AGC - Conference 2: Alarm on the lunar landing

Apollo Guidance Computer Activities

Apollo Guidance Computer History Project

Second conference

September 14, 2001

Alarm on the Lunar Landing

FRED MARTIN: I wanted to touch upon what I would categorize as almost a religious war between the synchronous executive folks and the interrupt priority-driven folks. The issue really came to a head in the shuttle program where there were forces that believed in asynchronous, priority-driven, interrupt-driven executives. And other forces that believed in absolutely synchronous executives where you planned out all the software and you executed the software by tables. You could thereby tell what part of the software was operating at every instant of time because you had planned it out so that it all operated very rigidly. The people who really used those kinds of executives were mostly people in the aircraft industry. They had come from various airplanes programs where they had these computers and used synchronous executives. It was very important for them to know exactly what was happening at every instant of time. Either from a testing standpoint or whatever. That was their mindset.

The people that had worked on Apollo, for the most part, believed in this asynchronous executive where you have priorities and you didn't have everything structured completely. But you allowed higher priority jobs to interrupt lower priority jobs and so on. There were these incredible discussions in the early shuttle time in deciding what kind of an executive and operating system should be used in the shuttle. The shuttle had its own hardware issues with multi-reliability strings and an IO processor and other things that had to be synced. So there were a lot of reasons why this was a big issue.

Hugh-Blair-Smith adds:

Having worked on the Shuttle’s FCOS (Flight Computer Operating System) for many years, I can say that in all important respects it was an asynchronous executive exactly as the Apollo one was, that in fact the lessons of Apollo were applied there, though with great political difficulty in getting the decision made. The picture was muddied by the fact of frequent synchronization points to force the two or four redundant-set computers to deliver exactly the same command outputs. Every change of context among the three levels (calculation, I/O completion interrupts, and time/event management interrupts) had to be synchronized in the sense that no computer could embark on a change without assuring that all the others did too. But synchronization is really orthogonal to the issue of synchronicity as discussed here. The greatest nod toward synchronous operation was a scheduled time management interrupt for flight control (i.e., DAP) every 40 msec--but if the previous cycle of flight control was still unfinished, the new cycle would be skipped. So the essential benefit of priority-driven asynchronicity was obtained. Fortunately, the main-engine and thermal-tile development problems gave us time to study these "cycle-overrun" phenomena thoroughly and get comfortable with the design. I really don’t recall any decision to ignore Intermetrics features in this area.

FRED MARTIN: But, the roots of the Apollo viewpoint really stemmed back to the alarms that occurred on the lunar landing. And, as you recall, in that flight when the LM was coming down and all of sudden you had these alarms, and the astronaut kept getting these alarms. He said they had alarms. Finally there was a decision in Houston to just push on and that he should land. Whoever made that decision, perhaps Steve Bales understood what the problem was or he felt that he—

DAN LICKLY: Well Jack Garman was screaming, I hear him, go, go, go, go! I don't think he had any evidence other than he thought we had restarted it perfectly.

FRED MARTIN: Anyway, he landed safely. And following that landing, there was just a frantic session at the MIT lab to find out what in the world was causing these alarms. It took us the better part of 24 hours, I think, as I recall. Going back to the simulators, we simulated this. We tried everything to make this happen. And we could not make this happen. It was as if the AGC was operating at about 25% slower than it usually operates. So we tried lots of things and we couldn't make it happen.

Eventually, at the suggestion of somebody who worked at the Cape a lot, we tracked down the fact that they had a switch in the wrong position and it was stealing cycles; it was the radar rendezvous switch. And that made the AGC run slower. Because it was running slower, it was dropping low priority jobs and not getting to these low priority jobs and only doing the highest priority jobs, so the job queue filled up. When the job queue filled up, it caused this alarm to go off and so on. At any rate, MIT got a lot of criticism for a software error causing this to happen. So now you had a sort of dichotomy where people believed that what had happened was actually an error. Other people believe, no. The software actually saved the program because, in the face of this mistake in the switch, the software which was written as a priority executive was able to go on with the highest priority jobs and not tank the mission because it didn't have to do this box car structured synchronous system where it would give time to everything whether it was important or not.

MARGARET HAMILTON: It was even more complicated than that. Because the error was in the actual documentation.

FRED MARTIN: Yes. I was going to mention that in a moment. So this feeling went forward, including the NASA folks. People like Jack Garman who absolutely felt that it was important to have such a system in the shuttle versus the airplane folks who felt that they could never be able to test the software afterward if they couldn't have an exact match and knowing exactly what happened at any time. And the asynchronous executive was almost not repeatable you might say. That caused a great deal of difficulty.

What Margaret is pointing out is that, when we finally got and found-- I remember the instant that we ran upstairs to look at the telemetry to see where the bits were set. And sure enough, the radar bit, which was picked up by telemetry and downloaded, that was in, let's say the 15th word bit, or how many bits were in the telemetry. And bit 9 or whatever it was showed that the rendezvous radar switch was on. Eventually, the ground told the astronaut, when he was about to take off from the moon, to put the switch in the right position. He said it in a very low key fashion. But when the issue was run down to the end, it was found in the crew procedures to put the switch in the RR position. So the next questions was, well how come if you put the switch in this position, wasn't this picked up in the simulators at Grumman, which these guys had been training on for a couple of years. And in that crew simulator, they had always done exactly what was in the document, put the switch in that position.

However, that switch wasn't connected to anything at Grumman. In other words, that switch was not connected in the simulator to the AGC which would slow them down in that simulator. So they trained exactly to what those crew procedures said, each time. And they did it on the landing too.

MARGARET HAMILTON: It kind of gives you an understanding more of how important everything is with respect to the entire system of software, peopleware, hardware...

DAN LICKLY: I could put it the other way. Why didn't our simulations follow the crew procedure?

FRED MARTIN: I don't know that the crew did that in our simulations. I don't know. I don't think it was in our digital simulator.

MARGARET HAMILTON: But when you're testing, what would make the software do something like that? In other words, you begin to think in terms of modeling and designing a problem and where do you stop? Where's the outside of the outside. But as modeling all these things as one system of which part is realized as software, part is peopleware, part is hardware. But then you begin to model it as an entire enterprise and then simulate it as an enterprise. But, it makes you realize that, when people blame something as being a software error or a hardware error, you're not necessarily sure what is it and what does it mean to be a software error or a hardware error.

This brings up all these questions when you have events like this take place. And we've still not solved such issues today, in the industry at large.

SLAVA GEROVITCH: Did they stick to the asynchronous design in the shuttle program?

FRED MARTIN: I think that they had a mixed system. It was some sort of a mixed system. They had some hardware constraints in their IO processors and the synchronization of IO with the computer that made it impossible to have, I think, a purely asynchronous executive. So they had some real hardware issues involved in it. We, in the company that we formed after, we created a programming system for the Shuttle in which we built into the programming language the concept of asynchronicity. That is you could have multiple jobs and priorities and interrupts and things like that.

The Shuttle people, for the most part, didn't want to use those features that we had put in the language. We were very influenced in the language by experiencing it in Apollo. And very influenced by the executive that Hal Laning had designed, and the manner in which he had designed that whole thing. We were very influenced in the language design to take advantage of those things. They were used to a minor extent in the manner in which the shuttle was put together.

Hugh Blair-Smith adds:

Fred’s account is accurate except that the percentage of wasted time was exactly 15%, just precisely using up, and microscopically overusing, the planned slack time. Another factor that I remember was Grumman’s insistence that their 400-cycle power supply, operating the radar, be independently phased relative to the GN&C system’s 400-cycle power supply. Had the two supplies been phase-locked, as we had urgently proposed, the switch error would not have created bogus angle differences to run the shaft and trunnion axis angle counters at full speed. Who would ever have thought, in a simulator, to randomize the phase between two AC power supplies?

Secrecy vs. openness

site last updated 12-08-2002 by Alexander Brown