The Ariane 5 Failure: How a Huge Disaster Paved the Way for Better Coding
Imagine taking years of research, hundreds of millions of dollars, and huge amounts of effort, and watching all of this flying up into the sky and turning into a huge ball of fire and smoke while the world watched.
Unfortunately, this nightmare scenario is exactly what happened at the first Ariane 5 launch on June 4, 1996. Now known as one of the most reliable rocket launchers, Ariane 5 was Europe’s answer to a future European launcher that would replace the Ariane 4 and prepare the European Space Agency (ESA) for a collaboration with Nasa on a proposed space station. Its primary focus was to study the interaction between solar wind and the Earth’s magnetosphere in more detail than ever before. However, the strategic goal for the ESA was to maintain a competitive edge for Europe in the space industry—and the cost for this failure was estimated to be $370m.
When the first crewless Ariane 5 sped away from the coast of French Guiana on that fateful day, the initial excitement of those involved transformed into despair in just 37 seconds: The rocket flipped 90 degrees and, mere seconds later, became a huge ball of fire, shooting higher into the sky before pieces of the rocket that looked like shooting stars traveled away from the fireball, flickering with light before falling to the ground, followed by streaks of gray smoke.
A software failure
So how could such a monumental disaster have occurred? A special team scoured the scene, collecting debris to try to find out. A week after the failure, Dr Colin O’Halloran, of the UK’s Defense Evaluation and Research Agency (DERA) at the time, was appointed to represent the UK on the board of inquiry into the disaster. He was joined by representatives from France, Germany, Italy, and Sweden. According to O’Halloran, the very first morning the board met, joined by a technical panel, it appeared to most that the disaster had been caused by a software failure.
“When I raised the idea that it could have been due to a software failure because there was a failure in the inertial reference system, there were eyebrows raised as if I had previous knowledge of what they had discovered,” he says. In fact, O’Halloran had looked at many software projects while at DERA and had noted that it was common for software issues to be the root of wider failures.
Despite this, the board did not focus solely on such issues—there was a broad range of potential causes and questions, and as with any investigation, its members wanted to exhaust all possibilities. O’Halloran’s main focus was on the register values, which had indicated that the failure had something to do with diagnostic information. To delve deeper behind the code, O’Halloran and another colleague went to the company who wrote the software, printed the code and examined it.
“I analyzed the source and machine code and, given the flight telemetry, determined that when the value output onto the 16-bit bus exceeded what could be represented in 16 bits then, according to the specification of the machine code, the flag called operand error would be set to true,” he says. “This in itself did not cause the failure. It was some later detection logic in the software that examined the operand flag and interpreted it as a manifestation of a hardware error and sent out a diagnostic message that included the location of whereabouts in the software the operand flag had been set to true.” This is how he knew where to examine the software. “The diagnostic information was interpreted by other systems on Ariane 5 as a command to the rocket nozzle actuators that sent Ariane [5 flight] 501 off course and caused it to break up. Ariane 5’s overall system fault tolerance strategy was therefore a key factor in the failure, the implicit assumption being that any error detected must be due to a hardware failure rather than a systematic software error.”
Any software error that manifested itself to the fault detection logic would have led to the primary system being ignored and the secondary system used, but the systematic error would be interpreted as a failure in the secondary backup system. This would have led the onboard computer to believe the rocket had failed.
“The diagnostic information output was unfortunately interpreted as a genuine command from the control system, but Ariane 5 would have been essentially flying blind without the information from the inertial reference system,” explains O’Halloran.
The error was easily traceable through the diagnostic information that O’Halloran had: An overflow value of 64-bit value had been created and put onto a 16-bit bus.
Much of the problem appeared to revolve around the culture of those involved in the project. “The Ariane 5 program involved people who had previously worked with hardware, as Ariane 4 had been largely hardware-dependent. With mechanical systems, if such an incident had occurred and caused the inertial reference system to fail, you would go to a backup. But because [with Ariane 5] it was a systematic error through software, the second inertial reference system had also shut down for the same reason—it was left without any navigation guidance, and you had a misrepresentation of the diagnostic data, which eventually led to the breakup of the rocket,” says O’Halloran.
Everything is obvious in hindsight
Although the Ariane 5 project went down in history as a monumental failure, the code was well written and a very good software engineering process had been followed throughout. “When you look at it, it’s kind of obvious… except it wasn’t,” says O’Halloran.
The organization that had written the software had initially put a guard against this kind of situation into the code, so that if there was an output that was larger than 16 bits, those working on the spacecraft would have been alerted earlier during the testing phases. However, it was removed because they were motivated to reduce the loading time of the processor and so took away elements they thought were unnecessary.
“The failure to understand the guard was critical—they didn’t understand the consequence of removing that seemingly innocuous condition,” says O’Halloran. “They thought it wouldn’t make a difference to the overall reliability of the rocket.”
It’s also worth remembering that this was not a project worked on by amateur programmers or in a format that would lead to many errors. A committee looked at every aspect of the rocket—its reliability, availability, maintainability, and safety. Despite this, they still failed to appreciate the devastating impact removing the guard would have.
“The most interesting thing about the Ariane 5 bug is what it said about the dark art of software and its hypnotic power for diversion and distraction, making clever people forget really basic risk-assessment analysis, along with the sway of dealing with very large numbers,” says Bola Rotibi, Research Director of Software Development at analysis firm CCS Insight. “It failed because of not anticipating the limitation of the computer systems, the software processes, and the base components used for doing the number-crunching calculations driving Ariane 5’s progression to space.”
Perhaps the fundamental issue was that the whole failure arose from the requirements that had been set. Says O’Halloran: “The inertial reference system was redeveloped from what it had been for Ariane 4, and [those working on Ariane 5] saw this as an opportunity to get a more reliable inertial reference system. But the problem was they faithfully reproduced the software to meet requirements that were there for Ariane 4, but there were further requirements for Ariane 5, particularly where the error came from.”
However, that’s not to say that another error would not have occurred. Rotibi feels this slack behavior of building on top of what was there before is inherent in many issues with software, even today. “[Those involved with the project] did not provide enough time to walk through the risk-assessment analysis and think from the perspective of what limitations could cause the computation to fail or deliver a catastrophic error,” she says. “It is the bane of the software world and the root of many problems that occur within software systems.”
Dr Bill Curtis, Senior Vice President and Chief Scientist of the software intelligence company CAST, believes that issues with software arise because there isn’t the same level of discipline that’s required for engineers when they start building hardware systems. “People go to these 90-day boot camps and come out as ‘software people,’ but they really don’t have adequate training to worry about security issues and numerical-analysis issues. And a lot of people working on software are rushed to get it out on time, so they don’t do an adequate job of evaluating it,” he says.
Along with these evaluations, those working on software also have to ensure its structure is sound from an engineering perspective, they need to analyze the dynamics of the software, and they need to ensure it is operating as it should be.
“There are many problems with building software,” says Curtis. “It requires process discipline and knowledge, and people require advanced tools, because some of the challenges are beyond what humans can comprehend, so we need more advanced automation on how the software is structured, what kind of defects they have on the architectural level and code level.
“Every time we catch up, we start building bigger systems, which are now several orders of magnitude larger than they were back when the Ariane 5 was [first] launched. They’re so complex that companies like Google have to build their own tools to keep up with their own scale and complexity.”
In addition, because organizations are shortening the time for software to be produced and tested as part of agile methods, it’s even harder to ensure that it is up to scratch. However, O’Halloran says that great advances have been made in software-verification systems to see if it is working properly. “Ariane 5 made us realize that if we had software-verification technology, we would be able to guard ourselves against something like this,” he says. “That doesn’t mean it wouldn’t happen at all, because there would still be cultural issues, but if tools could look at the complexity of what an end product is rather than how it was made, and the process goes back to the idea of quality control [then that is a positive move].”
What did sophistication look like in 1996?
It’s easy to look back a few decades and presume that the technology involved and error made were simplistic. While the technology has advanced significantly, the error should not be seen as a simple blunder.
“It was a sophisticated mistake, as somebody had to be aware that the number produced that required processing was going to be bigger than the processor could handle. We still run into these challenges and interactions between people and systems today,” says Curtis.
The reality is that while awareness has grown about the importance of software in many big projects, there is still a disconnect between executives and those creating software in many enterprises and governments. O’Halloran sums this up with the recollection of someone present at the board inquiry saying, “Oh thank goodness it’s just the software and not the engine or something else more complicated.” This person hadn’t realized, of course, that there were hundreds of thousands of lines of code to go through.
The question is whether that same comment would be made in 2020, more than 20 years on from that first launch of Ariane 5. There is a lot more understanding that software isn’t easy, isn’t insignificant, and isn’t just for coders to work on in a basement—that it’s a fundamental part of everything. However, there’s still room for improvement.
This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!
Want to contribute? Get published!
Follow us on Twitter to stay tuned!
Illustration by Blok
- Ajouter aux favoris
- Partager sur Twitter
- Partager sur Facebook
- Partager sur Linkedin