Posted on 02/21/2004 3:20:02 PM PST by LibWhacker
SAN MATEO, Calif. When the Mars rover Spirit went dark on Jan.21 a Jet Propulsion Laboratory team undertook to reprogram the craft's computer only to find themselves introducing an unpredictable sequences of events. The trouble with the Mars rover Spirit started much earlier in the mission than the day the craft stopped communicating with ground controllers.
"It was recognized just after [the June 2003] launch that there were some serious shortcomings in the code that had been put into the launch load of software," said JPL data management engineer Roger Klemm. "The code was reworked, and a complete new memory image was uploaded to the spacecraft and installed on the rover shortly after launch."
That appeared to fix the problems that had been identified with the initial load. But what no one at JPL could have anticipated was that the new load also made possible a totally implausible sequence of events that would, many months later, silence Spirit.
The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems at the heart of the system. The processor accesses 120 Mbytes of RAM and 256 Mbytes of flash. Mounted in a 6U VME chassis, the processor board also has access to custom cards that interface to systems on the rover.
The operating system is Wind River Systems' Vx-Works version 5.3.1, used with its flash file system extension. In operation, the real-time OS and all other executable code are RAM-resident.
The flash memory stores executable images that are loaded into RAM at system boot. Separately, about 230 Mbytes are used to implement a flash file system that stores data products, or data files that are created by the rover's subsystems and held for transmission to Earth.
Among the data products are the images created by the rover's cameras.
Part of my responsibility in the data management team is to keep track of the data files that are created, transmitted and deleted on the rover during the mission, Klemm explained. We recognized early in the planning process that the flash file system had a limited capacity for files. It is not just a limitation in the flash itself but also in the directory structure."
Klemm explained that as data is collected by Spirit, files are created and stored in the flash file system until a communications window opens an opportunity to transmit the data either directly to Earth or to one of the two orbiters circling the Red Planet. Then the files are transmitted. They are still held in the flash system until retrieved and error-corrected on Earth. If data is missing, requests are sent for retransmission. If the data is intact, a command is sent to delete the received files.
"But there were also directories of files already placed into the file system in the launch load," Klemm said. "When we uploaded a new image to the rover, we recognized that those files would have to be deleted, because they were being replaced by a new set using different directories."
Accordingly, on Martian day 15 (or sol 15) of rover operation, a utility was uploaded to the rover to find and delete the old directories.
Murphy strikes on Mars
But the transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19.
Thus was the fuse lit on a software hand grenade.
The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system.
As Murphy would have it, earlier, sol 19 Spirit attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended. That in turn led to a reboot, which attempted to mount the flash file system. But the utility software was unable to allocate enough memory for the directory structure in RAM, causing it to terminate, and so on.
Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot. And its human handlers at JPL seemed at a loss to help, unable to diagnose a system they could not see.
Luckily, early in the process of proposing failure scenarios, someone remembered the earlier failure to upload the second piece of the utility. The scenario was modeled, and it was discovered that a VxWorks flag that causes a task to be suspended on a memory allocation failure was set in the existing image.
"The irony of it was that the operating system was doing exactly what we'd told it to do," Klemm lamented.Working on the theory that the rover was in fact listening and rebooting, the team commanded Spirit to reboot without mounting the flash file system.
The team then uploaded a script of low-level file manipulation commands that worked directly on the flash memory without mounting the volume or building the directory table in RAM. Using the low- level commands, about a thousand files and their directories the leftovers from the initial launch load were removed.
"At that point we mounted the flash file system and ran a checkdisk utility," Klemm said. To everyone's enormous relief, the mount was successful.
"As we had anticipated, there was some corruption from the event, so that was corrected," Klemm added. "In the process of going through the contents of the file system, we discovered a system log in which the problem was documented, step by step, right up to the allocation request that failed."
Klemm said that with the leftover directories and their files removed, the system is now functioning well. But just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure.
As a postscript, Klemm noted that the other day he heard a car commercial on the radio that made reference to the Mars rover, comparing, for example, the car's speed over the ground to Spirit's. In the process of touting the car's extended-warranty program, the ad noted that the Mars rover came with "interplanetary roadside assistance." "That phrase just stuck in my mind," Klemm said. " love it."
For Heavens' sake, my Pocket PC phone has more "byte" power. A classic example of bureaucratic incompetence, not unlike the metric conversion nonsense awhile back. While I totally support these projects, bloated management is responsible. Every code 'bit' in such a small reserve needs to be scrutinized in a larger logic pattern. Geez, I wrote Y2K bugs in the early 80's in IBM assembler 360/370 hexadecimal just to save two bytes. That's where I learned to KISS; Keep it Simple Stupid.
I say "Phooey" to the armchair geekxperts, who invariably show up to say how screwed up everything is. If y'all are so smart, why aren't you programming for interplanetary missions?
Well, you're REALLY going to enjoy the future. When all your code development and maintenance has been contracted out to a 3rd party, who has subcontracted it to India, who has subcontracted it to China. Where communication will go three levels, in at least 3 languages. (Heck, most people don't even know that in India, programmers in code shops may speak a half dozen different Indian languages among their little sub-groups.) But don't worry! They will all know .NET! Or J2EE!
From some Slashdot discussions I've seen on this, VXWorks seems to have some nifty on the fly reconfiguration features, especially when it's being run in debug mode [as they're doing on the Rover]. Apparently you can more or less replace the operating system in situ without a reboot [imagine recompiling a kernel, or applying a service pack, without having to do a full reboot afterwards].
This very feature appears to have been what saved their asses in this situation.
What always aggravates me is code that, with the same inputs will sometimes do one thing, and sometimes another. I prefer consistant bugs.
What's really sad is how often businesses just don't understand how important the person that translate from user needs, to program requirements is. Few users know how to express what they need, and programmers generally never use the software, and so don't know what to stress. I will grant that namy of the people that do this really aren't very good at it, but having a good person doing this can cut development time by months.
Of course your Pocket PC phone isn't radiation hardened, probably proportionately uses far more power, won't operate consistantly at -70C, and likely needs a substantial atmosphere to bleed heat. Then there's the issue of surviving a planetary landing.
Try reading it again, that's not what it says.
In fact, it explictly states that they *did take into account the pre-launch directories, *and* had sent commands to erase them once the new software was successfully uploaded and functioning as a replacement.
The problem was that the transmission instructing the Rover to delete the old directories was not properly received (due to signal noise), and thus the instructions were rescheduled for retransmission.
But before the retransmission occurred, the Rover's storage space filled up.
It seems inappropriate to blame this on a "Project Management failure".
[Highway55:] Incompetence!
No, bad timing.
[thoughtomator:] Sloppy code... patched by more sloppy code.
It wasn't "sloppy code" that caused the problem. In fact, it was due to being careful enough to keep the original code onboard until they were sure that the new code was functioning successfully and they knew it was safe to issue a command to the Rover to delete the old code. And there's no evidence that the patch was in any way "sloppy". The error arose because the command to delete the old code failed to arrive on time due to normal communication failures.
[quantim:] A classic example of bureaucratic incompetence, not unlike the metric conversion nonsense awhile back. While I totally support these projects, bloated management is responsible.
How do you figure that?
The job for writers of code is never finished, for any job, and I can't figure it out unless it is just plain communication, at the human level.
The problem is that there are more than a small infinite number of things that can go wrong with a program - in fact, there are such a large infinite number of things that can go wrong that Turing was able to prove some meta-theorems about the sorts of things we'll never be able to do with computers.
But even in day to day operations, the kinds of things that could possibly go wrong are just mind-boggling. For instance, programmers are constantly writing code that looks like
z = x + y;[Take the values at x and at y, add them together, and insert them into the value at z].
99.99999999999999999999% of the time, nothing will go wrong with this code. But every once in a while, the value at x and the value at y will be so large that their sum x + y is too big to fit into the piece of RAM set aside for z. As an example, you'd think that 2000000000 + 2000000000 would equal 4000000000, but it doesn't; it equals -294967296.
The correct way to do even the simplest computer addition looks more like
But this introduces into the program a hideous new level of complexity, requiring the programmer to keep track of a monstrous construction called an error stack, and, about this time, he starts to pull his hair out and says to himself, "Ah, to hell with it. This'll never happen. Screw the error checking."try { z = x + y; } catch(OverflowBitError theOverflowBitError) { throw(theOverflowBitError); }
The same sort of phenomena occur when writing to files, but already the sequence of events is becoming vastly more complicated. To write to a file properly, you have to do something like the following:
1) Check to see whether a file with the given name exists already.Most programmers pull their hair out and go insane somewhere between steps 2) and 3), and, for the sake of their own sanity, simply declare, "Ah, to hell with it, this'll never happen..."2) Decide what your contingency plans will be if a file with that name exists already. [Do I overwrite the existing file? Do I append to the end of the existing file? Do I parse the existing file and insert somewhere in the middle? Do I throw an error and refuse to proceed?]
3) In the case that the file doesn't exist already, you then attempt to create it and lock it for writing. Both creating and locking could return errors, so you've got to have contingency plans for those.
4) At some point, you have to make a guesstimate as to how many bytes you'll be writing to the file. Hopefully it's a fixed number of bytes, but if it's not [i.e. if you don't know a priori how many bytes you'll be writing], then your task just became orders of magnitude more difficult.
5) Once you have a guesstimate as to how many bytes you'll be writing to the file, you have to ask the operating system, "How many bytes remain free on the volume to which I'm attempting to write?" If you're anywhere near 90% full on the volume, or if there's any chance you'll be going anywhere near that mark with your file write, you've got to start developing contingency plans for what you might do as the volume reaches its capacity.
6) Once you start writing to disk [especially if you don't know a priori how many bytes you'll be writing], you've got to periodically stop and continue to ask the operating system, "Hey, while I was away writing to my file, did anyone else write to the volume? If so, how many bytes remain free?"
7) Hopefully, when all is said and done, there will have been enough room on the volume to hold the bytes in the file you were trying to write. But we aren't done yet, because the file hasn't been written to the hard drive - it's only been written to the operating system's file cache in RAM, so you have to send a signal to the file cache to write the file to the host bus adapter. But of course the host bus adapter has its own cache, so you have to send a signal to the host bus adapter to flush its cache to the hard drive. But of course the hard drive has its own cache, so you have to send a signal to the hard drive to flush its cache, and so on, and of course all of these signals can fail and return error messages ad infinitum ad infinitum ad infinitum...
But of course, just precisely this sort of thing did happen with the Rover: Someone wrote a routine that forgot to ask the operating system whether there was any free space left on the volume before attempting to write to the volume. The routine kept writing and writing and writing, and eventually the volume maxed out and the whole house of cards came tumbling down.
As a programmer, there's never enough time to check for all of these things that can go wrong, and, as a deadline looms, you start cutting corners and trying to focus on only the most urgent aspects of the package, hoping that the esoteric stuff won't trigger any errors often enough to be noticed, and with the further hope that at some point in the future, you'll be able to go back and update your code to catch some more of the errors you didn't have time to catch the first go 'round.
In real life, you ship the flawed product, and a few weeks later you get an angry call from a customer [in the general time frame of 8AM to 5PM] asking why your code just crashed at the plant 20 miles down the road. You have an emergency debugging session, find the bug, patch it, and someone drives a CD-ROM with the new code over to the customer for installation.
In the unreal life of NASA, you get an angry call at 3AM asking why your code just crashed on a planet called Mars about 200,000,000 miles down the road, and after your emergency debugging session produces a candidate for a patch, you darn better hope that you've got an operating system like VXWorks [in debug mode] that can patch itself on the fly.
It was incomplete QA that did not take into account these scenarios that were not out ot the realm given the conditions.
Either
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.