Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

The trouble with Rover is revealed
EE Times ^ | 2/20/04 | Ron Wilson

Posted on 02/21/2004 3:20:02 PM PST by LibWhacker

SAN MATEO, Calif. — When the Mars rover Spirit went dark on Jan.21 a Jet Propulsion Laboratory team undertook to reprogram the craft's computer only to find themselves introducing an unpredictable sequences of events. The trouble with the Mars rover Spirit started much earlier in the mission than the day the craft stopped communicating with ground controllers.

"It was recognized just after [the June 2003] launch that there were some serious shortcomings in the code that had been put into the launch load of software," said JPL data management engineer Roger Klemm. "The code was reworked, and a complete new memory image was uploaded to the spacecraft and installed on the rover shortly after launch."

That appeared to fix the problems that had been identified with the initial load. But what no one at JPL could have anticipated was that the new load also made possible a totally implausible sequence of events that would, many months later, silence Spirit.

The Spirit rover has a radiation-hardened R6000 CPU from Lockheed-Martin Federal Systems at the heart of the system. The processor accesses 120 Mbytes of RAM and 256 Mbytes of flash. Mounted in a 6U VME chassis, the processor board also has access to custom cards that interface to systems on the rover.

The operating system is Wind River Systems' Vx-Works version 5.3.1, used with its flash file system extension. In operation, the real-time OS and all other executable code are RAM-resident.

The flash memory stores executable images that are loaded into RAM at system boot. Separately, about 230 Mbytes are used to implement a flash file system that stores “data products,” or data files that are created by the rover's subsystems and held for transmission to Earth.

Among the data products are the images created by the rover's cameras.

“Part of my responsibility in the data management team is to keep track of the data files that are created, transmitted and deleted on the rover during the mission,” Klemm explained. “We recognized early in the planning process that the flash file system had a limited capacity for files. It is not just a limitation in the flash itself but also in the directory structure."

Klemm explained that as data is collected by Spirit, files are created and stored in the flash file system until a communications window opens — an opportunity to transmit the data either directly to Earth or to one of the two orbiters circling the Red Planet. Then the files are transmitted. They are still held in the flash system until retrieved and error-corrected on Earth. If data is missing, requests are sent for retransmission. If the data is intact, a command is sent to delete the received files.

"But there were also directories of files already placed into the file system in the launch load," Klemm said. "When we uploaded a new image to the rover, we recognized that those files would have to be deleted, because they were being replaced by a new set using different directories."

Accordingly, on Martian day 15 (or “sol 15”) of rover operation, a utility was uploaded to the rover to find and delete the old directories.

Murphy strikes on Mars

But the transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19.

Thus was the fuse lit on a software hand grenade.

The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system.

As Murphy would have it, earlier, sol 19 Spirit attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended. That in turn led to a reboot, which attempted to mount the flash file system. But the utility software was unable to allocate enough memory for the directory structure in RAM, causing it to terminate, and so on.

Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot. And its human handlers at JPL seemed at a loss to help, unable to diagnose a system they could not see.

Luckily, early in the process of proposing failure scenarios, someone remembered the earlier failure to upload the second piece of the utility. The scenario was modeled, and it was discovered that a VxWorks flag that causes a task to be suspended on a memory allocation failure was set in the existing image.

"The irony of it was that the operating system was doing exactly what we'd told it to do," Klemm lamented.Working on the theory that the rover was in fact listening and rebooting, the team commanded Spirit to reboot without mounting the flash file system.

The team then uploaded a script of low-level file manipulation commands that worked directly on the flash memory without mounting the volume or building the directory table in RAM. Using the low- level commands, about a thousand files and their directories — the leftovers from the initial launch load — were removed.

"At that point we mounted the flash file system and ran a checkdisk utility," Klemm said. To everyone's enormous relief, the mount was successful.

"As we had anticipated, there was some corruption from the event, so that was corrected," Klemm added. "In the process of going through the contents of the file system, we discovered a system log in which the problem was documented, step by step, right up to the allocation request that failed."

Klemm said that with the leftover directories and their files removed, the system is now functioning well. But just in case, the team is working on an exception-handler routine that will more gracefully recover from an allocation failure.

As a postscript, Klemm noted that the other day he heard a car commercial on the radio that made reference to the Mars rover, comparing, for example, the car's speed over the ground to Spirit's. In the process of touting the car's extended-warranty program, the ad noted that the Mars rover came with "interplanetary roadside assistance." "That phrase just stuck in my mind," Klemm said. " love it."


TOPICS: News/Current Events
KEYWORDS: directory; file; flash; jpl; mars; memory; nasa; rover; spirit; techindex
Navigation: use the links below to view more comments.
first 1-2021-4041-52 next last

1 posted on 02/21/2004 3:20:02 PM PST by LibWhacker
[ Post Reply | Private Reply | View Replies]

To: LibWhacker
*bump* for a lucid explanation
2 posted on 02/21/2004 3:22:56 PM PST by Cboldt
[ Post Reply | Private Reply | To 1 | View Replies]

To: LibWhacker
it was probably that pre-installed Microsoft Works and free AOL trial that did it ... plus, it turns out the Rover was playing Minesweeper and Solitaire all the time it wasn't communicating ... 8)

interesting details on the equipment though ...
3 posted on 02/21/2004 3:24:47 PM PST by Bobby777
[ Post Reply | Private Reply | To 1 | View Replies]

To: LibWhacker
So the original specs either did not take into account pre-launch directories or assume they would be erased somehow.

Project Management failure.

4 posted on 02/21/2004 3:35:23 PM PST by Semper Paratus
[ Post Reply | Private Reply | To 1 | View Replies]

To: Cboldt
It does an amazing job with comparatively little RAM.
5 posted on 02/21/2004 3:38:58 PM PST by billorites (freepo ergo sum)
[ Post Reply | Private Reply | To 2 | View Replies]

To: Semper Paratus
Here's what got me:
Luckily, early in the process of proposing failure scenarios, someone remembered the earlier failure to upload the second piece of the utility.
Makes it sound like it was total serendipity that the problem was ever correctly diagnosed. Criminy, I'm losing my faith in that agency!
6 posted on 02/21/2004 3:43:09 PM PST by LibWhacker
[ Post Reply | Private Reply | To 4 | View Replies]

To: LibWhacker
Here is the uploaded code that was in error!
for (i=1; i>0; i++)
waste_dollars(i);

7 posted on 02/21/2004 3:44:04 PM PST by patriot5186
[ Post Reply | Private Reply | To 1 | View Replies]

To: LibWhacker
"The irony of it was that the operating system was doing exactly what we'd told it to do,"

That is very often the case, BTW.

Wind-River. I'd been trying to remember the name of that company... just happened to pop up in this article. Ha.

8 posted on 02/21/2004 3:44:23 PM PST by Who dat?
[ Post Reply | Private Reply | To 1 | View Replies]

To: billorites
It does an amazing job with comparatively little RAM.

The article does a pretty darn good job of explaining how filling up memory space resulted in hanging. It's great to read of the programmers' abilities in resolving the matter. The fact of the program keeping a log file that dutifully recorded events ... wow.

9 posted on 02/21/2004 3:48:30 PM PST by Cboldt
[ Post Reply | Private Reply | To 5 | View Replies]

To: Who dat?
"The irony of it was that the operating system was doing exactly what we'd told it to do,"

That is very often the case, BTW.

As far as I know, that is most often the case. The first time a digital machine takes off with its own mind will be extremely newsworthy.

10 posted on 02/21/2004 3:51:54 PM PST by Cboldt
[ Post Reply | Private Reply | To 8 | View Replies]

To: Cboldt
Incompetence!
11 posted on 02/21/2004 3:52:14 PM PST by Highway55 ("You're either on the bus, or off the bus.")
[ Post Reply | Private Reply | To 2 | View Replies]

To: LibWhacker
My experience with SW engineers is that the job is most always completed as to be adequate but never finished to completion. They seem to always offer that upgrades can be added as needed. More than once code writers have informed me that my desire for a program to work in a specific fashion was not the way he understood it, therefore it was written in a way that was not correct for the functionality of the system. Days and sometimes months later corrections were made to make the system run correctly.

The job for writers of code is never finished, for any job, and I can't figure it out unless it is just plain communication, at the human level.
12 posted on 02/21/2004 3:53:23 PM PST by Final Authority
[ Post Reply | Private Reply | To 1 | View Replies]

To: Highway55
Incompetence!

Mistake. Not as bad as the one that doomed previous Mars missions, and obviously, not as bad as the errors in judgement that doomed Columbia.

13 posted on 02/21/2004 3:54:20 PM PST by Cboldt
[ Post Reply | Private Reply | To 11 | View Replies]

To: billorites
I was wondering why they chose Wind River for the OS, but I bet it's that they gave human support for questions about their products.
14 posted on 02/21/2004 3:55:15 PM PST by Thebaddog (Woof this!)
[ Post Reply | Private Reply | To 5 | View Replies]

To: Cboldt
Sloppy code... patched by more sloppy code. For more information on that sort of thing, try this example.
15 posted on 02/21/2004 3:57:51 PM PST by thoughtomator ("What do I know? I'm just the President." - George W. Bush, Superbowl XXXVIII halftime statement)
[ Post Reply | Private Reply | To 2 | View Replies]

To: thoughtomator
Sloppy code... patched by more sloppy code. For more information on that sort of thing, try this ....

LOL. Web-page loop by MS. Hey, $hit happens. Not being in the know about Rover code, my comments are those of a casual observer. I find the description illuminating.

16 posted on 02/21/2004 4:04:19 PM PST by Cboldt
[ Post Reply | Private Reply | To 15 | View Replies]

To: Bobby777
"... plus, it turns out the Rover was playing Minesweeper and Solitaire all the time it wasn't communicating ... 8)"

While we are on the subject, would you happen to have a cheat for ms solitaire?
17 posted on 02/21/2004 4:08:07 PM PST by El Gran Salseron (It translates as the Great, Big Dancer, nothing more. :-))
[ Post Reply | Private Reply | To 3 | View Replies]

To: Cboldt
Hey I'm even funnier than I thought... the Microsoft web site looks to be down at the moment LMAO... worse, actually, it's up and times out on response ROFL!
18 posted on 02/21/2004 4:13:32 PM PST by thoughtomator ("What do I know? I'm just the President." - George W. Bush, Superbowl XXXVIII halftime statement)
[ Post Reply | Private Reply | To 16 | View Replies]

To: LibWhacker
The team then uploaded a script of low-level file manipulation commands that worked directly on the flash memory without mounting the volume or building the directory table in RAM

Sometimes ya gotta get a command prompt and do a little DOS.

19 posted on 02/21/2004 4:13:59 PM PST by Flyer (Don't abandon our military - Re-elect President Bush!)
[ Post Reply | Private Reply | To 1 | View Replies]

To: thoughtomator
Hey I'm even funnier than I thought... the Microsoft web site looks to be down at the moment LMAO... worse, actually, it's up and times out on response ROFL!

Linux desktop here, computers being a hobby and meant for fun. I don't recall your name on the SCO threads, but it you like to poke at MS, you really should check the SCO threads.

20 posted on 02/21/2004 4:16:20 PM PST by Cboldt
[ Post Reply | Private Reply | To 18 | View Replies]


Navigation: use the links below to view more comments.
first 1-2021-4041-52 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson