Software lessons from Mars

Software lessons from Mars
ZDNet ^ | January 28, 2004, | Rupert Goodwins

Posted on 01/30/2004 6:19:35 PM PST by gitmo

COMMENTARY--On a rocky plateau 300 million miles from here, The Mars Exploration Rover A--known to its friends and PR operatives as Spirit--sits quietly, conserving its strength after a near-fatal computer breakdown. As it was not named the Mars Static Nervous Wreck A, you may assume that things are not going exactly to plan.

Much closer to home, gaggles of geeks sit with furrowed brows as they work out exactly why the machine went mad just a third of the way through its mission, when everything was looking spotless. Instead of preparing to drill into a large and tempting rock, the robot had the silicon equivalent of a prolonged and devastating epileptic fit: when HQ tried to tune in, all they heard was binary gibberish.

You will not be surprised to learn that the number one suspect for the space probe's misery is buggy software, nor that Spirit's twin, Opportunity, is being handled with the kiddest of gloves as it too unfurls its sensors on the other side of the planet.

It's happened before: same planet, same people, same problem. Five years ago, the Mars Pathfinder mission was also busy scurrying across the Martian surface--not doing so much science, but testing out many of the techniques used by the Rovers. Just as with the Rovers, a mysterious problem caused the machinery to reset itself continually, never getting to the point where it could do its programmed tasks or return information. And as with the Rovers, the engineers running the mission were presented with a mystery: the local replica of the robot wasn't repeating the problem and you can't slap a logic analyzer on a chip from the best part of half a billion miles away.

You might think, reasonably, that a NASA-developed space mission would fly with as many redundant, hand-crafted ultra-reliable systems as man has ever seen. It doesn't work like that, especially with robotic craft where lives aren't at stake. One of the most important factors is obtaining as much science per dollar as possible--which means keeping launch costs down and active payload up. Redundant systems are dead weight. And far from automatically improving safety, back-up systems increase complexity and can even reduce reliability--just ask anyone with experience of uninterruptible power supplies.

Hand-crafted code takes a very long time to create and verify: the timescales and budget are such that you want your team to be working on the unique aspects of the mission. All these factors mean that the technology on Mars looks awfully like that on your desk--a general purpose, standards-based platform like many others running a commercial operating system doing custom tasks.

The problems with Pathfinder boiled down to priorities, both technical and human. The technical side was a classic problem where a low priority task had taken exclusive ownership of a shared resource, only to be interrupted by a higher priority task. This also needed exclusive ownership of the same resource, so waited until it became free--which it never could, due to the suspended low priority task. After a while, safety software on the spacecraft noticed that the high priority task hadn't completed within its designated time: the computer therefore reset itself and stopped work until it got the next day's communication from Earth.

The bug had been spotted before landing, but couldn't be reproduced back at base--it only happened when more data than anyone expected was being transferred and under certain timing conditions. Although nobody decided the bug was unimportant, it was deemed less important--and harder to find and fix--than many other ongoing problems, and the focus of the engineers was left on flight and landing. If the bug reappeared, they decided, the mission wouldn't be in jeopardy: the safety systems would ensure its survival and opportunities for recovery. The engineers considered that as these assumptions had been proved right and the mission was in the end a success, the prioritization was correct: a hard conclusion to argue against.

That recovery was aided by a couple of design decisions. The software on the spacecraft had a lot of diagnostic and logging features--the sort that normally get removed before shipping--in place and functional. This was part of a larger philosophy, "test what you fly and fly what you test": if you're responsible for looking after a system in the field, make as few changes as possible between testing and deployment – and don't touch the test system afterwards. Once the diagnostic logs were retrieved from the spacecraft, the problem could be replicated locally--with confidence that the results accurately reflected what was really happening, and that a correct fix could be made, tested and deployed effectively.

For all this to happen with proprietary software that the engineers hadn't developed themselves, two further factors had to be in place. The company behind the software--Wind River Systems--had to be there with exemplary support: the bug wasn't in their code, but was dependent on abstruse aspects of the way the operating system worked. Linked to that, the mission engineers had to have an extraordinary knowledge of the guts of the operating system. As a report after the event said: "A good lesson when you fly COTS [commercial off-the-shelf] stuff--make sure you know how it works."

It's tempting to say that with luck, the current problems with Spirit will be fixable--but luck's not the factor. It may make for poor advertising copy, but the truth is that good software's got little to do with operating system you buy, what languages you use or what rapid development system you use to cook your code. Solid knowledge, sound engineering discipline and a methodology spring-loaded for safety will save your project: if you can demand these from your suppliers and build them into your team, you too can rescue a project at half a billion miles. Lack any of these, and you'll be brought back to earth in no time.

TOPICS: Culture/Society; News/Current Events; Technical
KEYWORDS: development; mars; projectmanagement; software; spirit

The problems with Pathfinder boiled down to priorities, both technical and human. The technical side was a classic problem where a low priority task had taken exclusive ownership of a shared resource, only to be interrupted by a higher priority task. This also needed exclusive ownership of the same resource, so waited until it became free--which it never could, due to the suspended low priority task. After a while, safety software on the spacecraft noticed that the high priority task hadn't completed within its designated time: the computer therefore reset itself and stopped work until it got the next day's communication from Earth.

Sounds like a classic deadlock issue.

1 posted on 01/30/2004 6:19:36 PM PST by gitmo

[ Post Reply | Private Reply | View Replies]

To: gitmo; Phil V.; bonesmccoy; Howlin; RadioAstronomer; Grampa Dave; NormsRevenge

Most interesting!

I never worked with embedded interrupt driven systems but did chase mainframe software bugs a bit.

They were tough issues to figure out!

2 posted on 01/30/2004 6:36:05 PM PST by Ernest_at_the_Beach (The terrorists and their supporters declared war on the United States - and war is what they got!!!!)

[ Post Reply | Private Reply | To 1 | View Replies]

To: gitmo

the local replica of the robot wasn't repeating the problem and you can't slap a logic analyzer on a chip from the best part of half a billion miles away.

Ditto what you said. Deadlocks can be avoided by design, traced at runtime, and so on, not basic engineering, but hardly cutting edge either.
The thing that got me was the quote above. Timing issues are always harder to recreate, but the moment you notice a problem you can't recreate, you better at least keep your eyes and ears open because it will reveal itself again.
I do have to give these guys credit, on their thin budget, each rover is a prototype, and their 'replica' is not an exact copy otherwise their rover budget would be up to 50% more than it is now.

3 posted on 01/30/2004 6:37:59 PM PST by sixmil

[ Post Reply | Private Reply | To 1 | View Replies]

To: sixmil

I agree. Particularly with interrupt-driven systems, you have to be extremely careful with the order you access resources.

4 posted on 01/30/2004 6:42:12 PM PST by gitmo (Who is John Galt?)

[ Post Reply | Private Reply | To 3 | View Replies]

To: Ernest_at_the_Beach

I imagine that the Spirit Rover software story will eventually become as important a 'case study' as the Apollo 11 landing buffer-overload problem.

But this one is much, much more complex.

If they succeed in restoring full function (or nearly so), it will be quite an achievement, and a neat case study.

5 posted on 01/30/2004 6:51:48 PM PST by edwin hubble

[ Post Reply | Private Reply | To 2 | View Replies]

To: Ernest_at_the_Beach

They have minds of their own sometimes.. I had my fill of 'em. (supporting and troubleshooting callcenter switches)

Pulling weeds is a helluva lot easier on the nerves, imo.

6 posted on 01/30/2004 7:12:59 PM PST by NormsRevenge (Semper Fi Mac ...... /~normsrevenge - FoR California Propositions/Initiatives info...)

[ Post Reply | Private Reply | To 2 | View Replies]

To: gitmo

Here's what I think is the real lesson here: "As a report after the event said: "A good lesson when you fly COTS [commercial off-the-shelf] stuff--make sure you know how it works."

This is where open source (Especially an OS) software concept will likely save your butt. I don't use Linux (or one of the BSDs) yet, but they are clearly more attractive if you depend on underlying software to get a job done.

I'm not sure that any indivdual can understand more than n (n >= 100,000?) lines of code (especially in a real time control system environment), but having the capability to take the code and add diagnostic breakpoints/data logging has got to be a huge advantage. The other takeaway is therefore KISS.

7 posted on 01/30/2004 7:18:58 PM PST by Paladin2

[ Post Reply | Private Reply | To 1 | View Replies]

To: NormsRevenge

supporting and troubleshooting callcenter switches

What kind of a computer is in that?

8 posted on 01/30/2004 7:19:00 PM PST by Ernest_at_the_Beach (The terrorists and their supporters declared war on the United States - and war is what they got!!!!)

[ Post Reply | Private Reply | To 6 | View Replies]

To: edwin hubble

the Apollo 11 landing buffer-overload problem. What was that?

9 posted on 01/30/2004 7:22:45 PM PST by nj_pilot

[ Post Reply | Private Reply | To 5 | View Replies]

To: Ernest_at_the_Beach

Motorola 68xxx series.. we ran redundant systems with 2 of everything, millions of lines of code, VXWorks as the RTOS... it was great when it worked but when it blew or didn't cut to backup cleanly.. Ouch!! Core dumps are not always your friend.

10 posted on 01/30/2004 7:23:31 PM PST by NormsRevenge (Semper Fi Mac ...... /~normsrevenge - FoR California Propositions/Initiatives info...)

[ Post Reply | Private Reply | To 8 | View Replies]

To: Ernest_at_the_Beach

U have mail

UNIX Uber Alles!

11 posted on 01/30/2004 7:29:39 PM PST by NormsRevenge (Semper Fi Mac ...... /~normsrevenge - FoR California Propositions/Initiatives info...)

[ Post Reply | Private Reply | To 8 | View Replies]

To: nj_pilot

"the Apollo 11 landing buffer-overload problem. What was that?"

It is well worth seeing the NOVA show on this incident. It is a white-knuckle close call, involving a computer.

As Apollo 11 descended from orbit and headed to the landing spot in Sea of Tranquility, the onboard system was designed to automatically read the speed and altitude and adjust the thrusters rockets for a steady landing.

But the data was coming in faster than the input buffer on the computer could handle, and malfunctioned. (The on-board computer was smaller than today's TI-83 used in high school classes).

Buzz Aldrin took over the controls manually. He maneuvered past a field of huge boulders. Landed with only a few seconds of fuel left. (A few seconds from certain death).

So, part of it was a computer problem; part of it a bad LZ.

12 posted on 01/30/2004 8:51:07 PM PST by edwin hubble

[ Post Reply | Private Reply | To 9 | View Replies]

To: NormsRevenge

Check your in basket.

13 posted on 01/30/2004 8:57:52 PM PST by Ernest_at_the_Beach (The terrorists and their supporters declared war on the United States - and war is what they got!!!!)

[ Post Reply | Private Reply | To 11 | View Replies]

To: Ernest_at_the_Beach; gitmo; Phil V.; bonesmccoy; Howlin; Grampa Dave; NormsRevenge

They were tough issues to figure out!

Indeed. However, on many systems there is a hardened ROM coded "safe mode" that completely bypasses operational software in case of non recoverable software errors.

14 posted on 01/30/2004 9:01:49 PM PST by RadioAstronomer

[ Post Reply | Private Reply | To 2 | View Replies]

To: Ernest_at_the_Beach

They were tough issues to figure out!

The toughest problem I ever heard about was tracked down by code cowboy John Bell at Konica USA. Somewhere in the code, they had misspelled the letter "I."

The number "1" was typed in by mistake when the code was written. Proofreading the code and rereading during the debugging kept passing over the problem until it finally fell out. I can sympathize.

I was taking an accounting exam once where everything was written by hand. I made a number six but placed it a little too far down on the line so it looked enough like a "1" followed by a zero, that it took forever to find.

Successful debugging produces an endorphin cascade rivaling sex.

15 posted on 01/30/2004 9:38:34 PM PST by gcruse (http://gcruse.typepad.com/)

[ Post Reply | Private Reply | To 2 | View Replies]

To: gcruse

Successful debugging produces an endorphin cascade rivaling sex.

I don't know about that!
But both feel pretty good!

16 posted on 01/30/2004 9:57:22 PM PST by Ernest_at_the_Beach (The terrorists and their supporters declared war on the United States - and war is what they got!!!!)

[ Post Reply | Private Reply | To 15 | View Replies]

To: Ernest_at_the_Beach

I don't know about that!

Ask the guy who fixed Spirit. When the problem cropped up, you can bet his whole career passed in front of his eyes. Now he's a hero. Comparing that to sex with my ex-wife...well, like I said, successful debugging packs an endorphin punch.

17 posted on 01/30/2004 10:19:37 PM PST by gcruse (http://gcruse.typepad.com/)

[ Post Reply | Private Reply | To 16 | View Replies]

To: gcruse

Successful debugging produces an endorphin cascade rivaling sex.

Only time I felt good after successful debugging was when I found an obscure bug in the Cyrix x86 clone (non-aligned DWORD push across page boundary with lower page not present). Most of the time you find it's your own bug then sit around wondering what else you f***** up.

I love the product cycle. You work out all the known bugs, start feeling cocky about your code, chant "ship it, ship it". Your product manager says ok, you sober up from the RTM party, then nervousness increases as the time for retail availability gets closer. On the day of retail you wait for the phone to ring, check for disaster headlines on PCWeek/Infoworld/etc. Even if your part doesn't cause a s***storm, the old saying "I'd rather be lucky than good" keeps coming to mind.

18 posted on 01/30/2004 10:29:10 PM PST by mikegi

[ Post Reply | Private Reply | To 15 | View Replies]

To: mikegi

LOL I know the feeling.

19 posted on 01/31/2004 8:28:49 AM PST by gcruse (http://gcruse.typepad.com/)

[ Post Reply | Private Reply | To 18 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794