Replies

Is there any reasonable explanation for not having run some continuous testing before we shot this sucker into space?

No, the explaination is not reasonable, but troubling. The rovers were designed to last 90 days, (and travel 8 months). The longest test was 9 days! NASA is still infested by "faster, cheaper". I can't even imagine who would be dumb enough to let these guys send them to Mars!

The longest test was 9 days!

Maybe I'm just a forgiving kind of guy, but I could see how this could happen. The operating system in use here is a fairly old one with a good track record. It's probably one of the last places they expected to have an issue. They have this huge electro-mechanical machine full of moving parts and bleeding-edge technology for imaging, telecommunications, and locomotion on uncertain surfaces. If you're drawing up the testing budget, how much time do you allocate to debug an OS that's been in service in ten million little widgets around the world for a decade?

The thing is, embedded systems do not often have the problem of managing lots of "files." That's not really what they are about. Embedded-system OS's are about compact size, interrupt latency, and managing lots of little tasks. "File management" is what they do in IT shops; it hardly ever comes up on the factory floor or in a heart-lung machine.

So, they got surprised. It happens to everybody. People who haven't made any mistakes haven't tried enough new things.

I would love to have been on the team that figured out what this was. When it finally dawned on somebody what was causing this, they must have whooped for joy, because this is an easy fix and the $400 million machine is safe. There are a lot of ways this could have turned out worse.

No, the explaination is not reasonable, but troubling. The rovers were designed to last 90 days, (and travel 8 months). The longest test was 9 days! NASA is still infested by "faster, cheaper". I can't even imagine who would be dumb enough to let these guys send them to Mars!

The underlying problem here is that they didn't do a realistic loading test. They obviously demonstrated some basic capabilities, but didn't do enough uploads to create the problems seen.

This is a systems engineering/test planning problem -- probably nobody even thought of doing such a test.

In hindsight we can point fingers, but of all the problems that could have happened (which would have driven the Failure Modes and Effects Analysis that in turn drove the test plan), I can really understand them not having thought of it.

I think this will probably join the annals of "Learned the Hard Way" lessons, and folks will do that test from now on. I know I'm gonna use this as an example for the students in my class.