A Conversation with Mike Deliman [member of Mars Rovers' computer operating system team]

A Conversation with Mike Deliman [member of Mars Rovers' computer operating system team]
Association for Computing Machinery ^ | October 2004 | GEORGE NEVILLE-NEIL and MIKE DELIMAN

Posted on 11/20/2004 7:59:06 PM PST by Mike Fieschko

And you think your operating system needs to be reliable.

page 1

Introductions

Mike Deliman was pretty busy last January when the Mars rover Spirit developed memory and communications problems shortly after landing on the Red Planet. He is a member of the team at Wind River Systems who created the operating system at the heart of the Mars rovers, and he was among those working nearly around the clock to discover and solve the problem that had mysteriously halted the mission on Mars.

Deliman serves as chief engineer of operating systems at Wind River Systems. After leaving the University of California at Santa Cruz, where he majored in computer and information sciences, he went to work for a Unix company and was introduced to VxWorks, Wind River’s realtime operating system, later adapted for use in the Mars rovers. “I was very impressed with that very early version of VxWorks,” he says. “As fate would have it,” he adds, “just a few years after starting that job, the company closed its San Jose offices, and I moved to Wind River.” He has worked with NASA’s Jet Propulsion Laboratory on various space projects ever since.

Discussing the role of software in space with Deliman is George Neville-Neil, who is also well acquainted with VxWorks. He developed a device-driver model for networking devices used in VxWorks, worked on a multi-instance version of the Berkeley TCP/IP stack, and ported open source networking code to VxWorks. He has worked in the embedded systems area for the past eight years, both as an integrator of final products and as an implementer of off-the-shelf embedded operating systems. His work has centered on the networking aspects of embedded systems, but he has also done general work on the broader aspects of the systems. Neville-Neil is currently working on a new, commercial, dynamic host configuration protocol (DHCP) server at Nominum. He also teaches seminars and classes.
GEORGE NEVILLE-NEIL How did you wind up working with NASA on its space projects?
MIKE DELIMAN In 1994 Wind River Systems was asked to port its operating system to a radiation-hardened processor based on the IBM Power chip, the 32-bit predecessor to the current PowerPC line. The Power chip was also called the RS6000; the rad-hard version was called Rad6000. I was lucky enough to be asked to help with the Wind River end of the software and became an expert with both the chip and the VxWorks port. Everyone else who had worked with it moved on, but I kept helping the NASA folks use the software in other space-based applications—DS1 (Deep Space 1), SeaWinds, SMEX (Small Explorer Project)-Lite, Genesis, Stardust, SORCE (Solar Radiation and Climate Experiment), Gravity Probe B, and several other deep space probes and satellites.
When the Mars Exploration Rover (MER) project started, I was called and asked who was left from the Pathfinder project that could work on MER. I was it.
GNN How does one “radiation-harden” a processor?
MD Radiation in space takes the form of high-energy particles—protons, electrons, etc.—moving at very high rates of speed, and thus carrying a lot of energy. When these subatomic particles hit something made of metal, they can induce transient charges on the metal. When they hit silicon, they are capable of “burning” holes right into the silicon.

To radiation-harden a processor, you sort of engineer the chip backward. Every year there’s a big push to squeeze more transistors into less silicon, and use smaller and smaller gold or copper vias (wires) over the troughs in the silicon that make up the transistors. The smaller these features are, the more susceptible to voltage transients they become. To make the chips more resilient to power surges that can be caused by protons or electrons, you make the troughs deeper and wider, and use bigger “wires.” To help protect the silicon from being burned-through, the chips are encased in different kinds of ceramic shells that are thicker that they normally would be. A side effect of the bigger features inside a radiation-hardened chip is that it takes more electrical charge to operate normally, and/or its clock rate (the speed at which it runs) must be turned down to allow the necessary charges to build up.
GNN What is your role working with NASA?
MD For the MER project, 2001 through February 2004, I was the chief engineer of the operating system. I did extensions, modifications, bug fixes, investigations, and porting work (new compiler tools)—pretty much everything for the Wind River side of the project.
I’ve since left Wind River Systems and now work as a full-time employee at NASA’s Jet Propulsion Laboratory (JPL).
GNN Can you tell us a bit about your work at Wind River? How many other people at Wind River work on the software for NASA/JPL and how does that relationship work?

MD I was the only one at Wind River working on the Rad6000 software. I consulted other engineers for specific issues, but I was the only engineer responsible for the Rad6000 processor support on the Wind River side.
GNN What was your role during different phases of the mission (launch, transit, planetfall, etc.)?
MD In all phases I was the chief engineer—I acted as engineer, consultant, and the only technical support contact. This included responding while on vacation, even in remote areas (I had a laptop and a cellphone, and took them everywhere). The only difference is that while the mission was on the ground, I had a little more time to respond; once it was in flight, any problem encountered inherited new urgency.
Writing Code for Spacecraft

GNN Your primary focus for the MER project was the porting of VxWorks to the Rad6000, right? Did you also work on applications for MER? What were the typical support issues you ran into? Can you give us an example of a call you might have received from NASA at this phase?

MD My primary role was to update and maintain the software and extend it as needed by the MER team. In January, when the Spirit rover suffered from the file-systems anomaly, I was called in to help diagnose the problem, and as we understood more about it, to help characterize the exact nature and extent of the problem. In everyday terms, we had two sets of “buckets” to put data into: a very big bucket for long-term holding (a bank of flash memory), and a set of smaller buckets for temporary holding (blocks of RAM used to “cache” data until it could be moved to the flash bank.) The software that managed the set of smaller buckets was allowed to ask for more buckets as needed; eventually, the system just ran out of space to make more small buckets, and the whole process of managing buckets was shut down. This precipitated other problems, which led to the cycle of rebooting.
GNN What makes writing code for spacecraft hard?
MD Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet: you can’t put your hands on the malfunctioning system to see what’s going on; you must use intuition and experience.
GNN How do you debug problems on the ground versus in space?
MD On the ground, you use all the tools you can put your hands on, including software debugging tools (WindView, shared memory dumps, software source-level debuggers, etc.). From off-planet, you mostly think about the problem and run tests with what you have in the lab to see if you can re-create the symptoms.

GNN Can you give us an example of a problem you had to debug for MER? How would you fix a problem—upload a patch, or a whole new version of everything (operating system, apps)?
MD In the case of the Spirit rover file-systems problem (bucket managing), the team at JPL realized this might be a possible problem. They did their best to address it, by sending up routines designed to clean out older files, freeing up space both in the long-term storage and in the set of smaller buckets. The day after sending up those routines, the team found out that not all of the routines had made it to the rover intact. The routines were to be sent up to the rover again on “Sol 19” (the 19th day of operation on Mars). Unfortunately, on Sol 18, the problem occurred.
What we did as a team was first to diagnose and characterize the problem as completely as possible, then test ways to detect and prevent it. We realized that some of the testing in the lab wasn’t exactly the same as what was happening on Mars. The team made a combination of changes based on the differences in environments and the work we did to prevent the problem from occurring again. These changes were tested in the lab, verified to provide relief from the problem, and then sent to both rovers (Spirit and Opportunity). The fix mostly affected applications. It should be noted that part of the problem was a configuration issue—the system is extremely complex and there are numerous items that can be configured to react in specific ways. All of the configured items worked exactly as they had been configured to.
GNN How do differences in space-based hardware affect what the software sees or does?
MD In the best theory, you will test what you write, and fly what you test. In reality, sometimes that may not be possible. Having said that, if you have two pieces of hardware that are identical except that one is hardened for space flight, the software should run identically on both pieces of hardware.

GNN Is hardening a spacecraft the same thing as hardening a processor, or are there additional steps?
MD Hardening a spacecraft is much easier than hardening a processor. The steps to harden a processor can take years, and it requires testing and reworking, iterated several times to create a processor that is both radiation-hardened and functional. This is a costly and time-consuming process. These are the reasons why rad-hard processors are so far behind consumer-grade processors. For instance, the current state-of-the-art rad-hard processor is a PPC750-based chip that runs at 130 megahertz, whereas you can buy consumer-grade PPC750s that run at well over 1,000 megahertz.
To harden a spacecraft, you just need to add more layers of “stuff” that makes it harder for the protons (etc.) to get into the “guts” of the craft. The problem with this approach is that each layer adds weight to the craft; more weight means you need more thrust to get it into orbit. More thrust means stronger rockets and more propellant, which in turn adds more weight. The cost goes up dramatically as weight is added.
GNN What kinds of applications are placed on top of the operating system in a spacecraft?
MD In the case of the rovers, the applications were of three different natures. The first set of applications was designed to get the craft off of Earth and out to Mars; the second set to get the craft out of space and
landed on Mars safely; the third set was how to be a robot geologist and accomplish the main goals of the project: looking for signs of water.

GNN How is application programming done for a spacecraft?
MD Much the same as for anything else—software requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually it’s sent off to do its job. In the case of satellites, you want to be extra cautious about designing and implementing software, and diligent with testing the software, to make it as robust as possible before launch. It is almost impossible to schedule an on-site visit once the craft is on its way.
[snip: PAGE THREE snipped]
Close Calls

GNN What scary problems have been caught on the ground, before the mission went into space?

MD There were many problems found on the ground, even at Wind River. I wouldn’t call any of them “scary,” though some were nontrivial.
Many years ago, while Stardust was still on the ground, a problem was found with the compiler tools. It was mishandling one of the registers, overwriting a value before storing what had been in the register. This had the potential to make the entire system into a really expensive random number generator—not what you want from your spacecraft. The tools were fixed, the entire set of releases currently in use was rebuilt with the fixed tools, and updates were sent to all the customers using the software at the time.
GNN What is the hardest problem you’ve had to debug during a mission?
MD Two problems: the Mars Pathfinder priority inversion problem and the MER file-systems anomaly. Both were solved by some of the most brilliant engineers I’ve had the pleasure to work with, working as teams. I think Glenn Reeves [Mars Pathfinder Flight Software Cognizant Engineer] does the best job of explaining what went wrong and how it was fixed [see http://research.microsoft.com/~mbj/Mars_Pathfinder/Authoritative_Account.html.] [Note: the Mike most often referred to in this online document is Mike Jones of Microsoft; Mike Deliman is from Wind River.]
GNN Obviously there was a huge news frenzy as one of the rovers became inoperable for a couple of weeks. What went wrong—and why did it take so long to fix? Was it one of those situations where once you figured out what the problem was, it was easy to fix? Or was it just a real difficult thing to fix, requiring lots of work?

MD It certainly wasn’t an easy problem to diagnose or rectify.
There were many aspects of diagnosing and addressing the problem with the Spirit rover, which occurred in mid-January. There were many possible problems: it could have been a power surge, radiation from space, an intermittent wiring short, thermal-related problems, mechanical failures precipitated from launch or landing, etc. We had the task before us of eliminating the least likely, characterizing the most likely, and paring down the list of possibilities into a manageable set of probable causes, and then exploring those causes. Once the problem was accurately diagnosed and characterized, it had to be simulated in the lab, and the remedy still had to be implemented and tested.
Though I wasn’t directly responsible for implementing or testing the remedy, I did assist the team as much as I possibly could. For me, the call to help came literally 20 minutes after Opportunity had landed on Mars. It required research into the source code, discussions with experts in three time zones: Japan, California, and Gusev (that’s where Spirit landed on Mars), and taking copies of the work wherever I went so I could access it. There were many days of long hours, working late into the night, creating and running tests. I worked through weekends, woke up three times a day to make the contacts I needed to make, took breaks only for meals, sleep, showers, and enough time to care for my dogs.
I know the rest of the team was just as dedicated and focused, and put in at least as much effort. We all had to do our best to handle the situation and juggle our own family requirements and personal difficulties. Throughout the effort, I had numerous distractions (other projects approaching deadlines, requests from the media, explanations and status updates to be sent to the management teams and executives, and a death in the family). I don’t expect any of the team had an easy time handling their parts of the mission. I am extremely proud of our achievement and very thankful to have received so much support from friends and coworkers who made it possible for me to make my contributions.
GNN Has the design of a spacecraft ever affected the operating system software? Do things you learn working with NASA/JPL wind up in the base operating system code?
MD Some of the things we fixed for various space missions did get folded back into the base package. Of note, with several support engineers at Wind River, we fixed some of the math routines for a space customer; the resulting routines were every bit as accurate as the IEEE 754 versions, and had better timing characteristics.

GNN Do you think NASA/JPL would switch to an open source operating system for a space mission?
MD I would not rule it out, but the fact is, when you’re dealing with a billion dollars worth of hardware and many man-years worth of effort, you tend to go with what you know will work. As an example, let’s look at the Rad6000 processor. It’s a 32-bit computer that runs at 20 megahertz and was the pinnacle of technology perhaps in 1990. It has a limited amount of RAM, and that RAM is pretty slow by today’s standards. It’s at best a relic compared with today’s processors that run hundreds of times faster.
Even though it’s an old design, the Rad6000 is very well understood, has been used in many successful space missions, and is still in use in several others. This history of success builds up a good reputation, which in turn translates into confidence. If you’re confident in the basic platform at the heart of your satellite, you can feel more confident about the satellite surviving to achieve its goals.

TOPICS: Miscellaneous; Technical
KEYWORDS: jpl; mars; roverspirit

These are pages one, two and four of the interview. I excerpted because of the length.

Page three deals with QA, differences between the mission's OS and other Wind River products, security concerns, etc.

1 posted on 11/20/2004 7:59:07 PM PST by Mike Fieschko

[ Post Reply | Private Reply | View Replies]

To: KevinDavis

ping

2 posted on 11/20/2004 8:02:11 PM PST by Mike Fieschko

[ Post Reply | Private Reply | To 1 | View Replies]

To: Two Thirds Vote Aye

ping?

3 posted on 11/20/2004 8:06:30 PM PST by prairiebreeze (Ted Rall is a waste of perfectly good oxygen.)

[ Post Reply | Private Reply | To 2 | View Replies]

To: prairiebreeze

Very Informative, I love this kind of stuff.

4 posted on 11/20/2004 8:25:09 PM PST by corbe

[ Post Reply | Private Reply | To 3 | View Replies]

To: Mike Fieschko

"Radiation in space takes the form of high-energy particles—protons, electrons, etc.—moving at very high rates of speed, and thus carrying a lot of energy. When these subatomic particles hit something made of metal, they can induce transient charges on the metal. When they hit silicon, they are capable of “burning” holes right into the silicon. "

Could this be an example of why we NASA is important to us? Couldn't hardening of this chip have given our scientists ways to harden chips for military against just electro shock but also against high-energy particles from a nuclear attack?

5 posted on 11/20/2004 9:40:18 PM PST by JSteff

[ Post Reply | Private Reply | To 1 | View Replies]

To: JSteff

Couldn't hardening of this chip have given our scientists ways to harden chips for military against just electro shock but also against high-energy particles from a nuclear attack?

Being neither a physicist nor a materials engineer, I can't say anything more than 'I haven't the foggiest idea'.

6 posted on 11/20/2004 9:51:15 PM PST by Mike Fieschko

[ Post Reply | Private Reply | To 5 | View Replies]

To: Mike Fieschko

Okay, never mind.

7 posted on 11/20/2004 10:09:13 PM PST by JSteff

[ Post Reply | Private Reply | To 6 | View Replies]

To: Mike Fieschko

Mike Deliman is a gem, and may have returned to Wind River. I doubt that George would want to return to Wind River. I was a colleague. Seeing snatches of how they solved the problem is fascinating, but even more interesting is finding a way to test and discover such problems before the launch. You can have really smart people perform selflessly to save a mission from the naturally probable results of building complex systems. Or you can develop tools to address complexity, leaving the smart people to build more intelligence and experiments into the instrument.

In fact I believe the problem would have been revealed had NASA modeled and executed the system before building it. DOD is now modeling most big computer-based weapons systems. That is the trend in business systems too, as attested by IBM's acquisition of Rational Software, Borland of Togethersoft, Sun's work with Embarcadero. Wind River and its customers are understandably a little behind the curve, as few engineers working close to hardware design with "objects", even while the syntax of C++ or Java require that they employ them. But no semiconductor designer would dream of not modeling logic before committing to hardware implementation.

For the curious, since this is not a computer science forum, see www.omg.org and explore model driven architectures and SYSml, or look at iLogix Rhapsody or Mentor's Bridgepoint (among a half dozen commercial vendors).

8 posted on 11/20/2004 10:24:41 PM PST by Spaulding (Wagdadbythebay)

[ Post Reply | Private Reply | To 1 | View Replies]

To: prairiebreeze

Thanks for the ping, prairie! Excellent!

9 posted on 11/21/2004 7:24:04 AM PST by Two Thirds Vote Aye (9/11/2001 - The final, ultimate, defining Legacy of william clinton.)

[ Post Reply | Private Reply | To 3 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search

News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794

And you think your operating system needs to be reliable.

Introductions

Writing Code for Spacecraft

Close Calls