Posted on 11/20/2004 7:59:06 PM PST by Mike Fieschko
And you think your operating system needs to be reliable.
page 1
Introductions
Mike Deliman was pretty busy last January when the Mars rover Spirit developed memory and communications problems shortly after landing on the Red Planet. He is a member of the team at Wind River Systems who created the operating system at the heart of the Mars rovers, and he was among those working nearly around the clock to discover and solve the problem that had mysteriously halted the mission on Mars.
Deliman serves as chief engineer of operating systems at Wind River Systems. After leaving the University of California at Santa Cruz, where he majored in computer and information sciences, he went to work for a Unix company and was introduced to VxWorks, Wind Rivers realtime operating system, later adapted for use in the Mars rovers. I was very impressed with that very early version of VxWorks, he says. As fate would have it, he adds, just a few years after starting that job, the company closed its San Jose offices, and I moved to Wind River. He has worked with NASAs Jet Propulsion Laboratory on various space projects ever since.
Discussing the role of software in space with Deliman is George Neville-Neil, who is also well acquainted with VxWorks. He developed a device-driver model for networking devices used in VxWorks, worked on a multi-instance version of the Berkeley TCP/IP stack, and ported open source networking code to VxWorks. He has worked in the embedded systems area for the past eight years, both as an integrator of final products and as an implementer of off-the-shelf embedded operating systems. His work has centered on the networking aspects of embedded systems, but he has also done general work on the broader aspects of the systems. Neville-Neil is currently working on a new, commercial, dynamic host configuration protocol (DHCP) server at Nominum. He also teaches seminars and classes.
GEORGE NEVILLE-NEIL How did you wind up working with NASA on its space projects?
MIKE DELIMAN In 1994 Wind River Systems was asked to port its operating system to a radiation-hardened processor based on the IBM Power chip, the 32-bit predecessor to the current PowerPC line. The Power chip was also called the RS6000; the rad-hard version was called Rad6000. I was lucky enough to be asked to help with the Wind River end of the software and became an expert with both the chip and the VxWorks port. Everyone else who had worked with it moved on, but I kept helping the NASA folks use the software in other space-based applicationsDS1 (Deep Space 1), SeaWinds, SMEX (Small Explorer Project)-Lite, Genesis, Stardust, SORCE (Solar Radiation and Climate Experiment), Gravity Probe B, and several other deep space probes and satellites.
When the Mars Exploration Rover (MER) project started, I was called and asked who was left from the Pathfinder project that could work on MER. I was it.
GNN How does one radiation-harden a processor?
MD Radiation in space takes the form of high-energy particlesprotons, electrons, etc.moving at very high rates of speed, and thus carrying a lot of energy. When these subatomic particles hit something made of metal, they can induce transient charges on the metal. When they hit silicon, they are capable of burning holes right into the silicon.
To radiation-harden a processor, you sort of engineer the chip backward. Every year theres a big push to squeeze more transistors into less silicon, and use smaller and smaller gold or copper vias (wires) over the troughs in the silicon that make up the transistors. The smaller these features are, the more susceptible to voltage transients they become. To make the chips more resilient to power surges that can be caused by protons or electrons, you make the troughs deeper and wider, and use bigger wires. To help protect the silicon from being burned-through, the chips are encased in different kinds of ceramic shells that are thicker that they normally would be. A side effect of the bigger features inside a radiation-hardened chip is that it takes more electrical charge to operate normally, and/or its clock rate (the speed at which it runs) must be turned down to allow the necessary charges to build up.
GNN What is your role working with NASA?
MD For the MER project, 2001 through February 2004, I was the chief engineer of the operating system. I did extensions, modifications, bug fixes, investigations, and porting work (new compiler tools)pretty much everything for the Wind River side of the project.
Ive since left Wind River Systems and now work as a full-time employee at NASAs Jet Propulsion Laboratory (JPL).
GNN Can you tell us a bit about your work at Wind River? How many other people at Wind River work on the software for NASA/JPL and how does that relationship work?
MD I was the only one at Wind River working on the Rad6000 software. I consulted other engineers for specific issues, but I was the only engineer responsible for the Rad6000 processor support on the Wind River side.
GNN What was your role during different phases of the mission (launch, transit, planetfall, etc.)?
MD In all phases I was the chief engineerI acted as engineer, consultant, and the only technical support contact. This included responding while on vacation, even in remote areas (I had a laptop and a cellphone, and took them everywhere). The only difference is that while the mission was on the ground, I had a little more time to respond; once it was in flight, any problem encountered inherited new urgency.
Writing Code for Spacecraft
GNN Your primary focus for the MER project was the porting of VxWorks to the Rad6000, right? Did you also work on applications for MER? What were the typical support issues you ran into? Can you give us an example of a call you might have received from NASA at this phase?
MD My primary role was to update and maintain the software and extend it as needed by the MER team. In January, when the Spirit rover suffered from the file-systems anomaly, I was called in to help diagnose the problem, and as we understood more about it, to help characterize the exact nature and extent of the problem. In everyday terms, we had two sets of buckets to put data into: a very big bucket for long-term holding (a bank of flash memory), and a set of smaller buckets for temporary holding (blocks of RAM used to cache data until it could be moved to the flash bank.) The software that managed the set of smaller buckets was allowed to ask for more buckets as needed; eventually, the system just ran out of space to make more small buckets, and the whole process of managing buckets was shut down. This precipitated other problems, which led to the cycle of rebooting.
GNN What makes writing code for spacecraft hard?
MD Writing the code for spacecraft is no harder than for any other realtime life- or mission-critical application. The thing that is hard is debugging a problem from another planet: you cant put your hands on the malfunctioning system to see whats going on; you must use intuition and experience.
GNN How do you debug problems on the ground versus in space?
MD On the ground, you use all the tools you can put your hands on, including software debugging tools (WindView, shared memory dumps, software source-level debuggers, etc.). From off-planet, you mostly think about the problem and run tests with what you have in the lab to see if you can re-create the symptoms.
GNN Can you give us an example of a problem you had to debug for MER? How would you fix a problemupload a patch, or a whole new version of everything (operating system, apps)?
MD In the case of the Spirit rover file-systems problem (bucket managing), the team at JPL realized this might be a possible problem. They did their best to address it, by sending up routines designed to clean out older files, freeing up space both in the long-term storage and in the set of smaller buckets. The day after sending up those routines, the team found out that not all of the routines had made it to the rover intact. The routines were to be sent up to the rover again on Sol 19 (the 19th day of operation on Mars). Unfortunately, on Sol 18, the problem occurred.
What we did as a team was first to diagnose and characterize the problem as completely as possible, then test ways to detect and prevent it. We realized that some of the testing in the lab wasnt exactly the same as what was happening on Mars. The team made a combination of changes based on the differences in environments and the work we did to prevent the problem from occurring again. These changes were tested in the lab, verified to provide relief from the problem, and then sent to both rovers (Spirit and Opportunity). The fix mostly affected applications. It should be noted that part of the problem was a configuration issuethe system is extremely complex and there are numerous items that can be configured to react in specific ways. All of the configured items worked exactly as they had been configured to.
GNN How do differences in space-based hardware affect what the software sees or does?
MD In the best theory, you will test what you write, and fly what you test. In reality, sometimes that may not be possible. Having said that, if you have two pieces of hardware that are identical except that one is hardened for space flight, the software should run identically on both pieces of hardware.
GNN Is hardening a spacecraft the same thing as hardening a processor, or are there additional steps?
MD Hardening a spacecraft is much easier than hardening a processor. The steps to harden a processor can take years, and it requires testing and reworking, iterated several times to create a processor that is both radiation-hardened and functional. This is a costly and time-consuming process. These are the reasons why rad-hard processors are so far behind consumer-grade processors. For instance, the current state-of-the-art rad-hard processor is a PPC750-based chip that runs at 130 megahertz, whereas you can buy consumer-grade PPC750s that run at well over 1,000 megahertz.
To harden a spacecraft, you just need to add more layers of stuff that makes it harder for the protons (etc.) to get into the guts of the craft. The problem with this approach is that each layer adds weight to the craft; more weight means you need more thrust to get it into orbit. More thrust means stronger rockets and more propellant, which in turn adds more weight. The cost goes up dramatically as weight is added.
GNN What kinds of applications are placed on top of the operating system in a spacecraft?
MD In the case of the rovers, the applications were of three different natures. The first set of applications was designed to get the craft off of Earth and out to Mars; the second set to get the craft out of space and
landed on Mars safely; the third set was how to be a robot geologist and accomplish the main goals of the project: looking for signs of water.
GNN How is application programming done for a spacecraft?
MD Much the same as for anything elsesoftware requirements are written, with specifications and test plans, then the software is written and tested, problems are fixed, and eventually its sent off to do its job. In the case of satellites, you want to be extra cautious about designing and implementing software, and diligent with testing the software, to make it as robust as possible before launch. It is almost impossible to schedule an on-site visit once the craft is on its way.
[snip: PAGE THREE snipped]
Close Calls
GNN What scary problems have been caught on the ground, before the mission went into space?
MD There were many problems found on the ground, even at Wind River. I wouldnt call any of them scary, though some were nontrivial.
Many years ago, while Stardust was still on the ground, a problem was found with the compiler tools. It was mishandling one of the registers, overwriting a value before storing what had been in the register. This had the potential to make the entire system into a really expensive random number generatornot what you want from your spacecraft. The tools were fixed, the entire set of releases currently in use was rebuilt with the fixed tools, and updates were sent to all the customers using the software at the time.
GNN What is the hardest problem youve had to debug during a mission?
MD Two problems: the Mars Pathfinder priority inversion problem and the MER file-systems anomaly. Both were solved by some of the most brilliant engineers Ive had the pleasure to work with, working as teams. I think Glenn Reeves [Mars Pathfinder Flight Software Cognizant Engineer] does the best job of explaining what went wrong and how it was fixed [see http://research.microsoft.com/~mbj/Mars_Pathfinder/Authoritative_Account.html.] [Note: the Mike most often referred to in this online document is Mike Jones of Microsoft; Mike Deliman is from Wind River.]
GNN Obviously there was a huge news frenzy as one of the rovers became inoperable for a couple of weeks. What went wrongand why did it take so long to fix? Was it one of those situations where once you figured out what the problem was, it was easy to fix? Or was it just a real difficult thing to fix, requiring lots of work?
MD It certainly wasnt an easy problem to diagnose or rectify.
There were many aspects of diagnosing and addressing the problem with the Spirit rover, which occurred in mid-January. There were many possible problems: it could have been a power surge, radiation from space, an intermittent wiring short, thermal-related problems, mechanical failures precipitated from launch or landing, etc. We had the task before us of eliminating the least likely, characterizing the most likely, and paring down the list of possibilities into a manageable set of probable causes, and then exploring those causes. Once the problem was accurately diagnosed and characterized, it had to be simulated in the lab, and the remedy still had to be implemented and tested.
Though I wasnt directly responsible for implementing or testing the remedy, I did assist the team as much as I possibly could. For me, the call to help came literally 20 minutes after Opportunity had landed on Mars. It required research into the source code, discussions with experts in three time zones: Japan, California, and Gusev (thats where Spirit landed on Mars), and taking copies of the work wherever I went so I could access it. There were many days of long hours, working late into the night, creating and running tests. I worked through weekends, woke up three times a day to make the contacts I needed to make, took breaks only for meals, sleep, showers, and enough time to care for my dogs.
I know the rest of the team was just as dedicated and focused, and put in at least as much effort. We all had to do our best to handle the situation and juggle our own family requirements and personal difficulties. Throughout the effort, I had numerous distractions (other projects approaching deadlines, requests from the media, explanations and status updates to be sent to the management teams and executives, and a death in the family). I dont expect any of the team had an easy time handling their parts of the mission. I am extremely proud of our achievement and very thankful to have received so much support from friends and coworkers who made it possible for me to make my contributions.
GNN Has the design of a spacecraft ever affected the operating system software? Do things you learn working with NASA/JPL wind up in the base operating system code?
MD Some of the things we fixed for various space missions did get folded back into the base package. Of note, with several support engineers at Wind River, we fixed some of the math routines for a space customer; the resulting routines were every bit as accurate as the IEEE 754 versions, and had better timing characteristics.
GNN Do you think NASA/JPL would switch to an open source operating system for a space mission?
MD I would not rule it out, but the fact is, when youre dealing with a billion dollars worth of hardware and many man-years worth of effort, you tend to go with what you know will work. As an example, lets look at the Rad6000 processor. Its a 32-bit computer that runs at 20 megahertz and was the pinnacle of technology perhaps in 1990. It has a limited amount of RAM, and that RAM is pretty slow by todays standards. Its at best a relic compared with todays processors that run hundreds of times faster.
Even though its an old design, the Rad6000 is very well understood, has been used in many successful space missions, and is still in use in several others. This history of success builds up a good reputation, which in turn translates into confidence. If youre confident in the basic platform at the heart of your satellite, you can feel more confident about the satellite surviving to achieve its goals.
ping
ping?
Very Informative, I love this kind of stuff.
Okay, never mind.
Thanks for the ping, prairie! Excellent!
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.