Posted on 04/02/2020 9:41:53 AM PDT by dayglored
US air safety bods call it 'potentially catastrophic' if reboot directive not implemented
The US Federal Aviation Administration has ordered Boeing 787 operators to switch their aircraft off and on every 51 days to prevent what it called "several potentially catastrophic failure scenarios" including the crashing of onboard network switches.
The airworthiness directive, due to be enforced from later this month, orders airlines to power-cycle their B787s before the aircraft reaches the specified days of continuous power-on operation.
The power cycling is needed to prevent stale data from populating the aircraft's systems, a problem that has occurred on different 787 systems in the past.
According to the directive itself, if the aircraft is powered on for more than 51 days this can lead to "display of misleading data" to the pilots, with that data including airspeed, attitude, altitude and engine operating indications. On top of all that, the stall warning horn and overspeed horn also stop working.
This alarming-sounding situation comes about because, for reasons the directive did not go into, the 787's common core system (CCS) an Intel Wind River VxWorks realtime OS product, at heart stops filtering out stale data from key flight control displays. That stale data-monitoring function going down in turn "could lead to undetected or unannunciated loss of common data network (CDN) message age validation, combined with a CDN switch failure".
Solving the problem is simple: power the aircraft down completely before reaching 51 days. It is usual for commercial airliners to spend weeks or more continuously powered on as crews change at airports, or ground power is plugged in overnight while cleaners and maintainers do their thing.
The CDN is a Boeing avionics term for the 787's internal Ethernet-based network. It is built to a slightly more stringent aviation-specific standard than common-or-garden Ethernet, that standard being called ARINC 664. More about ARINC 664 can be read here.
Airline pilots were sanguine about the implications of the failures when El Reg asked a handful about the directive. One told us: "Loss of airspeed data combined with engine instrument malfunctions isn't unheard of," adding that there wasn't really enough information in the doc to decide whether or not the described failure would be truly catastrophic. Besides, he said, the backup speed and attitude instruments are for obvious reasons completely separate from the main displays.
Another mused that loss of engine indications would make it harder to adopt the fallback drill of setting a known pitch and engine power* setting that guarantees safe straight-and-level flight while the pilots consult checklists and manuals to find a fix.
A third commented, tongue firmly in cheek: "Anything like that with the aircraft is unhealthy!"
A previous software bug forced airlines to power down their 787s every 248 days for fear that electrical generators could shut down in flight.
Airbus suffers from similar issues with its A350, with a relatively recent but since-patched bug forcing power cycles every 149 hours.
Persistent or unfiltered stale data is a known 787 problem. In 2014 a Japan Airlines 787 caught fire because of the (entirely separate, and since fixed) lithium-ion battery problem. Investigators realised the black boxes had been recording false information, hampering their task, because they were falsely accepting stale old data as up-to-the-second real inputs.
More seriously, another 787 stale data problem in years gone by saw superseded backup flight plans persisting in standby navigation computers, and activating occasionally. Activation caused the autopilot to wrongly decide it was halfway through flying a previous journey and manoeuvre to regain the "correct" flight path. Another symptom was for the flight management system to simply go blank and freeze, triggered by selection of a standard arrival path (STAR) with exactly 14 waypoints such as the BIMPA 4U approach to Poland's rather busy Warsaw Airport. The Polish air safety regulator published this mildly alarming finding in 2016 [2-page PDF, in Polish].
This was fixed through a software update, as the US Federal Aviation Administration reiterated last year. In addition, Warsaw's BIMPA 4U approach has since been superseded.
The Register asked Boeing to comment. ®
Are they using ANY Gate$ software?
The I.T. Crowd - great comedy!!
“Have you tried turning it off and on again??!?!?”
https://www.youtube.com/watch?v=nn2FB1P_Mn8
0118 999 881 99 9119 725....3.
Boeing will soon field a team of lesbian-Muslim-multiple abortion-African American studies majors from India to rewrite the codebase with JavaScript and nodeJS ...
Well that’s computers for ya.
Just to be safe, do it every 50 days.
“No kidding. I have datacenter network gear that stays up without a reboot for years. The only time it gets restarted is after applying a required security patch.
You wait years to put in security patches?
It’s McDonnell Douglas’ fault. They introduced beancounters and effed up Boeing’s focus on quality engineering. Beware mergers.
I worked on this exact system so I can shed a little light. the Arinc 664 protocol was developed 10+ years ago to be ‘deterministic ethernet’. The short short version is that with ‘normal’ ethernet you can’t 100% guarantee that data will not be ‘old’ when it gets where it is going. So fancy ethernet favor was cooked up that, when used right, allows you to be 100% sure that when you get your air speed data over the network it will not be ‘stale’. What this article is saying is the system was not designed or tested to be powered up that long. I can say from experience ( worked on that project) that we did not have a requirement to be powered up continuously for more than 51 days. There are actually a lot of systems on other aircraft that expect periodically power cycled. So it is an embarrassing glitch when exposed in a hit piece article but it is pretty understandable why the ‘bug’ occurs. I am not familiar with the details of the specific ‘bug’ and I have not worked for that company for many years.
It sounds like a moronic windows thing. Thank you, Gates.
sounds similar to the Y2K problem except it happens after every 51 days … i.e., counter overflows and similar extremely bad coding …Yes, that was my first thought as well. 51 days sounds about like the time span when a 32-bit millisec tick counter overflows. Looks like many idiot programmers still have problems handling overflow and modulo arithmetic correctly. Maybe that is what happens if they spend more and more time learning gender studies instead of math in college…
Haha...that was great :)
Great show - wish it were still going...
Well, without throwing any stones at you or your team personally, I gotta say as a computer systems engineer since (oh, god...) 1980 or so, any system that relies on power cycling to clear stale state information is poorly designed and/or poorly implemented. Whether it's memory leaks, system table size crashes, disk space filling up, or whatever, those conditions are symptoms of a problem that should be addressed by fixing the design, the implementation, or both.
In my opinion, any computer worthy of the "reliable" descriptor should be capable of running continuously forever, modulo patches to correct stability or security flaws that require a reboot.
Again, nothing personal. But assuming what you're saying is true, I find it disturbing.
No, but in general, better gear only gets patches released at very long intervals, because they did a good job prior to shipment, and most of the found problems got patched early on.
IMO, any critical system that still requires frequent system patches after a few years wasn't ready for prime time when it was released. After a couple years of operation, patches should only reflect things like changes in protocols (like dropping TLS 1.0/1.1 and requiring TLS 1.2), or patching for a vuln in a supported application.
Somebody at Boeing neglected to include a trash collection routine in the software. Or the H-1B programmers interpreted the requirement with something to do with trash cans in the lavatories.
I told them they shouldn’t use XP.
I saw mention of this bug quite some time ago. It has been known for a while now. Crappy software.
Windows?
Future Edit: Do not perform this procedure during flight.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.