Boeing 787s must be turned off and on every 51 days (2020)

renw0rp · on May 11, 2021

Reminds me of patriot missiles:

> The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.

> The radar system had successfully detected the Scud and predicted where to look for it next. However, the timestamps of the two radar pulses being compared were converted to floating point differently: one correctly, the other introducing an error proportionate to the operation time so far (100 hours) caused by the truncation in a 24-bit fixed-point register. As a result, the difference between the pulses was wrong, so the system looked in the wrong part of the sky and found no target. With no target, the initial detection was assumed to be a spurious track and the missile was removed from the system. No interception was attempted, and the Scud impacted on a makeshift barracks in an Al Khobar warehouse, killing 28 soldiers, the first Americans to be killed from the Scuds that Iraq had launched against Saudi Arabia and Israel.

So such things happen also in military (although this was early nineties)

Source: https://en.m.wikipedia.org/wiki/MIM-104_Patriot

35fbe7d3d5b9 · on May 11, 2021

Which then reminds me of a classic:

> I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks". He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits it's target or at the end of it's flight, the ultimate in garbage collection is performed without programmer intervention.

https://groups.google.com/forum/message/raw?msg=comp.lang.ad...

mousepilot · on May 12, 2021

that post contains a pretty serious error, the sig should start with two dashes and a space.

bellyfullofbac · on May 11, 2021

Ouch, imagine dying because of a rounding off error...

More detail in 1 link from the article: https://web.archive.org/web/20100702180720/http://mate.uprh....

interestica · on May 11, 2021

Kind of related. I always found it interesting that the Terminator's heads up display noted no casualties as "0.0" -- and I always wondered how a computer/killing-machine from the future might calculate casualties.

https://youtube.com/watch?v=oolP65DLQLY

heavenlyblue · on May 11, 2021

Taking a contrarian point of view: people die for far simpler reasons.

_ktx2 · on May 11, 2021

At first I was thinking this was a component like internet or in-flight entertainment. Then:

> This alarming-sounding situation comes about because, for reasons the directive did not go into, the 787's common core system (CCS) stops filtering out stale data from key flight control displays. That stale data-monitoring function going down in turn "could lead to undetected or unannunciated loss of common data network (CDN) message age validation, combined with a CDN switch failure".

Excuse me? Clearing the queue is not a "solution" if you have to remind a human to do it.

jsight · on May 11, 2021

Why not? These are maintenance heavy machines. Another line item in the maintenance procedure isn't necessarily a show-stopper.

We have a tendancy in software to think that the software should do literally everything even with a user actively working against it. A lot of times that is even more harmful.

_ktx2 · on May 11, 2021

Commercial and industrial software are very different things. Humans are an unreliable gate to safety and reliability, so if the gate to staying in the air is a human doing something that they can't see the results of it will inevitably fail.

Veserv · on May 11, 2021

And yet commercial airplanes, which have used this methodology for decades, are one of the most reliable mechanical systems humans have ever made. The mean-time-before-failure (MTBF) for component failures that impacted a 787 flight is ~40,000 hours [1] or about 4.5 flight-years per component failure, not full system failure, component failure. After nearly ~4,000,000 flight-hours there has not been a single full system failure or fatality yet in a 787 and this is only a fraction of the standard in aerospace so far. Between 2000 and 2010 there was ~1 death per 50 billion passenger-miles [2]. At an average speed of ~700 mph that would be 1 death per ~71 million passenger-hours or ~8150 passenger-years or ~125 airframe-years.

People who want to design reliable systems and processes look to airplanes for how to do that. That does not mean the same techniques are applicable or cost-effective in a different context, but at the very least the historical processes have been empirically demonstrated to produce extremely high reliability far beyond what nearly every other industry can attain and what many industries, such as commercial software, do not even attempt to achieve and may not even think is possible in their environment.

[1] https://www.sciencedirect.com/science/article/pii/S100093612...

[2] https://en.wikipedia.org/wiki/Transportation_safety_in_the_U...

jsight · on May 11, 2021

> Commercial and industrial software are very different things.

Exactly my point. Trust no humans is not a philosophy that applies in aviation software.

It shouldn't apply quite as much in some other software either.

osenthuortuh · on May 11, 2021

Flight software is fabulously expensive to maintain and update, because it is safety critical. Furthermore, avionics programmers have a surprisingly captive audience. You would be surprised to learn how many problems are left to operator-based workarounds.

It's actually fairly depressing and is part of the reason I transitioned to consumer software.

sheepybloke · on May 11, 2021

This! One way to mitigate safety critical issues is to pass it onto humans. This lowers how detailed you have to be when verifying the software, because you don't have to handle some critical aspects.

johnklos · on May 11, 2021

Think about how absolutely insane some assumptions would have to be for this to be a problem.

These are the kinds of things that separate commercial software from open source software: in many organizations, commercial programmers typically write to a spec, and the specs usually come managers asking for specific, needed things without both a proper understanding of the larger picture and a nuanced understanding about how those specific pieces intimately fit in to the larger picture.

Open source is usually written by people who take pride in what they write, so even though a 32 bit timer for millisecond events almost certainly won't still be running 50 days after you've started a program, they'd still consider, "What if it overflows? Should I handle that event, or should use 64 bits instead?" instead of, "Not in the spec. Why should I care?"

And with bad programming practices comes unmaintainability. You want the department who wrote that code several years ago to open it up again, fix some things, then give you a new version? Well, that's going to take months and hundreds of thousands of dollars because they're not set up like that.

It's a stupid game with stupid results, yet we have so many apologists who think that if companies with money do it, it must be right.

sheepybloke · on May 11, 2021

This post is filled with assumptions. In aviation, the specs generally come down from the OEMs like Boeing or Airbus, who have systems engineers who've done a lot of work creating the specs and defining the nuances. Granted, there are errors (and external bosses who push requirements), but when they are seen down the line by a supplier, they get raised and discussed with the systems engineers at the OEM. Generally, these specs are pretty tight and specific. For something like this, the OEM probably had the operating time set to something like 25 days (over 3 weeks of never turning the plane off once), which gives the timer plenty of room. However, if the customer then doesn't restart for more than 6 weeks, and that's outside of the OEM's specs for the plane, then you're going to have issues no matter what.

The actual larger cost is not the maintenance or updates but the verification. There are definitely issues, especially when it's a code base you've been maintaining for 15-20 years, but generally it's the verification that takes the most effort, since often you're not just verifying the new functionality but downstream functionality as well.

izacus · on May 11, 2021

Seems like a common issue :)

Airbus A350s had a similar type of bug where the fix was to reboot them at least every 149 hours: https://it.slashdot.org/story/19/07/25/1932229/airbus-a350-s...

data_acquired · on May 11, 2021

I'm sure this has been asked on HN before, but is there something intrinsic to complex systems and the need for a reset? That is, as a given system gets more complex in design and purpose, is it inevitable that beyond a certain point of complexity, some kind of a power reset is needed to get things to work correctly? Any interesting writings on the subject?

Towards the extreme ends of the spectrum of complexity, humans need sleep. On the other end, my pocket calculator likely doesn't need to be switched off and on to ensure that numbers add correctly. I guess complex operating systems sit closer to humans than a calculator on this "spectrum". I do remember reading that the space shuttle's computer systems were close to perfect in design, but they're not operated as frequently as a 787.

dirkt · on May 12, 2021

> is there something intrinsic to complex systems and the need for a reset?

As the number of possible states of the system explodes, and as you layer levels of abstraction onto each other, the probability that you somehow reach a state the designers didn't think about increases. Then you need to reset.

Add to that limited bits for representation of numbers, and people not thinking about what do to when one overflows (because this is really really hard in many cases), and you get to "better reset it every n days" scenario.

> Any interesting writings on the subject?

None that I know of.

marcodiego · on May 11, 2021

Very similar to https://www.cnet.com/news/windows-may-crash-after-49-7-days/

tssva · on May 11, 2021

In the late 90s into the early 2000s I worked at a large international ISP. During my first few years there our network containing ATM switches both terminating backbone circuits and as intra-PoP fabrics. For a period of time the ATM switches had to be rebooted every 45 days because otherwise they would perform an uncontrolled reboot.

s3r3nity · on May 11, 2021

> Airbus suffers from similar issues with its A350, with a relatively recent but since-patched bug forcing power cycles every 149 hours.

WTF? Is this standard operating procedure?

merricksb · on May 11, 2021

Previous discussion:

https://news.ycombinator.com/item?id=22761395

ChrisArchitect · on May 11, 2021

from a year ago when this was news, on old url, lots of discussion:

https://news.ycombinator.com/item?id=22761395

or even 16 hrs ago some pts and talk: https://news.ycombinator.com/item?id=27111650

stunt · on May 11, 2021

Hello IT, have you tried turning it off and on again? https://www.youtube.com/watch?v=5UT8RkSmN4k

killingtime74 · on May 11, 2021

I always thought aviation design by committee was designed to avoid these kind of issues...I was wrong

alkonaut · on May 11, 2021

This has been posted several times before and it’s not really a problem is the usual conclusion.

alexfoo · on May 11, 2021

It has to be submitted to HN at least once every 149 days.

throwaway0a5e · on May 11, 2021

1.49 days if Boeing has recently done something dumb and newsworthy.

anovikov · on May 11, 2021

almost reminds me of Windows 95 that had to be rebooted every 2^31 milliseconds = ~25 days.

drfuchs · on May 11, 2021

And nobody noticed for years, since the Win95 would typically crash way before then.

fassssst · on May 11, 2021

This should be standard operating procedure for any device with software on it.

jeffrallen · on May 11, 2021

Remind me not to ride in any elevators with you...

fassssst · on May 11, 2021

Is your device running only formally verified software and has it been tested against every possible bit-flip or race or deadlock? Has anyone tested running the device for 51 days with a statistically sufficient sampling of configurations? Is it repeated after every update?

Fail safe and mitigate harm as much as possible. A power cycle is cheap and easy way to mitigate leaks and overflows.

mousepilot · on May 12, 2021

so can you reboot during a flight? just asking, I'm a bit forgetful about things like that.

astrea · on May 11, 2021

Memory leak?

im_down_w_otp · on May 11, 2021

I suspect it's a sequence number that advances monotonically on at hardware controlled tick rate which is used as a simple high-watermark for establishing recency and probably (subsequently) ordering, and this thing is probably a 32bit integer or some slice of a 64bit one and after being online for ~51 days it nears the threshold of rolling over, which would then cause the receiver to never show new data because it thinks its very, very old data.

cellularmitosis · on May 11, 2021

> I suspect

That was my first thought as well. The Arduino millis() function wraps back around to zero at around 50 days of uptime. https://www.arduino.cc/reference/en/language/functions/time/...

CheezeIt · on May 11, 2021

That’s 49.7 days. So this must be something else.

rurban · on May 11, 2021

As long as they don't do it inflight, ok. But I see no recommendation to do it strictly onground. So this could lead to several new crashes, when the operators strictly try to follow the rules.

kube-system · on May 11, 2021

The linked Airworthiness Directive says to follow the procedures in Boeing Alert Service Bulletin B787-81205-SB420045-00, Issue 002, which subsequently says to follow the normal maintenance procedures in Aircraft Maintenance Manual (AMM) 787 AMM 24-22-00

I don't have a copy of the 787 maintenance manual, but I have a feeling it covers that concern.

waterba · on May 11, 2021

Has The Register gotten rid of its motto "We bite the hand that feeds IT"? I can no longer find it.

This article is still critical, but they are pandering to FAANG etc. (see for instance the recent Stallman coverage).

RankingMember · on May 11, 2021

Scroll to the bottom