> The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.
> The radar system had successfully detected the Scud and predicted where to look for it next. However, the timestamps of the two radar pulses being compared were converted to floating point differently: one correctly, the other introducing an error proportionate to the operation time so far (100 hours) caused by the truncation in a 24-bit fixed-point register. As a result, the difference between the pulses was wrong, so the system looked in the wrong part of the sky and found no target. With no target, the initial detection was assumed to be a spurious track and the missile was removed from the system. No interception was attempted, and the Scud impacted on a makeshift barracks in an Al Khobar warehouse, killing 28 soldiers, the first Americans to be killed from the Scuds that Iraq had launched against Saudi Arabia and Israel.
So such things happen also in military (although this was early nineties)
> I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks". He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits it's target or at the end of it's flight, the ultimate in garbage collection is performed without programmer intervention.
Kind of related. I always found it interesting that the Terminator's heads up display noted no casualties as "0.0" -- and I always wondered how a computer/killing-machine from the future might calculate casualties.
At first I was thinking this was a component like internet or in-flight entertainment. Then:
> This alarming-sounding situation comes about because, for reasons the directive did not go into, the 787's common core system (CCS) stops filtering out stale data from key flight control displays. That stale data-monitoring function going down in turn "could lead to undetected or unannunciated loss of common data network (CDN) message age validation, combined with a CDN switch failure".
Excuse me? Clearing the queue is not a "solution" if you have to remind a human to do it.
Why not? These are maintenance heavy machines. Another line item in the maintenance procedure isn't necessarily a show-stopper.
We have a tendancy in software to think that the software should do literally everything even with a user actively working against it. A lot of times that is even more harmful.
Commercial and industrial software are very different things. Humans are an unreliable gate to safety and reliability, so if the gate to staying in the air is a human doing something that they can't see the results of it will inevitably fail.
And yet commercial airplanes, which have used this methodology for decades, are one of the most reliable mechanical systems humans have ever made. The mean-time-before-failure (MTBF) for component failures that impacted a 787 flight is ~40,000 hours [1] or about 4.5 flight-years per component failure, not full system failure, component failure. After nearly ~4,000,000 flight-hours there has not been a single full system failure or fatality yet in a 787 and this is only a fraction of the standard in aerospace so far. Between 2000 and 2010 there was ~1 death per 50 billion passenger-miles [2]. At an average speed of ~700 mph that would be 1 death per ~71 million passenger-hours or ~8150 passenger-years or ~125 airframe-years.
People who want to design reliable systems and processes look to airplanes for how to do that. That does not mean the same techniques are applicable or cost-effective in a different context, but at the very least the historical processes have been empirically demonstrated to produce extremely high reliability far beyond what nearly every other industry can attain and what many industries, such as commercial software, do not even attempt to achieve and may not even think is possible in their environment.
Flight software is fabulously expensive to maintain and update, because it is safety critical. Furthermore, avionics programmers have a surprisingly captive audience. You would be surprised to learn how many problems are left to operator-based workarounds.
It's actually fairly depressing and is part of the reason I transitioned to consumer software.
This! One way to mitigate safety critical issues is to pass it onto humans. This lowers how detailed you have to be when verifying the software, because you don't have to handle some critical aspects.
Think about how absolutely insane some assumptions would have to be for this to be a problem.
These are the kinds of things that separate commercial software from open source software: in many organizations, commercial programmers typically write to a spec, and the specs usually come managers asking for specific, needed things without both a proper understanding of the larger picture and a nuanced understanding about how those specific pieces intimately fit in to the larger picture.
Open source is usually written by people who take pride in what they write, so even though a 32 bit timer for millisecond events almost certainly won't still be running 50 days after you've started a program, they'd still consider, "What if it overflows? Should I handle that event, or should use 64 bits instead?" instead of, "Not in the spec. Why should I care?"
And with bad programming practices comes unmaintainability. You want the department who wrote that code several years ago to open it up again, fix some things, then give you a new version? Well, that's going to take months and hundreds of thousands of dollars because they're not set up like that.
It's a stupid game with stupid results, yet we have so many apologists who think that if companies with money do it, it must be right.
This post is filled with assumptions. In aviation, the specs generally come down from the OEMs like Boeing or Airbus, who have systems engineers who've done a lot of work creating the specs and defining the nuances. Granted, there are errors (and external bosses who push requirements), but when they are seen down the line by a supplier, they get raised and discussed with the systems engineers at the OEM. Generally, these specs are pretty tight and specific. For something like this, the OEM probably had the operating time set to something like 25 days (over 3 weeks of never turning the plane off once), which gives the timer plenty of room. However, if the customer then doesn't restart for more than 6 weeks, and that's outside of the OEM's specs for the plane, then you're going to have issues no matter what.
The actual larger cost is not the maintenance or updates but the verification. There are definitely issues, especially when it's a code base you've been maintaining for 15-20 years, but generally it's the verification that takes the most effort, since often you're not just verifying the new functionality but downstream functionality as well.
I'm sure this has been asked on HN before, but is there something intrinsic to complex systems and the need for a reset? That is, as a given system gets more complex in design and purpose, is it inevitable that beyond a certain point of complexity, some kind of a power reset is needed to get things to work correctly? Any interesting writings on the subject?
Towards the extreme ends of the spectrum of complexity, humans need sleep. On the other end, my pocket calculator likely doesn't need to be switched off and on to ensure that numbers add correctly. I guess complex operating systems sit closer to humans than a calculator on this "spectrum". I do remember reading that the space shuttle's computer systems were close to perfect in design, but they're not operated as frequently as a 787.
> is there something intrinsic to complex systems and the need for a reset?
As the number of possible states of the system explodes, and as you layer levels of abstraction onto each other, the probability that you somehow reach a state the designers didn't think about increases. Then you need to reset.
Add to that limited bits for representation of numbers, and people not thinking about what do to when one overflows (because this is really really hard in many cases), and you get to "better reset it every n days" scenario.
In the late 90s into the early 2000s I worked at a large international ISP. During my first few years there our network containing ATM switches both terminating backbone circuits and as intra-PoP fabrics. For a period of time the ATM switches had to be rebooted every 45 days because otherwise they would perform an uncontrolled reboot.
Is your device running only formally verified software and has it been tested against every possible bit-flip or race or deadlock? Has anyone tested running the device for 51 days with a statistically sufficient sampling of configurations? Is it repeated after every update?
Fail safe and mitigate harm as much as possible. A power cycle is cheap and easy way to mitigate leaks and overflows.
I suspect it's a sequence number that advances monotonically on at hardware controlled tick rate which is used as a simple high-watermark for establishing recency and probably (subsequently) ordering, and this thing is probably a 32bit integer or some slice of a 64bit one and after being online for ~51 days it nears the threshold of rolling over, which would then cause the receiver to never show new data because it thinks its very, very old data.
As long as they don't do it inflight, ok. But I see no recommendation to do it strictly onground. So this could lead to several new crashes, when the operators strictly try to follow the rules.
The linked Airworthiness Directive says to follow the procedures in Boeing Alert Service Bulletin B787-81205-SB420045-00, Issue 002, which subsequently says to follow the normal maintenance procedures in Aircraft Maintenance Manual (AMM) 787 AMM 24-22-00
I don't have a copy of the 787 maintenance manual, but I have a feeling it covers that concern.
> The Patriot missile battery at Dhahran had been in operation for 100 hours, by which time the system's internal clock had drifted by one-third of a second. Due to the missile's speed this was equivalent to a miss distance of 600 meters.
> The radar system had successfully detected the Scud and predicted where to look for it next. However, the timestamps of the two radar pulses being compared were converted to floating point differently: one correctly, the other introducing an error proportionate to the operation time so far (100 hours) caused by the truncation in a 24-bit fixed-point register. As a result, the difference between the pulses was wrong, so the system looked in the wrong part of the sky and found no target. With no target, the initial detection was assumed to be a spurious track and the missile was removed from the system. No interception was attempted, and the Scud impacted on a makeshift barracks in an Al Khobar warehouse, killing 28 soldiers, the first Americans to be killed from the Scuds that Iraq had launched against Saudi Arabia and Israel.
So such things happen also in military (although this was early nineties)
Source: https://en.m.wikipedia.org/wiki/MIM-104_Patriot