Hacker News new | past | comments | ask | show | jobs | submit login
The Patriot Missile Failure (umn.edu)
152 points by AndyBaker on March 24, 2014 | hide | past | favorite | 109 comments



In 1993, while in college, I wrote a paper about the fallacies of the Patriot Missile effectiveness. What I learned in writing it was that the so-called accuracy of the Patriot was grossly exaggerated. The mainstream media, and indeed the country, needed some symbol to rally around and the Patriot Missile became that symbol. The battery's very name started the process.

The US touted a 95% accuracy.

http://www.slate.com/articles/news_and_politics/war_stories/...

In reality, a "success" didn't mean disarming the scud. It meant flying on a trajectory sufficiently close to the scud to have disabled it if all else worked as designed.


The success rate was hugely exaggerated. I witnessed first hand missiles coming into and near Israel.

Missile defence is a funny thing, especially when it comes to budgets and politics. There is often this cost/benefit of perfectly intercepting a missile/at all vs. just letting it hit. It sounds sick, but it really is a problem.

Right now in Israel, we have an issue with our kipat barzel system. Originally they said it was a stupid idea and too expensive, but it has really proven itself in the field. This brings several problems. First, people are dumb sometimes and think when it is deployed near them that they are perfectly safe. Interception rates for this system are also no where near 100%. Secondly, shooting what amounts to flying garbage with this thing is really expensive. You might as well burn money to keep warm in the winter.

What is much different from a lot of other systems like the Patriot is it really factors all kinds of things like will it hit a projectile that will impact a populated area or will the target just explode in the middle of no where?

The same problems with people rallying around the Patriot are happening now. Luckily though it's being put into perspective and supplemented with systems that do other kinds of interception. At least the army realised that multiple kinds of interception are necessary and something like the Patriot will not be appropriate in every situation. These systems range from actually blowing up the targets to confusing them to throwing garbage at them to make them explode at the wrong time.


I forget the source of the quote, but it went something like this:

"The Scud isn't a military weapon, it's a terror weapon. And when you have an anti-terror weapon that the public perceives works, then it works."


The touted 95% accuracy was the primary reason US expats would sometimes leave their houses when the sirens sounded with their home cams to try to catch the Patriots intercepting incoming Scud missiles during the Gulf War.

I lived on the civilian airport during my youth in Riyadh during the Gulf War. The other expats thought the Americans were crazy no matter what the supposed accuracy.

It wasn't until after the war that we all realized the interception rates in the field were so much lower.

Pretty surreal.


I was in the Air Force in Riyadh during the Gulf War - living in "Eskan Village" - south west of town, and just north of a Patriot missle battery. Many US servicemen (including myself) went up on the roofs of their housing units to watch the Scud/Patriot fireworks every night -- I don't think anybody had any illusions that we were "protected" by the Scuds -- everybody knew the shrapnel would land somewhere, even if in slightly smaller bits -- people felt "safe" just because of the math of probability - the risk of you being hit by anything falling from the sky was quite low, given the size of Riyadh!

You could stand on the water tower on your roof and see lit cigarettes for blocks around (other guys up on their roofs watching the show). Ahh! Good times...


Seriously, thank you for your service.

>>I don't think anybody had any illusions that we were "protected" by the Scuds [...]

Maybe servicemen were in the know, but non-military US expats drank the kool-aid. Alternatively, maybe they were encouraged by the military personnel watching the fireworks and extrapolated their own conclusions from there.


Would have been sweet to connect back then and add your anecdote to my paper. I was an Computer Engineering major and this was the final paper for my only writing course. I learned a ton about the media on that project.


Well, it helped that the Scuds had a 5% accuracy rate.


The question then: is lying to people kinder than freaking them out with the truth?


This breaks down when your lie goes a full circle around the feedback loop and now someone above you is making the decision to pursue further a solution that was based on a lie.


I think that's a very interesting question. However, it is necessary to take into account the loss of credibility an exposed lie will produce, and how that will affect the efficacy of the technique in the future.


I wonder if you're in a position to comment on a couple of recollections I have from the first Iraq war (which I was aware of via TV news here in the UK, no direct involvement) - things that I've wondered about over the years?

The first: I remember a report on the UK's Channel 4 News which included some footage of Patriot missiles in flight. Now this is a long time ago, and I only saw the clip once, but what I recall is that shortly after launch, rather than a clean arc, one missile seemed to describe some crazy set of mid-air loops (which I immediately interpreted as a software error) before what seemed to be a sudden dive into the ground. The voice-over didn't refer to this apparent failure directly, but you could sense a certain awkwardness. Have you ever heard of any incident matching this?

The second: as you'll know, the Patriot wasn't designed for a direct hit on incoming missiles, but was intended to get within a critical radius and then explode, triggering an explosion in the incoming missile's warhead. My friend theorised that the Iraqis exploited this by repeatedly launching missiles without payloads, so that even successfully exploding Patriots had little effect other than to deplete the military budget by X hundred thousand bucks a throw.

Do either of these sound familiar or plausible? Things that I've wondered about for years!


Number 1 I can't offer any comment. I didn't delve too much into the programming of the missile though I have a similar recollection regarding the bug listed in the OP.

As to two, yes, I think that makes a lot of sense. One other "success" mode of the Patriot was to explode and turn the scud in to many fragments (de facto fragmentation bomb). These fragments in turn could (and in some cases did) reach or approximate their initial target and do massive damage in excess of a single scud hit.


Thanks for the reply!


Thanks for yours. Some more data:

McPeak's data, drawn from a variety of coalition sources, indicates that between January 18 and February 26, 1991, 40 Scuds were launched against Israel and 46 against Saudi Arabia.

http://www.pbs.org/wgbh/pages/frontline/gulf/weapons/scud.ht...


> The second: as you'll know, the Patriot wasn't designed for a direct hit on incoming missiles, but was intended to get within a critical radius and then explode, triggering an explosion in the incoming missile's warhead. My friend theorised that the Iraqis exploited this by repeatedly launching missiles without payloads, so that even successfully exploding Patriots had little effect other than to deplete the military budget by X hundred thousand bucks a throw.

How much does a Scud warhead cost compared to the cost of the missile body, fuel, guidance system, etc? Scuds are certainly lower-tech than Patriots, but they're much larger and carry much more fuel, and Iraq had much less money to burn than the US. It seems unlikely to me.


There's an interesting overview of Hussein's SCUD deception efforts in a paper by USAF Col. Kipphut:

http://www.dtic.mil/dtic/tr/fulltext/u2/a468155.pdf


There should be some sort of penalty for re-inventing language for gain. This reminds me of the Columbia Accident Board Report, which pointed out that NASA & contractors redefined the meaning of "foreign object impact" to exclude foam from the fuel tank.


As an Israeli who lived through the Gulf War and served in the army, I can attest that the patriot missile was a success and massive failure at the same time. It was better than nothing, let me just say that.

The problem was that we knew it had issues, complained many times, but it was tied up in politics. We wanted to develop and deploy our own missile defence systems for a long time, but in many ways we were more or less blackmailed into spending the defence loans we receive to pay back the American defence establishment. The message was take what you are given and enrich private American companies, or else (btw for the haters, we must spend our "aid" with companies like Lockheed, Raytheon, Boeing, etc., it does not go to anything else at all, so really it's your tax dollars shilling your military industry).

Anyway, I saw first hand Patriot misses and the fear after that, especially regarding chemical weapons. A huge part of the country spent time with gas masks and plastic in safe rooms during the gulf war. At the end of the war, we felt like our leaders failed protecting us sufficiently, especially when they knew there were issues.

The interesting outcome is this directly lead to various missile programs including kipat barzel, arrow, spider, and others. Before, missile defence was a much harder sell, but the aftermath of patriot failures raise the case that again as a country we had to be more self-sufficient regardless of the cost. The other reason of course is that the Americans never really had an offering that assessed needs such as short-range, low flying projectiles, rockets, shells, multiple-target tracking, etc. Today, we have arguably the most advanced short-range and tactical missile defence systems.

All of these systems are built heavily on targeting/guidance and to run on cheap hardware that can fail massively. The interceptors and computers are not necessarily cost effective and super expensive, but much more practical. Additionally, redundancy in terms of overlap of targeting errors and misses is a heavy part of deployment. Resource-wise, it's not always possible, but I know first-hand it is a combination of various operational failures of the Patriot combined with years of relentless attacks by our enemies using anything from glorified flying garbage to old Soviet tech.


> The interesting outcome is this directly lead to various missile programs including kipat barzel, arrow, spider, and others.

Good for you guys. One gets the sense that the US defense industry is breathtakingly inefficient, and overcomes this handicap only through a massive infusion of funding.


Part of the issue as well was that the Patriot was pressed into a role for which it wasn't designed. The surface to air mission has historically been a low priority for the U.S., and Patriot was the first real SAM system since Hawk to be designed from the start for the anti-aircraft role. In '91 it was suddenly pressed into the ballistic missile interception role, (which had been explored a little earlier in the program...but to say it was green is an understatement.)

The Gulf War forshadowed the issues that snacktime so well described earlier: Cheap rockets, artillery, and missiles being used against urban targets would, and will continue to be a problem to be solved and require countermeasures for. But for the army that was deployed in '91 was designed to fight a ground war in Europe, not revenge attacks and Scud potshots in the middle east. It worked to assuage fears at the time, and that was good enough.

Don't think that those lessons were ignored, however. Systems like Arrow and Iron Dome were developed not just with Israeli ingenuity and necessity, but large amounts of direct funding from the U.S. DoD.


Or perhaps it's the massive infusion of funding that makes the US defense industry breathtakingly inefficient? :-)


The US defense industry is breathtakingly efficient at extracting money from the US government, which is, after all, its whole point.

The US military procurement system may be breathtakingly inefficient at procuring cost-effective equipment, but that's a different issue.


For most of the whiz-bang web apps being written today, this sort of thing doesn't matter.

Every so often, though, working on embedded devices or medical software or finance stuff, it becomes really important that you remember that lives depend--in a non-trivial and quite real way--on your code being correct and on the implemented algorithms fitting the problem.

Something that's very tricky isn't just understanding that the code does what it says it does, but that the code implements a solution that is properly modeled to the problem at hand.

EDIT:

My background is in mechanical engineering, and I never forget this quote by Dr. Dykes:

"Engineering is the art of modelling materials we do not wholly understand, into shapes we cannot precisely analyse so as to withstand forces we cannot properly assess, in such a way that the public has no reason to suspect the extent of our ignorance."

Software engineering in life-critical applications is serious business.


> Software engineering in life-critical applications is serious business.

Probably one of the most tragic cases of a SW bug:

http://en.wikipedia.org/wiki/Therac-25


Yep. And that's why you really, really want hardware failsafes in addition to whatever nonsense code you're writing.


True, but those are not always possible, nor can cover everything.

What struck me the most was the criminal incompetence of the developers, both of the total system as it moved to software control and especially the software itself. Not to mention the Crown Corporation's response to the problem.


While Science Fiction and not directly related, one of the Tom Clancy novels went into the Science of missile intercept pretty extensively(was a very important part of the book.

This book brought to my attention the mathematics and for lack of a better term difficulty of having a missile intercept system. Not only the mathematics but the but the raw limitations of physics that must be dealt with. In the case of the book it was ICBM's which travel much faster, 7 km/s (15,700 mph)[1], compared to the 3749.11 mph of the skud missile.

For anyone interested the book is Tom Clancy - The Bear and the Dragon.

[1] http://en.wikipedia.org/wiki/Missile_defense


This clip of a Sprint ABM test gives some idea of how incredibly fast ICBM RVs are - it makes the 0-Mach 10 in 5 seconds Sprint look like it is hardly moving:

https://www.youtube.com/watch?v=msXtgTVMcuA


As I remember, source would be Jerry Pournelle, I think, from the time ICBM RVs hit enough atmosphere that mass separates warheads from decoys (if your decoy weights as much as a warhead, it's pointless), to the time it hits, is about 10 seconds.

Not a lot of time for a discrete point defense (as opposed to e.g. throwing a curtain of stuff up between you and the RVs).


I think that is why systems try to hit the missile on the way up before RV separation. The navy has Missile Systems, such as the TBMD section of the Aegis weapons system, that uses US/ship based SM-3 missiles to hit a Ballistic missile on it's ascent. However, this means you have to detect the BM early and launch your missiles from somewhere in the ocean so that you could hit the target around it's apex. They do this by sharing radar data between land and ship based radars, getting the SM-3 close to target, and then using a kinetic warhead to make contact with the BM.

I would think that the patriot missile system uses a similar theory. It probably use a radar to detect the scud launch and then calculate the correct time to deploy a missile for intercept. The patriot would either be flown close to the scud using telemetry based systems on the patriot, or by using mid course/terminal guidance from the patriot launcher. Terminal guidance would fly the missile in front of the scud and "patriot go boom". If the terminal guidance is slightly off, you don't go boom in front of the scud and the scud hits it target.

In some fire control radars, the terminal guidance is constructed of data the radar receives from a missile downlink, where the missile reports it's position to the launcher in order to use a more precise radar on the ground to direct the missiles path towards a target. I think it's more popular to use a combination of missile communications uplink/downlink and combine it with Ground based radar data, to determine where the missile is, where the target is, and where the expected point of missile intercept should be.


Sprint was a last ditch system. It was supposed to intercept the warheads that leaked through the Spartan system, and at only about 100k ft. At that point, the Decora would have burned up. Of course both types of interceptor had nuclear warheads. 100 sprints going off over Grand Forks wouldn't be very pretty... The real point was to defend the missile silos though, and they might have worked for that.


Enhanced radiation ("neutron bomb!!!") warheads for Sprint, emphasize the neutron flux, lessen the explosive yield.

Better way up there than the RV warheads detonating on the surface, although the successor proposed LOADS planed for a 75K foot intercept to among other things decrease EMP effects. And, hey, wouldn't all those bright lights in the sky look pretty ^_^?


My favorite project in school was writing a function that would calculate the angle and speed to shoot an interceptor given the location of the battery, the speed of the interceptor, and the location and speed of the incoming missile. If we didn't use the right mathematical methods, the calculation would take too long and the missile would cruise right by the small window.


Math and physics seem to be much more fun when discussed in military context. I suppose it's because, unlike pretty much anything else in life, military problems have at the same time both importance (if you fail, people die / equipment gets destroyed) and a time limit (you need to make your decision fast, otherwise you fail).


"3749.11 mph"

Six sig figs... impressive. I was under the impression the scuds were not quite built up to that level of precision.

This is to some extent meta-humor as the whole point of the story is the programmers involved needed an engineer helping them who understands sig figs and the effect of built up error tolerances.


The article listed the speed of a scud at "1,676 meters/second", "3749.11 mph" was simply a conversion from that unit to MPH.

I can only hope that for the sake of my post that my lack of attention to the increase of perversion of the velocity of the skud missile does not in some way invalidate my claim :P.

https://www.google.com/search?q=1%2C676+meters+per+second+to...


> The article listed the speed of a scud at "1,676 meters/second"

Which looks suspiciously like a rounded "3750 mph to meters per second" conversion. Indeed, "scud 3750 mph" turns up a lot of hits.


I hated that book. But couldn't put it down. Looking back it's amazing what he got right, like the darkstar drone system for example. But even more amazing is how much he got wrong, that western government media of the war would be sufficient to overcome local propaganda and cause an over through of the Chinese government in a matter of days.


This makes no sense in the way it is stated - the calculations involved are independent of the up-time. You do not aim at a target differently depending on how long the system is up. The correct way to think about this is probably the following - they noticed that the up-time drifted and tried to improve on that, but they failed to do so in all places. In consequence different parts of the system used different times that also advanced with differing speeds and this inconsistencies caused the calculations to go wrong. As mentioned in the article, would they not have made the "improvement" all parts would have used the same time and the errors would have canceled because the calculations are time-independent and the drift is probably to small to be important during the relatively short time a target approaches. This seems one of the rare cases where blaming math with limited precision is wrong.


Do you have additional information about this incident? Although TFA alludes to something like what you're saying with its "inaccuracies did not cancel" comment, your point seems to require an assumption that is not stated there. Aren't you assuming a non-distributed system, when it seems very likely that radar sensors and missile launchers would not be co-located? In a distributed system, we certainly cannot assume identical boot dates, and so an accumulating error would accumulate differently at different components of the system.

Systems like this must have time agreement among their different components. It seems that a system developed later would have just used GPS to prevent this problem, but I wonder why they didn't use NTP?


The obvious evidence are the laws of physics which are invariant under time translation. What you are saying is exactly what I said - it does not matter if the drifting clocks are in different parts of a distribute system or if a single system uses different clocks. In the end the system fails because different clocks disagree on the current time, not because they drift away from the actual time. And the article states exactly this - they tried to fix the clock drift relative to the actual time but failed to do so in all components and thereby inadvertently introduced two clocks drifting relative to each other.


In the end the system fails because different clocks disagree on the current time, not because they drift away from the actual time.

The possibility I mentioned was that the clocks disagreed because they had drifted away from actual time at similar rates, for different periods. [r ≠ 1 ∧ t₀ ≠ t₁] → [t₀ + (t-t₀)r ≠ t₁ + (t-t₁)r]. Unless specifically addressed, this phenomenon will occur in distributed systems.

That said, the link 'gibrown provided seems to establish firmly that this particular error resulted from times being up-converted (24bit fixed to 48bit floating) via different methods in different parts of the system.


Now I see what you mean, equal drift rates but different up-time. Assuming two clocks coming up at different points in time this will indeed cause time differences if they are initially synchronized with a third clock. If the clock coming up later synchronizes with the other clock already running there should be no issues though.

I read the article, too, and this two different ways of converting are more or less equivalent of two clocks starting in sync but drifting relative to each other. Therefore my initial comment was quite on the spot.


Yep, that's how I read the article, the target path prediction system was 0.34 seconds out of sync with the targeting system due to differing implementations.


The following scenario helps explain: you have two radars. One is a wide-angle, general radar, which sees a missle at some coordinates, travelling at some speed.

The other radar is for target acquisition, has a very narrow spread ("range gate"), and must determine the precise location of the missle given an initial set of coordinates, a velocity, and a timestamp.

In other words, the system is actually aiming at targets differently depending on how long it has been up, tracking so far in front (or behind, I don't recall the details) that it can't acquire the target.


You are describing a problem where two clocks drift relative to each other, not a problem where one clock drifts away from the actual time. On the other hand the article gives the impression the failure occurred because the system failed to exactly measure the up-time, not because clocks in different systems or system components drifted relative to each other.


That is not the impression I got from the article. No logic in the system cares how long it has been up, not directly at least. What matters is drift from its time reference, which is a function of uptime.

Various modules in a complex system like this each have their own clock, which I will refer to generically as a real-time binary counter (RTBC), which the module uses as its event time reference. The RTBC starts at 0 when the module comes up. At some point shortly after coming up the module will check in with its controller, which will send a time-of-day (TOD) message. The module links the TOD message to a particular RTBC tick to create its time reference. At this point the time is free to start drifting relative to the actual wall clock time, until the system is power cycled again.


That is exactly what I said - different clocks drifting relative to each other. It is completely irrelevant that their one tenth of a second was not exactly one tenth of a second, what matters is that different clocks in the system had different ideas of one tenth of a second.


You're applying a principle too broadly. Although the laws of physics don't change under a linear expansion of time, they are, for example, sensitive to linear expansion of velocities of missiles only: any non time-linear effect on the velocity is going to impact you -- for example, reynolds numbers for air depend non-linearly on the velocity which may be varying with time. Sure, if you multiply the whole system you would have that compensated by the increase in temperature and pressure which a faster time reference would observe, but it's not simulating the universe, just a limited set of variables.

Also, for obvious reasons of consistency and precision it would be better to keep a standard reference regardless.


I do not think of the problem as scaling the time by a factor - although this is the correct description - but as adding a constant offset. I think this is justified because the small drift is not significant during the relative brief period of time a target approaches. The offset builds up over time but only in the parts of the system that did not receive the improved algorithm and therefore these different parts disagree more and more on what the current time is.


> It is completely irrelevant that their one tenth of a second was not exactly one tenth of a second.

Its very relevant when the module that is off is trying to make telemetry calculations based on target Doppler velocity, which is given with real, ISO standard seconds. There is no clock involved in that. Diverging module clocks amplifies the problem.

Also, the ultimate reference is the true definition of a second. All modules are expected to be using it, as it is used to synchronize modules. It is the clock and at some level a clock that has a faulty definition will be drifting off another clock. Your distinction is irrelevant as far as real-time systems are concerned.


You are making a lot of assumption about how time might be used, nut I will ignore that because I have no clue if that is what really happens.

Let me repeat my point clearly. All clocks will drift away from the actual time. All the physics involved and measurements done are not depended on the current time - they will work the same at 14:07 as they do at 23:51 and they will therefore also work the same when the clock of the system drifted away from the actual time and believes it is 12:34 while it is 12:35. Important is only that all parts of the system agree on what the current time is and that the clock does not drift at such an high rate that all measurements and calculations done during a brief period of time become invalid, i.e. the clock should not report that it took two seconds for the incoming missile to travel one kilometer while it took only one second.

And the article gave the impression - at least to me - that the failure was caused because the system believed to be up for 100 hours while it was up for 100 hours and 340 milliseconds longer due to an imperfect representation of one tenth of a second. This makes no sense and is not what caused the failure. The failure was caused - as detailed in the other linked article - because one part of the system believed to be up for 100 hours while another part of the system performed more precise time conversions and knew that it was up for 100 hours and 340 milliseconds and this time difference between two parts of the system caused the failure.

For example one part of the system may have decided that the missile should be launched at 12:00:00.000 and the system responsible for doing so did that according to its clock but because of the time difference it was at 12:00:00.340 according to the clock of the system that made the decision.


My interpretation of the article:

Time is kept as an integer, stepped ten times per second. This can be exactly represented as a float, so probably uses the same 24 bit register. For 100 hours this integer would be 3600000, which fits into 24 bits with some room to spare. (But it would give a max uptime of the system of about 466 hours.)

Wide arc radar notes location, velocity, and time from clock above. This output data is still good enough for pinpointing the next position with a precision of about 170 meters (the distance the scud travels in the 0.1 second step of the clock). The precision radar system probably had accounted for this, and had a wide enough beam to handle this case.

Now, when deciding where to point the next precision beam, the radar multiplies the stored time value (exactly 3600000) with 0.1 (which is not represented exactly, but instead is about 0.000000095 less than 0.1) and uses this computed value in further calculations. This floating point value is now 0.34 seconds less than expected. The precision radar, even though it uses the same clock as above, has an incorrect representation of when the last wide arc radar update took place, and this propagates to the prediction of where the scud will be next (which is now off by 0.34 * 1676 ~= 570 meters). Thus, when it points the beam to where it believes the scud will be, the scud is outside the precision radars cone.

Note that both wide arc and precision beam systems have exact knowledge of the current system time at the point of their respective operations. What fails is precision beam's calculation of what wide arc's time reference actually meant.

The ironical part in the article probably refers to some computation using a delta, and if both time references ("then" and "now") have the error the delta will be small and possibly insignificant. However, if "now" is replaced with a more accurate representation of the clock above, only "then" has the big error, and the delta will be just as far off as the incorrect value above.

The exact error propagation depends on the order the calculations are performed in, and there's a whole field (numerical analysis) dedicated to controlling these errors. As developers we gladly ignore the problem even when we shouldn't.


> You are describing a problem where two clocks drift relative to each other, not a problem where one clock drifts away from the actual time

The latter is simply a specific case of the former. The second clock is the one measuring 'actual' time, and the drift is relative to it.


Of course, but the point was, that a drift relative to actual time does not cause problems as long as all clocks in the system drift at the same rate while the failure was caused by different clocks in the system drifting relative to each other and therefore with different rates relative to actual time. Therefore I treated actual time as a special clock.


A better description of the bug: http://www.ual.es/~plopez/docencia/itis/patriot.htm

It was related to limited precision, but was also related to using different calculation methods in different parts of the code.


> This makes no sense in the way it is stated - the calculations involved are independent of the up-time. You do not aim at a target differently depending on how long the system is up.

At some point or another your anti-SCUD computer is going to have to say "Send the electrical signal to launch this missile NOW!". I assume this is where they make use of the system clock.


Definitely, but it does not matter what the value of NOW is as long as all parts use the same time. Only their attempt to fix the drift caused different parts to have a different understanding what NOW = 5484515 means and that caused the problem.


My understanding is that they used a floating point value for time measurements, and they used time since boot as a reference. After a long enough time since boot, there were not enough significant digits available to represent both time since boot, and accurately capture the different in time between two events.

The first Ariane 5 launch also had a failure relating to floating point math, in that case an overflow when converting to an int16:

http://www.di.unito.it/~damiani/ariane5rep.html


The article states they use a 24 bit fixed point format.


Sure, but the point is that it did not have enough precision for both a large absolute value, and small time measurements, when based on time since boot.


No, this is not what happened. They converted the 24 bit fixed point value into a different format using different algorithms in different places of the system resulting in slightly different values. The failure was then caused by the difference between these values, i.e. different parts of the system did not agree on what the current time is.


What's interesting is that this is the kind of thing where a standard "boot-and-test" routine is likely to miss it (because the clock starts out in sync). Time-dependent bugs are always tricky.


I was an 18 year old grunt, fresh out of basic training, when I went to Iraq in 1991.

After about 48 hours of living on tarmac at the airport (where I experienced the first of many MOPP-4 chemical warnings: let me tell you how fun it is to have a gas mask, charcoal suit and rubber accoutrements while laying on tarmac, psychosomatically creating the nerve agent symptoms they just taught us), I moved to the a high-rise apartment complex.

I was in a North-facing room, on the northernmost edge of the complex. Open desert as far as the eye can see to the North. I watched scuds get shot down (seems like every night, but memories are wont to be inaccurate).

Here's the thing: I hear all this talk about the Patriot missle being inaccurate, and I seem to remember something like "no patriot ever shot down a scud". That takes some serious parsing to arrive at-- because I saw patriots "hit" scuds, but of course they could have "exploded in the vicinity of" scuds, destroying them.

From what I hear, the Patriots were running on 100-mile-an-hour tape (you did know that the military has its own version of duct tape, right?!) and bubble gum, but I'm thankful for them nonetheless; you tend to take what defense you can when folks shoot at you.


Even if it didn't fail (numerically), there is a decent probability it still would have failed to intercept the scud:

http://en.wikipedia.org/wiki/MIM-104_Patriot#Success_rate_vs...

The U.S. Army claimed an initial success rate of 80% in Saudi Arabia and 50% in Israel. Those claims were eventually scaled back to 70% and 40%. However, when President George H. W. Bush traveled to Raytheon's Patriot manufacturing plant in Andover, Massachusetts, during the Gulf War, he declared, the "Patriot is 41 for 42: 42 Scuds engaged, 41 intercepted!"[28] The President's claimed success rate was thus over 97% during the war.


You left this part out though. .

"Patriot PAC-3, GEM, and GEM+ missiles both had a very high success rate, intercepting Al-Samoud 2 and Ababil-100 tactical ballistic missiles.[17] However, no longer-range ballistic missiles were fired during that conflict. The systems were stationed in Kuwait and successfully destroyed a number of hostile surface-to-surface missiles using the new PAC-3 and guidance enhanced missiles"


Indeed. The real world performance of Patriot PAC-2, the first generation tweaked for BMD but using the same hardware except for increasing the size of the warhead's projectiles, says little about how later versions and generations (PAC-3 uses a dedicated missile, 1/4 the size and therefore 4 times as many in a launcher) also preformed in the real world.

Since in this case the real world = actual instances of enemies shooting live missiles at you trying to kill you, it's been proven in some acid tests. And one reason Guam is sprouting Patriot and THAAD batteries.


It's like when in the sixties the army prepared a report trying to asses a risk of an accidental nuclear detonation. They have concluded that since no such detonation ever occurred, the risk is 0%.


The devops fix? Reboot every night. Works wonders. :/


Reboot three times each day; though the Army assumes you know that.

"Army officials said that they believed the Israeli experience was atypical--they assumed other Patriot users were not running their systems for 8 or more hours at a time. However, after analyzing the Israeli data and confirming some loss in targeting accuracy, the officials made a software change which compensated for the inaccurate time calculation. This change allowed for extended run times and was included in the modified software version that was released on February 16, 1991. However, Army officials did not use the Israeli data to determine how long the Patriot could operate before the inaccurate time calculation would render the system ineffective.

On February 21, 1991, the Patriot Project Office sent a message to Patriot users stating that very long run times could cause a shift in the range gate, resulting in the target being offset. The message also said a software change was being sent that would improve the system's targeting. However, the message did not specify what constitutes very long run times. According to Army officials, they presumed that the users would not continuously run the batteries for such extended periods of time that the Patriot would fail to track targets. Therefore, they did not think that more detailed guidance was required."

http://www.fas.org/spp/starwars/gao/im92026.htm


This is exactly what the Israelis figured out and told the Pentagon before this failed interception, but the information was not passed on to units in Saudi Arabia.


The information was not passed on because it was damaging to the military industrial complex interests at the time. Americans at that time in particular really felt infallible after the Cold War and did not accept feedback very well. The skunkworks/American military machine was starting to crack apart with laziness as things descending into contracting hell.

As I mentioned in my other comment, my perspective as an Israeli witnessing all this was, "Don't bite the hand that feeds you." It is a shame that Americans and others had to pay a price for arrogance and politics. It may not be the official story, but anyone exposed to that world could tell you otherwise.


They did however release a software patch within about 2 weeks of getting the report from Israel, but it took 10 days to make it to Saudi Arabia, one day late.


In time of war, minutes are an eternity. No reason these soldiers should have had to wait so long other than inefficiency.


That's an interesting thing about system design philosophy, right? If you just accept that there are unknown unknowns, and make the system easy to fix and bring back up, you can oftentimes do better than a perfectly designed architecture (which matches its model of reality well but not reality itself).

Similar engineering feats would be the inclusion of the bolt-assist on the M16 family of rifles and the design of Erlang.


That's actually what they did as an interim measure, once the error was identified.


That's the first thing I thought, but I wonder if it would violate SLAs to have the system down for a couple of minutes on a daily basis? In this case SLA violations could be penalized severely.


They probably reboot them in batches, so not all of them were unavailable at the same time.


I actually think reboots in military hardware like this are pretty common, the way you only have to debug a certain time window.


Here's a post mortem report sent to the House of Representatives:

http://www.fas.org/spp/starwars/gao/im92026.htm


We talked about this in my microcontrollers class today! Not this exact incident but about the problems associated with using the system clock in a micro for precise timing.

Our micro (Freescale HC9S12) is clocked by a 24 MHz crystal quartz oscillator and if you use the timing module to clock the micro, you cannot get exactly 1ms, 10ms, etc. This makes summing the time over a long period tricky.


What was the solution proposed in your class?

Of the top of my head, I can only think of keeping time values in units of clock-ticks (equivalent to the Linux "jiffy") and only converting to/from milliseconds when turning programmer or user input into internal values.


This. There is almost no reason to put time back into ms when absolute clock ticks will do. Never sum low precision values, etc.


I'm not that knowledgeable about hardware, but the fact that they were using a "24 bit fixed point register" sounds like they were not using a commodity processor, even for 1991. I assume that the error described would simply not arise today on a modern CPU running Linux?


The Motorola 56000 DSP's fit the description, and they are used everywhere - all throughout military-industrial applications.

And yes, these bugs still occur - even on modern CPU's running Linux.


Medical and military equipment trove on outdated hardware just because they can save a few hundreds of dollars on a product that will cost thousands or millions by avoiding going thru the simple certification process for the new hardware.

It's a simple case of beancounters saving little in obvious places and wasting a lot later (in engineer time, testing, recalls, and in this case soldier lifes)

sadly, everyone in those industries buys that idiotic notion and repeat it like a mantra.


You've never worked for a military contractor have you? It's never black and white.

Let's say you ran the project that converted the missile from the militarized rad-hard ceramic-capped chip ($$!) that's likely in there now to a nice and cheap off-the-shelf Arm7. Except you forgot about coefficient of expansion for 120degF middle-of-the-day Middle East launches. Your Arm7 flexes too much and loses some bits, your Patriot misses, the soldiers lose their lives, and now the parallel universe equivalent of you can spout off on Internet forums about how the military just doesn't respect traditional, proven designs and is always wasting money on following the latest trend.

Also, you probably don't realize when the Patriot entered service? And the number of upgrades it's received?


Don't forget the fun of making sure the chip is built for an extended period of time to the exact specs that you started with.


I'm going to print this out and frame it. Engineering isn't no cakewalk.


In addition to what bronson already said:

Once upon a time I did production support for a military radar power supply. One of my tasks is something we called Diminishing Material Sources (DMS). Basically, we had a system that would be produced over the course of 10 years and would need to be supported for a minimum 20 years after it was produced. The design, of course, had to be complete before production started, so tack 7 years on to the front. Not every supplier would want to keep producing specific parts long enough to support our needs, so if something was going to drop out of production I had to perform a DMS analysis to find a substitute. That involved either finding a drop-in replacement or redesigning part of the circuit. The former was sometimes painful, because a new supplier would need to have its parts evaluated for compatibility, which meant dozens of hours of testing parts and testing boards with parts installed.

The latter was an order of magnitude more painful. Any change to the design meant a partial or full re-qualification, and the cost could be anywhere from hundreds of thousands to tens of millions of dollars. Anything that involved changing the circuit board automatically went to the right side of that scale. The "simple certification process" involves intensive environmental stress and screening testing: thermal shock on assembled boards (-40C to 70C transition in 30s); operational thermal cycling (repeated cycles of -40C to 70C with the hardware operating); and operational vibration testing (several tens of minutes operating while attached to a table shaking violently enough to liquefy your internal organs). There is also accelerated life testing, which determines if the modified design will meet the system longevity requirements. Depending on what changed and the system complexity, this may have to be repeated at multiple subsystem levels above the affected board.

For this reason we bent over backwards not to change the design unless we had no other choice. Design changes brought too much risk for us, our customers, and our end users. For this reason the military much prefers the spiral upgrade process for maintaining long-lived systems: once you get some practical experience with the current version, start work on an upgrade package that contains multiple fixes and enhancements at once and switch to that when it is ready. Sometimes it is even possible to retrofit existing systems with the spiral upgrades.


There's probably EMP hardening requirements that non-specialized processors don't meet.


If the time was in tenths, why would you multiply it by 1/10 to get seconds? Shouldn't you be multiplying it by 10 to get seconds?

update: Ahh, I get it. You'd need to divide it by 10 to get seconds.


No. 10 tenths of a second is 1 second, not 100 seconds.

10*(1/10)=1


I think they mean an integer storing tenths of a second (so 10 = 1 second) , so 10 * (1/10) == 10 * 0.1 == 1.0


So, shouldn't it have just been a divide by 10?


Multiplying by 1/10 is the same as dividing by 10.


And faster too. For older processors without hardware divide, much much faster.


No. No. 1/10s / s = 10. so 10 1/10s / 1s. No. s / 1/10s = No. 1/10s / 10.


> Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed to the problem, since it meant that the inaccuracies did not cancel.

-----------

I wonder what trade-off informed their decision not to have the time code in one place and pass it around as they needed it. Or at least defined in one place and then inserted where they needed it if there was some requirement that it be right there in the functions that wanted to know about time.

:/

Just seems a danged odd thing to do.


And I didn't want to take that Numerical Methods course...


There's an algorithm to deal with this kind of additive small error, the Kahan Summation algorithm. It keeps a second number to record the residual of an addition so that adding something small to something large doesn't go awry. With modern floating point, it's rarely an issue, but with a mere 24 bits, I guess it was a terrible problem.


This sounds similar to the technique used in the integer-only version of bresenham's line drawing algorithm.


I had a very similar bug in some software I wrote and used to perform in a band. The upshot was that everything worked fine at soundcheck; 6 hours later, the tempos were off by about 8%, and varied pseudo-randomly during each song.

The fix was to switch from float (24 bits precision) to a long int (32 bits), and to reset all the relevant vars between songs.

Fun times!


What is next? Iraqi soldiers didn't take babies out of incubators as Congress and the American people/world was told?

Oh yeah, that was a lie also. http://en.wikipedia.org/wiki/Nayirah_%28testimony%29


Dumb question, but could you solve the problem by simply using 1/16th of a second, not 1/10th?


I'm impressed the GAO investigators got into pretty technical detail.

For a government agency, I know the GAO reports are generally considered objective, and nonpartisan. For politico experts - how has the GAO managed to not go, say, the way of the EPA


Next time I have a "bad" bug I'll try and remember this.


So if one happens to own or operate an early Patriot battery, wait until the last possible moment to fire it up so accuracy doesn't go to shit. Wow. Engineering maths fail.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: