Nice reminder about a "glitch" happening in one of the datacenters use at the airline I used to work.
Went like this: Guy who shows around the new datacenter/ops guy demonstrated how the emergency power off works by lifting the protection plate. Protection plate unhinges suddenly and droppes onto emergency power off button. Hilarity ensues.
At former managed-hosting startup, we would deploy a pair of EMC symmetrixes into our DC cage as storage for our DB tier, with volumes mirrored over the pair. While equipment was being moved into the cage one day, someone accidentally banged the power-switch protection plate on one of the EMCs. Fortunately the power switch had to be held depressed for several seconds to turn off the equipment. Unfortunately, the protection plate itself got jammed in against power switch keeping it depressed. Hoisted by its own petard, as it were.
Good thing we deployed those EMCs in pairs. But then curiosity gets the better of a DC ops guy wondering how the plate got jammed. So he punches the plate on the 2nd EMC causing it to similarly jam and power-off. Doh.
When we told EMC about this flaw in the design, they deployed a fix to the production line - a rubber stopper under the plate next to the switch to protect the switch from the protection plate.
I remember one of the LiveJournal outages (IIRC) was caused by a dude in the datacenter thinking that the emergency power off button opened one of the doors.
Had a similar experience at an old company. One of the light switches in server room was placed next to a switch connected to the UPS of certain servers (I have no idea how/why this even came about).
Confused the heck out of us when we were trying to figure out why some of our servers went on UPS power randomly at night. Turns out we'd get the notifications on nights the cleaning crew decided to flick the lights off then on again.
A data center where I work had a naked big red switch without a protection plate. Until I flagged up this lack and let the IT people know about it, it was a disaster waiting to happen.
Similar situation here. A big red button on the wall in the basement data center - nobody knew what it was for and were afraid to turn it off. One day a night operator decided he was going to flip the switch and see what happened. It cut power to the entire datacenter, including the elevator systems. We spent a few days getting all the systems back online.
Had another instance where a pipe ran from the floor above right over the UPS. Construction workers on the first floor decided to poor some unusued paint into it (figured it was a drain pipe) and of course, we lost power again.
I have a nagging (intuition? feeling?) that software safety/reliability/security needs are going to explode soon (because unreliabilities multiply in non-resilient systems interacting with each other) and that these are simply foreshocks.
(yeah, I know security is already a huge deal, but as we come to trust software systems more and more, the safety/reliability factor will come more into play)
EDIT: This is also part of the reason I've been learning Elixir (http://elixir-lang.org/) since it's based on the highly-resilient Erlang and is designed to embrace failure. This was also informed by me reading Nassim Taleb's book "Antifragile" as well as "Thinking in Systems: A Primer" by the (late) Donella Meadows.
You might be right, unfortunately. If it's cheaper to buy insurance that will cover the (expected) losses caused by outages, most organizations will choose to do that instead of making the software more failure-resistant. The problem is that insurance only works well for isolated incidents, but a software failure can cause a cascading failure with a huge impact. Insurance companies generally aren't prepared for that and don't have the resources to pay out to everyone.
But aren't the insurance companies smart enough to figure this out and start correcting their rates to be much higher?And if they actually have their acts together, wouldn't those same insurance companies start insisting on basic audits of their client's systems?
I actually don't know about this stuff, so any correction of my thoughts is appreciated.
> But aren't the insurance companies smart enough to figure this out and start correcting their rates to be much higher?
That seems like a naive "MARKET WILL FIX IT" approach.
More likely, if the market does fix it, it'll be by having insurance companies deploy actual inspectors who know what they are doing and what sorts of problems to work for.
They might even be a fun combination of physical pentester/irl chaos monkey. Doesn't that sound like a fun job?
Quite often operational losses aren't insured against this kind of error, for example the Knight-Ridder automated trading losses. Sufficiently big operational failures can just destroy companies, especially small companies.
System audits would have to be standardised. There's bits of this in ISO9001, PCI compliance, FIPS, and so on. But the technology changes rapidly and the insurance companies don't have the expertise.
A cynic in me feels that someone will figure out the problems with cascade case and insure from failure of insurance companies to pay out insure money. Just like during 2008 financial crisis.
I think we just need to build smarter. I've become disillusioned with the "runaway state" problem (as well as spaghetti-dependency problems) in OO languages which contributes to bugs and general nondeterministic behavior as well as making long term maintenance difficult, and at the same time I've become enamored of unit test suites and functional immutable languages like Elixir as well as static code analysis tools (I'm still coming around to Haskell-esque typing, but I generally think it's a good idea to write "potentially provably correct" code that has parts which provably have no side effects).
So obviously the "move fast and break things" philosophy is meant for some fun web applications. But what's the equivalent "modern best practice" for systems that are much more weighed towards stability and resilience as opposed to new features?
Systemic quality focus would be a good start. Deming-based management philosophy driving a systems-oriented model.
This actually applies to all businesses; speed improves, market fit improves (quality is just "what is good and valuable" after all), employee happiness improves (exponential gains from that alone). The effects are systemically positive.
But, no one cares, and the belief that we have to crack the whip to get people to make things faster, and we should reward the good people and punish the bad ones—will continue on as pure religious fallacy resulting in the failure or constant-operation-at-the-edge-of-failure of everything involving more than five people, probably until the end of the human race.
Six Sigma and Agile are indeed at pretty opposite sides of the spectrum.
Deming -- W. Edwards Deming, that is -- is somewhere in the middle, around the right balance. He advocated for spreading a philosophy of systems thinking, scientific method, and statistical understanding, while simultaneously empowering employees by recognizing the power and responsibility of management and leadership, and understanding the motivation from a scientifically accurate psychological viewpoint.
It's a correct framework, and it's all aimed at driving quality by improving the things that directly impact it at a base level: fundamentally, how people work together, how they build systems that work, and how they're motivated (and demotivated) in reality.
Schuberg Philis (https://www.schubergphilis.com/) in the Netherlands has been selling 100% functional up-time for a while now. They've set their entire business model and management structure up to support this.
I worked on United's computer systems for a year (never that one though), and so I get nervous when I see a headline like that. True story: one of their systems still runs on a mainframe that has 9 bits in a byte!
One of my friends used to intern for a very large company that maintained software for flight control towers. His entire summer was spent writing bash tests for these old Fortran apps that kept planes from running into each other. Most of the mainframes still had tape.
I must admit I don't take that as a negative thing. Code that's so old had a lot of time for corner cases to be ironed out. If it's maintained properly it's probably less buggy than any rewrite, even though fortran is less enticing than rust as a language.
I did not know that was ever a thing. I knew that bit counts had been lower - 5 or 6 - but assumed once we hit 8, the whole power-of-2 thing was too comfortable in a binary system to ever be anything besides a power of 2 again - and sure enough, we get 16 bit, 32 bit, and 64 bit systems, and double-byte and triple-byte char sets, etc. Need more space? Take another byte.
Was the 9th bit special? Or just a standard bit in the byte?
36 bit was popular for scientific computing because 35 gives you a sign bit and 10 decimal digits (yes, that's weird, but that's the argument I read everywhere, including at https://en.m.wikipedia.org/wiki/36-bit. 35 likely was skipped because it its only divisors are 5 and 7, limiting instructions to 7 bits even then was felt to be too restricting. For some architectures, the DoD had a say in this, too. See https://en.m.wikipedia.org/wiki/Unisys_2200_Series_system_ar...)
36 bits got us the 6-bit character (10 digits, 26 letters, and punctuation) with six characters in a word. Because of that, some OSes had six-character file names.
If you want to get upper- and lowercase, you need more than 6 bits. 9 is the smallest divisor larger than 6 of 36, so nine-bit characters made sense.
On such systems, file names could still use 6-bit characters, while applications used 9-bit ones. Also, some instructions could work on words, half words, quarter words, or sixth words.
Someone probably read too much Iain M. Banks and decided to adopt Marain's base-9 system :)
If I'm not mistaken, some old consoles used a 9-bit RGB encoding, so this could theoretically help there. Minecraft also uses a 9-bit system for Redstone, afaik.
The PowerPC based AS/400 systems were technically 65-bit machines, as the extra bit was a privilege flag to separate system code from user code. Hardware enforced security - one reason why it is such a stable machine.
Sabre, still runs all their reservations through ztpf... If you know anyone there ask them about how they get the data off the mainframe. The story I heard sounds like the process was designed by Rube Goldberg.
I knew someone at United that once offered to give me a tour of one of their data centers: "It is like a computer museum - we have one of everything." Hard to imagine that they would have problems as a result. United is a really, really bad airline.
I bet this is an issue with an old mainframe used somewhere in the booking system, something that has worked well but is difficult to fix when things go wrong.
I think there is / will be a lot of money to be made trying to solve the problem of software security and reliability. This is obviously an extremely difficult problem, however the number of ancient systems that we currently have interconnected I think more large scale outages like this are inevitable.
edit: As mentioned in [1], I assume that at least someone is aware of the cost-benefit of potential projects; in turn, I assume that someone would've pulled the trigger if the $'s make sense.
Was reading this on HN and heard it on NPR simultaneously.
I have a sneaking suspicion that booking systems for most airlines run atop legacyware. It just seems like the type of thing that would've been put in place long ago and then be very expensive to migrate/updgrade.
> The biggest problem, one that would drive any tech-savvy user crazy, is that United junked an award-winning, state-of-the-art reservation system and adopted the Continental Airlines model based on older technology known as System One.
Not a word about it on United's web site. Flight status page doesn't load correctly. "Today's Operations" gives an error message. United's Twitter is silent.
United is a mess. I had the misfortune of flying with them a couple months ago. I ended up in a city a 2 hour drive from my actual destination and had to rent a car on my own to get to where I was going.
That was the worst of it, but almost every flight I saw on the way (both ones I was on and other flights at nearby gates) was delayed or overbooked or otherwise messed up in some way.
If I ever listened to horror stories like these, there wouldn't be an airline left I would fly with. This is textbook; replace United with Delta, Southwest, Air Canada, etc., at will. The only company I've used that I haven't heard exactly this type of complaint about is Widerøe, a tiny regional line in Norway with a fleet of prop planes. This is unfortunate, as I live in Pennsylvania and get motion sickness on those little aircraft. And I imagine that if I spoke Norwegian—it seemed to me that while many Norwegians speak English, they don't do much complaining in it when the native language would do—I wouldn't even have Widerøe left.
It's also worth remembering that quality of service depends on a host of variables, including departure/arrival cities (major hubs are better than regional airports), routes, and times (morning flights tend to be less delayed, Thursday afternoon flying always a bit of a shitshow), etc.
I fly out of UA hubs frequently and have had nothing but excellent service from them this year (over 50 segments flown this year, ~60K miles).
Definitely helps to have status too...
Personally, I consider SW to the be shittiest of them all. I hate having to fight for a seat...
Lastly, use google.com/flights by far my favorite booking tool now.
I've been flying 1-2 times a month for the past year or so, mostly on US Air or American (same thing now) and I never saw anything approaching the level of problems United had on one trip.
Obviously, with travel sometimes things go wrong, but the quantity and severity of things going wrong at United makes me think there's something off about the way they're running their airline. For instance, the two times this year that they've had to ground all flights.
AA destroyed a piece of baggage of mine. They denied responsibility. I sued and won. Then I couldn't collect because that judgement was essentially nullified because AA was going through bankruptcy.
I have 1k status with United -- I fly internationally with them almost monthly. The problems I've had with United have been when bags have to be interlined to Brussels Airlines and occasionally (surprisingly) Lufthansa. I've also had Lufthansa somehow think it was a good idea for 2 toddlers to sit in scattered seats rows away from their parents (who were also rows apart as well, despite having checked in almost 24 hours before the flight and being Star Alliance Gold.) I've had Brussels Airlines say they were going to "gate check" a stroller only to have it show up days later. I've been stranded in Detroit back in the Northwest Airlines days when aircrews hadn't showed up to work. I've been stuck in Paris when the Air France pilots decide that salaries up to $300,000 per year just aren't enough. On a recent United trip from Hartford to Marseille, I was stuck in Hartford for % extra hours for an airplane that was stuck in Newark (just a <50 minute flight away.) I then missed a cascade of connections leaving me rather miserable. However, United sorted the problem and got me on my way as quickly as possible. Let's not forget Jet Blues antics on multiple occasions a few years ago: a 10.5 hour tarmac delay, a 7 hour tarmac delay among several other extremely long tarmac delays. AA had 14 long tarmac (over 3 hours) delays in February. United had zero long tarmac delays during the same period. Envoy/American Eagle was in last place for on-time arrivals last year.
I'm not defending United. I'm not disparaging the others. The fact is that the air transport industry is extremely complex and perceptions of quality are as varied as their are passengers in the sky.
Every airline sucks and every airline is great. Pick a day, pick a destination and roll the dice. When you fly often enough it seems like it all averages out to just one level of melancholic service; unless you're flying on Singapore Airlines -- then it just becomes sublime.
Not that I know of. My particular flight was delayed because they had to replace a piece of the navigation equipment on the plane. This delay caused me to miss a connecting flight and that's how I ended up in an entirely different city.
This happens with every airline. No airline has magical planes that never break and they all have similar maintenance schedules on the exact same planes running the exact same workload. Blame Boeing for making a crappy plane if they keep breaking down.
I wasn't so annoyed that the flight got delayed and I got stranded in the wrong city, it was the way they handled it. United's response was:
1. Do nothing. You're stuck here. We'll get you on the next flight. Oh, the next available flight isn't for over a week. Sorry. No, we can't pay for your rental car.
2. Upon renting my own car, and writing to customer service to complain I got a $125 voucher for United. Great. Not enough to buy an entire ticket, so it just ensures I'll have to continue giving money to this airline that failed to deliver what I gave them money for in the first place.
3. After several weeks of emails back-and-forth with customer service finally a manager agrees to issue a reimbursement for the rental car as a one-time exception to their policy of never doing that. (but not the $7 worth of gas. I guess they just have to draw a line somewhere? Oh, and they also rescinded the voucher. No big deal since I'm not too keen to fly with them again anyway)
So, yeah. Sometimes shit goes wrong when you travel. Everybody knows that. How airlines act to fix it makes all the difference. If you hamstring all your customer service reps so they can't actually solve someone's problem, it makes something that's already annoying way more frustrating.
Very little news about this on Google News, but heard over the local Chicago ABC affiliate that the FAA attributed this to an "automation error"
Edit: And its Twitter account has been relatively inactive, with more than 30 minutes since the last reply-to or general tweet...presumably a lot complaining tweets have come in in the last 30 minutes https://twitter.com/united/with_replies
I bet they could have saved themselves untold numbers of telephone calls in the support queue, and thousands of in-person queries to ticket agents and other airport staff with one or two tweets and a facebook post or two.
"Departing DEN; taxied and then returned to gate. Pilot says nationwide failure of "three or four" computer systems. Only information from airport staff is that since the computers are down UAL can't book pax onto any other airline ..."
"Systemwide Ground Stop posted at FAA:
Due to USER REQUEST DUE TO AUTOMATION ISSUES. UAL AND SUBS ONLY., departure traffic destined to ALL airport will not be allowed to depart until at or after 13:15 UTC."
Considering just yesterday their flight system didn't believe they flew from SFO for 2+ hours in the morning (I have screenshots), i'm not all that shocked.
Lol HP system. I used to work for SABRE, its far more dependable despite its antiquity. HP is a marginal player in this market which is dominated by SABRE and Amadeus.
ISO all the way. Words have issues for different month languages (e.g. Auot vs. August), but otherwise is more elegant (no delimiter needed, e.g. 08JUL15)
I always kinda liked the meme from that language of "variable as picture". Calling something a picture has inherent and obvious implications WRT call by value vs call by name.
On the other hand intrinsic functions like FUNCTION CURRENT-DATE are not as logically amusing.
Why is this being down voted? It is true. When British sites spell things with superfluous letters (colour, for instance,) should HN readers demand that they change it to the more efficient spelling? American English is spoken natively by more people in the world than British English, so therefore must we banish British English from these pages? Of course not. We could argue that the US has the largest number of visitors to HN than any other country. HN is an American site. YC is an American organization. Thus, those not comfortable with American conventions really ought to get over it.
The United States has the world's largest economy, according to the IMF, World Bank and UN. So obviously the US is doing something right. Perhaps the rest of the world ought to adopt American conventions. I say that in jest, but the point is that criticizing the American way of doing things is a popular sport, yet at the end of the day, the American system has resulted in an economic output greater than Germany, the UK, France and Italy combined. The EU has a per-capita GDP of $36,779 and the US is at $54,601. So apparently something is working with the American way of doing things. Just because "the rest of the world" does it doesn't make it better. This whole idea that the American way of doing things is somehow inferior is just as ridiculous as the British/French rivalry.
The fact that the parent comment is the top comment is ridiculous. It's a petty thing about which to complain and adds nothing to the discussion. In fact, the parent comment ought to be down voted for being absolutely irrelevant to the posting in question.
I worked for a company that did a special translation of its software from American English to British English alongside the other languages they did. The Brits loved it, although they were mystified by the pricing - it cost them in Pounds as much as it cost the US customer in Dollars.
I tried to get a former employer to switch to ISO format, since they had offices both in the US and Ireland. There was too much pushback so I didn't succeed, but I at least got them close: 2015-Jul-08.
But a month from now, some people will be confused for sure. So, it's more future-proof to settle for ISO or alphanumeric format to avoid confusion and miscommunication.
I wouldn't get on a Russia autonomous flight for a million dollars. Between the corruption, hackey engineering, and complete disregard for safety standards, Russian accidents surprise no one. The Tu-154 accident list is long and scary. Hell, look at all the fires that broke out.
That's twice in a short time now - 'coincidence' on Ian Fleming's scale. A third time is enemy action. Airlines do seem like a pretty juicy target for cyber war operations - you can cause a gigantic amount of disruption with a successful attack on a single system.
I don't see any convincing rebuttal. Why are airline systems not a juicy target? Why would a successful attack on the system not cause gigantic disruption? Why have these long-running, stable systems only recently begun failing so severely?
Yes, one is wise to attribute to incompetence over malice, but these systems are demonstrably not run by incompetents: they've been in operation for decades.
Or, you know, the load is increasing over time, and the system is failing to scale to it, and so will fail more and more frequently as the load continues to increase?
The amount of daily UAL passenger seats doesn't change or grow radically over weeks or months. What is the scale-up here? Everyone using their phones to check rez/status/boarding passes?
Some quick googling suggests they take delivery of several-many dozen planes a year (based on various order/delivery/etc. coverage), which would absolutely cause a step function in the number of passenger seats.
I expect there's growth on all of these:
- Total planes
- Total flights
- Number of routes
- Number of passengers
- Passenger utilization of electronic boarding
- Passenger utilization of reservation modification (seat
changes, upgrades, etc.)
- Passenger utilization of in-flight electronic amenities (wifi, in flight entertainment, etc.) that are billed through the system (United has a proprietary WiFi that's tied to your frequent flier account)
- Online booking
- Travel agent/reseller pricing queries
- Travel agent/reseller booking
I'm not saying any of these are dramatically increasing, but I'd bet they're all going up slowly, which will add more and more load to the system.
Capacity only increased 0.01% over the first quarter of 2015. Passenger loads have remained flat at 81.1%. I would suggest that compared to last year, there hasn't been much change in scale. Also, all of the above systems are not linked together so I'm not sure an increase in one of those factors would be enough to take down and entire system. I could be wrong, I am definitely not an aviation IT expert. (Apparently Untied could use some though, so perhaps I might need to think about adding that to my skills!)
Went like this: Guy who shows around the new datacenter/ops guy demonstrated how the emergency power off works by lifting the protection plate. Protection plate unhinges suddenly and droppes onto emergency power off button. Hilarity ensues.