Hacker News new | past | comments | ask | show | jobs | submit login
How to lose $172k per second for 45 minutes (2013) (hmmz.org)
327 points by sunasra on April 1, 2019 | hide | past | favorite | 166 comments



I'm amused by the tone. It's like the author doesn't realize that 99% of software development and deployment is done like this, or much much worse. Welcome to the real world.

We work in an incredibly immature industry. And trying to enforce better practices rarely works out as intended. To give one example: we rolled out mandatory code reviews on all changes. Now we have thousands of rubber-stamped "looks good to me" code reviews without any remarks.

Managers care about speed of implementation, not quality. At retrospectives, I hear unironic boasts about how many bugs were solved last sprint, instead of reflection on how those bugs were introduced in the first place.


> I'm amused by the tone. It's like the author doesn't realize that 99% of software development and deployment is done like this, or much much worse. Welcome to the real world.

Agree with this, a lot of developers are in a filter bubble where they stick to communities that advocate modern practices like automated testing, continuous integration, containers, gitflow, staging environments etc.

As a contractor, I get to see the internals of lots of different companies - forget practices even as basic as doing code reviews, I've seen companies not using source control with no staging environments and no local development environments where all changes are being made directly on the production server via SFTP on a basic VPS. A lot of the time there's no internal experts there that are even aware there's better ways to do things instead of it being the case they're lacking resources to make improvements.


I wish I hadn't experienced this exact scenario as well. I actually worked at a publishing company where I discovered the previous team building their major education platform didn't know there was such thing as source control. Their method for sharing multiple team members changes was to email the full project to one person every Friday who would manually open a merging tool and merge all changes themselves. They would then send out the updated code base to all people again. Because of this method, they were afraid to delete any stale code, and just prefixed old versions of functions with XX. As you can imagine, inheriting that code base was a nightmare to deal with.


I remember having this debate many years ago. Would it be better to introduce a team with no source control knowledge to something like svn first or straight to git?

Svn is easier to understand and use, but then you’d have to break some existing habits to get to git. But going straight to git might be a big step and cause reversion back to whatever system was already there.


Straight to git but have a simple and clear branching strategy. As in a full written procedure for the three main events of git. "Get code, commit code, push code".

Then disallow direct commits to master to make people work in feature branches and make merge requests through the platform (github /lab/ bitbucket). I find merging and branching locally is where people normally trip up.

Git GUI tools always make git seem way more complicated than it is so depending on the teams platform I would recommend cli from the start.


I would not start people new to git on cli. That's how you get someone's entire Camera Uploads directory committed to master (I've seen it before). I recommend Git Tower. I use it for most tasks actually even though I am comfortable in the CLI too. It tends to stop me from doing stupid things before I do them.


Also, the git CLI in particular is extraordinarily terrible, given how many conceptually different things are done by the same set of commands. (For example: imagine trying to use git checkout in a repo with a file named "master" at the root level.)


You can probably assume a lot of people who aren't using version control aren't using the terminal either. GUIs are generally much better at CLIs at giving you an overview of what's happening and what actions are available too.


I've supported a few svn instances. Devs could be trusted with svn, but analysts,testers, system people, ... My god the horrors these guys dream up.

Say you want to edit a few chars in a 250MB file. Why don't you, just to be sure, past that whole file in the comment field? Do that for a few 100 commits. Tortoise really hated that one, and crashed the windows explorer every time you dared look at the logs (out of memory).

Or the time some joker (His CV proudly declares 10 years of developer experience) deletes the root of the tree, doesn't know about history, and goes with his manager straight to the storage admin, who wipes everybody's commit of that day ( a few 100 people). There clearly is no need to contact someone who knows anything about subversion if the data is gone, and maybe this way nobody will notice anything and jell at them.

Or say you want to do an upgrade. Theory is every user leaves, service and network port get shut down, VM instance is backed up just to be sure, you do the svn upgrade. Of course enterprise IT means I have to write detailed instructions to every party involved and under no circumstance allowed tot look at that server myself.

So it turns out: A) some users just keep on committing straight trough the maintenance window. B) The clown who shuts down the service doesn't check if the service is actually shut down, and there is a bug in the shutdown script. So svn just keeps on running. C) The RPM containing the patch is transported by an unreliable network that actually manages to drop a few bytes while downloading from http. D) The guy who should shut down the svn network port is away eating, so they decide to skip that step. E) SVN gets installed anyway (what do you mean, checksum mismatch) and starts commiting all kinds of weird crimes to its storage. F) The VM guy panics, rolls back to the previous version, except for the mount which contains the data files. G) Then they do it all again, and mail me how the release was successful without any detail of what happened.

Let me tell you, svn really loved having its binary change right under it, in the middle of a commit, while meeting its own previous version in memory. Oh and clients having revision N+10, while the server is at version N. A problem that solves itself in a few minutes as N goes up really fast ;-)

Now thats what happens with subversion, which is rock solid and never drops a byte once committed. This company is now discovering the joys of git, where you can rewrite history whenever you feel like it.


Here's a vote for going straight to Mercurial (a lot easier to grok) and link to Joel Spolky's excellent tutorial

https://web.archive.org/web/20180903164646/http://hginit.com...


GitHub can be used with both GIT and SVN so you don't have to choose.

I've introduced a lot of people to SVN over the past decades. Be it programmers, sysadmins, artists, translators, it's fairly quick to learn.

I couldn't begin to imagine introducing anybody to git. It's a horrible nightmare to use, even for developers, there is nothing that come close in how many times it screws up and you have to search for help on the internet.


If you haven't used any source control at all, I don't see why svn would be easier to understand than git? Using git will save you from a lot of pain up ahead so I would definitely go for git.


My limited experience with git (and none with svn) leads me to suggest someone might prefer another source control system, not because it's easier to understand per se, but because you can do simple things in it without fully understanding it.


"Straight" to git? If you have already made up your mind, why ask?

Subversion is newer than RCS. But that doesn't mean every use of the latter can or even should be replaced.


Surely these days that question is answered by the existence of GitLab and several other similar tools.


SVN and git! Pah!!

IBM ClearCase is the way to go


Someone created a jenkins pipeline that would deploy code in a zip file into prod.


Am I missing something? Isn't this exactly how deployment servers actually work?

I'm not into Java development, but this sounds fine on the face of it, without you giving the context of how this pipeline is triggered.


In theory it sounds alright. It's not great, because Jenkins is usually layered with some existing deploy framework that makes "deploy a zip file" pretty suspect. A healthier setup would look more like "build a Tomcat war file Maven, upload and deploy that with Jenkins". But in context, it sounds like the horror is that people were making and transferring a zip from local code rather than building from the tip of source control.


Developers had a copy of code in a google drive... they'd modify it, zip it and overwrite the one in google drive, copy it to the network folder which would deploy it and delete it from the network folder... in 2016.


That’s a pretty standard CI process - zip files are often your deploy artifact (using, for example, git archive)


Yeah, except the zip file was... the input.

You pulled the code from google drive, modified it, pushed it to PROD, checked it and moved it back to google drive... and asked the other developers to update.


>I've seen companies not using source control

Well I've seen companies that have their own idea of source control. Which is lots of copies on the network drive, and an Excel registry with what is in which file.

It is source control. Just bad source control.


I worked on a system with “octopus locking”. We a had toy octopus, and you could only change the code (obviously in production) when you had the octopus.


My day job involves multiple third-party systems that enforce their own proprietary version control systems for any custom scripting/programming within those systems. Unsurprisingly, these proprietary version control systems are complete garbage 90% of the time, especially if you're trying to collaborate with someone else.


This comment is resonating very well with that I've been experiencing at my current job, and with our new head company. It's always nice if you can take a new application, throw it into a clean container build chain with an orchestration - or some AWS stack, and there's a cheap, low-maintenance, HA production system.

However, there's also the skill set of taking such a ... let's call it well-aged development team and approach and modernizing and/or professionalizing it. And yes, sometimes this means to build some entirely lovecraftian deployment mechanism on mutable VMs because of how the application behaves. But hey, automated processes are better than manual processes, which beat undocumented processes. Baby steps.


> I've seen companies not using source control with no staging environments and no local development environments where all changes are being made directly on the production server via SFTP on a basic VPS.

Omg. And here I am feeling ashamed to tell others about my small small personal website with separate dev, qa, production environments on same server (via VirtualHost), code checked into github, deployed via Jenkins self hosted on another VPS, which was initially spun up with Ansible and shell scripts. All done by me for self training purpose. All because I thought businesses would have something more sophisticated with bells and whistles.

And then I hear there are businesses that make changes directly on live production servers...

But I'm not surprised by such stories as I have seen some bad workflow in real businesses that deal with tens of millions of dollars a year.

Years ago, I worked in the NOC of a company that's top in the small niche. They have dozens of employees, and been around for years.

Part of the job responsibility was rolling out patches to production servers. The kicker was the production servers were all Windows servers, running various versions, covering practically all Windows versions ever released by Microsoft. You can see where this is headed.

Rolling out a patch was all done manually, 1 Windows server at a time. Everything was manually done.

The instruction for deploying a change was easily multiple lines long, each with different style of writing/instruction. Often in plain text file format. We would print them out so we could check them off as we went down the list.

The CTO is still there, but everyone IT person under him has left or been let go. Working in the IT there is a struggle because lack of automation and old old stuff, but the CTO just blames bad employees and keeps churning them in/out. When the real issue is the decade or 2 worth of old legacy stuff that need to be cleaned up and/or thrown out, which can only be done by the direction of the CTO. But he knows he won't get that kind of budget from higher-ups so he will just keep hiring/firing employees and/or bring in some H1B workers who's basically trapped once they join. And of the few H1B workers I met there, they were truly completely non-technical. One did not want to learn how to use keyboard shortcuts to do common tasks... Good guy though.


> I've seen companies not using source control

> ...all changes are being made directly on the production servers via SFTP

I know this used to be common, but recently? Curious how often this is still the case.


> I know this used to be common, but recently? Curious how often this is still the case.

Several times within the last year for me. Not all companies have big tech departments with knowledgable developers advocating modern best practices. Some big internal systems can start out from someone internal applying some basic self-taught skills for example.

To be fair, the jump to using Git (and especially dealing with Git merge conflicts) is scary. It can be a hard sell to get people to move from a system they already completely understand and have used successfully for years, even if their system looks like ancient history to us.

Literally heard "...but my IDE already automatically uploads to FTP on save, I'm usually the only usually one editing it and I already know what I changed" last week.


I have seen it recently. I did my best to change the practice before I left the company, but was mostly unsuccessful. Given that they were still running some of the spaghetti-code PHP scripts that were written in 1999 and still used PHP4 in new development they were stuck in the stone ages. To give a little perspective, support for PHP4 ended in 2008, so they had almost a decade to update, but didn't.


"If it ain't broken, don't fix it". And then one day the server goes boom, the backup was incomplete, and everyone is trying to find the usb flash disk with Spinrite in it.

Meanwhile the CEO who was rejecting the €¥$£ in yh budget since 2000 is angry at everyone!

Oh the times I have seen this!!!


Oh, now that you mention backups, that was a nightmare too. Thankfully, the production database was backed up daily on magnetic tape and stored offsite, but the code was generally edited live on the server, and backups consisted of adding ".bak20190402" to the end of the file. Needless to say, losing code wasn't uncommon.

This was for a 100+ year old company with millions of dollars in annual revenue that was owned by the government. So, yeah. 100% the IT director's fault, who'd been there since the early 90s.


Its both the CEO's fault for either not understanding or not hiring someone who properly understands the risk they're taking on in their tech stack, and whoever's job it was to understand the risk. Part of being a responsible engineer (or IT manager, etc) is to be able to say "no" to new things and to explain that a bad day can and will take you down.


At my first (very small, research-group) employer, I was the one to introduce source control. Another company I saw ~4 years ago had Git, but no one knew how to use it adequately so "lost code" was still a regular event. I haven't seen it since, though; lots of people frightened by and badly misusing Git, but they mostly manage to keep code histories intact.

Having no testing/staging environments remains pretty common, along with its cousin "production work happens on staging". Partnering with not-primarily-software companies and asking about staging infrastructure, you har that regularly. And yeah, SFTP/SCP/SSH is a standard push-and-deploy approach in places where that happens.


On the one hand, source control is really the most basic thing you can find anywhere. It's really hard to find an actual development shop without source control.

On the other hand, outside of a tech company and with less than a dozen developers, don't expect to find any source control. Consultants see a lot of this shit, they work in all industries including where developers don't exist, with a lot of thrown away projects.

Funny thing. Git probably made it worse in recent years by being impossibly hard to use.


Its nice to have a source code history to see why and when something was changed. But a lot of tooling is usually to treat the symptoms of complexity.


Hey, at least they're using SFTP - you know, for the security!

/seriously, though - I...hope this isn't being done any longer - but I bet it is. Sigh...


Way too common. So many clients I've dealt with have a deployment workflow that is some variation of this.


There is a really, really interesting breakdown of this document on kitchensoap.com [0]

The writer breaks down why this document is not a post-mortem despite superficial similarities.

>Again, the purpose of the doc is to point out where Knight violated rules. It is not: 1) a description of the multiple trade-offs that engineering at Knight made or considered when designing fault-tolerance in their systems, or 2) how Knight as an organization evolved over time to focus on evolving some procedures and not others, or 3) how engineers anticipated in preparation for deploying support for the new RLP effort on Aug 1, 2012.

>To equate any of those things with violation of a rule is a cognitive leap that we should stay very far away from.

>It’s worth mentioning here that the document only focuses on failures, and makes no mention of successes. How Knight succeeded during diagnosis and response is unknown to us, so a rich source of data isn’t available. Because of this, we cannot pretend the document to give explanation.

He also makes an interesting point related to what you're saying: the SEC says the risk management controls were inappropriate, but clearly Knight thought they were appropriate or they would have fixed it.

>What is deemed “appropriate”, it would seem, is dependent on the outcome. Had an accident? It was not appropriate control. Didn’t have an accident? It must be appropriate control. This would mean that Knight Capital did have appropriate controls the day before the accident. Outcome bias reigns supreme here.

I'd go into it more but I would instead recommend you take a look at his breakdown as I'd be trying to do a shoddy summary of a really interesting write-up.

[0] https://www.kitchensoap.com/2013/10/29/counterfactuals-knigh...


> He also makes an interesting point related to what you're saying: the SEC says the risk management controls were inappropriate, but clearly Knight thought they were appropriate or they would have fixed it.

> This would mean that Knight Capital did have appropriate controls the day before the accident.

I think these claims are seriously confused.

SEC fines don't require mens rea, so Knight is simply being punished for having inappropriate controls, their view on the matter be damned. Kitchensoap rightly observes that "this event was very harmful" does not imply "this event was caused by extreme negligence". But the SEC filing focuses on alerting and controls; position limits don't prevent a specific misstep, but they limit the maximum size of any error that does occur. (Knight had position limits on accounts, but didn't use them as fundamental boundaries restricting actual trade volume.) The thesis is that Knight should have prepared to mitigate "unknown unknowns", in which case the size of the error is relevant because the size was exactly what should have been controlled for.

On appropriate controls, SEC fines are certainly outcome-biased, but the claim is obviously that these controls were always inappropriate, and the disaster simply revealed them. Post-disaster punishment creates an ugly system where people who don't take excess risk can be outcompeted before their competitors crumble, but the rule isn't actually conditional on failures.

Kitchensoap asks whether Knight would be judged so harshly if they'd only lost $1,000. Socially, perhaps not, but legally they actually would have! The SEC isn't just punishing Knight for losing money but for disrupting the market with improper trades; it specifically notes that for some "...of those stocks, the price moved by greater than ten percent, and Knight’s executions constituted more than 50 percent of the trading volume. These share price movements affected other market participants...". A smaller loss wouldn't have defended against that charge, while a smaller trade wouldn't have violated SEC rules.

I think the author is basically aware of this, since his fundamental point is that the SEC is describing the legal wrongs rather than the technical mistakes. That's a good point and I'm glad you linked this. But I think his focus on the specific deployment error neglects the fact missing position controls were the larger legal and technical failure.


> Kitchensoap rightly observes that "this event was very harmful" does not imply "this event was caused by extreme negligence".

What makes negligence extreme? Doing things you have specifically been warned against would be one thing, but there are others, including being oblivious to the magnitude of the risk when, with a little thought, it should have been clear.

The inverse of the above quote is equally valid: not-very-harmful outcomes do not imply that the negligence is not extreme, and it was all the days of operating without big problems that allowed the organization to be blind to the risk it was running, day in, day out.

OP wrote: >> Clearly Knight thought [its controls] appropriate or they would have fixed it.

But the problem is that it did not think about it, in a meaningful way: it did not have an informed opinion. Every day in which nothing very bad happened contributed to the normalization of deviance. I am sure there were other days when things went wrong, but without the worst possible outcome, and they became just part of the way things are, instead of a wakeup call.


> SEC says the risk management controls were inappropriate, but clearly Knight thought they were appropriate or they would have fixed it.

By this logic, we can claim that if a hospital is storing all its employee passwords in plaintext, that's "appropriate" because if it was inappropriate they wouldn't do so.

Or that if a company is neglectful about offsite backups, that's an "appropriate" data retention strategy because if it wasn't, then the company would be taking backups.

In this case, if Knight thought its controls were "appropriate", then that's the problem that needs fixed.


This kitchensoap article may be more correct, but man, it's a arduous read. A bit of brevity and humour can work wonders, conversely complete correctness can be detrimental if people lose attention.


It is a lengthy slog for sure, but I found it a great way to kill some time. Maybe you just have to be in one of those reading moods.


I'm not hugely fond of this specific article; I think the dissection of the SEC filing misunderstands what the accusation is. (Specifically, it's not having the bug, it's the lack of mitigation for large bugs in general.) To me, the "this is not a post-mortem, here's why" part would have stood better on its own.

But I appreciate the writing style, and I've got his site bookmarked for more systems safety reading in the future. It's a slog, but often this sort of ruthlessly-comprehensive breakdown is the best way to understand exactly how a complex thing went awry. Reading them every so often - even for non-software topics like drug treatments - seems to be a good refresher for my own error-analysis skills.


But also for 99% of companies, running the wrong version for an hour won't bankrupt the company and all of its investors.


I've seen a lot of companies and teams with hair-raising practices, but most of them still manage some basic business-logic safeguards like "don't buy for more than $X" or "if we don't sell anything for 10 minutes, shut it all down and sound the alarms". Generally they can't do worse than "making no money" or maybe "paying full expenses without making any profits".

The link here excerpts the full filing, but the real meat wasn't included:

> 16... Because the cumulative quantity function had been moved, this server continuously sent child orders, in rapid sequence, for each incoming parent order without regard to the number of share executions Knight had already received from trading centers... 17. The consequences of the failures were substantial. For the 212 incoming parent orders that were processed by the defective Power Peg code, SMARS sent millions of child orders, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. Knight inadvertently assumed an approximately $3.5 billion net long position in 80 stocks and an approximately $3.15 billion net short position in 74 stocks. Ultimately, Knight realized a $460 million loss on these positions.

Knight did $1.4 billion in revenue per year, but there was no safeguard against adopting $7 billion worth of positions in 45 minutes. The end result was that they lost 4 years of profits in under an hour. That's crazy, and not at all standard.


You live by the AI stock trader, you die by the AI stock trader.


This may be the majority of the industry, true. But some devs filter out employers to not feel miserable. Whenever I find an employer like this, either I try to change the status quo, or I leave.


Changing the status quo is kinda fun in these environments. Low hanging fruit, easy work. As long as your head isn't on the chopping block.

And everyone benefits. You, the employer, your colleagues, the fresh grads with starry eyes


It can be rewarding, but there are a couple caveats.

First, management has to be willing to assign you to cleanup tasks, rather than have you work on new features.

Second, cleaning up spaghetti code safely and turning it into something maintainable is often not easy -- it wasn't written to be maintained! No tests, missing documentation, ill-considered interfaces...

It isn't enough to keep your head off the chopping block: management has to be fully on board.


Yeah in my case I've got my manager and architect on board and they love it. My more senior teammates might be threatened though.

I found an impending any-day-now production down scenario last week and got it fixed while learning tremendously and fixing some incidentally related things. It feels amazing.

No tests, no docs, no comments, total spaghetti, global variables everywhere, no logs, and I am loving it. What a job.

Upper upper management is starting to keep tabs on the "productive" output of my team though. We'll see.


'Wanna make some enemies? Try to change something'


This. Don't except things to change if you happily stay in a workplace where things are done in a half-assed way. There's endless supply of jobs for software people nowadays.


> We work in an incredibly immature industry

Whether or not that's true, I don't know if it's the right framing for the problem. Think about how many new cars or new airplanes or new bridges are designed every year. (One or two per large firm? Less? None by any small firms?) Then think about how many new web services are designed every year. (Several for a small firm, dozens for a large firm?)

If "maturing" the software design process means 10x or 100x the cost and time that a project takes today, that isn't going to fly in the market. It might make more sense to point out the discrepancies between the typical software project (low cost for mistakes, high tolerance for downtime) and the atypical finance/medicine/aerospace software project (don't screw up or people die). Maybe the specific part of the industry that needs to mature is the _awareness_ of when you cross that boundary, especially within a single organization. The folks working on the internal HR system and the folks working on the high-frequency trading system need to use different procedures (and different levels of funding and different management expectations), and if they're the same folks on different days that might be very far from obvious.


+1 on this... there's a huge difference in code that may be used by a dozen users to pick where to have lunch and HFT systems that handle millions of dollars worth of capital investment a day. There also is a correlated high burnout.

About a decade ago, I was a contractor in a security services team at a large bank. There were systems in place to minimize the impact of our software, which handled requests for access, and the actual access granted. Our software was developed like most, or worse in the beginning. In the end, by separating workflows and agreeing on interfaces, a lot of issues were prevented altogether.

Other systems should be designed with similar safeguards and separate certain control flows behind well thought out APIs.

Others still should be created much closer to classic waterfall, with adjustments meaning a stop-work, re-evaluate, update design, and proceed process in place.

It should really depend on the situation.


We do intensive root cause analysis on every serious defect that leaks into production. Look into the "5 why's" technique. Usually there's something to learn in terms of a process or training improvement.

If the analysis shows the defect could have reasonably been caught during the code review phase then management sometimes has to recalibrate the reviewers. It's also important to select the right reviewers based on the scope and risk of the change. For the largest, most complex changes we sometimes pull in up to 6 reviewers including members of other teams and high level architects.


You guys think this is bad?

In finance, I've seen people compute deals worth billions of dollars using excel spreadsheets and a team of MBAs.


There’s nothing explicitly wrong with this. MBA’s aren’t intrinsically stupid, and excel is a reasonable tool for calculations. It’s just a pure functional language using geometric space instead of a namespace.


The team of MBA's can probably help avoid 172k$/second types of mistakes no?


Yep - all the focus on the engineering specifics here seems to miss the sheer magnitude of the thing. Knight risked 4x their annual revenue in under an hour! You can get away with some incredibly crappy systems if you have final-line safeguards like "let's not risk all the money we've ever seen and more without asking a human".

The team of MBAs might screw up an Excel formula and lose money on a deal, but presumably if the math comes out to "let's pay $500 billion for Yahoo", somebody's going sanity-check that before they transfer the money.


I think companies have on occasion lost billions due to a glitch in a contract, sometimes due to delegating work to a novice who didn't fully understand it. And a contract is not totally unlike a program.


Most business deals are done in part based on results from excel spreadsheets. What other tool would you expect to be used?


This must be ignorance speaking, but what is the issue with that?


As a tool for rough approximations, spreadsheets are amazing. Beyond that though:

* Lack of unit (or any) testing

* Lack of good versioning practices and code review (e.g emailing around a sheet that's been doing the rounds for 15 years in various different guises and formats and has who knows what horrors lurking)

* Lack of typing (e.g doing a SUM across a dataset consisting of "3", $3, 3, "III", an emoji of the number 3, a jpeg screenshot of the number 3 - might not add up to 18)

* Lack of precision (rounding errors)


> It's like the author doesn't realize that 99% of software development and deployment is done like this, or much much worse.

Not sure if you are talking about the author of the SEC document (the “bug report”) or the blogger, but in either case, what is okay in 99% of software development may be quite inadequate for critical software used in tightly regulated industries. Context matters.

> Managers care about speed of implementation, not quality.

Managers care about speed of implementation not quality insofar as quality is often hard to unambiguously measure and hard to assess impact on the bottom line (not just quantity, but whether there is really any impact.)

This is a fairly dramatic example of the impact being made concrete.


> we rolled out mandatory code reviews on all changes. Now we have thousands of rubber-stamped "looks good to me" code reviews without any remarks

with this practice I have gotten a better habit of looking at the code changes and having a mental image of what things do

this coupled with annotation in the IDE helps me really keep up with the code

only occasionally do I get an actual WTF that prompts me to leave any remark or not approve the change

I'm still better informed about the code change and implementation than not


> We work in an incredibly immature industry.

I'd add that it's the the incredibly immature economic system, paired with an even more immature(new) industry that leads to this kind of failure.

I feel like this is a symptom of tech in capitalism. Where the goal is to maximize profits and minimize effort, rather than doing the job correctly. Fitting that this would befall a high freq trading firm.


> I'm amused by the tone.

It reads to me like standard RCA (Root Cause Analysis) language/tone.


Every industries are immature. We just saw it more clearly because we are in it.


This is true for the bottom ~80% of the industry. The top ~20% of tech companies actually care about quality (as well as speed of implementation). You cannot deliver a shitty solution quickly and call it a day.


> bottom ~80% of the industry. The top ~20%

How are you defining "top" and "bottom" here? I can agree with the 80/20 split, but I don't think it necessarily tracks with, say, name recognition or market cap. Amazon, for example, utterly failed on their last Prime day, and from the discussions of those Amazonians I know, it was pretty much inevitable.

The REASON these fast-and-loose habits are habits are because, when they work, they make money. You can't look at a lottery and say the winners "did it right" and the losers didn't - the winners just haven't failed yet. Those companies that AREN'T trying to grow at ridiculous speeds but ARE trying to maintain quality (and are fortunate enough to be in a market niche that supports that) are the ones most likely to have bulletproof reliability. Those won't be the "top" companies by most peoples definitions.


>> , for example, utterly failed on their last Prime day, and from the discussions of those Amazonians I know, it was pretty much inevitable

I am not familiar what happened. I am also ex-Amzn. I think there is also a dimension to this, scale. With Amazon's scale there are quite different challeges then in a smaller company. There are solutions that could not possible scale to their needs.

Fast-and-lose was largely introduced when Google and Facebook got entered the scene. I remember that Google did a survey and most of the users were ok with some breakage and get the newest features ASAP. Many people concluded based on this that all software development is like that. Ironically Facebook lately adopted certain technologies to reduce the amount of bugs in their frontend code (ReasonML). I think there is a large distance between bulletproof and feck-all reliability. Top companies are closer to the bulletproof range of the spectrum while bottom companies closer to the feck-all end.


It's easy to rank companies from top to bottom. Whether there is source control, automated build, tests, reproducible scripted deployment, code reviews, test environments, bug tracker, logs, monitoring of hardware usage, analytics, backup.

Each of these is strongly correlated to quality.


> The top ~20% of tech companies actually care about quality

Most companies are not tech companies, fyi. So you're talking about the top 20% of a very small subset of business.


discussed previously at:

https://news.ycombinator.com/item?id=6589508

I remember the week after this. Everyone I knew who worked at a fund was going over their code and also updating their Compliance documents covering testing and deployment of automated code.

As a side note one of hte biggest ways funds tend to get in trouble from their regulators is to not follow the steps outlined in their compliance manual. Its been my experience that regulators care more that you follow the steps in your manual than those steps necessary being the best way to do something.

I came away from this thinking the worst part of this was that their system did send them errors, its just that when you deal with billions of events emailing errors just tend to get ignored as at that scale you generate so many false positives with logging.

I still don't know the best way to monitor and alert users for large distributed systems.

The other take away was that this wasn't just a software issue but a deployment issue as well. It wasn't just one root cause but a number of issues that built up to cause the issue.

1) New exchange feature going live so this is the first day you are actually running live with this feature

2) old code left in the system long after it was done being used

3) re-purposed command flag that used to call the old code, but now is used in the new code

4) only a partial deployment leaving both old and new code working together.

5) inability to quickly diagnose where the problem was

6) you are also managing client orders and have the equivalent of an SLA with them so you don't want to go nuclear and shut down everything


> I came away from this thinking the worst part of this was that their system did send them errors, its just that when you deal with billions of events emailing errors just tend to get ignored as at that scale you generate so many false positives with logging.

I write apps that generate lots of logs too...I think an improvement lies in some form of automated algorithmic/machine learning (to incorporate a buzzword in your pitch) log analysis.

When I page through the log in a text editor, or watch `tail` if it's live, there's a lot of stuff that looks like

    TRACE: 2019-04-01 09:45:03 ID A1D65F19: Request 1234 initiated
    ERROR: 2019-04-01 09:45:04 ID A1D65F19: NumberFormatException: '' is not a valid number in ProfileParser, line 127
    WARN : 2019-04-01 09:45:04 ID A1D65F19: Profile incomplete, default values used
    WARN : 2019-04-01 09:45:14 ID A1D65F19: Timeout: Service did not respond within 10 seconds
    TRACE: 2019-04-01 09:45:14 ID A1D65F19: Request 1234 completed. A = 187263, B = 1.8423, C = $-85.12, T = 11.15s
Visually (or through regex), you can filter out all the "Request initiated" noise. Maybe the default value warning occurs 10% of the time, and is usually accompanied by that number format exception (which somebody should address, but it still functions, and there's other stuff to fix). But maybe the "Timeout error" hasn't been seen in weeks, and the value of C has always been positive - that is useful information!

Don't email me when there's a profile incomplete warning. Don't email me any time there's an "ERROR" entry, because that just makes people reluctant to use error level logging. Definitely don't email me when there's a unique request complete string, that's trivially different every time. But do let me know when something weird is going on!


I once worked in an envrionment like this that saw about 1k emails a day when I started... What I started doing was trying to resolve one bug a day... I'd look at an error, do a search in my email filtered folder for that error, and manually count them, whichever had the most emails, I took care of it.

About 2/3 just came down to filtering certain classes of errors. 4xx errors, I stopped email notification altogether, since they were already being trapped/handled by the system. Others were a little more specific. Ironically, .Net tends to handle some things that should be 4xx errors as 500, so reclassifying those took out a lot as well.

In the end, within about a month, the emails were down to a manageable 20 or so a day and got more visibility as a result.


Don’t mean to sound snarky, but there are tools that do this and have been for years. If you’ve been grepping through logs for the last 3 years, you’re doing it wrong for the cloud era.

Often times the answer is writing better alert triggers that take historical activity into account to cut down on false positives. Other times it’s simply to reduce the number of alerts. In every case you need an alerting strategy that takes balances stakeholder needs, and you need to realign on that strategy quarterly. It’s ultimately an operational problem, not a technical one.

Alas, back in the real world, logging is always the last thing teams have time to think about...


Care to share what types of tools do this? I'm genuinely interested. I haven't come across a log management tool that uses AI to detect abnormal conditions based on the log message contents like the OP describes. I stick to Papertrail for the most part though so I'm likely out of the loop.


I’ve used and really liked DataDog in the past. It has some rudimentary ML functionality for anomaly detection of certain fields, but it’s only getting better.

I’ve also had clients in the past use Splunk with ML forecasting models that inject fields as part of the ingest pipeline. I don’t know the details of that implementation; I just know how the dev teams were using it.


My company uses Sentry.io, which doesn't have AI stuff but does have good tells for separating "normal" errors from "unusual" errors.


There's nothing particularly wrong with e-mailing (or using other push notification mechanisms) when a warning or error occurs. When those logging levels are properly used it means some sort of administrator intervention is needed. But send the notification only once. Don't keep spamming out the same notification over and over.


Honestly, it sounds like there just needs to be some higher discipline here around what the different log levels mean. ERROR should only be used if your system failed to service a request. Any other errors are WARN. Then you tell the system to notify you on any ERROR log, because that means that requests are failing to be processed.

In your example, the `NumberFormatException` is a bad ERROR entry, because it's covered by a WARN entry right below it. Meaning that the request was not failed -- it entered a default value instead of the bad parsing. So that exception should also be at WARN level.

(Arguably, overwriting input values with defaults because of a parsing error is probably a bad idea and should be an ERROR due to rejecting the request. But I'm rolling with what we got here.)


> I still don't know the best way to monitor and alert users for large distributed systems.

I'd imagine that this is somewhere statistical process control would apply: if you're dealing with a system in which errors are expected, then monitor frequency & magnitude, and alert when they fall outside of one or two standard deviations from the mean.

For a financial company, you'd probably want a graph of net worth or somesuch, and alert when it falls outside of one standard deviation. If you don't have the IT do calculate net worth to at least hourly granularity, then get there, and aim for to-the-minute granularity. This shouldn't be hard, but it might be, and if it is then it's worth fixing.


Not only did they leave old code in the system, but they made changes to it without reviewing whether the old code would still work correctly, despite still being callable:

In 2003, Knight ceased using the Power Peg functionality. In 2005, Knight moved the tracking of cumulative shares function in the Power Peg code to an earlier point in the SMARS code sequence. Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called.

If you want to leave the old code in, fine, but then it still needs to be tested.


Reading this I was considering point #6 and shutting down the system. If the proper alerts had shown a system improperly taking on massive positions, and the related risk in dollar terms, wouldn't shutting down be a better route than bankruptcy? In retrospect, would (could?) they have chosen that?

Of course, violating those SLA's could cause bankruptcy through client dissatisfaction but that seems less certain than bleeding out money.


> Its been my experience that regulators care more that you follow the steps in your manual than those steps necessary being the best way to do something.

Not a joke. Bitfinex lost its Bitcoins by being compliant (regulators required that users wallets are displayed on the blockchain). It'd have been safer to just have a single cold wallet.


Deployment is where the really scary bugs can happen the easiest.

I've been working on a warehouse management software (that was running on mobile barcode scanners each warehouse worker had, as he moved stuff around the warehouse and confirmed each step with the system by reading barcodes on shelves and products).

We had a test mode, running on a test database, and production mode, running on the production database, and you could switch between them in a menu during the startup.

During testing/training users were running on the test database, then we intended to switch the devices to production mode permanently, so that the startup menu wouldn't show.

A few devices weren't switched for some reason (I suspect they were lost when we did the switch and found later), and on these devices the startup menu remained active.

Users were randomly taking devices each day in the morning, and most of them knew to choose "production" when the menu was showing. Some didn't, and were choosing the first option instead.

We started getting some small inaccuracies on the production database. People were directed by the system to take 100 units of X from the shelf Y, but there was only 90 units there. We looked at the logs on the (production) database, and on the application server, but everything looked fine.

We were suspecting someone might just be stealing, but later we found examples where there was more stuff in reality on some shelves than in the system.

At that time we introduced a big change to pathfinding, and we thought the system was directing users to put products in the wrong places. Mostly we were trying to confirm that this was the cause of the bugs.

Finally we found the reason by deploying a change to the thin client software running on the mobile devices to gather log files from all the mobile devices and send to server.


I bet you had one engineer who claimed that the real problem was that the users were stupid and not that the deployment process was error prone.

I've heard about this case many times before but somehow in the other renditions they downplayed or neglected to mention that the deployments were manual. As this story was first explained to me, one of the servers was not getting updated code, but I was convinced by the wording that it was a configuration problem with the deployment logic.

Performing the same action X times doesn't scale. Somewhere north of 5 you start losing count. Especially if something happens and you restart the process (did I do #4 again? or am I recalling when I did it an hour ago?)


Was the deployment process of your parent post actually error-prone? From what I gathered, the developers were unaware of the lost handheld scanners. I imagine if they did they could've proactively put them out-of-service until found.


We had automatic updates in the thin clients (that's how we were able to add "logging to server" on all of them at once).

The problem was - the startup menu with testing/production choice was enabled independently of the autoupdate mechanism (separate configuration file ignored by autoupdates) for some technical reason (I think to allow a few people to test new processes while most of the warehouse works on the old version on production database).


My company's legacy system (which still does most revenue producing work) has deployment problems like this. The deployment is fully automated, but if it fails on a server it fails silently.

I rarely work on this system, but had to make an emergency change last summer. We deployed the change at around 10 pm. A number of our tests failed in a really strange way. It took several hours to determine that one of the 48 servers had the old version still. It's disk was full, so the new version rollout failed. The deployment pipeline happily reported all was well.

We got lucky in that our tests happened to land on the affected server. The results of this making past the validation process would be catastrophic. Not as catastrophic as this case I hope, but it'd be bad.

We made a couple human process changes, like telling the sysadmins to not ignore full disk warnings anymore (sigh). We also fixed the rollout script toactually report failures, but I don't actually trust it still.


Ignoring disk full condition ! really.

Handling an out of space condition should be part of your test suite - it certainly was back when I looked after a Map reduce based Billing system at BT and that was back in the day when a cluster of 17 systems was a really big thing.


IMO that's too late of a condition check. You should keep a margin of free space for OS/Background applications just so you have enough life left in the application to detect and log it.


In this case, the application itself does not use the disk in question, so the old version kept chugging along obliviously.

We actually did have monitoring on the capacity of the disk in question. We discovered during the analysis that it had been alerting the responsible team all day. They had just been ignoring it.


I think the parent did well openly and honestly raising a personal example where missing a "basic" check caused near career changing problems - I applaud them for sharing a difficult situation.

I was concerned that it's possible to read your comment as if it was critical of the parent - was that your intention?


To clarify a little, I was neither responsible for monitoring the underlying hardware or the deployment systems in this case. I also didn't have access to fix it myself. It took me a couple hours to go from "random weird test results" to "full disk broke the deploy".


It was more a criticism of the OP's Sysad's -

Though for the system I mentioned one of the reasons they employed me as a developer was that I was also a sysad on PR1MES - an early example of devops maybe.


> Knight did not design these types of messages to be system alerts, and Knight personnel generally did not review them when they were received

So they received these 90mins before they were executed, and as it so happens in many organizations, automated emails fly back and forth without anyone paying attention.

Also.. running a new trading code, and NOT have someone looking at it LIVE on the kick-off, that is simply irresponsible and reckless.


I bring up this story every time someone talks about trying to do something dumb with feature toggles.

(Except I had remembered them losing $250M, not $465M, yeow)

The sad thing about this is if the engineering team had insisted on removing the old feature toggle first, deploying that code and letting it settle, and only then started work on the new toggle, they may well have noticed the problem prior to turning on the flag, and it certainly would have been the case that rolling back would not have caused the catastrophic failure they saw.

Basically they were running with scissors. When I say 'no' in this sort of situation I almost always get pushback, but I also can find at least a couple people who are as insistent as I am. It's okay for your boss to be disappointed sometimes. That's always going to happen (they're always going to test boundaries to see if the team is really producing as much as they can). It's better to have disappointed bosses than ones that don't trust you.


I had a chance to get familiar with deployment procedures at Knight two years after the incident. And let me tell you, they were still atrocious. It's no surprise this thing happened. In fact, what's more surprising is that it didn't happen again and again (or perhaps it did, but not on a such large scale).

Anyway, this is what the deployment looked like two years after:

* All configuration files for all production deployments were located in a single directory on an NFS mount. Literally, countless of .ini files for hundreds of production systems in a single directory without any subdirectories (or any other structure) allowed. The .ini files themselves were huge as it typically happens in a complex system.

* The deployment config directory was called 'today'. Yesterday's deployment snapshot was called 'yesterday'. This is as much of a revision control as they had.

* In order to change your system configuration, you'd be given write access to the 'today' directory. So naturally, you could end up wiping out all other configuration files with a single erroneous command. Stressful enough? This is not all.

* Reviewing config changes were hardly possible. You had to write a description of what you changed, but I've never seen anybody attach an actual diff of changes. Say you changed 10 files, in the absence of a VCS, manually diff'ing 10 files wasn't anybody wanted to do.

* The deployment of binaries was also manual. Binaries were on the NFS mount as well. So theoretically, you could replace your single binary and all production servers would pick it up the next day. In practice though, you'd have multiple versions of your binary, and production servers would use different versions for one reason or another. In order to update all production servers, you'd need to check which version each of the server uses and update that version of the binary.

* There wasn't anything to ensure that changes to configs and binaries are done at the same time in an atomic manner. Nothing to check if the binary uses the correct config. No config or binary version checks, no hash checks, nothing.

Now, count how many ways you can screw up. This is clearly an engineering failure. You cannot put more people or more process over this broken system to make it more reliable. On the upside, I learned more about reliable deployment and configuration by analyzing shortcomings of this system than I ever wanted to know.


I realize that the consensus is that lots of companies do this kinda thing. I don't know if it's 99% - but the percentage is pretty high.

However what's neglected to mention is the risk associated with a catastrophic software error. If you are say instagram and you lose your uploaded image of what you ate for lunch, that is undesirable and inconvenient. The consequences of that risk should it come to fruition is relatively low.

On the other hand if you employee software developers that are literally the lifeblood of your business for automatic trading, you'd think that a company like that would understand the consequences of treating this "cost-center" as a critical asset rather than just a commodity.

Unfortunately you would be wrong. Nearly every developer I have ever met that has worked for a trading firm has told me that the general attitude towards nearly all it's employees that are not generating revenue as a disposable commodity. It's not just developers but also research, governance, secretarial, customer service, etc. This is a bit of a broad brush but generally the principles and traders of those aforementioned firms are arrogant and greedy and cut corners whenever possible.

In this case you'd think these people would be rational enough to know that cutting corners on your IT staff could be catastrophic. This is where you would be wrong. Most of the small/mid sized financial firms that I have had friends who worked there have told me they generally treat their staff like garbage and routinely push people out who want decent raises/bonuses, etc. These people are generally greedy and also egocentric and egomaniacal, and they believe all their employees are leaching from their yearly bonus directly.

This story is not a surprise to me in the least. What's shocking is no one in the finance industry has learned anything. Instead of looking at this story as a warning, most of the finance people hear this story and laugh at how stupid everyone else is and that this would never happen to them personally because they're so much smarter than everyone else.


>>> Instead of looking at this story as a warning, most of the finance people hear this story and laugh at how stupid everyone else is and that this would never happen to them personally because they're so much smarter than everyone else.

What if we're smarter than everyone else? When I was in big bank, we had mandatory source control, lint, unit tests, code coverage, code review, automated deployment, etc... pretty good tools actually. Not everybody is stuck in the stone age.

Even in a small trading company before that, we had most of the tooling although not as polished. Very small company with a billion dollars a month in executed trade. One could say amateur scale.


Big bank is not the same as a small/mid sized trading firm. Banks have regulations they need to meet, and typically do things by the book.

I'm not an expert here. Part of what I said is based on the 6 different people I've met who have worked in the industry. I'm just saying if you have $400+ million to lose and you rely on the IT infrastructure allows you to make that money then you can spend a few million on top notch people and processes to prevent this kind of thing. I worked at a relatively large media company, and every deployment has a go/no-go meeting where knowledgable professionals asked probing questions, you defended your decisions. I've love to know what they did in Knight Capital. The idea of re-using an existing variable for code that was out of use strikes me as a terrible idea.


What baffles me is how they got his far into operations with such dreadful practices, 100-200k could have got them a really solid CI pipeline with rollbacks, monitoring, testing etc,

But spend 200,000 on managing 460,000,000? No way!


How would CI help in this case? It isn't even software bug, it's a process issue - they had old code running on one of out of 8 servers. The monitoring was triggered, but no action was taken.


I disagree that it isn't a software bug.

"The new RLP code also repurposed a flag" - this is the moment when terrible software development idea was executed that resulted in all of the mess.

Of course I don't know the full context and maybe, just maybe there was a really solid reason to reuse a flag on anything.

What I observe more often is something like this though:

  1. We need a flag to change behaviour of X for use case A, let's introduce enable_a flag.
  2. We want similar behaviour change of X also for use case B, let's use the enable_a flag despite the fact the name is not a good fit now.
  3. Turns out use case B needs to be a bit different so let's introduce enable_b flag but not change the previous code so basically we need them both true to handle use case B.
  4. Turns out for use case A we need to do something more but things should stay the same for B.
  5. At this point no one really knows what enable_a and enable_b really do. Hopefully at least someone noticed that enable_a affects use case B.
If you have an use case A, create a handle_a flag. If you have a use case B create handle_b flag even if they do exactly the same thing as more than likely they do exactly the same thing only for now.

What would probably be even better is separate, properly named config flags for each little behaviour change and just use all 5 of those to handle different use cases.

edit: formatting


I interpret "repurposed a flag" to be re-using a bit in a bitfield in a binary RPC.

This is a little inevitable when working with (internal) binary protocols. You have some bits that used to be used as one thing, haven't been used for that thing in awhile, and it can be very tempting to just repurpose those old bits for new tricks.

In that case I'd call the sin deprecating and reusing in a single step. If you have to change the meaning of bits in a binary protocol, you should deprecate the old meaning, wait several release cycles, and then repurpose it after you're very, very confident that there are no remaining clients or servers in lurking around production with the old meaning, and that you won't need to roll back production to the old state ever again.

Roll-out failures like this happen all the time; you have to roll-out new changes almost assuming they will happen.


The urge to use the same flag for both B and A is probably motivated by a desire either to clean up the code by not using redundant flags. Or perhaps it was implicitly part of a forward-thinking (in theory, well intentioned) design to avoid separate flag B in the future.

Either way, it shows some attempt at longer-term thinking undone by short term implementation done improperly. That’s a microcosm of this story as a whole which gives this incident a fractal quality.


> If you have an use case A, create a handle_a flag. If you have a use case B create handle_b flag even if they do exactly the same thing as more than likely they do exactly the same thing only for now.

A hard lesson to learn, and a hard rule to push for with others who have not yet learned.

Imagine what our species could do if experience were directly and easily transferable...


Hah exactly :)

Same goes for functions, classes, React components, DB tables and everything else.

Just model it as close as possible to the real world. The world doesn't really change that often. What does is how we interpret and behave within it (logic/behaviour/appearance on top).

If you have a Label and Subheader in your app, create separate components for them. It doesn't matter that they look exactly the same now. Those are two separate things and I guarantee you more likely than not at some point they will differ.

My rule of thumb is: If it's something I can somehow name as an entity (especially in product and not tech talk) it deserves to be its own entity.


It's funny though, because my experience has led me to the exact opposite approach. Modeling based on real world understanding has been very fragile and error prone, and instead modeling as data and systems that operate on that data has been very fruitful.


Maybe I was using the wrong terminology, but when I said "CI" I meant that to include automated, atomic deployments.

In the last 16 years I've worked in software the last 10 or so of that has not included manually copying files to production


Those must be new businesses / startups. The world is full of companies more than 20 years old.


The bug report describes how they reuse a flag for new code that triggered old code.

"CI" would generally help keeping things tidy and transparent. It would be easier to find out what branches or flags or parameters already exist, which ones are not relevant any more; someone could rename them with a prefix "OLD_" or otherwise clean up, so that you typically pick from a list of options instead of setting a flag by accidentially copying it from one script to the other.

It is also easier to have error visualization plugins, or to look around previous deployments or test runs and get a pretty good idea if any new errors have shown up, even if you haven't analyzed the old ones.


One of the tests in my CI pipeline is verifying that my new code works with previously released code. It gets painful when these tests fail, but it is a valid configuration so I have to work around bugs in the old code from time to time.


Firms like this have spent decades driven by a short term, "get it done now, future entropy be damned" mentality, partially because market opportunities are fleeting and partially because the principals usually come from a trading background and have very little respect or understanding for technology. The norm are gigantic systems accruing decades of technical debt,very large swats of dead code and a refusal (and sometimes plain inability) to address software engineering issues. The firms that survive are the ones that realize (and staff accordingly) they are tech firms that happen to trade rather then other way around.


One problem is that business-focused people seem to think of tech debt as an easily determined quantity, when it behaves more like a complex derivative than a simple financial instruments like a loan


That's assuming that business-focused people have any interest in determining the amount of technical debt at all. Generally they're content to let other teams pay the debt, and many of them are quite happy to complain when it doesn't get paid fast enough.


"You're not taking that $200K out of my budget."

While something may be a rounding error to the company, you can't get someone at the correct political level where it's a rounding error to pay attention.


Makes me feel less bad about rm -rf 'ing a product database and losing 1 hour of client data, the other week. Maybe I should show them this...


I would argue you shouldn't have been able to do that in your organization without bypassing (several) significant safeguards.

Did you forget a where clause while deleting data on a table, or were you actually on the production server hosting the database?

Any code you write that interacts with a database (or really any production code at all...) should be reviewed before being merged. And developers shouldn't be writing raw SQL commands on a production server. It's hard for me to see this as anything other than an organizational failure rather than your own.

EDIT: Based on the number of downvotes this has received, I can only imagine we have a lot of devs on HN who cowboy SQL in production...holy hell how can any of what I said be controversial.


While I mostly agree, many companies have a tech department of half a dozen people, and implementing and enforcing every good practice devops isn't always realistic.

That said, I'd expect at least a backup of production, then again he said he lost 1 hr of data so it was likely between backups.


If you haven't been able to invest the time to do database maintenance tasks in a safe way, at the very least enforce a 4-eyes principle and write up a checklist / script before hacking away in the production database.

I mean I get it, I've made mistakes like this as well knowing I shouldn't have (we had test and prod running on the same server, about 40K people received a test push notification). But the bigger your product gets, the less you can afford to risk losing data.


I totally agree, if feasible those steps should be done!

I was just trying to explain that many business like the one I'm at don't do business in tech (mine sells wholesale clothing), with 6 people in the tech dep, so understandably there's certain limitations on how far best practices can go. While I would usually consider it a mistake, if you thought you were just making a quick, what should be read only query, and it happens to hit some random edgecase-bug and crash a db... Edit, continuing - Sure you should have tested that on the test DB first, but I'd be kinda understanding of how that happened.

Depends on the business too, if you're a startup-tech company then yea, get your -stuff- together! It's just a lot of business only need their website and some order management, their focus isn't on the tech side of things.


Any IT dept with less than 5-10 DBAs will have to throw out the window any Segregation of Duties plans (keeping apart access on Prod-Dev-QA) or separating/dedicating DBAs to the three different environments.

But the backup, hell yeah, you NEED mitigating controls (preventive/corrective) for when you allow people to make changes in Prod that haven't been gone through all the testing phases.


That's the problem with DevOps / CI/CD. The DBA team, and separation of duties / least privilege more generally, are seen as old-fashioned impediments to business velocity. The foundations of DevOps are supposed to be trust, tools, and testing, but in my experience once the dev team gets their hands on the tools, that's all she wrote.


To be fair, when you have a tech department of 6 people, who still have to respond to internal support tickets, manage the businesses intranet, and continue development of current projects, not to mention only 2 of those 6 have any clue how to setup/administrate a database... You can see the issue.

You can't just say "Hire more people" because the current setup is "working" and isn't considered critical to the rest of the business when it isn't tech related.


No I totally agree, looking back on my actions, rm -rf'ing the production database was not a productive use of my time.


I worked in a place for years where we had access to production databases, to do updates, deletes and so on. We had a certain stylized way of doing things to prevent irreversible errors. If you were going to delete data for instance, you always created a backup table first with 100% of the columns for the rows of interest, then selected from it to verify it looked reasonable, and then deleted using the backup table to correlate rows to the production table using the primary key. And then another analyst reviewed your work (originally two levels of review, later reduced to one). The backup tables and comments in a ticket including a copy of the SQL run were kept indefinitely, so even if the review missed something it generally was correctable.

The DBAs in the company did think SQL should be reviewed in advance, but that's not how our department did it. I think it's arguable that, given that reviewers before or after doing something dangerous may miss things, it's better to establish safe practices and if you do that, then you don't really need a review in advance.


SQL fixes in production don't have to be that risky. Just run a SELECT first with the same predicates and be sure that's what you want. In some cases this can be better than trying to develop a fix using a test system which may not have the same data.

Same with "rm -r" --- run an "ls" first.

Sure it's not ideal but life isn't ideal. When you have a production problem, you need to fix production.


I think you're making a number of assumptions about the size and maturity of the parent's organization; it could be just one person.

With the right tools and processes, it's possible to build very successful companies with tiny engineering teams. Developers run queries against prod because there's nobody else to do so. The risk of mistakes is mitigated by the increased situational awareness and the developer quality and communication in 3-teams vs 30-teams.

Neither approach is inherently 'wrong', although running a 30-person team the same way as a 3-person team (or vice versa) can have nasty consequences.


Even cowboys don't herd cattle by themselves...


Loosely related - this is what terrifies me about deploying to cloud services like Google which have no hard limit on monthly spend - if background jobs get stuck in an infinite loop using 100% CPU while I'm away camping, my fledgling business could be bankrupt by the time I get phone signal back.


Woah, how does Google Cloud still not support budget capping?

It has budget alerting, so the capabilities are obviously there, but it's never been added. Instead, there's just a vaguely insulting guide on writing a script to catch the alert and trigger a shutdown...


Pretty sure Google cloud does support it

Pretty sure aws still doesn't


Google App Engine has spending cutoffs. Cloud allows API call cutoffs, but for actual spend it only has alerts. Their 'controlling budget' page sends you to a guide on writing your own triggers to respond to those alerts:

https://cloud.google.com/billing/docs/how-to/budgets


You can set maximum spend limits :)


This is one of the classics of the genre. If you're interested in software reliability/failure, you should read some of COMP.RISKS .. and then stop before you get too depressed to continue.


> This is probably the most painful bug report I've ever read

I suggest further reading, starting with Therac-25.


An honest bug report for the recent Boeing fuckup would be even worse. They deployed unconscionably shitty software (MCAS system) that killed a total of 346 people in two perfectly airworthy planes.


Totally unrelated, but the title made me think back to one of my previous roles in the broadcast industry. If you're using a satellite as part of your platform, every second that you aren't transmitting to your birds (satellites), you're losing a massive amount of money. There are always a lot of buffers and redundant circuits in those situations, but things can always go wrong.

Funny tangent- the breakroom at that job was somewhat near the base stations. Some days around lunchtime we'd have transmission interruptions. The root cause ended up being an old noisy microwave.


Needs (2013) tag. As usual, human negligence is to blame.


Is there any case where human negligence is not to blame?


Pompei?


I mostly was going for software mistakes in the last century or so, not natural disasters before we had reasonable tech to protect ourselves. But even in the case of Pompeii, the residents knew that Vesuvius (likely etymologies include "unquenchable" or "hurler of violence") was an active volcano, so human negligence isn't unreasonable there.

My claim is mostly that "human negligence" is so universal of a root cause that it's meaningless. What form of human negligence happened, and how could it be averted in the future?


Building a city on a known lava field?


It had been "silent" for about 300 years when Pompeii was destroyed. And before that, it had mainly had small series of low-level eruptions. Basically, for much of known history at that time, it was a safe place to live.


I mean, developers are building houses in flood plains as we speak and no one bats an eye.


Just popping in to say i believe the Equifax hack was also do to a 'bad manual deployment' similar to this. They had a number of servers but they didnt patch one of the servers in their system. Hackers were able to find this one server with outdated and vulnerable software and took advantage of it.

I think deploys get better with time, but that initial blast of software development at a startup is insane. You literally need to do whatever it takes to get your shit running. Some of these details dont matter because initially you have no users. But if your company survives for a couple years and builds a user base, you still have the same shitty code and practices from early times.


I have no sympathy for high frequency traders losing everything.

So many more interesting and meaningful uses of computing than trying to build a system to-out cheat other systems in manipulating the global market for the express interest of amplifying wealth.


I remember seeing this in the late 90s, seeing 18-year-olds becoming "millionaires" investing in a dot-com. Lasted a good six months before rock bottom.


A trading bot is a money-making machine and so is Facebook. What's worse, a "headless" machine that is directly manipulating buy/sell orders to feed off market inefficiencies, or a machine that lures humans in, then converts their attention and time into money?


I've thought before that transforming our soft grey matter into gold is basically what the Rosicrucian alchemists were on about.


I watched the market go haywire on this day. Attentive people made a cool buck or two as dislocations arose.

What's crazy is that there were already rules in place to prevent stuff like this from happening - namely the Market Access Rule https://www.sec.gov/news/press/2010/2010-210.htm which was in place in 2010.

When the dust settled, Knight sold the entire portfolio over via a blind portfolio bid to GS. Was a couple $bn gross portfolio. I think they made a pretty penny on this as well.


>Knight relied primarily on its technology team to attempt to identify and address the SMARS problem in a live trading environment.

Ah the good old "fk it we'll do it live" approach to managing billions.


Could use a [2013] tag, but this story is fascinating and horrifying and I re-read it every time it pops up. It's a textbook case of why a solid CI/CD pipeline is vital.


... and a requirements tracker linked to the code.


Who did they hire to develop this software?


Incidents such as these are rarely a people failure and nearly always a process failure. People will always make mistakes — perhaps seniors will make fewer mistakes than juniors, but no-one makes no mistakes.


Yeah there's a lot of macho, procrastinatory bullshit that gets in the way of having a good process. There's always someone who thinks that each step is 'simple' and refuses to believe in death by a thousand cuts. They blame people who make mistakes and feel better about themselves.


People are responsible for making the processe, no? It’s still a people problem.


Right, it's usually something like this:

A: I'd like to hire some people to improve our processes. It will take time and money and prevent future problems, but you will never notice.

B: Time and money and no new features? No way, I won't approve that.

A: tries to sell it some more even though they are technical and not a salesperson

B: No.


Agreed. This has the feeling of software done by the absolute lowest bidder who promised or were given unsustainable timelines.


I interviewed at Knight 1-2 years before this incident for a SWE position. I was very junior then, but I remember all the developers I met seemed to be very knowledgable and the office environment was very professional, especially compared to some other companies at the time.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: