Hacker News new | past | comments | ask | show | jobs | submit login
We built a self-healing system to survive a concurrency bug at Netflix (pushtoprod.substack.com)
345 points by zdw 46 days ago | hide | past | favorite | 165 comments



Vaguely related anecdote:

30 years ago or so I worked at a tiny networking company where several coworkers came from a small company (call it C) that made AppleTalk routers. They recounted being puzzled that their competitor (company S) had a reputation for having a rock-solid product, but when they got it into the lab they found their competitor's product crashed maybe 10 times more often than their own.

It turned out that the competing device could reboot faster than the end-to-end connection timeout in the higher-level protocol, so in practice failures were invisible. Their router, on the other hand, took long enough to reboot that your print job or file server copy would fail. It was as simple as that, and in practice the other product was rock-solid and theirs wasn't.

(This is a fairly accurate summary of what I was told, but there's a chance my coworkers were totally wrong. The conclusion still stands, I think - fast restarts can save your ass.)


This is along the lines of how one of the wireless telecom products I really liked worked.

Each running process had a backup on another blade in the chassis. All internal state was replicated. And the process was written in a crash only fashion, anything unexpected happened and the process would just minicore and exit.

One day I think I noticed that we had over a hundred thousand crashes in the previous 24 hours, but no one complained and we just sent over the minicores to the devs and got them fixed. In theory some users would be impacted that were triggering the crashes, their devices might have a glitch and need to re-associate with the network, but the crashes caused no widespread impacts in that case.

To this day I'm a fan of crash only software as a philosophy, even though I haven't had the opportunity to implement it in the software I work on.


Seems like the next priority would be to make your product reboot just as fast if not faster then theirs.


Clearly but maybe the thing that makes your product crash less makes it take longer to reboot.

Also the story isn't that they couldn't just that they were measuring the actual failure rate not the effective failure rate because the device could recover faster than the failure caused actual issues.


My workplace currently has a similar problem where a resource leak can be greatly increased with certain unpredictable/unknown traffic conditions.

Our half-day workaround implementation was the same thing, just cycle the cluster regularly automatically.

Since we're running on AWS, we just double the size of the cluster, wait for the instances to initialize, then rapidly decommission the old instances. Every 2 hours.

It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.


Sounds like process-level garbage collection. Just kill it and restart. Which also sound like the apocryphal tale about the leaky code and the missile.

"This sparked and interesting memory for me. I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks"

He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits it's target or at the end of it's flight, the ultimate in garbage collection is performed without programmer intervention."

https://x.com/pomeranian99/status/858856994438094848


At least with the missile case, someone _did the analysis and knows exactly what's wrong_ before deciding the "solution" was letting the resources leak. That's fine.

What always bothers me, is when (note, I'm not saying this is the case for the grandparent comment, but it's implied) people don't understand what exactly is broken, but just reboot every so often to fix things. :0

For a lot of bugs, there's often the component you see (like the obvious resource leak) combined with subtle problems you don't see (data corruption, perhaps?) and you won't really know until the problem is tracked down.


That's super interesting and I love the idea of physically destructive GC. But to me that calculation and tracking sounds a lot harder than simply fixing the leaks :)


> It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

The trick is to not tell your manager that your bandaid works so well, but that it barely keeps the system alive and you need to introduce a proper fix. Been doing this for the last 10 years and we got our system so stable that I haven't had a midnight call in the last two years.


Classic trick. As a recent dev turned manager, these are the kind of things I've had a hard time learning.


Heroku reboots servers every night no matter what stack is running on them. Same idea.

The problem is that you merely borrowed yourself some time. As time goes on, more inefficiencies/bugs of this nature will creep in unnoticed, some will perhaps silently corrupt data before it is noticed (!), and it will be vastly more difficult at that point to troubleshoot 10 bugs of varying degrees of severity and frequency all happening at the same time causing you to have to reboot said servers at faster and faster intervals which simultaneously makes it harder to diagnose them individually.

> It's shockingly stable.

Well of course it is. You're "turning it off and then on again," the classic way to return to a known-good state. It is not a root-cause fix though, it is a band-aid.


Also, it means you are married to the reboot process. If you loose control of your memory management process too much, you'll never be able to fix it absent a complete rewrite. I worked at a place that had a lot of (c++) CGI programs with a shocking level of disregard for freeing memory, but that was ok because when the CGI request was over the process restarted. But then they reused that same code in SOA/long lived services, but they could never have one worker process handle more than 10 requests due to memory leaks (and inability to re-initialize all the memory used in a request). So they could never use in-process caching or any sort of optimization that long-lived processes could enable.


I never considered "having to reboot" as "introducing another dependency" (in the sense of wanting to keep those at a minimum) but sure enough, it is.

Also, great point about (depending on your architecture) losing the ability to do things like cache results


You kinda want some machines to reboot less frequently than others, so issues don't creep up on you.

You also want some machines to reboot much more frequently than others, so you catch boot issues before they affect your entire fleet.


Not rebooting after software upgrade is an oft repeated mistake.


> It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

I don't know why my senses tell me that this is wrong even if you can afford it


> I don't know why my senses tell me that this is wrong

The fix is also hiding other issues that show up. So it degrades over time and eventually you’re stuck trying to solve multiple problems at the same time.


^ This is the problem. Not only that, solving 10 bugs (especially those more difficult nondeterministic concurrency bugs) at the same time is hideously harder than solving 1 at a time.

As a Director of Engineering at my last startup, I had an "all hands on deck" policy as soon as any concurrency bug was spotted. You do NOT want to let those fester. They are nondeterministic, infrequent, and exponentially dangerous as more and more appear and are swept under the rug via "reset-to-known-good" mitigations.


Guys might be looking to match the fame of the SolarWinds.


I think this is a prime example of why the cloud won.

You don’t need wizards in your team anymore.

Something seems off in the instance? Just nuke it and spin up a new one. Let the system debugging for the Amazon folks.


This has been done forever. Ops team had cronjobs to restart misbehaving applications out of business hours since before I started working. In a previous job, the solution for disks being full on a VM on-prem (no, not databases) was an automatic reimage. I've seen scheduled index rebuilds on Oracle. The list goes on.


> I've seen scheduled index rebuilds on Oracle

If you do look into the Oracle dba handbook, scheduled index rebuilds are somewhat recommended. We do it on weekends on our Oracle instances. Otherwise you will encounter severe performance degredation in tables where data is inserted and deleted at high throughput thus leading to fragmented indexes. And since Oracle 12g with ONLINE REBUILD this is no problem anymore even at peak hours.


Rebooting Windows IIS instances every night has been a mainstay for most of my career. haha


I’ve got an IIS instance pushing eight years of uptime… auto pool recycling is disabled.


Amazon folks won’t debug your code though, they’ll just happily bill you more.


The point is not to spend time frantically fixing code at 3 AM.


This is not exactly a new tactic, and not something that would have to have been implemented without any cloud solution. A randomized 'kill -HUP' could do the same thing, for example.


Amazon needs wizards then.


This fix means that you won't notice when you accumulate other such resource leaks. When the shit eventually hits the fan, you'll have to deal with problems you didn't even knew you had.


People will argue you should spend time on something else once you put bandaid on a wooden leg.

You should do proper risk assessment, such bug may be leveraged by an attacker, that may actually be a symptom of a running attack. That may also lead to data corruption or exposure. That may mean some part of the system are poorly optimised and over-consuming resources, maybe impacting user-experience. With a dirty workaround, your technical debt increases, expect more and more random issues that requires aggressive "self-healing".


It's just yet another piece of debt that gets prioritized against other pieces of debt. As long as the cost of this debt is purely fiscal, it's easy enough to position in the debt backlog. Maybe a future piece of debt will increase the cost of this. Maybe paying off another piece of debt will also pay off some of this. The tech debt payoff prioritization process will get to it when it gets to it.


Without proper risk assessment, that's poor management and a recipe for disaster. Without that assessment, you don't know the "cost", if that can even be measured. Of course one can still run a business without doing such risk assessment and poorly managing technical debt, just be prepared for higher disaster chances.


There's nothing as permanent as a temporary solution.


Production environments are full of PoCs that were meant to be binned


This sounds terrible


If you squint hard enough, this is an implementation of a higher order garbage collection: MarkNothingAndSweepEverything.

There, formalized the approach, so you can't call it terrible anymore.


Oh no it isn't. Garbage collector needs to prove that what's being collected is garbage. If objects get collected because of an error... that's not really how you want GC to work.

If you are looking for an apt metaphor, Stalin sort might be more in line with what's going on here. Or maybe "ostrich algorithm".


I think it’s more like Tech Support Sort, as in “Try turning it off and on again and see if it’s sorted”.


LOL - I like that one! :-)


>Garbage collector needs to prove that what's being collected is garbage

Some collectors may need to do this, but there are several collectors that don't. EpsilonGC is a prime example of a GC that doesen't need to prove anything


EpsilonGC is a GC in the same sense as a suitable-size stick is a fully automatic rifle when you hold it to your shoulder and say pew-pew...

I mean, I interpret your comment to be a joke, but you could've made it a bit more obvious for people not familiar with the latest fancy in Java world.


To be fair this is what the BEAM vm structures everything on: If something is wonky, crash it and restart from a known ok state. Except when BEAM does it everyone says it's brilliant


It's one thing to design a crash-only system, and a quite different to design a system that crashes all the time but paper over it with a cloud orchestration layer later.


I don't see the fundamental difference. Both systems work under expected conditions and will crash parts of it if the conditions don't happen. The scales (and thus the visibility of bugs) change, the technologies change, but the architecture really doesn't. Erlang programs are not magically devoid of bugs, the bugs are just not creating errors


I understand this perspective but a BEAM thread can die and respawn in microseconds but this solution involves booting a whole Linux kernel. The cost of the crash domain matters. Similarly, thread-per-request webservers are a somewhat reasonable architecture on unix but awful on Windows. Why? Windows processes are more expensive to spawn and destroy than unix ones.


I am running an old statically compiled perl binary that has a memory leak. So every day the container is restarted automatically so I would not have to deal with the problem. It has been running like this for many many years now.


I guess this works right up until it doesn't? It's been a while, but I've seen AWS hit capacity for a specific instance size in a specific availability zone. I remember spot pricing being above the on-demand pricing, which might have been part of the issue.


Yup, don't want to get ICEd out of anything.

Also, sometimes the management API goes out due to a bug/networking issue/thundering herd


I've realized that majority of engineers have no critical thinking, and are unable to see things beyond their domain of speciality. Arguments like "even when accounting for potential incident, your solution is more expensive, while our main goal is making money" almost never work, and I've been in countless discussions where some random document with "best practices", whatever they are supposed to be, was treated like a sacred scripture.


We are dogmatic and emotional, but the temptation to base your opinions on the "deeper theory" is large.

Pragmatically, restart the service periodically and spend your time on more pressing matters.

On the other hand, we fully understand the reason for the fault, but we don't know exactly where the fault is. And it is, our fault. It takes a certain kind of discipline to say "there are many things I understand but don't have the time to master now, let's leave it."

It's, mostly, embarrassing.


"certain kind" of discipline, indeed... not the good kind. and while your comment goes to great pains to highlight how that particular God is dead (and i agree, for the record), the God of Quality (the one that Pirsig goes to great lengths to not really define) toward which the engineer's heart of heart prays that lives within us all is... unimpressed, to say the least.


Sure, you worship the God of Quality until you realize that memory leak is being caused by a 3rd party library (extra annoying when you could have solved it yourself) or a quirky stdlib implementation

Then you realize it's a paper idol and the best you can do is suck less than the average.

Thanks for playing Wing Commander!


>> Thanks for playing Wing Commander!

captain america voice I got that reference :-)


> "certain kind" of discipline, indeed... not the good kind.

Not OP but this is a somewhat normal case of making a tradeoff? They aren't able to repair it at the moment (or rather don't want/can't allocate the time for it) and instead trade their ressource usage for stability and technical debt.


that's because the judge(s) and executioner(s) aren't engineers, and the jury is not of their peers. and for the record i have a hard time faulting the non-engineers above so-described... they are just grasping for things they can understand and have input on. who wouldn't want that? it's not at all reasonable for the keepers of the pursestrings to expect a certain amount of genuflection by way of self-justification. no one watches the watchers... but they're the ones watching, so may as well present them with a verisimilitudinous rendition of reality... right?

but, as a discipline, engineers manage to encourage the ascent of the least engineer-ly (or, perhaps, "hacker"-ly) among them ("us") ...-selves... through their sui generis combination of learned helplessness, willful ignorance, incorrigible myopia, innate naïvete, and cynical self-servitude that signify the Institutional (Software) Engineer. coddled more than any other specialty within "the enterprise", they manage to simultaneously underplay their hand with respect to True Leverage (read: "Power") and overplay their hand with respect to complices of superiority. i am ashamed and dismayed to recall the numerous times i have heard (and heard of) comments to the effect of "my time is too expensive for this meeting" in the workplace... every single one of which has come not from the managerial class-- as one might reasonably, if superficially, expect-- but from the software engineer rank and file.

to be clear: i don't think it's fair to expect high-minded idealism from anyone. but if you are looking for the archetypical "company person"... engineers need look no further than their fellow podmates / slack-room-mates / etc. and thus no one should be surprised to see the state of the world we all collectively hath wrought.


I dig your vibe. whaddya working on these days?


How about the costs? Isn’t this a very expensive bandaid? How is it not a priority? :)


Depends what else it's solving for.

I've seen multiple issues solved like this after engineering teams have been cut to the bone.

If the cost of maintaining enough engineers to keep systems stable for more than 24 hours, is more than the cost of doubling the container count, then this is what happens


This. All the domain knowledge has left. This sounds like a Hacky work around at best which AWS will welcome you with open arms come invoice day.


Depends on how long it takes for the incoming instances to initialize and outgoing instances to fully decommission.

x = time it takes to switchover

y = length of the cycles

x/y = % increase in cost

For us, it's 15 minutes / 120 minutes = 12.5% increase, which was deemed acceptable enough for a small service.


Shouldn't be too high cost if you only run 2x the instances for a short amount of time. A reasonable use of Cloud, IMHO, if you can't figure out a less disruptive bandaid.


AWS charges instances in 1 hour increments - so you're paying 150% the EC2 costs if you're doing this every 2 hours


AWS has been charging by the second since 2017: https://aws.amazon.com/blogs/aws/new-per-second-billing-for-...


> So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

I mean, you probably know this, but sooner or later this attitude is going to come back to bite you. What happens when you need to do it every hour? Every ten minutes? Every 30 seconds?

This sort of solution is really only suitable for use as short-term life-support; unless you understand exactly what is happening (but for some reason have chosen not to fix it), it's very, very dangerous.


Well that's the thing: a bug that happens every 2 hrs and cannot be traced easily gives a developer roughly 4 opportunities in an 8hr day to reproduce + diagnose.

Once it's happening every 30 seconds, then they have up to 120 opportunities per hour, and it'll be fixed that much quicker!


In a way, yes. But it's also like a sledge hammer approach to stateless design. New code will be built within the constraint that stuff will be rebooted fairly often. That's not only a bad thing.


"It's shockingly stable." You're running a soup. I'm not sure if this is satire or not. This reminds me of using a plug-in light timer to reboot your servers because some java program eats all the memory.


or installing software to jiggle the mouse every so often so that the computer with the spreadsheet that runs the company doesn't go to sleep


Still infinitely cheaper than rebuilding the spreadsheet tbh.


Sometimes running a soup is the correct decision


What's neat is that this is a differential equation. If you kill 5% of instances each hour, the reduction in bad instances is proportional to the current number of instances.

i.e.

if bad(t) = fraction of bad instances at time t

and

bad(0) = 0

then

d(bad(t))/dt = -0.05 * bad(t) + 0.01 * (1 - bad(t))

so

bad(t) = 0.166667 - 0.166667 e^(-0.06 t)

Which looks a mighty lot like the graph of bad instances in the blog post.


Love it! I wonder if the team knew this explicitly or intuitively when they deployed the strategy.

> We created a rule in our central monitoring and alerting system to randomly kill a few instances every 15 minutes. Every killed instance would be replaced with a healthy, fresh one.

It doesn't look like they worked out the numbers ahead of the time.


Title is grossly misleading.

That Netflix had already built a self-healing system means they were able to handle a memory leak by killing random servers faster than memory was leaking.

This post isn't about how they've managed that, it's just showing off that their existing system is robust enough that you can do hacks like this to it.


Your take is much different than mine. The issue was a practical one of sparing people from working too much over one weekend since the bug would have to wait until Monday, and the author willingly described the solution as the worst.


I have a project where one function (reading metadata from an Icecast stream [0]) was causing a memory leak and ultimately consuming all of it.

I don't remember all the details but I've still not be able to find the bug.

But this being in Elixir I "fixed it" with Task, TaskSupervisor and try/catch/rescue.

Not really a win but it is still running fine to this day.

[0] https://github.com/conradfr/ProgRadio/blob/1fa12ca73a40aedb9...


Half of hn posts are people showing off things where they spent a herculean amount of effort reinventing something that elixir/erlang has had solved 30+ years already.


Some are even proud of their ignorance and belittle Erlang and Elixir.


I'm fine with it.

If people want to belittle something, either we aren't trying to solve the same problem (sure) or they're actively turning people away from what could be a serious advantage (more for me!)

If the cost of switching wasn't so high, I'd love to write Elixir all day. It's a joy.


This is a bit odd coming from the company of chaos engineering - has the chaos monkey been abandoned at Netflix?

I have long advocated randomly restarting things with different thresholds partly for reasons like this* and to ensure people are not complacent wrt architecture choices. The resistance, which you can see elsewhere here, is huge, but at scale it will happen regardless of how clever you try to be. (A lesson from the erlang people that is often overlooked).

* Many moons ago I worked on a video player which had a low level resource leak in some decoder dependency. Luckily the leak was attached to the process, so it was a simple matter of cycling the process every 5 minutes and seamlessly attaching a new one. That just kept going for months on end, and eventually the dependency vendor fixed the leak, but many years later.


In cases like this won't Chaos Monkey actually hide the problem, since it's basically doing exactly the same as their mitigation strategy - randomly restarting services?


Right. The point of the question is why not ramp up the monkey? They seem to imply it isn’t there now, which wouldn’t surprise me with the cultural shifts that have occurred in the tech world.


They were just lucky not to have data corruption due to concurrency issue, and the manifestation was infinite get. Overall if you can randomly "kill -9", the case is rather trivial.

Likely replacing HashMap with CHM would not solve the concurrency issue either, but it'd prevent an infinite loop. (Edit) It appear that part is just wrong: "some calls to ConcurrentHashMap.get() seemed to be running infinitely." <-- it's possible to happen no hashmap during concurrent put(s), but not to ConcurrentHashMap


it wasn’t luck, it was very deliberately engineered for. The article does lack a good bit of context about the Netflix infra:

https://netflixtechblog.com/the-netflix-simian-army-16e57fba...

https://github.com/Netflix/chaosmonkey


yep, the link in the "some calls to ConcurrentHashMap.get() seemed to be running infinitely." sentence points to HashMap.html#get(java.lang.Object)


I have seen that part myself (infinite loops), also I have quite extensive experience with CHM (and HashMap).

Overall such a mistake alone undermines the effort/article.


You gotta pick your battles. Part of being in a startup is to be comfortable with quick and dirty when necessary. It’s when things get bigger, too corporate and slow that companies stop moving fast.


We are talking about Netflix. You know, the 'N' in FAANG/MAANG or whatever.


As a non-FAANGer Netflix has always intrigued me because of this. While Google, Facebook and others seem to have bogged themselves down in administrative mess, Netflix still seems agile. From the outside at least.

(also worth noting this post seems to be discussing an event that occurred many years ago, circa 2011, so might not be a reflection of where they are today)


Netflix is a much smaller enterprise. It got included because it was high growth at the time, not because it was destined to become a trillion dollar company.


Netflix isn’t trying to be a search engine, hardware manufacturer, consumer cloud provider (email, OneDrive, etc), cloud infrastructure provider, and an ad company at the same time. Or an Online Walmart who does all the rest and more.


If the principles of languages like Erlang were taught in American school, things like this would be much likely to occur. Silly that Computer Science is regarded more highly by many than Software Engineering for Software Engineering jobs.


Ideas stemming from Erlang and Mozart/Oz are indeed a big blind spot in most undergrad programs. Sadly, even in EU all this is becoming a niche topic, which is weird as today's applications are more concurrent and data-intensive than ever.


> Could we roll back? Not easily. I can’t recall why

I can appreciate the hack to deal with this (I actually came up with the same solution in my head as reading) but if you cannot rollback and you cannot roll forward you are stuck in a special purgatory of CD hell that you should be spending every moment of time getting out of before doing anything else.


This was a nice short read. A simple (temporary) solution, yet a clever one.

How was he managing the instances? Was he using kubernetes, or did he write some script to manage the auto terminating of the instances?

It would also be nice to know why:

1. Killing was quicker than restarting. Perhaps because of the business logic built into the java application?

2. Killing was safe. How was the system architectured so that the requests weren't dropped altogether.

EDIT: formatting


The author mentions 2011 as the time they switched from REST to RPC-ish APIs, and this issue was related to that migration.

Kubernetes launched in 2014, if memory serves, and it took a bit before widespread adoption, so I’m guessing this was some internal solution.

This was a great read, and harkens back to the days of managing 1000s of cores on bare metal!


> It would also be nice to know why:

1. Killing was quicker than restarting.

If you happen to restart one of the instances that was hanging in the infinite thread, you can wait a very long time until the Java container actually decides to kill itself because it did not finish its graceful shutdown within the alotted timeout period. Some Java containers have a default of 300s for this. In this circumstance kill -9 is faster by a lot ;)

Also we had circumstances where the affected Java container did not stop even if the timeout was reached because the misbehaving thread did consume the whole cpu and none was left for the supervisor thread. Then you can only kill the host process of the JVM.


This reminds me of LLM pretraining and how there are so many points at which the program could fail and so you need clever solutions to keep uptime high. And it's not possible to just fix the bugs--GPUs will often just crash (e.g. in graphics, if a pixel flips the wrong color for a frame, it's fine, whereas such things can cause numerical instability in deep learning so ECC catches them). You also often have a fixed sized cluster which you want to maximize utilization of.

So improving uptime involves holding out a set of GPUs to swap out failed ones while they reboot. But also the whole run can just randomly deadlock, so you might solve that by listening to the logs and restarting after a certain amount of inactivity. And you have to be clever with how to save/load checkpoints, since that can start to become a huge bottleneck.

After many layers of self healing, we managed to take a vacation for a few days without any calls :)


Interesting read, the fix seems to be straightforward, but I'd have a few more questions if I was trying to do something similar.

Is software deployed regularly on this cluster? Does that deployment happen faster than the rate at which they were losing CPUs? Why not just periodically force a deployment, given it's a repeated process that probably already happens frequently.

What happens to the clients trying to connect to the stuck instances? Did they just get stuck/timeout? Would it have been better to have more targeted terminations/full terminations instead?


An answer to basically all your questions is: doesn’t matter, they did their best to stabilize in a short amount of time, and it worked - that’s what mattered.


Netflix is supposed to be the bastion of microservices and the trailblazer of all-aws infrastructure.

But as time goes by I just ask, all this work and costs and complexity, to serve files? Yeah don't get me wrong, the size of the files are really big, AND they are streamed, noted. But it's not the programming complexity challenge that one would expect, almost all of the complexity seems to stem from metadata like when users stop watching, and how to recommend them titles to keep them hooked, and when to cut the titles and autoplay the next video to make them addicted to binge watching.

Case in point, the blogpost speaks of a CPU concurrency bug and clients being servers? But never once refers to an actual business domain purpose. Like are these servers even loading video content? My bet is they are more on the optimizing engagement side of things. And I make this bet knowing that these are servers with high video-like load, but I'm confident that these guys are juggling 10TB/s of mouse metadata into some ML system more than I'm confident that they have some problem with the core of their technology which has worked since launch.

As I say this, I know I'm probably wrong, surely the production issues are cause by high peak loads like a new chapter of the latest series or whatever.

I'm all over the place, I just don't like netflix is what I'm saying


Netflix has done massive amounts of work on BSD to improve it's network throughput, that's part of them enabling their file delivery from their CDN appliances. https://people.freebsd.org/~gallatin/talks/euro2022.pdf

They've also contributed significantly to open source tools for video processing, one of the biggest things that stands out is probably their VMAF tool for quantifying perceptual quality in video. It's probably the best open source tool for measuring video quality out there right now.

It's also absolutely true that in any streaming service, the orchestration, account management, billing and catalogue components are waaaay more complex than actually delivering video on-demand. To counter one thing you've said: mouse movement... most viewing of premium content isn't done on web or even mobile devices. Most viewing time of paid content is done on a TV, where you're not measuring focus. But that's just a piece of trivia.

As you said, you just don't like them, but they've done a lot for the open source community and that should be understood.


Yeah I stand corrected. Video being one of the highest entropy types of data probably means they face state of the art throughput challenges. Which are inherently tied to cost and monetization.

That said, free apps like tiktok and youtube probably face higher throughput, so the user-pays model probably means netflix is at the state of the art at high volume quality (both app experience and content) rather than sheer volume low quality or premium quality low volume markets.

I mean serving millions of customers at 8 bucks per month. Which is not quite like serving billions.


> But as time goes by I just ask, all this work and costs and complexity, to serve files?

IMHO, a large amount of the complexity is all the other stuff. Account information, browsing movies, recommendations, viewed/not/how much seen, steering to local CDN nodes, DRM stuff, etc.

The file servers have a lot less complexity; copy content to CDN nodes, send the client to the right node for the content, serve 400Gbps+ per node. Probably some really interesting stuff for their real time streams (but I haven't seen a blog/presentation on those)

Transcoding is probably interesting too. Managing job queues isn't new, but there's probably some fun stuff around cost effectiveness.


> But as time goes by I just ask, all this work and costs and complexity, to serve files?

You could say the same thing about the entire web.


Not really. People are not posting data into Netlix. Netflix is mostly read-only. That is huge complexity reducer.


Is it? It's pretty rare to download assets from servers that you're uploading to. Sometimes you have truly interactive app servers but that's a pretty small percentage of web traffic. Shared state is not the typical problem to solve on the internet, though it is a popular one to discuss.


Whatever your service is, usually the database is the bottleneck. The database limits the latency, scaling and availability.

Of course, how much, depends on the service. Particularly, how much concurrent writing is happening, and do you need to update this state globally, in real-time as result of this writing. Also, is local caching happening and do you need to invalidate the cache as well as a result of this writing.

The most of the relevant problems disappear, if you can just replicate most of the data without worrying that someone is updating it, and you also don't have cache invalidation issues. No race conditions. No real-time replication issues.


> Whatever your service is, usually the database is the bottleneck. The database limits the latency, scaling and availability.

Database-driven traffic is still a tiny percentage of internet traffic. It's harder to tell these days with encryption but on any given page-load on any project I've worked on, most of the traffic is in assets, not application data.

Now, latency might be a different issue, but it seems ridiculous to me to consider "downloading a file" to be a niche concern—it's just that most people offload that concern to other people.


> It's harder to tell these days with encryption but on any given page-load on any project I've worked on, most of the traffic is in assets, not application data.

Yet you have to design the whole infrastructure to note that tiny margin to work flawlessly, because otherwise the service usually is not driving its purpose.

Read-only assets are the easy part, which was my original claim.


> Read-only assets are the easy part, which was my original claim.

I don't think this is true at all given the volume. With that kind of scale everything is hard. It's just a different sort of hard than contended resources. Hell, even that is as "easy" these days with CRDTs (and I say this with dripping sarcasm).


Asset volume is just a price issue in these days. You can reduce the price by using clever caching, with programming language choices, or higher compression rate.. but in the end it is not a real problem anymore in the overall infrastructure architecture. Read-only assets can be copied, duplicated, cached without any worries that they might need to be re-synced soon.


That's true as the internet becomes more $ focused, companies become more interested in shoving messages (ads/propaganda) than letting users say anything). Even isp plans have asymmetric specs.

I'm honestly much more impressed by free apps like youtube and tiktok in terms of throughput, they have MUCH more traffic since users don't pay!


Every time you like/dislike/watchlist a movie you're posting data. When you're watching a movie your progress is constantly updated, posting data. Simple stuff but there's possibly hundreds of thousands of concurrent users doing that at any given moment.


Yes, but it is still counts only a fraction of the purpose of their infrastructure. There are no hard global real-time sync requirements.

> When you're watching a movie your progress is constantly updated, posting data

This can be implemented on server side and with read requests only.

A proper comparison would be YouTube where people upload videos and comment stuff in real-time.


There's many cases where server side takes you 98% there, but it still makes economical sense to spend shitloads on getting that 2% there.

In this case it's not the same whether your server sends a 10 second packet and the viewer views all of it, and whether server sends a 10 second packet, but client pauses at the 5s mark (which needs client-side logic)

Might sound trivial, but at netflix scale there's guaranteed a developer dedicated to that, probably a team, and maybe even a department.


> A proper comparison would be YouTube where people upload videos and comment stuff in real-time.

Even in this one sentence you're conflating two types of interaction. Surely downloading videos is yet a third, and possibly the rest of the assets on the site a fourth.

Why not just say the exact problem you think is worth of discussion with your full chest if you so clearly have one in mind?


I'd make a distinction on:

-the entropy of the data: a video is orders of magnitude than browsing metadata.

- the compute required: other than an ML algorithm optimizing for engagement, there's no computationally intensive business domain work (throughput related challenges dont count)

- finally programming complexity, in terms of business domain, is not there.

I mean my main argument is that a video provider is a simple business requirement. Sure you can make something simple at huge scale and that is a challenge. Granted.


I thought about the complexity in terms of compute, but I guess if there's no input then there's no compute possible, as all functions are idempotent and static. At the very least their results are cacheable, or the input is centralized (admins/show producers)


[flagged]


Rizz skibidi bro


That presumably fixed things from a deployment point of view, but if there was a concurrency bug involving a hashmap, the service may have been emitting incorrect results.

For example: calculate hash code of string, determine index, find apparent match, hashmap is modified by another thread, return value at that index which no longer matches.

I don't think that particular issue can happen with Java's HashMap, but there's probably some sort of similar goofiness.


> It was Friday afternoon

> Rolling back was cumbersome

It's a fundamental principle of modern DevOps practice that rollbacks should be quick and easy, done immediately when you notice a production regression, and ideally automated. And at Netflix's scale, one would have wanted this rollout to be done in waves to minimize risk.

Apparently this happened back in 2021. Did the team investigate later why you couldn't do this, and address it?


>It's a fundamental principle of modern DevOps practice that rollbacks should be quick and easy

Then DevOps principles are in conflict with reality.


Go on...


That was a bit underwhelming compared to what the headline set my expectations up for, but definitely a good idea and neat solution.


from the headline alone I got linkedin ceo vibe. "Built a Self-Healing System to Survive a Concurrency Bug" is how I could describe wrapping a failing method in a retry loop


Put in a couple more if statements checking the output of rand(), call it AI, and you'll be CEO in no time!


On a long enough timescale, everything eventually converges to Erlang


Hah, hinted at that in my comment: https://news.ycombinator.com/item?id=42126301

It really is a fundamental advantage against the worst kinds of this category of bug


I understand how their approach worked well enough, but I don’t get why they couldn’t selectively target the VMs that were currently experiencing problems rather than randomly select any VM to terminate. If they were exhausting all their CPU resources, wouldn’t that be easy enough to search for using something like ansible?


I agree, I've been at places that can tie alerts at a host level to an automated task runner. Basically a workflow system that gets kicked off on an alert. Alert fires, host is rebooted or terminated. Helpful for things like this.


Not much familiar with Elixir OTP, but isn’t the approach OP took similar to Let It Crash philosophy of OTP?


Not really, you wouldn't normally kill or restart processes randomly in an OTP system. "Let it crash" is more about separating error handling from business logic.


> and to my memory, some calls to ConcurrentHashMap.get() seemed to be running infinitely.

Of course they did. And whoever though "Concurrent" meant it would work fine gets burned by it. Of course.

And of course it doesn't work properly or intuitively for some very stupid reason. Sigh


It has to be an error - it could happen to HashMap, it has never been an issue w/ CHM.


this sounds more like citing chapter and verse in an exegesis than anything of direct relevance to the Mortal Plane...


I had to deal with a concurrency bug in Ruby once and it was so bad* that it pushed me into Elixir, which makes the vast majority of concurrency bugs impossible at the language-design level, thus enabling more sanity.

Ingeniously simple solution for this particular bug though.

*as I recall, it had to do with merging a regular Hash in the ENV with a HashWithIndifferentAccess, which as it turns out was ill-conceived at the time and had undefined corner cases (example: what should happen when you merge a regular Hash containing either a string or symbol key (or both) into a HashWithIndifferentAccess containing the same key but internally only represented as a string? Which takes precedence was undefined at the time.)


One of the things I am greatful for kubernetes and the killing of pods.

Had a similar problem but memory wise with a pesky memory leak, and the short term solution was to do nothing as instances would to do nothing.


During one of my past gigs, this exact feature hid a huge memory leak, in old code, always running on k8s which we found out only when we moved some instances to bare metal.


We hit this in a past gig too. One of the big services had a leak, but deployed every 24 hours which was hiding it. When the holiday deploy freeze hit the pods lived much longer than normal and caused an OOM storm.

At first I thought maybe we should add a "hack" to cycle all the pods over 24 hours old, but then I wondered if making holiday freezes behave like normal weeks was really a hack at all or just reasonable predictability.

In the end folks managed to fix the leak and we didn't resolve the philosophical question though.


I've dealt with something similar. We were able to spin up zombie reapers, looking for the cores / CPUs that were pegged at 100%, and prioritize the instances that were worst hit.


In technical context leaving unknowns with unknown boundaries behind is an approach that makes it hard to assess next actions in error cases, but even more importantly it makes the future system design work rooted on uncertainty.

To me that is blocker to my thinking. I really need to understand the impact of leaving something behind before continuing. I’d likely do everything in my power to remove that unknown from current state for the sake of sanity.


Self-healing system: increase cluster size and replace servers randomly. It works because it was a problem of threads occasionally entering an infinite loop but not corrupting data. And the whole system can tolerate these kind of whole server crashes. IMHO an unusual combination of preconditions.

It's not explained why they couldn't write a monitor script instead to find servers having the issue and only killing those.


I think they just needed a quick and dirty solution that was good enough for a few days. They figured that for 1% failure per hour, they needed to kill x processes every y minutes to keep ahead of the failures. I'm sure it would be much more efficient but also more complicated to try to target the specific failures, and the "good enough" solution was acceptable.


> Practical engineering can mean many things, but a definition I often return to is: having clear goals and making choices that are aligned with them.

The main takeaway from this post. You will not encounter exactly this or similar issues at your workplace, but this single piece of advice can help you fix any issues that you do encounter.


This reminds me of a couple startups I knew running Node.js circa ~2014, where they would just restart their servers every night due to memory issues.

iirc it was mostly folks with websocket issues, but fixing the upstream was harder

10 years later and specific software has gotten better, but this type of problem is certainly still prevalent!



True, but for somewhat different reasons. For the OP, they take this approach because they simply don't know yet what the problem is, and it would take some time to track it down and fix it and they don't want to bother.

For Boeing, it's probably something fairly simple actually, but they don't want to fix it because their software has to go through a strict development process based on requirements and needing certification and testing, so fixing even a trivial bug is extremely time-consuming and expensive, so it's easier to just put a directive in the manual saying the equipment needs to be power-cycled every so often and let the users deal with it. The OP isn't dealing with this kind of situation.


"worked"

(not that i don't get the sarcasm)


Meta has a similar strategy, and this is why memory leak bugs in HHVM are not fixed (they consider that instances are going to be regularly killed anyway)


Apache where I used to work was configured to restart each child after 1000 requests. So many outages prevented at very little cost.


> Why not just reboot them? Terminating was faster.

If you don't know why you should reboot servers/services properly instead of terminating them..


Well, why? This comment seems counter to the now-popular "cattle not pets" approach.


state


It is pretty typical these days for services in a distributed architecture to not depend on local state whatsoever. In fact, in k8s there is no way to "properly reboot" a pod. The equivalent would be to replace the pod with a new one.


The real key here is to understand Netflix’s business, and also many social media companies too.

These companies have achieved vast scale because correctness doesn’t matter that much so long as it is “good enough” for a large enough statistical population, and their Devops practices and coding practices have evolved with this as a key factor.

It is not uncommon at all for Netflix or Hulu or Facebook or Instagram to throw an error or do something bone headed. When it happens you shrug and try again.

Now imagine if this was applied to credit card payments systems, or your ATM network, or similar. The reality of course is that some financial systems do operate this way, but it’s recognized as a problem and usually gets on people’s radar to fix as failed transaction rates creep up and it starts costing money directly or clients.

“Just randomly kill shit” is perfectly fine in the Netflix world. In other domains, not so much (but again it can and will be used as an emergency measure!).


Kill and restart the service. This seems to be the coder solution to everything. We do it for our service as well. The programmer could fix their stuff but alas, that’s too much to ask.


Yes - lots of writing for a common solution to a bug...

Memory leaks are often "resolved" this way... until time allows for a proper fix.


"Did you try turning it off and on again?"


I like the practicality of this


Yeah I had the same issue of my EC2 that I used to host my personal websites randomly getting to 100% CPU and being unreachable.

I put a CloudWatch alarm at 90% CPU usage which would trigger a reboot (which completed way before anyone would notice a downtime).

Never had issues again.


Reminds me of the famous quote by Rasmus Lerdorf, creator of PHP

> I’m not a real programmer. I throw together things until it works then I move on. The real programmers will say “Yeah it works but you’re leaking memory everywhere. Perhaps we should fix that.” I’ll just restart Apache every 10 requests.


i ll argue that doing the restart is more important until someone else finds the leak


Or future me. It hurts on the inside to just kick EC2 every hour because every 61 minutes something goes awry in the process. But the show must go on, so you put in the temporary fix knowing that it's not going to be temporary. Still, weeks/months/years down the line you could get lucky and the problem will go away and you can remove the kludge. But if you're ridiculously lucky, not only will the problem get fixed, but you'll get to understand exactly why the mysterious problem was happening in the first place. Like the gunicorn 500 upgrade bug, or the Postgres TOAST json thing. That sort of satisfaction isn't something money can buy. (Though it will help pay for servers in the interim until you find the bug.)


or at least after the weekend :P


Also uttered by others who thought borrowing money was more important until they could figure out a way to control spending.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: