Tell us how you handle 20 trillion events per day? That’s 230 million a second.

jiggawatts · on March 11, 2022

By not collecting that much data. Netflix is a streaming platform. So... stream the movie. That's it. Collect just enough data for performance analysis and for keeping track of where the user stopped watching.

That's mere millions of data points per day, not trillions.

Like... why do they need real time machine learning at that rate? No, really: Why!? Would it be catastrophic to their business if they ran ML as batch jobs? Do their recommendations need to change second-to-second?

Another crazy quote: "At that time, Netflix had ~500 microservices, generating more than 10PB data every day in the ecosystem."

Wat?

That's a 170 MB of logs pre customer per day! Most of those customers might watch an episode of a TV show per day, or watch one movie per day. Call it 2 hours of usage per day? These guys were blasting out 1.4MB of logs per minute per active user while essentially doing nothing much more than streaming big binary blocks of data from a CDN!

In my mind that's the crux of the issue. The architecture astronauts have gone amok, solving problems that shouldn't exist...

The argument / excuse is: "Many product features, such as personalized recommendation, search, etc., can benefit from fresher data to improve user experiences, leading to higher user retention, engagement, etc."

My experience with NetFlix is that their recommendations are garbage and getting worse over time. Exabytes of data won't solve this.

booi · on March 11, 2022

I'm with you on this one. I was an architect of a quite large SVOD streaming service and this sounds like a runaway engineering division that's lost.

Sure, you can log and trace every tiny event but they've forgotten to ask whether or not they should and what are you getting out of it. It's creating a problem that doesn't exist, then solving it (by creating 2 more complex systems). 500 microservices sounds excessive but not out of the realm of belief.

I do wonder if it might relate to the way they're measuring engineering performance.

techie128 · on March 18, 2022

I think you're misunderstanding Netflix's scale and their culture. I am sure the SVOD service you architected was great but it probably wasn't Netflix scale.

Netflix's culture is heavily biased towards collecting metrics and experimenting with them. Their entire ethos is data oriented therefore they tend to capture a lot of metrics. You are terming it as a 'runaway engineering division that's lost' is unfair. You probably need to work at their scale for a while to appreciate the complexity of their business and technology needs. If it was easy to build Netflix, then everyone would.

markus_zhang · on March 11, 2022

From my BA experience it could also be a runaway project manager issue or runaway data analytics issue as people just say hey let's get that piece of data.

Is it useful? Rarely asked. But more importantly, they rarely got pulled off when proved to be useless.

orf · on March 11, 2022

It sure is convenient to just say "don't do that", but it's a lazy answer divorced from reality.

Netflix does way more than "streaming big binary blocks of data from a CDN". Think of all the backend stuff behind the scenes that might go into this - encoding/re-encoding all these "big binary blocks", request tracing, audit logs, BI, etc.

Part of Netflix's success is powered by collecting this detailed information. Hiring 6 smart people to build something that can handle that volume of events is a complete no-brainer compared to the business value it generates.

southerntofu · on March 11, 2022

> Netflix does way more than "streaming big binary blocks of data from a CDN".

Yes, but our point (or at least mine) is that it should not. Transcoding takes place once and for all when you first upload the video, then you can call it a day and let caching proxies distribute bits to people who want to download them.

Do users benefit from the machine learning algorithms and statistics collected by Netflix? No. Managers and shareholders benefit from it, and they're gonna let such big corporations destroy our entire planet if that's what it takes for their fame and profits. Just think about the ecological impact of what you're building, for humanity's sake.

orf · on March 11, 2022

You’re focusing on the “ecological impact” of an event bus over the impact of massive transcoding jobs and the infrastructure required to serve it fast to users?

> Transcoding takes place once and for all when you first upload the video, then you can call it a day and let caching proxies distribute bits

Damn! You should let Netflix know that, I’m sure they would be relieved to know it’s as simple as that, as anyone who’s ever worked with serving video to many devices would attest!

> Do users benefit from the machine learning algorithms and statistics collected by Netflix?

Yes. Those metrics allow them to take targeted bets on new types of content that other studios might pass over, across the entire world.

Your whole argument seems to be that “Netflix could be run with just python -m http.server /movies/, all these kids better get off my lawn because they are messing it up with all this complexity that I don’t appreciate because I’ve got no clue what’s required to build and operate something like Netflix at scale”. I don’t buy it.

southerntofu · on March 11, 2022

> You’re focusing on the “ecological impact” of an event bus over the impact of massive transcoding jobs and the infrastructure required to serve it fast to users?

Not exactly no. I'm focusing on the excess, useless consumption of resources just to please psychopaths on a corporate board. Video distribution as a domain has inherent complexity that has to be dealt with, but there's obvious ways to greatly reduce the ecological impact of Netflix:

- distribute videos P2P, enabling shorter routes and less "edge" infrastructure

- remove all the useless DRMs so that people can share the videos on their USB key without requiring considerable network infrastructure

But of course business-minded people can't come to terms with the reality of how much they're fucking the environment just for their so called "property rights" and for the sake of imaginary numbers and very real control over people/workers.

> python -m http.server /movies/

Something like that. Better with a more solid HTTP server implementation (nginx?), and even better if you can replicate split/replicate the content across machines. Then you can leave Bittorrent protocol + HTTP(S) seeds do the job of delivering video on a budget. The only complex parts in such a setup is transcoding (but that's ok they've got machines to do it in advance) and maintaining a central database of everything available across storage servers (that's also OK given it's only a few million entries).

> I’ve got no clue what’s required to build and operate something like Netflix at scale”

Nope, i've got no clue, and i'm ok with that. I don't think anything approaching the scale of Netflix or Facebook should even exist. We need to dismantle industrial capitalism before it finishes dismantling our entire planet.

orf · on March 11, 2022

Ok sure but in the meantime the point still stands: Netflix needs it, both they and consumers get value out of it and serving video at their scale to the number of devices they support is much harder than “use nginx”.

If the system that enables these companies should be dismantled or not is a completely different argument to “should this event bus exist”.

southerntofu · on March 11, 2022

> If the system that enables these companies should be dismantled or not is a completely different argument to “should this event bus exist”.

Fair enough. But if the "170 MB of logs pre customer per day" suggested in another comment are even remotely true, that's still a huge mammoth to shave...

mattcwilson · on March 11, 2022

How much business value does it generate?

The article fails to mention.

I wonder if they are collecting any kind of A/B test data on ROI per new data platform feature?

Complete no-brainer to look at dollars gained vs time/dollars spent.

jamra · on March 11, 2022

I recently had a discussion with a former Netflixer about keeping track of user views. It’s important to pick up where you left off and is likely important to pay the owners of the movies. It’s a very hard problem to solve. The data is highly skewed by skips and afk viewers.

jiggawatts · on March 11, 2022

This is mere kilobytes of data over the length of the entire movie, and can tolerate hours of latency for most aspects (e.g.: analysis, charge-back, ML, etc...).

A lot of the time I look at "modern cloud architectures" and I see a firehose of raw data being spewed out in hideously inefficient formats. Think 1 kilobyte of JSON to represent a metric, often a single number or data point. It's not just NetFlix, this is the default for most of Azure Monitor and its agents as an example that would affect a lot of people.

The write amplification is just insanity. Orders of magnitude of overhead, not data.

It's as if the entire world forgot that binary formats exist. Google in no small part benefits from remembering this lost art -- they default to gRPC and similar packed binary formats. One reason Kubernetes (written mostly by ex-Googlers) is so fast is because internally it's all compiled code and gRPC. Compared to, say, Azure itself it is ludicrously faster. Literally hundreds of times lower latencies for common operations.

beebmam · on March 11, 2022

>One reason Kubernetes is so fast

I spit out my Pepsi at this. Some of us have actually used kubernetes with 5,000+ nodes. It's orders of magnitude slower than Mesos

tux3 · on March 11, 2022

Kubernetes may be depressingly slow on its own, but Azure might as well be powered by humans handling support tickets.

You can order a dedicated server at a traditional host faster than it takes to scale an AKS cluster sometimes.

Gravityloss · on March 11, 2022

This should be on the front page. We could always have had good fast metrics but the extremely inefficient data formats have made everything slow and expensive. It probably has a measurable ecological impact too.

orf · on March 11, 2022

> One reason Kubernetes (written mostly by ex-Googlers) is so fast is because internally it's all compiled code and gRPC. Compared to, say, Azure itself it is ludicrously faster.

Are you saying Azure, as a single entity, is not mostly "compiled code and (binary wire format)"? And that's why "Azure" is hundreds of times slower than "Kubernetes"? How does this make any sense?

Also gzipped json is pretty compact.

jiggawatts · on March 11, 2022

Azure is a random mish-mash of C#, Python, and Node.js -- or at least appears to be from the outside. None of those are compiled.

All of its protocols and APIs are JSON over HTTPS. I have never seen anything in Azure using a binary protocol. Ever.

mschuster91 · on March 11, 2022

> None of those are compiled.

C# as well as Python are compiled at build time. Python is a hot garbage for other reasons (just look at how dog slow the OpenStack CLI is, which is written in Python), but there's no need to pile on C#.

C# itself is actually pretty resource efficient - just look at StackOverflow, they use ridiculously few resources.

orf · on March 11, 2022

All of kubernetes external APIs are JSON over HTTPS as well, FYI. Like Azure, AWS and GCP internal communication is done via different protocols.

I hoped to convey the fact that you’re comparing apples to oranges. Of course a 3-master k8s setup with a pretty hard ~500 node cap is faster to make scheduling decisions than “Azure”, and it’s not anything to do with the data format the external API uses.

speedgoose · on March 11, 2022

That’s an interesting reason to explain Azure’s slowness.

jiggawatts · on March 11, 2022

Or anything for that matter. Write amplification or giving the computer "unnecessary, mandatory work" is one of the two big causes of poor performance despite ludicrously high performance hardware.

The other is latency. It's a metric that may as well not exist in the minds of 99% of developers or architects, but is the most important one by far.

If you ever wondered how it's possible to have 100 physical servers hosting an application that's glacially slow while no single part of the entire thing is running at more than 5% utilisation, these two are sufficient explanations most of the time:

    Work amplification + ignoring network latency.

Have you not ever wondered why something like Jira takes 10-30 seconds of wall clock time to display an empty form with less than 1 KB of text displayed on the screen?

It's burning through 100 million CPU cycles per byte that it's showing you! That's how.

That... or its server is twiddling its thumbs waiting for the network 29.99 seconds out of 30.00 seconds.

It's one or the other: There are no other options!

speedgoose · on March 11, 2022

That’s not a lot of data to keep track of that.

nvarsj · on March 11, 2022

I feel like the majority of celebrity tech companies are like this. They attract a certain kind of person who gets easily bored and needs interesting things to work on to stick around. There is also internal pressure to justify promotions by building things with "large impact". And CV padding to move onto the next big thing.

So you end up with lots of large, complex, engineering projects with virtually 0 or most likely negative actual business impact, to keep all the smart kids happy. And these beget more projects to deal with the complexity. It's pretty much Parkinson's Law applied to tech.

mschuster91 · on March 11, 2022

> and the recommendation system seems far worse than the old days

That's partially also because they lost or let expire the licenses to so much content.

Maybe governments should, in the interest of all people, mandate that all owners of IP/content should fairly license them to everyone who asks to operate a streaming or purchase service... that way we'd get competition on service quality and features between streaming providers instead of having to pay more for all the "content islands" than we did for cable TV.

raffraffraff · on March 11, 2022

Lol, reminded of of wat: https://www.destroyallsoftware.com/talks/wat

And I completely agree. What the hell are they doing, and is it worth the candle? Copyright aside, torrents and a shitty imdb wannabe site would have smaller CO² footprint and better content.

faangiq · on March 11, 2022

It’s clown eng run amok. But like any whale, no one tells them because they pay well.

carapace · on March 11, 2022

Reminds me of the Nutri-Matic machine:

> When the Drink button was pressed it made an instant but highly detailed examination of the subject's taste buds, a spectroscopic analysis of the subject's metabolism and then sent tiny experimental signals down the neural pathways to the taste centers of the subject's brain to see what was likely to go down well. However, no one knew quite why it did this because it invariably delivered a cupful of liquid that was almost, but not quite, entirely unlike tea.

tgsovlerkhgsel · on March 11, 2022

500 microservices is less than I expected, assuming they're actually "micro". Stuff that comes up from the top of my mind: bandwidth estimation, best cache finding, rating, recommendations, kids experience, various aspects of user management, subscription management, analytics, ... - each of these could be multiple microservices. And that's just the user facing part. They also have corporate stuff, ingestion, ...

ryandvm · on March 11, 2022

Couldn't agree more. While it's true that Netflix does have some unique engineering problems to solve, this smells like "Paycheck Driven Development". When you employ 10,000 software engineers, if you don't have enough problems for them to solve, they'll start inventing problems.

20 trillion events per day? Are you kidding me? This sounds like some sort of Montessori day care for software engineers.

mathgladiator · on March 11, 2022

I'll tell you how my team did it when I was an architect: https://dl.acm.org/doi/10.1145/3477132.3483572