How Netflix works: the stuff that happens when you hit Play

luckydude · on Nov 5, 2017

So this is perhaps a lame post but I'm super tired, been up since 4am. I'll try and follow up with more detail in the morning.

This article, in my opinion, is way off. I base that on the fact that I've been dancing with Netflix for a month, I might end up working with them to try and make NUMA machines serve up content faster. As in I'd be working on exactly what happens when you hit play.

My take on how Netflix serves you movies is nothing like what this article says.

They have servers in every ISP, the servers send a heartbeat to a conductor in AWS, the heartbeat says "I've got this content and I am this over worked", when you hit play the app reaches out the conductor and says "I want this", the conductor looks around and finds a server close to you that is not overloaded and off you go.

That might look easy. It's not. Take a look at this post about how they fill a 100Gbit pipe: https://news.ycombinator.com/item?id=15367421

I'm a kernel guy, I'm old school, I get what they are doing there, that is impressive.

I wish hacker news got excited about the filling the pipe post and less excited about this thread.

Slartie · on Nov 5, 2017

Hehe, I can feel your pain. Had a discussion just a few days ago with some guys demonstrating the typical enthusiasm about cloud stuff, especially AWS. Because after all, Netflix is using it, and they create 1/3rd of all Internet traffic, right?

There's this very widespread misconception about Netflix being a monolithic service running on Amazons' cloud infrastructure, even though the truth is that just the "rather boring" routine stuff like billing, view history tracking, suggestions, everything necessary to show the UI is running there. Netflix does a good job at this, but after all it's not that impressive when you've architected some distributed systems yourself. Not even their extreme take on the microservices concept - that is after all just a nice way of letting their devs do "their thing" in the way they want with as little restrictions from the environment as possible (which basically only works because they clearly have above-average-competence devs who can deal with these degrees of freedom).

What's really crazy is the way they squeeze unimaginable amounts of bytes per second out of modern hardware and into the internet infrastructure. They saturate 100Gbit links that usually serve an aggregation of many boxes in a datacenter with just ONE box! This is way beyond what even most above-average devs are capable of doing - you NEED old-school guys which still managed to stay on top of the crazy stack created by the evolution of hardware and low-level software in the last decade. There's not many of those out there, and Netflix apparently managed to catch a good bunch of them. These guys do the magic, and the magic they do never touches that damn Amazon Cloud. It just floats way above it.

SomeStupidPoint · on Nov 5, 2017

Netflix is my goto example of "do your core competency in-house, outsource the rest".

They do one thing themselves in their stack: their distribution network for content, and they do an incredible job of it. Every post I see about Netflix's CDN for video is insightful and a learning experience.

Then they throw a CRUD app in the cloud on top of it and call it a day. Okay, a little simplified -- there's still some neat tech in the DRM, in load-balancing the CDN, and in keeping all of their tech highly available. But conceivably, Netflix could retain much of their value by simply offering all the content to other websites that displayed it to customers -- the hard part of what they do is the CDN (and contracts with content owners), and opening their platform to other interfaces doesn't change that. (Heck, Netflix might be worth more if they opened their content to other interfaces, since they're not actually very good at the front end experience.)

But when I was doing consulting (about cloud stuff), that was my advice: do the core of your business yourself (eg, CDN) then offload as much of the rest as you can.

luckydude · on Nov 5, 2017

You stated it very well, especially that last paragraph.

I'm busily bringing lmbench up to date and seeing if I can get my mojo back enough to hang with these guys and do some work. It's both exciting to think about kernel work and depressing to realize so few people understand it any more.

X86BSD · on Nov 5, 2017

That's what I find most impressive. The insane amount of bits they do from a single FreeBSD box.

And I concur as well on the BSD crew responsible for engineering it. The grey beards are getting harder to find as they retire and I weep each time one does.

nickjj · on Nov 5, 2017

Is it every ISP, or only certain ISPs who opt in?

I remember OpenConnect being mentioned back when the ISP I use came on board. The service has been really good. I haven't gotten a single buffer in like 3 years.

If anyone is curious, glance through this. It's a pretty cool initiative:

https://openconnect.netflix.com/en/

If you get crappy Netflix service you should call your ISP and point them to this page.

side_up_down · on Nov 5, 2017

> I wish hacker news got excited about the filling the pipe post and less excited about this thread.

That's disingenuous considering the number of votes and comments each thread have received. HN IS more excited by the pipe post as reflected by user participation.

Would love to hear more detail insofar as this article is concerned.

luckydude · on Nov 5, 2017

The 100Gb pipe article was off the front page and not really commented on that much. I see post after post about containers this and vagrant that, all sorts of posts about stuff that I find pretty uninteresting. Something as remarkable as the 100Gbit thing gets a little attention but it is clear that it's not that interesting to the majority of the people here.

I'm just depressed that real low level systems stuff seems to be not sexy any more. When I was at Sun the kernel group was the unchallenged top of the heap, it was the place to be. When I was fixing source management at Sun (as a hobby, it wasn't my official job) they asked me to go work in the tools group. Are you frigging kidding me? Noone in their right mind would leave the kernel group for the tools group. Sorry to be snooty but that just wasn't a thing.

I just got back from a FreeBSD conference at Netflix and was asking about the state of the world and it's depressing. People don't seem to write solid papers like they used to. Sun produced papers on vnodes, the VM system architecture and implementation, Sparc, I wrote one on making UFS perform like an extent based file system. Who is doing that now? I looked in the usual conference proceedings and it was full of academic stuff, not very interesting stuff, but no industry stuff.

sliken · on Nov 10, 2017

I think that must have been common. Nobody good worked in tools. I managed a couple sunos and later solaris boxes. Userspace was a disaster. I ended up replacing it all with GNU and every piece I installed had a warning. Don't use Solaris make. Don't use solaris tar. Don't use the solaris compiler, etc. Ended up having to install gcc from binaries, then build zip, make, etc. Even tail (max 10k lines) and awk (max 16 columns or something) had issues. While debugging even xterm would die if you made it too wide.

Once you replaced the userspace with GNU + X11 the kernel was solid, although SunOS and early solaris didn't multitask well. For some workloads I'd end up disabling all but one CPU.

luckydude · on Nov 11, 2017

So I believe you on the Solaris userspace, it was full of System 5 garbage. GNU was a much better answer.

But on SunOS 4.x, you had userspace problems? That was a pretty stock BSD userspace with a lot of bugfixes. Every open source makefile in the world just worked when you typed make on SunOS. Seemed pretty solid to me, I think the only thing I would do is add perl, it took Sun a long time to decide to include perl (if they ever did).

arca_vorago · on Nov 5, 2017

Is there a GNU/Linux project equivalent of the netflix/freebsd sendfile? Besides this use case, what is your assessment of the linux (netfilter) vs bsd netstacks? What kernel is going to be at the center of faster and faster switch fabrics?

luckydude · on Nov 5, 2017

I wrote up a hand wavy way of doing a generalized sendfile here:

http://mcvoy.com/lm/papers/splice.pdf

and linux took some of that. As I recall, they don't use it much because the user land apps (like web servers) end up touching the data for TLS (https).

Netflix/FreeBSD are working with the NIC folks to get the byte by byte TLS pushed down in the card along with the TCP offload. I think they'll get there first because BSD is so heavily used by CDN people (not just Netflix, Limelight is very active in BSD as well).

Thaxll · on Nov 5, 2017

Senfile is a syscall in the Linux kernel since 2003~~

anonu · on Nov 5, 2017

Can you recommend a more technical version of OP's article or a video-taped talk?

kaplun · on Nov 5, 2017

An Amazon Web Services data center in Frankfurt, Germany, specially dedicated to CERN.

I believe this is actually the CERN computer center itself and totally unrelated with Amazon.

http://cds.cern.ch/record/1103476

qualitytime · on Nov 5, 2017

You know what, I'm going to use this post to tell you what happened last night when I hit play in firefox.

Nothing happened.

I saw The spinning loading wheel and the firefox/netflix header saying "blah blah audio video software being installed try again blah restart blah"...

I waited, I reloaded the page, I googled, I checked the DRM settings, I did blah blah blah.

Wasted time and frustrated I then opened microsoft edge and everything worked.

You know that firefox share is at 8% according to W3Counter? You know this kind of crap will only reduce this?

And then we'll only have the big daddy corporate sponsored browsers.

And all because users want to watch netflix.

What a bag of mediocre horse shit.

kilburn · on Nov 5, 2017

I'm not upvoting because of the snarkiness, but I found myself in the exact same situation yesterday. Today it magically works fine again.

If there is any netflixer around here: please, please push for better testing to prevent that from happening again (i.e: firefox should be one of the test platforms!)

crypt1d · on Nov 5, 2017

FWIW, Firefox has been behaving quite weird for me for the last 2-4 weeks (Linux version). It randomly freezes and slows down during page loading, even crashed a few times. Seems to happen mostly when there is some video content on one of the tabs, so I'm blaiming the plugins for now. Kinda sucks because I switched from Chrome few months ago only. I really hope I didn't make a mistake.

bababooey · on Nov 5, 2017

Same thing happened to me last year. I got very hyped up on the switch. But then it turned into hot garbage.

Tried to get support on the firefox subreddit and they told me I shouldn't be using Firefox Beta if I'm not "technically inclined". Okay, so I switched to the normal release and it was too slow for me to even bother. Back to Google...

I'm wondering if their new quantum engine fixes issues. Are you on that newest version?

JBlue42 · on Nov 5, 2017

Have also had issues for a month or so now. Weird stuff like trying to do a search in incognito mode, which should do a Google search, and it spinning forever.

zoeysaurusrex · on Nov 5, 2017

wscott · on Nov 5, 2017

BTW. Netflix is in the process of transitioning to TLS (https) transport. Not because they need it for any reason, but because shady advertisers keep snooping connections and using that to profile users. Netflix is tired of being accused of selling user data. My understanding is that the movie was already encrypted, but that stream could be fingerprinted to identify which movie it was.

Using TLS is a lot more expensive so that costs them money, so I have to respect them for that.

Grandma's TV is still unencrypted, but anyone who updates their client is protected.

They do have to deal with a whole ton of legacy clients.

arianvanp · on Nov 5, 2017

Alas. The https protected videos can be fingerprinted as well. https://news.ycombinator.com/item?id=14070130

johnwheeler · on Nov 5, 2017

i don’t understand how microservices solves the problem of broken interfaces. if an api changes or disappears, how is that different from the locations.txt file changing or disappearing?

i think it’s just another instance of humans making things more complicated than they need be. Same line of reasoning Linus went with a monolith vs a micro kernel.

reificator · on Nov 5, 2017

Microservices set out to do the same thing as Object Oriented programming set out to do. You define an {object,microservice} by exposing methods, and other services and clients interact by calling those methods. Theoretically, if your API remains stable, your internal implementation can change drastically and the system will continue to function as it should. There's no reason you can't draw these boundaries inside a monolith, but with microservices you'd have to go out of your way to not have those boundaries.

IMO by using HTTP to communicate, they ended up being significantly closer to the original concepts behind OOP and message passing.

http://lists.squeakfoundation.org/pipermail/squeak-dev/1998-...

Now, whether microservices are successful in dealing with the problems they set out to solve, and are worth the tradeoffs they entail is still up for debate.

solipsism · on Nov 5, 2017

Absolutely right. locations.txt is an API. And there's nothing monolithic about the example. It involves two completely separate apps with an API in between!

Godel_unicode · on Nov 5, 2017

I think you have missed the point. I think the point of that section actually has nothing to do with the locations.txt interface between the maps app and locolist. It's actually about the fact that, with microservices, any piece of the system only depends on it's dependency graph, not on every piece of the system working.

"... on a huge service like Netflix the entire application going down because a change was made to one part of it..."

To extend your OS example, just because there was a kernel panic in the Bluetooth stack is no reason to stop servicing requests over the Ethernet NIC which is already bound by the web server. There's also no need to edit or recompile the Ethernet driver because of a Bluetooth problem; see others posts about encapsulation.

ImSkeptical · on Nov 5, 2017

You might change the location of locations.text or store the locations in a database, or on the cloud, etc. You shouldn't change the commitment your service has to vend data about locations etc. You make a contract saying you will do X when given Y but you don't commit to anything about how you will do it, and that can change as necessary.

fuckuservices9 · on Nov 5, 2017

They don’t, they just let the trendy nerds beat off over unnecessarily complex software. C.f. The entirety of the JavaScript ecosystem.

rb666 · on Nov 5, 2017

"A special piece of code is also added to these files to lock them with what is called digital rights management or DRM — a technological measure which prevents piracy of films."

They should probably update this to say "tries to prevent", as NF DRM has been long cracked.

nitrogen · on Nov 5, 2017

Also nice would be a note that DRM prevents fair use.

colde · on Nov 5, 2017

Do you have a source for it being "cracked"?

Circumvented sure, but the actual cryptography components haven't been cracked as far as I am aware.

wolco · on Nov 5, 2017

I think lower resolutions have been cracked. I don't think 4K has been cracked yet but the comment might be related to this bug.

https://www.reddit.com/r/Piracy/comments/6pkypj/direct_strea...

jccalhoun · on Nov 5, 2017

I don't have a source that explicitly says it has been cracked but I saw a reddit post stating that all episodes of stranger things 2 was available on torrent sites less than 10 minutes after it went live on Netflix and people commenting on how it was possible because the drm was cracked

fasouto · on Nov 5, 2017

Correct me if I'm wrong but the data center picture looks very similar to the CERN computing center inside CERN installations in Meyrin.

saagarjha · on Nov 5, 2017

The caption says that it is:

> An Amazon Web Services data center in Frankfurt, Germany, specially dedicated to CERN.

ephimetheus · on Nov 5, 2017

But CERN is not in Frankfurt and there is a data center on site here.

jb1991 · on Nov 5, 2017

Another comment here claims that that photo is mislabeled and not an Amazon property.

Thaxll · on Nov 5, 2017

I still believe that Netflix stack is way too complicated for what Netflix needs. It's a CDN with a recommendation engine that is completely garbage ( the 95% stuff I'm interested ). Also comparing Youtube and Netflix speed, Youtube is like 2/3x faster to load any content.

X86BSD · on Nov 5, 2017

Their open conect FreeBSD boxes are nuts. The amount of data they spew is crazy. 1/3 of all internet traffic. One service. Mind blowing.

StillBored · on Nov 5, 2017

I suspect (having worked on an application handing > 100Gbit of I/O per node) that the OS choice doesn't really matter that much.

That is because in my case, the data path portion basically talked directly to a couple PCIe boards. It bypassed the entirety of the kernel outside of some setup API's to claim memory/interrupts/etc. That meant the transfer limits generally came down to lack of PCIe or memory bandwidth (depending on which generation of machine/configuration we were using). The CPU's in the machines spent 99.99% of their time running code we wrote. Despite the talents of most OS developers, generic OS/driver code is not optimized for absolute performance in one case, rather it tends to be tuned to perform well over a wide range of situations. The general goal is to be a fair arbitrator of system resources to multiple competing processes. Further, most general purpose OS's are under the assumption that I/O is slow or low bandwidth. Take the entirety of the linux filesystem/block layer/scsi layer, which is written under the assumption that the system is attached to a high latency low bandwidth spinning disk, so burning a few cycles coalescing requests, or handling the page cache isn't a big deal. That code doesn't scale when you plug it into a NVMe disk with 2GB/sec of bandwidth, much less a storage network with 100GB/sec of IO bandwidth.

Anyway, if you throw all these assumptions away and ignore modern "best practices" development models of assembling piles of unrelated libraries to solve a task, you end up with really lean (probably fits in the L1i cache) software that can perform two or three orders of magnitude faster than similar code written using modern methods.

luckydude · on Nov 5, 2017

That sounds like a ftp-like measurement of throughput, and yeah, what you said will work for that just fine.

Netflix connections are typically about 1mbit/sec each (older apps open up ~4 connections per video for reasons that are no longer valid but the apps aren't all updated).

So to fill a 100Gbit pipe they have 100,000 connections running at the same time. Which makes filling that pipe super super impressive.

StillBored · on Nov 6, 2017

In our case we were doing a fair amount of data manipulation, so it wasn't strictly a case of pushing the data through, although we had higher bandwidth per stream.

But, there are a bunch of different ways to solve the problems. I guess how impressive it is depends on they have gone about solving their particular cases. There is a fair number of network accelerators that offload individual stream level management to little cores running on the network adapter itself. Cavium, EzChip and now even companies like mellanox are playing in this space https://www.enterprisetech.com/2017/10/04/mellanox-etherneta....

So, i'm not sure the impressive parts are necessarily in the stream counts but what they must be doing to "align" (for lack of a better term) them. AKA the trade offs between keeping a few seconds of a video stream in RAM vs sourcing it from disk/wherever so that multiple users streams are aligned to avoid having to hit a secondary storage medium. In netflix's case I suspect that requiring fairly large buffers on the endpoint allow them to get away with a much lower QoS metric on any given stream.

Put another way, at least the few times I've watched netflix's bandwidth usage, it seems to be bursty. It blasts a few 10's of MB/s of data and then sits idle for a few seconds while the stream plays and then you get another chunk.

luckydude · on Nov 6, 2017

Randall Stewart at Netflix did a new TCP implementation that helps quite a bit. And he did this really cool thing for the nay sayers, he made it possible to have multiple stacks running in FreeBSD at the same time. I believe the default is you get the original stack, you can ask for his stack, and he did a super simple TCP stack just to show you how small a TCP stack could be.

They are using either Chelsio or Mellanox cards and they use the offload but they are doing TLS with the Xeon cpus. So they are getting 100Gbit while touching every byte.

And don't under estimate how hard it is to do 100,000 TCP connections. When I was at SGI we had a bunch of big SMP machines (I think they were 12 cpu Challenge) that someone was using to serve up web pages (AOL? It was someone big). Modems brought that machine to its knees. You would think that would be easy but it was not. A single (or small number of) fast streams is easy, a boat load of slow streams is hard. Think about it, if you have a TCP stack that gets a request and then nothing, you have all the overhead of finding that socket, doing that work, then nothing. It's way easier to have a stream of packets all for one socket.

It's that sort of stuff that they worked on so far as I can tell. Your caching idea is nice but the cache hit rate is very very low. They did way more work in the sendfile area, managing the page cache. Did you read Drew's post? It's worth a read for sure.

StillBored · on Nov 6, 2017

I didn't mean to minimize the difficulties of maintaining that many TCP connections (much less getting useful work out of them). I read the original article when it was on HN, but must have mentally thrown most of it away due to the freebsd bias. So I just reread it, and the fact that they are getting those numbers utilizing much of the OS buffer management and Nginx, is impressive by itself. But their difficulties sort of plays into my original assumptions. Basically, if you want cutting edge I/O perf your better off dumping most general purpose OS's I/O stacks unless you want to spend a lot of time re-engineering them to work around bottlenecks.

sendfile() is good, but the general concept tends to waste far to much time doing filesystem traversals, buffer management, dma scatter gather lists, and a bunch of other crap that gets in the way of getting a blob of data from the disk, encrypting it, and passing it off to a send offload to handle breaking up and apply the TCP/IP headers/checksums. Frankly the minimum MSS size is something that ipv6 should have fixed, given that no one is on 9600bps modems, but didn't.

Good for them for realizing that modern machines have a little less than a GB of bandwidth per pcie lane per direction, and memory bandwidth to match. If you don't mess up the CPU side of things you can even touch all that data once or twice and still maintain pretty amazing I/O numbers.

EDIT: Also in the case of x86 NUMA, you _REALLY_ want to make sure that the nvme/source disk, the memory buffer your writing to and the network adapter are on the same node with the core doing the encryption/etc. That is pretty easy if the "application" controls buffer allocation/pooling, but much harder with a general purpose OS which will fragment the memory pools.

kev009 · on Nov 6, 2017

We'll make it work on FreeBSD

toomuchtodo · on Nov 5, 2017

It gives me hope that someone could still go out and challenge YouTube using a stable of these at various internet exchanges and a central object store/web front end.

X86BSD · on Nov 5, 2017

I think YouTube simply has first to market mindshare. I personally use and prefer Vimeo. It's a better experience and quality. I don't think anything technical or special keeps YouTube number one other than first to market mindshare.

reificator · on Nov 5, 2017

Part of that mindshare is that I only go to Vimeo when I can't find something on YouTube, and then when I find it on Vimeo it's incomplete and in approximately 240p.

It's not fair to Vimeo, because they actually had a copy and YouTube didn't, but because of multiple experiences like that I was surprised to see a positive remark about their quality.

analogic · on Nov 5, 2017

just hope there's at least enough competition to keep them halfway honest

blocked_again · on Nov 5, 2017

> I don't think anything technical or special keeps YouTube number one other than first to market mindshare

It's the youtube content creators like PewDiPie, Casey Niestat, Kurzgesagt etc that makes Youtube special just like that community that makes Hacker news special.

pidybi · on Nov 5, 2017

nice to know ;)

reificator · on Nov 5, 2017

[flagged]

CaveTech · on Nov 5, 2017

[flagged]

reificator · on Nov 5, 2017

[flagged]

CaveTech · on Nov 5, 2017

[flagged]

reificator · on Nov 5, 2017

[flagged]

dang · on Nov 5, 2017

Would you all please stop? This subthread is off-topic, uncivil, and embarrassing.

We all understand the seduction of internet spats, but on HN we're trying for at least a little higher discussion quality. We need everyone to pitch in with that.

reificator · on Nov 5, 2017

My apologies. I was frustrated at claims that I was saying the opposite of what I intended, and when attempts to clarify backfired it fueled a desire to try harder to resolve the misunderstanding. I was out of line a few times because of that frustration.

Learning to temper that is difficult for me, as I'm not always the best communicator, but I value clear communication highly.

benchaney · on Nov 5, 2017

[flagged]

reificator · on Nov 5, 2017

[flagged]

benchaney · on Nov 5, 2017

[flagged]

reificator · on Nov 5, 2017

[flagged]

benchaney · on Nov 5, 2017

[flagged]

reificator · on Nov 5, 2017

[flagged]

benchaney · on Nov 5, 2017

If you are willing to accept that amount of inane pedantry, you may as well argue that CaveTech never said he disagreed with you, so your insinuation that he did is a straw man. Either way, the only stawman came from you, and there is still no "backlash".

dang · on Nov 5, 2017

Please stop. Posting like this only makes the thread worse and does less good the more right you are.

https://news.ycombinator.com/newsguidelines.html

alexnewman · on Nov 5, 2017

Wish they would have mentioned the widevine drm Shit show