Hacker News new | past | comments | ask | show | jobs | submit login
Streaming video on 10 Gigabit Ethernet and beyond (bbc.co.uk)
80 points by howsilly on Oct 17, 2015 | hide | past | favorite | 64 comments



Interesting, reminds me of a related question. I've looked recently for 10 gig ethernet on a new laptop and haven't been able to find it.

I know it is overkill, its just that it has been about ten years already, isn't it cheap enough yet? Can't a modern ssd keep up with it?


There are Thuderbolt adapters, as another comment noted few other interfaces have the bandwidth.

There's a lot of other bandwidth issues that aren't sorted out yet, too. If the adapter and drivers don't have TCP offloading, you'll be very hard pressed to get more than 3Gbit for anything other than a UDP dump.

Most of the OS network stacks aren't properly tuned for 10Gbit either.

I'd say give it a few years; as it becomes more mainstream, things should improve.


I've had the pleasure of testing my rMBP on a 10gbE internet facing port, I can get 4Gbps+ on a browser-based speedtest without any tuning. (And full 10Gbps with iperf)


And how did you connect it? Thunderbolt 2 adapter?


> If the adapter and drivers don't have TCP offloading

Any modern adapters you can think of that don't support TCP & UDP offloading (+ARP, etc.)? As far as I know, all of them support it.


Well, it doesn't apply to OP's question, which is why I didn't mention it directly, but I know for sure that most enterprise class virtualization software doesn't accelerate through to the VMs. (Yes, there is SR-IOV, but then you lose hot migration and other HA features. SolarFlare had it a while back but it no longer works)

I didn't mean to suggest that it's hard to find, just that there are a host of things that need to be in place before you'll get the expected speeds. This isn't a knock on the tech or any manufacturer, it's just that it's not mature enough to to be like 1Gbit where you plug it in an almost everything starts running at 100Mbyte/s.

Here's an example of what to expect (I have no affiliation with this thread): https://forums.creativecow.net/thread/197/860183


I doubt IP/TCP offloading even makes that big of a difference, if SIMD (SSE2+ or AVX2+) can be used. A single CPU core is probably capable of TCP checksumming more than 100 Gbps.

Of course it's a completely another story without SIMD. A naive traditional checksum loop with a register dependency stall is just not going to be fast.


You forgot the cost of memory access.

The L3 layer checksum is useless because IP packet is small and the kernel has to read/write all the fields anyway.

The L4 checksum covers TCP/UDP packet data, which the kernel can avoid touching if necessary.

When a TCP sender uses sendfile(), the kernel does a DMA read from storage to a page if the data is not already in memory (in the so called page cache), and just ask the network card to send this page, prepended with a ETH/IP/TCP header. That only works if the NIC can checksum the TCP packet content and update the header.

If the network card can do TCP segmentation offload, the kernel does not have to repeat this operation for each 1500 bytes packets, it can fetch a large amount of data from disk, and the NIC will split the data in smaller packets by itself.


The benchmarking I had done had the [non-offloaded] bandwidth peak out around 2.5-3Gbit/s. Could have been trouble with the drivers, or a naive implementation, or any of a number of things. Didn't dig into it too deeply at the time as the offloading drivers worked fine.


simple checksum computation and/or verification, indeed most cards can do (sometimes with restrictions: not for IPv6, not for VLAN...)

the other kind of offloading that the kernel can use is TCP Segmentation Offload (TSO), which is much more complex to implement in hardware, and you won't find it on cheap NIC (like Realtek)


But not all drivers/network cards are bugfree :/


Which raises an interesting quesiton - why don't we have Thunderbolt switches for our networks? The switch could convert to Ethernet, reducing the cost of servers and giving desktops lower cost access to higher speeds. Yes I know about the security issue but I'm referring to corporate or internal networks.


Supposedly you can use a Mac Pro as a thunderbolt hub and connect other computers with thunderbolt speeds.


Windows doesn't keep up with it well. We had alot of issues with VDI workstations running 10GB interfaces. Lots of weird latency problems, solved by presenting a 1GB adapter.


Completely the opposite of my experience with PyParallel; I can saturate 2x10GB Melanox-2 ($35 each off eBay) via TransmitFile() with about 3% CPU use.


The tests we run are based on our workload. Lots of Windows rpc, Outlook MAPI/RPC traffic and similar stuff.


Probably a combination of lack of need until recently (125MB/sec is more than fine to saturate a classic disk), parts cost and power consumption.

I can't think of anything except connecting to a _fast_ SAN that would require a 10GBE port in a laptop. Maybe something specialized for a network engineer, but even then it's probably easier to buy dedicated equipment for line rate port monitoring


The use case is high-def video editing, which people could do on laptops, and there are thunderbolt connections at that speed.

I found this:

http://www.fastestssd.com/featured/ssd-rankings-the-fastest-...

Pushing 3000MB/s, which is 3GB/s which should be times 8 for gigabits, I think it should be viable now, no?


Even a cheap $100-$200 SSD can do over 2Gbytes/sec these days: 256060514304 bytes (256 GB) copied, 95.4762 s, 2.7 GB/s

The M.2 interface has finally shrugged off the SATA bottleneck for commodity hardware. It's common on new motherboards, new laptops, and I recently read there's similar circuitry in the new iphone 6s.


> I can't think of anything except connecting to a _fast_ SAN that would require a 10GBE port in a laptop.

An off-the-shelf consumer raid-5 nas would require more than a gigabit port (so a 10 gig port would be needed).


A very cheap RAID5 setup with 4 spinning disks and a filesystem like ZFS or btrfs should get you about 3-500 MB/s, so 10 GbE is good for that setup on nodes.


The only way you're getting 3-500MB/s on a 4-disk ZFS RAID5/raidz is with very, very fast SSDs and a very, very fast CPU. Not exactly "very cheap". (The compute and I/O overhead for raidz is significant.)


This is... not my experience at all. I have a RAIDZ2 comprised of 6 4TB Seagate drives (the 5900RPM variety) and it can do about 600MBps read/write with moderate CPU usage on an Intel i3. A mirrored zpool of two Samsung 850 EVO SSDs can do nearly 1GBps read/write. That's not a particularly expensive setup.


600MB/s write over 6 drives in RAID-Z2 is very good on an i3.


Yep, i have some wd red 3tb drives in a raidz6 pool.

At best I can get 60m/s out of it. Each drive can do about 100m/s sequential but that is rare.

Putting an ssd on for caching read/writes though really changes the calculus of this.


zfs maybe not, but btrfs does it.


Building it in doesn't really make sense. Nearly no users, power hungry (although that has gotten better), bulky. And what connector do you use? 10Gig Base-T? Not really common and power intensive. SFP+ port? Really bulky.

There are Thunderbolt adapters. USB3.0 only has 4Gbit/s available, ExpressCard only 2GBit/s, so both aren't really good options.


Point-in-fact, 10G-base-T is exactly where Intel is trying to take the market.


USB 3.1 is 10Gbps I believe.


What do you do on a laptop that it requires 10 Gbit?


Don't know about OP, but I could use it to connect to fast storage on network and do some video finishing. Laptop has sufficient power, but no storage of that capacity.


Gigabit should be more than enough for that. There's no way your CPU/GPU can process video faster than Gbit/second.


Depends what I'm doing. For editing it's usually just fine. Color and fx work require higher bandwidth. For example, at 2k a single frame is 12 MBytes (times 24 or 25 per second, depending on the project). And we are at the dawn of an era where we are talking about 4k dci or QHD mastering for all. That wpuld be 48 MBytes per frame. So,we're looking at 300 and 1200 MBytes per second for a single workstation. Gigabit is not up to it.


I doubt your CPU/GPU can do color/fx at 1200 MBytes/second. Heck, I don't doubt, you're making it up.


If you say so. Explain how am I running 2K DPX 10-bit in realtime with color correction applied on Lustre then? Does my machine and Autodesk software perform magic?


Nothing much other than copying lots of stuff, but the tech is a decade old+ and I believe in future proofing when buying hardware that I'll use for years. I would pay an extra hundred for the port, but found zero options.


I have to wonder why 10gig, why don't the make an intermediate step ? Like 5gig ?


Until recently Ethernet only increased in speed by factors of 10. This made sense when it was increasing 10x every 5 years, because it takes years for standards and products to be developed. But now there are intermediate speeds like 2.5G, 5G, 25G, and 40G.


from the article:

> Each core needs to generate a few thousand data packets per second, because Ethernet packets typically contain up to 1500 bytes. This gives the CPU around 100 microseconds to process each packet.

No it doesn't, not when using TCP Segmentation Offload (TSO)

This only works for a particular use-case: sending static data using TCP, but this is the most common use-case since a typical "video streaming server" is actually a simple HTTP server that serves static MP4/MPEG-TS data.

for each connected client this is what happens - nginx/apache does sendfile(file, sock, off, <large_number>) - kernel issue large (> 10kB) DMA read to the file storage backend into a set of memory pages and wait for completion - kernel allocates/clone a small IP/TCP header (40 bytes) - kernel gives that small header + set of memory pages to network card, which will segment and create those 1500 bytes packets and send them on wire

if you have a lot of RAM, the read from storage could even be skipped because the previously read data pages are kept in the page-cache with a LRU approach. (help if clients are requesting the same file).

you can easily saturate a 10G link with spare CPU cycles on cheap hardware with that approach, no need to bypass anything.


Kind of arcs back to the days when people were putting HTTP servers in kernel space. Slightly different tac though


but the same result, reduce copies.


Reducing context switches is probably just as important.


Context switches in to kernel land are a lot cheaper on x86 than they used to be.


My state of knowledge leads me to think, that bypassing the kernel requires some non-blob network drivers with which you can tinker around. Am i mistaken?

So right now, i am missing the information on what kind of NIC they were using. Any thoughts or comments on that HN-community?

What vendor and product model would be a reasonable entry point for such endeavours? Answers very much appreciated.


In the past, some of my colleagues have used Intel's 82599 NICs for kernel bypass. Their Linux driver is quite good, they have a DPDK platform for developing user-space apps to directly access ring buffers on the NIC, and if you do a quick search, you should be able to find examples online.

Cloudflare wrote a blog post recently about accelerated packet IO and their post mentions the 82599 NIC: https://blog.cloudflare.com/kernel-bypass/.


via netmap, yes.


You might be interested in this writeup by Luke Gorrie: https://github.com/lukego/blog/issues/13

His project, Snabb Switch, also utilize the Intel 82599 10G NIC that 'xtacy mentioned.


The netmap project (http://info.iet.unipi.it/~luigi/netmap/) or intel's dpdk help with this and don't require super fancy non-standard NICs.


Just a modified driver and a NIC that supports 'rings'.


Some Solarflare cards?

http://www.openonload.org/


That's a blob driver.


Is a single CPU core able to process 4k/50fps video stream? Or is there no need for any processing, other than encapsulating it into data packets for sending to the network card?


Yes, assuming 4:2:0 that's just under 5 cycles per byte at 3 GHz, which is enough for simple processing, or 2.5 cycles per byte for 4:4:4. But they're using the other cores for processing and just one for network handling.


no. he says he can only spare one core for network. so he used network cards for a 1 to 1 physical cable. something that should have been done with a dedicated pcie card much easier.

other than that they are just bypassing tcp, arp, ethernet, etc. basically one pc has a kennel that says "every one in this file descriptor turns the voltage up on this cable" and the other pc has a driver that "every voltage up on this cable writes a one on this file" then they add some rudimentary sync logic for the timings. maybe just a known initial handshake that both expect and know... like a modem have shake.

i wonder if there is already a well known project/Linux kernel driver for this dumbed down network-as-fast-interface around or if they are writing it. the article is really lame on any detail


They already have a dedicated PCIe card -- it's the standard 10GigE NIC they're using.

What they're doing is bypassing the kernel overhead of header parsing, demultiplexing and copying to user space. Instead, the network card's ring buffers are mapped directly into the user processes' address space.

The application is almost certainly still talking TCP/IP (or maybe UDP), and there are no changes at the physical layer at all. It isn't a case of a file descriptor being hooked up to generate voltages on a cable at all -- in fact, the overhead of read()/write() calls on a file descriptor is one of the things they're trying specifically to avoid!

http://dpdk.org/ is an open-source library to implement this kind of thing, mostly aimed at Intel NICs.


From the description, they use "proper" packets. Probably not TCP, but UDP or their own protocol would still enable them to use a lot of existing tech instead of special purpose devices. Which I guess is the reason why they don't use "dedicated PCIe cards" with special cabling (if I understand correctly what you mean).


What do you mean? There's still a network card at both ends of a connection, which handles physical layer tasks like voltages or timings. A kernel can't perform those tasks.


Clearly not, which is exactly why kernel bypass is such a huge red flag: it's only possible (rather, useful) if you aren't doing anything sensible with the data anyway, or the kernel overhead would be tiny compared to the processing.

Use the right tool for the job and don't funnel network data through your instruction pipeline. When they realized that for memory, they called it "DMA", and when graphics was scaling up, we created the GPU.


The networks used by supercomputers have have kernel bypass with DMA for more than a decade... and they do a lot of processing on the data, too. Check out Infiniband and Intel's Omni-Path for modern examples. Or the Cray T3E (1995), which had an excellent network with a user-level DMA engine that only did 8- or 64-byte transfers.


Are you kidding me? How could not bypassing the kernel POSSIBLY be better here? It's infinitely harder for kernel devs to optimize these scenarios than for userland devs.


What would be "the right tool for the job" here? If no processing is needed, why is the kernel bypass is "a huge red flag"?


They probably could have used cards/systems capable of RDMA instead, but I don't see what's wrong with their approach.


Linus Tech Tips talked about the difficulties in getting 10 Gigabit working.

https://youtu.be/D03t890dKTU


All this talk about TCP and HTTP streaming when they are actually trying to do low-latency UDP for broadcast production.

(When you have a hammer (web technologies) everything looks like a nail I guess)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: