NUMA Siloing in the FreeBSD Network Stack (2019) [pdf]

drewg123 · on Oct 7, 2021

This is my EuroBSDcon talk from 2 years ago. The more recent one (about getting to 400G) is up now at https://www.youtube.com/watch?v=_o-HcG8QxPc

throw0101a · on Oct 7, 2021

Looking to the 800Gb/s next year. :)

Is that with the ConnectX-7?

* https://www.nvidia.com/content/dam/en-zz/Solutions/networkin...

drewg123 · on Oct 7, 2021

No, just multiple CX6-DX

I'm currently stuck because the only boards we can find that have enough PCIe lanes exposed to do 800g (64 for NICs + close to 64 for NVME, with them more or less equally divided between sockets) have only 3 xgmi links to the 2nd socket, rather than 4. That causes uneven loading on the xgmi links, which causes saturation issues well below 800g

EDIT: I didn't realize the CX7 specs were public; I no longer feel quite so special :)

infogulch · on Oct 7, 2021

Great talk, thanks for sharing!

Have you looked into something like DirectStorage to allow the nic to request data directly over PCIe, thus cutting out memory bandwidth limitations entirely? This is what new generation consoles are using to load static scene data from NVMe drives to GPUs directly over PCIe, your workload also seems to match this model well.

Edit to add: Nvidia branded this as GPUDirect Storage in 2019: https://developer.nvidia.com/blog/gpudirect-storage/

Edit2: Oh neat this is mentioned in the connectx-7 doc linked in sibling comment.

drewg123 · on Oct 7, 2021

The problem is that with current generation NICs and NVME drives, you need a buffer somewhere, and there is no place for the buffer. You need a buffer because the NVME drive speaks in 4k chunks, and doing anything sub-4K is non optimal.

So picture that TCP wants to send 2 segments (1448* 2 == 2896), possibly from a strange offset (say 7240, or 5 segments in). You'd be asking the NVME drive to read 2 4K blocks and pull 2896 bytes out of the middle. What you really want to do is read the first 12K and keep it buffered.

With the current generation NICs, there is no place to buffer. With non gold-plated enterprise NVME, there is no place to buffer. NVME does have a thing called CMB, but the last time I searched, I could not find a drive with CMB that was large enough to be useful at a decent price.

The alternative is to have a box of RAM sitting on the PCIe bus. That's pretty much a GPU, but GPUs are expensive, power-hungry, and unobtainable.

So for now, its just not practical to do this.

toast0 · on Oct 8, 2021

> you need a buffer somewhere, and there is no place for the buffer.

> The alternative is to have a box of RAM sitting on the PCIe bus.

One thing I didn't think of until now is your 400 Gbps system uses effectively 32 PCI-e 4.0 lanes of Nic and 32 more lanes of storage (i think it's 64 lanes of 3.0 storage in system), and you mention elsewhere that your 800 Gbps system needs 64 lanes of NIC and about 64 lanes of storage, and that you're having a hard time finding that many lanes on a board that meets your requirements. You'd need more lanes for the PCI-e ram disk, even if you could find one more affordable and available than GPUs.

Sounds like Mellanox (or someone) needs to add some ram to their NICs. (Or wait for PCI-e 5.0 and DDR5)

a1369209993 · on Oct 8, 2021

> So picture that TCP wants to send 2 segments (1448* 2 == 2896), possibly from a strange offset (say 7240, or 5 segments in). You'd be asking the NVME drive to read 2 4K blocks and pull 2896 bytes out of the middle. What you really want to do is read the first 12K and keep it buffered.

The obvious solution here would be to just send less-than-maximum-size TCP segments. You have to support that to some extent anyway, to deal with the last segment of a non-multiple-of-1448 file, and to talk to things behind small MSSs. So send 4 1024-byte segments, or 3 segments of 1368+1368+1360 bytes (or whatever subdivision is technically convenient). If you can stream out 181 uint64s of data (from RAM, I assume?) in the first place, it shouldn't be that much harder to stream out 171 or 170.

There may be technical reasons why that won't work (eg the hardware is double-buffered and only supports current-packet and next-packet), but it's not just a matter of "TCP segments are a irregular size", because TCP segments are variable sized.

drewg123 · on Oct 8, 2021

That's the obvious solution. But we A/B test QoE metrics to death. Touching something so fundamental about TCP would surely hurt QoE.

The last time we tried to coerce TCP into helping with a problem like this, we were dealing with a NIC that could not save state in the middle of TLS records. Eg, if TCP sent 2800 bytes, then tried to send another 2800 bytes, the NIC would need to re-DMA those first 2800 bytes so as to be able to encrypt the 2800 bytes that it was asked to send.

  So we made the TLS record size variable from 16k down to 4K, and forced TCP to always send a complete TLS record at a TLS record boundary so as to avoid having the NIC have to re-DMA the start of a TLS record.  This was *MISERABLE* for QoE.  It wasn't the NIC itself; enforcing those restrictions led to horrible QoE with any NIC.

So no, I'm not gonna touch anything having to do with TCP :(

infogulch · on Oct 8, 2021

Have you considered using a cypher mode like AES-CTR [0]? That might reduce the need to buffer because later packets are no longer dependent on the encryption result of previous packets. Of course that would only work if clients support CTR, and if the NIC supports it, and depends on how much you trust the encryption, but if it applies it could help a lot and eliminate the whole issue of retransmissions causing IO spikes to rebuild the TLS state.

To be clear, I just wondered if such an encryption mode existed and found this after spending 2m googling it, so I have no idea what the adoption rate of such a cypher is or what the consensus is on its security. And for all I know you're already using it and it doesn't solve the problem because something else is forcing the NIC to rebuild state.

[0] https://datatracker.ietf.org/doc/html/draft-ietf-tls-ctr-01#...

> AES-CTR is capable of random access within the key stream. For DTLS, this implies that records can be processed out of order without dependency on packet arrival order, and also without keystream buffering

dragontamer · on Oct 7, 2021

> Have you looked into something like DirectStorage to allow the nic to request data directly over PCIe, thus cutting out memory bandwidth limitations entirely? This is what new generation consoles are using to load static scene data from NVMe drives to GPUs directly over PCIe, your workload also seems to match this model well.

Lets say an Ethernet Port on NUMA#1 is asking for a movie found on a NVMe SSD on NUMA#3.

There's simply no way for you to get the data unless you traverse the NUMA-fabric somehow. Either NUMA#1 tells NUMA#3 "Yo, start sending me data so that I can reply to this TCP/TLS/HTTP message"... or NUMA#1 offloads the job to NUMA#3 somehow (Its now NUMA#3's problem to figure out how to talk back).

----------

There are many ways to implement the system, each with apparently their own pros and cons. There are further issues: in an earlier post a few weeks ago, drewg123 explained that his system / bosses demanded that each server was only allowed to have 1 IP Address (meaning all 4 NICs had to cooperate using link-aggregation).

You can effectively think of this as "each Netflix stream comes in on one of the 4-random ports" (which may be on NUMA#1 or NUMA#2). While each bit of data is scattered across NUMA#1, #2, #3, and #4.

---------

Note: NUMA#1 and NUMA#2 have the NICs, which means that they have fewer remaining PCIe lanes for storage (The ConnectX NIC uses multiple PCIe lanes). As a result, most of the movie's data will be on NUMA#3 or NUMA#4.

-------

Note: I expect the NUMA-fabric to be faster than PCIe fabric. NUMA-fabric is designed so that different chips can pretend each other's RAM at roughly 50+GBps bandwidth at something like 300ns latencies. In contrast, PCIe 4.0 x16 is only 30GBps or so... and a lot of these NVMe SSDs are only 4x lanes (aka: 8GBps).

You also need to handle all the networking tidbits: HTTPS has state (not just the TLS state, but also the state of the stream: who's turn is it to talk and such). Which means the CPU / Application data needs to be in the loop somehow. I know that the most recent talk offloads the process as much as possible thanks to the "sendfile" interface on FreeBSD (also available on Linux), which allows you to "pipe" data from one file-descriptor to another.

infogulch · on Oct 7, 2021

Both NICs and SSDs are connected to the CPU via PCIe: all SSD and NIC traffic already go through PCIe before they hit main memory today.

PCIe 4.0 x16 is 64GB/s, and there are 8 groups of x16 lanes in a 128 lane EPYC, for a total of 512GB/s or 4Tb/s, more than double the max fabric bandwidth of 200GB/s.

Lets take the original diagram from the video:

               CPU
              ↑   ↓
    Storage → Memory → NIC

He was able to use NIC-kTLS offloading to simplify it to this:

    Storage → Memory → NIC

Now lets add a bit more detail to the second diagram, expanding it to include PCIe:

             Memory
              ↑  ↓
    Storage → PCIe → NIC

This third diagram is the same as the diagram above it, except it explicitly describes how data gets from storage to memory and from memory to NIC.

So the story for a request is something like: 1. request comes in for a file, loop { 2. CPU requests chunks of data, 3. data is delivered to memory via PCIe and signals CPU to handle it, 4. CPU tells NIC to send data from memory, 5. NIC requests data from memory via PCIe, sends it out on port }

If you squint this looks kinda like the first diagram where there were extra unnecessary data transfers going up through the CPU, except now they're going up through main memory. My proposal is to skip main memory and go straight from storage to NIC as in:

    Storage → PCIe → NIC

The story for serving a request would now be: 1. request comes in for a file, 2. CPU tells NIC: "ok get it from SSD 13", 3. NIC requests data from SSD via PCIe, sends it out on port, 4. cpu & main memory: crickets

toast0 · on Oct 7, 2021

From drewg123's other threads, it seems like their machines are also hitting (or approaching) memory bandwidth limits, so being able to reduce some of the memory write/read requirements should help with that.

I think what was working best was having the disk I/O write to RAM that was NUMA aligned with the NIC, so disk -> PCIe -> NUMA Fabric -> RAM -> PCIe -> NIC.

If instead you could do disk -> PCIe -> NUMA Fabric -> PCIe -> NIC, at least for a portion of the disk reads, that would still be the same amount of traffic on the NUMA Fabric, but less traffic on the memory bus. This probably means that the NIC would be doing more high latency reads though, so you need more in-flight sends to keep throughput up.

throw0101a · on Oct 7, 2021

Video of the talk:

* https://www.youtube.com/watch?v=_o-HcG8QxPc

* https://2021.eurobsdcon.org/home/speakers/#serving

Playlist of all the 2021 EuroBSDCon videos:

* https://www.youtube.com/playlist?list=PLskKNopggjc4dadqaCDmc...

perihelions · on Oct 7, 2021

Related (from a more recent talk):

https://news.ycombinator.com/item?id=28584738 ("Serving Netflix Video at 400Gb/s on FreeBSD [pdf]")

There's also an AMA by the author (in the HN thread).

jeffbee · on Oct 7, 2021

I live for getting downvoted on HN so I'd just like to point out that this deck supports my previously-expressed opinion that the AMD EPYC architecture is harder to use. Out of the box, the Intel machine that is obsolete on paper was beating the EPYC machine by more than 50%.

adrian_b · on Oct 7, 2021

Intel Xeons for servers have a few features outside the CPU that AMD Epyc are lacking for now, e.g. the ability to transfer data directly between the network interface cards and the CPU cache.

These features are usually exploited by high-performance networking applications and they can provide superior performance on Intel, even if the Intel CPUs are inferior.

As long as the application is dominated by the data transfers, such extra features of the Intel uncore can provide superior performance, but when the application needs heavy processing on the cores, the low energy efficiency of the Intel Ice Lake Server or older Xeons allows Epyc to deliver better results.

drewg123 · on Oct 7, 2021

The feature you're talking about, DDIO, is worse than useless for our application. It wastes cache ways on I/O that has long, long, long since been evicted from the cache by the time we go to look at it.

It might be helpful for a low-latency polling sort of scenario, but with interrupt coalescing all it does is waste cache.

wmf · on Oct 7, 2021

Intel's unified mesh does have some advantages over AMD's quadrants, but Netflix's workload is pretty unusual. Most people are seeing better performance on AMD due to more cores and much larger cache.

drewg123 · on Oct 7, 2021

Note that this was with the chip running in NUMA (4-NPS) mode. It did much better in 1-NPS mode when paired with a kernel without NUMA optimizations.

Rome in 1-NPS mode is just fine, far better than naples.

monocasa · on Oct 7, 2021

EPYC isn't NUMA anymore.

dragontamer · on Oct 7, 2021

Both EPYC and Intel Skylake-X are NUMA.

Yes, Skylake-X. It turns out that the placement of those L3 caches matter, and some cores are closer to some memory controllers than others.

https://software.intel.com/content/www/us/en/develop/article...

------------

Some cores have lower latency access to some memory channels than other memory channels. Our modern CPUs are so big, that even if everything is on a singular chip, the difference in latency can be measured.

The only question that matters is: what is the bandwidth and latencies of _EACH_ core compared to _EACH_ memory channel? The answer is "it varies". "It varies" a bit for Skylake, and "it varies a bit more" for Rome (Zen 2), and "it varies a lot lot more" for Naples (Zen1).

---------

For simplicity, both AMD and Intel offer memory-layouts (usually round-robin) that "mixes" the memory channels between the cores, causing an average latency.

But for complexity / slightly better performance, both AMD and Intel also offer NUMA-modes. Either 4-NUMA for AMD's EPYCs, or SNC (Sub-numa clustering) for Intel chips. There are always a set of programmers who care enough about latency/bandwidth to drop down to this level.

monocasa · on Oct 7, 2021

It looks like the parent edited the context out of their post.

They were specifically calling out EPYC's extreme NUMAness, in contrast to Intel's, as the cause of their problems. That distinction has more or less been fixed since Zen 2, to the point that the NUMA considerations are basically the same between Intel and AMD (and really would be for any similar high core count design).

jeffbee · on Oct 7, 2021

Don't blame me for editing out something you imagined. I didn't touch it. If you're having problems with hallucinations and memory see a neurologist.

MisterTea · on Oct 7, 2021

I can not find anything to back up this claim. How else is AMD linking the multiple dies/sockets together?

monocasa · on Oct 7, 2021

The DDR phys are on the I/O die, so all of the core complexes have the same length path to DRAM.

Multi socket is still NUMA, but that's true of Intel as well.

dragontamer · on Oct 7, 2021

> The DDR phys are on the I/O die, so all of the core complexes have the same length path to DRAM.

The I/O die has 4 quadrants. The 2 chips in the 1st quadrant access the 1st quadrant's 2-memory channels slightly faster than the 4th quadrant.

> Multi socket is still NUMA, but that's true of Intel as well.

Intel has 6-memory controllers split into pairs of 3 IIRC (I'm going off of my memory here). The "left" 3 memory channels access the "left 9 cores" a bit faster than the "right 9 cores" in an 18-core Intel Skylake-X chip.

--------

Both AMD and Intel have non-uniform latency/bandwidth even within the chips that they make.

monocasa · on Oct 7, 2021

There's a few cycles difference based on how the on chip network works, but variability in the number of off chip links between you and memory is what dominates the design. And in the context of what the parent said (but has since edited out), was what was being discussed.

thinkingkong · on Oct 7, 2021

Previous discussion https://news.ycombinator.com/item?id=28584738

detaro · on Oct 7, 2021

this is not a previous discussion of this talk.

thinkingkong · on Oct 7, 2021

Youre totally right. The talk I linked simply references NUMA siloing as a way of boosting throughput for their usecase.