Intel Storage Performance Development Kit

notacoward · on Nov 5, 2015

"operating in a polled mode instead of relying on interrupts, which avoids kernel context switches and eliminates interrupt handling overhead"

Yeah, that's great if all you're running is a benchmark. As soon as you need to combine this with a network stack, also polling its own devices at fast as it can, it becomes a lot harder to avoid those context switches. If you're running an actual application it becomes even harder. Likewise if you have more devices than you have cores to spin waiting for them.

In an extremely latency-sensitive and resource-rich environment this kind of thing can yield great results, but otherwise it's almost a form of cheating. Yes, that's what I was accused of when I wrote network drivers at Dolphin and again at SiCortex that some people felt polled too much. Oddly enough, users felt that their CPUs should be running their applications, and were more interested in maximum throughput per cycle consumed. Hardware designers don't build interrupt-based interfaces just for fun, and it's worth remembering that Intel makes most of its money from selling CPUs. Think about it.

drv · on Nov 5, 2015

It's certainly true that the polled-mode driver model doesn't interact well when the application needs to use other APIs that don't provide a polled mode. However, SPDK can be used in conjunction with a polled user-mode network stack so that a storage application can operate fully in user mode without any user-to-kernel context switches or hardware interrupts.

It's definitely not a drop-in replacement for a kernel storage stack in the general case, but rather an optimization for specific applications (e.g. storage appliances) that can be structured to take advantage of the polled/no-interrupts model.

notacoward · on Nov 5, 2015

Thank you. That's kind of the point I was trying to make. For a storage appliance - and that's the most common deployment model for the software I work on - it's great. I just don't want to see every "full stack" halfwit trying to use a specialized tool for general-purpose stuff. That often ends up wasting everyone's time - especially that of the people providing the tool.

vetrom · on Nov 5, 2015

That said, there are cases where it would make sense in a multicore environment to pin one core to storage polling, and another to network polling.

You will then, of course, have contention in transitioning data between the two, but there are existing and useful models for eliding much of the locking load there.

jsnell · on Nov 5, 2015

I don't understand where you're getting the increase in context switches from. The expectation should be the opposite, since the network stack is almost certainly going to be user-space and in the same process

Does it make sense for every application? Unlikely. Does it make sense for an application that runs on 10k nodes, and where moving all IO to poll mode user-space doubles the number of requests you can serve per unit of time? Saving 5k machines worth of capex buys a lot of engineering complexity.

And since we know that systems like this have been done, 2x should not be an unreasonable number. See for example https://www.usenix.org/system/files/conference/osdi14/osdi14... and https://www.cs.cmu.edu/~hl/papers/mica-nsdi2014.pdf

notacoward · on Nov 5, 2015

It all depends how many cores you're willing to burn. If you're getting 2x performance but you only have 75% as many CPUs left for your application, you're going to need more than half as many nodes. At SiCortex we were selling into HPC. These people were thoroughly used to working on football-field-sized compute complexes, and they still bitched about how much CPU time was being burned in the I/O stack. No matter how many cycles they had available, they wanted those cycles used for their apps.

As I also said before, this kind of thing does have its place. It all depends on how many cycles you're likely to spin before you actually find anything to do, and whether you have another need for those cycles. If it's not many cycles because you really are pushing a lot of I/O, that's great. If you don't need those cycles for something else because you're an appliance and this is your only job, that's great too. If it's a lot of cycles and you do need those cycles for other things - which is the most common case - then burning lots of cycles busy-waiting for events that haven't happened yet only decreases real hardware utilization and increases either capex or time to completion.

slashdev · on Nov 5, 2015

Your assuming this uses more cycles, but that's probably false. All the interrupts, context switches, and kernel processing related to IO is a lot of processing. You have control here how many cycles to burn busy waiting, it's the classic latency vs throughout tradeoff. Longer sleep times in between polling means less busy waiting and more latency.

jsnell · on Nov 5, 2015

I'm not quite sure where you're getting that 25% reduction in cores from. Yes, it'd probably be a pretty bad tradeoff to dedicate 25% of the machine to busy-looping on IO just to reduce the IO overhead by a factor of 2. But that's not at all what I'm suggesting. I think there's a reasonable case to be made that application performance measured on two systems with exactly the same hardware would be 2x higher when the IO is moved into user-space.

notacoward · on Nov 6, 2015

Kind of depends on the application, wouldn't you say? For some applications, absolutely not. For some applications, maybe yes, assuming the application programmer knows enough not to negate the advantage e.g. by creating lock contention or cache thrashing. And they know enough to use something like LKL (interestingly also from Intel) instead of reinventing their own filesystem-like layer on top of that raw storage. And they don't make huge security blunders. Because if they don't get all of those things right, it doesn't matter if their performance is 2x for one brief shining moment before everything goes to hell. That's not comparing apples to apples. You have to hold the functionality/quality bar constant or else it's meaningless.

wmf · on Nov 5, 2015

I would hope that DPDK apps only poll when they have no other work to do, but I don't know if it's actually set up this way.

jeff_marshall · on Nov 5, 2015

The DPDK operates (or at least it did the last time I looked) in a run to completion model for packets received. once the current batch of received packets has been processed by a thread, the DPDK run time immediately looks for more packets to process. The number of polling threads is configurable, and often there are multiple threads per network interface (esp. on 10 gig or faster interfaces) with recieve flow steering to direct related packets to the same receiving thread to maintain cache coherency during packet processing.

If you have long-lived work that shouldn't block processing more packets, it would be typical to offload that to separate thread/process from the one doing the packet processing (e.g., for control plane work).

notacoward · on Nov 5, 2015

I haven't looked at DPDK specifically (I'd like to) but most such libraries I've seen poll on their own in their own thread. If the application has to tell it to poll, then it's cooperative multitasking; we all saw how much fun that was in Windows or (pre-OSX) MacOS. Polling introduces a classic latency/resource-consumption tradeoff. If you do it too much, you waste resources. If you don't do it enough, you get crappy latency. This is why interrupts were invented, and they work quite well when used correctly. Whether the storage or networking stacks in the Linux kernel uses them correctly is left as a point for the reader to ponder. ;)

wmf · on Nov 5, 2015

IMO people should only try to use DPDK if they're willing to go all in on "1975 programming", so the event loop is a minor annoyance on the scale of things that you have to deal with. Besides, kids today love event loops and callbacks anyway.

mtanski · on Nov 6, 2015

Intel just does it for the kids.

wumpus · on Nov 5, 2015

You're experienced in an environment (HPC) where polling for communications works great for compute/network intensive programs. InfiniPath didn't use interrupts, basically, and its latest evolution as Intel's Omni-Path probably does the same thing. Even Mellanox InfiniBand doesn't use interrupts when doing HPC-ish things.

This Intel library means that the network and storage can use the same event loop. It will integrate beautifully with Omni-Path's user-level library.

notacoward · on Nov 5, 2015

It will be great if that actually happens. I'm not down on the idea as an addition to the arsenal; I'm just pointing out that it's not as generally applicable as people might think. User-space I/O is another one of those ideas that has come and gone many times over the last few decades. This time it will find some applications, to be sure, but there are also good reasons why most people will probably continue to be better off sticking with the in-kernel implementations.

wumpus · on Nov 5, 2015

It's easy to predict that most people will probably not use this. It's much more fun to talk about why this will be awesome for some.

rbanffy · on Nov 6, 2015

One reason to use a polling model instead of interrupts is that storage on non-volatile memory is much more predictable than a hard disk.

drv · on Nov 6, 2015

I am an engineer working at Intel on SPDK, and I can answer any technical questions you might have.

Currently SPDK consists of a usermode NVMe (PCIe-attached SSD) driver. We will soon be releasing a usermode driver for the Intel I/OAT DMA engine (copy offload hardware) that is available on some server platforms.

n00b101 · on Nov 6, 2015

Will combining SPDK with the usermode driver for the Intel I/OAT DMA engine provide the necessary building blocks for a complete solution (network+storage)?

And can this be combined with PCIe-attached accelerators (e.g. Xeon Phi or GPUs)?

drv · on Nov 6, 2015

The SPDK libraries are mostly storage-specific components (the I/OAT DMA engine can be used for generic copy offload, but it is particularly useful for copying between network and storage buffers). SPDK itself does not provide any network functionality.

I am not familiar enough with the Xeon Phi or GPU programming model to say for sure, but they could possibly be used to offload tasks like hashing/dedup or other storage-related functions.

n00b101 · on Nov 6, 2015

> they could possibly be used to offload tasks like hashing/dedup or other storage-related functions

Sorry, I was not referring to accelerating storage-related functions, I was wondering about efficient DMA copy from one PCIe device (Intel NVM storage) to another (Xeon Phi accelerator) which would be for useful many different functions, if the NVM storage device capacity is much larger than the accelerator device memory.

drv · on Nov 6, 2015

Ah, I see. The I/OAT DMA copy offload is essentially equivalent to an asynchronous memcpy(), so anything addressable on the memory bus could be a source or destination (with some caveats about alignment requirements and pinned pages if copying to/from RAM).

pm · on Nov 6, 2015

As someone who has never touched any of this stuff before, where might I actually find it useful in my day-to-day programming (assuming I might have an application where it would be useful)?

drv · on Nov 6, 2015

The NVMe driver will only work for a fairly narrow set of uses in which the whole NVMe device(s) can be dedicated to a single application (this is because the user-space application takes control of the NVMe device directly, so the kernel driver can't simultaneously use it).

Some of the straightforward use cases would be inside network-attached storage appliances (ideally in conjunction with a user-mode network stack) or in a database (database systems already typically want to avoid any OS interference with storage access). In general, the NVMe driver can be dropped in fairly easily when existing code is using something like Linux AIO with O_DIRECT on a raw block device; the AIO programming model maps quite directly to the NVMe driver programming model (create a queue, submit I/Os, and poll for completions).

wumpus · on Nov 6, 2015

Are you going to integrate this with the Intel Omni-Path user-level library? You could probably serve a quite insane number of transactions that way.

drv · on Nov 6, 2015

I don't know anything about Omni-Path, sorry. However, based on the publicly available information, it does look like a very interesting combination. One major advantage of SPDK over the traditional kernel-provided storage stack is lower latency (by avoiding interrupts and other context switches), and it would fit nicely with a low-latency network stack.

slashdev · on Nov 5, 2015

This is a beautiful thing. I'm eagerly awaiting when NVM arrives in AWS to see if I can put this to work.

Intro here https://software.intel.com/en-us/articles/introduction-to-th...

jpgvm · on Nov 5, 2015

Damn this is sweet. User-mode networking has been fairly widespread for a while but user-mode storage stacks have been pretty rare outside MirageOS and a few other places.

I wish I had an NVMe device to play with this.

justincormack · on Nov 5, 2015

The Intel consumer one, the SSD 750, is relatively cheap (around £300 or so).