I assert that open source databases don't do it because it's a bad idea. It's th...

jandrewrogers · on April 3, 2014

You both overestimate the engineering complexity and underestimate the benefits. I've both designed and worked on a couple different bypass kernels as well PostgreSQL internals over the years.

You are correct that the initial development is steep. However, once the infrastructure is there it really is not much different than working with the operating system infrastructure and you gain a level of predictability and stability in terms of behavior that saves engineering time. Also, bypass implementations have almost no locking internally (either "lock-free" types or heavier types) which reduces complexity considerably.

Some bypass kernel code bases allow you to compile with the bypass implementation disabled, using highly-optimized PostgreSQL-like internals. I've seen and run quite a few comparative benchmarks on the same design with and without bypass enabled, as well as absolute benchmarks against engines like PostgreSQL. We don't have to guess about single node performance.

Broadly speaking, a properly designed bypass kernel buys you 2-3x the throughput of a highly optimized non-bypass kernel in my experience. If it was only 25% no one would bother. Furthermore, for massively parallel databases, you essentially require a bypass kernel to design a well-behaved system due to the adaptive operation scheduling requirements.

I agree that it is a lot of work but it is also entirely worth it if you need to either (1) maximize throughput on a single node and (2) build a well-behaved massively parallel database kernel. The differences are not trivial.

jeffdavis · on April 3, 2014

Any number we have is going to be sensitive to the workload, so I think it's unfair to say 2-3x without a lot of context.

Also, you dismiss ideas that help the database and the OS work together better. For instance, I did "synchronized scans" for postgres. It coordinates sequential scans to start from the block another scan is already reading, improving cache behavior and dramatically reducing seeks. This could have been done by lots of extra code controlling the I/O very carefully (as at least one paper seemed to suggest was a good idea). But I chose to do it the simple way, just start the scan off in the same place as another scan, and concurrent scans got almost ideal behavior -- each ran in about the same time as if no other scan was in progress (with no overhead in the single scan case).

Linux is clearly interested in allowing more hooks and making them more useful. From an engineering standpoint, that makes more sense to me.

Two other points:

* I'm a little skeptical that such a bypass can easily be made resilient to some strange/degenerate cases.

* You say that the reason an open source system won't do it is because the MVP is too expensive. But the MVP for a cost-based optimizer is also very expensive, and postgres has one of those. I think that was a much better investment than investment in the filesystem/scheduling layer.

jandrewrogers · on April 4, 2014

Jeff, I am familiar with your work, I lurked on the PostgreSQL hackers mailing list for years when I was hacking on that database. :-) I am not dismissing the coordination of OS and database, it just has really deep limits because the OS must hide information critical to optimizing database throughput.

While the increased throughput is a complex function of hardware, workload, etc, it is also consistently substantial. The reason why it works is simple: the database processes have nearly omniscient view of hardware and state and there is only (in modern designs) a single process per core. Consequently, even if you have thousands of concurrent high-level database operations, each process can dynamically select and continuously reorder the low-level operations to (nearly) optimally maximize the throughput for the execution graph at that moment because the execution is completely cooperative. You can do the “synchronous scan” optimization for CPU caches that you do for disk systems. You can schedule around any conflicts in the execution graph and even the impact of outside CPU interrupts can be detected and optimized around. And it is easy to track the aggregate costs of these choices. To the extent possible, every clock cycle is spent on end-user database work instead of database internals overhead.

So minimal processing stalls, micro or macro, and no context-switching or coordination overhead. All combined with incredible locality knowledge (by inference) that is not available if you let the OS manage things for you.

On your other two points:

- Bypass is generally more resilient partly because the software has more explicit and immediate knowledge of the nature of the fault and can do something sensible about it. Obviously you have to handle faults when they occur. A lot of OS behavior when faults occur is pathological from the standpoint of optimizing databases. It is like memory management in C; it requires extra effort but also adds extra power if you handle it well.

- Postgres has expensive capability add-ons to an existing, useful system so it is more incremental in nature. The problem with OS bypass database kernels (and I learned this the hard way) is that (1) they are huge in terms of LoC long before rudimentary functionality is available and (2) it takes many years of atypical software design experience to be competent at trying to write one. It could be done, but it would require a critical mass of a tiny demographic willing to do a lot of work. My argument in this regard was less about inevitability and more about statistical probability.

I spent a lot of years hacking on and customizing Postgres. I recommend it to anyone and everyone that will listen because it is a great piece of engineering and would still use it for many OLTP systems. But it does leave a lot of performance on the table for a variety of reasons that probably make sense for a portable, open source project. The fact remains that I can design and have built bypass kernels that are substantially faster largely by exploiting the optimizations bypassing offers.

jeffdavis · on April 4, 2014

Perhaps it's just because I've never seen a good implementation of a bypass, and I might agree if I had seen one. Like many things, maybe it just takes the right people to make it successful.

Postgres leaves a lot of performance on the table in much more basic ways, too, so I certainly am not suggesting that postgres is anywhere near optimal.

deafbybeheading · on April 4, 2014

Stupid (and somewhat tangential) question: how do bypass kernels work with virtualization, if at all?

jurjenh · on April 3, 2014

Is there a middle ground here somewhere? In that the kernel developers create some sort of DB specific hooks that allow some of the kernel bypass mechanisms to be implemented?

What are the key things that a kernel bypass version does different? Can these be separated out in a concise way which would lead to multiple DB implementations being able to use these same interfaces? Essentially for any major DB system, you'd want the kernel tailored anyway - you're not going to be doing much else on your DB server (are you?)

kev009 · on April 3, 2014

Or you could, you know, fix the OS like they are trying to do in TFA.

cookiecaper · on April 4, 2014

The argument is that "fix" is relative. Things that improves Postgres's performance may negatively impact other applications. The suggestion, therefore, is that Postgres takes control of these tasks for its own purposes, and then they don't have to worry about the implications for other systems or wait on anyone else.