Jeff, I am familiar with your work, I lurked on the PostgreSQL hackers mailing list for years when I was hacking on that database. :-) I am not dismissing the coordination of OS and database, it just has really deep limits because the OS must hide information critical to optimizing database throughput.
While the increased throughput is a complex function of hardware, workload, etc, it is also consistently substantial. The reason why it works is simple: the database processes have nearly omniscient view of hardware and state and there is only (in modern designs) a single process per core. Consequently, even if you have thousands of concurrent high-level database operations, each process can dynamically select and continuously reorder the low-level operations to (nearly) optimally maximize the throughput for the execution graph at that moment because the execution is completely cooperative. You can do the “synchronous scan” optimization for CPU caches that you do for disk systems. You can schedule around any conflicts in the execution graph and even the impact of outside CPU interrupts can be detected and optimized around. And it is easy to track the aggregate costs of these choices. To the extent possible, every clock cycle is spent on end-user database work instead of database internals overhead.
So minimal processing stalls, micro or macro, and no context-switching or coordination overhead. All combined with incredible locality knowledge (by inference) that is not available if you let the OS manage things for you.
On your other two points:
- Bypass is generally more resilient partly because the software has more explicit and immediate knowledge of the nature of the fault and can do something sensible about it. Obviously you have to handle faults when they occur. A lot of OS behavior when faults occur is pathological from the standpoint of optimizing databases. It is like memory management in C; it requires extra effort but also adds extra power if you handle it well.
- Postgres has expensive capability add-ons to an existing, useful system so it is more incremental in nature. The problem with OS bypass database kernels (and I learned this the hard way) is that (1) they are huge in terms of LoC long before rudimentary functionality is available and (2) it takes many years of atypical software design experience to be competent at trying to write one. It could be done, but it would require a critical mass of a tiny demographic willing to do a lot of work. My argument in this regard was less about inevitability and more about statistical probability.
I spent a lot of years hacking on and customizing Postgres. I recommend it to anyone and everyone that will listen because it is a great piece of engineering and would still use it for many OLTP systems. But it does leave a lot of performance on the table for a variety of reasons that probably make sense for a portable, open source project. The fact remains that I can design and have built bypass kernels that are substantially faster largely by exploiting the optimizations bypassing offers.
Perhaps it's just because I've never seen a good implementation of a bypass, and I might agree if I had seen one. Like many things, maybe it just takes the right people to make it successful.
Postgres leaves a lot of performance on the table in much more basic ways, too, so I certainly am not suggesting that postgres is anywhere near optimal.
While the increased throughput is a complex function of hardware, workload, etc, it is also consistently substantial. The reason why it works is simple: the database processes have nearly omniscient view of hardware and state and there is only (in modern designs) a single process per core. Consequently, even if you have thousands of concurrent high-level database operations, each process can dynamically select and continuously reorder the low-level operations to (nearly) optimally maximize the throughput for the execution graph at that moment because the execution is completely cooperative. You can do the “synchronous scan” optimization for CPU caches that you do for disk systems. You can schedule around any conflicts in the execution graph and even the impact of outside CPU interrupts can be detected and optimized around. And it is easy to track the aggregate costs of these choices. To the extent possible, every clock cycle is spent on end-user database work instead of database internals overhead.
So minimal processing stalls, micro or macro, and no context-switching or coordination overhead. All combined with incredible locality knowledge (by inference) that is not available if you let the OS manage things for you.
On your other two points:
- Bypass is generally more resilient partly because the software has more explicit and immediate knowledge of the nature of the fault and can do something sensible about it. Obviously you have to handle faults when they occur. A lot of OS behavior when faults occur is pathological from the standpoint of optimizing databases. It is like memory management in C; it requires extra effort but also adds extra power if you handle it well.
- Postgres has expensive capability add-ons to an existing, useful system so it is more incremental in nature. The problem with OS bypass database kernels (and I learned this the hard way) is that (1) they are huge in terms of LoC long before rudimentary functionality is available and (2) it takes many years of atypical software design experience to be competent at trying to write one. It could be done, but it would require a critical mass of a tiny demographic willing to do a lot of work. My argument in this regard was less about inevitability and more about statistical probability.
I spent a lot of years hacking on and customizing Postgres. I recommend it to anyone and everyone that will listen because it is a great piece of engineering and would still use it for many OLTP systems. But it does leave a lot of performance on the table for a variety of reasons that probably make sense for a portable, open source project. The fact remains that I can design and have built bypass kernels that are substantially faster largely by exploiting the optimizations bypassing offers.