A lot to digest. Presumably the compression schemes outlined in the paper would ...

jiggawatts · on March 11, 2020

> Generally, databases are completely inefficient and there is just so much low-hanging fruit;

Not generally, unless you only consider only popular open-source databases on Linux to be the "general" case.

> Linux doesn't have a syscall batching system

Windows has scatter-gather I/O, and notably SQL Server uses it.

> Database engines all use blocking io

SQL Server almost exclusively uses asynchronous I/O, and the entire Windows kernel I/O stack has been async since the 1990s.

> I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql

That's because MySQL barely qualifies as a database engine. It has a only fraction of the features of any commercial database, but it is cheap and it has low latency for simple queries, which makes it popular.

PostgreSQL is substantially more feature rich, but they made tradeoffs elsewhere in the design that results in high write amplification, unfortunately making it a no-go for many potential use cases.

You would be shocked at the concurrency levels achievable with modern in-memory databases, and the compression levels available with technologies like SQL Server's ColumnStore.

Unfortunately, all of that technology comes with a licensing cost...

derefr · on March 11, 2020

> Unfortunately, all of that technology comes with a licensing cost...

Yes, so in the end it doesn't decrease TCO—it decreases OpEx but increases CapEx.

(Or, to put that another way: anything one SQL Server instance can do, five-to-ten Postgres read-replicas can do for the same price.)

Low-hanging fruit is "low-hanging" (i.e. tempting to pluck) because it'd be high-positive-ROI to solve these problems. Switching to an RDBMS that has already solved these problems, but has a licensing cost, is more likely just ROI-neutral.

reborn7 · on March 11, 2020

>> the entire Windows kernel I/O stack has been async since the 1990s.

Sincere Qs from a novice database developer, what good is all that async IO when you also have to consider data integrity? How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer? In what db layer would I be helped by async IO?

willvarfar · on March 11, 2020

Not involved, but giving an educated guess:

Imagine a simple search like SELECT * FROM my_table WHERE foo = ?

Imagine that the column foo is indexed.

With blocking IO, the we go through the foo index, finding the rows that have the correct value of foo.

The rows we find are on some kind of pages in the storage of the main table. So we now have a list of pages we have to fetch.

With blocking IO, we would fetch a page at a time, process it, then request the next page.

With async IO, we would request the fetching of all pages, as we discover we need them, and then process those pages in the order the IO subsystem can return them to us.

Writes are the same; we have some number of pages we want to write to disk. With blocking IO, we write a page at a time and wait for it to be written. With async you can buffer up the writes, let the OS write them in the order it chooses, and wait for them all to finish.

mortehu · on March 11, 2020

Note that writes are asynchronous on Linux too, and only become synchronous when the kernel buffer for writing is full. No special code needed.

hinkley · on March 11, 2020

Stuff like that is even more confusing because your performance analysis at moderate load tells you absolutely nothing about how the system will behave under high load (people who complain about GC often mention this as one of their grievances)

Turns out this causes a lot of issues when working with large diverse teams in particular. If you have a team of 4 you can socialize the idea of avoiding situations like this.

derefr · on March 11, 2020

> How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer?

I mean, you wouldn't get write-concurrency for that field. And in fact you'd avoid serial PKs in general in a write-concurrent design, in favor of e.g. client-generated UUIDs.

Where you'd get wins is when 1000 different clients are each opening a transaction and upserting 1000 rows, with, say, 50% of them being duplicates of other clients' rows. Rather than linearizing+deduping all of those right away, you can just write down all 1000 individual MVCC-world-state WAL log fragments concurrently, as 1000 queued write-batches. This lets you take actual advantage of all those write IOPS that RAIDed NVMe gets you.

barrkel · on March 11, 2020

Using threads for in-flight IOs means burning up lots of memory with stacks, and burning ever more CPU in context switches as IOs complete. In practice this means using some kind of state machine (either explicit or in the form of coroutines) to pick up where logic paused at the initiation of the IO. The logic with respect to locking and concurrency control doesn't change; the costs in memory and CPU do.

The CPU cost is increasing all the time due to all the cache attacks being revealed and the mitigations blowing away various cache contents, which is especially damaging in VM contexts like AWS where the mitigations can't be reasonably disabled.

jandrewrogers · on March 11, 2020

Data integrity and asynchronous I/O are unrelated concerns. The value of async I/O is that it exposes the concurrency and parallelism of the storage hardware to the software layer, enabling many types of performance optimizations. The vast majority of I/O operations in many database architectures do not have any sequential dependencies on other I/O operations, even within a single execution context.

Data integrity is a function of design and implementation correctness, in addition to having robust mechanisms for dealing with data corruption during I/O, hardware failures, etc.

jandrewrogers · on March 11, 2020

> database engines all use blocking io

Only in open source, most advanced database engines have not been designed with blocking I/O in a very long time in closed source. I have not worked on any database engines with blocking I/O in the last decade.

This is an area of easy integer factor performance improvement for open source databases, and one of the primary reasons they are not competitive with the performance of many closed source designs.

fractalb · on March 11, 2020

Are there any benchmarks openly available to compare the performance of different open/closed source DB’s?

rrrrrrrrrrrryan · on March 11, 2020

An older comparison of MySQL vs SQL Server that I dug up:

https://www.ijarcce.com/upload/2015/march-15/IJARCCE%2039.pd...

the8472 · on March 11, 2020

> linux doesn't have a syscall batching system

Well, you already mention io_uring. That's your syscall batching system. It doesn't cover all existing syscalls because not everything is safe to do asynchronously.

willvarfar · on March 11, 2020

Yes, I'd like both!

Asynchronous io_uring is good for things that take a lot of time, e.g. io. Its really just a queue between the user-space and kernel threads.

But it doesn't replace just making a synchronous syscall for small things e.g. stating files etc.

A lot of those syscalls are called in quick succession, often with no logic between them.

The Soares paper I linked too showed how buffering up a list of syscalls, doing a single context switch to the kernel, executing them and then switching back is a major gain to a wide variety of applications.

Sychronous syscall batching would be much more straightforward to adopt than aysnc io; imagine if cstdlib file stuff all used buffered syscalls - everyone would benefit! Whereas I would imagine it adopting true async io would be very hard because it is a exposing a blocking api to its user.

the8472 · on March 11, 2020

You can use io_uring for synchronous batching too. All you have to do is submit a batch and set min_complete to the number of submitted items when calling io_uring_enter. That's a single syscall, assuming you already have some thread-local ring lying around just for batching.

yxhuvud · on March 11, 2020

You probably also need to set the flag that enables the events to be linked, so that they happen in the correct order.

the8472 · on March 11, 2020

Only when order matters. For the given example of stating a lot of files it wouldn't.

codetrotter · on March 11, 2020

> A lot of those syscalls are called in quick succession, often with no logic between them.

Speaking of which, I noticed on macOS that the “du” command-line utility seems much slower than a third-party GUI utility called DaisyDisk [1] at collecting the sizes of everything on a disk.

This has me wondering what DaisyDisk is doing, if anything, that “du” isn’t, to achieve this.

One theory of mine is that perhaps APFS caches metadata about file sizes and that this can be accessed faster than walking all directories and stat’ing all the things.

An alternative theory I have is that perhaps GUI apps just run with a higher priority.

If anyone knows, I’d be delighted to learn.

A third possibly is that all of this was just due to “noise”. For example, disks happening to be under less I/O pressure when I ran DaisyDisk than when I’ve run “du”. My perception of their relative speed is based on very little scientific rigor.

[1]: https://daisydiskapp.com/

IAmEveryone · on March 11, 2020

One of the features of APFS is supposedly a cache for total size of directory trees.

As far as I remember, it is something the OS decides to turn on or off for different folders. It’s wasteful for /tmp, for example, because it makes file creation and deletion slightly slower.

I’m not entirely sure if du would want to tap into that. A lot has changed in MacOS, and it’s not entirely clear if we still have a universal definition of “file size”. You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.

Don’t remember the exact name, sorry.

There is a MacOS-native way to explore recursive size, by the way: about this Mac -> storage -> manage -> documents -> file browser. This is limited to your user folder by default, but you can just create a soft link to take you to /

derefr · on March 11, 2020

> You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.

Do you know how that translates to the BSD layer? Does such a file appear to exist and have a certain size, but is e.g. exclusively read-locked, unless you manually download it? Or does fopen(2) block, trigger an iCloud download, and then unblock once it's complete?

the8472 · on March 11, 2020

It might just be using more threads and thus filling command queues better.

winrid · on March 11, 2020

Right. It's likely multithreaded whereas du is not. You can find some multithreaded versions of du though.

namibj · on March 11, 2020

Stat and open are supported by io_uring in 5.6

yxhuvud · on March 11, 2020

It also doesn't (yet) handle use cases like FOPEN -> FREAD -> FCLOSE, where all of them are queued at once. This because the read and close need the file descriptor the open operation need.

Perhaps not a problem for databases (which probably keep all relevant files open anyhow),

qaq · on March 11, 2020

"database engines all use blocking io" Well sled is using io_uring

samatman · on March 11, 2020

probably why that paragraph had a link to a video titled "sled and rio: modern database engineering with io_uring"...

qaq · on March 11, 2020

which was my point :)

fnord123 · on March 11, 2020

https://github.com/facebook/rocksdb/pull/5881

I thought I saw a PR fly by for mysql but I can't find it now so maybe I was imagining things.

thedance · on March 11, 2020

InnoDB (MySQL) uses aio where available.

voortuckian · on March 11, 2020

MySql is the original No sql database