A lot to digest. Presumably the compression schemes outlined in the paper would benefit columns generally, and not just keys, and would have a positive impact on row-size which would in turn benefit general performance?
One short-string-compression-library to compare with might be https://github.com/antirez/smaz by the Antirez, the author of redis.
In the last decade Tokudb and then MyRocks showed how to do a fast storage engine for data too big to fit in traditional RDBMS. They are multi-level, they do page-level compression etc. And yet there is still so many easy wins yet to be had.
Generally, databases are completely inefficient and there is just so much low-hanging fruit; if the team size working on e.g. myrocks could be doubled, and tasked with looking at the inefficiencies in the server as well as the storage engine, things might change. I have a list in my head of the various really-promising-ideas that databases don't actually do:
* linux doesn't have a syscall batching system, but if it did, the number of context switches would be cut down dramatically. Research in 2010 https://www.usenix.org/legacy/events/osdi10/tech/full_papers... proved this and it wouldn't just be databases that benefit. These days context switching is more expensive than ever.
* tokudb showed that simply not tracking the affected row count could speed things up massively (tokudb called it NOAR)
* query engines often don't handle things that a few more lines of code would handle efficiently. I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql evaluating constraints that it can against the other columns in the key before dereferencing the row? Arrgh. Dumb dumb dumb. Etc.
> Generally, databases are completely inefficient and there is just so much low-hanging fruit;
Not generally, unless you only consider only popular open-source databases on Linux to be the "general" case.
> Linux doesn't have a syscall batching system
Windows has scatter-gather I/O, and notably SQL Server uses it.
> Database engines all use blocking io
SQL Server almost exclusively uses asynchronous I/O, and the entire Windows kernel I/O stack has been async since the 1990s.
> I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql
That's because MySQL barely qualifies as a database engine. It has a only fraction of the features of any commercial database, but it is cheap and it has low latency for simple queries, which makes it popular.
PostgreSQL is substantially more feature rich, but they made tradeoffs elsewhere in the design that results in high write amplification, unfortunately making it a no-go for many potential use cases.
You would be shocked at the concurrency levels achievable with modern in-memory databases, and the compression levels available with technologies like SQL Server's ColumnStore.
Unfortunately, all of that technology comes with a licensing cost...
> Unfortunately, all of that technology comes with a licensing cost...
Yes, so in the end it doesn't decrease TCO—it decreases OpEx but increases CapEx.
(Or, to put that another way: anything one SQL Server instance can do, five-to-ten Postgres read-replicas can do for the same price.)
Low-hanging fruit is "low-hanging" (i.e. tempting to pluck) because it'd be high-positive-ROI to solve these problems. Switching to an RDBMS that has already solved these problems, but has a licensing cost, is more likely just ROI-neutral.
>> the entire Windows kernel I/O stack has been async since the 1990s.
Sincere Qs from a novice database developer, what good is all that async IO when you also have to consider data integrity? How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer? In what db layer would I be helped by async IO?
Imagine a simple search like SELECT * FROM my_table WHERE foo = ?
Imagine that the column foo is indexed.
With blocking IO, the we go through the foo index, finding the rows that have the correct value of foo.
The rows we find are on some kind of pages in the storage of the main table. So we now have a list of pages we have to fetch.
With blocking IO, we would fetch a page at a time, process it, then request the next page.
With async IO, we would request the fetching of all pages, as we discover we need them, and then process those pages in the order the IO subsystem can return them to us.
Writes are the same; we have some number of pages we want to write to disk. With blocking IO, we write a page at a time and wait for it to be written. With async you can buffer up the writes, let the OS write them in the order it chooses, and wait for them all to finish.
Stuff like that is even more confusing because your performance analysis at moderate load tells you absolutely nothing about how the system will behave under high load (people who complain about GC often mention this as one of their grievances)
Turns out this causes a lot of issues when working with large diverse teams in particular. If you have a team of 4 you can socialize the idea of avoiding situations like this.
> How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer?
I mean, you wouldn't get write-concurrency for that field. And in fact you'd avoid serial PKs in general in a write-concurrent design, in favor of e.g. client-generated UUIDs.
Where you'd get wins is when 1000 different clients are each opening a transaction and upserting 1000 rows, with, say, 50% of them being duplicates of other clients' rows. Rather than linearizing+deduping all of those right away, you can just write down all 1000 individual MVCC-world-state WAL log fragments concurrently, as 1000 queued write-batches. This lets you take actual advantage of all those write IOPS that RAIDed NVMe gets you.
Using threads for in-flight IOs means burning up lots of memory with stacks, and burning ever more CPU in context switches as IOs complete. In practice this means using some kind of state machine (either explicit or in the form of coroutines) to pick up where logic paused at the initiation of the IO. The logic with respect to locking and concurrency control doesn't change; the costs in memory and CPU do.
The CPU cost is increasing all the time due to all the cache attacks being revealed and the mitigations blowing away various cache contents, which is especially damaging in VM contexts like AWS where the mitigations can't be reasonably disabled.
Data integrity and asynchronous I/O are unrelated concerns. The value of async I/O is that it exposes the concurrency and parallelism of the storage hardware to the software layer, enabling many types of performance optimizations. The vast majority of I/O operations in many database architectures do not have any sequential dependencies on other I/O operations, even within a single execution context.
Data integrity is a function of design and implementation correctness, in addition to having robust mechanisms for dealing with data corruption during I/O, hardware failures, etc.
Only in open source, most advanced database engines have not been designed with blocking I/O in a very long time in closed source. I have not worked on any database engines with blocking I/O in the last decade.
This is an area of easy integer factor performance improvement for open source databases, and one of the primary reasons they are not competitive with the performance of many closed source designs.
Well, you already mention io_uring. That's your syscall batching system. It doesn't cover all existing syscalls because not everything is safe to do asynchronously.
Asynchronous io_uring is good for things that take a lot of time, e.g. io. Its really just a queue between the user-space and kernel threads.
But it doesn't replace just making a synchronous syscall for small things e.g. stating files etc.
A lot of those syscalls are called in quick succession, often with no logic between them.
The Soares paper I linked too showed how buffering up a list of syscalls, doing a single context switch to the kernel, executing them and then switching back is a major gain to a wide variety of applications.
Sychronous syscall batching would be much more straightforward to adopt than aysnc io; imagine if cstdlib file stuff all used buffered syscalls - everyone would benefit! Whereas I would imagine it adopting true async io would be very hard because it is a exposing a blocking api to its user.
You can use io_uring for synchronous batching too. All you have to do is submit a batch and set min_complete to the number of submitted items when calling io_uring_enter. That's a single syscall, assuming you already have some thread-local ring lying around just for batching.
> A lot of those syscalls are called in quick succession, often with no logic between them.
Speaking of which, I noticed on macOS that the “du” command-line utility seems much slower than a third-party GUI utility called DaisyDisk [1] at collecting the sizes of everything on a disk.
This has me wondering what DaisyDisk is doing, if anything, that “du” isn’t, to achieve this.
One theory of mine is that perhaps APFS caches metadata about file sizes and that this can be accessed faster than walking all directories and stat’ing all the things.
An alternative theory I have is that perhaps GUI apps just run with a higher priority.
If anyone knows, I’d be delighted to learn.
A third possibly is that all of this was just due to “noise”. For example, disks happening to be under less I/O pressure when I ran DaisyDisk than when I’ve run “du”. My perception of their relative speed is based on very little scientific rigor.
One of the features of APFS is supposedly a cache for total size of directory trees.
As far as I remember, it is something the OS decides to turn on or off for different folders. It’s wasteful for /tmp, for example, because it makes file creation and deletion slightly slower.
I’m not entirely sure if du would want to tap into that. A lot has changed in MacOS, and it’s not entirely clear if we still have a universal definition of “file size”. You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.
Don’t remember the exact name, sorry.
There is a MacOS-native way to explore recursive size, by the way: about this Mac -> storage -> manage -> documents -> file browser. This is limited to your user folder by default, but you can just create a soft link to take you to /
> You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.
Do you know how that translates to the BSD layer? Does such a file appear to exist and have a certain size, but is e.g. exclusively read-locked, unless you manually download it? Or does fopen(2) block, trigger an iCloud download, and then unblock once it's complete?
It also doesn't (yet) handle use cases like FOPEN -> FREAD -> FCLOSE, where all of them are queued at once. This because the read and close need the file descriptor the open operation need.
Perhaps not a problem for databases (which probably keep all relevant files open anyhow),
One short-string-compression-library to compare with might be https://github.com/antirez/smaz by the Antirez, the author of redis.
In the last decade Tokudb and then MyRocks showed how to do a fast storage engine for data too big to fit in traditional RDBMS. They are multi-level, they do page-level compression etc. And yet there is still so many easy wins yet to be had.
Generally, databases are completely inefficient and there is just so much low-hanging fruit; if the team size working on e.g. myrocks could be doubled, and tasked with looking at the inefficiencies in the server as well as the storage engine, things might change. I have a list in my head of the various really-promising-ideas that databases don't actually do:
* linux doesn't have a syscall batching system, but if it did, the number of context switches would be cut down dramatically. Research in 2010 https://www.usenix.org/legacy/events/osdi10/tech/full_papers... proved this and it wouldn't just be databases that benefit. These days context switching is more expensive than ever.
* database engines all use blocking io. Finally io_uring offers workable async io and that would benefit database storage engines immensely. See https://fosdem.org/2020/schedule/event/rust_techniques_sled/
* tokudb showed that simply not tracking the affected row count could speed things up massively (tokudb called it NOAR)
* query engines often don't handle things that a few more lines of code would handle efficiently. I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql evaluating constraints that it can against the other columns in the key before dereferencing the row? Arrgh. Dumb dumb dumb. Etc.