The diagram is a bit misleading. There is no system call (switching to the kernel) with read or write hits, assuming the memory is already mapped, has a page table entry and is resident. You only incur switches when performing the mapping and due to various page fault modes (lazy mapping, non-resident data, copy-on-write needed).
> This usually happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every page that is brought into cache causes another page to be evicted.
While true one can optimize these things by unmapping larger ranges in bulk (to reduce cache pressure) or prefetching them (to reduce blocking) with madvise, allowing the kernel doing the loading asynchronously while you're accessing the previously prefetched ones.
If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads.
In really new linux there is a preadv2 sysscall that was merged. preadv2 supports a flag RWF_NOWAIT to tell the kernel to not block if a read requires uncached data.
This lets you play around with policy for blocking reads in user space. Examples: try to make progress on partially read data & queue up the rest on another thread; try performing reads from your network thread and if unavailable queue up the blocking read onto another disk thread & serve a different request.
There's no RWF_NOWAIT support for write in pwritev2. But it's technically possible to implement it and if your write will cause a writeout to disk return EWOULDBLOCK.
Disclaimer: I'm the author of the preadv2/pwritev2 syscalls. And the original author of the support for RWF_NOWAIT, which Christoph Hellwig took over. So feel free to consider this to be self congratulatory.
Correction: It looks like there's a patch set floating around for adding support for RWF_NOWAIT in pwritev2 as well. https://patchwork.kernel.org/patch/9787271/ (comment from Christoph)
Is this a way to have non-blocking reads from file system? For regular files O_NONBLOCK does nothing, so one has to use thread pools. For plain files served from local disk it's usually not worth it. But files can be served by sshfs, nfs or something more exotic made with FUSE.
If I understand this correctly one can use RWF_NOWAIT to do what one would naively expect from O_NONBLOCK. Am I correct?
This was my original model when I came up with the idea; in fact, my v1 hack used recv and MSG_DONTWAIT. For a number of a reasons the kernel community did not want to overload that interface and that flag.
Besides some technical reasons, the big reason was an "impedance mismatch" of the recv API which was working with stream data. In that context you're depending on another party send data. Also, you can wait on this to happen using another API (select et. al) so the buffer will be filled not by your actions. On the other hand preadv2(..., RWF_NOWAIT) is not going to trigger any more read in and theres no wait to wait on the data. Although the previous statement is not a 100% true... preadv2 may or may not trigger readahead (if it's enabled).
I think he means perform a disk read in response to a network request, and choose to respond to a different queued/pooled request if the data is not available to complete it.
I would assume the intent is similar but applied to a different type/source of IO.
I noticed that as well. Maybe it's assuming a CPU cache and TLB miss on read which means going to the kernel. Also on write if it's a new page you are going to have to fault once for every page which depending on write size is like having to go to the kernel for every write. It can actually be good for performance in some dimensions to pre-touch pages in a loop.
I can't speak for Avi though.
I suspect he is writing from the perspective of a database implementor. For a database the page cache is not predictable in ways you want it to be.
For writes the issue is IO scheduling, when flushing occurs, and how much is flushed at a time and at what rate.
For reads if a page is not in memory you have to block a thread to read it. If you are thread per core this means you either have a thread pool on the side to do IO or you block an event loop preventing other work from making progress. Even having the thread pool on the side is problematic because checking whether a page is in the cache is not cheap either and requires a system call which is kind of self defeating.
> Maybe it's assuming a CPU cache and TLB miss on read which means going to the kernel.
My understanding is that those normally don't incur faults that would switch to the kernel context. They're just slow due to several memory lookups that have to be done to satisfy the primary memory access.
Correct for x86/x64. TLB misses require the CPU to lookup the location in the current active page table which means another memory lookup. I'm not sure how the cache hierarchy works for page table lookups following a TLB miss, but worst case, it's equivalent to an additional L3 cache miss.
Actually it depends on the processor. Nowadays it's rare, but in the past there used to be software TLBs where a TLB miss did trap into the kernel, and there was no radix tree-like page table structure in the definition of the architecture.
SPARC and MIPS for example worked like this, and server POWER chips had a hashed page table that many OSes populated lazily, effectively treating it like a (more efficient) software TLB.
Well, MIPS is still in widespread use in wifi hardware, and POWER is being resurrected for a scale-up way to deal with a growing database, considering the current generation scales higher at about a similar pricepoint than a 4-socket Intel machine. So I am not sure what the past tense is supposed to mean, if it denotes the architectures to no longer compromise a significant part of world-wide (say DRYSTONE) MIPS, or their lack of use where they are good at (false).
I am not sure about MIPS, but other processors have moved away from either software TLBs or hashed page tables (POWER9 supports tree page tables). Hence the past tense.
> If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads.
This may work for sequential reads, but not for random reads.
> If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads.
The same is true of plain read/write, barring extra data copies to/from the kernel.
The only reason to ever use AIO is if you're doing direct I/O, in which case mmap doesn't apply.
"If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads."
I have seen Windows File Explorer lock up over and over again for the same assumption. They assume the dir scan is fast operation. A slow, bad USB drive, NFS, Samba mount, or trying to scan directory with thousands of files will lockup the GUI for long period of time.
Dragging a file over slow/busy system icon with SMB mount will just spin / hang the whole GUI - not a very good UX.
Very nice job, I like the diagrams and the description.
Noticed this bit "block size which is typically 512 or 4096 bytes" and was wondering how would the application know how to align. Does it query the file descriptor for block size? Is there an ioctl call for that?
When it comes to IO it's also possible to differentiate between blocking/non-blocking and synchronous/asynchronous, and those categories are orthogonal in general.
So there is blocking and synchronous: read, write, readv, writev. The calling thread blocks until data is ready and while it gets copied to user memory.
non-blocking synchronous: using non-blocking file descriptors with select/poll/epoll/kqueue. Checking when data is ready is done asynchronously but then read, write still happens inline and the thread waits for it to be copied to user space. This works for socket IO in Linux but not for disk.
non-blocking asynchronous: using AIO on Linux. Here both waiting till data is ready to be transferred and transferring happens asynchronously. aio_read returns right way before read finished. Then have to use aio_error to check for its status. This works for disk but not socket IO on Linux.
I released a native add on [1] for Node.js earlier this week for querying a block device file descriptor for block size. It uses different ioctl calls depending on the platform and works on FreeBSD, Linux, macOS and Windows.
It also provides cross-platform aligned buffers for DMA.
Note that block size can be logical or physical sector size. IIRC st_blksize will typically report logical sector size (usually 512 bytes), but physical sector size (usually 4096 bytes) is often larger for Advanced Format drives, and is what you want to be using for optimal alignment. Any multiple of 4096 bytes is usually a safe bet for block size. Note that some SSD drives have 8192 or 16384 bytes as a "physical sector" page size / block size.
Ooh, very nice work. Was reading your README and the binding.cc file, learned a lot of stuff like calling fsync on direct i/o file descriptor might not flush anything to disk since the dirty page write mechanism is bypassed.
In the benchmark what was the physical block size, 16KB? it seems with O_DIRECT after 16KB the speed tapers off and stays at about 130MB/s. And there is a significant jump from 4K (60Mb/s) and 8K (90Mb/s).
Sorry for the late reply, just spotted your comment.
Thanks for the input, glad you found it useful!
Regarding the benchmark, the physical block size of the drive itself was 512 bytes. The memory page size on the Ubuntu system was 4K. It would be good to update the benchmark script to include these kinds of numbers.
I'm not sure why O_DIRECT speed increased with block size=4K (60MB/s) to 8K (90MB/s) and tapered off at 16KB (130MB/s). Perhaps system call overhead is more noticeable at 4K and 8K and less so at 16KB? Any other ideas?
130MB/s is the raw drive write speed and matches what I got from dd for the same flags for the same drive.
> The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache.
Some other advantages:
The kernel has a global view of what is going on with all the different applications running on the system, whereas your application only knows about itself.
The cache can be shared amongst different applications.
You can restart applications and the cache will stay warm.
For servers running big data tools, like the one described, 99.99% of the IO is going to be from the application, so you want the IO algorithms tuned to that. If there is some inefficiency in the remaining 0.01% then too bad.
Do you mean you want to cache things that can't be serialised to memory pages, or you don't want the memory pages to be backed by files?
In the context of this article, many databases use the page cache to cache their data-structures, either in combination with private pages (ie postgres) or entirely (ie LMDB).
Oh I see, I thought you might be referring to how the page cache (obviously) can't cache things like sockets.
Some applications do use the page cache for non-caching purposes though, for instance IPC, or ensuring their working-data will survive an application restart. The memory is still associated with files naturally, but this is often regarded as a feature, not a problem. Some applications use files on tmpfs so that their page cache data is backed by swap (or nothing), the same as anonymous pages.
Yes, good point, I was speaking in general terms. This article shows they've put a lot of thought into it and argues convincingly that bypassing the page cache makes sense for their system.
Great article for a noob like me. Any similar article suggestions, which gives a good high level overview on kernel internals, specifically on IO topics like what happens when an application requests for a file - right from issuing a system call to disk Interactions.
I was always confused - still I am - with regards to different caching layers that involve in an IO operation.
There's also a class of functions that act like read/write, but are done completely in the kernel without returning to userspace (sendfile/(vm)splice/copy_file_range).
In my experience, the amount of data many applications are expected to access is growing geometrically, while RAM costs are declining only arithmetically. There certainly are in-memory use cases, but for most use cases it is not now (and probably won't become) economical.
You're both right and wrong at the same time. The disconnect is the based on the kind of data we're talking about.
Human-generated metadata (i.e. OLTP data) grows roughly with respect to the population of internet connected humans - humans only can produce so much content in their finite time. Advances in computers far exceed this growth rate, so more of these kinds of problems fit into RAM each year.
The volume of media data, which depends more on the increasing fidelity of the media (videos going from 320K to 4K, photos going from 1MP to 30MP) have been increasing rapidly but that trend won't continue forever because there are diminishing returns. Do you care about the difference between 4K and 8K video? TV companies hope you do. At some point (likely soon) you will stop caring.
OLAP data, which includes data produced by IoT devices, sensors, etc is increasing geometrically. But this kind of data is not best stored in RAM anyway, and is served best by scale-out column-oriented and time-series databases.
Agreed. For certain use cases (session store, user profile store), the lower cost of memory and the relatively slow growth of data will mean that they will usually fit in memory. But, in my experience, there's a larger, and growing, class of use cases (ML, IOT, Fraud detection, Gaming data) where both the amount of raw data captured and the expectations for data retention are causing the total dataset to spiral into the dozens or hundreds of TB, with expectations for real-time SLAs on query. That's what I was talking about.
Yes, but data on the order of TBs is usually stored on platters (as opposed to SSD).
In that case, a fast data access library will not buy you much, because the disk will be slow anyway.
(By the way it is common that an in-memory database is combined with on-disk storage for large blobs; those blobs are just flat files, and the OS already offers the fastest way to access them without any library)
Ok, but it doesn't take away the point that very often the database can still be stored in memory, except for large blobs which are stored in flat files. Access of those files is usually sequential, so you will not need a special library.
Yes, video streaming and management is a use case as you describe. Metadata is kept in memory, and the video files, themselves, are stored on an inexpensive media - usually as file system. So queries about "what can i watch" are quick, but loading the video can take seconds. But any use case where you want to keep state over time - sums, averages, maxes, histories - requires you to keep all the individual points of data in the database. Those are the (increasingly common) use cases that I run into where very fast disk storage like Scylla are important.
The one thing that will surely ruin your servers reliability is adding more ram to it. Based on my relatively recent experience a server with 1TB ram will have an MTBF of about 150k to 200k hours. Most of the failures (> 80%) are memory or memory socket related.
What are typical use cases for direct I/O? I never stumbled upon anything that used it. Intuitively, I would assume specialized logging applications or databases, but even these seem to use other mechanisms.
I work in online advertising (AdGear/Montreal/Canada) and a lot of the business is around RTB (Real Time Bidding). This is an area where millions of requests/second is the norm and latency is the name of the game.
We've been using Cassandra for many years. As the business has grown, we've refreshed/updated/resized/tweaked/hacked the cluster. Unfortunately past a certain point, there were no more performance wins to be had - further - we started experiencing stability issues.
We came across ScyllaDB. I had heard about the underlying SeaStar framework and was excited to test something based on it that was addressing a real pain point.
We are not (yet) customers of ScyllaDB - however we're pretty advanced into our POC. To summarize:
* Extremely low latency compared to cassandra
* During delivery valeys, peaks, and even burst batch loads
* It delivers the above running on a fraction of the hardware Cassandra does
* Very good no-bs responsive engineering support
Scylla's having a conference in SF later this month where AdGear will be presenting our use case more in depth. If it sounds like the product may address some pain points you're working on, might be worth attending.
/not affiliated with Scylla beyond the potential-customer-in-POC case described above
> This usually happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every page that is brought into cache causes another page to be evicted.
While true one can optimize these things by unmapping larger ranges in bulk (to reduce cache pressure) or prefetching them (to reduce blocking) with madvise, allowing the kernel doing the loading asynchronously while you're accessing the previously prefetched ones.
If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads.