Why buffered writes are sometimes stalled

pas · on Dec 15, 2014

To save others a few minutes with the stable pages patch, it's already in the mainline kernel, visible in sysfs as /sys/block/*/bdi/stable_pages_required.

Also, on a sort of related note, aio is steadily progressing, so maybe we'll see more reliance on it: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.g... and libaio too: https://git.fedorahosted.org/cgit/libaio.git/

MrBuddyCasino · on Dec 15, 2014

"When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done."

I'm not sure I got this, but this seems similar to vsync - making sure only complete 4k pages are written to disk, and then flipping the buffer and processing the next one.

But what error condition does this guard against? It seems this is only useful in a non-journaled file system.

TheLoneWolfling · on Dec 15, 2014

If you don't lock, you can get a situation where the file on disk is in an inconsistent state until all writes have completed (for example, half of the second write applied but the other half not, etc)

As far as I know, at least. If (when) I'm wrong, please correct me.

MrBuddyCasino · on Dec 15, 2014

Yes, isn 't that a situation that a journaled FS should prevent? So for e.g. XFS it should be redundant.

userbinator · on Dec 15, 2014

Large "granularities" on storage devices always suffer this problem - whether it's sectors on an HDD (which have silently transitioned from 512B to 4KB) or blocks/pages in flash on an SSD. Perhaps it's become more prominent now that the granularities have increased while small read/write operations are still common.

(Aside: The autogenerated spam comments there are also strangely interesting - they sound almost poetic.)

meowface · on Dec 15, 2014

>(Aside: The autogenerated spam comments there are also strangely interesting - they sound almost poetic.)

Given the right corpus and parameters, Markov chains can do a surprisingly scary job at producting content that seems profound and/or humorous.

riobard · on Dec 15, 2014

If the task is to just overwrite existing files without blocking, why not mmap()?

quotemstr · on Dec 15, 2014

I wish people would stop bringing up mmap as a performance panacea. It's frequently worse than using IO system calls due to the need to actually take a page fault and map the file's backing page into the faulting processes' address space. mmap is not a magical fast path. It certainly shouldn't be your fault IO strategy.

userbinator · on Dec 15, 2014

mmap() has no way of knowing how much you'll write to the file, where, in what order, or even whether you may read before doing any writing, so it still has to read the blocks into memory first. A block-aligned write() or pwrite(), however, doesn't require any reading.

rdtsc · on Dec 15, 2014

As others have suggested, if mmap is so fast, wouldn't it be expected that write/pwrite would just be mapped (pun intended) to mmap inside the standard library? [Heck for all I know, that might already be the case]

I have had to actually write benchmark tests to show some people I work with that there was no conlusive difference between the two (with our data access patterns). mmap having the disadvantage that it allows you more ways to shoot yourself in the foot (ever seen SIGBUS signals?). Before that they swore up and down how mmap is this magic performance hack that was burried in there for ages and only the elites know about it.

beagle3 · on Dec 15, 2014

My experience has been that mmap is not magical for writing in most workloads, but it is mostly magical for reading - I have yet to encounter a real life workload in which mmaping and using memory was NOT easier and at least as fast as reading. On 32-bit systems, however, it's easy to run out of usable address space - 64-bit makes it useful again.

_delirium · on Dec 15, 2014

One workload where I find it worse: if my access pattern is sequential reads from the start of a file (no seeking or random access), or could reasonably be rewritten as such. In that case, using mmap() breaks some expected Unixy flexibility, because it demands real files, while there's no reason in this access pattern that your program should die if it finds a named pipe instead. Of course you could test for pipe and provide an alternate read path, but then you might as well just use that for real files too, instead of maintaining two paths.

evan_miller · on Dec 15, 2014

The semantics for write(2) and mmap(2) differ. Neither can be used to implement the other in full.

It's possible to have access patterns which don't demonstrate the efficiencies of mmap but it would be incorrect to suggest they do not exist. The overhead of making a syscall to access an area of a file and copying the data being accessed is significant for many workloads.

jzwinck · on Dec 15, 2014

Significant typo: the second memset() in each pair needs to be memcpy() instead.