Hacker News new | past | comments | ask | show | jobs | submit login
Why buffered writes are sometimes stalled (yoshinorimatsunobu.blogspot.com)
69 points by sciurus on Dec 15, 2014 | hide | past | favorite | 14 comments



To save others a few minutes with the stable pages patch, it's already in the mainline kernel, visible in sysfs as /sys/block/*/bdi/stable_pages_required.

Also, on a sort of related note, aio is steadily progressing, so maybe we'll see more reliance on it: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.g... and libaio too: https://git.fedorahosted.org/cgit/libaio.git/


"When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done."

I'm not sure I got this, but this seems similar to vsync - making sure only complete 4k pages are written to disk, and then flipping the buffer and processing the next one.

But what error condition does this guard against? It seems this is only useful in a non-journaled file system.


If you don't lock, you can get a situation where the file on disk is in an inconsistent state until all writes have completed (for example, half of the second write applied but the other half not, etc)

As far as I know, at least. If (when) I'm wrong, please correct me.


Yes, isn 't that a situation that a journaled FS should prevent? So for e.g. XFS it should be redundant.


Large "granularities" on storage devices always suffer this problem - whether it's sectors on an HDD (which have silently transitioned from 512B to 4KB) or blocks/pages in flash on an SSD. Perhaps it's become more prominent now that the granularities have increased while small read/write operations are still common.

(Aside: The autogenerated spam comments there are also strangely interesting - they sound almost poetic.)


>(Aside: The autogenerated spam comments there are also strangely interesting - they sound almost poetic.)

Given the right corpus and parameters, Markov chains can do a surprisingly scary job at producting content that seems profound and/or humorous.


If the task is to just overwrite existing files without blocking, why not mmap()?


I wish people would stop bringing up mmap as a performance panacea. It's frequently worse than using IO system calls due to the need to actually take a page fault and map the file's backing page into the faulting processes' address space. mmap is not a magical fast path. It certainly shouldn't be your fault IO strategy.


mmap() has no way of knowing how much you'll write to the file, where, in what order, or even whether you may read before doing any writing, so it still has to read the blocks into memory first. A block-aligned write() or pwrite(), however, doesn't require any reading.


As others have suggested, if mmap is so fast, wouldn't it be expected that write/pwrite would just be mapped (pun intended) to mmap inside the standard library? [Heck for all I know, that might already be the case]

I have had to actually write benchmark tests to show some people I work with that there was no conlusive difference between the two (with our data access patterns). mmap having the disadvantage that it allows you more ways to shoot yourself in the foot (ever seen SIGBUS signals?). Before that they swore up and down how mmap is this magic performance hack that was burried in there for ages and only the elites know about it.


My experience has been that mmap is not magical for writing in most workloads, but it is mostly magical for reading - I have yet to encounter a real life workload in which mmaping and using memory was NOT easier and at least as fast as reading. On 32-bit systems, however, it's easy to run out of usable address space - 64-bit makes it useful again.


One workload where I find it worse: if my access pattern is sequential reads from the start of a file (no seeking or random access), or could reasonably be rewritten as such. In that case, using mmap() breaks some expected Unixy flexibility, because it demands real files, while there's no reason in this access pattern that your program should die if it finds a named pipe instead. Of course you could test for pipe and provide an alternate read path, but then you might as well just use that for real files too, instead of maintaining two paths.


The semantics for write(2) and mmap(2) differ. Neither can be used to implement the other in full.

It's possible to have access patterns which don't demonstrate the efficiencies of mmap but it would be incorrect to suggest they do not exist. The overhead of making a syscall to access an area of a file and copying the data being accessed is significant for many workloads.


Significant typo: the second memset() in each pair needs to be memcpy() instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: