Hacker News new | past | comments | ask | show | jobs | submit login

Which postgres bug are you referring to? I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

Multiple processes writing to the same file would need some coordination anyway, so it's not much of a problem that you see errors "from" other processes. Note that under Linux, all processes will always be reported the last error that happened since opening that file (errors are reported from write() and other functions as well, and any error is reported only once, AFAIK). Again, looking on the implementation side, where the page cache has a partial view of the file address space, and writes just write to the pages in that view, the behaviour makes quite a bit of sense. (I'm not saying that this is a great model to build a database on).

Yep, if fsync() returns an error in general probably just assume that any of your writes since the last fsync() may or may not have hit the disk. Maybe it means even less in some situations in practice.

You can try sync_file_range() as well (Linux specific).

As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?




> I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

As I understand it, O_DIRECT doesn't change fsync's wonky behaviour. It just asks the OS to stop using its filesystem cache.

> As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?

Maybe.

The nice thing about barriers or write groups is that you don't need to move to an async API like io_uring to keep issuing write commands to the OS. The kernel can still buffer writes and issue them to the underlying hardware when its ready, and your program can go back to responding to requests. I think that'll always be nicer to work with than some io_uring + sync_file_range combination.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: