1. Signals are bad. All the bad bits of IRQs, and you're not even working in assembly language, so they're twice as hard. Just say no.
2. The forking model is seductive, but wrong-headed. It's hard to make the file descriptor inheritance behave correctly, and it means the memory requirements are unpredictable (due to the copy-on-write pages)
3. Readiness I/O is not really the right way to do things, because there are obvious race conditions, and the OS can't guarantee one thread woken per pending operation (because it has no idea which threads will do what). Also, the process owns the buffer, which really limits what the OS can do... it needs to be able to own the buffer for the duration of the entire operation for best results, so it can fill it on any ready thread, map the buffer into the driver's address space, etc.
4. Poor multithreading primitives. This really annoyed me... like, you've got a bunch of pthreads stuff, but it doesn't interact with select & co. The Linux people aren't dumb, so they give you eventfd - but there's no promise of single wakeup! NT wakes up one thread per increment, and the wakeup atomically decrements the semaphore; Linux wakes up every thread waiting on the semaphore, and they all fight over it, because there's no other option.
(I'm just ignoring POSIX semaphores entirely, because they don't let you wait on them with select/poll/etc. in the first place.)
(Perhaps this and #3 ought to be the same item, because they are related. And the end result is that you need to use non-blocking IO for everything... but the non-blocking IO is crippled, because it still has to copy out into the caller's buffer. It's just a more inconvenient programming model, for no real benefit.)
I guess it just boils down to what you want: a beautifully polished turd, or a carefully engineered system assembled from turds.
Using signals in the ye olde UNIX fashion is bad. However, that was entirely fixed with the advent of POSIX threads: block all signals in all threads with sigprocmask(), and have a dedicated signal-handling thread that loops around on sigwaitinfo(). That thread then handles signals entirely synchronously, notifying other parts of the program using usual inter-thread communication.
That is quite the rant, I am impressed with how deep you must know both systems to crank out a list like that. Did you just write all this for this post or has this been kicking around your hard drive for a while?
> The Linux people aren't dumb, so they give you eventfd - but there's no promise of single wakeup! NT wakes up one thread per increment, and the wakeup atomically decrements the semaphore; Linux wakes up every thread waiting on the semaphore, and they all fight over it, because there's no other option.
You can get Linux to only wake up one thread using epoll and either EPOLLEXCLUSIVE or EPOLLONESHOT.
Though I don't really understand why you'd want to wait on a semaphore and other waitables at the same time… probably this is just my Unix bias/lack of Windows experience showing.
> Though I don't really understand why you'd want to wait on a semaphore and other waitables at the same time…
This. I probably also lack some specific kind of experience because I never understood why you would lock a lot of threads on a semaphore, and only want one of them to execute after a signal.
I just never saw the use case for that. Yet people complain about it a lot.
> it needs to be able to own the buffer for the duration of the entire operation for best results, so it can fill it on any ready thread, map the buffer into the driver's address space, etc.
Can it actually do that? For network write operations the OS has to split the data into MSS/PMTU-sized packets and add headers to it. For network read operations the OS has to reassemble the packets back into a stream or datagram, and it doesn't even know which process a packet is for until after the packet is read into memory.
You're making the copy regardless. Meanwhile IOCP requires O(n) read buffers for n sockets, instead of O(1) when the OS notifies you that it has enough packets to reassemble something.
If the HW has good vectored I/O support then it might be possible to send out data directly from user buffers. This is by composing a packet using two buffer descriptors, one for the header, which points to the next descriptor for the data. But there are complications:
- Almost all hardware would require the buffers to be aligned. Though I think unaligned buffers could probably be handled with a hack by copying some bytes from the user buffer to the "header" buffer.
- The user buffer would need to remain available not only until the packets are transmitted but until they are acknowledged (assuming TCP). Therefore if you want to avoid copying data to kernel buffers for the potential retransmission, the application needs to track which buffers are pending (and is notified when they can be released).
Also, I was reading that zero-copy receive is also possible in some scenarios by changing virtual memory mappings. I'm sure lots of info can be found by googling "zero-copy TCP". FreeBSD supposedly has support for zero-copy TCP.
> Therefore if you want to avoid copying data to kernel buffers for the potential retransmission, the application needs to track which buffers are pending (and is notified when they can be released).
Not necessarily. The kernel could map the page into kernel space and mark it copy-on-write so the application can't modify the kernel page. Then the application doesn't have to care when the kernel is finished with it.
> Also, I was reading that zero-copy receive is also possible in some scenarios by changing virtual memory mappings.
Not necessarily. The kernel could map the page into kernel space and mark it copy-on-write so the application can't modify the kernel page. Then the application doesn't have to care when the kernel is finished with it.
In practice, it's faster to copy the page up front than mess about with the page tables and TLB shootdowns, doubly so if you end up copying the page to break the COW anyway!
> In practice, it's faster to copy the page up front than mess about with the page tables and TLB shootdowns, doubly so if you end up copying the page to break the COW anyway!
It could be worth it when the data fills an entire page or more. And it's common that after a write, the thread will either sleep waiting on events and not get one during the milliseconds it takes for the kernel to be finished with the data, or the buffer is immediately used for read()/recv() which allows the kernel to remap the page without copying it.
But yes, that seems to be the problem in general -- we're trying to optimize something which isn't actually that slow. A memcpy() on a <500 byte packet is only tens of cycles. Even a full page is hundreds of cycles, which is on the same order as the cost of the syscall to have the OS notify the application it has finished with a buffer. None of this complexity can justify its overhead unless you're sending thousands of contiguous bytes, and at that point you're in sendfile() territory anyway.
> the memory requirements are unpredictable (due to the copy-on-write pages)
On modern Linux OOM killer became rather good at finding the real memory offenders. Plus this unpredictability allows, for example, to transparently enable memory compression in Linux eliminating the need for swap in many configurations.
Can Linux do that? I remember when OS X introduced it, performance improved notably because the system did much less paging. Would be nice to have that on Linux.
(Rephrase my question: If Linux can do that, do I have to take any special steps to enable it?)
EDIT: Nevermind, I should have thought of googling for it first!
I'm curious, given all these limitations in POSIX, what are some examples of software that do better on Windows because of these more advanced OS features?
2. The forking model is seductive, but wrong-headed. It's hard to make the file descriptor inheritance behave correctly, and it means the memory requirements are unpredictable (due to the copy-on-write pages)
3. Readiness I/O is not really the right way to do things, because there are obvious race conditions, and the OS can't guarantee one thread woken per pending operation (because it has no idea which threads will do what). Also, the process owns the buffer, which really limits what the OS can do... it needs to be able to own the buffer for the duration of the entire operation for best results, so it can fill it on any ready thread, map the buffer into the driver's address space, etc.
4. Poor multithreading primitives. This really annoyed me... like, you've got a bunch of pthreads stuff, but it doesn't interact with select & co. The Linux people aren't dumb, so they give you eventfd - but there's no promise of single wakeup! NT wakes up one thread per increment, and the wakeup atomically decrements the semaphore; Linux wakes up every thread waiting on the semaphore, and they all fight over it, because there's no other option.
(I'm just ignoring POSIX semaphores entirely, because they don't let you wait on them with select/poll/etc. in the first place.)
(Perhaps this and #3 ought to be the same item, because they are related. And the end result is that you need to use non-blocking IO for everything... but the non-blocking IO is crippled, because it still has to copy out into the caller's buffer. It's just a more inconvenient programming model, for no real benefit.)
I guess it just boils down to what you want: a beautifully polished turd, or a carefully engineered system assembled from turds.