Worth noting that there's an equivalent of epoll on most platforms.
On Windows there's IOCP, on Mac and BSD-derivates there's kqueue, and Linux has epoll, but to a first approximation they all do basically the same thing; instead of giving you a full-sized array of "active or not" results that you have to iterate across to detect activity on each of your sockets (as you get from the standard berkeley sockets 'select' and 'poll' APIs), they only inform you about the sockets that are actually active, so you can spend less CPU time iterating through that big results array.
I can say from personal experience that when you've got a single server process that's monitoring 25,000 sockets on a Pentium Pro 200mhz box, it makes a huge difference!
I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?
epoll and kqueue really are just edge-triggered select/poll.
However IOCP and the new io_uring are different beasts, they are completion based APIs vs readiness based.
To quickly explain the difference:
readiness based: tell me all sockets that are ready to be read from
completion based: do this, tell me when you are done
The "tell me when you are done" part is usually handled in the form a message on a queue (or ring buffer, hence the name io_uring, with the u being for userspace). Which also generally means really high scalability of submitting tons of tasks and also processing tons of completions.
Completion based APIs are superior IMO and it was always sad to me that Windows had one and Linux didn't so it's awesome Jens Axboe got his hands dirty to implement it. It beats the pants off of libaio, eventfd, epoll and piles of hacks.
A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.
Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".
Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.
AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.
(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)
Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)
Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).
Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?
True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd
Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events
Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.
> to a first approximation they all do basically the same thing; instead of giving you an array you have to iterate across they inform you about the sockets that are actually active
Well, that's half the story. The other half is that select/poll is stateless, meaning the application–kernel bridge is flooded with data about which events you are interested in, despite the fact that this set usually doesn't change much between calls.
kqueue and the like are stateful instead: you tell the kernel which events you are interested in and then it remembers that.
Stateless is simpler to implement and easier to scale across multiple nodes, but comes with additional overhead. When polling the kernel for sockets, the overhead was a bigger cost than implementation complexity and horizontal scaling. (Implementation complexity is still a problem – see the discussions regarding the quality of epoll; and horizontal scaling is just not something desktop kernels do.)
It's a subscription-based kernel API. Instead of passing roughly the same set of descriptors in each successive iteration of the event loop, you tell the kernel once that you are interested in a given fd. It can then append to the "ready list" as each of the descriptors becomes available, even when you're not waiting on epoll_wait.
With the older select() and poll() interfaces, you have to pass the whole set every time you wait. The kernel has to construct individual wait queues for each fd. Epoll is just a more convenient and efficient API, allowing an efficient implementation on the side of the kernel.
I've tried to use IOCP with Python and the pywin32 module few years ago. I was never able to make it work, even reading c++ code as a guiding source. Also documentation and resources for IOCP when I looked at the time were very scarce and almost looked like an obscure topic to dive in, and finally gave up. On the other side kqueue and epoll are almost trivial to use. If everything fails there is always select which is easy to use as well.
Making IOCP work underneath stateful protocols like TLS is more complicated still, and then add on multithreaded handlers and things get messy real quick. It’s a similar story with io_uring. It can be done (I’ve done it) but its not easy to reason about (at least for me, maybe I just don’t have the mental horsepower or proper brain wiring to grasp it easily).
> Aside: All of the above work on many operating systems and support API’s other than epoll, which is Linux specific. The Internet is mostly made of Linux, so epoll is the API that matters.
That quote is talking about epoll-using software (Go, nginx and 'most programming languages', including Rust), not polling-based APIs. You should instead quote:
> without BSD’s kqueue (which preceded epoll by two years), we’d really be in trouble because the only alternatives were proprietary (/dev/poll in Solaris 8 and I/O Completion Ports in Windows NT 3.5).
> I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?
Are you aware that Linux is used e.g. by Google, Microsoft or Amazon? Can there be occasions when they handle more than 5 simultaneous connections per box? Maybe epoll can make a difference for them, no?
On Windows there's IOCP, on Mac and BSD-derivates there's kqueue, and Linux has epoll, but to a first approximation they all do basically the same thing; instead of giving you a full-sized array of "active or not" results that you have to iterate across to detect activity on each of your sockets (as you get from the standard berkeley sockets 'select' and 'poll' APIs), they only inform you about the sockets that are actually active, so you can spend less CPU time iterating through that big results array.
I can say from personal experience that when you've got a single server process that's monitoring 25,000 sockets on a Pentium Pro 200mhz box, it makes a huge difference!
I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?