Worth noting that there's an equivalent of epoll on most platforms. On Windows t...

jpgvm · on Jan 11, 2024

They aren't really created equally though.

epoll and kqueue really are just edge-triggered select/poll.

However IOCP and the new io_uring are different beasts, they are completion based APIs vs readiness based.

To quickly explain the difference:

readiness based: tell me all sockets that are ready to be read from

completion based: do this, tell me when you are done

The "tell me when you are done" part is usually handled in the form a message on a queue (or ring buffer, hence the name io_uring, with the u being for userspace). Which also generally means really high scalability of submitting tons of tasks and also processing tons of completions.

Completion based APIs are superior IMO and it was always sad to me that Windows had one and Linux didn't so it's awesome Jens Axboe got his hands dirty to implement it. It beats the pants off of libaio, eventfd, epoll and piles of hacks.

o11c · on Jan 11, 2024

A point that people seem to miss: epoll supports both level-triggered and edge-triggered. Most similar APIs only support level-triggered.

Edge-triggered is theoretically less work for the kernel than level-triggered, but requires that your application not be buggy. People tend to either assume "nobody uses edge-triggered" or "everybody uses edge-triggered".

Completion-based is far from trivial; since the memory traffic can happen at any time, the kernel has to consider "what if somebody changes the memory map between the start syscall and the end syscall". It complicates the application too, since now you have to keep ownership of a buffer but you aren't allowed to touch it.

AIX and Solaris apparently also support completion-based APIs, but I've never seen anyone actually run these OSes.

(aside, `poll` is the easiest API to use for just a few file descriptors, and `select` is more flexible than it appears if you ignore the value-based API assumptions and do your own allocation)

manwe150 · on Jan 11, 2024

Edge-triggered requires an extra read/write on every epoll relative to level-triggered though because you must exactly trigger reading the error state (EAGAIN), so it actually can be much slower (libuv considered switching at one point, but wasn’t clear the extra syscalls required by edge triggering were worth while)

o11c · on Jan 11, 2024

Only on reads. For writes you always want to loop until the kernel buffer really is full (remember the kernel can do I/O while you're working). Writes, incidentally, are a case where epoll is awkward since you have to EPOLL_CTL_MOD it every single time the buffer empties/fills (though you should only do this after a full tick of the event loop of course ... but the bursty nature means that you often do have to, thus you get many more syscalls than `select`).

Even for reads, there exist plenty of scenarios where you will get short reads despite more data being available by the time you check. Though ... I wonder if deferring that and doing a second pass over all your FDs might be more efficient, since that gives more time for real data to arrive again?

manwe150 · on Jan 12, 2024

True, I don’t remember the details for writes, and the complexity of managing high/low water marks makes it even trickier for optimal code. And large kernel send buffers here mostly avoid the performance problem here anyways. But on a short write, I am not sure I see the value in testing for EAGAIN over looping through epoll and getting a fresh set of events for everything instead of just this one fd

Right, for reads, epoll will happily tell you if there is more data still there. If the read buffer size is reasonable, short reads should not be common. And if the buffer is huge, a trip around the event loop is probably better at that point to avoid starvation of the other events

marssaxman · on Jan 11, 2024

> Completion based APIs are superior IMO

Perhaps it's just that I cut my teeth on the classic Mac OS and absorbed its way of thinking, but after using its asynchronous, callback-driven IO API, the multithreaded polling/blocking approach dominant in the Unix world felt like a clunky step backward. I've been glad to see a steady shift toward asynchronous state machines as the preferred approach for IO.

geertj · on Jan 11, 2024

> Completion based APIs are superior IMO

I probably agree with that, but curious to know what your reasons are.

kqr · on Jan 11, 2024

> to a first approximation they all do basically the same thing; instead of giving you an array you have to iterate across they inform you about the sockets that are actually active

Well, that's half the story. The other half is that select/poll is stateless, meaning the application–kernel bridge is flooded with data about which events you are interested in, despite the fact that this set usually doesn't change much between calls.

kqueue and the like are stateful instead: you tell the kernel which events you are interested in and then it remembers that.

abhishekjha · on Jan 11, 2024

Wait, so which is better? Stateful or stateless? How do you decide?

Very new to these APIs and their usages.

kqr · on Jan 11, 2024

Stateless is simpler to implement and easier to scale across multiple nodes, but comes with additional overhead. When polling the kernel for sockets, the overhead was a bigger cost than implementation complexity and horizontal scaling. (Implementation complexity is still a problem – see the discussions regarding the quality of epoll; and horizontal scaling is just not something desktop kernels do.)

mewse · on Jan 11, 2024

Good point, yes!

bluetomcat · on Jan 11, 2024

It's a subscription-based kernel API. Instead of passing roughly the same set of descriptors in each successive iteration of the event loop, you tell the kernel once that you are interested in a given fd. It can then append to the "ready list" as each of the descriptors becomes available, even when you're not waiting on epoll_wait.

With the older select() and poll() interfaces, you have to pass the whole set every time you wait. The kernel has to construct individual wait queues for each fd. Epoll is just a more convenient and efficient API, allowing an efficient implementation on the side of the kernel.

liendolucas · on Jan 11, 2024

I've tried to use IOCP with Python and the pywin32 module few years ago. I was never able to make it work, even reading c++ code as a guiding source. Also documentation and resources for IOCP when I looked at the time were very scarce and almost looked like an obscure topic to dive in, and finally gave up. On the other side kqueue and epoll are almost trivial to use. If everything fails there is always select which is easy to use as well.

docandrew · on Jan 11, 2024

Making IOCP work underneath stateful protocols like TLS is more complicated still, and then add on multithreaded handlers and things get messy real quick. It’s a similar story with io_uring. It can be done (I’ve done it) but its not easy to reason about (at least for me, maybe I just don’t have the mental horsepower or proper brain wiring to grasp it easily).

throw0101d · on Jan 11, 2024

> Worth noting that there's an equivalent of epoll on most platforms.

And libevent if you want a portable front-end to all of them:

* https://libevent.org

* https://en.wikipedia.org/wiki/Libevent

jen20 · on Jan 11, 2024

Or libuv:

* https://libuv.org/

another2another · on Jan 11, 2024

Or even libdispatch which I've been using on Linux and MacOSX surprisingly well.

graemep · on Jan 11, 2024

> Worth noting that there's an equivalent of epoll on most platforms.

The article makes that clear.

deaddodo · on Jan 11, 2024

What a weird thing to get offended by. It mentions kqueue, sure. OP is just noting that it's a near ubiquitous thing nowadays.

Cool your wits.

chrisrhoden · on Jan 11, 2024

From the article:

> Aside: All of the above work on many operating systems and support API’s other than epoll, which is Linux specific. The Internet is mostly made of Linux, so epoll is the API that matters.

ghusbands · on Jan 11, 2024

That quote is talking about epoll-using software (Go, nginx and 'most programming languages', including Rust), not polling-based APIs. You should instead quote:

> without BSD’s kqueue (which preceded epoll by two years), we’d really be in trouble because the only alternatives were proprietary (/dev/poll in Solaris 8 and I/O Completion Ports in Windows NT 3.5).

HackerThemAll · on Jan 11, 2024

> I'm a little surprised (and maybe skeptical) that it'd make a noticeable difference for the much smaller number of sockets that your average web server would be using, but.. maybe?

Are you aware that Linux is used e.g. by Google, Microsoft or Amazon? Can there be occasions when they handle more than 5 simultaneous connections per box? Maybe epoll can make a difference for them, no?

jen20 · on Jan 11, 2024

For every Google there are 50 enterprise shops using a thread per connection in IIS or Tomcat…