Hacker News new | past | comments | ask | show | jobs | submit login

I think you need to look at a common use case and consider how many syscalls you'd like it to take and how many CPU cycles would be reasonable.

Let's take downloading a 1MB jpeg image over QUIC and rendering it on the screen.

I would hope that can be done in about 100k CPU cycles and 20 syscalls, considering that all the jpeg decoding and rendering is going to be hardware accelerated. The decryption is also hardware accelerated.

Unfortunately, no network API allows that right now. The CPU needs to do a substantial amount of processing for every individual packet, in both userspace and kernel space, for receiving the packet and sending the ACK, and there is no 'bulk decrypt' non-blocking API.

Even the data path is troublesome - there should be a way for the data to go straight from the network card to the GPU, with the CPU not even touching it, but we're far from that.




There's a few issues here.

1. A 1 MB file is at the very least 64 individually encrypted TLS records (16k max size) sent in sequence, possibly more. So decryption 64 times is the maximum amount of bulk work you can do - this is done to allow streaming verification and decryption in parallel with the download, whereas one big block would have you wait for the very last byte before any processing could start.

2. TLS is still userspace and decryption does not involve the kernel, and thus no syscalls. The benefits of kernel TLS largely focus on servers sending files straight from disk, bypassing userspace for the entire data processing path. This is not really relevant receive-side for something you are actively decoding.

3. JPEG is, to my knowledge, rarely hardware offloaded on desktop, so no syscalls there.

Now, the number of actual syscalls end up being dictated by the speed of the sender, and the tunable receive buffer size. The slower the sender, the more kernel roundtrips you end upo with, which allows you to amortize the processing over a longer period so everything is ready when the last packet is. For a fast enough sender with big enough receive buffers, this could be a single kernel roundtrip.


JPEG is not a particular great example. However most video streams and partially hardware decoded. Usually you still need to decode part of the stream, namely entropy coding and metadata, first on the CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: