> In practice, it's faster to copy the page up front than mess about with the page tables and TLB shootdowns, doubly so if you end up copying the page to break the COW anyway!
It could be worth it when the data fills an entire page or more. And it's common that after a write, the thread will either sleep waiting on events and not get one during the milliseconds it takes for the kernel to be finished with the data, or the buffer is immediately used for read()/recv() which allows the kernel to remap the page without copying it.
But yes, that seems to be the problem in general -- we're trying to optimize something which isn't actually that slow. A memcpy() on a <500 byte packet is only tens of cycles. Even a full page is hundreds of cycles, which is on the same order as the cost of the syscall to have the OS notify the application it has finished with a buffer. None of this complexity can justify its overhead unless you're sending thousands of contiguous bytes, and at that point you're in sendfile() territory anyway.
It could be worth it when the data fills an entire page or more. And it's common that after a write, the thread will either sleep waiting on events and not get one during the milliseconds it takes for the kernel to be finished with the data, or the buffer is immediately used for read()/recv() which allows the kernel to remap the page without copying it.
But yes, that seems to be the problem in general -- we're trying to optimize something which isn't actually that slow. A memcpy() on a <500 byte packet is only tens of cycles. Even a full page is hundreds of cycles, which is on the same order as the cost of the syscall to have the OS notify the application it has finished with a buffer. None of this complexity can justify its overhead unless you're sending thousands of contiguous bytes, and at that point you're in sendfile() territory anyway.