It's been well known for much longer. I exchanged email with John when diagnosing a strange 200ms connection stalling issue between Solaris and NT, around 1997. At the time the problem was familiar to him.
Back in 1996 I developed a TCP/IP stack from the ground up based on the the DARPA and RFC specifications from that era. The impression I has from the quality of consistency was that TCP/IP was a students project. Delayed-ACKs was described briefly as a concept in addendum. Implementing it was a nightmare.
What happened was that with bulk unidirectional flow (FTP) the window would slowly fill up until it was completely full after which transport would halt to a grinding stand still. Only when retransmit timers triggered would transport resume at full speed, slowly filling the window to repeat the cycle. As I can remember, this cycle would repeat every 5 seconds or so.
Out of sheer frustration I logged all traffic with (then) high resolution timestamps and hand drew both sides of the connection on 132 columns zigzag printer paper which filled the corridor.
As it turned out, the protocol state timings assume that the transit time is 0ms. For one-on-one REQ/ACK this isn't a problem, for delayed ACK it was a game-breaker.
When a packet arrives, several tests are performed to determine if it is within sequence and rejected otherwise. With delayed ACKS the administration did not represent the actual state triggering lots of false negatives.
Solution was to create dual state information. One being the actual values for the end-point, the other being the projected/assumed state of the other end, taking into account it is suppressing ACKs. Then, for incoming packets, the headers are tested against the projected other-end state and transmitted packets constructed using the this-end state.
Performance hit the roof and filled the cable with 98% of the theoretical bandwidth. Much more than stacks implemented by competitors.
Sadly my employers did not allow me to publish the findings.
I believe there is still potential. Dominant TCP implementations bet on the single horse "TCP Congestion Control" which is a different breed for a different situation.
> The current Nagle algorithm is very important in protecting the health of the internet; the proposed modification [hopefully] provides the same level of protection.
A high level library that buffers before sending doesn’t need the delay, as the send send reply pattern would go to the buffer and then be sent as packets.
Pretty sure this was the issue giving me trouble on a ROS application for remote sensing. I have hardware trigger signal, a bank of cameras, and a gps/ins unit which emits an exact time/space stamp, all off the trigger. It is a very ROS-esque app, with camera and gps "drivers" which parse the data into ROS messages which are published. Occasionally, the tiny gps packet gets delayed way longer than the image buffering, which throws everything out of whack.
Then I discovered the TCP_NODELAY flag, which made this occurence go from 1/100 to 1/+10,000. Had to do some digging to understand "why would anyone want to delay packets ever?" the answer of course being naive code writing one byte at a time to sockets.
Modern web as in...? Tinygram prevention is a TCP level actor, with no regard to whether it's sending db queries or webpages. With the prevalence of "smart" appliances and ever coming IOT, I'd argue it's more important than ever. I don't trust most vendors to get it anything close to optimal otherwise.
Packet sizes are increasing [1], which makes it more likely you'll go over the threshold quickly. It used to be far more common to write a hand written protocol to a TCP stream directly. Now the shitty IOT stuff is more likely to be throwing massive JSON packets down the wire.
The original motivation (as I understand it, and recognizing I may well get corrected by John who is in this thread) were situations like a remote terminal where you'd be writing single bytes to a TCP stream. I just don't think much software is written that way any more, and if it is, it's probably latency sensitive.
And 200ms is an eternity to wait if you happen to build a tiny packet for real time situations and this trips you up.
I write multiplayer games for a living, and you're not always in control of the connection that you're passing data through. Usually, but not always. If Nagle is enabled you literally have to write padding data to the stream to make sure your packet is being sent like this poor soul [2].
Like I say, I'm happy to learn I'm wrong. Given the downvotes, people certainly seem to think I am, and I'm happy to believe they're experienced network engineers dumbfounded at my naïve stupidity.
> With the prevalence of "smart" appliances and ever coming IOT
Someone was recently complaining that their Arduino project was gobbling up mobile data. They had used an Ethernet shield coupled to some 4G network something.
After a bit of back and forth the culprit was found: it was sending one byte packets.
The standard Arduino is extremely memory limited, so a typical trick is to use the Flash memory to store string constants, and then use those when writing to devices etc. The library code has some nice wrappers which makes things like sending a Flash string constant to some output stream like a serial port or say Ethernet shield fairly seamless.
However for reasons[1], the Flash variant of the "send constant string" call does this by reading and sending one character at a time. And the Ethernet library (memory constrained) or Shield did no extra buffering, so it sent minimum one packet per call...
[1]: IIRC the Arduino Flash is 16 bit and the characters in the string are stored zero-expanded, so it can just memcpy but has to strip away the top byte.
Nagle is fine for web, as long as you flush when you reach the end of your request/response. And at strategic places like end of headers, end of html head, etc, if you know the content isn't directly forthcoming.
A lot of HTTP software writes things to sockets in small chunks though, especially headers and what not. I've seen some things with chunked encoded posts; I don't think it was visible at the TCP level, but I would get three TLS application data segments, one for $SIZE\r\n, one for the size worth of data, and one for the trailing \r\n. Ugh.
But it's there, as I understand it, mainly to combine multiple logical packets over a short human sized time scale.
By its nature it also prevents fragmenting poorly written logical packets.
But you could also do this as a separate level of protection. Say a 5ms delay on any write without a flush, any further write resets the delay, up until a maximum of 20ms.
This would seem to reign in the worst of the behavior - developers not realizing they're sending multiple packets - without adding a large delay to developers using the system properly but not flushing.
And since the penalty is much less, it would be less likely to be turned off aggressively, see the Go example in sibling comments.
If the goal is to hand training wheels to developers of shitty apps, it's pretty important they actually use them.
We had TCP performance issues on leased lines EU-Singapore at the time of the article (2005) and ended up using sockets with the TCP_NODELAY flag to disable Nagle. And tuning window sizes, of course.