Why upgrading your Linux Kernel will make your customers much happier

ChuckMcM · on March 1, 2012

TL;DR version - another new guy discovers TCP slow start, and not doing it, wow major speedup! Now, imagine the whole Internet doing that, whoops.

I once peevishly pointed out that if you weren't required to stop at stop signs your commute would go faster, but if nobody was required to stop at stop signs it would be slower because every other intersection would have an accident blocking the way.

This is also very much true of TCP congestion control algorithms. And while a few people not using them can get away with it, everyone not using them and you will find your network latency goes from a median with a low standard deviation, to a slightly lower median with a HUGE standard deviation.

One of the things that slow start does is it spreads the change in median latency over a longer period of time. You can think of this intuitively where each new connection starts slow and then gradually gets faster, until it is as fast as it can be, and as more people start connections they start slow and get faster, while the current connections get slightly slower to accomodate the new traffic. The result is a non-chaotic adjustment of the network flow.

The converse is that everyone starts out going as fast as they can, they not only overwhelm the node the node ends up getting massively congested for a moment trying to sort things out. And of course IP doesn't care if you lose a fragment, you'll eventually resend it. So now during this massive congestion the re-transmits are causing more congestion. You get lots of pushback and finally everyone is back to a level where the network is doing ok with it and wham! a new connection opens up and everyone gets hosed again and backs off again, and then ramps up again.

Moral of the story, if only you don't do slow start you can be fast, if everyone doesn't start slowly, the network latency gets really unpredictable and poor.

po · on March 1, 2012

The converse is that everyone starts out going as fast as they can

As far as I can see, nobody is recommending that. This isn't about getting rid of slow-start. They're just talking about tuning the initial window to be larger.

If you were writing this algorithm today, you would look at the value for the window that most systems end up on and then pick an initial value just slightly below or equal to that. I doubt that number would be 2.

etrain · on March 1, 2012

Q: do you think the designers of TCP were thinking of the constant that they started with in the first place? Or do you think as computer scientists they said "some integer n > 0 so that we can achieve the exponential effect."

While we're modifying the initial constant, why not reconsider the exponent? Or the shape of the equation itself?

moe · on March 1, 2012

Or the shape of the equation itself?

There's quite a bit of research[1] in the area and it seems for most applications you can indeed get away with quadradic or basically any superlinear backoff algorithm.

However, most of the work I've seen was based either on synthetic models or tested only on smallish networks under artificial congestions. It's hard to predict how it will cope at internet-scale when theory meets broken vendor-implementations.

[1] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5766...

brlewis · on March 1, 2012

INACCURATE TL;DR

I don't usually post in all caps here, but many readers, myself included, will sometimes go to HN comments first to get a TL;DR version of an article. The thoughtful rebuttals by po and sams99 can easily be overlooked by people in a hurry. I think it's too late to edit or delete the parent comment, and the "flag" function is for spam and uncivility, not for carelessness.

As po and sams99 pointed out, the article is not about not doing slow start. It's about tweaking slow start parameters. This particular tweak has been adopted by the Linux kernel authors and others who do understand why TCP slow start is important.

ChuckMcM · on March 2, 2012

Yes, its about changing the slow start congestion window from 2 to 10. And the author accurately points out that this is a huge win for your basic web page.

But lets think about that from the other side for a moment. Most systems will end up with a congestion window size of 1452 because their router has an MTU of 1500. With a CW of 2 that is 3000 bytes blasted out waiting for an ack (even though its only 2904 data bytes the MTU constraint will fragment it into two 1500 byte packets), with a CW of 10 that is 15000 bytes blasted out initially.

Lets say everyone on Comcast's network gets this change, they have an estimated 15 - 20 million subscribers. We'll be conservative and call it 15 million, and at any given time maybe half of them are doing an HTTP request (think about all the things that do HTTP requests in your house for a moment). So at any given instant in time you've gone from blasting out 3000 * 7.5M or 22.5GBytes (or 225 Gigabits) to 112.5GB or 1.125 terabits of data. Not surprisingly a lot of their traffic goes to some peering network and now their hammer is hitting with Terabit whacks instead of 200 Gigabit whacks. That is a noticeable change, and its an even bigger change when you consider they are running VOIP traffic as well.

Some useful work to be done here is to look at where the congestion window settles out on your network. And for what servers. And for which transit networks. How often does it get to 10? 20? The worst case would be it settles at 16. That's because with an IW of 10 the next window is 20, and then you get clamped.

There has been a lot of great work done on congestion control, and yes its really annoying like metering lights[1] when its not needed. But when it is needed it makes the system work.

Congestion is a function of cross section bandwidth and traffic demands. The cross section bandwidth of the Internet has gone up a lot, the number of 'ports' on the Internet has gone up even more. The ratio, has not improved much (and in some cases gotten worse) from the days of dial up.

I fully recognize that a number of people have done the same test (or thought experiment) the author has done with slow start and seen a green field for improvement. I was reasonably active in the IETF before the work on congestion was implemented, with a protocol that could be very latency sensitive (NFS), and it was a hellish environment. My disagreement, is that changing slow start in this way will destroy a number of interesting streaming services by increasing the standard deviation on latency outside of what they can tolerate. And for what? So that a 34K web page loads 30% faster? I'd much rather compress the web page or build a web service protocol that knows about slow start and accommodates it than this.

As others have pointed out, this change is going to happen regardless, and perhaps I'll have the opportunity for an 'I told you so', or perhaps I'll be relegated into the dustbin of ranting network dudes from the last century. But I stand by my TL;DR that the author did not demonstrate an appreciation for the impact of changing slow-start would have on the network in general because they focused only on how it would make their life faster.

[1] Here in California we have congestion control 'metering' lights on some on-ramps to the freeway, its annoying as hell when they are on and there isn't anyone on the freeway.

brlewis · on March 2, 2012

Since changing the congestion window doesn't increase the total number of packets for an HTTP request, for your Comcast example to cause problems there would have to be massive synchronization of the start of the HTTP requests.

sams99 · on March 1, 2012

This is very far off from what I am proposing, the Linux kernel team decided to implement a recommendation that is still under review by the ietf, every release of Linux from 3 onwards has IW 10 enabled by-default. Like it or not, IW of 10 is probably here to stay. The 3 line has 3 stable release branches, all with this change. The majority of web servers online are still running 2.

ajross · on March 1, 2012

It's more complicated than that though. Slow start is tuned for long-lived pipes sending large files, which doesn't match the observed load. And in any case the congestion bottleneck on modern networks is almost always in the last mile to the client anyway: most web services can freely pump as much data into the network as they want[1].

[1] Jim Gettys and his bufferbloat posse might have something to say to you though.

ChuckMcM · on March 1, 2012

This is something that I think is slowly going to bite us big time. So much memory out there so much ability to "absorb" and retransmit. That gives the network a sort of 'resonant' frequency with respect to packet retransmission. I could imagine doing the Tesla trick of injecting burst of packets into the network at the resonant frequency and being able to get all the buffers to explode.

jacques_chester · on March 1, 2012

You'd probably find reading up on systems dynamics to be interesting. Systems with multiple stages linked by delaying steps tend to have quite dramatic oscillations. A great example is "The Beer Game"[1]: small changes in purchases of stock leading to massive whipsawing of production schedules.

[1] http://en.wikipedia.org/wiki/Beer_distribution_game

bigiain · on March 1, 2012

Heh - I can see an art project here… (poetic network terrorism?)

wmf · on March 1, 2012

Yes, I think those bloated buffers can easily absorb 10-packet bursts. If bufferbloat gets fixed there may be packet loss in the initial window burst and people should adjust window sizes back down. But I wouldn't worry about less-slow start unless it actually causes packet drops.

gideons · on March 1, 2012

That's the problem though isn't it. The buffers absorb the packets, preventing packet loss, and TCP performance goes down. Adding more initial packets puts more data in flight, more traffic uses intermediate buffers quicker. This tactic seems like it would accelerate buffer issues.

Like someone said in the comments on the article, I'd like to see Van Jacobson's or someone similar's thoughts.

ajross · on March 1, 2012

No, TCP performance goes up for the connection in question. Latencies are lower because the packets arrive earlier. Latency (for all protocols) goes down on the whole though due to the backlog.

Balancing these requirements against each other is a really hard problem. TCP slow start is (well, was, c.f. this article) an early attempt to get an auto-tuning solution. But it isn't the only part of the problem, nor is it an optimial (or even "good") solution to its part of the problem. It's defaults are very badly tuned for modern networks (though they'd be a lot better if everyone was using jumbograms...).

barrkel · on March 1, 2012

I object to your metaphor; stop signs do slow down traffic. If every 4-way stop sign junction had a roundabout (either full-sized or mini, depending on available space), traffic wouldn't need to stop very often, overall throughput would be much higher, and accidents would probably be even less.

4-way stop sign junctions were probably the most asinine, time-wasting, fuel-wasting road control I found when I drove in the US.

vacri · on March 1, 2012

Discussing this with a US friend, I have come to the conclusion that for single-lane roads, a roundabout is superior to a 4-way stop as you only have to watch one direction for traffic.

Once you get to multiple lanes, the answer is simple: both roundabouts and 4-way stops are inferior...

ChuckMcM · on March 2, 2012

Totally agree with this, and love them in the common case (intersections of single lane streets). And we have a few in various places, however the lack of experience with the American driver leads to some interesting results. One person insisted that when there wasn't a light you had to treat it like a 4-way stop!

khafra · on March 1, 2012

Yes, cloverleafs are necessary for >1 lane per direction.

Nick_C · on March 2, 2012

Nah, we have plenty of 2 lane roundabouts here in Australia, and I'm sure the UK does too. It's really a matter of what you're used to. They don't seem to get used much in the US from what I see.

protomyth · on March 1, 2012

4-way stops are popular because of construction cost and space savings. Different optimizations yield different problem areas.

barrkel · on March 1, 2012

I don't think there are many 4-way stops in the US that cannot be replaced by a mini-roundabout for reasons of space or cost; mini-roundabouts are just solid white circles of paint on the ground, with a more or less pronounced mound. Their primary purpose is to trigger the right-of-way rules of a roundabout; often the junctions they are used on are too small to actually go "around" the mini-roundabout, in London (where I live) at least. Roads (and hence space for road furniture etc.) in the US are much larger than most roads in London.

protomyth · on March 1, 2012

A standard 4-way stop has two roads with two lanes each (both directions). No space beyond the widths of the road is used. They are very space efficient. It is also easy to build as an afterthought (just cross the roads).

I cannot find a roundabout design that fits into that same space. Looking at the Wikipedia entry[1], it seems much larger than a standard 4-way. Given the block patterns in American suburbs and housing developments, the extra space would eat into someone's yard.

[1] http://en.wikipedia.org/wiki/Roundabout

barrkel · on March 2, 2012

Mini-roundabouts are much smaller; most UK roads do not have two lanes in either direction, so junctions are naturally smaller too.

A road with two lanes in either direction would normally be major enough to warrant traffic lights; or, if it's just a particularly wide suburban road, then the lane closest the pavement is often used for parking, and the road would narrow to one lane (with advantages in reducing speed somewhat) for a roundabout.

sgt · on March 1, 2012

We also have 4-way stops but you don't actually stop entirely, that would be counterproductive wouldn't it. You "stop" (in a grey sense of the word), look left and right, then you go.

solutionyogi · on March 1, 2012

Legally, the 'grey' stop doesn't work. I have been given a ticket by a cop. Yes, it may work 90% of the time but if you get caught, you get 2 points (in NJ) and your insurance goes up.

xiongchiamiov · on March 1, 2012

Well, the problem is that you're in New Jersey. There's a reason it's called the California Stop.

epo · on March 1, 2012

Can't comprehend why you were downvoted. Unless Americans are aware of a more asinine, time-wasting, fuel-wasting road control.

dubya · on March 2, 2012

I'm not sure it's more asinine, but in my town there are sometimes lights where a major road intersects a single lane road with almost no traffic. But typically these lights do not have a traffic sensor, so they just regularly interrupt the flow of traffic on the major road. We also have lots of untimed lights.

smosher · on March 1, 2012

He was exaggerating a bit when he said "tens" of years ago. It wasn't that long ago that dialup was downright pervasive in North America and Europe. At that time it didn't take too much abuse to discover just what an overstuffed pipe would do to your connection.

On the other hand, we have so much more bandwidth now. 56kBd is a fart in today's gale. Of course we shouldn't start out at full throttle, but the factor of 5 he's endorsing is virtually nothing and it still gives a higher proportion of overhead than we had back then.

simcop2387 · on March 1, 2012

All that being said, there is still something to be considered about adjusting the parameters to more closely match the network as it is today. That's mostly what this is going on about, not about completely doing away with congestion control, but instead making it fit better with the higher bandwidth connections that exist. What the right changes would be I have no clue.

skyraider · on March 1, 2012

Hm - spreading out the ramping-up over time may not be necessary for getting the window up to size while keeping congestion low. One idea that I found intriguing is "probing" the appropriate window size by "trying out" a large range of window sizes in a single round-trip time: http://www.cs.unc.edu/~jasleen/papers/infocom09.pdf

So you can start slowly, but time isn't necessarily the axis along which your start is "slow" or "fast." I am really interested to see whether the above or a similar protocol will be able to spread out "slow"-start over an axis of window sizes that the protocol probes for. Then maybe we will be less concerned with who gets to decide what the initial congestion window is, because you can't ramp up too "fast" when you know what to ramp up to.

sams99 · on March 1, 2012

I also came across this http://yuba.stanford.edu/~nanditad/talks.html which is very interesting

lloeki · on March 1, 2012

This article gives the IW status on Windows and Linux. What is the status on other systems (e.g Mac OS X, FreeBSD, Solaris...) ?

Does it matter only server side, or do clients benefit of having this window increased too?

Also, a comment of the article mentions this:

    Why are you talking about upgrading the kernel, when you can simply do:
    
        ip route change default via MYGATEWAY dev MYDEVICE initcwnd 10

which would be similar to the netsh tunable on Windows. So upgrading the kernel is only needed to have it set to 10 by default.

EDIT:

It seems Mac OS X is using either NewReno or LEDBAT instead of the mentioned CUBIC or Vegas. Look for tcp_ledbat_cwnd_init in [1] which looks quite simple, or tcp_newreno_cwnd_init_or_reset in [0] which looks a bit more involved:

    /* Calculate initial cwnd according to RFC3390,
     * - On a standard link, this will result in a higher cwnd
     * and improve initial transfer rate.
     * - Keep the old ss_fltsz sysctl for ABI compabitility issues.
     * but it will be overriden if tcp_do_rfc3390 sysctl is set.
     */

PS: xnu-1699.24.23 is Lion 10.7.3

[0] http://opensource.apple.com/source/xnu/xnu-1699.24.23/bsd/ne...

[1] http://opensource.apple.com/source/xnu/xnu-1699.24.23/bsd/ne...

The_Fox · on March 1, 2012

The initcwnd change is helpful on any host that has more than 2 segments worth of data ready to send at the beginning of the connection. So a client that wants to send lots of data would benefit from the change.

For 99% of web browsing, the client's request fits in one or two segments and so would not benefit from the change.

gizzlon · on March 1, 2012

This Freebsd related paper[1] is a little old but seems to suggest: "it depends". I didn't actually read it..

Tuning and Testing the FreeBSD 6 TCP Stack (2007) http://caia.swin.edu.au/reports/070717B/CAIA-TR-070717B.pdf

sams99 · on March 1, 2012

re the ip route trick, be sure to read http://lkml.indiana.edu/hypermail/linux/kernel/1007.2/00395....

sgt · on March 1, 2012

In OS X all you need to do is set a sysctl setting. No need for even a restart. Check out net.inet.tcp.slowstart_flightsize (set it to 10) if you are interested.

snissn · on March 1, 2012

If upgrading my Linux Kernel will solve all of my problems, why is the experimental comparison between a Linux box and a windows box? Just saying..

sams99 · on March 1, 2012

mainly cause I did not have a chance to set up an old Linux VM. The number hold though, the initial congestion window is 2-3 on the 2 line kernels.

snissn · on March 1, 2012

I don't disagree, but it makes it an apples an oranges comparison. It introduces the variables of how linux vs windows deals with TCP (not withstanding that linux 2.x vs 3. might have some internal ipv4 changes, but youre recommendation is to upgrade anyway, so that's fine) but also changes in the webserver.. It seems like the changes are hard coded into the compiled kernel, so there's no way to simply change configuration flags?

That said, thanks for the post, and I'll definitely be tcpdumping in the upcoming week and reading some more about slowstart!

Maybe testing with net.ipv4.tcp_slow_start_after_idle 0 vs 1 would make a cleaner comparison?

sams99 · on March 1, 2012

I totally agree with the concern, but the only way for a clean comparison here would be for me to spin up a new VM. I observe the exact same patterns as I get from the windows VM on our Linux 2.x prod box so assume they are the same.

There were a slew of TCP changes leading up to the 3 branch which included changing the default congestion control algorithm to cubic.

slow start after idle does not really play part here. The test is for a clean/new connection.

I am no expert but it is possible I could lower my IW on my 3.2 box to 3 to demonstrate the same pattern, however that too is not a clean comparison.

If my sys admins push me I may set up another VM to demonstrate this.

snissn · on March 3, 2012

Thanks again for your blog post and comments in this thread! Experiments where you already are really confident about the conclusions are pretty silly, but I feel like despite that being skeptical in general to posts on the internet has value. I haven't investigated yet, but if I do investigate clear benefits of slow start and can make a corroborating case, i'll be happy to correspond and write it up in a blog post.. No promises though :)

yorhel · on March 1, 2012

Not really a solution for the short term, but CCNx[1] looks like it'd solve a lot of problems that TCP currently has for both large file transfers and short web browsing. 1. http://www.ccnx.org/

obtu · on March 1, 2012

That site is terribly vague, what are the specific problems and how does CCNx address them?

yorhel · on March 1, 2012

It would indeed be nice if the project had a proper introductionary page or something. Either way, the paper "Networking Named Content" is pretty much the best introduction you can get, and a very interesting read at the same time. Googling gave me a PDF at the following URL: http://conferences.sigcomm.org/co-next/2009/papers/Jacobson....

pkh80 · on March 1, 2012

I'm sure there is a chance I am missing some bits here, but I setup a 3.0.18 / Ubuntu 11.10 / Apache 2.2 server and to compare against 2.6.39 Apache 2.2 server and in all tests they are basically identical.

sams99 · on March 1, 2012

yeah ... the change was introduced in 2.6.39 ... but its a pretty rare kernel to have afaik, not even in debian backports anymore

ilaksh · on March 1, 2012

What versions of Ubuntu have this larger (faster) setting?

mixmastamyk · on March 1, 2012

Looks like Oneiric and Precise have > 3.0 kernels.

jcastro · on March 2, 2012

And the Oneiric kernel is backported to 10.04LTS:

Installing "linux-image-generic-lts-backport-oneiric" outta do it.

alpb · on March 1, 2012

I wonder if Apple has implemented this in OS X kernel.

Flow · on March 1, 2012

sudo sysctl -w net.inet.tcp.slowstart_flightsize=10

alpb · on March 3, 2012

Wow it changed from 1 to 10. Should I put that to boot script or what do you recommend?

Drbble · on March 2, 2012

Linkbait title. Editors, please fix.