Hacker News new | past | comments | ask | show | jobs | submit login
Unbloating the buffers (dgroshev.com)
112 points by dgroshev 11 months ago | hide | past | favorite | 30 comments



The biggest problem with CAKE/CoDel is that if you're not deploying it on the bottleneck device, you have to artificially limit the download/upload speed ("bandwidth 280mbit" "bandwidth 45mbit" in this example) in order to create a new bottleneck. Finding the correct limits is nontrivial because they might change over time.

Anecdotally, switching from DOCSIS 3.0 to 3.1 seems to have improved my worst-case latency, because DOCSIS 3.1 includes a CAKE-like algorithm (PIE) in the modem itself:

https://www.cablelabs.com/blog/how-docsis-3-1-reduces-latenc...


The bottleneck device can change with every different destination, so CAKE and fq_codel find it the same way older TCP code did: increase the load until they get a timeout or a "slow down, you idiot" flag (:-))

The initial bandwidth setting is really to get it started at a good value for your usual bottleneck device, such as your local cable modem.

Always starting off as if you had fast ethernet to the ISP when you actually have a super-el-cheapo link is a waste of time and effort. Even if you have a good algorithm like CAKE.

--dave


When I used CoDel on OpenWrt years ago, setting a 1 Mbps limit prevented the link from ever going faster. If it's now smart enough to discover the correct limit, then that's an interesting/useful development.


Followup: I just tested OpenWrt 23.05.2 + luci-app-sqm + cake with https://speed.cloudflare.com/

With the limit set to 1 Mbps (1000/1000), my upload latency dropped from 80ms to 25ms, but speed was hard-limited to 1000/1000. With the limit rasied to 1G/1G, cake stopped working and my upload latency returned to 80ms.

So I stand by my original comment. You still have to configure the speed limits manually.


You're right: CoDel and derivatives like fq_codel and cake don't auto-tune anything on timescales much longer than the interval parameter which defaults to 100ms. And fq_codel doesn't even do traffic shaping.

But I think davecb may have been confusing traffic shaper limits with TCP congestion control behaviors, and maybe the impact of a large TCP initial window releasing a sudden burst of packets that may be large enough to build up a bit of a queue on particularly slow links. (It was a serious problem in the ADSL era; now, only wireless gets that slow, and large bursts of packets are as likely to help as not when frame aggregation enters the picture.)


I mean setting a static traffic control limit was generally a high water mark, and in general even if you had excess bandwidth you didn't want to exceed it.

Now, to get around that some devices perform regular speed tests and dynamically adjust the high water mark. This said there are some limits at which you should perform these tests as too often and you may affect the actual applications you're trying to run.


> The biggest problem with CAKE/CoDel is that if you're not deploying it on the bottleneck device.

Right. This is a scheme for trying to manage the flows from the end you control so that the clueless FIFO nodes out there don't overload.

Flow management should be at the point where multiple flows feed into a bottleneck. Usually at the point where the LAN connects to the outside world, the telco modem or cable box. And, in the other direction, where bulk bandwidth goes into limited bandwidth to the consumer at the upstream end. I'm encouraged to hear that this finally got into DOCSIS. Now if only we could get it into AT&T modems.


> Finding the correct limits is nontrivial because they might change over time.

It's even worse than that, a lot of consumer Internet connections are bottlenecked by some shared link, which means your max bandwidth literally depends on what the other customers are doing at the moment. There is no "correct limit".


Once you find what you’re network’s maximum throughput is using CAKE or CoDel or whatever algorithm to prioritize latency over capacity (as this article does), a very simple way to get the shaping effect is to lower your bandwidth plan, so long as there’s enough headroom to do so.

Three homes ago, my Comcast Internet only supported 135mbps before latency spikes while using active shaping, so I downgraded from the 250mbps plan to the 125mbps plan and no longer had to use the shaping device at all. Another home could only get 75mbps, so I switched to a 25mbps plan just to prove that low bandwidth excelled at low latency; it did.

This does mean that console video games download slower, but I grew up on modems, so I can keep myself occupied while my computer is downloading — and instead of needing a shaping router, I saved half my internet bill per month.


Heh - the article refers to CAKE as a "really tortured acronym". (Common Applications Kept Enhanced). It is a backronym indeed. Some of the backstory behind CAKE is that we were shooting to compensate for some of the known flaws of fq_codel, shortly after PIE (backed by CISCO and comcast among other "big boys"), and so we called it as an internal codename, cake. To some in the project (me) it was a reference to a scene in 2010 where the russian guy keeps getting American idioms wrong: "Piece of pie" - "No, cake!". Others thought it came from the game portal, where we started saying "Bandwidth is a lie!". Anyway, the name stuck (although I wish we had come up with something more unique, google for "cake shaper" as for why), and that's where it came from.

In general fq_codel is superior to pie in just about every way. There is a specific thing that pie does that makes it appealing to hardware designers is that it is O(1) on egress and fq_codel is not (but there are ways around that). fq_codel achieves less latency than pie does, and faster than it does. codel takes a bit longer than we would like.

fq-pie still has worse tcp latency (using the same fq algorithm as in fq_codel and cake), and cake solves all kinds of edge cases that no other fq+aqm+shaper can - per host FQ (solving torrent issues), ack-filtering, nat transparency, a better codel model, easy to use link layer compensation, diffserv - and my favorite feature is actually that it runs more or less the same as a default qdisc, as a shaped one. I had hoped it would replace fq_codel at both line rate and at the shaped cases, but it did prove a bit too cpu intensive on cheesy hardware and the edge cases not noticeable enough on simple benchmarks.

I would like to multi-thread it, and add some new features, discussion here: https://docs.google.com/document/d/1tTYBPeaRdCO9AGTGQCpoiuLO...


The problem I always have with this is that my maximum bandwitdh is not a constant. I am technically getting 1200Mbps/20Mbps from Comcast (ugh, gross), but I usually do not see that. Sometimes my max download is significantly less than that. So let's say I normally see 1100Mbps down. So let's say I set my CAKE max to 90% of that, or 990Mbps. Then during a peak time of day, my max bandwidth drops to 1000Mbps. 990Mbps is no longer the right setting, and my latency goes to hell again.

So do I set my max to 900Mbps all the time? I'd really rather not.

And this is worse for the upload bandwidth, as there's so precious little to spare of it.


This is a really good article. I do not know how to reach the author as there is one thing that bothers me about his default config. He set CAKE rtt to 30ms which is OK, so long as 98% or so of the sites he wants to reach are within 30ms. Otherwise the default of 100ms is scaled to be able to manage bandwidth efficiently to sites around the world, and has been shown to work pretty well to real RTTs of 280ms. In general I have been recommending the rtt be set lower with the world consisting of CDNs, but not to 30ms.

Delighted to see vyos pick it up!

Next is the shaping vs variable rate issue, on outbound and inbound. On outbound - Ideally the fq_codel or cake algorithms are located right on the bottleneck link, and respond to changes in the available bandwidth from the upstream dynamically (asserted by ethernet pause frames for example), and managed underneath by BQL (ethernet) or AQL (some wifi(ath10k,ath11k, mt76, mt79)) , though true native fq_codel support exists for the ath9k) which store up enough data only to service one interrupt. These algorithms are much (20x!) lighter weight than shaping to a fixed rate is, and like I said

Codel was designed to deal with variable rate links. A fixed shaping rate was not what we intended, but has become one of the most common and finicky use cases, dang it. It is the FQ that matters most in most cases, however.

On inbound shaping, it's a SWAG, where we recommend 85% or so of the provided rate as a starting point. You might need more, you might need less, due to the behaviors of "slow start" in particular, it is never going to be accurate enough, and we keep encouraging the ISPs merely to manage their own egress well enough (ideally dynamically and without a shaper) so inbound shaping is not needed. Libreqos/preseem/bequant/paraqum all make middleboxes that do ISP level shaping. Libreqos (for whom I work these days) has pushed CAKE to where an ISP can do 10k subscribers at 25Gbit at 50% of cpu on a cheap ($2k) Xeon or Ryzen box.

So you get something that works well on the rrul test... in the morning but not the evening. Folk then complain that their bandwidth from their provider is variable rate and shaping via their other box does not work all the time, and I go back to hoping more ISPs get told how fix it at, egress, truly right at their end.

There is another project which perhaps can be ported to vyos, called cake-autorate, which attempts to use active measurements to cope with variable rates.


CAKE is pretty modern. You can also read the original algorithm proposed to solve the problem, CoDel [0]. The article doesn't quite go into details of how to unbloat the buffers. So this is still worth a read. The TL;DR is that there are good queues and bad queues that we must distinguish, and that to distinguish them it is not a good idea to use queue length, but rather the recently experienced minimum time for a packet to spend in a queue. I find this very insightful.

[0]: https://datatracker.ietf.org/doc/html/rfc8289


The codel ACM queue article is IMHO, better than the RFC: https://queue.acm.org/detail.cfm?id=2209336

Kathie nichols has also spoken about Codel multiple times.


Someone can explains why buffer bloat is always tied to specific hardware but from what I've read the fix is a different scheduler in Linux so it a hardware or a software problem or both?


It's a problem in various networking devices, in which those devices buffer too heavily rather than dropping packets. That behavior may be implemented in a combination of hardware/firmware/software. Among other things, this prevents other devices from being able to adapt to bandwidth limits (because dropped packets are the primary way to signal "slow down"), and causes high latency (because packets are sitting in deep buffers for a while).

Software such as network schedulers in Linux can work around that problem. But ultimately, it should be fixed in the various networking devices that create the problem in the first place.


It's a software issue, but many routers are sold as appliances where the average user does not choose what OS to run. Therefore, a common recommended solution to bufferbloat is to replace the router with one whose software has more modern buffer management algorithms.

OpenWRT, for example, can be flashed on many older routers and will be able to solve bufferbloat without replacing the hardware.


Bufferbloat can be a hardware or software problem. The fix is always to move the bottleneck to a device you control and can configure to not have bufferbloat at that bottleneck. For consumer hardware that means implementing AQM in software.


The buffer bloat issue happens due to bad queue management on the router immediately preceding the bottleneck link. Whether the queue management happens in "hardware" or "software" depends on the router (but I suspect it'll be software/firmware, not actual hardware).

"bottleneck link" is, of the path from the packet source to the target, whichever link is trying to exceed the available bandwidth. The problematic router (prior to that link) will receive packets faster than it can send them, so it needs to either use the queue, drop the packet, or use the ECN bits in the IP header to tell the target to tell the source to slow down. A router with bad queue management will enqueue packets until the queue is full, before picking one of the other options. This adds latency depending on the queue size; big queues on slow links cause significant latency.

With bad internet connections, the bottleneck link is often the connection between the home router/modem and the ISP's infrastructure. If upload bandwidth is fully used, bad queue management on the home router will cause latency to spike. If download bandwidth is fully used, bad queue management on the ISP's router will cause latency to spike.

With fast internet connections, the bottleneck link might be somewhere else, e.g. at the interface between two ISPs.

Because the problem only happens at the bottleneck link, it is possible to work around a bufferbloat problem in the ISP's infrastructure by introducing another artificial bottleneck in your own home network. So by using a bit of Linux software that artificially limits bandwidth, that software creates a virtual "bottleneck link" where it can control both ends, and thus "fix" the problem (as the link affected by bufferbloat will no longer be overloaded, and thus no longer use its queue). A real fix would require updating the firmware/configuration on the ISP's routers.


One thing to be cautious about with cake/fq_codel is that many consumer routers aren't fast enough to run it at line rate for current "fast" networks, so that enabling it those routers effectively caps the throughput lower than any limit you set and likely has unexpected effects elsewhere (potentially increasing latency issues instead of reducing them).


fq_codel should be fine on any hardware that's still operational and has enough RAM to run a currently-maintained Linux-based OS (eg. OpenWRT). It's specifically the traffic shaping (rate limiting) that tends to hit CPU limits on consumer hardware. CAKE includes a traffic shaper; fq_codel doesn't and usually needs to be paired with a separate traffic shaper.

If you have an older, slower router in your network that is not acting as your gateway router and does not need to compensate for bufferbloat in somebody else's device (eg. you're using it as a secondary WiFi access point), then you should almost always use fq_codel on that device's network interfaces.

Some devices like some cable modems have even more drastically anemic CPUs because the CPU was never intended to be part of the data plane and most traffic is supposed to be offloaded to special-purpose hardware. Those may be genuinely unable to even handle CoDel, and are a big part of the reason why DOCSIS 3.1 invented PIE AQM rather than adopting CoDel as the standard AQM.


This is a funny and digestible read, thanks for sharing!


I've never understood these "fancy" algorithms. I've always just used HTB with simple egress-only rate and ceil limiting tied to the actual ISP line upload rate and it's never failed to make the latency problem go away.

What am I missing? Are things like cake/codel "easier" to deploy than creating a single paired qdisc and class?


Token buckets have burstiness problems, CAKE lets you run things at ~99%+ of your link speed, assuming a stable medium, without bufferbloat, thanks to the packet pacing it does. CAKE also does several other things using Cobalt (CoDel + BLUE), and clever 8-way set associative hashing that helps out when you have many flows.


Ironically, we made fq-codel the default in many linuxes starting in 2012. HTB will pick up the default qdisc when you set that, so we figured many small shaping tools were just picking that up automagically.

Most people do not measure in-stream latency, rather multiple streams, so, assuming you did or did not pick up fq_codel, a packet capture will show you the tcp RTTs being managed or not. Note that on a system where the tcp is originating and then transitioning htb on the same box, another subsystem (TSQ) managed the rates fairly well before hitting HTB.


I don't know why that would work. It sounds like you're just moving the bloat into HTB.


It's likely that the default queue limits for HTB on Linux are a lot more reasonable that the crazy bloat some cable and DSL modems have shipped with.


A bit miffed that there was no mention of L4S though.

Prior HN discussion: https://news.ycombinator.com/item?id=38597744


Even trying L4S is rather problematic still these days. All the support for it - from the AQMs to the transports - live out of tree in an increasing ancient 5.15 release. Where cake has been shipping for nearly 10 years in some form or another. I would like that group to get on the stick on submitting it upstream, because there are some code quality problems there that they should fix (notably GRO issues), much less the algorithmic issues that I have also pointed out!

My biggest gripe about that project is that everyone focuses on the ECN change, and nobody looks at the effects on normal everyday traffic, where fq_codel or cake utterly smoke the dual-pi AQM. There actually is L4S support in fq_codel now, but they are not testing it. :(


The endpoints can also be smarter about it by using a buffer-aware congestion algorithm like BBR, which should avoid filling up the buffers to available capacity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: