Linux 3.3: Making a Dent In Bufferbloat?
mtaht writes "Has anyone, besides those that worked on byte queue limits, and sfqred, had a chance to benchmark networking using these tools on the Linux 3.3 kernel in the real world? A dent, at least theoretically, seems to be have made in bufferbloat, and now that the new kernel and new iproute2 are out, should be easy to apply in general (e.g. server/desktop) situations."
Dear readers: Have any of you had problems with bufferbloat that were alleviated by the new kernel version?
I've had all sort of trouble with bloat of all kinds since I turned 40.
You name it, it's become bloated: buffers, bellies, butts, pretty much everything.
Bufferbloat...is the result of our misguided attempt to protect streaming applications (now 80 percent of Internet packets) by putting large memory buffers in modems, routers, network cards, and applications. These cascading buffers interfere with each other and with the flow control built into TCP from the very beginning, ultimately breaking that flow control, making things far worse than they’d be if all those buffers simply didn’t exist.
In my day, if your modem had a 16550A UART protecting you with its mighty 16 byte FIFO buffer protecting you, you were a blessed man. That little thing let you potentially multitask. In OS/2, you could even format a floppy disk while downloading something thanks to that sucker.
Has there been widespread empirical analysis of bufferbloat? Particularly by device manufactures?
I read TFA and all I got was this lousy cookie
3.3 is odd (thus unstable?); does anyone recommend actually installing it (in my case, Ubuntu)? Are there considerable advantages versus its drawbacks/unstability? Thanks.
... routers and gateways to have any effect?
I state the obvious because who's already installing it on any but home routers so soon after release?
"I don't know, therefore Aliens" Wafflebox1
Umm, it was only released 9 days ago. Do you really think every server, router, gateway, etc. is upgraded through magic days after a new kernel version is released? Considering most devices will probably never have their devices updated don't you think it's a bit early to be asking this?
Yes there has.
Unfortunately, the analysis is "its almost all bad". We have seen with Netalyzr some network kit that had properly sized buffers, sized in terms of delay rather than capacity, but the hardware in question (an old Linksys cable modem) was obsolete and when I bought one and plugged it into my connection, I got into the cable company's walled garden of 'your cable modem is too obsolete to be used'.
We would encourage all device manufacturers to test their devices with Netalyzr, it can find a lot of bugs, and we would be glad to assist in the testing process.
Test your net with Netalyzr
Like the apocryphal monkey throwing darts at the stocks page, Cringely does get things right occasionally, but not because he actually understands or is capable of explaining them.
If you were blocking sigs, you wouldn't have to read this.
It steals all of the memory.
Seriously. My 10.0.2 Firefox on Debian Squeeze often grows to over 1GB RSS and 1.5GB VSZ in a day or so. And then it becomes extremely sluggish. Closing windows or tabs does not help. It will run my system out of memory. Where are the built-in memory usage stats, especially for extensions?
Why isn't 2GB of physical memory sufficient for laptop running the latest firefox?
I run no Flash. Addons include only adblock+, noscript and ghostery.
It seems to me that people blame cheap memory and making larger buffers possible for this problem, but no - if there is a problem, it's from bad programming.
Buffering serves a purpose where the rate of receiving data is potentially faster than the rate of sending data in unpredictable conditions. A proper event driven system should always be draining the buffer whenever there is data in it that can possibly be transmitted.
Simply increasing the size of a buffer should absolutely not increase the time that data waits in that buffer.
A large buffer serves to minimize potential dropped packets when there is a large burst of incoming data or the transmitter is slow for some reason.
If a buffer actually adds delay to the system because it's always full beyond the ideal, one of two things is done totally wrong:
a) Data is not being transmitted (draining the buffer) when it should be for some stupid reason.
b) The characteristics of the data (average rate, burstiness, etc.), was not properly analyzed and the system with the buffer does not meet its requirements to handle such data.
In the end, it's about bad design and bad programming. It is not about "bigger buffers" slowing things down.
the alternative is data loss
TCP was designed to work around this by putting predictably sized retransmit buffers on the endpoints, and then the endpoints would scale their transmission rate based on the rate of packet loss that the host on the other end reports. Bufferbloat happens when unpredictably sized buffers in the network interfere with this automatic rate control.
Then I discovered it was mostly firebug with the network log turned on that ate the memory with every ajax request made by setInterval.
Hey don't blame me, IANAB
The bufferbloat "movement" infuriates me because it's light on science and heavy on publicity. It reminds me of my dad's story about his buddy who tried to make his car go faster by cutting a hole in the firewall underneath the gas petal so he could push it down further.
There's lots of research on this dating back to the 90's, starting with CBQ and RED. The existing research is underdeployed, and merely shortening the buffers is definitely the wrong move. We should use an adaptive algorithm like BLUE or DBL, which are descendents of RED. These don't have constants that need tuning like queue-length (FIFO/bufferbloat) or drop probability (RED), and they're meant to handle TCP and non-TCP (RTP/UDP) flows differently. Linux does support these in 'tc', but (1) we need to do it by default, not after painful amounts of undocumented configuration, and (2) to do them at >1Gbit/s ideally we need NIC support. FWIH Cisco supports DBL in cat45k sup4 and newer but I'm not positive, and they leave it off by default.
For file sharing, HFSC is probably more appropriate. It's the descendent of CBQ, and is supported in 'tc'. But to do any queueing on cable Internet, Linux needs to be running, with 'tc', *on the cable modem*. With DSL you can somewhat fake it because you know what speed the uplink is, so you can simulate the ATM bottleneck inside the kernel and then emit prescheduled packets to the DSL modem over Ethernet. The result is that no buffer accumulates in the DSL modem, and packets get layed out onto the ATM wire with tiny gaps between them---this is what I do, and it basically works. With cable you don't know the conditions of the wire so this trick is impossible. Also, end users can only effectively schedule their upstream bandwidth, so ISP's need to somehow give you control of the downstream, *configurable* control through reflected upstream TOS/DSCP bits or something, to mark your filesharing traffic differently since obviously we can't trust them to do it.
Buffer bloat infuriates me because it's blitheringly ignorant of implemented research more than a decade old and is allowing people to feel like they're doing something about the problem when really they're just swapping one bad constant for another. It's the wrong prescription. The fact he's gotten this far shows our peer review process is broken.
There is a similar, and well known situation that comes up in database optimization. For example, the Oracle database has over the years optimized its internal disk cache based on its own LRU algorithms, and performance tuning involves a combination of finding the right cache size (there is a point where too much causes performance issues), and manually pinning objects to the cache. If the database is back-ended by a SAN with its own cache and LRU algorithms, you wind up with the same data needlessly cached in multiple places and performance statistics reported incorrectly.
As a result I've run across recommendations from Oracle and other tuning experts to disable the SAN cache completely in favor of the database disk cache. That, or perhaps keep the SAN write dache and disable read cache, because the fact is that Oracle knows better than the SAN the best way to cache data for the application. Add in caching at the application server level, which involves much of the same data, and we have caching of the same information needlessly cached at many tiers.
Then, of course, every vendor at every tier will tell you that you should keep their cache enabled because caching is good and of course it doesn't comflict with other caching, but reality is that caching is not 100% free... there is overhead to manage the LRU chains, do garbage collection, etc. So in the end you wind up dealing with a very similar database buffer bloat issue to Cringely's network buffer bloat. Let's not discount the fact that many serverdisk communications are migrating toward similar communications protocols as networks (NAS, iSCSI, etc). Buffer bloat is not a big deal at home or even a mid-sized corporate intranet, but for super high speed communications like on-demand video, and mission critical multi terrabyte databases, these things matter
Unfortunately, I think you haven't quite got this right.
The problem isn't buffering at the *ends* of the link (the two applications talking to one another), rather, it's buffering in the middle of the link.
TCP flow control works by getting (timely notification of) dropped packets when the network begins to saturate. Once the network reaches about 95% of full capacity, it's important to drop some packets so that *all* users of the link back off and slow down a bit.
The easiest way to imagine this is by considering a group of people all setting off in cars along a particular journey. Not all roads have the same capacity, and perhaps there is a narrow bridge part way along.
So the road designer thinks: that bridge is a choke point, but the flow isn't perfectly smooth. So I'll build a car-park just before the bridge: then we can receive inbound traffic as fast as it can arrive, and always run the bridge at maximum flow. (The same thing happens elsewhere: we get lots of carparks acting as stop-start FIFO buffers).
What now happens is that everybody ends up sitting in a car-park every single time they hit a buffer. It makes the end-to-end latency much much larger.
What should happen (and TCP flow-control will autodetect if it gets dropped packet notifications promptly) is that people know that the bridge is saturated, and fewer people set off on their journey every hour. The link never saturates, buffers don't fill, and nobody has to wait.
Bufferbloat is exactly like this: we try to be greedy and squeeze every last baud out of a connection: what happens is that latency goes way too high, and ultimately we waste packets on retransmits (because some packets arrive so late that they are given up for lost). So we end up much much worse off.
A side consequence of this is that the traffic jams can sometimes oscillate wildly in unpredictable manners.
If you've ever seen your mobile phone take 15 seconds to make a simple request for a search result, despite having a good signal, you've observed buffer bloat.
You are correct that replacing one bad constant with another is a problem, though I certainly argue many of our existing constants are egregiously bad and substituting a less bad one makes the problem less severe: that is what the cable industry is doing this year in a DOCSIS change that I hope starts to see the light of day later this year. That can take bloat in cable systems down by about an order of magnitude, from typically > 1 second to of order 100-200ms; but that's not really good enough for VOIP to work as well as it should. The enemy of the good is the perfect: I'm certainly going to encourage obvious mitigation such as the DOCSIS changes while trying encourage real long term solutions, which involve both re-engineering of systems and algorithmic fixes. There are other places where similar "no brainer" changes can help the situation.
I'm very aware of the research over a decade old, and the fact that what exists is either *not available* where it is now needed (e.g. any of our broadband gear, our OS's, etc.), and *doesn't work* in today's network environment. I was very surprised to be told that even where AQM was available, it was often/usually not enabled, for reasons that are now pretty clear: classic RED and derivatives (the most common available) require manual tuning, and if untuned, can hurt you. As you, I had *thought* this problem was a *solved* problem in the 1990's; it isn't....
RED and related algorithms are a dead end: see my blog entry on the topic: http://gettys.wordpress.com/2010/12/17/red-in-a-different-light/ and in particular the "RED in a different light" paper referenced there (which was never formally published, due to reasons I cover in the blog posting). So thinking we just apply what we have today is *not correct*; when Van Jacobson tells me RED won't hack it (which was originally designed by Sally Floyd and Van Jacobson) I tend to believe him.... We have an unsolved research problem at the core of this headache.
If you were tracking kernel changes, you'd see "interesting" recent patches to RED and other queuing mechanisms in Linux; this shows you just how much such mechanisms have been used, that bugs are being found in this day and age in such algorithms in Linux: in short, what we have had in Linux has often been broken, showing little active use.
We have several problems here:
1) basic mistakes in buffering, where semi-infinite statically sized buffers have been inserted in lots of hardware/software. BQL goes a long way toward addressing some of this in Linux (the device driver/ring buffer bufferbloat that is present in Linux and other operating systems).
2) variable bandwidth is now commonplace, in both wireless and wired technologies. Ethernet scales from 10Mbps to 10 or 40Gps.... Yet we've typically had static buffering, sized for the "worst case". So even stupid things like cutting the buffers proportionately to the bandwidth you are operating at can help a lot (similar to the DOCSIS change), though with BQL we're now in a better place than before.
3) the need for an AQM that actually *works* and never hurts you. RED's requirement for tuning is a fatal flaw; and we need an AQM that adapts dynamically over orders of magnitude of bandwidth *variation* on timescales of tens of milliseconds, a problem not present when RED was designed or most of the AQM research of the 1990's done. Wireless was a gleam in people's eyes in that era.
I'm now aware of at two different attempts at a fully adaptable AQM algorithms; I've seen simulation results of one of those which look very promising. But simulations are ultimately a guide (and sometimes a real improving insight): running code is the next steps, and comparison with existing AQM's in real systems. Neither of these AQM's have been published, though I'm hoping to see either/both published soon and their implementation happening immediately thereafter.
So no, existing AQM algorithms won't hack it; the size of t
RTFA. Fourth link.
They (the modern supposed bufferbloat conspiracy) talk about exactly the things you're raving about.
I don't have mod points, but wanted to let you know that I found your post very interesting. Thanks.
There is one other problem: TCP assumes that dropped packets mean the link is saturated, and backs off the transmit rate. But Wireless isn't like that: frequently packets are lost because of noise (especially near the edge of the range). TCP responds by backing off (it thinks the link is congested) when actually it should be trying harder to overcome the noise. So we get really really poor performance(*).
In this case, I think the kernel should somehow realise that there is "10 MB of bandwidth, with a 25% probability of packets returning". It should do forward-error correction, pre-emptively retransmitting every packet 4x as soon as it is sent. Of course there is a huge difference between the case of lots of users on the same wireless AP, all trying to share bandwidth (everyone needs to slow down), and 1 user competing with lots of background noise (the computer should be more aggressive). TCP flow-control seems unable to distinguish them.
(*)I've recently experienced this with wifi, where the connection was almost completely idle (I was the only one trying to use it), but where I was near the edge of range from the AP. The process of getting onto the network with (DHCP) was so slow that most times it failed: by the time DHCP got the final ACK, NetworkManager had seen a 30 second wait, and brought the interface down! But if I could get DHCP to succeed, the network was usable (albeit very slow).
Buffer bloat infuriates me because it's blitheringly ignorant of implemented research more than a decade old and is allowing people to feel like they're doing something about the problem when really they're just swapping one bad constant for another. It's the wrong prescription. The fact he's gotten this far shows our peer review process is broken.
Actually, this focus is driven very much by a technical approach. We know it is a problem in the real world due to wide spread, empirical measurements. Basically, for most users, the Internet can't "Walk and chew gum": interactive tasks or bulk data work just fine, but combining bulk data transfer with interactive activity results in a needless world of hurt.
And the proper solution is to utilize the solutions known in the research community for a decade plus, but the problem is getting AQM deployed to the millions of possible existing bottlenecks, or using 'ugly-hack' approaches like RAQM where you divorce the point of control from the buffer itself.
Heck, even a simple change to FIFO design: "drop incoming packets when the oldest packet in the queue is >X ms old" [1], that is, sizing buffers in delay rather than capacity, is effectively good enough for most purposes: I'd rather have a good AQM algorithm in my cable modem but, without that, a simple sized in delay buffer gets us 90% there.
[1] X should be "measured RTT to the remote server", but in a pinch a 100-200ms number will do in most cases.
Test your net with Netalyzr
DHCP isn't sent via TCP. It uses UDP broadcast.
TCP has SACK to handle moderate link layer packet loss, and at a certain point link layer packet loss is the link layer's fault and up to the link layer to solve via its own retransmission/forward error correction methods.
Yes...which is why DHCP shows the problem even more severely. DHCP needs 4 consecutive packets to get through OK, and when the environment is noisy, this doesn't happen. But the same happens for TCP, mitigated (slightly) by TCP having a faster retransmit timeout.
My point still stands:
Symptom: packet loss.
Common cause: link saturation.
Remedy: back off slightly, and hope everyone else also notices.
Symptom: packet loss (indistinguishable from the above)
Less common cause: RF interference because the AP is near the edge of range.
Remedy: try really hard, and flood the link with repeats to get at least some packets through.
In the wireless example, we still have a dedicated 10 Mbit link, it's just really unreliable. Being profligate with packets might get me 25% of 8 Mbit/s; being conservative with packets, would back off to perhaps 90% of 0.01 Mbit/s.
There is one other problem: TCP assumes that dropped packets mean the link is saturated, and backs off the transmit rate. But Wireless isn't like that: frequently packets are lost because of noise (especially near the edge of the range). TCP responds by backing off (it thinks the link is congested) when actually it should be trying harder to overcome the noise. So we get really really poor performance(*).
In this case, I think the kernel should somehow realise that there is "10 MB of bandwidth, with a 25% probability of packets returning". It should do forward-error correction, pre-emptively retransmitting every packet 4x as soon as it is sent. Of course there is a huge difference between the case of lots of users on the same wireless AP, all trying to share bandwidth (everyone needs to slow down), and 1 user competing with lots of background noise (the computer should be more aggressive). TCP flow-control seems unable to distinguish them.
Shouldn't this be handled at the datalink level by the wireless hardware? If there's transmission errors due to noise, more bits should be dedicated to ECC codes. The reliability is maintained at the expense of (usable) bandwidth and the higher layers of the stack just see a regular link with reduced capacity.
Be careful. People in masks cannot be trusted.
Shouldn't this be handled at the datalink level by the wireless hardware? If there's transmission errors due to noise, more bits should be dedicated to ECC codes. The reliability is maintained at the expense of (usable) bandwidth and the higher layers of the stack just see a regular link with reduced capacity.
Yes, it certainly should be. But it often isn't.
Incidentally, regular ECC won't help here: adding 1kB of ECC to 1KB of packet doesn't help against a 1ms long burst of interference, which obliterates the whole packet.
DHCP needs 4 consecutive packets to get through OK,
Eh? It needs 4 packets to get through, but doesn't require them all to be consecutive. Lose a dhcp request and the client will retransmit without going back to discover...
XML is like violence. If it doesn't solve the problem, use more.
The solution for wireless could be a TCP congestion control change, such as Westwood+ which accounts for bandwidth by delay rather than dropped packets.
But even better is a simple proxy setup. The proxy handles the request at the AP for the client, and retransmits can occur over the much faster wireless link.
It's mostly a cost issue, since only recent APs are powerful enough to run a local caching proxy.
"The bufferbloat "movement" infuriates me because it's light on science and heavy on publicity."
Of the articles I've read on it, they've been VERY heavy on science.
"merely shortening the buffers is definitely the wrong move"
Who is saying this? The issue that I have read about talks about the HUGE difference in performances of different links. If you have a 10Gb card and have a 1Mb link, the buffers are grossly different in size. To fix TCP, we can't look at packet-loss, we need to look at latency.
The problem is "how" you discover the base latency for a path without the routers telling you and how that algorithm interacts with many streams. There are many cases TCP needs to cover. There are decent algorithms that are better than the current TCP, but no one wants to mass-deploy those changes before it becomes standard.
"With cable you don't know the conditions of the wire so this trick is impossible."
You haven't worked with DOCSIS3.0+channel-bonding+CDMA. I get less latency and less jitter on my DOCSIS3 connection to my ISP than my mom's fiber connection. How your ISP implements your connection makes a HUGE difference.
Your main argument is great, they're actually the SAME argument these "bufferbloat" evangelists are preaching. So you are also part of the "bufferbloat movement" that you talked down about in your first sentence.
Did you know that Gettys (X Window) was once VP of Software Engineering at OLPC? Me neither, but apparently he resigned in 2009 to go back to the W3C and now concentrates on to bufferbloat and other matters.
Another fun google search is "NPR OLPC". Lots of hits - who would've thunk it.