A Possible Cause of AT&T's Wireless Clog — Configuration Errors
AT&T customers (iPhone users notably among them) have seen some wireless congestion in recent months; Brough Turner thinks the trouble might be self-inflicted. According to Turner, the poor throughput and connection errors can be chalked up to "configuration errors specifically, congestion collapse induced by misconfigured buffers in their mobile core network." His explanation makes an interesting read.
This is not really news at all. They spend little to nothing to keep their network up to the devices they have on it. This misconfiguration of buffers (if that is really a cause at all) is probably because they might not hire people with any knowledge of what they are doing to keep costs low.
Anything can be found funny, from a certain point of view.
His explanation makes an interesting read.
I'd like to think that's a given, considering it's a news story. At any rate, from TFA:
The bottleneck link is the over-the-air link, i.e. the connection from radio access network or UTRAN to the Mobile Statation (MS) in the above diagram, therefore the critical buffers are those at the UTRAN. In practice the UTRAN includes both the basestations (called Node-Bs) and the Radio Network Controllers (RNCs) which coordinate handovers between basestations (among other things). Because of hand-overs, the amount of data buffered at the Node-B is relatively small. It's the buffers at the RNC that must be large enough to deal with the delay variations in the radio network and yet small enough to induce packet loss when the network gets congested.
I am not a network engineer, but how exactly could 8 second ping time be not noticed by the AT&T engineers who set up, configured, and monitored their OTA link? I would think that we're not talking about some dude's set of bridged dd-wrt linksys routers, but some serious heavy-duty RF equipment. I'm thinking on the order of several zeros...
You see, most blokes, you know, will be buffering at ten. You're on ten here, all the way up, all the way up, all the way up, you're on ten on your buffer. Where can you go from there? Where?
I don't know.
Nowhere. Exactly. What we do is, if we need that extra push over the cliff, you know what we do?
Put it up to eleven.
Eleven. Exactly. One more buffered.
Every time I deal with AT&T I am amazed that anything works at all over there. My phone almost always shows five bars at home, yet frequently calls don't cause the phone to ring - they go to voicemail after pretending to ring. The jaded amongst us could suspect a deliberate misconfiguration of phones and signal strength monitoring. Similarly, it would not surprise me if AT&T data networks weren't about as reliable as the signal strength indicator on my phone. The recent alleged blurb from an Apple "genius" in NYC that 1/3 of all iPhone calls get dropped seems to point in that direction.
That a cell-phone won't work everywhere and perfectly every time is a given. However, wouldn't it be nice if the companies that stood behind these networks would actually be held accountable for some of the advertising statements they make? What it comes down to is that we're dealing with an oligopolistic market, where only a few carriers can achieve the scale and the coverage to satisfy most mobile customers most of the time. On the flipside, that also means that said carriers can be truly dismal when it comes to customer service, back-end efficiency, etc. since consumers don't have many choices. Considering the ongoing consolidation in the industry, the only way out seems to be a trust-busting activity on the part of the DoJ to regulate the industry.
Not sure that is the better alternative... nor what the best structure for a regulated market would be.
I worked for AT&T in several parts of the country on their core networks, and in the early 2000's they had misconfigured all of their Solaris boxes and I worked with the infrastructure group to implement a startup script on Solaris to tune all the ndd settings for performance. The problem with Solaris is that by default all the TCP, UDP, Ethernet, etc settings are set for a Desktop workstation, not a server. Most system admins know to tune these settings, otherwise in a lot of cases a multi-CPU box will perform as slow as a 1 CPU box. Anyway, at specific companies I worked with (AT&T Broadband / Worldnet in St. Charles, MO was one big one), all the servers were configured without the proper settings for a server, so we had all kinds of issues as a result, a big one is that the tcp accept queue is not set high enough and so connections to daemons will drop after a low number of connections, making it appear that the box can't handle the connections...., As a result, they had spent millions on numerous servers (in one situation they had over twenty 12-cpu servers just for smtp...
These changes seem small, however, changing "ndd" kernel parameters on a Solaris box is not a single task, it is an infrastructure-wide task, and therefore requires the coordination of dozens of different groups, it really took a long long time to get this script implemented. It was called "S99nddfix" and it had all the ndd tunable parameters in it. Later when I worked at a different AT&T group in a different state, I noticed my script had been implemented on all the Solaris servers in the 200+ server environment.
This is the problem. Thanks to the competitive barriers (such as the inability to move phones between all but two of the top four networks, and none of the top 3) moving can take a long time (2 year contract must expire) before someone can move networks unless they want to pay a large fee.
And then, you probably lose your phone. So even if you like it, you have to buyer either a different phone from the new provider, or the same one in their version. Both will cost you even more money, unless you're willing to be stuck on another 2 year contract.
The US system is very well setup, as far as carrier lock in goes.
It's rather amazing how many people go to AT&T for the iPhone. I think they said about 1/3 of their iPhone customers are coming from other networks. I wonder how many more people would get iPhones if it wasn't for their current contract? That's a big reason for many people I've talked to. The rest who want an iPhone are in the "I'd love it but I'm not touching AT&T again" camp.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Wouldn't be the first time, except maybe for AT&T.
I don't think that it's limited to just AT&T - I am in Australia, so have never even had to deal with them, but I am finding that in the vast majority of Australian companies as well, simple back to basics work quality is plummeting. Everything seems to be about making everything as cheap as possible - whether or not it even functions the way it is supposed to. That also goes for the majority of customer service dealings as well.
It seems that the "Do it once but do it properly" mentality is limited to very few people and businesses. I work as a business analyst and the amount of arguing I have to do with each project to get extra money spent to do things properly (the majority of the time it saves money in the long run anyhow for other projects - I am not even taking into account the maintenance and support savings into that equation) yet I seem to always have to fight the same battles over and over.
Moved to http://soylentnews.org/. You are invited to join us too!
Wow. This is kind of amazing.
Nothing on this page (as I type) talks about zero packet loss, except you. That means you read the article.
Of course, the article says that AT&T has set their buffers large enough to prevent packet loss due to congestion in transit, not that they expect no radio packet loss. The problem is that TCP/IP needs packet loss to tell it when it's going too fast and AT&T's decision causes this to fail spectacularly at times.
The trolls read the articles. Weird.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
That's why to the greatest degree possible, libraries, programs, and algorithms should be auto-tuning. You can provide all the knobs you want, but people won't actually touch them. They'll choose which library, application, or operating system they use based on the default settings, so you'd better damn well make sure the default settings are good --- or even better, that you don't need settings at all.
quality, and fashion, one not having to be the other...
Indeed, and that is why many companies built atop the foundations of showy fashion are gone now. Fashion is transient and fickle. Apple however delivers a quality product that delivers new customers through loyalty and word of mouth. If this were not so Apple would not be a tenth of what it is now.
It doesn't hurt that it is fashionable, too. But that is not why I and so many other people buy Apple products.
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Blackberries are awesome about this with the bi-directional communication arrows. When I'm with friends in an area of low reception, they're all walking around randomly trying to call every two yards, and waiting 15 seconds before determining that its not going to work. I walk around until I see an incoming arrow. I freeze and then make a call. Works wonderously.
So in this case, zero packet loss is a setup for disaster instead of a desirable quality.
The trouble is that it's not an intuitive solution to a problem, the introduction of occasional packet loss. It's usually something to avoid.
The determined Real Programmer can write Fortran programs in any language.
Zero packet loss may sound impressive to a telephone guy, but it causes TCP congestion collapse and thus doesn't work for the mobile Internet!
I was in the standardisation group that specified the RLC/MAC layer (ETSI SMG2, later called 3GPP TSG GERAN) and our priorities were not the behaviour of TCP. We were designing the radio layer to provide a bearer service for the higher layer protocols, at that time they were X25, IP (UDP and TCP). The "problem" we were trying to solve was the tendancy of the radio layer to fade, have multipath and generally lose packets. The RLC layer was designed to deliver error-free packets, in sequence over the radio layer. Generally that is exactly what it does, and does well. If it didn't then tehre would be no mobile internet.
What we did find to be a significant performance problem was the asymetric channel. The uplink is usually the root of the TCP performance issues, UDP works much better. When the discrepancy is higher than 10, the downlink is ten times faster than the uplink, then the TCP Acks don't arrive in time and it stalls. Sadly a faster uplink is difficult and expensive to provide.
When did you last meet an unfunny penis?
Probably when he met the guy that modded that comment down.
I know somebody who works on network infrastructure for Telstra. I suggested to him that a lot of traffic which currently goes through wireless and wired LANs will soon run through the cellular networks. He was horrified at the idea. Apparently TCP/IP traffic from 3G cells has to go all the way back to the internet backbone, so anything resembling P2P still saturates the links between the base stations and the back end. Thats a minor issue just now but in addition the links to the 3G cells are only just keeping up with demand right now.
I pointed to the European environment where 3G data is much cheaper and more bandwidth is available. He says that we don't do that kind of investment here. So at the end of the day its a money problem. Lots of profit being taken while they can get away with it.
http://michaelsmith.id.au
I walk around until I see an incoming arrow. I freeze and
people and cars crash into me
The arrows show data traffic as well as voice traffic. It is very nice to see a whole lot of up, down, or both arrows flashing when an app is sitting "unresponsive." You know data is flying so nothing is wrong, just wait and the app will respond when it has the data it needs. The arrows (at least on my 8330) are large for the faster network, and thin for the slow network so I even know when it will take longer because of poor network coverage. I used a Windows Mobile phone for a week and it drove me mad not knowing what was going on with the network data.
TCP measures round trip time, and doesn't need packet loss to tell it that the round trip time is long. The retransmit interval will go up appropriately. TCP will behave reasonably with a long round trip time. If you're trying to do a bulk transfer, there's nothing wrong with this. The problem comes when short messages and bulk transfers are sharing the same channel. The short messages can spend too much time in the queue.
The solution is reordering the packets, not dropping them. That's what "fair queuing" is about. It may be worthwhile to implement fairness at the port-pair level, rather than the IP address level, at entry to the air link. Then low-traffic connections will get through faster.
"Quality of service" can help, but it's not a panacea. The network layer can't tell which of the TCP connections on port 80 is highly interactive and which is a bulk download, other than by traffic volume.
(I used to do this stuff.)
Wouldn't be the first time, except maybe for AT&T.
I don't think that it's limited to just AT&T - I am in Australia, so have never even had to deal with them, but I am finding that in the vast majority of Australian companies as well, simple back to basics work quality is plummeting. Everything seems to be about making everything as cheap as possible - whether or not it even functions the way it is supposed to. That also goes for the majority of customer service dealings as well.
It seems that the "Do it once but do it properly" mentality is limited to very few people and businesses. I work as a business analyst and the amount of arguing I have to do with each project to get extra money spent to do things properly (the majority of the time it saves money in the long run anyhow for other projects - I am not even taking into account the maintenance and support savings into that equation) yet I seem to always have to fight the same battles over and over.
There's a simple reason for that: money is trivial to measure. Quality is much harder to measure. For example, failure rates like MTBFs often don't directly correlate into straight dollars and cents, but a small percentage chance that it might cost a large but unknown amount at some point in the future. This kind of thing confuses people, so they stick to the simple stuff. In an Excel spreadsheet, the solution that costs fewer up-front dollars is just "better" in the world view of most people.
I've had a conversation recently with the CIO of a major business who didn't quite understand why backups were worthwhile. He said something along the lines of "how does this help the business sell more widgets?".
I see the same thing, but often much worse, in big government or big bureaucracies. Project management is complex, so to simplify things, they just ignore the rest of the business or potential future requirements like they don't even exist. In the past, I've tried to point out that, say, with an additional 10% spend on one project they could halve the cost of a dozen future projects, but that's basically crazy talk to a project manager that has to minimize the cost of this project, right now. I've given up trying, and I bet a lot of other people have too.
Oh noes, I'm feeding the trolls again.
But whatever... it doesn't matter. Because at the end of the day, the techie nerds will continue to have no respect for management... and then they'll wonder why they're treated with no respect in return.
So you think the techies that have taken the time to explain all the reasons *why* something needs to be done are stupid.
But you can sit back and say 'loose 8% from your budget - go do it'. No reasons, no explanations just a demand. (Brillant!).
I'm guessing your also the same arsehole that screams at the 'stupid' techies for not being able to restore that sales contract from two months ago that you accidentally deleted - Forgetting about that replacement broken tape drive you refused to pay for last quarter.
As a manager you have got to be the conduit between the workers and the directors. Here's a tip, how about try talking to your techies. No seriously, talk to them. Show them your budget, show them your overheads. Ask them to provide assistance in setting the priorities instead of telling them to get stuffed.
You may end up *earning* some respect from the people who are actually keeping your company running and who don't play musical employers when things start getting to hard.
As I recall, the story went: Mandelbrot was a mathematician at IBM lab. The engineers were attempting high speed data networking, but were encountering data/signal loss due to some noise. So like good engineers, they made things more robust, better isolation, grounds, shielding, etc. but the darn noise was still there.. They could not get rid of it. Determined to find the cause, they went to Mandelbrot with the request to analyze the noise, to determine its cause, in order to eliminate it.
Mandelbrot examined the data and found that there were periods of clear signal interrupted by noise. He examined the noise and found that within it were periods of clear signal, interrupted by noise and so on. Hmmm... He astutely determined that "shit happens" and what was needed was a redundant protocol, not better shielding. The noise you see, was inherent in a damped and driven system.
It was from this that he began his explorations of fractals and chaos theory, and we got robust network protocols.