Software Glitch Caused 911 Outage For 11 Million People
HughPickens.com writes: Brian Fung reports at the Washington Post that earlier this year emergency services went dark for over six hours for more than 11 million people across seven states. "The outage may have gone unnoticed by some, but for the more than 6,000 people trying to reach help, April 9 may well have been the scariest time of their lives." In a 40-page report (PDF), the FCC found that an entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not," the FCC's report reads. "The causes of this outage highlight vulnerabilities of networks as they transition from the long-familiar methods of reaching 911 to [Internet Protocol]-supported technologies."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
have your local police and fire phone numbers in your cell phone and posted next to your land line.
40 millions doesn't seem to cross any boundary?
You see, killbots have a preset kill limit. Knowing their weakness, I sent wave after wave of my own men at them until they reached their limit
sounds like some dummy read a best practices and conserve the resources book and made a column an interger data type instead of big interger. or whatever the corresponding names are for Oracle or non-MSSQL. some auto process or identity column creates the keys and it reached the max amount. and it wasn't set up to use negative numbers either
While you might find 911 service operable and efficient in the burbs, cash strapped cities with large populations like Miami run out of operators before they run out of capacity. dialing 911 in Cincinnati for example, or any other major city in the rust belt, results in a pre-recorded message instructing you to stay on the line and wait for the next available operator. Its a fun joke to make on sitcoms, but when you've actually in danger its not. Having been backed over on a motorcycle by a truck, I was at the mercy of this hold system for nearly 10 minutes in a busy downtown intersection.
Good people go to bed earlier.
That's just about the most ridiculous thing I've heard here in a while.
Thank you for calling Springfield RescuePhone!
if you know the name of the crime being committed, press one!
To choose from a list of felonies press two!
If you are being murdered or are calling from a rotary phone please stay on the line!
An entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not,"
So, let us be clear. The error, was not simply preventable but absolutely and completely preventable in all cases. There was no impediment to prevent it. Its prevention was not only possible but also within the reach of any error prevention effort or action. It could have been prevented.
The preventability of the error was absolute. No situation, fictive or factual, in this or other world, would allow a situation in which this error was not preventable.
Finally, it's important to note that the eventual series of events that would lead to the fully avoidable non-prevention of this error, would be unfortunate.
the FCC found that an entirely preventable software error was responsible for causing 911 service to drop
As opposed to what? An entirely unpreventable software error? Sounds more like a configuration issue than a software error anyway.
It's the fault of the administrators to begin with. I am friends with one of the technical advisors for the midwest EOC and the problem is that the administrators dont know their ass from a hole in the ground and ignore their tech guys and listen to the vendors.
He has been screaming for all call centers to have analog failover, but the administrators refuse to hear it.
So who is to blame for the failures? That top moron of Homeland security. IT would have been in place if he would realize that he is not an expert and to actually LISTEN to the experts in the field.
Do not look at laser with remaining good eye.
If a private railroad owns rolling stock that would occupy, say 10 miles of track, but actually owns 2000 miles of track, it is no skin off your nose. Your taxes are not funding it. But if the government is running that railroad, we should restrict the total track length owned by the government to the actual track required by those rail cars and not an inch more.
That is how we reduce the size of the government, reduce deficits and reduce taxes. When will America see the logic here?
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Wait, what?
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Don't do critical things in hastily-written, poorly designed software. Instead, take sufficient time and make the design and implementation robust. Tried and tested methods exist for all of this. (Consider avionics, for example).
I am sure that there are many other solipsists out there.
If calls are lost then help is delayed. This impacts the outcome of incidents.
I'm not saying that people died because of this but I'm absolutely certain that there were some who suffered worse injury and losses because of the delays. Loss of 6,000 calls will result in a lot of hurt.
Like so many other issues, it wasn't a single fault but a chain of events. In this case there was a software failure but the fault monitoring systems and support services failed to immediately note that there were no calls going through the affected systems. A change from 1,000 calls per hour to zero should be pretty obvious.
They didn't appear to have a credible mitigation process to handle this sort of failure like diverting calls to another location. This could have been automated or manually initiated by the NOC operators.
Shit happens in all systems, the important thing is how you deal with problems.
I used to work in the NOC for a large Telco and we'd handle 911 outages. Usually 911 goes down because the entire networks down. Like the switch failed, or the trunk from one area that leads to the area the 911 center is in would get cut. Most of this stuff is in a ring so there's usually an alternate route, but in some areas that's not physically possible. For example a remote mountain town with a single road in, would likely have its only trunk running along that same road and it'd get cut all the time as the road constantly needed repair. Chose where you live wisely.
We'd handle this in different ways depending on the situation. For example, if we had 4 trunks that could handle 4X number of calls, and 3 got cut so it could only handle 1X, we could actually prioritize certain numbers so 911 and emergency services would get priority. If the trunk leading to the 911 center failed, we could do something like re-route the calls to the local police dispatcher who literally had no warning and would suddenly have their phone ringing off the hook. You may say "you should warn them!" but our policy was "Get it done" because who's dieing while you're arguing with the dispatcher about how her days going to suck?
The most important skill you can have in any NOC is your ability to triage problems. That term comes from the medical world but it's just networking equipment... until you get into the situation I was in. And you're making triage decisions that could actually result in death. These were real engineers that really cared and did what they could. But when you have an area ravaged by hurricane and you tell the tech to put gas in generator 1 instead of 2, because you've been up for 30hrs strait... and a remote goes down so they can't call 911? I just couldn't detach myself from that. I took a pay cut to leave. A lot of people floated through that job, it wasn't just me. It takes a special kind of person that can detach themselves from the consequences of their decisions.
For 911 services in 7 states? Set aside issues about the backup system for the moment (which may be a second server in the same data center): Why do all the 911 calls have to funnel through a single system? Emergency services are largely local. Not many people have to make 911 calls across a large region. So why isn't the energency call routing handled by local systems? And calls routed to local service centers?
Even if there was a common software glitch in all the handling systems, I doubt everyone would hit some call limit in a database field simultaneously.
Have gnu, will travel.
And the number "40,000,000" doesn't come up on my list of "potential overflows to watch out for". What's special about 40 million?
Do you have ESP?
At least there are no current / recent worries what might make someone want to call 911 ...
Actually, it's an interesting question -- just what is the threshold? Suspected ebola vomit? Suspicious bag on suburban street? Seems like it would be a very easy system to game, or even to unintentionally render useless. Takes a lot of goodwill and good behavior, all around.
I took a CPR class last night, and the instructor (a firefighter in his dayjob) basically encouraged people to use 911 more, even for things about which the caller might be on the fence. "We'll sort it out. Don't call about your kid's math homework, but don't *not* call because you're not sure it's serious enough" was the gist.
I've called 911 quite a few times, or asked others to (hypothermia case, guy on railroad tracks, gun shots, more gunshots, drunk drivers, etc), now I wonder what the mean / mode is for that ;)
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
As near as I can tell from the TFA somebody just put an arbitrary limit of 40 million in the code somewhere. Would be nice to have more technical details.
and they were dying like flies for millennia too. we have arranged and are paying for these services to help us out when need be. Some of those things that happen (fire or stroke) can kill or maim if not dealt with fast. This said it is indeed true that you need to accept that accidents happen also to such emergency centers. You can also expect that in this day and age such centers are handled in a way assuring redundancy. There many things that failed apparently. This is an occasion to improve. For some it is occasion to course and blame game (Putin would be a nice candidate to blame). For you this is just a statement of surprise why others are so shocked.
Well, what the gp suggests is pretty much impossible, but the truth is, software *is* the weakest link. OSes and apps are never or rarely bug free, and each new patch often introduces new bugs. Some things, like OSes, have tens of millions of lines of code, and where just a missing or misplaced semicolon can wreak random havoc, that's practically a guaranteed problem waiting to happen.
/. among IT and science people, against "grammar nazis", with the frequent defense of "you know what I meant". Well, computers don't know. And recently I've seen a surprising number of typos in SuSE/SLES conf file comments and whatnot.. it makes me wonder if developers are screwing up in the code too, syntactically.
Flexibility is proportional to complexity, and inversely proportional to reliability/stability. Dedicated hardware devices are the most stable, followed by firmware driven appliances. I've been in IT (repair, then administration) for about 20 years now, which is a fair amount of time, and I've seen far fewer hardware issues than I ever have with software.
Then there are the issues of security; Windows, for one example, has been around for two and a half decades now, and there are still numerous bug fixes and security patches released on a weekly basis. It is permanently flawed.
I sometimes wonder if it's really a language thing. I'm not a developer though. High level coding (that I'm aware of anyway) is in English, yet there is a pervasive attitude these days, even here at
Look back up at my post, now look back down, you're on the Internet. Now look back up. I'm a signature.