Software Glitch Caused 911 Outage For 11 Million People
HughPickens.com writes: Brian Fung reports at the Washington Post that earlier this year emergency services went dark for over six hours for more than 11 million people across seven states. "The outage may have gone unnoticed by some, but for the more than 6,000 people trying to reach help, April 9 may well have been the scariest time of their lives." In a 40-page report (PDF), the FCC found that an entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not," the FCC's report reads. "The causes of this outage highlight vulnerabilities of networks as they transition from the long-familiar methods of reaching 911 to [Internet Protocol]-supported technologies."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
have your local police and fire phone numbers in your cell phone and posted next to your land line.
40 millions doesn't seem to cross any boundary?
You see, killbots have a preset kill limit. Knowing their weakness, I sent wave after wave of my own men at them until they reached their limit
Then we would've just rolled to -39,999,999 calls. :D
Did the 911 operators just think it was a slow day for emergencies?
sounds like some dummy read a best practices and conserve the resources book and made a column an interger data type instead of big interger. or whatever the corresponding names are for Oracle or non-MSSQL. some auto process or identity column creates the keys and it reached the max amount. and it wasn't set up to use negative numbers either
While you might find 911 service operable and efficient in the burbs, cash strapped cities with large populations like Miami run out of operators before they run out of capacity. dialing 911 in Cincinnati for example, or any other major city in the rust belt, results in a pre-recorded message instructing you to stay on the line and wait for the next available operator. Its a fun joke to make on sitcoms, but when you've actually in danger its not. Having been backed over on a motorcycle by a truck, I was at the mercy of this hold system for nearly 10 minutes in a busy downtown intersection.
Good people go to bed earlier.
That's just about the most ridiculous thing I've heard here in a while.
Thank you for calling Springfield RescuePhone!
if you know the name of the crime being committed, press one!
To choose from a list of felonies press two!
If you are being murdered or are calling from a rotary phone please stay on the line!
An entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not,"
So, let us be clear. The error, was not simply preventable but absolutely and completely preventable in all cases. There was no impediment to prevent it. Its prevention was not only possible but also within the reach of any error prevention effort or action. It could have been prevented.
The preventability of the error was absolute. No situation, fictive or factual, in this or other world, would allow a situation in which this error was not preventable.
Finally, it's important to note that the eventual series of events that would lead to the fully avoidable non-prevention of this error, would be unfortunate.
the FCC found that an entirely preventable software error was responsible for causing 911 service to drop
As opposed to what? An entirely unpreventable software error? Sounds more like a configuration issue than a software error anyway.
always works for my Windows computers...
It's the fault of the administrators to begin with. I am friends with one of the technical advisors for the midwest EOC and the problem is that the administrators dont know their ass from a hole in the ground and ignore their tech guys and listen to the vendors.
He has been screaming for all call centers to have analog failover, but the administrators refuse to hear it.
So who is to blame for the failures? That top moron of Homeland security. IT would have been in place if he would realize that he is not an expert and to actually LISTEN to the experts in the field.
Do not look at laser with remaining good eye.
If a private railroad owns rolling stock that would occupy, say 10 miles of track, but actually owns 2000 miles of track, it is no skin off your nose. Your taxes are not funding it. But if the government is running that railroad, we should restrict the total track length owned by the government to the actual track required by those rail cars and not an inch more.
That is how we reduce the size of the government, reduce deficits and reduce taxes. When will America see the logic here?
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Don't do critical things in hastily-written, poorly designed software. Instead, take sufficient time and make the design and implementation robust. Tried and tested methods exist for all of this. (Consider avionics, for example).
I am sure that there are many other solipsists out there.
If calls are lost then help is delayed. This impacts the outcome of incidents.
I'm not saying that people died because of this but I'm absolutely certain that there were some who suffered worse injury and losses because of the delays. Loss of 6,000 calls will result in a lot of hurt.
Like so many other issues, it wasn't a single fault but a chain of events. In this case there was a software failure but the fault monitoring systems and support services failed to immediately note that there were no calls going through the affected systems. A change from 1,000 calls per hour to zero should be pretty obvious.
They didn't appear to have a credible mitigation process to handle this sort of failure like diverting calls to another location. This could have been automated or manually initiated by the NOC operators.
Shit happens in all systems, the important thing is how you deal with problems.
I used to work in the NOC for a large Telco and we'd handle 911 outages. Usually 911 goes down because the entire networks down. Like the switch failed, or the trunk from one area that leads to the area the 911 center is in would get cut. Most of this stuff is in a ring so there's usually an alternate route, but in some areas that's not physically possible. For example a remote mountain town with a single road in, would likely have its only trunk running along that same road and it'd get cut all the time as the road constantly needed repair. Chose where you live wisely.
We'd handle this in different ways depending on the situation. For example, if we had 4 trunks that could handle 4X number of calls, and 3 got cut so it could only handle 1X, we could actually prioritize certain numbers so 911 and emergency services would get priority. If the trunk leading to the 911 center failed, we could do something like re-route the calls to the local police dispatcher who literally had no warning and would suddenly have their phone ringing off the hook. You may say "you should warn them!" but our policy was "Get it done" because who's dieing while you're arguing with the dispatcher about how her days going to suck?
The most important skill you can have in any NOC is your ability to triage problems. That term comes from the medical world but it's just networking equipment... until you get into the situation I was in. And you're making triage decisions that could actually result in death. These were real engineers that really cared and did what they could. But when you have an area ravaged by hurricane and you tell the tech to put gas in generator 1 instead of 2, because you've been up for 30hrs strait... and a remote goes down so they can't call 911? I just couldn't detach myself from that. I took a pay cut to leave. A lot of people floated through that job, it wasn't just me. It takes a special kind of person that can detach themselves from the consequences of their decisions.
For 911 services in 7 states? Set aside issues about the backup system for the moment (which may be a second server in the same data center): Why do all the 911 calls have to funnel through a single system? Emergency services are largely local. Not many people have to make 911 calls across a large region. So why isn't the energency call routing handled by local systems? And calls routed to local service centers?
Even if there was a common software glitch in all the handling systems, I doubt everyone would hit some call limit in a database field simultaneously.
Have gnu, will travel.
The police are under no obligation to respond to 911 calls. Not accepting them due to a software problem just speeds up the process.
And the number "40,000,000" doesn't come up on my list of "potential overflows to watch out for". What's special about 40 million?
Do you have ESP?
All you have to do is dial 9... oh
...and prepare your life around the principle that nobody else is responsible for how you respond to things but you. It may still be troublesome if/when 911 goes down, but far less so when you realize that people lived for millenia without the telephone. And I say that as a heart patient who's been grateful for 911 in the past.
At least there are no current / recent worries what might make someone want to call 911 ...
Actually, it's an interesting question -- just what is the threshold? Suspected ebola vomit? Suspicious bag on suburban street? Seems like it would be a very easy system to game, or even to unintentionally render useless. Takes a lot of goodwill and good behavior, all around.
I took a CPR class last night, and the instructor (a firefighter in his dayjob) basically encouraged people to use 911 more, even for things about which the caller might be on the fence. "We'll sort it out. Don't call about your kid's math homework, but don't *not* call because you're not sure it's serious enough" was the gist.
I've called 911 quite a few times, or asked others to (hypothermia case, guy on railroad tracks, gun shots, more gunshots, drunk drivers, etc), now I wonder what the mean / mode is for that ;)
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
It's a software error, stupid. It may be due to carelessness, oversight, lack or testing, or other reason. It may be trivial or catastrophic, but it is still an error. The implication is that it is harmless. If it's being reported, it certainly is not harmless.
As near as I can tell from the TFA somebody just put an arbitrary limit of 40 million in the code somewhere. Would be nice to have more technical details.
Maybe a sequence value tied to a 32bit was shared with other tables resulting in 40 million not being the important number, just what was visible to the end user. Perhaps it was a sequence to a related but unknown table, say an auditing table, that overflowed - may have been a design intent that wasn't realized in engineering to have that table cycled every so often.
Well, what the gp suggests is pretty much impossible, but the truth is, software *is* the weakest link. OSes and apps are never or rarely bug free, and each new patch often introduces new bugs. Some things, like OSes, have tens of millions of lines of code, and where just a missing or misplaced semicolon can wreak random havoc, that's practically a guaranteed problem waiting to happen.
/. among IT and science people, against "grammar nazis", with the frequent defense of "you know what I meant". Well, computers don't know. And recently I've seen a surprising number of typos in SuSE/SLES conf file comments and whatnot.. it makes me wonder if developers are screwing up in the code too, syntactically.
Flexibility is proportional to complexity, and inversely proportional to reliability/stability. Dedicated hardware devices are the most stable, followed by firmware driven appliances. I've been in IT (repair, then administration) for about 20 years now, which is a fair amount of time, and I've seen far fewer hardware issues than I ever have with software.
Then there are the issues of security; Windows, for one example, has been around for two and a half decades now, and there are still numerous bug fixes and security patches released on a weekly basis. It is permanently flawed.
I sometimes wonder if it's really a language thing. I'm not a developer though. High level coding (that I'm aware of anyway) is in English, yet there is a pervasive attitude these days, even here at
Look back up at my post, now look back down, you're on the Internet. Now look back up. I'm a signature.
I once joined an IBM business partner, where one of my first calls was to a police department in the Detroit suburbs. They wanted me to configure a printer on their AIX based high-availability 911 system. It all seemed odd. When I met the police chief he was very slow to shake my hand, and really left me hanging. He told the office people to watch everything I did - and not the "watch this expensive expert and learn!" sort of watching.
It wasn't until years later that I became aware of the background check practice and realized they didn't actually need the printer configured. It was just a ruse to send me to do work for a police department, who would give me a thorough background check, courtesy of taxpayers.
I did not drink the IBM Kool Aide, so the partner gig did not work out.
lmbo