Software Glitch Caused 911 Outage For 11 Million People
HughPickens.com writes: Brian Fung reports at the Washington Post that earlier this year emergency services went dark for over six hours for more than 11 million people across seven states. "The outage may have gone unnoticed by some, but for the more than 6,000 people trying to reach help, April 9 may well have been the scariest time of their lives." In a 40-page report (PDF), the FCC found that an entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not," the FCC's report reads. "The causes of this outage highlight vulnerabilities of networks as they transition from the long-familiar methods of reaching 911 to [Internet Protocol]-supported technologies."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
On April 9, the software responsible for assigning the identifying code to each incoming 911 call maxed out at a pre-set limit; the counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure. Adm. David Simpson, the FCC's chief of public safety and homeland security, says having a single backup does not provide the kind of reliability that is ideal for 911. "Miami is kind of prone to hurricanes. Had a hurricane come at the same time [as the multi-state outage], we would not have had that failover, perhaps. So I think there needs to be more [distribution of 911 capabilities]."
have your local police and fire phone numbers in your cell phone and posted next to your land line.
40 millions doesn't seem to cross any boundary?
You see, killbots have a preset kill limit. Knowing their weakness, I sent wave after wave of my own men at them until they reached their limit
Then we would've just rolled to -39,999,999 calls. :D
drop your socks and grab your cocks, Slashdot, it's time for some frosty posts with your hot grits and petrified Natalie Portman
This is how sat and pathetic society has become. That big daddy government, a corrupt Federal Tammany Hall, will save you.
Start saving yourselves. Start helping others in need.
When seconds count, the police are minutes away.
When you call 911, there is a good chance you'll be on the phone for the rest of your life.
YOU are responsible for your well being and those who need help around you if you can give aid.
WE The People have become WE THE SHEEPLE, government is not the answer.
Remind yourselves that public "servants" take their pay and pension from those who are elected by giving those same people concessions in terms of PAY and PENSION. They pay for all this by debasing the currency and monetizing the debt.
You have been hoodwinked, bamboozled. Sheeple.
Tsarkon Reports
You see, killbots have a preset kill limit. Knowing their weakness, I sent wave after wave of my own men at them until they reached their limit
Don't do critical things in software.
Did the 911 operators just think it was a slow day for emergencies?
sounds like some dummy read a best practices and conserve the resources book and made a column an interger data type instead of big interger. or whatever the corresponding names are for Oracle or non-MSSQL. some auto process or identity column creates the keys and it reached the max amount. and it wasn't set up to use negative numbers either
While you might find 911 service operable and efficient in the burbs, cash strapped cities with large populations like Miami run out of operators before they run out of capacity. dialing 911 in Cincinnati for example, or any other major city in the rust belt, results in a pre-recorded message instructing you to stay on the line and wait for the next available operator. Its a fun joke to make on sitcoms, but when you've actually in danger its not. Having been backed over on a motorcycle by a truck, I was at the mercy of this hold system for nearly 10 minutes in a busy downtown intersection.
Good people go to bed earlier.
Thank you for calling Springfield RescuePhone!
if you know the name of the crime being committed, press one!
To choose from a list of felonies press two!
If you are being murdered or are calling from a rotary phone please stay on the line!
An entirely preventable software error was responsible for causing 911 service to drop. "It could have been prevented. But it was not,"
So, let us be clear. The error, was not simply preventable but absolutely and completely preventable in all cases. There was no impediment to prevent it. Its prevention was not only possible but also within the reach of any error prevention effort or action. It could have been prevented.
The preventability of the error was absolute. No situation, fictive or factual, in this or other world, would allow a situation in which this error was not preventable.
Finally, it's important to note that the eventual series of events that would lead to the fully avoidable non-prevention of this error, would be unfortunate.
the FCC found that an entirely preventable software error was responsible for causing 911 service to drop
As opposed to what? An entirely unpreventable software error? Sounds more like a configuration issue than a software error anyway.
always works for my Windows computers...
It's the fault of the administrators to begin with. I am friends with one of the technical advisors for the midwest EOC and the problem is that the administrators dont know their ass from a hole in the ground and ignore their tech guys and listen to the vendors.
He has been screaming for all call centers to have analog failover, but the administrators refuse to hear it.
So who is to blame for the failures? That top moron of Homeland security. IT would have been in place if he would realize that he is not an expert and to actually LISTEN to the experts in the field.
Do not look at laser with remaining good eye.
If a private railroad owns rolling stock that would occupy, say 10 miles of track, but actually owns 2000 miles of track, it is no skin off your nose. Your taxes are not funding it. But if the government is running that railroad, we should restrict the total track length owned by the government to the actual track required by those rail cars and not an inch more.
That is how we reduce the size of the government, reduce deficits and reduce taxes. When will America see the logic here?
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
If calls are lost then help is delayed. This impacts the outcome of incidents.
I'm not saying that people died because of this but I'm absolutely certain that there were some who suffered worse injury and losses because of the delays. Loss of 6,000 calls will result in a lot of hurt.
Like so many other issues, it wasn't a single fault but a chain of events. In this case there was a software failure but the fault monitoring systems and support services failed to immediately note that there were no calls going through the affected systems. A change from 1,000 calls per hour to zero should be pretty obvious.
They didn't appear to have a credible mitigation process to handle this sort of failure like diverting calls to another location. This could have been automated or manually initiated by the NOC operators.
Shit happens in all systems, the important thing is how you deal with problems.
I used to work in the NOC for a large Telco and we'd handle 911 outages. Usually 911 goes down because the entire networks down. Like the switch failed, or the trunk from one area that leads to the area the 911 center is in would get cut. Most of this stuff is in a ring so there's usually an alternate route, but in some areas that's not physically possible. For example a remote mountain town with a single road in, would likely have its only trunk running along that same road and it'd get cut all the time as the road constantly needed repair. Chose where you live wisely.
We'd handle this in different ways depending on the situation. For example, if we had 4 trunks that could handle 4X number of calls, and 3 got cut so it could only handle 1X, we could actually prioritize certain numbers so 911 and emergency services would get priority. If the trunk leading to the 911 center failed, we could do something like re-route the calls to the local police dispatcher who literally had no warning and would suddenly have their phone ringing off the hook. You may say "you should warn them!" but our policy was "Get it done" because who's dieing while you're arguing with the dispatcher about how her days going to suck?
The most important skill you can have in any NOC is your ability to triage problems. That term comes from the medical world but it's just networking equipment... until you get into the situation I was in. And you're making triage decisions that could actually result in death. These were real engineers that really cared and did what they could. But when you have an area ravaged by hurricane and you tell the tech to put gas in generator 1 instead of 2, because you've been up for 30hrs strait... and a remote goes down so they can't call 911? I just couldn't detach myself from that. I took a pay cut to leave. A lot of people floated through that job, it wasn't just me. It takes a special kind of person that can detach themselves from the consequences of their decisions.
For 911 services in 7 states? Set aside issues about the backup system for the moment (which may be a second server in the same data center): Why do all the 911 calls have to funnel through a single system? Emergency services are largely local. Not many people have to make 911 calls across a large region. So why isn't the energency call routing handled by local systems? And calls routed to local service centers?
Even if there was a common software glitch in all the handling systems, I doubt everyone would hit some call limit in a database field simultaneously.
Have gnu, will travel.
The police are under no obligation to respond to 911 calls. Not accepting them due to a software problem just speeds up the process.
And the number "40,000,000" doesn't come up on my list of "potential overflows to watch out for". What's special about 40 million?
Do you have ESP?
All you have to do is dial 9... oh
...and prepare your life around the principle that nobody else is responsible for how you respond to things but you. It may still be troublesome if/when 911 goes down, but far less so when you realize that people lived for millenia without the telephone. And I say that as a heart patient who's been grateful for 911 in the past.
At least there are no current / recent worries what might make someone want to call 911 ...
Actually, it's an interesting question -- just what is the threshold? Suspected ebola vomit? Suspicious bag on suburban street? Seems like it would be a very easy system to game, or even to unintentionally render useless. Takes a lot of goodwill and good behavior, all around.
I took a CPR class last night, and the instructor (a firefighter in his dayjob) basically encouraged people to use 911 more, even for things about which the caller might be on the fence. "We'll sort it out. Don't call about your kid's math homework, but don't *not* call because you're not sure it's serious enough" was the gist.
I've called 911 quite a few times, or asked others to (hypothermia case, guy on railroad tracks, gun shots, more gunshots, drunk drivers, etc), now I wonder what the mean / mode is for that ;)
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
It's a software error, stupid. It may be due to carelessness, oversight, lack or testing, or other reason. It may be trivial or catastrophic, but it is still an error. The implication is that it is harmless. If it's being reported, it certainly is not harmless.
As near as I can tell from the TFA somebody just put an arbitrary limit of 40 million in the code somewhere. Would be nice to have more technical details.
Maybe a sequence value tied to a 32bit was shared with other tables resulting in 40 million not being the important number, just what was visible to the end user. Perhaps it was a sequence to a related but unknown table, say an auditing table, that overflowed - may have been a design intent that wasn't realized in engineering to have that table cycled every so often.
I once joined an IBM business partner, where one of my first calls was to a police department in the Detroit suburbs. They wanted me to configure a printer on their AIX based high-availability 911 system. It all seemed odd. When I met the police chief he was very slow to shake my hand, and really left me hanging. He told the office people to watch everything I did - and not the "watch this expensive expert and learn!" sort of watching.
It wasn't until years later that I became aware of the background check practice and realized they didn't actually need the printer configured. It was just a ruse to send me to do work for a police department, who would give me a thorough background check, courtesy of taxpayers.
I did not drink the IBM Kool Aide, so the partner gig did not work out.
lmbo