Software Bug Behind Biggest Telephony Outage In US History (bleepingcomputer.com)
An anonymous reader writes: A software bug in a telecom provider's phone number blacklisting system caused the largest telephony outage in US history, according to a report released by the US Federal Communications Commission (FCC) at the start of the month. The telco is Level 3, now part of CenturyLink, and the outage took place on October 4, 2016.
According to the FCC's investigation, the outage began after a Level 3 employee entered phone numbers suspected of malicious activity in the company's network management software. The employee wanted to block incoming phone calls from these numbers and had entered each number in fields provided by the software's GUI. The problem arose when the Level 3 technician left a field empty, without entering a number. Unbeknownst to the employee, the buggy software didn't ignore the empty field, like most software does, but instead viewed the empty space as a "wildcard" character. As soon as the technician submitted his input, Level 3's network began blocking all incoming and outgoing telephone calls — over 111 million in total.
According to the FCC's investigation, the outage began after a Level 3 employee entered phone numbers suspected of malicious activity in the company's network management software. The employee wanted to block incoming phone calls from these numbers and had entered each number in fields provided by the software's GUI. The problem arose when the Level 3 technician left a field empty, without entering a number. Unbeknownst to the employee, the buggy software didn't ignore the empty field, like most software does, but instead viewed the empty space as a "wildcard" character. As soon as the technician submitted his input, Level 3's network began blocking all incoming and outgoing telephone calls — over 111 million in total.
Check the spec - perhaps it was by design or not called out to ignore empty entries?
Browsing at +1 - no ACs, I ignore their posts. So refreshing!
It was Linux.
had me laughing out loud. I of course didn't rtfa but what was the abused language.
I can see why CenturyLink sucked up such talented employees and wonderful software...it's right up their alley....
I'm 99% sure they were using the Sonus EMS management software (L3 is a huge Sonus shop) to manage the PSX routing engine. The software works as longest match of the number. Since you have to always select the country, a blank entry would be treated as +1 and block everything after that or everything in the US.
My best guess is the bug was: "block if phoneNumber.containsAnyFrom(blacklist)"
Every phone number contains "" so all get blocked. I'm assuming the input field gets trimmed of whitespace and assuming it wasn't a feature.
Any other ways it could have occurred?
Bugs happen, and when they do you should fix them, but having no sanity checks on the input data nor on the the rules which govern critical infrastructure? That's irresponsible to the extreme.
Also, why are naughty phone numbers being entered into a GUI, by hand, one at a time? Don't tell me these numbers are based on Caller ID, too.
While empty fields are a stupid wild-card option, it's seems reasonable to assume this sort of error didn't happen all that often. Why wasn't the employee aware of the stupid UI? Was entering a malicious number rarely done? Was this the first time this employee performed this task?
1) There should have been a warning: "Do you want to block all calls?" If not, then require the employee to enter a phone number.
2) Or for a better solution, that form should not interpret a blank field as a wildcard. If all phone calls are to be blocked, then someone must sign on with a manager's user id, and fill out a special form that lets you block all phone calls.
One guy can bring down a whole system so easily. For years I've been trying to tell you people your contraptions are not ready to be put online, just like your silly "self driving" cars. Maybe in 50 years or so you'll get it right, but for now, let's stick to slide rulers, pencil, and paper. That's how we built the 747 and went to the moon, and notice, we haven't been back, or built a superior airplane, why? Because we no longer use the tools that work!
and of course rm -r .* -- which might work if you're a restricted user deleting your .files, until you do it as root or with sudo.
It should be an entirely reasonable thing to do, right?
What about you doing a "ls -R" or "find" or anything recursive, and hitting a symlink or a network mount point? Perhaps you're now scanning a million files on a secondary HDD or on a slow network mount. Hope you're not doing something worse.
What if you're automatically "backing up" your files?, then after you've written "banana" into all your files, your backup gets overwritten.
What about "-r" vs "-R" and "-p" vs "-P" : a much milder issue but shows that user interfaces aren't all that great or consistent.
What if I put a file named "-r" in your directory. (or "~")
If all those examples are fine, then I'm thinking the phone exchange software worked as intended.
Also, how about a dialogue box that says âyou are about to block 111 million phone lines from xxxx to xxxxxxxxx. Also manager authorization codes needeâ(TM)
I used to think that US' FCC was this responsible organization being involved in all the tech stuff going on, but now it seems it is as tainted by corruption as the one and only CIA.
I am a European though and I don't live in USA, though until recently until the dubious decision by FCC to work against net neutrality I had a fairly positive view on FCC. Admittedly I didn't know much about them.
As an adult, it seems organizations I have heard about growing up is falling from grace with me.
I can't help wonder if a major outage with comms tech in a country could perhaps be related to secret changes in a looming police state, though I might ofc be wrong, I am not surprised by anything anymore.
Ha ha. No, this is not just a bug. The fuckup goes much deeper than that. "An empty field acts as a wildcard" is the least of your problems. It may or may not be expected behaviour for a GUI. "Not finding it during testing" is par for the course for GUIs for this sort of thing. You're not supposed to give wrong input, even accidentally!
The real problem is thinking a GUI is appropriate to feed lists of boring numbers through. By hand, no less. It's way too easy to accidentally leave a field empty or --if it's a micro-managing form like windows IP address entry type things-- copy part of a numer in the wrong line, shift it a sub-field, or something else similarly silly.
What we have here is a mismatch between user interface and purpose, cooked up without thinking. This is the same mode that makes users stupid, but now it was the designer who wasn't thinking. The focus was on "getting some input fields done", not on "how will this be used and what might the consequences be?" The deeper problem is TFIing such lists. GUIs are entirely stupid for this.
Compare "here, have a GUI" with this sequence: Check the list then feed it to the system as a textfile. It gets queued. Then check the list as it appears in the system against your original list. The system probably should make explicit just what it will do with each entry, like "block one number" or "block a range of numers". Possibly have someone else look over the proposed actions. THEN activate it.
So the problem is that the workflow is entirely too stupid to live. And it was shaped into that form by a GUI.
No, the outage began years ago when someone created a process in which a human being manually enters data directly into a production system.
I swear, if I had a nickel for every time a major fuckup was root caused to "human error in a process that should have never had a human factor to begin with", I could buy a house in the Bay Area.
In 1987 I had just taken a job at the local Telco and was hitting a steep learning curve. My experience to that point had been PC computers and networks, assembler, CBASIC dBase and the like. This was an IBM System/38 and their billing software used RPG/III, which was a real structured language unlike its spaghetti-GOTO RPG/II cousin, but aspects were still position sensitive and opcodes were silly-simple compared to languages with which I was familiar. It was more like assembler than anything else. Most data flows consisted of running commands that generated a relational input stream sort of like an SQL query, through simple RPG programs.
We had just installed an ITT 1210 switch and ITT had sent over a block of sample RPG code demonstrating how to parse the various fields and flags appearing on call tapes. My boss provided specs for the internal call ticket system they were using and the simple (!) task was to write a shim that generated a batch of call tickets from each tape. Pretty straightforward, tedious without being intricate. But one part of their code slapped me across the face when I examined it.
The tape recorded end time and call duration in whole seconds, call start time would need to be calculated. They had supplied a routine to do this but it didn't make any sense because I could see no modulo 60 arithmetic in it, they were applying the simple RPG subtraction opcode on the zoned fields. I spent the most mystified HOUR of my LIFE searching the language manuals for that surely described RPG's 'magic' ops for manipulating times and dates, which I assumed had to be there because IBM is GREAT and I am STUPID... finding none. Forced to conclude that I was looking at concept code that was dashed off hurriedly in two minutes I confronted my boss with it (and my solution) but it was a hard sell at first, because my boss was incredulous too.
<blink>down the rabbit hole</blink>
So these phone companies have the ability to block all incoming calls from a malicious phones. All these days... All the complaints about spam callers... About scam artists posing as IRS employees .... They had the ability to block them. But they never did. Bastards.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Name the god damn supplier!
Does your country have a "Province, Prefecture, or Other Region" more general than city but more specific than a country? If so, that'd go in the State field. (Source: my experience integrating with postage software published by Endicia and UPS.) If the form states that the name or postal abbreviation of your province is invalid, then perhaps the business doesn't ship to your country.
This is why QA is good and tech companies used to consider it a separate and necessary discipline.
Awe, fuck it. We're too smart for that. Just throw it against the wall and see if it sticks. If it don't, we can just roll it back, right?