RIM Releases Reason for Blackberry Outage
An anonymous reader writes "According to BBC News, RIM has announced that the cause of this week's network failure for the Blackberry wireless e-mail device was an insufficiently tested software upgrade. Blackberry said in a statement that the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space. The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options."
a routine that can take down the system is a tad more critical then you think?
The Kruger Dunning explains most post on
I'd really hate to be the guy that signed off on the quality of this software update. And apparently they didn't adequately test their recovery system. Oh, well. I hope they learn from this and improve!
Avoid Missing Ball for High Score
This is all just technical jargon for, "I tripped over the power cord. MY BAD."
"Please, shut up. Just when I think you can't say anything more stupid, you speak again." -Archie Bunker.
This is obviously some new definition of the word "non-critical" with which I was previously unfamiliar.
bkd
The irony is that the SEC couldn't do any more investigating during the outage because they had no email access!
"It is a miracle that curiosity survives formal education." -Albert Einstein
Their tubes were clogged and the plumber wasn't responding. Damn Canadian plumbers...
This guy's the limit!
So, an outage affecting a core part of the buisiness was caused by a 'non-critical' upgrade. Someone needs to redefine what non-critical actually is. As far as my experience goes (mostly in mission critical datacentres), most of the testing was actually done by the engineers installing and fixing on-the-fly. Engineers are more likely to look in the right places to find a bug, due to hands-on real life experience.
"I am not bound to please thee with my answers" [William Shakespeare]
So that is where the missing 5 million White House emails went! Sneaky Canadians!
Life needs more saving throws.
This is funnier than the other comments about it being critical.
And let me tell you, I have no problem believing they have buggy software.
Mod me down with all of your hatred and your journey towards the dark side will be complete!
...they just became famous as a lesson in what not to do
all publicity is good publicity, right?
as the other poster said:- boy I would hate to be their QA at this time.
I think not. You realize this is 2007, yes? Ask the marketing department for how much testing they get.
Yeah, they've got areas to tighten up their QA and patch processes, but on the whole they got it all back up and running faster than most enterprises get their email functioning after a worm.
Mistakes in QA do happen and everyone can do more testing but RIM's biggest failure during the outage was not their QA but their PR. How many BES Admins wasted an hour or two trying to figure out why their servers were not delivering properly to their user's handhelds? If there was a statement on their website or a message on their support line, a lot of wasted time would have been averted. If it were not for a few of the independent blackberry forums, I would not have known their was a nationwide outage during my troubleshooting.
Strange women lying in ponds distributing swords is no basis for a system of government.
Which is worse:
A) The fact one piece of software took down their environment.
B) Their failover plan didn't work.
C) All of the above.
D) None of the above.
Personally, I vote for "B". Face it, s**h happens. But when you plan for s**t happening and the plan doesn't work, that's a VERY bad thing.
This is my opinion. To make sure you don't steal it, it's covered by the DMCA.
Nobody should be allowed to charge for anything or make any money ever!
Information wants to be free!
Serves them right!
It's an imperfect world. Now, show Dick some respect!
Weaselmancer
rediculous.
And a bunch of suits will want the heads of the technicians responsible.
I feel for them, I really do.
A few years ago I put in a minor maintenance change that made headlines for my employer.
This is a natural result of the budgetary constraints we have to live with in the real world. Testing and certification is expensive, and the more complex the environment, the more expensive it gets. It is difficult to justify a full blown certification test for minor, routine maintenance, unless you are talking about health and safety systems. So a worst-case event occurred, RIM suffers some corporate embarrassment, some low-level techs will get yelled at, and possibly fired, and a bunch of people had to suffer crackberry withdrawal.
Nobody died. No planes crashed. No reactors melted down.
RIM will work up some new and improved testing standards, and tighten the screws on system maintenance so much that productivity will suffer, they may even spend a bunch of money on the equipment needed to do full-production-parallel certification testing. And then in a year or so cut the budget to upgrade the certification environment as 'needless expense', and come up with work-arounds to reduce the time it takes to get trivial changes and bugfixes rolled out.
I wish them luck. Especially to the poor sods who did the implementation.
At least when I did my 'headline-making-minor-maintenance' it only made the local papers for a couple of days.
http://visualizecommonsense.com/
>...the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space.
:-)
:
>The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options.
Hmmm... so when they wiped the incriminating e-mails from the system (which would certainly create more space), they took the rest of the system down (which prevented anyone else from grabbing copies).
I'm reading WAY too many conspiracy novels these days
(Not that I think this actually happened - but it makes for a great plotline).
Of course he would not elaborate more on what it is.
This Computer World article has more detail.
2bits.com, Inc: Drupal, WordPress, and LAMP performance tuning.
I think it's more along the lines of "Unplugged the coffee maker; please feel free to restart the server now."
You know the thing about UDP jokes? I don't care if you get it or not.
"insufficiently tested software upgrade" => "untested software upgrade" => "some superstar at RIM changed the CRASH_NETWORK constant from 0 to 1."
stuff |
"In other news, the wikipedia.org web site screeched to a halt as /. readers rushed to lookup the meaning of the term 'routine' applied in the context of software systems. The RIM public relations department could not be reached for a clarification as to why such an anachronism was used in their announcement."
Chandler: "Quick, we must telegraph presidend Coolidge!"
I own a Blackberry and could not send/recieve e-mail from approximately 8:00 pm to 10:00 am. Considering that I was asleep or showering for 8 hours, that is 6 hours of personal impact. Although I think the outage is unacceptable and shows the fragility of the system, I am surprised at the size of the reaction, even considering the "Crackberry" effect. I guess that wireless e-mail is now seen as a utility like cell-phones, land-lines and electricity.
Although an explanation would have been better yesterday, as an IT person, I can understand the process: Tuesday night: Panic! Get the system back up ASAP. Wednesday: Investigate exactly what went wrong, monitor systems with extreme dilligence, hold your breath. Thursday: Meet with marketing folks to come up with a statement. Thursday Night: release statement.
Does anyone have a link to the actual statement that RIM made? 5 minutes of googling could only found articles that quoted the statement.
I heard it was a practical joke gone bad at The Perimeter Institue. Apparently, Lee Smolin was preparing synthetic black holes for the string theorists' offices, and one of them escaped and headed for RIM's control centre.
Their use of the term "non critical" is most likely referring to the nature of the patch. It was an "optional" patch that did not fix any "critical vulnerabilities" or anything like that.
It is quite obvious they were not referring to the criticalness of the system which was affected.
Many things went wrong at once: - defective patch - automated and manual testing missed the defect - defective patch rolled out to a huge portion of the user base at once - rollback failed/ineffective On the flip side, some things went right: - no data loss, messages delayed instead of lost - BB's continued to function properly while server offline - phone functions unaffected - firms running their own BES servers unaffected - lastly, even with this outage, they're still offering lots of 9's of availability.
Don't VMware's admins know to turn Automatic Updates off in the copy of Windows ME that the Blackberry backend runs on?
The problem now a days with the product life cycle is ship dates. I have seen time and time again where something was shipped based on a date. This all comes from a product project cycle that is based on software that is shipped to customers. When you develope a service\product that is ran on the internet, developers need to have the mind set to "Develope to run, not to ship". This is was i preach everyday as an operations manager for a large online ecommerce site.
RIM is not a regular company. They have specifically created a centralized system where the email for millions of people depend on the uptime of their two (?!?!) data centres. Delivering email is literally their business and uptime is a critical part of that. IMHO, a half an hour of system wide downtime is pushing RIM's luck.
Several hours of email downtime is "OKish" if you are talking about a medium sized company that only has a handful of servers and a few IT guys. This is not the same at all.
Prior to this, I never realized that the RIM system was THIS centralized. It's kind of concerning really. And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).
What ever happened to no single point of failure. And since when do you update a live system. Has no one learned anything in the past decade.
Reminds me of when a Mobile phone company upgrades over the weekend and everyone discovered you could make long distance phone calls for free.
davecb5620@gmail.com
...somebody forgot the ~ in rm -rf ~/
t omatic-failover-and-d2d2t2brain system. That it takes the whole network down, is a problem.
Adding storage space to a single system shouldn't be a problem, since you take your system down for that anyway (or put it in spare mode or so) even if it's a hotplug-always-on-superfast-resizing-raid-with-au
Custom electronics and digital signage for your business: www.evcircuits.com
Maybe someone could've told them that erasing (shredding) files and unused disk space can grind the system to a halt.
If anyone said something like this five years ago, I'd accuse them of being a tin-hat wearing paranoid fool. But times have changed.
There are too many things, such as the unprecedented use of signing statements, abuse of the Patriot Act, death of investigative journalism (replaced by partisan pundits disguised as reporters), Valerie Plame's outing, and unchecked kleptocracy going on that turns trusting people into cynics.
When hearing about lost e-mails on TV, I think "e-mails get lost all the time" but when I read detailed reports, the facts clearly show that it could not have been an accident. For example, this report shows facts about the lost e-mails that should be unacceptable to Democrats, Republicans, and independents alike:
WITHOUT A TRACE: THE MISSING WHITE HOUSE EMAILS AND THE VIOLATIONS OF THE PRESIDENTIAL RECORDS ACT
http://www.citizensforethics.org/node/27607
If you read the report and know that people in Washington use Blackberries, how could you not wonder if the recent outage was caused by attempts to destroy evidence?
Congress would have to be completely blind if they don't immediately contact RIM and have them confirm under oath that no evidence was destroyed.
Given that Karl Rove is known to be a Blackberry user, Congress would have to be incompetent to ignore this incident. Please give them a clue by contacting your representative by e-mail or fax or phone!
B) Their failover plan didn't work.
..
What failover plan and assuming what they said really happened
was Re:Pop quiz!
davecb5620@gmail.com
I pity the RIM staff now. I work in an client that has had two "bad headlines" incidents. Understandably they are now highly risk averse - up to 12 sign offs required for minor changes ; Documentation to code ratio is >> 20:1 ; 5 chiefs to 1 Indian on many projects ...
The public is ignorant as to what causes IT problems - even if RIM upgrade their QA process to "better than normal" no one will forgive them if lightning strikes twice. Thus RIM are likely to bring in extraordinarily restrictive processes. If I was a creative developer or solutions architect in RIM I'd be looking for a new job.
it started with a vendor issue, and then RIM's software did not react well to that issue.
Given the nature of the technology I find the explanation of a 'fail-over' system failing to kick in a tad disingenuous. It's not like a generator kicking in when the mains electricity stops. And what kind of design decisions led to an upgrade triggering outages for the entire North America.
I would have thought they had multiple nodes at multiple locations with no single point of failure. Or at least three redundant and independent systems, a main system, a backup system and a system for testing upgrades. Or is it like most commercial companies they designed the cheapest system possible.
Tell me it's not like the Uks DOH system where power cuts in Kent lead to system outages in the north of england. It takes real genius to design a distributed database that borks because of a power cut. sarcasm.
More details(Score:3, Informative)
davecb5620@gmail.com
Turned out one of our contract software guys had made a simple change to the file retention period - so trivial, he said, there was no need to test it. He was rather chagrined the next day.
Yeah, this was a long time ago - 1973 or so. But some cherished principles hold up pretty well, such as: Test the damn "trivial" changes!
You can't expect programmers to do perfect work, even with unit testing and all the other basic amenities of software development. It requires QA, and that is something sorely lacking in contemprary software product. From the smallest OSX widget to MS Vista,Testing Matters.
RS
Shoes for Industry. Shoes for the Dead.
Support Right To Repair Legislation.
What gets me is all the media talk about Emergency responders not being able to be contacted. It's not like their Blackberries burst into flames because the message passing servers were down, they still had SMS and phone capability. Hell because we aren't certain our email relays or BES servers won't be the down system our alerting system automatically switches from email to SMS for the second round of notifications. I guess RIM isn't the only ones who could use a little process improvement!
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Your information isn't quite right here. RIM has more than two data centers in more than two locations in more than two continents.
Governments tend to be (justifiably) paranoid customers. I'm sure it's safe to assume that each government does a fair amount of investigation before deciding it's safe to use BlackBerry for official use. And even then I expect it's only permitted for certain classification levels -- probably low ones.
That actually happened at a medium sized web host I did tech support for! A network admin had been at the NOC all night installing patches on our servers and on his way out tripped over the cords for the main routing system. As a result, every one of our websites (serving some 20,000 customers worldwide) were offline for about two hours before someone discovered what he'd done.
Any plan which depends on a fundamental change in human behavior is doomed from the start.
Awesome. Better him than me! Sounds like something I'd have happen to me.
"Please, shut up. Just when I think you can't say anything more stupid, you speak again." -Archie Bunker.
Maybe the feds check out that kind of thing. I can testify that at least one county government (with several thousand employees and about a million citizens) has such a bad IT department that they would be hard pressed to figure out that RIM is canadian. All employees within about 5 hops of the county manager on the O-chart have blackberries for official use.
Damn, so that's what happened!
I thought it was just aliens trying to attack Earth.
It's my understanding that corporate BlackBerrys use encryption for while the messages are in transit. I'm not sure if the central RIM server ever gets a chance to see the cleartext message.
Uh, I think the RIM BES runs constantly, 100% of the time. They'd have to update a live system, if they were going to update anything at all. Otherwise, depending on the frequency of updates, I'd imagine BB users would be pissed if they had fluky BBs for an hour every week. And the "centralization" aspect is what a lot of people are wondering about.
So you update your fully redundant production system, then test that production system a bit just to make sure, then flip live traffic over to it. It everything looks good you upgrade the system that was previously live, otherwise you flip back.
This isn't even hard (unless, of course, you really have learned nothing in the past 10 years and don't have a fully redundant production system hot at all times).
Socialism: a lie told by totalitarians and believed by fools.
How tired do you have to be to not notice that you tripped over a bunch of cables?
Encrypted transit makes sense. However, that still leaves a fairly important point of failure significantly outside of US control. I hate to say it, if RIM use continues to grow in the government, eventually, those NOCs become strategic targets.
I can understand smaller countries having to accept that as a part of life, but lets face it, America has a few bucks to toss around. I'm quite surprised that the government hasn't forced RIM to put a NOC on American ground.
I don't really know how many data centres RIM has. Two doesn't really sound right to me, but that's what has been frequently quoted in the media lately. Maybe people, including myself, are mixing up data centres and NOCs in the RIM world.
E.g. http://news.zdnet.com/2100-1035_22-6177829.html
Wow, just in time for STP Con!
http://www.stpcon.com/
They probably missed the early bird discount, though.
My favorite quote: "The cost of software failures is high -- and in today's increasingly litigious and regulated business environment, they're higher than ever. Security flaws, usability problems, functional defects, performance issues, all carry a tremendous price tag."
This is a match made in heaven.
P.S.
Non-
Function: prefix
2 : of little or no consequence : unimportant : worthless <nonissues> <nonsystem>
"non-critical" is an interesting usage in this context. It's probably just an understatement oversight. I bet what they really meant is "non-critical-system" or non-critical system, or maybe "non-critical System". In any case, it seems.... off. No pun intended.
the <nonsystem> seems appropriate. *shrug*
If you do what you always did, you get what you always got.
The most important thing about RIM's e-mail infrastructure from a gov't or business perspective, is strong end-to-end encryption of every message. (IIRC they use elliptic curve crypto for their PKI, and 3DES for symmetric session keys. They might have changed it, but I know the guy who implemented most of their crypto algorithms many years ago. They didn't do anything boneheaded crypto-wise, its a pretty unbreakable system).
In other words, the reason US gov't users can route mail through a Canadian NOC without fear is, RIM delivers the packets but it can't actually read any of the messages. The crypto is strong enough that it would probably take hundreds of years to break, barring the development of quantum computing or something.
I highly doubt they will ever say who is officially to blame, but most likely it was a combination of pressure from 'above' for the developers to complete the upgrade by xxxxx, for the roll-out team to implement & verify the upgrade globally with absolutely no downtime, the lack of time to test the application for every possible bug or 'feature' that may arise (including going through the code step-by-step to make sure no weird situations or invalid data input/output could occur) and the sheer complexity of the system(s). Some level of management are definately to blame for the outage. At least one person is going to loose their job or get a severe telling off over this.
As for the non-critical upgrade statement, if it was non-critical how come it caused a major outage, and in the case of where I work, caused a reasonably significant loss via disruption of communications? I'd classify that as more than non-critical, but I'm funny that way...