Slashdot Mirror


RIM Releases Reason for Blackberry Outage

An anonymous reader writes "According to BBC News, RIM has announced that the cause of this week's network failure for the Blackberry wireless e-mail device was an insufficiently tested software upgrade. Blackberry said in a statement that the failure was trigged by 'the introduction of a new, non-critical system routine' designed to increase the system's e-mail holding space. The network disruption comes as RIM faces a formal probe by the US financial watchdog, the Securities and Exchange Commission, over its stock options."

16 of 106 comments (clear)

  1. perhaps by geekoid · · Score: 5, Interesting

    a routine that can take down the system is a tad more critical then you think?

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  2. What really happened... by Mockylock · · Score: 5, Funny

    This is all just technical jargon for, "I tripped over the power cord. MY BAD."

    --
    "Please, shut up. Just when I think you can't say anything more stupid, you speak again." -Archie Bunker.
  3. Non-critical? by Anonymous Coward · · Score: 5, Funny

    This is obviously some new definition of the word "non-critical" with which I was previously unfamiliar.

    bkd

  4. Buying time by faloi · · Score: 5, Funny

    The irony is that the SEC couldn't do any more investigating during the outage because they had no email access!

    --
    "It is a miracle that curiosity survives formal education." -Albert Einstein
  5. Re:I'd hate to be their QA manager right now! by Mr+Pippin · · Score: 5, Insightful

    More importantly, they apparently had no or a very bad backout plan.

    It's quite likely the development group listed this as a risk, with a good backout plan, and upper management simply didn't want to pay for the cost of having a quick backout.

    If that's the case, you can be pretty sure upper management WON'T take the blame.

  6. Ah ha! by Grashnak · · Score: 4, Funny

    So that is where the missing 5 million White House emails went! Sneaky Canadians!

    --
    Life needs more saving throws.
  7. Re:I'd hate to be their QA manager right now! by spells · · Score: 5, Insightful

    You can tell this is a geek site. Bad software rollout, first post wants to blame the QA manager, second wants to blame "Upper Management." How about a little blame for the devs?

  8. Re:testing departments by Red+Flayer · · Score: 4, Informative

    Someone needs to redefine what non-critical actually is.
    A non-critical upgrade is one that isn't critical that it be performed.

    Increasing storage capacity (when current capacity not close to exhaustion)? Non-critical.

    Fixing the shut-down system that resulted from the upgrade? Critical.

    Watching the sales reps in my office apoplectically try to figure out how to get in touch with their clients? Priceless.
    --
    "Trolls they were, but filled with the evil will of their master: a fell race..." -- J.R.R. Tolkien on Olog-hai
  9. Re:I'd hate to be their QA manager right now! by lucabrasi999 · · Score: 5, Insightful
    How about a little blame for the devs?

    Blasphemer!

  10. Re:I'd hate to be their QA manager right now! by bradkittenbrink · · Score: 5, Insightful

    Clearly bugs originate with devs, the same way typos and spelling errors originate with authors. The occurrence of such errors is inevitable. The process as a whole is what is responsible for eliminating them. To the extent that the devs failed to contribute to that process then yes, they also deserve blame.

  11. Re:I'd hate to be their QA manager right now! by roman_mir · · Score: 4, Insightful

    I am not sure if you are trying to be funny or insightful, probably you are aiming for a bit of both, however, while bugs in software (inevitably) are developers' fault, release of software with bugs into production system is always management fault. There must be a process in place to catch bugs before release for mission critical systems (isn't it one of them?) There must be a process in place for a quick rollback for such systems. There must be some form of backup. How about running both, new and old systems in parallel for a while with ability to switch to the old if the new one fails?

    Whatever it is, the production problems are due to bad process, which is what management is supposed to control. They are not even responsible for coming up with the technicalities of the process, they are responsible for making sure that there is a sufficient process (sufficient in terms that it is agreed by all parties, DEVs, QAs, BAs, client that it is good enough.) They are responsible to make sure that the process is followed.

    Over a year ago now in Toronto, ON, Canada, the Royal Bank of Canada had a similar problem, but of course with a bank it is much more dangerous it is lots of money of lots of people. Heads rolled at the management level only.

  12. Is this really so bad? by TheBishop613 · · Score: 4, Insightful
    Am I the only one who thinks they actually survived this pretty well? I mean sure, the goal is to try to make sure that the system never goes down and is up 24/7, but sometimes shit happens in large systems. It seems to me that getting everything back to normal within 12 hours is pretty reasonable. Did they have an instant fix? Well no, of course not, but they got the system back to a working state relatively quickly and hopefully didn't lose data.


    Yeah, they've got areas to tighten up their QA and patch processes, but on the whole they got it all back up and running faster than most enterprises get their email functioning after a worm.

  13. RIM's biggest failure by toupsie · · Score: 4, Interesting

    Mistakes in QA do happen and everyone can do more testing but RIM's biggest failure during the outage was not their QA but their PR. How many BES Admins wasted an hour or two trying to figure out why their servers were not delivering properly to their user's handhelds? If there was a statement on their website or a message on their support line, a lot of wasted time would have been averted. If it were not for a few of the independent blackberry forums, I would not have known their was a nationwide outage during my troubleshooting.

    --
    Strange women lying in ponds distributing swords is no basis for a system of government.
  14. Testing of Complex Systems by Fritz+T.+Coyote · · Score: 4, Insightful
    I love the (Friday) morning quarterbacks who will now proceed to beat up RIM for a system outage after a 'non critical' upgrade.

    And a bunch of suits will want the heads of the technicians responsible.

    I feel for them, I really do.

    A few years ago I put in a minor maintenance change that made headlines for my employer.

    This is a natural result of the budgetary constraints we have to live with in the real world. Testing and certification is expensive, and the more complex the environment, the more expensive it gets. It is difficult to justify a full blown certification test for minor, routine maintenance, unless you are talking about health and safety systems. So a worst-case event occurred, RIM suffers some corporate embarrassment, some low-level techs will get yelled at, and possibly fired, and a bunch of people had to suffer crackberry withdrawal.

    Nobody died. No planes crashed. No reactors melted down.

    RIM will work up some new and improved testing standards, and tighten the screws on system maintenance so much that productivity will suffer, they may even spend a bunch of money on the equipment needed to do full-production-parallel certification testing. And then in a year or so cut the budget to upgrade the certification environment as 'needless expense', and come up with work-arounds to reduce the time it takes to get trivial changes and bugfixes rolled out.

    I wish them luck. Especially to the poor sods who did the implementation.

    At least when I did my 'headline-making-minor-maintenance' it only made the local papers for a couple of days.

  15. Re:I'd hate to be their QA manager right now! by jimicus · · Score: 4, Insightful

    How about a little blame for the devs?

    Because that's not how change should happen in large/business critical applications.

    What should happen is that the update is thoroughly tested, a change control request is raised and at the next change control meeting the change request is discussed.

    The change request should include at the very least a benefit analysis (what's the benefit in making this change), risk analysis (what could happen if it goes wrong) and a rollback plan (what we do if it goes wrong). None of these should necessarily be vastly complicated - but if the risk analysis is "our entire network falls apart horribly" and the rollback plan is "er... we haven't got one. Suppose we'll have to go back to backups. We have tested those, haven't we?" then the change request should be denied.

    As much as anything else, this process forces the person who's going to be making the change to think about what they're going to be doing in a clear way and make sure they've got a plan B. It also serves as a means to notify the management that a change is going to be taking place, and that a risk is attached to it.

    And if a change is made but hasn't been approved through that process, then it's a disciplinary issue.

    Of course, it's entirely possible that such a process was in place and someone did put a change through without approval. In which case, I don't envy their next job interview.... "Why did you leave your last job?"

  16. Yes it is. They've put themselves in a critical... by WoTG · · Score: 4, Insightful

    RIM is not a regular company. They have specifically created a centralized system where the email for millions of people depend on the uptime of their two (?!?!) data centres. Delivering email is literally their business and uptime is a critical part of that. IMHO, a half an hour of system wide downtime is pushing RIM's luck.

    Several hours of email downtime is "OKish" if you are talking about a medium sized company that only has a handful of servers and a few IT guys. This is not the same at all.

    Prior to this, I never realized that the RIM system was THIS centralized. It's kind of concerning really. And I don't quite understand why so many US gov't users are allowed to route their email through a NOC in Canada (disclosure: I'm Canadian).