British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

← Back to Stories (view on slashdot.org)

British Airways IT Outage Caused By Contractor Who Accidentally Switched off Power (independent.ie)

Posted by msmash on Friday June 2, 2017 @02:00AM from the getting-to-the-bottom-of-things dept.

An anonymous reader shares a report: A contractor doing maintenance work at a British Airways data centre inadvertently switched off the power supply, knocking out the airline's computer systems and leaving 75,000 people stranded last weekend, according to reports. A BA source told The Times the power supply unit that sparked the IT failure was working perfectly but was accidentally shut down by a worker.

180 of 262 comments (clear)

Min score:

Reason:

Sort:

Did they try... by 110010001000 · 2017-06-02 02:03 · Score: 5, Funny

...turning it on again?
1. Re: Did they try... by Anonymous Coward · 2017-06-02 02:16 · Score: 5, Funny
  
  text book example of a "career changing event"
2. Re:Did they try... by Zocalo · 2017-06-02 02:28 · Score: 5, Interesting
  
  Apparently that was what led to the major outage turning into a prolonged major outage. It seems that the sequence of events is now that the contractor turned off the power (presumably killing one of more phases), obviously leading to a large scale hardware shutdown. Someone (the same contractor, most likely) tried to restore power, as you would, only to find that the power surge of all that hardware switching back on at the same time, which often means that they are close to maximum power draw, overloaded the system and caused physical damage to the hardware.
  
  While a there's a lot of mocking of BA going on at the moment, that's actually a pretty easy situation to get into if you've expanded a DC - or even an regular equipment room - over several years without proper power managment, and BA is far from the first company to be caught out. So, if you are responsible for some IT equipment rooms, here's two things to consider; what's the combined total power draw of all the equipment in each room on power on (don't forget to include any UPS units topping up their batteries!), and what the maximum power load that can be supplied to each room? If you can't answer both of those, or at least be certain that the latter exceeds the former in each case, then you've potentially got exactly the same situation as BA.
  
  None of which excuses BA from not having the ability to successfully failover between redundant DCs in the event of a catastrophic outage at one facility, of course.
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:Did they try... by PolygamousRanchKid+ · 2017-06-02 02:33 · Score: 1
  
  "Holy Mother Of All Single Point of Failures, Batman!"
  Well, if the contractor is like some of the ones I know, he will justly say, "I was instructed to turn off the switch . . . not to turn it back on again!"
  Which brings to the obvious point: Which British Airways employee was responsible for the work being done? Blaming the lowly contractor is a complete shift of the blame to someone who obviously couldn't know any better.
  Or is British Airways an example of "Contractors . . . all the way down" . . . ?
  
  --
  Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
4. Re:Did they try... by sycodon · 2017-06-02 02:34 · Score: 5, Insightful
  
  Bullshit.
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
5. Re: Did they try... by DickBreath · 2017-06-02 02:34 · Score: 1
  
  The individual will merely have to change which contractor employs them. In this world, education is now proportional to the number of contractors you've been employed by. Sad. Terrible.
  
  --
  
  I'll see your senator, and I'll raise you two judges.
6. Re:Did they try... by DickBreath · 2017-06-02 02:37 · Score: 2
  
  In Open Compute I believe the power supplies wait random amount of time before applying power to the rails to fire up the load. What an idea. You auto-magically spread out the time of the start up load over a short time.
  
  --
  
  I'll see your senator, and I'll raise you two judges.
7. Re:Did they try... by Archangel+Michael · 2017-06-02 02:42 · Score: 5, Interesting
  
  There are two ways to engineer power for a datacenter. 1) You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter.
  Here is the real equation: Cost * likelihood of of catastrophic event. If you think 100,000 * a .0000001 chance of catastrophe, you err on the side of savings. On the other hand, if you think $25 * 100.00 chance of catastrophe, you err on the side of cost.
  My guess, is that they didn't account for business losses when plugging in that (obviously over simplified) formula. This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
  I get that there are budgets and such, but here is my one question I (IT guy) ask the "business" decision makers: If you lost everything, how much would it cost you? Most people undervalue the data inside the databases and documents, because they have no way of quantifying how much all that data is worth.
  Data, is the biggest unaccounted for asset of a business.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
8. Re: Did they try... by Anonymous Coward · 2017-06-02 02:42 · Score: 1
  
  That is, an instant promotion to the management due to the record unexpected point savings of electricity during the budget month.
9. Re:Did they try... by HornWumpus · 2017-06-02 02:53 · Score: 1
  
  A sort of similar thing happend to me, in 1991, running netmare, at a business location. Conscientious employee made sure she shut the office down at the end of business. I got to run out their the next morning.
  I fixed it with duct tape over the power switch ('server' was a desktop, AT power supply). Wrote 'touch this and die' on it in sharpie. Arranged to have 'server closet' locked, which was good for cutting down dust as well.
  Can I be BA director of IT now? I'm obviously better qualified, despite being 'promoted' to full time programmer 20 years ago.
  
  --
  John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
10. Re:Did they try... by interkin3tic · 2017-06-02 02:56 · Score: 1
  
  As someone who doesn't work in IT I have to ask, what are the chances of other big organizations learning from this? Are we talking other airlines will make sure they avoid the exact same scenario but don't bother putting any additional resources to other IT disasters, or are we talking other companies laugh at BA's customers and then cut IT support?
11. Re: Did they try... by __aaclcg7560 · 2017-06-02 02:57 · Score: 4, Interesting
  
  text book example of a "career changing event"
  Not necessarily. I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can). Mangers are less likely to punish someone who comes forward immediately. In other cases where blame must be assigned, I've already documented my actions and sometimes the action of those around me. If my CYA is stronger than everyone else's, I'm not going to get blame for something that I didn't do.
12. Re:Did they try... by swb · 2017-06-02 03:06 · Score: 5, Insightful
  
  I think they also suffer from what I call "efficiency savings hoarding".
  If you have a process that requires 10 labor inputs to achieve and you buy a machine that reduces it to 5 labor inputs, your ongoing savings isn't really 5 labor inputs. You have to spend some of that labor savings in keeping the machine maintained and operational and investing in its replacement when it reaches end of life.
  When I started working for a company in 1993, they had some 40 secretarial positions whose workload was about half spent doing correspondence and scheduling meetings. In 2001, thanks to widely deployed email/calendaring system they had cut about 30 of those positions because internal meetings could be automatically planned via email and the bulk of internal correspondence shifted from paper memos to email.
  Yet when it came time to expand/replace the email system due to growth it was seen as a "cost". I actually got the project approved by arguing that the cost of the replacement was actually being paid for by the savings realized from fewer administrative staff -- they still had ample savings (the project was less than 1 administrative FTE). But the efficiency gain from the project wasn't free on an ongoing basis.
  Too many business gain efficiencies and savings from automation, but assume these are permanent gains whose maintenance incurs no costs.
  I have an existing client with a large, internally developed kind of ERP system that supports a couple of thousand remote workers. The system is aging out (software versions, resources, performance issues all identified by their own internal developer) and of course the owner is balking at investing in it without realizing that the "free money" from reduced in-office staff needed to process faxes, etc, needs to be applied to maintaining the system to keep achieving the savings.
13. Re:Did they try... by will_die · 2017-06-02 03:06 · Score: 1
  
  They could easily be well under maximum power draw in normal usage since the switch-on surge can be multiple times larger.
14. Re:Did they try... by __aaclcg7560 · 2017-06-02 03:07 · Score: 1
  
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  I took a PC hardware course in college. The instructor had six "server" PCs on a cart plugged into a $5 power strip. He left the cart plugged after the last class for Friday. A electrical storm passed through over the weekend. Monday morning he found a blackened power strip and six dead PCs.
  Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
15. Re:Did they try... by PPH · 2017-06-02 03:17 · Score: 2
  
  The overarching problem is quantifying the value of the functions being performed. You can compare the costs of a clerical staff versus that of an e-mail/calendaring system. But it's difficult to figure out in a business setting what these functions are worth.
  My boss would wail and cry over the inability to peruse the individual schedules of all of his minions for the purpose of calling yet another self-aggrandizing staff meeting. And he would assign a very high value to this function. But back in the 'old days', we just didn't have as many. And we got more work done.
  
  --
  Have gnu, will travel.
16. Re:Did they try... by jcr · 2017-06-02 03:24 · Score: 3, Insightful
  
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  Given the duration of the outage, I'd say that's a fair conclusion.
  -jcr
  
  --
  The only title of honor that a tyrant can grant is "Enemy of the State."
17. Re:Did they try... by mspohr · 2017-06-02 03:27 · Score: 1
  
  Isn't that how you fix Windows computers?
  
  --
  I don't read your sig. Why are you reading mine?
18. Re:Did they try... by I'm+New+Around+Here · 2017-06-02 03:27 · Score: 5, Funny
  
  >
  Remember, kids, don't plug your "servers" into a $5 power strip and hope for the best.
  Yes, buy a Monster power strip for $50. It has gold plated vacuum tubes for effecient power control. ;^)
  
  --
  If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
19. Re:Did they try... by Drakonblayde · 2017-06-02 03:32 · Score: 5, Informative
  
  Not entirely true...
  So it depends on what kind of UPS you're employing. If it's the really big ones, you know, the ones the size of generators, then you don't plug stuff directly into them. They tend to be centralized and distribute power to PDU's that are in the racks themselves. The servers plug into the PDU's in the racks, and those PDU's have on/off switches. My fat ass has bumped the power switch on PDU's more than once trying to squeeze into tight spaces between racks. UPS's aren't employed to protect against human error, they're designed to protect against loss of main power.
  If you're data center is small enough, you can get away with UPS's mounted in the rack and plug your servers directly into them, but when you're talking about scale, that's just not feasible or cost effective.
  Somehow I doubt British Airways data center is of the 'couple cabinents in a colo variety' and they've probably got the big UPS setup
  Most likely the fault lies with whomever architected the data center. I'll bet either there's very little room between the racks, or the PDU's are mounted in a way they can be accidentally bumped (probably either mid rack or at the bottom). I personally have taken to mounting PDU's at the top of the rack on the backside just to minimize any potential human contact with them.
20. Re: Did they try... by LS1+Brains · 2017-06-02 03:33 · Score: 5, Insightful
  
  I've been in a few of these "career changing event" over the years. If I make a mistake, I step forward, take responsibility and fix the problem (if I can).
  
  As an IT Manager/Director, THANK YOU. Everyone screws up at some point, it's what you do after that really matters.
21. Re:Did they try... by Zocalo · 2017-06-02 03:34 · Score: 1
  
  Still unclear, but that's what BA claimed in their previous statement - physical damage to some of their hardware - although it was never made very clear whether they meant to the actual IT hardware (which was generally assumed, given the outage), the power supply hardware, or something else entirely. Given the power supply is now a major part of the cause of failure it could be almost anything, but if I had to guess I'd probably go with a physical failure somewhere between the incoming HV supply from the National Grid and the power distribution rails in the DC.
  
  Even before you consider the safety and procedural aspects of working on high-voltage electrical switch gear (permits to work, etc.), that's often hardware that cannot simply be swapped out on the fly, especially if a breaker or whatever has actually blown rather than tripped, which might also entail issues with actually getting a replacement to site in the first place or collateral damage to adjacent equipment. Plus, if they'd already determined they couldn't simply replace/reset the failed unit(s) and try to power-on again without ending up with another failure, they'd need to look into powering stuff back on in stages, and if they didn't properly understand which systems were dependant on which (lots of things can fail to start up cleanly if things like DNS and LDAP/AD. are absent, for instance) then they'll have to get that data from the admins - who they outsourced to India last year...
  
  --
  UNIX? They're not even circumcised! Savages!
22. Re:Did they try... by __aaclcg7560 · 2017-06-02 03:50 · Score: 1
  
  Yes, buy a Monster power strip for $50.
  That's exactly what the instructor did. :/
  I wasn't surprised. The class was meant to qualify students for the A+ certification. Never mind that you're supposed to have six months of work experience. A 16-week course doesn't cut it.
23. Re:Did they try... by swb · 2017-06-02 04:05 · Score: 1
  
  My sense is that maybe part of the problem is that the perceived value of the function when performed via automation declines.
  In the case of planning a meeting, if all you have to do is use the "suggest a time" function in Outlook and send invites it appears to be a not very valuable function unless you compare it to the 1 hour of human labor required to make calls, check calendars and manually determine a common time all attendees are available.
  For some outputs like email, they may actually be individually less valuable, especially when compared to a memo typed on letterhead which has been spell checked, corrected for grammar, and validated for its content (in terms of policy, etc, and now being an official communication, possibly even setting policy). Emails in comparison are slapdash, impulsive and it may take a string of them to achieve the content value that an original typed correspondence had.
  When it comes time to pay for maintenance on the automated system, the amount you pay for the maintenance appears too high because you now value the functionality far less because it seems so much less valuable.
24. Re:Did they try... by fuzzywig · 2017-06-02 04:10 · Score: 2
  
  Many BIOS(/EFI) have an option to delay the harddrive spin up, so they don't all demand spin-up power from the PSU at the same time.
25. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:15 · Score: 1
  
  [...] is how you maintain your Karma Houdini status.
  1) I've been on Slashdot for 18+ years and, accumulated 9K+ in karma points.
  2) I'm consistently up voted more than I'm down voted by the mods.
26. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:18 · Score: 4, Funny
  
  Almost forget... 3) Kiss my shiny metal ass!
27. Re:Did they try... by Zocalo · 2017-06-02 04:20 · Score: 1
  
  Probably next to none. BA certainly failed to learn any lessons from the various other organizations that have suffered similar problems when power was restored after a major outage, and even if the people responsible for other DCs that are susceptible take note of what went wrong for BA (I'm hoping they'll publish an incident report at some point), they've still got to consider whether it applies to their own systems and, if it does, convince bean counters that it's something that needs money to be spent on. As Archangel Michael notes in his reply to my post, "You can engineer for maximum efficiency/lowest cost or you can engineer for redundancy/max safety. Penny Pinchers always choose the former, and IT guys usually want the latter" which is perfectly true, because once it gets up to the bean counters it typically devolves into a simple case of risk:
  
  What is the probability of this happening? (P)
  What is the cost of impact (financial, reputation damage, compensation, etc.) if it does? (Ci)
  What is the cost of preventing this from happening? (Cp)
  
  If P*Ci < Cp then the IT guys are not getting their toys, but you can bet they'll get the blame if the gamble comes up short.
  
  --
  UNIX? They're not even circumcised! Savages!
28. Re:Did they try... by WarlockD · 2017-06-02 04:48 · Score: 1
  
  Data, is the biggest unaccounted for asset of a business.
  I wish businesses realized this. I used to do these monthly all nighters for dell at this data center for a hospital. This data-center was one of those "legacy" ones, built into a signal floor office off the side of a busy highway. They are at the ABSOLUTE LIMIT of power consumption and they cannot get more power. Just the act of switching on or off one rack is enough to pop the outside transformer but because they are THE hospital in the city, they get away with it.
  The problem becomes that the UPS is the only thing keeping the whole thing falling app-art when load spikes. So every month, they do this "update/upgrade" where all the various vendors go in and we do firmware updates only on on machine, at a time. Make sure the ups doesn't melt down and replace/add any new systems
  I mean, Christ, they still had a VAX-11/780 in there that I was oogling. This was 10 years ago but they told me then they only JUST shut it off. Not because they wanted to, but because they needed the extra power for a few more web app boxes. I never got a clear answer on why they can't move either.
  On the otherwise, my grandpa owned a 30m a year distribution company. He knows nothing about IT, except that every time skipped 4 years of upgrades, he paid double the next 4 years. So he learned even from the 70['s. Do you only learn these lessons when your the single owner?:P
29. Re: Did they try... by __aaclcg7560 · 2017-06-02 04:58 · Score: 2
  
  You're a middle-aged man counting Slashdot karma points?
  Nope. I'm a middle-aged man who wrote a Python script to scrap my comment history. I'll run the script and post the stats when I get home.
30. Re: Did they try... by jellomizer · 2017-06-02 05:00 · Score: 5, Insightful
  
  You mean for the Executive who didn't approve of the hot offsite fail over solution ?
  You know the stuff that normal large organizations have to make sure their business can be operational.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
31. Re:Did they try... by Dunbal · 2017-06-02 05:12 · Score: 2
  
  Many electrical companies bill industrial customers based on PEAK power consumption, so it's in your interest to spread the load as widely as possible.
  
  --
  Seven puppies were harmed during the making of this post.
32. Re:Did they try... by zeugma-amp · 2017-06-02 05:22 · Score: 5, Interesting
  
  Many moons ago I was working in a datacenter and we had a crew in the hallway hanging wallpaper. In order to do so, they had to remove the box that normally covered the Emergency Power Cutoff switch (which was actually a Big Red Button) that would instantly drop power to the whole room. I'm sure you can guess where this is going...
  One of the paper hangers bumped into the BRB, and *poof*, there went the power to the room. In the data center, we were in the middle of a shift change. My coworkers and I were standing around discussing handoff, and whatnot. Suddently, we heard a huge Boom as a crapload of switches tripped all around us. Then we heard the drives and fans spinning down.
  As I said, this was a ways back. In the data center we had 12 HP-3000/70 minicomputers, a couple of VAX 11/780s, and a water-cooled IBM 3090 mainframe that were our main systems in the room. The disk drives on the HPs were disk packs of 16" platters sitting in drives the size of a small washtub. They produced a lot of noise. Each HP3K had about 8 or 9 of these things daisy-chained behind the system itself.
  The room was loud. All the time. Well, when the power dropped, all those drives started to spin down. We were all just kind of standing around looking at each other, not knowing what had happened. You could hear the pitch of all those drives winding down, becoming a lower and lower note, until finally - silence.
  Simon and Garfunkle had a song called "The Sound of Silence" many year even further back into the dim reaches of time from when all this was taking place. This was the first instance in my life when I really understood what silence actually sounded like. It was eerie. You never heard silence in the computer room. You have UPS, generators and all kind of other things to make sure you never actually heard silence.
  So, there we were, standing around with our mouths hanging open, and listening to the eerie silence. The moment broke, and we quickly determined what had happened. Rather than just cut the power back on, we went through and powered off all the drives and such so we could slowly bring everything back up in an orderly fashion.
  One thing that I learned that day was that HP-3000 minicomputers contain a battery designed to allow the things to ride through such catastrophes. Out of the twelve HPs, once we had powered back on all of the drives, nine of them just started executing their next instruction and continued on as if nothing whatsoever had happened. Three needed to be coldstarted, which wasn't a really big deal. Within 30 minutes or so of power being brutally disconnected we had all of them running smoothly, or at least on the way up.
  The two DECs weren't quite so resiliant, but after checking their dirty disks, they came back up as well.
  An IBM 3090 does not like to have it's power just cut off. It really doesn't. We ended up having issues and it took about 24 hours to return to normal operational status.
  The entire event was kind of cool to run through. Gave me a new respect for HP engineering. For many of our users, all they experienced was that their terminal froze for about 20 minutes, then continued on where it had stopped.
  I don't know if the paper hanger lost his job, but we lost several thousand user hours of time while they were sitting staring at their frozen terminals.
  It was certainly an interesting experience, and I'll never forget the Sound of Silence in the Computer Room.
  
  --
  This is an ex-parrot!
33. Re:Did they try... by Archangel+Michael · 2017-06-02 05:28 · Score: 1
  
  The sad thing is, I am pretty sure there is a VAX virtual machine out there that runs (emulates) a VAX on standard PC hardware, and is actually more efficient, than the actual VAX is.
  https://en.wikipedia.org/wiki/...
  Not sure how well it works.
  
  --
  Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
34. Re:Did they try... by Minupla · 2017-06-02 05:28 · Score: 1
  
  The good data centers have two PDUs, a 'red' and 'blue' one, plugged into different circuits, which run back to separate power distribution systems (transformers, UPS, generators, and through a switching system to multiple grids.)
  You then plug one of your power supplies into the Red line and one into the Blue line and are protected against any single "oops!" that doesn't involve the Coyote trying to catch the road runner level hi-jinks.
  It's hard for me to conceive of a situation where a 'biologic' as one of the IT managers I worked with called humans, could take out all of BA for a day by "turning off a switch".
  The closest example I can think of from my experience would be a cable plant in Calgary that went down because someone mis-architected the fire suppression system and a fire in the 'Red' electrical room tripped water fire suppression in the 'Blue' room. But that requires some serious bad Murphy mojo, and considerably more then BA has copped to so far. I think there's more to the story, although they have improved form the power surge story.
  Min
  
  --
  On the whole, I find that I prefer Slashdot posts to twitter ones because I don't get limited to 140 chars before
35. Re:Did they try... by ghoul · 2017-06-02 05:29 · Score: 4, Insightful
  
  Managers get paid to take the blame and the stress while workers get paid to do the work.
  
  --
  **Life is too short to be serious**
36. Re:Did they try... by sl3xd · 2017-06-02 05:41 · Score: 1
  
  The power supply for a data center rack lasts 5-10 seconds. It's good for accidental cord bumps, but that's about it.
  And that is making the huge assumption the nodes actually have power supplies - many data centers are straight DC power these days, which means they don't need a power supply.
  Generally there's one large UPS to power the entire data center (or building). I'm used to seeing several circuit panels filled with large breakers that are all after power has flowed from the UPS.
  The machines themselves are typically plugged into normal power sockets -- the power is already conditioned by the UPS. There might be a PDU to convert from 240V 3-phase down to 120V single phase; but they are little more than wire and switches (usually manual).
  Alternatively, the UPS puts straight 12VDC to every rack, with no PDU, and no PSU.
  So yeah, a guy that didn't know what he was doing could easily shut down the breaker for the whole datacenter, panicked, switched power back on, and boom.
  
  --
  -- Sometimes you have to turn the lights off in order to see.
37. Re:Did they try... by drew_kime · 2017-06-02 05:45 · Score: 1
  
  This is why you leave penny pinching idiots out of the decision making, because when all you see is cost, and don't properly evaluate the catastrophic losses in event of disaster, then you're just an idiot that nobody should listen to.
  What catastrophic losses? They saved the fuel on all those flights that didn't happen, and I'm sure 90%+ of the tickets were non-refundable.
  Shit, if I ran an airline I'd "accidentally" shut down one holiday weekend per year.
  
  --
  Nope, no sig
38. Re:Did they try... by Ichijo · 2017-06-02 05:57 · Score: 1
  
  ...usually.
  
  --
  Any sufficiently unpopular but cohesive argument is indistinguishable from trolling.
39. Re: Did they try... by lactose99 · 2017-06-02 05:59 · Score: 2
  
  Resume line-item:
  - Single-handedly tested entire DR operation of British Airways
  
  --
  Fully licensed blockchain psychiatrist
40. Re: Did they try... by lactose99 · 2017-06-02 06:02 · Score: 3, Insightful
  
  "we didn't budget for that"
  "well does your budget include a multi-day downtime when the primary site goes offline?"
  "now how could the primary site possibly go offline?"
  Unfortunately I run into this far more than I should in this industry.
  
  --
  Fully licensed blockchain psychiatrist
41. Re:Did they try... by lactose99 · 2017-06-02 06:04 · Score: 3, Informative
  
  Even a brand new IT graduates knows computers should be plugged into UPS devices that protect against this.
  And you'd be surprised how many shops think this knowledge is someone else's problem and subsequently don't add it to any server installation docs, then look for a scapegoat when systems go tits-up like this.
  
  --
  Fully licensed blockchain psychiatrist
42. Re: Did they try... by __aaclcg7560 · 2017-06-02 06:36 · Score: 1
  
  Did you step forward and admit it was a bonehead thing to do, considering the importance of your imaginary legacy of stories that no one reads?
  I didn't think Slashdot was relevant to my personal brand. So I continued to use my user account from 18+ years ago. However, you asshats have shown me the errors of my way and I now see $$$ that I haven't seen before. Commenting daily on Slashdot is now part of my regular routine. Thank you.
43. Re: Did they try... by __aaclcg7560 · 2017-06-02 06:51 · Score: 1
  
  You don't have any Python coding experience on your GitHub, so you're obviously lying about coding a Python script, you fat liar.
  I haven't posted anything to GitHub yet. When I get done with the script, I'll post it on GitHub and submit it for consideration as Slashdot article.
44. Re:Did they try... by LinuxIsGarbage · 2017-06-02 07:11 · Score: 1
  
  Many electrical companies bill industrial customers based on PEAK power consumption, so it's in your interest to spread the load as widely as possible.
  Demand billing is usually based on the highest average 15 minute window. The effect of less than 1 second inrush on demand billing is minimal. It is also a combination: Part of the bill is based on the Demand(kVA), part is based on energy (kWh). Given that you were probably down for a portion of that 15 minute window, it would be almost impossible to make it up in inrush.
45. Re:Did they try... by LinuxIsGarbage · 2017-06-02 07:14 · Score: 1
  
  Stromasys provides enterprise grade emulation of old DEC hardware. Very pricey licences too. Like $20K.
46. Re: Did they try... by rickb928 · 2017-06-02 07:14 · Score: 1
  
  Or, to put it simply, it doesn't matter WHY the power went out...
  
  --
  deleting the extra space after periods so i can stay relevant, yeah.
47. Re:Did they try... by lgw · 2017-06-02 07:14 · Score: 1
  
  Handling power outages is about as basic of an IT task as they come. Basic Lock Out practices that prevent power from accidentally being turned off is also Server Maintenance 101.
  For this to actually have been the cause means their IT organization was run by rank amateurs.
  Bet it was the UPS being serviced. I've seen that happen twice - guy plugs in some probe, causes a GFI, and the UPS goes down hard. And echoes with the sound of silence. Often all the UPSs in the DC are chained in some stupid way and they all go down hard.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
48. Re:Did they try... by ghoul · 2017-06-02 07:21 · Score: 1
  
  Loads of great Indian Computer folks work for outsourcing companies because outsourcing companies do their visas and give them a shot at the American Dream. Of course if they stick around after getting their GCs then questions on their competency start to arise. The model of the Indian Outsourcing companies is to get great folks (who would normally earn more than market wage) at market wage by doing their visas and GCs. They know the good people will leave but the process takes 6-10 years so they get 6-10 years of above market performance at market wages. The ones who dont prove themselves are sent back. Just imagine how good your job performance would be if the result of a bad appraisal was not just a missed promotion or a firing but a deportation. These folks live to work for 6-10 years. Once they get their GCs and move over to Product companies they have effectively 20 years of experience because they have been working 80 hour weeks for 10 years. No wonder they then outshine their peers who havn't had the immigrant experience. The average Indian is not smarter than the average American but the average Indian immigrant definitely is because there have been multiple levels of filtering to thin the herd. Its not surprising that the founders of Sun, the CEOs of Citibank, Google and Microsoft are all Indian immigrants.
  
  --
  **Life is too short to be serious**
49. Re:Did they try... by __aaclcg7560 · 2017-06-02 07:22 · Score: 1
  
  The only thing that is a bigger joke than an A+ certification is a school which offers training courses for the A+ calling itself a "college."
  
  Community colleges are more vocational-oriented in their missions. The one I attended had networking classes for the CCNA exam and those often had waiting lists for the waiting lists. That is until healthcare became the new money major after the dot com bust. The classes were cancelled after everyone and their grandparents switched to healthcare.
  
  [...] 16 weeks of hands-on work is far more than sufficient to pass that exam.
  48 course hours vs. 640 work hours. That's a bit lacking. I had the required two years of work experience when I took the A+ and Network+ exams back to back.
50. Re: Did they try... by slashrio · 2017-06-02 07:29 · Score: 1
  
  It's not the calories ingested that counts, it's the calories stored without burning them, which counts.
  
  --
  "Trump!!", the new Godwin.
51. Re: Did they try... by tattood · 2017-06-02 08:10 · Score: 1
  
  Sad. Terrible.
  Thank you for your input, Mr. President.
  
  --
  WTB [sig], PST!!!
52. Re: Did they try... by sabri · 2017-06-02 08:17 · Score: 4, Interesting
  
  I pay people to not screw up so if you do I'm terminating you and finding someone competent.
  Which would be stupid. What do you think the chances are that this guy will repeat this mistake?
  
  Here is a story a friend of mine once told me. He was working on an AS migration of a major telco, when he made a big boo-boo causing a huge outage for hundreds of thousands of subscribers, making headline news. The next morning he got called into his boss's office, expecting to be fired. He was not. The reason why?
  
  His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
  
  --
  I'm not a complete idiot... Some parts are missing.
53. Re:Did they try... by snookiex · 2017-06-02 08:28 · Score: 1
  
  They did, but you know, there was a really long queue of Windows updates to be applied at startup...
  
  --
  Open Source Network Inventory for the masses! Kuwaiba
54. Re: Did they try... by xystren · 2017-06-02 08:28 · Score: 1
  
  If it was just a case of the power supply being turned off, isn't it of greater concern that it too so long to trouble shoot and remedy? What about redundant fail over systems? It sounds like there was a lot more incompetence than just a single person.
  Again, I will admit, I wasn't there, have no idea how their infrastructure is setup, and so on and so forth. It just seems like there were many different failures on many different levels.
55. Re:Did they try... by DickBreath · 2017-06-02 08:42 · Score: 1
  
  So you are saying that there is a finite possibility that all those power supplies could randomly choose to start up in the same instant? Oh, my. :-)
  
  Of course, I don't run any hardware like that. So for me it's academic.
  
  --
  
  I'll see your senator, and I'll raise you two judges.
56. Re: Did they try... by rrohbeck · 2017-06-02 08:56 · Score: 1
  
  ... for the person who designed this single point of failure in.
  
  --
  thegodmovie.com - watch it
57. Re:Did they try... by I'm+New+Around+Here · 2017-06-02 08:57 · Score: 1
  
  Never mind that you're supposed to have six months of work experience.
  No, they recommend 6-12 months of experience. There's no requirement to have it. The reason they do this is that the average helpdesk tech is a barely-sentient moron, and they figure that if you can survive 6-12 months of grueling helpdesk monkey work, you can probably survive their 90 minute exam.
  You keep touting these certifications as if they're somehow impressive. They're not. Any baboon with two brain cells to rub together can learn what's required to pass these certs in a 16-week course.
  That's because the classes teach exactly what's on the exam, no more. It's the equivalent of a law school only teaching what's on the bar exam for a state.
  
  --
  If you think I voted for Trump because of this post, you're wrong. I voted for Dr. Jill Stein of the Green Party. Again.
58. Re:Did they try... by Dr.+Evil · 2017-06-02 09:05 · Score: 1
  
  I was hooking up a monitoring system to a mine system fire and environment panel.
  On the panel was a delicate 8-way toggle switch. If the switch was hit, a deluge water suppression system would cause irreparable damage to a major project. Hundreds of millions of dollars.
  Deactivating the alarm would require months of planning, signoff from multiple safety authorities, work stoppage and evacuation of major groups of personnel.
  It looked really simple to connect...
  I think you know how the story goes...
  I refused and they went with a webcam.
59. Re:Did they try... by dwywit · 2017-06-02 10:56 · Score: 1
  
  UK, so 230-240VAC, not 120. And each rack would have multiple 20 or even 32 amp supplies. Single phase, but not your domestic 10-amp circuit. A rack of traditional spinning rust would need that, perhaps not so much with SSD, but if this organisation has been cutting costs, I suspect SSD is a distant dream.
  *IF* one or more mainframes went down because "a contractor turned off the power", then your system design is, shall we say. not best practice. Mainframes are *hard* to shut down, and even a decent minicomputer has enough internal battery to allow a fast flush and preserve state before it goes quiet.
  Anyone who's worked with a mainframe and/or big data centre should be familiar with what's needed to 1. keep it running, 2. take it down gracefully and bring it back up, and 3. DON'T LET NON-CORE STAFF, i.e. contractors, anywhere near the core systems. IBM field staff would be the only exception.
  
  --
  They sentenced me to twenty years of boredom
60. Re: Did they try... by jabuzz · 2017-06-02 11:06 · Score: 1
  
  Never seen a PDU with an easy to trip switch. All had protective covers or require getting your pinky out down the side to turn off. Don't cheap out in PDU's is the lesson there.
61. Re: Did they try... by jabuzz · 2017-06-02 11:14 · Score: 1
  
  Problem is peak power draw occurs when you apply mains to the PSU and it is charging the capacitors the other side of the mains bridge rectifier. For example an IBM TS3500 tape library that draws couple kW tops when running has an inrush current of well over 100A. Turning the tape library on always upset the UPS which immediately decided to drop from line interactive to bypass mode, then a minute later decide to go back into line interactive mode when the power draw was now sane and clearly staying that way. It was not worth the $$$ to get the much more expensive UPS require to handle the tape library in line interactive mode during startup, which only happened during maintenance on the library, which was couple times a year tops.
62. Re: Did they try... by Razed+By+TV · 2017-06-02 11:40 · Score: 2
  
  His boss argued that this mistake made him more valuable, since he would not be making that mistake ever ever again.
  I believe there is wisdom in this, but there is a prerequisite.
  The person must have the capacity to learn.
  
  I currently have the pleasure of working with someone who must repeat the same mistakes before he learns from them.
  He breaks off, on average, one screw a month.
  It's always the same.
  *WHIRR*
  *SNAP*
  "Oh crap!"
63. Re: Did they try... by __aaclcg7560 · 2017-06-02 12:36 · Score: 1
  
  Pages Processed: 629, Comments (Accepted/Total): 9407/9424 Oldest Date: 2008-08-04, Newest Date: 2017-06-02 Scores (9254) | -1: 76, 0: 390, 1: 7037, 2: 1010, 3: 401, 4: 332, 5: 160 Bonuses (1258) | Flamebait: 32, Funny: 301, Informative: 200, Insightful: 334, Interesting: 270, Offtopic: 47, Redundant: 11, Troll: 63 Total Time: 00:16:19.00
64. Re: Did they try... by __aaclcg7560 · 2017-06-02 13:35 · Score: 1
  
  Negative mods should be deducted from the total.
  Good point.
  
  But I'm not sure why you'd penalize yourself twice for a -1, but not at all for a 0, which is also a down-mod based on your karma level.
  I'll double check -1 count and revise the 0 count.
  
  The 7037 "1" moderations suggest that you spent a long time shit-posting with low karma, and relied on having a shitload of posts out here to gain a little positive karma.
  Not every post is brilliant. Most of it was just snarky. People today just want to fight about every little detail in existence.
  
  From this, we can determine that you're a middle-aged man who wrote a very poor Python script to calculate numbers poorly.
  Much of the work was in the parser section. Someone wanted a SQlite database option.
65. Re:Did they try... by sysrammer · 2017-06-02 15:39 · Score: 2
  
  Hello darkness my old friend...
  
  --
  His ignorance covered the whole earth like a blanket, and there was hardly a hole in it anywhere. - Mark Twain
66. Re: Did they try... by __aaclcg7560 · 2017-06-02 18:21 · Score: 1
  
  You're a failure at almost everything you try, you know that?
  
  You only fail if you give up.
  
  Except failure itself.
  You can't learn without failing.
67. Re: Did they try... by Megol · 2017-06-02 23:06 · Score: 1
  
  Not by you, true. Frankly I don't want to work at "Mom's basement Inc." even if they get a competent leader...
68. Re: Did they try... by __aaclcg7560 · 2017-06-03 06:26 · Score: 1
  
  Only three lines of code got changed.
  
  Pages Processed: 630, Comments (Accepted/Total): 9423/9440 Oldest Date: 2008-08-04, Newest Date: 2017-06-03 Scores (12329) | -1: 76, 0: 390, 1: 7054, 2: 1010, 3: 401, 4: 332, 5: 160 Bonuses ( 952) | Flamebait: 32, Funny: 301, Informative: 200, Insightful: 334, Interesting: 270, Offtopic: 47, Redundant: 11, Troll: 63 Total Time: 00:12:43.00
69. Re: Did they try... by jeremyp · 2017-06-03 07:17 · Score: 1
  
  Wouldn't you be to blame for hiring somebody not competent in the first place?
  
  --
  All I want is a secure system where it's easy to do anything I want. Is that too much to ask ~~ Randall Munroe
Am I in the Matrix? by DontBeAMoran · 2017-06-02 02:03 · Score: 1

Because I'm having déjà vu.

--
#DeleteFacebook
1. Re:Am I in the Matrix? by Anonymous Coward · 2017-06-02 02:11 · Score: 2, Insightful
  
  The new article has more details.
2. Re:Am I in the Matrix? by nedlohs · 2017-06-02 02:11 · Score: 1
  
  If you can't see the difference in the two articles you have bigger problems than being in the matrix.
3. Re:Am I in the Matrix? by DontBeAMoran · 2017-06-02 02:20 · Score: 1
  
  The difference is British Airways shifting the blame to someone else.
  
  --
  #DeleteFacebook
4. Re:Am I in the Matrix? by nedlohs · 2017-06-02 02:31 · Score: 1
  
  Right, the story has been updated and so news websites (and sites that pretend to be news websites) post a new article about it. Slashdot is great at dupes, this isn't one though.
LOL by nospam007 · 2017-06-02 02:04 · Score: 1

Seems like this 'test' to see if the UPS would kick in didn't work.
So the CEO _should_ resign after all.
1. Re: LOL by haemish · 2017-06-02 02:13 · Score: 5, Insightful
  
  Right. It's not the poor guy that turned off the power supply. It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies and hired cheap lobour that had no clue how to architect for fault tolerance.
2. Re:LOL by sycodon · 2017-06-02 02:35 · Score: 1
  
  He should resign because he apparently relied in a single UPS.
  
  --
  When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
3. Re: LOL by thegarbz · 2017-06-02 03:05 · Score: 5, Insightful
  
  who wouldn't let the engineers put in redundant power supplies
  That's an interesting assumption. Have you seen anything even remotely indicating that the data centre didn't have redundant power? No amount of redundancy has ever withstood some numbnuts pushing a button. But i'm interested to see your knowledge of the detailed design of this datacentre.
  Hell we had an outage on a 6kV dual fed sub the other day thanks to someone in another substation working on a wrong circuit. He was testing intertrips to a completely different substation, applying some power to an intertrip signal, realising he hit the wrong circuit (A), he immediately moved to the one he was supposed to do (B), both in the wrong cubicle successfully knocking out both redundant feeds to a 6kV sub and taking down a portion of the chemical plant in the process.
  Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
4. Re: LOL by edtice1559 · 2017-06-02 03:25 · Score: 1
  
  I agree with you that the OPs assumption reads a lot of world-view into a statement that is fairly light on facts. However, there is also probably some truth. A single data center failure shouldn't have caused such an outage. There should be a redundant data center. So, yes, it's reasonable for somebody to accidentally shut off all the power to a data center or for a natural disaster to wipe it out. But it's not reasonable for an operation the size of BA not to have redundant servers somewhere.
5. Re: LOL by gmack · 2017-06-02 03:33 · Score: 1
  
  The place I work at work would have been fine with that scenario. They have two separate power feeds per rack going to two separate UPS systems going to two separate generators.
  There is no one switch that would take out the entire facility
6. Re: LOL by chispito · 2017-06-02 03:50 · Score: 3, Insightful
  
  Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
  It isn't armchair engineering. The CEO should accept full responsibility because that's what it means to be at the top of the reporting chain when such a devastating preventable outage occurs. If he was misled by his direct reports, then he should fire them and take full responsibility for not firing them sooner. Maybe he resigns maybe he doesn't--the point is that he must own the failure, whatever the logical conclusion.
  
  --
  The Daddy casts sleep on the Baby. The Baby resists!
7. Re: LOL by Anne+Thwacks · 2017-06-02 04:01 · Score: 1
  
  I don't think the EPO should affect data Centres on other continents. (Oh, there weren't any? Perhaps there was no justification for the CTO's salary then).
  
  --
  Sent from my ASR33 using ASCII
8. Re: LOL by DutchSter · 2017-06-02 04:43 · Score: 1
  
  There is no one switch that would take out the entire facility
  Then your facility may not be code compliant. Data centers are often required to have an emergency power off (EPO) button that does exactly this - kill power to the entire facility, or at the very least the room, across all sources and phases of power.
  I believe the NEC was updated to remove this requirement in some situations but data centers built to the earlier standards still have them. The NEC is technically voluntary; it has no force of law until it is adopted by a local authority having jurisdiction (AHJ), and that AHJ is free to only adopt portions of the code or modify the requirements.
  As an example a place I used to work has a small data center. When the NEC was updated it took three years before the city adopted that version of the NEC, and when they did they stated that the only way to remove the EPO button was to bring the entire facility up to the latest code, a very expensive undertaking.
9. Re: LOL by WarlockD · 2017-06-02 04:55 · Score: 1
  
  Not sure what's worse, managers who don't put in redundant power, or armchair engineers who just *assume* that they didn't because redundant power can't ever go out.
  One word: Marketing
  If a data center comes to you saying they have triple redundant power(battery-battery-generator> they can say to the manager, under the table, they have no need for rack UPS's. Orders to save money, no mater what the cost are the killer of IT.
  Hell, it properly is even WORST if the airline owned the data-center. With airlines being as cheap as they are, using a good UPS/battery setup, except it being 20 years old and not able to handle power spikes because they have built out 2x its spec.
10. Re: LOL by gmack · 2017-06-02 04:57 · Score: 1
  
  The easy fix would be to put said button behind one of those "break glass in case of fire" type enclosures and have it cut both circuits. Luckily the requirements here don't seem to need them and the only big red buttons are for keeping the fire suppressant system from going off.
11. Re: LOL by Dunbal · 2017-06-02 05:20 · Score: 2
  
  The CEO should accept full responsibility
  Hah, the CEO is probably trying to figure out how to give himself more stock options now that they're cheaper. These greedy fuckers can never think past their multi-million payouts.
  
  --
  Seven puppies were harmed during the making of this post.
12. Re: LOL by sl3xd · 2017-06-02 05:44 · Score: 1
  
  It's the shit-for-brains managrrs who wouldn't let the engineers put in redundant power supplies
  WHY are you assuming that the data center even uses power supplies. Most of the better datacenters I've worked with are all 12VDC -- the racks & nodes have no power supplies at all.
  
  --
  -- Sometimes you have to turn the lights off in order to see.
13. Re: LOL by thegarbz · 2017-06-02 08:31 · Score: 1
  
  The place I work at work would have been fine with that scenario. ... snip ... There is no one switch that would take out the entire facility
  I hope you're not talking about the scenario I just mentioned. If you are you may want to re-read it.
14. Re: LOL by thegarbz · 2017-06-02 08:34 · Score: 1
  
  The CEO should accept full responsibility
  He already is. He reports to share holders and is there to deliver on their requirements. If they decide that this incident caused him to miss targets he may get some form of punishment. Chances are BA will just report another record profit for the year (like they have been the last few couple) and he'll be named CEO of the year for raking in cash despite the problems.
  With overarching responsibility comes an overarching metric by which you're judged. Pissing off a few customers and having a bad day is not typically one of those metrics.
15. Re: LOL by gmack · 2017-06-02 08:39 · Score: 1
  
  We would be fine.. The external feed would go down and the batteries would kick in until the generators start up.
  We are fully redundant internally so the issue described in the actual article would not affect us either.
16. Re: LOL by thegarbz · 2017-06-02 08:46 · Score: 1
  
  You may want to re-read it again.
  The person was working on a final piece of redundancy feeding equipment and then manually through his own mistake tripped both feeds.
  Saying "we would be fine" is just plain stupid on the face of it. Human error can take down the most redundant system you could possibly design.
17. Re: LOL by jabuzz · 2017-06-02 11:37 · Score: 1
  
  Would you like to provide a reference to that fictitious regulation please.
18. Re: LOL by ebvwfbw · 2017-06-03 06:33 · Score: 1
  
  Could have two power supplies, or more. It's fool proof. However that makes it only idiot resistant. Never underestimate the power of an idiot.
That still doesn't explain why by Anonymous Coward · 2017-06-02 02:10 · Score: 1

they didn't just switch over to their DR site.
1. Re:That still doesn't explain why by bill_mcgonigle · 2017-06-02 02:12 · Score: 2
  
  they didn't just switch over to their DR site.
  You forgot the mic drop.
  
  --
  My God, it's Full of Source!
  OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Bright side by SpaghettiPattern · 2017-06-02 02:10 · Score: 2

Floor got cleaned cheaply and everyone got home early. Long live outsourcing!
Of course I didn't RTFA! With respect to outsourcing there's no difference between strategic and daily tasks like cleaning and strategic planning. Both need to be done short and long term. I can understand outsourcing occasional tasks but daily and strategic stuff will always be needed. Outsourcing of those tasks is a sign of utterly bad management.

--

I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
1. Re:Bright side by avandesande · 2017-06-02 03:19 · Score: 1
  
  If you look at the timeline of events it's highly unlikely that outsourced individuals designed this fault.
  
  --
  love is just extroverted narcissism
2. Re:Bright side by SpaghettiPattern · 2017-06-03 05:28 · Score: 1
  
  At one point company towns used to have company doctors. Now we have hospitals.
  You don't seem to get that health isn't a company's core business -unless you're business is a hospital/clinic/GP, etc...
  Outsourcing recurring tasks -cleaning, local production, etc...- that can employ people 100% is essentially bonkers. If buying services from a company is cheaper that means your managers mainly consume capital and don't pull their weight.
  Outsourcing your strategic analysis/development means you make it easier for your competitors to have access to your IP.
  The tough part of management is to set up a system whereby everyone and anyone continues performing and doesn't stagnate. Outsourcing is the easy way out.
  
  --
  
  I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I know the feeling by Anonymous Coward · 2017-06-02 02:11 · Score: 1

Been there, sort of done that.
Years ago I was in the basement of a 5-star hotel in South Africa, busiest time of the week, everyone was checking out, and I had to install a simple little Novell Netware to internet gateway machine, and there was one spare port on the power strip. Something shouted out in my head, "Don't put it in that one!", but I thought "The machine supplied tests fine, the cable is approved... what could possibly go..." *BLAM*, everything went down and took a few hours to get back up as the Netware "mirror" servers decided to argue about who comes up first. No idea why, something was wrong with the power strip in the rack I suppose.
Needless to say, I'd hate to be the poor chap who took down BA like that, might be a little hard finding work, unless it's retelling their story at a geek-comedy club.
Out of band by MikeB0Lton · 2017-06-02 02:11 · Score: 1

I guess it cost too much to add monitoring and remote management.
N+1 guess not by silas_moeckel · 2017-06-02 02:13 · Score: 3, Insightful

So it was all running in a single DC with a single power bus? Plenty of room at real datacenters they need to stop running out of a closet somewhere.

--
No sir I dont like it.
Re:How is this a thing by queazocotal · 2017-06-02 02:13 · Score: 1

That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
Yeah, yeah... blame the contractor... by __aaclcg7560 · 2017-06-02 02:14 · Score: 5, Insightful

This is human error because a contractor accidentally turned off a power supply that caused a world-wide outage? It should be operational error for allowing such a single-point of failure to exist.
1. Re:Yeah, yeah... blame the contractor... by __aaclcg7560 · 2017-06-02 02:40 · Score: 1
  
  Design error you probably mean?
  Operational error. Backup systems need to be periodically checked to see if they still working as designed. If the backup system got tested and failed to work, then it would then be a design error.
What the heck does this switch do? by JoeyRox · 2017-06-02 02:14 · Score: 4, Funny

No sure Bob - just flip it so that we can go get some lunch. I'm starving.
1. Re:What the heck does this switch do? by interkin3tic · 2017-06-02 02:59 · Score: 1
  
  Maybe it was more like this?
2. Re:What the heck does this switch do? by LordWabbit2 · 2017-06-02 03:21 · Score: 2
  
  Heh, you joke, but we had a server in our server room no one was using any more, it was under powered (ie. old) we had all gotten our stuff off of it and thought we might as well shut it down. So we did. Got a call a couple days later from across the country, "WTF happened to our XYZ?". So we switched it on again. No one knew wtf they were doing on/with the server, and our manager didn't even try to find out, he just said "Well leave it on then". It's probably still sitting there quietly doing whatever the fuck it was doing before.
  
  --
  There are three kinds of falsehood: the first is a 'fib,' the second is a downright lie, and the third is statistics.
Stephen Stucker unavailable for comment by RogueWarrior65 · 2017-06-02 02:15 · Score: 2

"Just kidding!"
Root Cause by FerociousFerret · 2017-06-02 02:17 · Score: 1

I found the culprit: https://youtu.be/9WYGdstEVJQ?t...
Re:How is this a thing by __aaclcg7560 · 2017-06-02 02:17 · Score: 1

That doesn't help if there is one master switch, in case of (for example) fire, and he activated it.
More like a extension cord stretched across a busy walkway just waiting for someone to trip on it.
What they MEANT to say is that. . . by Salgak1 · 2017-06-02 02:18 · Score: 1

. . . . the power was turned off by a FORMER contractor.
Then again, BA probably promoted him to executive VP.. .
Human Error? Sue but Still... by bobbied · 2017-06-02 02:18 · Score: 1

Human Error accounts for 99% of actual power outages in my experience. It's ALWAYS some idiot throwing the wrong switch, unplugging the wrong thing, yanking the wrong wires or spilling something in the wrong place...
You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
That being said... For a mission critical system in a multi-million dollar company like BA where was the backup site in a different geographic location that was configured to take over in the not-so-uncommon event of an outage? I don't care if it WAS a human that messed up and turned everything off, you need a contingency plan to deal with such things. Why? Because outages WILL happen no matter how much engineering and resources you pile into your primary location.

--
"File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
1. Re:Human Error? Sue but Still... by ghoul · 2017-06-02 02:33 · Score: 1
  
  They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off
  
  --
  **Life is too short to be serious**
2. Re:Human Error? Sue but Still... by PolygamousRanchKid+ · 2017-06-02 02:45 · Score: 1
  
  You simply cannot engineer around stupid well enough to fix it, regardless of how hard you try..
  Nothing can be made foolproof . . . because fools are so ingenious."
  
  --
  Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
3. Re:Human Error? Sue but Still... by HornWumpus · 2017-06-02 03:03 · Score: 1
  
  They thought they had DR. They were wrong.
  Somebody responsible should have signed off on the plans and routine testing schedule for that. It is a key job responsibility.
  
  --
  John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
4. Re:Human Error? Sue but Still... by turbidostato · 2017-06-02 03:14 · Score: 1
  
  "They had an offsite DR. The DR was setup wrong and did not have the latest data so when they switched to it they started seeing wrong data and had to switch it off"
  In the kind of companies that outsource the hell out to save some pennies there are two and only two types of highly available systems:
  1) Active/Passive. When the active goes nuts the passive fails to start for whatever reasons (tightly coupled to the fact that the system was tested exactly once, in the happy path, when given the "operational" label -and then only partly: there were minor flaws but everybody covered it up because the project was already late and they were "oh, not so important").
  2) Active/Active. When part of the system fails, the surviving part can't handle the load and falls to its knees too because nobody did a proper load assessment nor capacity planning (it would have outgrown the promised costs and some manager would have looked bad).
  In both cases, bonus points when they reach a split-brain situation and start killing each other left and right without ever reaching quorum.
5. Re:Human Error? Sue but Still... by ghoul · 2017-06-02 05:37 · Score: 1
  
  Thats why you always have a third site to be the tie breaker vote. The third site can be very simple - a PC in a managers office somehere. It is not meant to run any load. Just be a tie breaker and have independent power and internet connections to both Data centers. And regular check job should run to make sure the PC is up and running. It can be down for short periods just as long as you dont have the bad luck of having a DR shutdown AND restart during the time the PC is down. Shutdown itself will be fine as when one of the actual centers is down whatever the DR says goes. The vote is needed only when you restart the primary and need to resync.
  
  --
  **Life is too short to be serious**
6. Re:Human Error? Sue but Still... by ghoul · 2017-06-02 05:40 · Score: 1
  
  Screwups can also happen because you did not outsource to the experts and instead tried to use your internal techs who dont understand DR and have never implemented it to try and setup the DR. Bonus points if the company got tax breaks for employing out of work non tech workers in an IT retraining program and had them do the DR. Outsourcing is not always the problem.
  
  --
  **Life is too short to be serious**
not the contractor's fault by ooloorie · 2017-06-02 02:20 · Score: 4, Insightful

When your business depends on your IT infrastructure like that, turning off the power to a single machine or data center shouldn't bring down your operation; that's just stupid and bad design. Good enterprise software provides resilience, automatic failover, and geographically distributed operations. Companies need to use that.
And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
1. Re:not the contractor's fault by crow · 2017-06-02 03:05 · Score: 1
  
  Systems in a data center should have two different power systems. The contractor shut one of them down to do some work. That should have been fine. I would guess that the work was to replace or repair some of the power infrastructure. The most likely situation here is that the contractor switched off the wrong one, and the correct one was already off (possibly due to the failure for which the contractor was called in the first place, or else someone had already shut it off for him).
  Process errors like this are, unfortunately, all too common. I've heard stories of service people replacing redundant parts pulling out the good one by mistake and crashing entire systems, so it's not surprising that this could happen on a larger scale. I know my employer has worked very hard to adjust processes to minimize this type of mistake.
2. Re:not the contractor's fault by h4ck7h3p14n37 · 2017-06-02 03:14 · Score: 1
  
  And they should actually have tests every few months where they do shut down parts of their infrastructure randomly.
  A place I used to work at had an outage at their main data center because of a scheduled test of the power system.
  The week prior they had an outage because the APC battery-backup system in the server room developed a short. One of the engineers flipped the bypass switch while the APC tech was fixing the problem.
  There was a diesel generator next to the building that was used in case the building itself lost its electrical connection and a power-on test was performed every two weeks.
  Sure enough, the power-on test was forgotten and it ran while the server room's battery backup was bypassed knocking them out again. The admins didn't practice good configuration management, so people were having to log into individual systems to bring software up after the outages.
3. Re:not the contractor's fault by ooloorie · 2017-06-02 04:12 · Score: 1
  
  Systems in a data center should have two different power systems.
  Of course. But in addition to that, a global corporation like BA needs redundant data centers as well. There are many ways in which a single geographic location can be knocked off line even with redundant power systems.
Simple case of CEO GREED .. by Anonymous Coward · 2017-06-02 02:23 · Score: 1

He gutted the knowledgeable staff and replaced with inexperienced outsourced help.
Incoming power would/should have been the first thing checked.
Re:How is this a thing by queazocotal · 2017-06-02 02:24 · Score: 1

At the moment, there is no reasonable way to tell between various scenarios.
It could go all the way from 'worker pressed big red button despite being told not to, signs telling him not to, and having signed an agreement not to', to 'worker followed what they believed was procedure and did what 99% of people would have done', to 'worker did precisely as instructed and are being scapegoated'.
This is the ultimate single point of failure. by SuperKendall · 2017-06-02 02:24 · Score: 1

The first thing I think of is anything happening at tat location - flood, bomb, larger grid outage lasting more than a day or so - and BA is finished.
Heck if you were a terrorist now you know exactly where to attack that would truly hose an entire company that brings in a lot of money (and people) to England...

--
"There is more worth loving than we have strength to love." - Brian Jay Stanley
Mental picture from the movie 'Airplane' by DirkDaring · 2017-06-02 02:26 · Score: 2

of Johnny unplugging the extension cord from the wall and the lights on the runway going out. "Just kidding!"
https://datacenteroverlords.files.wordpress.com/2017/01/airplane.jpg
Yeah, sure by Rik+Sweeney · 2017-06-02 02:27 · Score: 1

This is just one step up from the cleaner killing a patient because they unplugged the life support machine to vacuum in the room.
Pull the other one, it's got bells on it.

--
Summation 2
Why the power went out is unimportant by minus9 · 2017-06-02 02:35 · Score: 1

The more important question is why it took the best part of two days to get things up and running again.

As for the power outage - A UPS test to check if power transferred to battery/generator that failed maybe?
1. Re:Why the power went out is unimportant by HornWumpus · 2017-06-02 03:06 · Score: 1
  
  Ship a tape to the backup site and wait for it to restore. No online backups at that site.
  
  --
  John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
Re:How is this a thing by sycodon · 2017-06-02 02:37 · Score: 1

Switches such as that should be locked out, requiring multiple people to allow access.
If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.

--
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
A bigger boy did it and ran away... by JonnyCalcutta · 2017-06-02 02:38 · Score: 1

Sounds like a load of baloney to me and really explains nothing. Sounds, in fact, like a cover up from someone who doesn't understand the implications of their lie.
It still doesn't explain why everything went down so catastrophically. Why was there only one power source? What about back up servers and other redundant systems? Why was it so easy for a contractor to switch the power off? Was he following procedure. What about redundancy? Why couldn't he just switch it back on again (I know, but if its such a simple system that it doesn't need redundancy then surely switching it back on would fix it). What about redundancy?
At the end of the day, unless the contractor was working way outside allowed procedures - e.g. deliberately switching it off for a laugh - then the fault lies way over his head.
(I know I'm preaching to the converted here - it just grinds my gears)
1. Re:A bigger boy did it and ran away... by gweihir · 2017-06-02 05:07 · Score: 1
  
  Indeed. You can _not_ kill a properly set-up architecture with one switch or one action. You cannot even kill it from one site (unless you are the admin and are doing it deliberately).
  This is dishonest, incompetent and greedy management fighting for survival with every lie they can find. The fault is >99% with them.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:How is this a thing by Archangel+Michael · 2017-06-02 02:44 · Score: 3, Funny

Worker: The sign says "Do not use"
Manager: I don't care what it says, flip the switch
Worker: That's a really stupid idea
Manager: Do it, or you're fired
Worker:
Manager: Well, now you really screwed things up, you're fired!

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
How does one DR test in a 24/7 business? by zerofoo · 2017-06-02 02:45 · Score: 4, Interesting

I've worked in banking and real estate businesses where we had the luxury of being able to DR failover test things like redundant databases, WAN connections, power supplies...etc - knowing that if something failed we had time to put it back together - before the business and customers would notice the outage.
How does one actually fail-over test things in production in a 24/7 business - especially one that spans time zones all across the world?
Are lab simulations simply enough? I've never seen a lab environment that could truly replicate a production environment.
1. Re:How does one DR test in a 24/7 business? by silas_moeckel · 2017-06-02 02:53 · Score: 4, Insightful
  
  You do it in production because none of it should cause a massive failure. They bought a DR site and failed to test it. Working at some big shops the DR site was prod every other quarter.
  
  --
  No sir I dont like it.
2. Re:How does one DR test in a 24/7 business? by will_die · 2017-06-02 03:12 · Score: 2
  
  We actually test ours by failing over portions each month and making sure everything works.
  For a smaller place I worked which had a limited DR(not everything failed over) parts were tested on a monthly bases, everything was tested yearly with a planned failure that was also to ensure the users had training.
  Some DR stuff also now is really nice in that when you tell it to self-test it creates a separate network so you can test the installation at the COOP site.
3. Re:How does one DR test in a 24/7 business? by jader3rd · 2017-06-02 03:34 · Score: 2
  
  How does one actually fail-over test things in production in a 24/7 business
  You eliminate any distinction between maintenance operations and DR. The redundant systems should behave the same during upgrade/patching of one of the nodes, a disk dying on one of the nodes, a node hosting active client connections has its NIC die, having a rack die, having the WAN cut, having the entire datacenter lose power, etc.
  If the underlying redundancy system doesn't significantly differentiate discretionary failover operations from DR failover situations, you can run a 24/7 system.
  See Exchange Database Availability Groups as an example.
4. Re:How does one DR test in a 24/7 business? by freeze128 · 2017-06-02 03:51 · Score: 1
  
  Easy. You schedule an outage.
5. Re:How does one DR test in a 24/7 business? by gweihir · 2017-06-02 05:04 · Score: 1
  
  Simple. Use the redundant site for that and make sure the primary is stable before. This is not rocket-science. Competent IT organizations with competent CEOs do this regularly.
  
  --
  Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
6. Re:How does one DR test in a 24/7 business? by Bob+the+Super+Hamste · 2017-06-02 05:40 · Score: 2
  
  Easily. Regularly switching to the backup site should be done as part of the day to day business operations. For example at my job I work with a company that will switch daily between the main and backup system. It doesn't hurt that the main and backup are running in a hot standby configuration and the backup can take over at a moments notice. They also have 2 additional systems for further levels of redundancy. One is a system that they do a system restore to each day (the previous backup of the main system) that is sitting warm and the other is a cold system where they do a weekly restore from a previous backup of the main system. As the switch-over, as well as the recovery, is done daily as part of regular operations it isn't an issue and everyone there knows what to do. This is for a piece of critical infrastructure which is why there is that level of redundancy, as well as many others to ensure a 99.999% up time of the system but it shows that it is possible to have the requisite up time with a properly designed system and processes.
  
  If you are worried about testing switching to a backup site on a 24/7 system you should also be worried about hardware failures and patches to that same system as those also require outages that you say can't happen as you obviously don't have a system with the required levels of redundancy and are lacking in recovery ability.
  
  --
  Time to offend someone
Blame the Worker for Management's Incompetence by boulat · 2017-06-02 02:45 · Score: 1

It was not working perfectly at all - there was a single point of failure, poor design with no redundancy responsible for critical infrastructure, clearly approved by senior management.
So no, it wasn't a contractor responsible for the outage. It was the CEO who did not ensure there was redundancies in place on critical infrastructure, business continuity was not tested and disaster recovery was not a thing.
Re:How is this a thing by queazocotal · 2017-06-02 02:46 · Score: 1

'flick it off' may include 'opened the interlocks and keyed in the code as he believed he was doing the correct thing'.
This could be a personal failure due to stupidity, a training failure, or he was in fact instructed to turn it off, and though he protested, is now getting scapegoated.
Did they try to turn it off and on again? by grumpy-cowboy · 2017-06-02 02:51 · Score: 1

:)

--
Will $CURRENT_YEAR be the year of the Linux Desktop?
You can only idiot-proof so much by ErichTheRed · 2017-06-02 02:52 · Score: 1

It's good practice to make things so simple that no one could possibly mess them up. It works in programming - look at how many JavaScript frameworks abstract an already sandboxed development environment to a point where "signalling intent" is basically all the developer needs to do. Or in hardware -- we're using HPE servers and there is literally a "don't remove this drive" light that comes on when a drive fails in a RAID set. That had to be a customer-requested change after one too many data-loss events stemming from someone replacing the wrong disk.
But at some point, all that abstraction meets the real world and the man behind the curtain really does need full control of whatever system they're in charge of. A favorite example of mine is a project we're doing in Azure -- the developers have full faith in the magic box that will never fail and is so simple that we don't need to know how it works. Sure we might not need to know the exact implementation details, but it doesn't absolve you from knowing what is and is not possible in the realm of compute, network and storage combinations. I've dealt with support tickets where the Microsoft personnel are quite obviously looking in whatever monster Hyper-V / SCVMM console is controlling their back-end to solve something complex.
At the data center level, you can only idiot-proof so much. Some operations person is actually going to need to control the system directly at some point and have access to the Big Red Lever. You can put a million fail-safes in place to avoid routine problems, but when automatic processes fail, you need at least one smart person who knows _everything_ that could go wrong. When you outsource this function to the lowest bidder, don't expect to get a super-genius in that role. Your typical body shop outsourcer isn't going to pay people enough to stay on to learn the ins and outs of an environment.
Re:Insurer will be unhappy by 91degrees · 2017-06-02 03:00 · Score: 1

That's their hard luck. But if it wasn't for accidents like this, they wouldn't have a business so they're not really in a position to complain.
Re:How is this a thing by Required+Snark · 2017-06-02 03:03 · Score: 1

There is one thing we do know: ultimately this was a management failure, not a tech/operations failure. A cascade failure from a single point is bad, but it inevitably follows from bad management. Read about the decision making before Fukushima or the Challenger disaster for examples. Someone always speaks at some point before it all goes up in flames, and they are ignored.
We also know one other thing: no one up in management will accept responsibility. All upper managers will be shielded from personal responsibility so their reputations and wealth will be preserved. Even if they retire early, they will never see their retirement be reduced because they screwed up. It will always be the case that they will be able to go out and get new positions, often as overpaid consulting parasites.
Everyone else, employees, customers, stockholders will loose out, but the insiders will barely feel a bump.

--
Why is Snark Required?
Re:How is this a thing by queazocotal · 2017-06-02 03:13 · Score: 1

No, it isn't.
Sometimes someone, despite proper training, management, and instruction does something that goes against all of that training, to the point that no reasonable person given the same instruction and training would have done the same thing.
In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
Re:How is this a thing by nedlohs · 2017-06-02 03:19 · Score: 1

If you have a switch like that accessible so that just anyone can flick it off, you are an idiot.
If you don't you are probably in violation of the local fire codes. Though the 2011 updates to the NEC did remove the "shall be readily accessible at the principal exit door" language from the emergency power off requirements instead allowing "shall be located at approved locations readily accessible in case of fire to authorized personnel and emergency responders", so a bunch of jurisdictions will just be using that by now and not requiring the big red button at the exit... You can do away with it entirely if you check a bunch of other boxes.
Of course their data center probably isn't in the US in the first place, so they likely live under a whole different set of requirements for the off switch.
he is called to the CEO's office by tommeke100 · 2017-06-02 03:22 · Score: 1

contractor: "so, I guess I'm pretty much done with this company right?"
CEO: "Not at all! We just spend 1 billion $ educating you!"
contractor in tears: "oh thank you"
CEO: "I was joking, dumbass. This is the real world. You're fired and we're going to sue you for 2 billion $".
Re:How is this a thing by edtice1559 · 2017-06-02 03:27 · Score: 1

As I've pointed out earlier, they should have been able to fail-over to another data center. So the fact that they didn't have these procedures and/or hadn't tested them is a management failure. The localized problem, though, should not be blamed on management.
Keep your core competencies in house by raymorris · 2017-06-02 03:29 · Score: 2

What definitely needs to be done in-house is whatever your company is supposed to be good at. Ford designs and assembles cars - they shouldn't outsource the design and assembly of cars because that's what they DO - if they stop making cars, they are no longer doing anything and have no reason to exist. Ford is not in the business of making cleaning products, so they probably shouldn't make the cleaning products they use. They should outsource that, buying cleaning products from SC Johnson or someone. Ford is not in the business of cleaning carpets, so that's also a candidate for outsourcing.
Once you have a list of items that can be outsourced because they aren't your "core competencies", they "make or buy" decision becomes mostly a matter of arithmetic. For the same budget cost, will you get it done better by hiring people to do it, or by hiring a conpany to do it? Equivalently, for the same level of quality, does it cost less to pay in-house people to do it or to an outside source? Probably, you'll find that it's better to get an operating system from an outside source, not make your own.
While there is no hard and fast rule, a rule of thumb is to consider the company next door. If you could easily buy the same product or service from the same vendor that the company next door uses, and it would serve your purpose, you should probably do so. General purpose things like office supplies office cleaning, and payroll services should be purchased, not manufactured in house, because there is no competitive advantage to be gained from having better office supplies than the other company.
Single point of failure? by mspohr · 2017-06-02 03:32 · Score: 1

Shit happens and most competent companies plan for it by have redundant live backup systems.
I can't believe that BA didn't have a live backup system at another site to fail over to.
Really, this costs money but these cheap bastards don't seem to have a clue.

--
I don't read your sig. Why are you reading mine?
1. Re:Single point of failure? by JustNiz · 2017-06-02 03:55 · Score: 1
  
  BA totally sucks ass.
  I recently spent 3 hours trying to get what I needed from the insanity that is their website booking system, and finally gave up in frustration, so I phoned them, was kept on long hold multiple times and repeatedly asked for the same info, then was told it would cost extra just because I was trying to book over the phone.
  They have so many stupid apparently arbitrary and entirely arcane rules about seat pricing and availability depending on which currency you pay in (hint: much less available seats and far more expensive if you try and pay in UK pounds), They don;t even accepting their own travellers points (avios) in many circumstances. Their internet pricing keeps going up every time you quit and start-over, which you will need to multiple times just to work out the arcane tricks to get what you want from their system. I can't believe BA are even still in business.
  Judging from their awful customer relations and hideously unfriendly/buggy website I'd be surprised if it wasn't all running off a single laptop on some offshoring company's desk in Mumbai.
No HA? by elistan · 2017-06-02 03:33 · Score: 2

Business critical systems should operate in an active/active high-availability scenario in at least two separate locations. That way the loss of any one node has zero effect except perhaps a transaction retry and reduced performance.

Systems of the next lower level of criticality should have real-time replication to a separate location, so that if a node fails the recovery time is simply what it takes to boot the replacement node.

A further lower levels of criticality you start getting into things like virtualization clusters to mitigate hardware failures supported by point-in-time backups to mitigate data failures. The IT department's Minecraft server can just be a spare desktop machine sitting on an admin's desk.

(There are additional considerations for all levels of criticality too, of course, like SAN volume snapshots, and backups too of course.)
Oh really? by fustakrakich · 2017-06-02 03:46 · Score: 1

The janitor just tripped over the extension cord?

--
“He’s not deformed, he’s just drunk!”
Headline is incorrect by p51d007 · 2017-06-02 03:55 · Score: 1

Should be: "British Airways IT outage caused by FORMER CONTRACTOR who accidentally switched off the power".
They must have a really shitty IT ... by Qbertino · 2017-06-02 04:04 · Score: 1

... setup if their entire order processing can be turned off by a single guy.
I wouldn't even feel guilty if this happened to me. I'd just be surprised and say "Whooops ... guess that was the wrong switch/command/ansible script/whatever procedure.

--
We suffer more in our imagination than in reality. - Seneca
Re:How is this a thing by thsths · 2017-06-02 04:16 · Score: 2

> In some cases you actually do need emergency global 'off' switches that are never meant to be used in normal operation.
Yes, if you run a simple experiment, and there is the possibility for harm, a single red button is a good idea.
But if shutting down the server room costs $100 000 000, then a single red button is not a good idea. Instead, you have two parallel power distribution system, with some physical separation, and there are two off switches. Of course there should be sign that explains how to use the switch, and I guess that is where this story eventually leads.
Oops by TheDarkener · 2017-06-02 04:16 · Score: 2

(Hopefully) an honest, albeit very consequential mistake. I've done the same thing when I was working on the backside of a server cabinet - the PDU was right there by my shoulder and I swiped it on accident. No UPS in the cabinet (a mistake not of my own but the ones who built it out). Fortunately everything came back on. Good thing to have BIOS settings to 'stay off' after a power failure (so you can turn them back on individually and not overdraw power). I feel bad for the guy who did this, it was probably his last day working there.

--
It is pitch black. You are likely to be eaten by a grue.
Re:Insurer will be unhappy by thsths · 2017-06-02 04:26 · Score: 1

I would not be too worried.
a) It is not clear that the contractor is the only person to blame.
b) Maybe there is some small print that was violated.
c) Even if they have to pay, there is probably a limit of 5 Million...
Something similar happened at Boeing, and at Nike by Anonymous Coward · 2017-06-02 04:27 · Score: 1

A few years ago a sys admin at Boeing's main site in Washington flipped a main power switch ("the big red button"). He wanted to restart the network hardware for the machines in that server closet, to solve a network issue (not a shutdown). He had no idea it was a single point of failure, a doomsday switch (when in doubt, ask more than one person!). The entire system went down, and took 24+ hours to restart, effectively shutting down Boeing's production of airplanes for a day (manufacturing typically requires lots of servers for automation, etc.). Ouch.
But Nike's famous flameout was much worse. Several years ago they replaced their ERP system (basically, it analyzes sales to keep their factories making the right products well in advance of need, so availability meets demand). Despite many red flags, the head of Nike had the ERP company deploy the new system in an absurdly short time, not in a proof-of-concept or limited deployment, nor an A-B comparison with the legacy, but instead global. While the new system worked, it had never been tested at scale, and it turned out it couldn't handle a serious load. Worse yet, it wasn't obvious that it wasn't handling the load. The effect was that the system lost track of the Nike products that were selling the most, e.g., Air Jordan shoes, so continued manufacturing wasn't triggered for the most popular products. Meanwhile, products that barely sold at all continued to register in the system, and since Nike had accidentally left some legacy triggers in place, unpopular products were manufactured double the needed amount. Months later, as stores ran out of Air Jordans and similar, it turned out none were being made, and couldn't be available for several more months. But stores were being shipped products that nobody wanted to buy. In a short time, Nike lost at least $100M, and nearly went bankrupt (in revenge, Nike bankrupted the company that did the new ERP system, despite the fact that the company had very clearly told Nike the short timelines were impossible to meet). Very recently, Nike has finally replaced their legacy ERP systems with best-of-breed software (e.g., based on JustEnough) that is tested to death for both accuracy and scalability. E.g., their unit testing has code coverage of close to 100% (I know, because I spent more time writing tests than services there). And they have a huge infrastructure team that leverages AWS scalability (Lambda and similar) to the extreme.
Re:Shit rolls downhill by thsths · 2017-06-02 04:30 · Score: 1

Of course. I am just waiting for the statement by BA that it was "not their fault", and they are therefore not paying compensation to passengers...
And we should believe this? by whitroth · 2017-06-02 04:32 · Score: 2

In our secure rooms, we have an EPO button. It's LARGE, red, and inside a cover that you have to lift to turn hit.
And this contractor turned off the *entire* power for an *entire* datacenter? Yep, yep, not our fault, not your fault, it's gotta be the fault of that guy over there pushin' a broom!
1. Re:And we should believe this? by markana · 2017-06-02 06:07 · Score: 3, Informative
  
  We had an entire data center shut down this way. Facilities *insisted* that the BRB (Big Red Button) not have any sort of shroud or cover over it. Just in case someone couldn't figure out how to get to the button in a dire emergency.
  So one day, they've got a clueless photographer taking pictures of the racks. He was backing up to frame the perfect framing and... we'll, you can guess the rest.
  Now, the button has a shroud that you have to reach into to hit it, and non-essential personnel are banned from the rooms. Total cost of the outage (even with the geo-redundant systems kicking in) was over $1M.
  Just another day in the life of IT.
It was the cleaning lady by Anonymous Coward · 2017-06-02 04:36 · Score: 1

True story.
In the late 90's I worked at a small startup and was the main IT guy. Each night we had to send out large files, this is back around 98 or so when a 256k bonded business class ISDN or something like that cost us about $1,000 a month. So, this thing needed to be sending data all night long.
I kept having to go back into work because for some strange reason the line would sometimes go down, only after hours, and the crappy old software were were forced to use by the client for the uploads would just fail.. I had to manually restart the file transfer.
This happened about once a week for a months.
We then got a client that needed better security. So, among other things from the audit we did, we got a electronic lock for the server room door.
Week or two later without any failures my boss stops by with a guy I did not recognize.
"Hi, this is Bob, he is the manager of the cleaning company and he says his workers need a key for something?" Fun conversation..
The cleaning lady was ignoring (or unable to read) the signs saying keep out and such, was going up the ramp, around the server racks, over next to the network rack by the wall, unplugging the network power cord, and the proceeding to vacuum the spotless room I though no one but me ever went into... Then plugging it back in when done.
My fault for not locking it obviously.
Sounds like a system and database design fault by WillAffleckUW · 2017-06-02 04:56 · Score: 2

Have they never heard of multiple servers with the ability to handle server down events for one machine?

--
-- Tigger warning: This post may contain tiggers! --
And _more_ lies! by gweihir · 2017-06-02 05:00 · Score: 2

Sure, that may have been the proverbial last drop. But the actual root-cause is that their systems were not able to cope with outages that must be expected. And the responsibility for that is straight with top management. Their utterly dishonest smoke-screen is just more proof that they should be removed immediately for gross incompetence.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Been there ... by CaptainDork · 2017-06-02 05:06 · Score: 3, Interesting

... I was walking behind the server rack and unknowingly brushed up against the power cord to the Novell 3.1 server.
Later, when my boss asked me for an outage report, I told her, "I wish you hadn't asked that."
I made damned sure that plug was tied to the server after that.

--
It little behooves the best of us to comment on the rest of us.
Re:Unacceptable. by gweihir · 2017-06-02 05:12 · Score: 1

DR test are expensive and may show the DR site does not work, making them even more expensive. The bonus for upper management is obviously far more important.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Re:Insurer will be unhappy by gweihir · 2017-06-02 05:15 · Score: 1

He will not have to. An outage of this magnitude from a single cause like this can only happen if gross negligence was rampart on the other side.

--
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Redundancy? by Anonymous Coward · 2017-06-02 05:17 · Score: 1

I'm Navy trained, and every fail-critical system should be designed with the assumption that the greatest threat is the incompetence of your own employees. No single switch should be able to collapse a critical system. The contractor who physically pulled the plug is not the person agent. The engineers who designed a power system without single-fault tolerance and/or the managers who implemented inadequate supervision, training, and procedural compliance are responsible.
Re:How is this a thing by bws111 · 2017-06-02 05:34 · Score: 1

They are life safety equipment, dumbass.
Seen this movie before by Anonymous Coward · 2017-06-02 06:12 · Score: 1

Data center with backup generators, automatic transfer switches and whole nine yards during routine testing of equipment goes down hard and stays that way for the better part of a day because it couldn't handle inrush currents. Panic turning on the little switch you turned off not only didn't work but caused physical damage.
What if power supplies had a current sensor /w random timer to flip a relay and avoid inrush synchronization or what if power system was designed appropriately to deal with the problem in the first place? What if there was an off-side backup system?
Saying a contractor caused it is like saying a customer withdrawing $0 from an ATM caused a banks transaction system to crash. MULTIPLE design failures on many scales CAUSED this outage.
I presume the fabled Big Red Button by swschrad · 2017-06-02 06:45 · Score: 1

there really should be a shield cover over The Big Red Button so prevalent in data centers at the door. the damn thing always scared me, I never got within a foot of the bugger. always felt saver leaning on the Halon tank.

--
if this is supposed to be a new economy, how come they still want my old fashioned money?
IT SHOULD NOT HAVE MATTERED!! by Anonymous Coward · 2017-06-02 07:06 · Score: 1

As I'm sure others have posted, IT SHOULD NOT HAVE MATTERED!!
The fact that there was no redundant system anyway: fail!
The fact that turning it on again did not restore service: fail!
We can all laugh at the clown that turned off the power supply, but c'mon, we all know that this wasn't the *true* problem here!
They are seither incompetent or liars by Anonymous Coward · 2017-06-02 07:37 · Score: 1

No sane person who operates a critical infrastructure does not have a backup system and built in redundancy. Also you cannot switch off power in a computing facility or a single rack in there without proper permission.
In addition of having no backup system, they also did not have an emergency plan. Maybe they are both.
I don't know if I really buy this story. by keith_nt4 · 2017-06-02 08:23 · Score: 1

I mean I don't have any reason to doubt it either, just seems convenient that a dude named Ben just happens to get the blame...

--
"UNIX is very simple, it just needs a genius to understand its simplicity." -Dennis Ritchie
The outage wasn't caused by the contractor by Anonymous Coward · 2017-06-02 08:27 · Score: 1

It was triggered by the contractor.
The cause is in the system design and testing that allowed that trigger to cause to much pain.
CEO trying to spin it... by Anonymous Coward · 2017-06-02 09:43 · Score: 1

Example:
If you fired the CFO based on capriciousness and lack of understanding of what a CFO does you don't get to dodge responsibility saying the delay in getting out the financial quarterlies is because someone didn't order enough paper...
As a CEO, especially in 2017, you should know better than to trust outsourcing solely based on a sales pitch and perhaps a free lunch. These systems are delicate, and frankly even a midlevel IT position takes 2-3 months to get up to speed. If you are taking over a large scale organization in essence you are in a defacto freeze for 6-9 months depending on how much turnover there is. You can't ITILv3 your way out of the complexities of these systems...