Slashdot Mirror


Vendors Take Blame For Most Data Center Incidents

dcblogs writes "External forces who work on the customer's data center or supply equipment to it, including manufacturers, vendors, factory representatives, installers, integrators, and other third parties were responsible for 50% to 60% of abnormal incidents reported in a data center, according to Uptime Institute, which has been collecting data since 1994. Over the last three years, Uptime found that 34% of the abnormal incidents in 2009 were attributed to operations staff, followed by 41% in 2010, and 40% last year. Some 5% to 8% of the incidents each year were tied to things like sabotage, outside fires, other tenants in a shared facility. But when an abnormal incident leads to a major outage that causes a data center failure, internal staff gets the majority of blame. 'It's the design, manufacturing, installation processes that leave banana peels behind and the operators who slip and fall on them,' said Hank Seader, managing principal research and education at Uptime."

39 of 57 comments (clear)

  1. Whelp. by GmExtremacy · · Score: 1

    I think it's time to switch to Gamemaker. Have to face the music some day, yes?

  2. correlation is not causation by Karmashock · · Score: 4, Insightful

    I'm sure outside forces installing things are disruptive. But then are they the primary forces doing installations in general? And if that's the case, then it would be more appropriate to call them simply installation related issues... and that's both common and to be expected.

    Install anything new and teething issues tend to crop up.

    --
    I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    1. Re:correlation is not causation by Anonymous Coward · · Score: 1

      I think you are 100% correct. I would also consider it the responsibility of the operational staff of a datacenter to properly monitor any external vendors granted access to the datacenter to ensure they don't break anything. I mean, I'm a customer of Datacenter A, I don't really give a crap if the person who took down my network works for Datacenter A or Vendor B that Datacenter A hired. The fact is, my SLA is with Datacenter A, and they are 100% responsible for maintaining the integrity of that datacenter, regardless of what outside forces are involved.

    2. Re:correlation is not causation by starfishsystems · · Score: 1

      Speaking only anecdotally, I've found, on many occasions when installing server applications, that the vendor's installation mechanism breaks the system in some way. These products can't be trusted in their default configuration, yet the nature of software installation entails an elevated degree of trust.

      A characteristic example would be a network service which inserts its own startup behavior into one of the standard Unix scripts in /etc/init.d rather than providing its own standalone script. There's really no excuse for this; it's just a mediocre hack by someone who doesn't know or care to do it the right way. But it's exactly here, where the application is provisioned to the system, that the greatest opportunity lies for breakage.

      It's also been my experience that contractors who perform installations on site, as well as vendors acting in that capacity, are generally not motivated to do the installation in a clean manner. Sure, it's nice to have someone to blame when things go wrong, but the kind of things that go wrong are often not encountered until the next system upgrade or patch interacts with whatever the installation broke, by which time the outside party is long gone. So usually the local staff take the blame, and in any case they're the ones who have to identify and fix the issues.

      Ideally, the remedy for these scenarios lies with the operating system vendor. If the system provided and enforced its own defensive installation API instead of relying on third parties to play nice, then the system vendor and the application vendor could duke it out on their own turf instead of dragging site administrators into the battle.

      Can anyone think of an example where this approach is strictly not possible?

      --
      Parity: What to do when the weekend comes.
    3. Re:correlation is not causation by Imrik · · Score: 1

      What happens when the person who took down the network works for Vendor C that you hired?

    4. Re:correlation is not causation by Karmashock · · Score: 1

      It's also very very easy to blame the guy that was there for a day and then not there the next to defend himself.

      I've run into a few situations where a coworker blamed mistakes they made on someone that was recently fired or only came to the office occasionally.

      Why not after all?... what are they going to say in their defense?

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
  3. Internal staff IS to blame by nurb432 · · Score: 1, Interesting

    They need to monitor and control their vendors/contracts/etc better.

    --
    ---- Booth was a patriot ----
    1. Re:Internal staff IS to blame by chuckinator · · Score: 2

      but, but, but, THAT'S TOO HARD when passing the buck is so much easier!

    2. Re:Internal staff IS to blame by houstonbofh · · Score: 1

      Is there a category for warning the PHB that it won't work and being told "I don't need excuses! Just get it done!" as that might be the missing 102%...

    3. Re:Internal staff IS to blame by Archangel+Michael · · Score: 5, Insightful

      over worked, understaffed, added three projects this month and only closed one that was already in the works. It isn't too hard, or that it can't be done, it is also we don't have the time to do it right because we're still cleaning up the mess from the last three projects that were "critical" and were over budget and late. We'd be outsourced, but the cost of hiring outside vendor is about 10x what in house staff costs, and they would charge more for each project added.

      Which is why I no longer try to do things on "low budget" and why everything I look at is Enterprise level. Enterprise level allows me to blame the vendor, because THEY are the ones that are selling this shit to the PHB who doesn't know how ridiculously over simplified the vendor makes it sound.

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    4. Re:Internal staff IS to blame by sociocapitalist · · Score: 4, Informative

      Nice try -

      The reality is that whenever something goes wrong, the vendors/contractors are almost always blamed regardless of who is at fault. It's standard business practice for the customer to bring in a vendor for just this reason - if something goes wrong, they can point at the vendor. The bigger the vendor name the better this works. If you can bring the manufacture(s) in that's best of all. Who can blame you then?

      The 'abnormal' incidents where an internal employee is blamed are probably instances where there was absolutely no way for that employee to escape responsibility (ie the syslog entry shows that user logged in, using a one time password token in his possession so that there's no chance of "the vendor has my username and password bullshit", and entering the command 'reboot').

      I'm not saying that vendors, contractors and manufacturers don't make mistakes - they're human and from the manufacturer standpoint there are always bugs that are going to cause problems. I'm just saying that the employee / external aspect should be taken into account and thus these statistics taken with a very large grain of salt.

      --
      blindly antisocialist = antisocial
    5. Re:Internal staff IS to blame by s73v3r · · Score: 3, Insightful

      I honestly wonder how many of these incidents blamed on outside vendors are actually the result of something the outside vendor did, and not the result of some manager yelling and screaming loud enough to make the vendor do something to shut him up and not lose business.

    6. Re:Internal staff IS to blame by eclectus · · Score: 1

      As a vendor, I will attest to this. I have 'fallen on the sword' more than once for employees to save face.

      --
      This signature is a waste of 42 characters
    7. Re:Internal staff IS to blame by houstonbofh · · Score: 1

      wise man, i have been in this exact same boat before.

      I think it is a rather large boat...

    8. Re:Internal staff IS to blame by ArsenneLupin · · Score: 1

      I think it is a rather large boat...

      With Schettino at the helm...

      And in the enterprise world, he's got a golden lifeboat that he can fall into when things turn sour....

  4. Blame game? by swb · · Score: 2, Insightful

    It sounds like this is just some kind tool to show that "it's not our fault, really" -- but at the end of the day, aren't the internal staff responsible for managing the "outside forces" up to and including setting standards, supervision, etc?

    Or is this one of those deals where so much it oursourced that it's easy for everyone to deny culpability?

    1. Re:Blame game? by sociocapitalist · · Score: 1

      Depends - if you bring in HP or IBM to do provide a solution for you all the way from business requirements, applications development, systems, networking, security, etc, etc, and then at implementation you have IBM, Juniper, Cisco, etc on site to support and something happens...you've covered yourself perfectly. No one can blame you because you got 'the best in the business'.

      --
      blindly antisocialist = antisocial
    2. Re:Blame game? by trevelyon · · Score: 1

      I think that's SAP's entire business model.

  5. Well duh, cuz they outsource everything. by Above · · Score: 2, Insightful

    Corporate America loves to outsource. Not because it's efficient or cheap, but because it provides someone to blame!

    Outsource the network to one firm, the generator to another, the HVAC to a third. Hire temp contract lackeys to staff the place, and rent-a-cops to "guard" it. Then, when something goes wrong, blame them. If it's a big enough issue fire them and replace them with the next batch of people who won't be trained, won't care, and will eventually screw up.

    This article isn't illuminating, it's simply restating the design parameters of the system!

    1. Re:Well duh, cuz they outsource everything. by swb · · Score: 1

      This is what I really wanted to post.

      And it's not like its even done "intentionally" to find someone to blame, it's just that there is SO much outsourcing and the buck stops...nowhere. Everybody does the least amount they possibly can to keep something from going wrong, because they (well, hell, *I*) know it will because there's inadequate training, documentation, testing, PHBs and Suits screaming about how late projects are, nobody bought enough storage/CPU/bandwidth/amperage and the aforementioned suits/PHBs wont' spend "any more".

      It's just fucking endless and the blame just gets shipped downstream, rather than someone wondering if maybe somebody with a brain bigger than the shift knob on their BMW sedan should be in charge.

    2. Re:Well duh, cuz they outsource everything. by dkf · · Score: 1

      Corporate America loves to outsource. Not because it's efficient or cheap, but because it provides someone to blame!

      Outsource the network to one firm, the generator to another, the HVAC to a third. Hire temp contract lackeys to staff the place, and rent-a-cops to "guard" it. Then, when something goes wrong, blame them. If it's a big enough issue fire them and replace them with the next batch of people who won't be trained, won't care, and will eventually screw up.

      They're forgetting that the one thing they cannot outsource is the overall responsibility for having things working enough to support their business, for if they get rid of that then they've eliminated the need for them to exist at all (and their supplier will simply cut them out of the equation with no ill-consequences). If things keep failing horribly because the people they're outsourcing to suck, it's Corporate America's fault for outsourcing to the wrong people (or outsourcing at all).

      Mind you, it might be better to deal with problems through conventional insurance than trying to make the system infallible.

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  6. Dive deeper... by Anonymous Coward · · Score: 1

    80-90% of abnormal incidents caused by vendors was the previous vendors fault.

  7. Banana peels? by Chemisor · · Score: 3, Funny

    It's the design, manufacturing, installation processes that leave banana peels behind and the operators who slip and fall on them

    When a company tries to get around minimum wage laws by hiring low-paid monkeys to do their design, manufacturing, and installation, they get exactly what they deserve.

  8. UPS datacenter testing by DigiShaman · · Score: 3, Insightful

    My favorite is getting notifications that all our servers went offline. Now typically, that would be at the network (ISP) level. So come to find out later that the entire facility lost power. Apparently they performed an internally scheduled UPS test without letting us know before hand. Well, they completed the test alright. It was a failure.

    In that whole event, we ended up with dirty NTFS volumes that needed to have chkdsk ran and one or two servers with a failed drive in their respective RAID5 arrays. Not happy!

    --
    Life is not for the lazy.
    1. Re:UPS datacenter testing by thebeige · · Score: 1

      Justified rant!

  9. It's part of the shell game. by L3370 · · Score: 2

    If you let them in your datacenter, it's your fault if anything goes wrong in there.
    If your vendor botched a deployment or delivers a functionally useless product, it's your fault for buying into their marketing campaign and not understanding what you just got yourself into.

    But mostly, I think the blame system was by design here...Hire someone else to do the job for everything possible. Fire them/drop contracts when they don't work for you, then file insurance claims to compensate (plus extra if you do it right) for the damages. The trick is to keep the damages rolling as expected--enough to keep insurance revenues up, but not enough so that your premiums adjust to make it unprofitable.

  10. I think her name was Abby... by neo-mkrey · · Score: 1

    Abby Normal

  11. Well by M0j0_j0j0 · · Score: 1

    i don't know but i have been taking these pills, my wife is happy, and they were recommended to me by the Uptime institute as well, so this study must be close.

  12. contractors and sub contractors add middle man by Joe_Dragon · · Score: 2

    contractors and sub contractors add middle man and overhead.

    Some times to the point where a sub may get a job with little to documentation or a job with poor or bad documentation.

    Or a sub may hit a issue and have to work though alot middle man off site managers to get things fixed or just be told do as the documentation says and we will have to get a other contract to fix it.

    1. Re:contractors and sub contractors add middle man by Anonymous Coward · · Score: 1

      As a contractor I experienced this exact thing in a datacenter. During installation of fiber infrastructure I noticed some anomalies on the scope of work. On site contact pretty much said to do exactly what it said (which was wrong/incomplete) because it was designed by the corporate level 1 engineers. Of course this contact retires and the replacement comes in and blames me for an unfinished job.

  13. rotating contracts leads to people with no knowled by Joe_Dragon · · Score: 1

    rotating contracts leads to people with no knowledge of the site and more errors as people get up to speed.

  14. who takes the blame working on another company&rsq by Joe_Dragon · · Score: 2

    When a data center is working on another company’s server then the one that they should be working on?

    http://thedailywtf.com/Comments/Remotely-Incompetent.aspx

  15. Re:Anonymous Cowards takes credit for most by eternaldoctorwho · · Score: 1

    The headline said "blame", not "credit". I think the former is more relevant here.

  16. Deer in the headlights... by billybob_jcv · · Score: 2

    Back in the prehistoric days a group of us were sitting in a bull-pen outside the datacenter. There were big windows on the datacenter wall so we could all ooh & ahh at the blinky lights on the servers and switches. Suddenly, my workstation froze - and when I (and every other person in the bullpen) yelled and looked up, we saw our network admin standing in the datacenter looking back at us with a "What?" look on his face. In his hand was the Ethernet cable he had just pulled out of a core switch...

       

  17. They do the risky work by hawguy · · Score: 2

    Is this surprising? The vendors/contractors do more of the risky work. When it comes time for UPS maintenance, our vendor comes in to take the UPS offline and do the work. If they screw up when they bypass the UPS, they can take down the datacenter. Likewise, when it comes time to add a new disk tray to the storage system or replace a failed controller board, instead of having our staff do it (who may add one tray every year if that), we have the vendor do it, so there's more chance of him doing the wrong thing and bringing down our storage system -- but there's less chance of the vendor causing a problem than our own staff since the vendor's engineer does this twice a week.

  18. Re:rotating contracts leads to people with no know by L3370 · · Score: 1

    Which is PERFECT if you don't want things to work correctly in the first place! Good products delivered efficiently become cheap.

    Its about profit maximization. More fuckups = more billable hours and expenses to pass on to the customer :)~

  19. A matter of control by PPH · · Score: 1

    Quality in a data center, or any facility for that matter, depends on controlling the processes within that facility. If vendors have signed on to working within the procedures developed by the data center operators, fine. There should be minimal problems. But if vendors are allowed on the property to do work not covered by these plans and controls, antics will ensue.

    There is nothing inherently wrong with bringing in outside vendors. As long as their function has been planned for. And there is some means to hold those vendors to working within that plan. But all too often, data center managers overlook certain functions in their procedures. Like installation and commissioning new equipment, for example. So when these operations become necessary, people are brought in (or the task is handed to in house techs) with insufficient directions on how to proceed. The difference between vendors and your own techs is that vendors come in familiar with their own equipment, but unfamiliar with data center processes. In house technicians have the opposite problem. They know their way around the facility, but not so much the equipment. Either way, somebody is going to need training.

    So, do you train your people on functions that they'll rarely have to perform? Or do you expect vendors to learn your processes when they may not return for months or years.

    --
    Have gnu, will travel.
  20. Who to blame by Skapare · · Score: 1

    But with the possible exception of a meteor strike, there's always someone to blame for a data center problem.

    I always blame Anonymous Coward. He's the one that failed to order the meteor sheilds.

    --
    now we need to go OSS in diesel cars
  21. Tech support war story here. by sirwired · · Score: 1

    Several years ago, I was working a support case with a major bank. Their remote storage mirroring between BFE, [Southwest State Here] and BFE, [Flyover Country State here] failed, and they wanted to know why. I obtain SAN switch logs from both fabrics and attempt to troubleshoot the issue. The logs revealed that the network ports dropped offline one by one, about 5-7 seconds apart, and then the problem hit the other switch. They came back online one-by-one about three minutes later. The ports were scattered all over the respective switches.

    I inform the customer of my findings and am informed that there happened to be somebody working on the cabling in that exact same rack cabinet, but he swears he didn't touch anything having to do with these cables and that the problem MUST be within our hardware. I inform the customer that hardware or software issues do not spread to random ports within a switch, and then to a switch that has NOTHING in common besides a nearby rack cabinet, and ONLY affect a particular group of ports that are otherwise completely randomly spread throughout the switch. (We are talking good old-fashioned light loss here... not some esoteric failure that could be caused by software.)

    The customer replies: "We'll be having a "discussion" with that cabling contractor."