Cisco's Network Bugs Are Front and Center in Bankruptcy Fight (bloomberg.com)
Reader Dharkfiber writes: Bloomberg is covering a story today about a hosting business that is now filing chapter 11 due to bugs in a switch. Good, bad, or ugly, is it time to admit that business really can't continue without IT? When will IT training become formal curriculum in schools?An excerpt from the Bloomberg report: There's buggy code in virtually every electronic system. But few companies ever talk about the cost of dealing with bugs, for fear of being associated with error-prone products. The trial, along with Peak Web's bankruptcy filings, promises a rare look at just how much or how little control a company may have over its own operations, depending on the software that undergirds it. Think of the corporate computers around the world rendered useless by a faulty update from McAfee in 2010, or of investment company Knight Capital, which lost $458 million in 30 minutes in 2012 -- and had to be sold months later -- after new software made erratic, automated stock market trades.
Peak Web, founded in 2001, had worked with companies including MySpace, JDate, EHarmony, and Uber. Under its $4 million-a-month contract with Machine Zone, which began on April 1, 2015, it had to keep Game of War running with fewer than 27 minutes of outages a year, court filings show. According to Machine Zone, the hosting service couldn't make it a month without an outage lasting almost an hour. Another in August of that year was traced to faulty cables and cooling fans, according to the publisher.
I'm a photographer, and I sell my work through a web service. They bring together the finishing providers (prints, calendars, t-shirts, etc) and take care of payment, and all I have to do is provide content and manage sales. When I finish post-processing on a new photo, the tool I use (Adobe Lightroom) automatically uploads to the web service in the album I select. I cover events, so there's often a massive number (600 or so) of photos to upload.
Yesterday I was getting sporadic "service not available" messages from the service. After doing some triage to verify the problem was not at my end, I contacted customer support. Mind you, this was 10:30 PM PST. But that's the way it is with photographers -- we often take photos during the day and process them at night, which is somewhat the opposite of a standard use case. (And should be borne in mind when said services schedule maintenance. Just sayin'.)
Browsing the service's forum, I saw others were seeing the same error message, and people were starting to get excited. (This is our livelihood, after all.)
I got an answer to my service ticket in less than 30 minutes, that they were struggling with with network problems with one of their service providers (probably a cloud service). I got a followup shortly after that they thought the service was up now but they were still testing. And I got another followup at 6:30 AM that the problem had been resolved and they had put steps in place to insure it would not happen again. They also implemented a "status page" that we could consult in the future (which should have already existed, but live and learn).
Now, *that's* the way to handle an incident like this. Very commendable. But it does point up the problems a business sometimes has when they rely too much on external services. Just my opinion, but the main difference I can see between in-house and outsourced is one of motivation. If you're providing an online service, your employees realize in their heart of hearts that outages can easily result in business failure and loss of jobs. But if you're renting all the pieces of your service from outside vendors, you soon find that those vendors may be concerned about their contract with you, and the money they make off you, which isn't at the same level in the hierarchy of needs as the live-or-die situation you are in.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Agreed. I've worked at places that kept five nines of availability. By "normal" IT standards, it was massively overbuilt: multiple sets of gear, clustered in failover mode, with a separate redundant setup elsewhere in the data center, on an entirely different power feed. . . (as I recall, we had at LEAST 4 independent power feeds)..
We also had cabinets full of spare parts, entire full pieces of gear on the shelf, and an entire library of config files on the TFTP server. Plus duplicated on a laptop that lived in one of the cabinets. Took a LOT of labor and gear, and was not cheap,
And we constantly had to explain the man-hour and spares costs to the suits . .