Ask Slashdot: How Transparent Should Companies Be When Operational Technology Failures Happen?
New submitter supernova87a writes: Last week, Southwest Airlines had an epic crash of IT systems across their entire business when "a router failure caused the airlines' systems to crash [...] and all backups failed, causing flight delays and cancellations nationwide and costing the company probably $10 million in lost bookings alone." Huge numbers of passengers, crew, and airplanes were stranded as not only reservations systems, but scheduling, dispatch, and other critical operational systems had to be rebooted over the course of 12 hours. Passenger delays, which directly attributable to this incident, continued to trickle down all the way from Wednesday to Sunday as the airline recovered. Aside from the technical issues of what happened, what should a public-facing company's obligation be to discuss what happened in full detail? Would publicly talking about the sequence of events before and after failure help restore faith in their operations? Perhaps not aiming for Google's level of admirable disclosure (as in this 18-minute cloud computing outage where a full post-mortem was given), should companies aim to discuss more openly what happened and how they recovered from system failures?
Router failures shouldn't cause loss of data in any appreciatable amount. Enterprise level organizations should have automatic failover routers in place. This was far more than a simple router failure...so the real question should be: should companies be allowed to lie to their customers about major technical issues?
i've been delayed because of weather, engine troubles, etc and i'm still alive and happy. the only thing public pressure does is cause the company to spend more money in redundant hardware which mostly sits unused and raises prices
What failure?
The companies understand one thing: profit.
It depends on the volume of business and a variety of factors. For example, I was recently considering the purchase of a new automobile. There was one make which I ended up removing from consideration because their infotainment was not open for me to hack on. I felt like this was important and so I told the salesman why it was important to me and that this single factor resulted in my no longer considering any models from this manufacturer.
In another instance, a specific dealership had two different sales people contact me by phone, essentially competing with each other. I didn't like that so I didn't bother calling back either one. Several days later I received a form inquiry from the general manager (certainly an automated message). I took the time to respond, explaining that I wouldn't be doing business with them because of the poor coordination of their salesmen's activities. If I already talked with one and explained what I needed in a vehicle, why was another going to call me and try to make me go through all that again?
Granted, these are different examples, but I make this small effort in the hopes that it will either improve the situation for the person who comes along after me or for myself the next time. Of course, the larger the organization, the less likely this is to have an effect. I expect that the GM of the dealership with two salesmen could possibly do something based on my feedback. I fully expect nothing to change from the manufacturer of the car with the closed infotainment system. However, if 10,000 customers all told different dealers the same thing or bothered to write to the manufacturer directly, then something might change.
Southwest and other airlines are by necessity very large companies. If you tell a booking agent something it is almost certain no manager will hear of it. But, if you contact the execs directly, perhaps if there is a VP of customer service or an ombudsman, contact that person and let them know that you value openness and that you are specifically avoiding giving them your business because of their lack of it. If they hear this from enough people, the will get the message: we are losing out on business because of our approach to blah blah blah.
So, bottom line: companies should be as transparent as their customers demand. If you, the customer, don't demand then they won't know and won't make any change.
Aside from the technical issues of what happened, what should a public-facing company's obligation be to discuss what happened in full detail?
There is no simple single answer to this question. It's going to be circumstance dependent. In many cases a lot of transparency will be helpful and appropriate. In other cases it probably won't matter much and in a few cases it might even be counterproductive though I expect that would be uncommon. If the problem is something like a security problem that will take time to resolve, immediate transparency might do more harm than good in some cases. But in general people are pretty forgiving if they understand the mistake was an honest one and that the company is working in good faith and transparently to fix it.
Would publicly talking about the sequence of events before and after failure help restore faith in their operations?
Generally speaking the answer is probably yes. If people can see that the company is acting in good faith to solve a problem that might shake confidence in the product then yes, transparency will probably help. Probably the canonical example of transparency working to the benefit of the company is how Johnson & Johnson handled the Tylenol tampering back in 1982. The company acted quickly, transparently and forcefully to deal with the problem and it probably saved the product and changed how such products were packaged going forward. It's one of the better examples of how to handle a major crisis. It's not hard to find examples of companies brushing things under the rug and then it blowing up in their face down the road. See GM and their ignition failures for a good example of that.
Airlines solicit people to be 5 miles in the air and thus vulnerable to death. To me that means that zero levels of privacy should be allowed so that all individuals and competitors can study every single detail about anything to do with an airline. For example the pay rates for their mechanics is one indicator of the quality of maintenance performed. How about dollars spent on maintenance per hour of flight? How about the hours in the air for every plane they fly? All these things can be used to judge safety and should be wide open for all inspections at all times.
I worked IT in the airline industry for over 20 years and that happening does not surprise me.
In many cases the systems are old, the software is not well maintained, and management does not understand how critical it is to the operation of the company. Many airline/aircraft companies have outsourced their IT to Managed Service Providers under the guise that "We are an airline, not an IT company." In doing so management negotiated the contracts, not IT, and the contracts are crap. No clauses for upgrading systems, no clauses for management of software patching, and one such contract, that I have read, guaranteed a 98% uptime. Yes, it really was 98% and not 99.999%.
In almost all cases once IT was outsourced, they not only eliminated their IT department, the added rules that stated they could not hire IT people as it was all outsourced and they had no need of them. The companies I have worked for have haired me with odd titles to avoid such rules.
Redundancy is, in many cases, non-existent. Equipment is aging and starting to fail, and there is no plans or projects in the works to update them. Heck, one company I know of is still running on computers that were purchased in 1995.
When projects are put forward with proper HA, network fail over, SAN, etc. They get cut in cost cutting measures to the point that they are unrecognizable. A great example is an upgrade to an Oracle server that I was working on. The original upgrade plan was to deploy an HA pair with back end SAN on a dual 10g fail over connection. After it was cut it ended up being a single dual proc windows system with internal drives running on a 1g connection. It has already crashed multiple times and each time has brought the company to a standstill.
In this day and age, companies need to realize that they run on IT. If your IT infrastructure fails, your company comes to a halt and you loose money!
If you forget to shave or shower on a particular day, should you be required to post that to your Facebook page or wear a billboard sign all day decrying your lack of hygiene?
What do we do when buildings and bridges fail, or when an aircraft falls out of the sky? We should do something like that. In a more enlightened age, we'd have the NTSB-equivalent for massive IT failures.
the growth in cynicism and rebellion has not been without cause
The companies understand one thing: profit.
That's not true. Companies and the people that run them understand more than just profit. I defy you to find a single person in a company who cannot comprehend something other than profit. To claim that profit is all they can understand is absurdly untrue. But there is a nugget of truth in what you say. What is true is that companies and some (not all) of those who run them have a strong tendency to focus on profits excessively, particularly short term profits. They do this to the detriment of all else including the long term health of the company sometimes. It's too glib to say that companies only understand profit but it is fair to say that companies tend to focus on it too hard at times and make bad decisions as a result.
A well managed company has to consider things like the health of their community, the well being of their suppliers, the trust of their customers, etc. All these things sooner or later will impact profits so if company focuses excessively on near term profits then in the long term they will likely be worse off and so will all those who depend on the company - customers, suppliers, community, shareholders and employees.
Southwest Airlines Co. has filed 33 labor condition applications for H1B visa and 1 labor certifications for green card from fiscal year 2013 to 2015. Southwest Airlines was ranked 5651 among all visa sponsors.
Ah. That explains it.
Yes! I think airlines and all companies exposing the public to potential life and death situations should definitely give a post mortem when critical systems fail, regardless of whether they are mechanical or not. However, if your local supermarket had a crash of their inventory management system, would you really care? No you probably would not because you will still be able to pay with cash and take your goods anyway. I think the line should be drawn somewhere near exposure to mortal danger. Therefore every company offering some sort of transportation service should be as transparent as possible and should have near-zero privacy.
The previous post offering the title "That Depends" is on the right track.
Some industry sectors have legal requirements to disclose technical failures that could impact their operating bottom line. For example, think about Section 404 of the Sarbanes-Oxley Act.
Other requirements are driven by locations - for example California was the first US State to require formal disclosure if a company lost unencrypted client data.
The bottom line is that, for a growing number of industry sectors, legislative jurisdictions and use cases, there is a legal requirement to make necessary disclosures and in a timely manner. In the case of some requirements [like SOX-404] there is the potential of jail time for company officials that fail to abide by the law.
Ultimately, it is the legal responsibility of the CEO of a [publicly listed] company to ensure that the company operations fully comply with all legal obligations at all times. Irrespective of whether or not a company unaware of it's obligations may end up breaking the law, a company that doesn't understand those obligations has a negligent CEO.
Here be dragons. Tread carefully!
I think it's smart to be as transparent as you can about system failures (without creating a security risk by discussing your infrastructure in too much detail). A company like Southwest depends in part on consumer confidence in order to gain customer loyalty and confidence. People are still a little afraid to fly. The act of transparency can boost confidence because the customer is expected to accept that bad things happen but that the right person is in charge, knows what happened, knows how to fix it, and can assume that changes have been or will be made so that it doesn't happen again. If Southwest instead chose to be completely opaque the customer would be either be wondering what they're covering-up and why (hurting confidence) or if they are completely incompetent (also greatly hurting confidence). For companies like Southwest, I think it's essential that they be as detailed as they can be.
Accidents happen. And only people who don't work make no mistakes. So if anyone claims he never makes mistakes, you have found the slacker.
People are surprisingly willing to cut you some slack if you admit mistakes, apologize and offer them some token compensation. Provided that they don't happen too often and that it cannot be considered malice or gross negligence.
Also, what you offer in compensation should be in sync with your mistake. Handing out a free trial that marketing has been throwing about left and right like it's some candy that's reaching its best before date after losing your customer's credit card info is NOT going to cut it. Sony, I'm looking your way.
Generally, Sony could be used as the poster child of how NOT to reconcile with your customer base after fucking up...
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
"Outsourcing partner" in Bangalore must have screwed up.
On Indian outsourcing, here's a war story. When working with Fokker, the Dutch aerospace company, I was sent to Bangalore to emit a final judgment on an outsourcing firm there. On the second day, needing to go to the toilet, I lost my way in the building. Trying to find the loo, I walked by an empty cubicle (the cubicles had large glass panes in them). On the table lay a blueprint. Being an engineer, I couldn't refrain from looking at it. The name "Areva" was printed all over it, Areva being a French constructor of nuclear power plants. It soon became clear to me that those st***d Indians had left the blueprint of an import safety valve in a current nuclear reactor design, unsupervised, on a table in an empty cubicle, and that anyone could walk in on it. I took a picture with my cell phone and sent it to Areva - after having stood there, for a test, for about 10 minutes. Nobody turned up. Anyways - some high-up security guy there went ballistic; on the phone, he thanked me and explained to me the kind of mayhem that blueprint falling in the wrong hands could have caused. (Needless to say we at Fokker immediately cut ties with that Bangalore company.)
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
Those customers that got "badly burned" are going to want to know that you've learned your lesson.
If the event hit the press or word got around to your target customer base, you'll need to convince them that it won't happen again (I'm looking at you, Southwest Airlines).
If your industry is one where the failure could cause death or injury if it happened again - even to a competitor - then you have a moral and possibly legal obligation to "go public" within your industry so they can learn from your experience (I'm looking at you, Blue Bell Creamery).
Even if it's not life-or-death, you may find it good busine$$/good PR to share details within your industry or to the general public (thank you, Google).
There are some cases where publicity isn't critical.
For example, if you sell widgets and you had a no-critical-lessons-learned systemic failure in one of your factories that shut down production in that factory for a week, but your other factories were able to ramp up production so all your distributors and major customers noticed was a half-day shipping delay on some parts resulting in their own inventories, but your other end users didn't notice anything, then all you need to do is apologize for the inconvenience and say is that a plant had to be taken offline and it took half a day to add shifts to the other plants and get your widgets shipped out. If you are a public company you may need to issue a press release for the benefit of investors. If you had temporary layoffs or if employee health and safety were affected, you may have to notify the goverment, unions, and affected employees. Other than that, you probably don't need to say much more.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
The public expects no explanation. They have been conditioned over 50-years of experience to accept, "Our computers are down" as sufficient.
You may think I'm joking but it is mostly true.
First off, how does a router failure make you lose data? That's either a lie or they have no clue how their system works...which would explain why they failed so miserably. Also, how does such a large company not run a RAID on their server hard drives? So you're telling me the RAID failed, the NAS failed, the on-site backup failed and the offsite backup(s) failed? I guess I could believe Southwest outsources to a company that doesn't consider the basics that every 20+ person small business in the USA takes for granted.
Actually, that is the definition of a company.
All large organizations have some messy aspects of their internal IT. The longer the organization has existed, and the larger and more diverse it is, the worse it gets. There was a story a couple days ago circulating about a Citibank employee (NOC engineer or something like it) that was able to stop most network traffic by removing the configs in a few key routers. (Turns out he was upset about a bad review he had just been given.) If a network were properly designed with no choke points, no SPOFs, etc. it would be extremely hard to take out all traffic. But the reality is that stuff grows organically over time and there are lots of IT skeletons in closets. I doubt there's a CIO on earth that wants to go out there and say to the public that they screwed up because they didn't, say, pay an extra $5K for a redundant router.
About SWA's troubles, here's a clue -- airlines have absolutely zero interest investing more in IT than the basics required to run the business. There's cool stuff being done, but airlines are a low-margin business (believe it or not) and have historically relied on a web of third party companies to provide IT services. It used to be just reservation systems, etc. that most airlines couldn't or wouldn't want to run themselves anyway. But in recent years, lots of development and operations work has been moved to "offshore partners" or IT companies that in turn offshore everything. Because of all this abstraction, I'm sure Southwest's onsite IT staff had a very difficult time figuring out who and what was actually to blame for the issue. That, and airline IT is full of single points of failure that are just the nature of the business. Losing operational messaging links, having one system fail in a chain of dependencies that prevents aircraft dispatch or crew scheduling, and others can stop an airline from operating until they're fixed.
Another point - the cloud doesn't really solve this either. It has the potential to, but architecting a failure-tolerant solution in a public cloud is actually harder to do than on-site stuff. Sure, if you're starting from scratch you can write software in a way that gracefully handles failure. However, any legacy application port into the cloud requires very careful thought about how to design it for fault tolerance.
Actually, that is the definition of a company.
No it is not. The definition of a company is "an 'artificial person', invisible, intangible, created by or under law, with a discrete legal personality, perpetual succession and a common seal. It is not affected by the death, insanity or insolvency of an individual member."
A company is a term that refers to a variety of types of organizations. Some types of companies are explicitly not concerned with profits at all. Perhaps you've heard of non-profit companies? Those are a thing you know.
From the linked article: "In the United States, a company may be a "corporation, partnership, association, joint-stock company, trust, fund, or organized group of persons, whether incorporated or not, and (in an official capacity) any receiver, trustee in bankruptcy, or similar official, or liquidating agent, for any of the foregoing". In the US, a company is not necessarily a corporation."
100% transparent to someone, it needs to be known whether are not software is actually reliable, hype needs to be set aside. When we are talking about lives hanging in the balance, the spreadsheet mentality is very inappropriate. In some cases, silicon valley and its chsmpions are beginning to remind me of the pharmaceutical industry!
Undoubtedly, it's related in some way to cheaping out - as you say, it's in the culture of every corporation. And yes, it's the same for their airplanes, though they're a far cry from Allegiant. I've been delayed several times by problems with the airplane on Southwest, thought the worst actually flying plane I've been on was United. Southwest even had the fuselage open up on one, that caused rapid (though not explosive) decompression and an emergency landing in Yuma, and like many airlines has been cited in the past for outsourced maintenance practices. In all, though, they may have problems but they're way ahead of second place unless you can afford to fly in your own plane.
Don't say they're cheap to fly, though. Fares are very similar to the other big boys these days.
I'm suspecting that a router did fail, which triggered a bunch of other cascade failures which it shouldn't have.
If it's a government entity, yeah, full disclosure, down to the last comma separated value. A public company? That's between them and the share holders. Private company, disclose whatever they want or not. In the end, there'll be some consumer watchdog outfit that will publish all the up and down time percentages and companies will reap their desserts. Unless they're calling me in to fix the problem, I don't care whether they were hacked or somebody's cat pissed on a circuit breaker, they're either up or they are down.
1 router failed and corrupted all backups umm what the f*** Failover is a thing, like a really really important thing. If they're going for transparency they're doing a poor job because the excuse they provided is a lie. This sounds more like an IT worker trying to save his job by lying his ass off, and his excuse somehow made it all the way to Slashdot.
I've worked at a helpdesk and we were always lying through our noses to cover our screw ups. On the phone while the incident was happening we'd be very vague and give as few details as possible. We'd then give a full account to the manager who would produce an incident report which was what we called "customer-service friendly".
Tech companies have no incentives to admit when they screwed up and as such, will always try to lie/cover it up when they think they can get away with it.
it was probably not a router.
You are either transparent, or you aren't.
Southwest has redundant systems, and many redundant paths to those system. Most everything will survive a single point of failure, except the failure of one of the corner routers (they used to use F5's setup in a redundant network path, it has been a few years since I saw things). Most of the apps share the 4 corner F5 routers.
The apps tend to be interconnected. The reservation system talks to the ground system so the bags get on the right airplane. The ground system talks to the crew system, so the gate agent knows what flight attendants are on the flight, etc.
Normally processes load balance from the main data center to the alternate data center (if it is a new app, the older apps will only fail over to the alternate data center). The either data center isn't responding, the F5's are supposed to route the traffic to the known good data center, and the apps should work fine. If the F5 gets confused, it may route data to the bad data center, and in effect loose stuff.
I haven't heard anything about what actually went wrong, but my guess is one of the F5s was mis-configured, or broken, and it caused many of the apps to not be able to talk to each other. It probably started small, but as failure compounded, and more systems couldn't talk to each other, the front line people just couldn't use the computers, and had to revert to manually checking in passengers, which slowed things down. When the pilots couldn't get flight releases, then everything stopped.
I am speculating a bunch here, don't quote me on this.
The companies understand one thing: profit.
To claim that profit is all they can understand is absurdly untrue. But there is a nugget of truth in what you say. What is true is that companies and some (not all) of those who run them have a strong tendency to focus on profits excessively . . .
Uhm, yeah. That's what the parent is implying, duh.