An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com)
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
Cheap help is so easy to overwork.
They can have my command prompt when they pry it from my cold dead fingers.
>> wrong command
Sure, blame the intern actually typing the command.
More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
You pushed the "trigger disruptions to S3 storage service" button!
The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"
playbook?? This is my data not a football!
Well that all sounds easy enough
Rookie mistake. I think we've all been there at least once. Hopefully not more than once...
AC because posting from work.
Lucky 10000: A "playbook" or "runbook" is an operational document which describes how humans should carry out regular work or respond to pages.
fyi
Enter command: DELETE ALL SERVERS
Confirm that you wish to delete all servers: YES
Are you sure? YES
You really wish to delete all servers? YES
I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO
#DeleteChrome
If this is a playbook, someone needs to script/automate a menu system with safety checks.
perl, python, shell or whatever system tool your admin team uses.
hopefully this was a jr member that made the goof.
go go gadget Openview....
This is why we have the technology called graphical user interfaces today. Maybe all that fashionable AI technology could be harnessed for the benefit of the tired, lone CLI rangers in order to minimize errors while preserving productivity. Oh, really.
Except anyone can make a mistake. Therefore inevitably everyone does make mistakes. Usually you're lucky and it wasn't important, but very rarely things such as this happens to someone. All you have demonstrated is your own prejudice. I could equally say it was an overpaid lazy American worker which is why companies are climbing over themselves to get cheaper skilled H-1B workers, but that also would be prejudiced. Unless you have the full details, which neither of us do, then its just meaningless noise.
I bet you are a hoot at parties...
The big problem is not the US-EAST-1 S3 outage.
The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.
Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.
Racist comments like that will earn you a bigot brand from half the country. Better be careful. In your best interests, The Democratic party of the United States
An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.
"A person is smart. People are dumb, panicky dangerous animals and you know it." - K
... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.
Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.
This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.
Realy going to claim redundant sites for static data is hard? Eventually consistent databases are a thing has been for a long time outside of some very specific niches how much stuff really needs ACID transactions.
And yes I've built these many times well before the cloud was a "thing". Using a single cloud provider for anything is a risk the same reasons we use multiple data centers in different parts of the country/world since before the internet allowed commercial traffic and probably before that (no direct experience but the greybeards of my youth told stories).
No sir I dont like it.
i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options
why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline
A PHB. Not even rogue.
PHB: Hey, could I interrupt you for a second while you're typing that command? I've got something more important. We're thinking of changing the locks on the data center doors to another brand that offers locks in a variety of different colors. And the vendor has assured me that these locks can use the same code as my luggage.
I'll see your senator, and I'll raise you two judges.
"I must have put a decimal point in the wrong place or something. I always do that! I always mess up some mundane detail!"
Nobody expects The Spanish Inquisition!
at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again
His name was Robert `); DROP TABLE S3-subsystem; --
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.
But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.
Anyone relinquishing control over their infrastructure to unaccountable third parties needs to fired ASAP, and be replaced with someone who isn't a complete and utter moron. The mine is littered with dead canaries, and too many responses are of the line, "that won't happen to our canaries. Let's forge ahead."
Ok, anyone can make a mistake, but if H1Bs built the server management system to rely on manually typed commands and no one saw the obvious risk of doing that, where does the blame really lie?
The stereotypical H1B is culturally preconditioned to serve, not analyze; they'll (attempt to) do exactly what they're told... no more, no less, with little questioning.
So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.
I would say most stuff requires ACID or at least continuously consistent databases (you don't always need transactions or atomicity) and eventually consistent is a niche. Most 'eventually consistent' systems I've seen have an entire layer on top to make sure the data is consistent.
Anytime you do a financial transaction of any sorts (free or not), you need a consistent system or risk someone being able to manipulate the data. Obviously, some developers don't really care at first since eventually consistent updates are fast enough initially. But once they realize the mistake they made, an entire layer of patchwork gets written to make it behave like a rational database again.
Custom electronics and digital signage for your business: www.evcircuits.com
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended
oh, nevermind.
Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button. Then you could put a red dot or a green dot in the run book.
Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
Of course, you have all your business critical systems spread amongst several AWS data centers, right? And any business critical is replicated several times? Not.
AWS should use the cloud, that way when one server goes down the load is picked up seamlessly by another one with no downtime ...... Oh Wait. Never Mind
**Life is too short to be serious**
The double-check script way works until it doesn't.
The command wasn't mis-typed, the scope was wrong.
Now Amazon can sell AWS prime. If you are a susbcriber of AWS prime we will check "Twice" before removing your servers. That should boost the profit somewhat
**Life is too short to be serious**
> For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.
My experience is the exact opposite. When I write software to automate something, that automated procedure is planned and reviewed, then undergoes unit testing, integration testing, and acceptance testing. When I do something by hand - well you better hope the phone doesn't ring while I'm in the middle of it because if I lose concentration for a moment mistakes are quite possible. My boss agrees; the other day I mentioned I was doing something manually and he cocked his head and asked "manually? Isn't that subject to typos and other errors?"
Nothing but more meaningless noise. It's no wonder lazy overpaid Americans are getting replaced when they make bad assumptions and spout prejudice.
Maybe it's just me, but I typically drop a VM's NIC for a week, then shut it down for at least 2 weeks, before I remove it from inventory, and, only then when I've shoved it off to cold storage so it can get brought back in.
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week.
What am I missing here? Is there a way to "knock down" a website without affecting it?
Something extremely similar happened to my company last month. An EMC tech was onsite to work on the Isilon system. He was supposed to issue a command to put one of the nodes into maintenance mode. Instead he put the entire cluster into maintenance mode.
Needless to say, he is not welcome back. Ever.
Not to put myself above anyone else, I made a similar mistake a couple of years ago. I wrote a script that checked a csv input against a list of computers in Active Directory. It was supposed to delete all of the servers not on the list. Instead it deleted all of the servers on the list. It was pretty easy to fix with another script that rejoined all of them to the domain. None the less it was a pretty major fuck up, and one that I should have caught if I had tested my code properly.
The admin has a very powerful tool. It has almost no constraints on what it can do because 99% of the time we want that power. We are dealing with an uncommon, unexpected situation and need to be able to have the power to do something different. The exact correct command might be something that no one anticipated before. It would be very time consuming to come up with rules preventing such a command.
Also I don't think more warning messages or safety logic is always the answer. Maybe practicing more without the autopilot is the answer. Look at Air France 447.
and the password is
1...
2...
3...
4...
5.
OK, so somebody took down thousands of servers, shit happens. Once the mistake was recognized why does it take so long to start up those servers again?
..since errors of this nature could be worse than a missile launch. https://en.wikipedia.org/wiki/...
Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
To all those guys who are bragging about how they would never put anything in the cloud (AWS or otherwise) because their data centers are so reliable, so redundant, fault-tolerant and insulated from human error that they can be held to the highest possible standards of up-time and accountability, are you hiring?
Or, would you be interested in a bridge I have for sale?
No, you can't use H1Bs in data centers because they shit behind the racks, and sometimes pull out cables by accident when getting up
These comments have gone from suck to blow.
The distributed failover mechanisms of Amazon Web Services has never failed because of an unrecoverable error in either hardware or software. In all cases where such a mistake was suspected, the problem has always turned out to be *human* error.
Some female with long fingernails flubbed it.
Well that all sounds easy enough
Well, computers are easy. A child can program one. That's why you should always hire the cheapest IT workers you can get.
Nothing like good old shell-commands to get things done fast and efficient.
Who needs GUIs, with their ugly label's and noobish warning popups?
Quoting the Dunning-Kruger effect is often the smug illustration that one is suffering from the Dunning-Kruger effect.
Much like quoting Godwin's Law is a favourite countering tactic for Nazi apologists.
Anyone knows exactly what typo?
I can't find the fumbled command line anywhere in the internet.
- these are not the droids you are looking for -
Can't it be switched into reverse?
I'll see your senator, and I'll raise you two judges.
Warnings are nice, but it's hard to anticipate all of the conditions where a warning might be in order. It's also hard to make people pay attention to warnings when each and every action produces one or more warnings.
Reversibility is a key. Let the admin see the consequences of the last given command and undo it if necessary. Warnings are for actions that are intrinsically irreversible. Build it into the commands if possible. If not, build it into the procedures. Don't delete an instance, just take it offline for a while first and see if anyone squawks. Don't even allow a delete of an instance that isn't offline already, preferably not until it has been offline for some time.
Nothing but more meaningless noise. It's no wonder lazy overpaid Americans are getting replaced when they make bad assumptions and spout prejudice.
India, the predominant H-1B Visa source for American companies is an authoritarian culture will arrange marriages and all that sort of stuff. That's not prejudice, it's an observation of fact. Can you dispute that with facts and evidence? Or you just want your fantasy version of reality in your head to be right even if it's complete fantasy. You know that's the definition of delusion right?
We'll make great pets
Except the Dunning-Kruger effect seems to affect most people. Most people seem to have zero intuition about technical problems that have not experienced before. All technical problems are logical and a quick bit of thinking can give you nearly everything you need to know. There are only a few good ways to solve any given technical problem, and knowing even a small bit about what you're doing can tell you which of those few ways where chosen. If you can quickly come up with the few good ways to solve the problem, then you can immediately figure out the implementation.
Dunning-Kruger effect is caused by a lack of metacognition. Metacognition is caused by a lack of abstract reasoning. Abstract reasoning is what allows you to infer and deduce. People with strong abstract reasoning are able to quickly reverse engineer by inference and deduction.
For example. My co-worker was working on a project in Go, a language I have never used. He ran into some performance issues. He showed me an example of how the code was slowing down. After looking at the example, I was able to quickly recognize that the issue did not fit my mental model of how async message passing language like Go should work. After about a minute, I looked at his code. Again, I have never looked at Go code. Just looking at it, I saw no issue with his design, but I did notice he was making use of a nifty feature of zero sized message queues, in my parlance, work like co-routines. I came to the conclusion that the implementation of Go that he was using better fit my mental model of how the program would work if it was not truly async co-routines, but instead backed by threads with a limited threadpool.
Lo and behold, I was correct. The GCC Go compiler was backing the "go routines" with a small threadpool of something like 4. I assumed it had an exponential decrease in performance based on his example. I had him double the threadpool and it agreed with my prediction by effectively running twice as fast for a given progress of the processing, but the N^2 scaling was killing it. I also predicted that making the message channel sizes to be at least 1 would give a large improvement, but a size of at least 2 would see near the max benefit. That was also true. The official Go compiler did not have this issue, but he was running this on a Raspberry Pi, and ARM was not supported by Google's compiler at the time.
This was back when I had about 4 days of experience with multi-threading and have only heard about thread-pools. When I said "better fit my mental model of how the program would work if it was not truly async co-routines", that mental model I made up on the spot, took me tens of seconds. I took what I assumed were facts (assuming go was co-routine like), created a mental model about what guarantees I had, then I took my assumptions and iterated through them by plugging them into my partial mental model and "seeing" how it would affect the outcome. Then I chose the outcome the mostly closely fit what I thought I saw.
That is an example of inference and deduction. I knew what I did know and knew what I didn't know and filled in the blanks by making purposeful assumptions while fully understanding the ramifications of my assumptions, making sure my theoretical mental model would create the same characteristic reduced performance. Not just any reduced performance, but the exact same type.
Of course this has absolutely nothing to do with someone making a "typo" style mistake. I make mistakes all of the time, just rarely with reasoning.