An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com)

← Back to Stories (view on slashdot.org)

An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com)

Posted by msmash on Thursday March 2, 2017 @06:46AM from the how-it-all-happened dept.

Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

39 of 169 comments (clear)

Min score:

Reason:

Sort:

Fucking interns by xxxJonBoyxxx · 2017-03-02 06:55 · Score: 5, Funny

>> wrong command

Sure, blame the intern actually typing the command.

More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
1. Re:Fucking interns by DickBreath · 2017-03-02 07:33 · Score: 5, Insightful
  
  Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.
  
  Here are two other ideas.
  
  1. Confirmation. Are you sure you want to delete 3,207 servers?
  (oh, drat, that's not what I meant!)
  
  2. Require more typing. If you really want to delete 3,207 servers, then type "DELETE SERVERS" in all caps and press enter. (or something like that. Similar to how Ripley had to go through a lot of motions to activate the self destruct.)
  
  --
  
  I'll see your senator, and I'll raise you two judges.
2. Re:Fucking interns by squiggleslash · 2017-03-02 08:06 · Score: 5, Funny
  
  "alexa take down s3 servers a b and c"
  "OK, taking down s3"
  
  --
  You are not alone. This is not normal. None of this is normal.
3. Re:Fucking interns by imatter · 2017-03-02 08:33 · Score: 2
  
  That will be the first thing I do when I get home tonight, be prepared.
4. Re:Fucking interns by dgatwood · 2017-03-02 08:50 · Score: 5, Insightful
  
  Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.
  The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.
  
  --
  Check out my sci-fi/humor trilogy at PatriotsBooks.
5. Re:Fucking interns by lgw · 2017-03-02 12:00 · Score: 2
  
  All of that was done, more-or-less, is the problem.
  Some poor schmuck took the command line from the (presumably reviewed) change to something billing-related (TFA is short on details there), and typed in that approved command to do the approved thing in a controlled way. But the command line had a typo, and, total WTF, the command line with a typo was able to wreak wholesale destruction.
  Whatever configuration management system acted on that command was garbage. Any sane system would have said "hell no!", or at least asked in Gary Gygax's voice "are you really sure you want to do that", and/or stopped at the point where a bunch of servers were down and redundancy was getting low and required some other scary command to do real damage.
  If it were a config file checked in instead of a command, nothing changes about that. The configuration management system simply shouldn't have to power to destroy the world based on a single input of any kind.
  S3 must have some vast fleet of servers, more than most companies have in total. How do you have a configuration management system that allows destructive changes on that scale in the first place? TFA is silent there.
  
  --
  Socialism: a lie told by totalitarians and believed by fools.
6. Re:Fucking interns by complete+loony · 2017-03-02 14:38 · Score: 2
  
  https://aws.amazon.com/message/41926/
  
  We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level
  Yeah, they have apparently made this screw up much harder to repeat.
  
  --
  09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
AWS Internal Help Desk by EmagGeek · 2017-03-02 06:55 · Score: 5, Funny

The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"
1. Re:AWS Internal Help Desk by Lisias · 2017-03-02 10:12 · Score: 3, Funny
  
  No. He is splitting the command in more than one line. The next input would trigger the killing frenzy.
  Not enough UNIX, as it appears.
  
  --
  Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
playbook?? This is my data not a football! by Joe_Dragon · 2017-03-02 06:55 · Score: 2, Funny

playbook?? This is my data not a football!
1. Re:playbook?? This is my data not a football! by Altrag · 2017-03-02 07:56 · Score: 2
  
  Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.
  And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty much requires having a large number of players all scaling up at different times in order to maximize workload (and thus minimize wasted cycles/bandwidth/whatever measurement.)
  If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.
  But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services. At the very least, they'll have the knowledge and tools to secure their system a hell of a lot better than you just running updates and crossing your fingers.
  And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?
  Cloud services aren't a silver bullet to be sure. My own company doesn't use them because they don't suit our needs (we sell pure desktop apps.) But if they can suit your needs, there's a good chance that they'll be a far better long-term solution than anything you can throw together yourself for the vast majority of "you" (whether cheaper or not is another question..)
2. Re:playbook?? This is my data not a football! by __aaclcg7560 · 2017-03-02 08:40 · Score: 2
  
  If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.
  A dedicated network between my workstations and the file server is adequate for my needs.
  
  But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services.
  Very little of my data need to live 24/7 on the Internet. Since I'm converting my dynamic websites to static generated websites, the data behind my websites stays off the Internet as well.
  
  And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?
  My file server is a custom built PC that runs FreeNAS (BSD) in a Z2 (RAID-6) hard drive configuration. Current uptime is three months since an extended power outage drain the UPS battery and prompted the server to safely power down. It's been ten years since I lost any data due to a hard drive crashing on the file server.
Re: cloud by Anonymous Coward · 2017-03-02 07:01 · Score: 3, Insightful

Well that all sounds easy enough
Transcript by 93+Escort+Wagon · 2017-03-02 07:12 · Score: 5, Funny

Enter command: DELETE ALL SERVERS
Confirm that you wish to delete all servers: YES
Are you sure? YES
You really wish to delete all servers? YES
I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO

--
#DeleteChrome
1. Re:Transcript by sky_khan72 · 2017-03-02 08:31 · Score: 3, Funny
  
  Once I worked in a software firm. They said they got rid of the one feature after they got support calls for that irreversible destructive operation which you must enter something like "YES I UNDERSTAND. DELETE ANYWAY" to proceed.
2. Re:Transcript by painandgreed · 2017-03-02 10:05 · Score: 2
  
  Enter command: DELETE ALL SERVERS
  Funny enough, I started this one job and was given an account on the VAX system that handled all main applications plus internal email. When I went into the email list to read and write my personal email, it looked something like this:
  Read Email
  Send Email
  Read Sent Email
  Read Deleted Email
  Delete All Email
  Well, after a week or two, I had a bunch of email and no need to keep it, so I hit Delete All Email. What nobody told me was since I was an admin, that command meant ALL EMAIL, for EVERYBODY on the system, gone forever. Luckily, my new boss covered for me. A year or two later another newbie did the same thing, but thankfully, he only deleted a year or two of email.
Re:rm -rf /* by molarmass192 · 2017-03-02 07:22 · Score: 2

It's the "-f" that's scary. Heck, I "rm -r" at least once a month, but when that "-f" is needed, then it's double and triple check time, proceeded by a feeling of dread as my finger depresses the return key.

--

Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
S3 outage not the big problem by TheSync · 2017-03-02 07:27 · Score: 4, Interesting

The big problem is not the US-EAST-1 S3 outage.
The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.
Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.
1. Re:S3 outage not the big problem by phantomfive · 2017-03-02 07:50 · Score: 3, Interesting
  
  Yeah, actually I was surprised how much Amazon stuff actually went down. I had a coworker who had previously worked at Amazon, and he assured profusely me that AWS was not stable. Clearly he was right. (I don't know why this is big news all of a sudden, AWS has a big outage every year on average).
  
  --
  "First they came for the slanderers and i said nothing."
2. Re:S3 outage not the big problem by swb · 2017-03-02 09:31 · Score: 2
  
  Wasn't it Netflix that released their internal scripts for testing reliability? Randomly blowing away cloud instances and other core components so they could more realistically test the ability of the HA to deliver?
  I generally agree that HA is hard, and it's made harder still by PHBs who ask for HA and then cherry pick the cheapest element (out of several necessary), blab to management that you are fault tolerant and then never allow actually testing it.
  I also blame vendors for waaayyyy overpromising what their HA products can actually do, and sales people for piling on and selling actually unnecessary shit when customers actually indicate that, yes, they would like to buy everything the overcomplex HA system requires.
  No nobody believes the vendor requirement list (..when they can find it) because its been so bloated.
When I was in ops... by QuietLagoon · 2017-03-02 07:31 · Score: 4, Insightful

... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.
1. Re:When I was in ops... by rijrunner · 2017-03-02 09:25 · Score: 2
  
  Procedures are just the archaeology of mistakes..
Been there, done that by Dorianny · 2017-03-02 07:31 · Score: 5, Insightful

Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.
This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.
1. Re:Been there, done that by idji · 2017-03-02 08:13 · Score: 3, Insightful
  
  then people will type "--force" automatically.
Re: cloud by silas_moeckel · 2017-03-02 07:34 · Score: 2

Realy going to claim redundant sites for static data is hard? Eventually consistent databases are a thing has been for a long time outside of some very specific niches how much stuff really needs ACID transactions.
And yes I've built these many times well before the cloud was a "thing". Using a single cloud provider for anything is a risk the same reasons we use multiple data centers in different parts of the country/world since before the internet allowed commercial traffic and probably before that (no direct experience but the greybeards of my youth told stories).

--
No sir I dont like it.
the magic of the command line by known_coward_69 · 2017-03-02 07:38 · Score: 2

i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options
why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline
This is not some mundane detail, Michael! by paratek · 2017-03-02 07:41 · Score: 2

"I must have put a decimal point in the wrong place or something. I always do that! I always mess up some mundane detail!"

--
Nobody expects The Spanish Inquisition!
The employee name was revealed. by 140Mandak262Jamuna · 2017-03-02 07:55 · Score: 5, Funny

Looks like Amazon has very strict sign off requirement. After entering the command, the system asked for his name to be logged for the audit trail.
His name was Robert `); DROP TABLE S3-subsystem; --

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
1. Re:The employee name was revealed. by The+Other+White+Meat · 2017-03-02 15:45 · Score: 2
  
  Every time they tried to add his resume to the blacklist, it just disappeared.
  
  --
  
  --- Generation X: The first generation to have SIG lines inferior to their parents... ---
Cloud Services Are Inherently Unreliable by StormReaver · 2017-03-02 07:55 · Score: 2

Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.
But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.
Anyone relinquishing control over their infrastructure to unaccountable third parties needs to fired ASAP, and be replaced with someone who isn't a complete and utter moron. The mine is littered with dead canaries, and too many responses are of the line, "that won't happen to our canaries. Let's forge ahead."
Re: cloud by guruevi · 2017-03-02 08:08 · Score: 4, Insightful

I would say most stuff requires ACID or at least continuously consistent databases (you don't always need transactions or atomicity) and eventually consistent is a niche. Most 'eventually consistent' systems I've seen have an entire layer on top to make sure the data is consistent.
Anytime you do a financial transaction of any sorts (free or not), you need a consistent system or risk someone being able to manipulate the data. Obviously, some developers don't really care at first since eventually consistent updates are fast enough initially. But once they realize the mistake they made, an entire layer of patchwork gets written to make it behave like a rational database again.

--
Custom electronics and digital signage for your business: www.evcircuits.com
Well, at least that won't happen again by s1d3track3D · 2017-03-02 08:09 · Score: 3, Insightful

I'm glad they located the issue and put safeguards in place to make sure it doesn't happen again.

Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended
oh, nevermind.
Re:Admin by color by mjr167 · 2017-03-02 08:46 · Score: 3, Insightful

What do color blind admins do?
Amazon should use the cloud by ghoul · 2017-03-02 09:03 · Score: 2

AWS should use the cloud, that way when one server goes down the load is picked up seamlessly by another one with no downtime ...... Oh Wait. Never Mind

--
**Life is too short to be serious**
Business Opportunity by ghoul · 2017-03-02 09:15 · Score: 2

Now Amazon can sell AWS prime. If you are a susbcriber of AWS prime we will check "Twice" before removing your servers. That should boost the profit somewhat

--
**Life is too short to be serious**
My experience suggests the opposite by raymorris · 2017-03-02 09:16 · Score: 5, Insightful

> For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.
My experience is the exact opposite. When I write software to automate something, that automated procedure is planned and reviewed, then undergoes unit testing, integration testing, and acceptance testing. When I do something by hand - well you better hope the phone doesn't ring while I'm in the middle of it because if I lose concentration for a moment mistakes are quite possible. My boss agrees; the other day I mentioned I was doing something manually and he cocked his head and asked "manually? Isn't that subject to typos and other errors?"
Re:GUIs and AIs and Ohs by fzammett · 2017-03-02 09:50 · Score: 2

Oh, be careful, friend! I made the grave mistake of suggesting on Reddit that we've kinda/sorta/maybe become too enamored of CLIs and that just MAYBE a GUI MIGHT have prevented this, and I got hammered mercilessly.
You don't want to say anything that doesn't equate to worship at the feet of the almighty, great and awesome CLI around the wrong people.

--
If a pion (n-) collides with a proton in the woods & noone is there to hear it, does lamdba decay into the source pa
Re:Admin by color by dissy · 2017-03-02 12:06 · Score: 2

Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button.
What do color blind admins do?
Take down all the S3 services in the US-EAST-1 Region :P
Re: cloud by RabidReindeer · 2017-03-02 19:51 · Score: 2

Well that all sounds easy enough
Well, computers are easy. A child can program one. That's why you should always hire the cheapest IT workers you can get.