Slashdot Mirror


An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com)

Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

12 of 169 comments (clear)

  1. Fucking interns by xxxJonBoyxxx · · Score: 5, Funny

    >> wrong command

    Sure, blame the intern actually typing the command.

    More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."

    1. Re:Fucking interns by DickBreath · · Score: 5, Insightful

      Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      Here are two other ideas.

      1. Confirmation. Are you sure you want to delete 3,207 servers?
      (oh, drat, that's not what I meant!)

      2. Require more typing. If you really want to delete 3,207 servers, then type "DELETE SERVERS" in all caps and press enter. (or something like that. Similar to how Ripley had to go through a lot of motions to activate the self destruct.)

      --

      I'll see your senator, and I'll raise you two judges.
    2. Re:Fucking interns by squiggleslash · · Score: 5, Funny

      "alexa take down s3 servers a b and c"
      "OK, taking down s3"

      --
      You are not alone. This is not normal. None of this is normal.
    3. Re:Fucking interns by dgatwood · · Score: 5, Insightful

      Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

  2. AWS Internal Help Desk by EmagGeek · · Score: 5, Funny

    The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"

  3. Transcript by 93+Escort+Wagon · · Score: 5, Funny

    Enter command: DELETE ALL SERVERS
    Confirm that you wish to delete all servers: YES
    Are you sure? YES
    You really wish to delete all servers? YES
    I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
    Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO

    --
    #DeleteChrome
  4. S3 outage not the big problem by TheSync · · Score: 4, Interesting

    The big problem is not the US-EAST-1 S3 outage.

    The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.

    Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

    So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.

  5. When I was in ops... by QuietLagoon · · Score: 4, Insightful

    ... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.

  6. Been there, done that by Dorianny · · Score: 5, Insightful

    Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.

    This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.

  7. The employee name was revealed. by 140Mandak262Jamuna · · Score: 5, Funny
    Looks like Amazon has very strict sign off requirement. After entering the command, the system asked for his name to be logged for the audit trail.

    His name was Robert `); DROP TABLE S3-subsystem; --

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  8. Re: cloud by guruevi · · Score: 4, Insightful

    I would say most stuff requires ACID or at least continuously consistent databases (you don't always need transactions or atomicity) and eventually consistent is a niche. Most 'eventually consistent' systems I've seen have an entire layer on top to make sure the data is consistent.

    Anytime you do a financial transaction of any sorts (free or not), you need a consistent system or risk someone being able to manipulate the data. Obviously, some developers don't really care at first since eventually consistent updates are fast enough initially. But once they realize the mistake they made, an entire layer of patchwork gets written to make it behave like a rational database again.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  9. My experience suggests the opposite by raymorris · · Score: 5, Insightful

    > For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.

    My experience is the exact opposite. When I write software to automate something, that automated procedure is planned and reviewed, then undergoes unit testing, integration testing, and acceptance testing. When I do something by hand - well you better hope the phone doesn't ring while I'm in the middle of it because if I lose concentration for a moment mistakes are quite possible. My boss agrees; the other day I mentioned I was doing something manually and he cocked his head and asked "manually? Isn't that subject to typos and other errors?"