Slashdot Mirror


An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com)

Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

169 comments

  1. CLI ? by doesnothingwell · · Score: 1

    Cheap help is so easy to overwork.

    --
    They can have my command prompt when they pry it from my cold dead fingers.
    1. Re:CLI ? by Anonymous Coward · · Score: 1

      I don't know about cheap, but overworked is on the money. It's surprising Amazon's people aren't fucking up catastrophically more often, when they're stuck in the office 80 hours a week.

  2. Fucking interns by xxxJonBoyxxx · · Score: 5, Funny

    >> wrong command

    Sure, blame the intern actually typing the command.

    More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."

    1. Re:Fucking interns by Anonymous Coward · · Score: 0

      That's why we run Linux.

      We don't want stupid shit getting in the way of running commands with constant nags of 'Are you really sure?" and "Please elevate your privileges even though you have permission to do this.".

    2. Re:Fucking interns by DickBreath · · Score: 5, Insightful

      Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      Here are two other ideas.

      1. Confirmation. Are you sure you want to delete 3,207 servers?
      (oh, drat, that's not what I meant!)

      2. Require more typing. If you really want to delete 3,207 servers, then type "DELETE SERVERS" in all caps and press enter. (or something like that. Similar to how Ripley had to go through a lot of motions to activate the self destruct.)

      --

      I'll see your senator, and I'll raise you two judges.
    3. Re:Fucking interns by squiggleslash · · Score: 5, Funny

      "alexa take down s3 servers a b and c"
      "OK, taking down s3"

      --
      You are not alone. This is not normal. None of this is normal.
    4. Re: Fucking interns by Anonymous Coward · · Score: 0

      This way of fixing things, making lifer harder forever for everyone just because might stop a dumb employee doing something dumb, well, it makes me hate what I do.
      Issue can be fixed with some old fashion verbal abuse.

    5. Re:Fucking interns by Tablizer · · Score: 1

      Sure, blame the intern actually typing the command.

      A (now retired) colleague of mine, I'll call Bob, once was a mainframe operator. An incompetent programmer used to blame his accounting application adding errors on Bob for "entering the command wrong".

      It was something simple like "RUN ACCT7", but Bob was accused of doing it wrong without specifics, and formally written up for that. HR didn't know anything about computers, so it was easy to bullshit them per creating reprimands.

      We'd always joke when something went wrong that "Bob pressed the Enter key at the wrong angle."

    6. Re:Fucking interns by imatter · · Score: 2

      That will be the first thing I do when I get home tonight, be prepared.

    7. Re:Fucking interns by dgatwood · · Score: 5, Insightful

      Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    8. Re:Fucking interns by Anne+Thwacks · · Score: 1
      ^^^ and it should get rolled back if it breaks things

      This! A thousand time: this!

      --
      Sent from my ASR33 using ASCII
    9. Re: Fucking interns by xxxJonBoyxxx · · Score: 1

      >> making lifer harder forever for everyone just because might stop a dumb employee doing something dumb

      Did you know that this is the heart of any "quality" initiative since World War II? Make things idiot-proof, and your employees will save their mental energy for harder things and make less mistakes overall.

    10. Re: Fucking interns by Megol · · Score: 1

      If you think anybody that makes mistakes is dumb then you are an absolute idiot. I am all for making "lifer" harder for idiots.

    11. Re:Fucking interns by safetyinnumbers · · Score: 1

      I can see how this happened:

      - "Okay, I've typed the shutdown commnand, now what was the name of the PBX server?"

      - "Asterisk"

    12. Re: Fucking interns by Anonymous Coward · · Score: 0

      We have a horrible internally designed system that we've paid millions for and years overdue and it still is full of bugs. Anyway, we had one bug and one of the guys on the support team actually asked if we were clicking correctly. No lie.

    13. Re:Fucking interns by lgw · · Score: 2

      All of that was done, more-or-less, is the problem.

      Some poor schmuck took the command line from the (presumably reviewed) change to something billing-related (TFA is short on details there), and typed in that approved command to do the approved thing in a controlled way. But the command line had a typo, and, total WTF, the command line with a typo was able to wreak wholesale destruction.

      Whatever configuration management system acted on that command was garbage. Any sane system would have said "hell no!", or at least asked in Gary Gygax's voice "are you really sure you want to do that", and/or stopped at the point where a bunch of servers were down and redundancy was getting low and required some other scary command to do real damage.

      If it were a config file checked in instead of a command, nothing changes about that. The configuration management system simply shouldn't have to power to destroy the world based on a single input of any kind.

      S3 must have some vast fleet of servers, more than most companies have in total. How do you have a configuration management system that allows destructive changes on that scale in the first place? TFA is silent there.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    14. Re:Fucking interns by complete+loony · · Score: 2
      https://aws.amazon.com/message/41926/

      We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level

      Yeah, they have apparently made this screw up much harder to repeat.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    15. Re:Fucking interns by Anonymous Coward · · Score: 0

      And, honestly, the article is greatly oversimplifying what actually happened. The number and variety of abstractions make it difficult to manage the services in a way that prevents all possible variations from halting the system. It's not in the realm of pure logic anymore but a confluence of resources being tied up that creates a domino affect. Timeouts and retry and every level of abstraction, metrics in every micro service, and perfect coordination are impossible to achieve goals, but AWS gets farther than most.

      It's rather pitiful that most people's reactions to this will be "this is why you have a process in place for change management." and not some more nuanced discussion about how difficult it is to build resilient distributed systems and how current strategies are, in the scope of what will come, still in their infancies.

    16. Re:Fucking interns by Anonymous Coward · · Score: 0

      Everyone loves blaming the underling when something goes wrong. How about giving him credit when the company makes billions due to his ingenuity? No... the company/management has to take credit for that.

    17. Re: Fucking interns by Anonymous Coward · · Score: 0

      Found the Amazon employee that fucked it up!

    18. Re:Fucking interns by Anonymous Coward · · Score: 0

      >> wrong command
      Sure, blame the intern actually typing the command.

      They seem to be going out of their way to *not* blame the person who ran the correct command with an incorrect parameter.

      More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."

      The statement from Amazon/AWS includes:

      "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks."

    19. Re:Fucking interns by Bongo · · Score: 1

      The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.

      Yes, and I gather, similarly, airline safety is about the systems (culture/procedures/tech) which allowed the individual to make the mistake. Even the way a checklist is written, is critical (what to leave out is as important as what to include).

      And I gather like with a new drug, you don't give all your subjects the same injection all together. Wait and see if the first survives, then go on.

      And it is always kinda fascinating how, the system one builds to cope with one set of scenarios, in turn creates a new set of scenarios.

    20. Re: Fucking interns by Anonymous Coward · · Score: 0

      Was his name Steve Jobs?

    21. Re: Fucking interns by DickBreath · · Score: 1

      I agree strongly.

      When I do anything on a production server, I am extremely careful. Paranoid even. I double check everything. And automation helps avoid mistakes. I only configure a few parameters of a script. But I can double check that before I run it. And I leave the previous configurations commented as examples. That way I just clone the current one, change the version numbers, etc.

      Since these Amazon servers that were deleted have the potential to do a HUGE amount of damage, I don't have any pity for someone crying about "making life more difficult". Waaaah! If you can't accept a few controls to help avoid errors, then you should have responsibility for the potential amount of damage you could cause.

      I find it funny that the missile commanders who can launch nuclear weapons can't just type in rm -rf \ and hit ENTER. It seems like someone made their life a lot more difficult. They have to go through several procedures and safeguards designed to prevent accidental destruction of servers owned not just by Amazon, but by others as well. Google, Microsoft, Apple, etc.


      On a different note, as for the subject line, which you merely inherited from the parent post; nobody should be doing that to the poor interns. Assuming the first word was a verb and not an adjective.

      --

      I'll see your senator, and I'll raise you two judges.
  3. You Crazy Fool! by seven+of+five · · Score: 1

    You pushed the "trigger disruptions to S3 storage service" button!

  4. AWS Internal Help Desk by EmagGeek · · Score: 5, Funny

    The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"

    1. Re:AWS Internal Help Desk by Anonymous Coward · · Score: 0

      # rm -fr \
      error good sir - which command now?

    2. Re:AWS Internal Help Desk by Holi · · Score: 1

      Why did you use a backslash? Windows much?

      --
      Sorry, teleporters just kill you and then make a copy. A perfect, soul-less copy.
    3. Re:AWS Internal Help Desk by fluffernutter · · Score: 1

      My eyes are bleeding!

      --
      Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
    4. Re:AWS Internal Help Desk by Lisias · · Score: 3, Funny

      No. He is splitting the command in more than one line. The next input would trigger the killing frenzy.

      Not enough UNIX, as it appears.

      --
      Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
    5. Re: AWS Internal Help Desk by Anonymous Coward · · Score: 0

      Have you tried turning it off and on, again?

  5. playbook?? This is my data not a football! by Joe_Dragon · · Score: 2, Funny

    playbook?? This is my data not a football!

    1. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0

      playbook?? This is my data not a football!

      It might be your data, but it's Amazon's football game. Their field, their ball, their refs, their rules, their playbook.

    2. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0

      Maybe they are using Ansible.

    3. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0

      Scripted and rehearsed system maintenance is SOP for good shops. You don't have your highly paid senior folks doing grunt work, and you don't put a grunt in front of a terminal and let them work by the seat of their pants. Obviously you try to automate as much as possible, but there will always remain things that cannot be automated and must be done by a human.

    4. Re:playbook?? This is my data not a football! by __aaclcg7560 · · Score: 1

      This is my data not a football!

      If your data is so important, keep it on your own server. Preferably on a separate network not directly connected to the Internet.

    5. Re:playbook?? This is my data not a football! by sexconker · · Score: 1

      Obviously you try to automate as much as possible

      For efficiency, sure.

      For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.

      Similarly for security. Does your shit come back up after a reboot? Or does someone have to key in passwords to get the drives unlocked/decrypted, then get the OS running, and then get the various service accounts to do their shit.

      No matter where you draw the line, documentation for regular procedures, disaster recovery, and initial configuration is king.

    6. Re:playbook?? This is my data not a football! by Altrag · · Score: 2

      Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.

      And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty much requires having a large number of players all scaling up at different times in order to maximize workload (and thus minimize wasted cycles/bandwidth/whatever measurement.)

      If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.

      But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services. At the very least, they'll have the knowledge and tools to secure their system a hell of a lot better than you just running updates and crossing your fingers.

      And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?

      Cloud services aren't a silver bullet to be sure. My own company doesn't use them because they don't suit our needs (we sell pure desktop apps.) But if they can suit your needs, there's a good chance that they'll be a far better long-term solution than anything you can throw together yourself for the vast majority of "you" (whether cheaper or not is another question..)

    7. Re:playbook?? This is my data not a football! by __aaclcg7560 · · Score: 2

      If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.

      A dedicated network between my workstations and the file server is adequate for my needs.

      But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services.

      Very little of my data need to live 24/7 on the Internet. Since I'm converting my dynamic websites to static generated websites, the data behind my websites stays off the Internet as well.

      And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?

      My file server is a custom built PC that runs FreeNAS (BSD) in a Z2 (RAID-6) hard drive configuration. Current uptime is three months since an extended power outage drain the UPS battery and prompted the server to safely power down. It's been ten years since I lost any data due to a hard drive crashing on the file server.

    8. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0
      Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.

      All the evidence is that the average half-wit can outperform AWS, Google, and even below average half-wits can outperform Azure.

      Disclaimer: I keep all my data in a dufflebag under my kid sisters' bed.

    9. Re:playbook?? This is my data not a football! by wasteoid · · Score: 1

      Relax, they were just deflating it.

    10. Re:playbook?? This is my data not a football! by lgw · · Score: 1

      Scripted and rehearsed system maintenance is SOP for good shops. You don't have your highly paid senior folks doing grunt work, and you don't put a grunt in front of a terminal and let them work by the seat of their pants. Obviously you try to automate as much as possible, but there will always remain things that cannot be automated and must be done by a human.

      Amazon famously has their highly paid senior engineers doing grunt work. Clearly, given yesterday, it's not their only mistake.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    11. Re:playbook?? This is my data not a football! by __aaclcg7560 · · Score: 1

      Disclaimer: I keep all my data in a dufflebag under my kid sisters' bed.

      Not necessarily the best place to store your Playboy magazine collection. ;)

    12. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0

      I know my lesbian sister always steals my Playboys.

    13. Re:playbook?? This is my data not a football! by Altrag · · Score: 1

      is adequate for my needs

      That's kind of the tricky part there. Not everyone's needs will match yours.

      the data behind my websites stays off the Internet as well

      So basically a backup that you can use to regenerate your internet-facing site. Always good to have more backups no matter what their form but again, other people might have different needs. In particular some dynamic sites actually need to be dynamic (frequently for business reasons more than technical ones, but that's beside the point. Needs are needs no matter where they're generated.)

      And of course, I'm assuming you're transferring your data from the dynamic server/network to the internet-facing network using a clean flashdrive or equivalent every time? Because if the two are connected via a network link then you have a potential security risk. Perhaps a small risk but hey, even things like OpenSSL (where the entire point is security) have failed here and there.

      It's been ten years since I lost any data due to a hard drive crashing on the file server

      And motherboard, CPU, PSU, etc failures? RAID protects you from data loss but it does little for your uptime if some other component crashes (or even if an HDD crashes and you don't happen to be using a hot-swappable setup.) The only way to protect uptime is redundant servers and now we're talking into large investments in hardware and technical knowledge to get that kind of thing going.

      Don't get me wrong, if your setup is sufficient for your needs then great. As I stated, I myself don't use cloud services because they don't suit my needs. But I still allow myself to recognize the benefit of such services for people who have different needs than I do. If you need solid uptime (recent issue notwithstanding) and more importantly if you need to support unpredictable scaling more than you need to worry about nosy Amazon employees, then AWS or its competitors can be great services.

    14. Re:playbook?? This is my data not a football! by naubol · · Score: 1

      It might be your data, but it's Amazon's football game. Their field, their ball, their refs, their rules, their playbook.

      It sounds so sinister until you consider that it could analogously apply to self-storage companies or handing over your luggage to an airline. Consider that the former have caught on fire and the latter have misplaced luggage. It's a platform that you don't have to use that makes a lot of things easier. Like all services, it's not perfect.

      --
      Reality is a slackware box running on a 386 tucked away in god's sock drawer.
    15. Re:playbook?? This is my data not a football! by naubol · · Score: 1

      I'm sure Amazon has none of that.

      --
      Reality is a slackware box running on a 386 tucked away in god's sock drawer.
    16. Re:playbook?? This is my data not a football! by Anonymous Coward · · Score: 0

      Highly reasonable post. Regarding the "nosy Amazon employees" bit, it is extremely difficult, practically impossible, to get access to customer data. Systems are coded such that without the proper keys, information is opaque in logs, databases, and storage. Inserting a back door would generally require complicity at multiple levels and across multiple teams to generate enough to be useful. It's often difficult to even know which data belongs to which customer.

    17. Re:playbook?? This is my data not a football! by __aaclcg7560 · · Score: 1

      And motherboard, CPU, PSU, etc failures?

      Never happened. Probably because I replace everything when the hard drives start to have problems after running 24/7 for five years. I had to replace the nine-year-old motherboard in my gaming PC so it would have better specs than the file server.

      [...] worry about nosy Amazon employees [...]

      Uh, no. Script kiddies from China and Russia banging down my virtual doors. I got tired of playing whack the mole with trying to keep everything up to date for Joomla and WordPress. When I replaced a dynamic website with a static website, hacking attempts dropped from 20,000+ per day to zero per day.

  6. Re: cloud by Anonymous Coward · · Score: 3, Insightful

    Well that all sounds easy enough

  7. rm -rf /* by Anonymous Coward · · Score: 0

    Rookie mistake. I think we've all been there at least once. Hopefully not more than once...

    1. Re:rm -rf /* by molarmass192 · · Score: 2

      It's the "-f" that's scary. Heck, I "rm -r" at least once a month, but when that "-f" is needed, then it's double and triple check time, proceeded by a feeling of dread as my finger depresses the return key.

      --

      Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
    2. Re:rm -rf /* by Anonymous Coward · · Score: 0

      Yeh, there needs to be a rm -rf -list that simply displays the file names it would delete.

      (In the windows world, I often use robocopy. robocopy src dest /mir /l has saved my bacon a few times - the /l means "display the console output as you would have done things, but don't actually do things" - you look at what it would have done, and you suddenly go "it shouldn't do that, just as well I ran it /l")

    3. Re:rm -rf /* by fisted · · Score: 1

      Yeh, there needs to be a rm -rf -list that simply displays the file names it would delete.

      They gave it the name find(1).

    4. Re:rm -rf /* by Megol · · Score: 1

      Yeah I don't think you got the idea...

    5. Re:rm -rf /* by Anonymous Coward · · Score: 0

      alias ls=rm -rf --list

    6. Re:rm -rf /* by FrankHaynes · · Score: 1

      That's exactly what a tired/disgruntled operator at AOL did many years ago, I believe at their data center in Japan. Wherever it was, it affected a very important system that took them down in a pretty broad geographic area for something like 2 or 3 days. It was a big deal.

      --
      slashdot: A failed experiment.
    7. Re: rm -rf /* by Anonymous Coward · · Score: 0

      That's me with certain SQL commands. Feeling of dread, Richard Wagner type over the top music starts to swell as my finger presses the enter key.

  8. Lucky 10000 by Anonymous Coward · · Score: 0

    AC because posting from work.

    Lucky 10000: A "playbook" or "runbook" is an operational document which describes how humans should carry out regular work or respond to pages.

    1. Re: Lucky 10000 by Anonymous Coward · · Score: 0

      Thanks for that enlightening bit of trivia.. And why do we give two shits why you're posting AC?

  9. His name is Steve Bartman by Anonymous Coward · · Score: 0

    fyi

  10. Transcript by 93+Escort+Wagon · · Score: 5, Funny

    Enter command: DELETE ALL SERVERS
    Confirm that you wish to delete all servers: YES
    Are you sure? YES
    You really wish to delete all servers? YES
    I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
    Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO

    --
    #DeleteChrome
    1. Re:Transcript by Altrag · · Score: 1

      I would guess it was something more along the lines of "TAKEDOWN server_subset *" when they meant "TAKEDOWN server_subset/*". The same kind of thing that you can accidentally do with say rm. Obviously I don't work in the bowels of AWS and don't know the real command syntax, but chances are it wasn't as obvious as pushing a big red button ie: not a "predefined" scenario but a simple typo that happened to pass syntax parsing and got run anyway even though it wasn't what the user wanted.

      We've all done something like that. We've all kicked ourselves for it. Difference is, most of us weren't controlling major cloud hosting services at the time.

    2. Re:Transcript by Anonymous Coward · · Score: 0

      That is one thing I thought was funny with Star Trek. Blowing up the ship is such a big and irreversible decision that it originally required multiple steps by multiple officers (usually the top two or three) but by the time they get to Captain Janeway? Well there is a reason why sfdebris refers to setting the self-destruct sequence as a "Janeway Pi" Just compare here: https://www.youtube.com/watch?v=RPQB7Wkqo3M

    3. Re:Transcript by sky_khan72 · · Score: 3, Funny

      Once I worked in a software firm. They said they got rid of the one feature after they got support calls for that irreversible destructive operation which you must enter something like "YES I UNDERSTAND. DELETE ANYWAY" to proceed.

    4. Re:Transcript by Tablizer · · Score: 1

      Blowing up the [early Trek] ship is such a big and irreversible decision that it originally required multiple steps by multiple officers...but by the time they get to Captain Janeway [it was one step]

      Janeway's ship was "lost" out in nowhere-land. There was no Federation to inspect or enforce such rules. They probably hot-wired the ship to give her more control to be more nimble since there was no help around. It was the space equivalent of the Wild West.

    5. Re:Transcript by ghoul · · Score: 1

      There you go trying to analyze a TV show to show its logical. Its fiction folks and not even Science Fiction. Its more like fantasy set in the future. Its not supposed to be logical.

      --
      **Life is too short to be serious**
    6. Re:Transcript by painandgreed · · Score: 2

      Enter command: DELETE ALL SERVERS

      Funny enough, I started this one job and was given an account on the VAX system that handled all main applications plus internal email. When I went into the email list to read and write my personal email, it looked something like this:

      Read Email
      Send Email
      Read Sent Email
      Read Deleted Email
      Delete All Email

      Well, after a week or two, I had a bunch of email and no need to keep it, so I hit Delete All Email. What nobody told me was since I was an admin, that command meant ALL EMAIL, for EVERYBODY on the system, gone forever. Luckily, my new boss covered for me. A year or two later another newbie did the same thing, but thankfully, he only deleted a year or two of email.

    7. Re:Transcript by Tablizer · · Score: 1

      I'm suggesting issues the writers may have considered. If you don't like reading speculation on such, don't read it.

    8. Re:Transcript by ebvwfbw · · Score: 1

      The override is just return key

  11. playbook work needed by Anonymous Coward · · Score: 0

    If this is a playbook, someone needs to script/automate a menu system with safety checks.

    perl, python, shell or whatever system tool your admin team uses.

    hopefully this was a jr member that made the goof.

    go go gadget Openview....

  12. GUIs and AIs and Ohs by Anonymous Coward · · Score: 0

    This is why we have the technology called graphical user interfaces today. Maybe all that fashionable AI technology could be harnessed for the benefit of the tired, lone CLI rangers in order to minimize errors while preserving productivity. Oh, really.

    1. Re:GUIs and AIs and Ohs by fzammett · · Score: 2

      Oh, be careful, friend! I made the grave mistake of suggesting on Reddit that we've kinda/sorta/maybe become too enamored of CLIs and that just MAYBE a GUI MIGHT have prevented this, and I got hammered mercilessly.

      You don't want to say anything that doesn't equate to worship at the feet of the almighty, great and awesome CLI around the wrong people.

      --
      If a pion (n-) collides with a proton in the woods & noone is there to hear it, does lamdba decay into the source pa
    2. Re:GUIs and AIs and Ohs by Anonymous Coward · · Score: 0

      Pay heed to the warnings of the still smoking man, for the burning flames have touched him.--- Proverbs of the Internet, The Second Genesis Of Social Media

    3. Re:GUIs and AIs and Ohs by naubol · · Score: 1

      Obviously, some mistakes are less likely or impossible in a GUI just as some kinds of work are more efficient with a GUI, but the opposite has always, and will always, be true. Some mistakes are much more likely, and CLIs can do many things more efficiently. GUI also tends to be more expensive to write well to achieve similar functionality.

      --
      Reality is a slackware box running on a 386 tucked away in god's sock drawer.
    4. Re:GUIs and AIs and Ohs by Anonymous Coward · · Score: 0

      I made the grave mistake of suggesting on Reddit that we've kinda/sorta/maybe become too enamored of CLIs and that just MAYBE a GUI MIGHT have prevented this, and I got hammered mercilessly.

      A GUI with the exact same behaviour as the CLI would have resulted in exactly the same outcome.

      The problem isn't CLI vs GUI, but ensuring that tools with destructive behaviour have sanity checks, as AWS has indicated they have added:

      "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks."

  13. Re: H-1B Visas by Anonymous Coward · · Score: 1

    Except anyone can make a mistake. Therefore inevitably everyone does make mistakes. Usually you're lucky and it wasn't important, but very rarely things such as this happens to someone. All you have demonstrated is your own prejudice. I could equally say it was an overpaid lazy American worker which is why companies are climbing over themselves to get cheaper skilled H-1B workers, but that also would be prejudiced. Unless you have the full details, which neither of us do, then its just meaningless noise.

  14. Re: H-1B Visas by Anonymous Coward · · Score: 0

    I bet you are a hoot at parties...

  15. S3 outage not the big problem by TheSync · · Score: 4, Interesting

    The big problem is not the US-EAST-1 S3 outage.

    The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.

    Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

    So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.

    1. Re:S3 outage not the big problem by phantomfive · · Score: 3, Interesting

      Yeah, actually I was surprised how much Amazon stuff actually went down. I had a coworker who had previously worked at Amazon, and he assured profusely me that AWS was not stable. Clearly he was right. (I don't know why this is big news all of a sudden, AWS has a big outage every year on average).

      --
      "First they came for the slanderers and i said nothing."
    2. Re: S3 outage not the big problem by bill_mcgonigle · · Score: 1

      My boy couldn't turn his desk light on because an intern in Seattle made a typo? The 'S' in IoT also stands for 'sanity'.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    3. Re:S3 outage not the big problem by guruevi · · Score: 1

      HA is hard. "Cloud" makes it even harder because in most instances you don't have much control over the lower levels anymore.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    4. Re: S3 outage not the big problem by sir-gold · · Score: 1

      Is there an H in there somewhere too? And lets just drop the lower-case 'o' entirely.

      (I hope someone understands this really convoluted joke)

    5. Re:S3 outage not the big problem by swb · · Score: 2

      Wasn't it Netflix that released their internal scripts for testing reliability? Randomly blowing away cloud instances and other core components so they could more realistically test the ability of the HA to deliver?

      I generally agree that HA is hard, and it's made harder still by PHBs who ask for HA and then cherry pick the cheapest element (out of several necessary), blab to management that you are fault tolerant and then never allow actually testing it.

      I also blame vendors for waaayyyy overpromising what their HA products can actually do, and sales people for piling on and selling actually unnecessary shit when customers actually indicate that, yes, they would like to buy everything the overcomplex HA system requires.

      No nobody believes the vendor requirement list (..when they can find it) because its been so bloated.

  16. Re:H-1B Visas by Anonymous Coward · · Score: 0

    Racist comments like that will earn you a bigot brand from half the country. Better be careful. In your best interests, The Democratic party of the United States

  17. Simple errors with big affect by ArhcAngel · · Score: 1

    An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.

    --
    "A person is smart. People are dumb, panicky dangerous animals and you know it." - K
    1. Re:Simple errors with big affect by David_Hart · · Score: 1

      An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.

      Why would it have taken a week to recover? You can un-delete objects in AD. Below is an example... Should have taken, at most, a few hours to recover, not a week.

      https://technet.microsoft.com/...

    2. Re:Simple errors with big affect by ArhcAngel · · Score: 1

      That is a VERY good question. I learned long ago that those questions seldom get answered here. They recreated each ID manually. Which generated a new SID meaning when they logged back in they couldn't access their old profile data. Yeah, like I said...fiasco.

      --
      "A person is smart. People are dumb, panicky dangerous animals and you know it." - K
  18. When I was in ops... by QuietLagoon · · Score: 4, Insightful

    ... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.

    1. Re:When I was in ops... by rijrunner · · Score: 2

      Procedures are just the archaeology of mistakes..

    2. Re:When I was in ops... by lgw · · Score: 1

      Doesn't really matter though.: command, or script, or checked-in config file. There's just shouldn't be a way to destroy the world with any one action, regardless of how that action is expressed.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    3. Re:When I was in ops... by QuietLagoon · · Score: 1

      Procedures are just the archaeology of mistakes..

      Fortunately, that procedure was put into place because of a small mistake, and the procedure prevented much larger mistakes down the road.

      .
      Experience is something you don't get until just after you need it.

      Experience is the worst teacher; it gives the test before the lesson. ---Vernon Law

  19. Been there, done that by Dorianny · · Score: 5, Insightful

    Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.

    This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.

    1. Re:Been there, done that by idji · · Score: 3, Insightful

      then people will type "--force" automatically.

    2. Re:Been there, done that by Anonymous Coward · · Score: 0

      And at that point, then you know for sure the fault is the user who ran the command, and not the fault of the person who created the unsafe tool defaults.

    3. Re:Been there, done that by gravewax · · Score: 1

      that would change nothing. admins would just get accustomed to always adding --force. you don't run a command with the intention of doing damage, you do it because you needed to do it, a mistype/mistake will happen with or without --force.

    4. Re:Been there, done that by Anonymous Coward · · Score: 0

      Or the fault of the organization that has created a system that is too large for the heuristic and requires using the --force practically every time.

    5. Re:Been there, done that by Anonymous Coward · · Score: 0

      --force="three random words"

      Where the words appear only after you have reviewed the dry-run.

      Ala "You have deleted how many servers?:"

  20. Re: cloud by silas_moeckel · · Score: 2

    Realy going to claim redundant sites for static data is hard? Eventually consistent databases are a thing has been for a long time outside of some very specific niches how much stuff really needs ACID transactions.

    And yes I've built these many times well before the cloud was a "thing". Using a single cloud provider for anything is a risk the same reasons we use multiple data centers in different parts of the country/world since before the internet allowed commercial traffic and probably before that (no direct experience but the greybeards of my youth told stories).

    --
    No sir I dont like it.
  21. the magic of the command line by known_coward_69 · · Score: 2

    i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options

    why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline

    1. Re:the magic of the command line by Anonymous Coward · · Score: 0

      Shhh... the grown ups are talking now.

    2. Re:the magic of the command line by Megol · · Score: 1

      Grownups and then you, right?

    3. Re:the magic of the command line by spongman · · Score: 1

      some "grown up" deleted half of S3.

    4. Re: the magic of the command line by Anonymous Coward · · Score: 0

      The world isn't perfect. In fact, perfection is impossible, so you should stop worrying about it. Strive, but know that you will sometimes fail and thereby learn from it. This has been a public service announcement by the department of trite aphormisms.

  22. Re:Let me guess... by DickBreath · · Score: 1

    A PHB. Not even rogue.

    PHB: Hey, could I interrupt you for a second while you're typing that command? I've got something more important. We're thinking of changing the locks on the data center doors to another brand that offers locks in a variety of different colors. And the vendor has assured me that these locks can use the same code as my luggage.

    --

    I'll see your senator, and I'll raise you two judges.
  23. This is not some mundane detail, Michael! by paratek · · Score: 2

    "I must have put a decimal point in the wrong place or something. I always do that! I always mess up some mundane detail!"

    --
    Nobody expects The Spanish Inquisition!
  24. why remove in the middle of the day? by known_coward_69 · · Score: 1

    at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again

    1. Re: why remove in the middle of the day? by Anonymous Coward · · Score: 0

      Some systems have a point at which they are "down enough" that more serious intervention is required to bring them back up.

      For example, systems using consensus algorithms can reach a point where so many nodes are down that the system can't automatically determine which of the nodes have the "correct" view of the current state. Such algorithms are generally specified as being resilient to some amount of outage, such as 50% of nodes, after which manual intervention is required.

      This reads to me as such a situation: their mitigation plan was to safeguard against reducing capacity so quickly, suggesting that the system is resilient only to partial failure.

    2. Re:why remove in the middle of the day? by Anonymous Coward · · Score: 0

      It was the billing system that was fucking up. I guarantee you engineering asked for a maintenance window but got overruled by the bean counters who were afraid money was being lost. Little did they know!

  25. The employee name was revealed. by 140Mandak262Jamuna · · Score: 5, Funny
    Looks like Amazon has very strict sign off requirement. After entering the command, the system asked for his name to be logged for the audit trail.

    His name was Robert `); DROP TABLE S3-subsystem; --

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    1. Re:The employee name was revealed. by roman_mir · · Score: 1

      That's all good and stuff but how did little Bobby Tables end up working at an AWS data centre in the first place?

    2. Re:The employee name was revealed. by Anonymous Coward · · Score: 0

      And also importantly how was young Bobby able to graduate after removing his years student records

    3. Re:The employee name was revealed. by The+Other+White+Meat · · Score: 2

      Every time they tried to add his resume to the blacklist, it just disappeared.

      --

      --- Generation X: The first generation to have SIG lines inferior to their parents... ---
    4. Re:The employee name was revealed. by 140Mandak262Jamuna · · Score: 1

      Every time they tried to add his resume to the blacklist, it just disappeared.

      The blacklist disappeared. Not little Bobby's resume alone.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  26. Cloud Services Are Inherently Unreliable by StormReaver · · Score: 2

    Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.

    But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.

    Anyone relinquishing control over their infrastructure to unaccountable third parties needs to fired ASAP, and be replaced with someone who isn't a complete and utter moron. The mine is littered with dead canaries, and too many responses are of the line, "that won't happen to our canaries. Let's forge ahead."

    1. Re:Cloud Services Are Inherently Unreliable by guruevi · · Score: 1

      You found out too late that Microsoft doesn't have backups of it's service. I actually had the same issue, employer decided that $25/mailbox/month was 'normal' and then mailboxes corrupted on the O365 servers (because, it's still Exchange after all, the worst e-mail system in the world and it has inherited the same Exchange problems: corrupt data stores). Now we're scrambling to find a 'backup' solution. TCO calculations that were already dodgy suddenly went up 50%.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    2. Re:Cloud Services Are Inherently Unreliable by s1d3track3D · · Score: 1

      Greetings Sir,

      I just wanted to point out how wrong you are.
      We have finished migrating to the cloud and now we have redundancy, speed and guaranteed uptime
      This also allowed us to get rid our our surly ops guys, so it's a win win really.

      We are now in the process of outsourcing our programing department to India, another win, win for us in management.
      We are excited about the future, what could go possibly go wrong?

      Regards,

      Bob Stiff - PHB

    3. Re:Cloud Services Are Inherently Unreliable by Anonymous Coward · · Score: 0

      management-level boneheads seem to either not understand or not care about until it's too late.

      It's a matter of risk vs reward. Can I take a kickback or stock options from $commercial_vendor for switching my company to their product or service and make i to retirement without a major disaster.

  27. Re: H-1B Visas by Dracos · · Score: 1

    Ok, anyone can make a mistake, but if H1Bs built the server management system to rely on manually typed commands and no one saw the obvious risk of doing that, where does the blame really lie?

    The stereotypical H1B is culturally preconditioned to serve, not analyze; they'll (attempt to) do exactly what they're told... no more, no less, with little questioning.

  28. cloud serives, rm -rf / has global significance by Anonymous Coward · · Score: 1

    So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.

    1. Re:cloud serives, rm -rf / has global significance by Tablizer · · Score: 1

      So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.

      I'm hoping T's fingers are too short to reach the Red Button. I'm hoping O had a collar soldered to it that only fit his long fingers.

  29. Re: cloud by guruevi · · Score: 4, Insightful

    I would say most stuff requires ACID or at least continuously consistent databases (you don't always need transactions or atomicity) and eventually consistent is a niche. Most 'eventually consistent' systems I've seen have an entire layer on top to make sure the data is consistent.

    Anytime you do a financial transaction of any sorts (free or not), you need a consistent system or risk someone being able to manipulate the data. Obviously, some developers don't really care at first since eventually consistent updates are fast enough initially. But once they realize the mistake they made, an entire layer of patchwork gets written to make it behave like a rational database again.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  30. Well, at least that won't happen again by s1d3track3D · · Score: 3, Insightful
    I'm glad they located the issue and put safeguards in place to make sure it doesn't happen again.

    Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended

    oh, nevermind.

  31. Admin by color by fluffernutter · · Score: 1

    Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button. Then you could put a red dot or a green dot in the run book.

    --
    Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
    1. Re:Admin by color by mjr167 · · Score: 3, Insightful

      What do color blind admins do?

    2. Re:Admin by color by Anonymous Coward · · Score: 1
      What do color blind admins do? Flag as Inappropriate

      Press 'em both, just to be sure they got the right one!

      --
      You have the right to remain dead.

    3. Re:Admin by color by dissy · · Score: 2

      Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button.

      What do color blind admins do?

      Take down all the S3 services in the US-EAST-1 Region :P

    4. Re:Admin by color by Anonymous Coward · · Score: 0

      Read colour in braille?

    5. Re:Admin by color by WallyL · · Score: 1

      Customize the colour scheme in the console?

  32. AWS by Kinnison · · Score: 1

    Of course, you have all your business critical systems spread amongst several AWS data centers, right? And any business critical is replicated several times? Not.

    1. Re:AWS by Anonymous Coward · · Score: 0

      including the most important bit. replicated on your *own* hardware, on your *own* internet pipe, in your *own* cage or datacenter or office......

      clouds evaporate. and those that don't just rain on your parade.

  33. Amazon should use the cloud by ghoul · · Score: 2

    AWS should use the cloud, that way when one server goes down the load is picked up seamlessly by another one with no downtime ...... Oh Wait. Never Mind

    --
    **Life is too short to be serious**
  34. Scripting won't work either, really by mveloso · · Score: 1

    The double-check script way works until it doesn't.

    The command wasn't mis-typed, the scope was wrong.

    1. Re:Scripting won't work either, really by QuietLagoon · · Score: 1

      The command wasn't mis-typed, the scope was wrong.

      And how did the wrong scope appear? By magic? Or did someone enter it?

      .
      Having the wrong scope is precisely one of the types of errors a script and a second set of eyes will help to prevent.

  35. Business Opportunity by ghoul · · Score: 2

    Now Amazon can sell AWS prime. If you are a susbcriber of AWS prime we will check "Twice" before removing your servers. That should boost the profit somewhat

    --
    **Life is too short to be serious**
  36. My experience suggests the opposite by raymorris · · Score: 5, Insightful

    > For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.

    My experience is the exact opposite. When I write software to automate something, that automated procedure is planned and reviewed, then undergoes unit testing, integration testing, and acceptance testing. When I do something by hand - well you better hope the phone doesn't ring while I'm in the middle of it because if I lose concentration for a moment mistakes are quite possible. My boss agrees; the other day I mentioned I was doing something manually and he cocked his head and asked "manually? Isn't that subject to typos and other errors?"

    1. Re:My experience suggests the opposite by Anonymous Coward · · Score: 0

      If only more bosses/colleagues/bystanders were like yours. This thing was caused by a typo. The "cloudbleed" thing was caused by a single-character typo. Don't interrupt us when we're writing code!

    2. Re:My experience suggests the opposite by sexconker · · Score: 1

      I'm all for having things scripted or menuized, or otherwise made foolproof over manually keying things in that aren't security-sensitive (such as passwurdz).
      But triggering those scripts and running the "do this complex thing" job should have a human at the trigger, watching things as they burst into flames (or don't).

      Another option for complex procedures is to have 2 people to serve as a check against each other. These kinds of checks are commonplace in the military and various regulated industries (mining, manufacturing, precision butt scratching). But they cost more, and companies hate that.

  37. Re: H-1B Visas by Anonymous Coward · · Score: 0

    Nothing but more meaningless noise. It's no wonder lazy overpaid Americans are getting replaced when they make bad assumptions and spout prejudice.

  38. Turn off, then delete when nothing screams by Anonymous Coward · · Score: 0

    Maybe it's just me, but I typically drop a VM's NIC for a week, then shut it down for at least 2 weeks, before I remove it from inventory, and, only then when I've shoved it off to cold storage so it can get brought back in.

  39. In Some Cases? by tsqr · · Score: 1

    Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week.

    What am I missing here? Is there a way to "knock down" a website without affecting it?

    1. Re:In Some Cases? by Carcass666 · · Score: 1

      As an example, if your website was hosted servers in a different region than the outage, but tried to send email using US-EAST-1, your website would have still been up, but would have been affected because it couldn't send email.

  40. Happens all the time by dave562 · · Score: 1

    Something extremely similar happened to my company last month. An EMC tech was onsite to work on the Isilon system. He was supposed to issue a command to put one of the nodes into maintenance mode. Instead he put the entire cluster into maintenance mode.

    Needless to say, he is not welcome back. Ever.

    Not to put myself above anyone else, I made a similar mistake a couple of years ago. I wrote a script that checked a csv input against a list of computers in Active Directory. It was supposed to delete all of the servers not on the list. Instead it deleted all of the servers on the list. It was pretty easy to fix with another script that rejoined all of them to the domain. None the less it was a pretty major fuck up, and one that I should have caught if I had tested my code properly.

  41. The more power the more rope to hang yourself with by FeelGood314 · · Score: 1

    The admin has a very powerful tool. It has almost no constraints on what it can do because 99% of the time we want that power. We are dealing with an uncommon, unexpected situation and need to be able to have the power to do something different. The exact correct command might be something that no one anticipated before. It would be very time consuming to come up with rules preventing such a command.

    Also I don't think more warning messages or safety logic is always the answer. Maybe practicing more without the autopilot is the answer. Look at Air France 447.

  42. Re: Let me guess... by Anonymous Coward · · Score: 0

    and the password is
    1...
    2...
    3...
    4...
    5.

  43. How was this so disruptive? by emorning · · Score: 1

    OK, so somebody took down thousands of servers, shit happens. Once the mistake was recognized why does it take so long to start up those servers again?

    1. Re:How was this so disruptive? by Anonymous Coward · · Score: 0

      OK, so somebody took down thousands of servers, shit happens.
      Once the mistake was recognized why does it take so long to start up those servers again?

      "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected." ...

      "As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."

  44. Time to implement the Two-Man Rule... by RealGene · · Score: 1

    ..since errors of this nature could be worse than a missile launch. https://en.wikipedia.org/wiki/...

    --
    Mission: To provide products that consume time and energy as entertainingly as permitted by the laws of thermodynamics.
  45. I Want to Work Where You Work by Carcass666 · · Score: 1

    To all those guys who are bragging about how they would never put anything in the cloud (AWS or otherwise) because their data centers are so reliable, so redundant, fault-tolerant and insulated from human error that they can be held to the highest possible standards of up-time and accountability, are you hiring?

    Or, would you be interested in a bridge I have for sale?

  46. Re: Employee probably was H1B by Anonymous Coward · · Score: 0

    No, you can't use H1Bs in data centers because they shit behind the racks, and sometimes pull out cables by accident when getting up

  47. Re: Let me guess... by Anonymous Coward · · Score: 0

    These comments have gone from suck to blow.

  48. Oblig by Anonymous Coward · · Score: 0

    The distributed failover mechanisms of Amazon Web Services has never failed because of an unrecoverable error in either hardware or software. In all cases where such a mistake was suspected, the problem has always turned out to be *human* error.

  49. Fingernails by Anonymous Coward · · Score: 0

    Some female with long fingernails flubbed it.

  50. Re: cloud by RabidReindeer · · Score: 2

    Well that all sounds easy enough

    Well, computers are easy. A child can program one. That's why you should always hire the cheapest IT workers you can get.

  51. Leet powr by Mats+Svensson · · Score: 1

    Nothing like good old shell-commands to get things done fast and efficient.
    Who needs GUIs, with their ugly label's and noobish warning popups?

  52. Re: cloud by Anonymous Coward · · Score: 0

    Quoting the Dunning-Kruger effect is often the smug illustration that one is suffering from the Dunning-Kruger effect.

    Much like quoting Godwin's Law is a favourite countering tactic for Nazi apologists.

  53. Where is the typo? by cabazorro · · Score: 1

    Anyone knows exactly what typo?
    I can't find the fumbled command line anywhere in the internet.

    --
    - these are not the droids you are looking for -
  54. Re: Let me guess... by DickBreath · · Score: 1

    Can't it be switched into reverse?

    --

    I'll see your senator, and I'll raise you two judges.
  55. Re:The more power the more rope to hang yourself w by sjames · · Score: 1

    Warnings are nice, but it's hard to anticipate all of the conditions where a warning might be in order. It's also hard to make people pay attention to warnings when each and every action produces one or more warnings.

    Reversibility is a key. Let the admin see the consequences of the last given command and undo it if necessary. Warnings are for actions that are intrinsically irreversible. Build it into the commands if possible. If not, build it into the procedures. Don't delete an instance, just take it offline for a while first and see if anyone squawks. Don't even allow a delete of an instance that isn't offline already, preferably not until it has been offline for some time.

  56. Re: H-1B Visas by zifn4b · · Score: 1

    Nothing but more meaningless noise. It's no wonder lazy overpaid Americans are getting replaced when they make bad assumptions and spout prejudice.

    India, the predominant H-1B Visa source for American companies is an authoritarian culture will arrange marriages and all that sort of stuff. That's not prejudice, it's an observation of fact. Can you dispute that with facts and evidence? Or you just want your fantasy version of reality in your head to be right even if it's complete fantasy. You know that's the definition of delusion right?

    --
    We'll make great pets
  57. Re: cloud by Anonymous Coward · · Score: 0

    Except the Dunning-Kruger effect seems to affect most people. Most people seem to have zero intuition about technical problems that have not experienced before. All technical problems are logical and a quick bit of thinking can give you nearly everything you need to know. There are only a few good ways to solve any given technical problem, and knowing even a small bit about what you're doing can tell you which of those few ways where chosen. If you can quickly come up with the few good ways to solve the problem, then you can immediately figure out the implementation.

    Dunning-Kruger effect is caused by a lack of metacognition. Metacognition is caused by a lack of abstract reasoning. Abstract reasoning is what allows you to infer and deduce. People with strong abstract reasoning are able to quickly reverse engineer by inference and deduction.

    For example. My co-worker was working on a project in Go, a language I have never used. He ran into some performance issues. He showed me an example of how the code was slowing down. After looking at the example, I was able to quickly recognize that the issue did not fit my mental model of how async message passing language like Go should work. After about a minute, I looked at his code. Again, I have never looked at Go code. Just looking at it, I saw no issue with his design, but I did notice he was making use of a nifty feature of zero sized message queues, in my parlance, work like co-routines. I came to the conclusion that the implementation of Go that he was using better fit my mental model of how the program would work if it was not truly async co-routines, but instead backed by threads with a limited threadpool.

    Lo and behold, I was correct. The GCC Go compiler was backing the "go routines" with a small threadpool of something like 4. I assumed it had an exponential decrease in performance based on his example. I had him double the threadpool and it agreed with my prediction by effectively running twice as fast for a given progress of the processing, but the N^2 scaling was killing it. I also predicted that making the message channel sizes to be at least 1 would give a large improvement, but a size of at least 2 would see near the max benefit. That was also true. The official Go compiler did not have this issue, but he was running this on a Raspberry Pi, and ARM was not supported by Google's compiler at the time.

    This was back when I had about 4 days of experience with multi-threading and have only heard about thread-pools. When I said "better fit my mental model of how the program would work if it was not truly async co-routines", that mental model I made up on the spot, took me tens of seconds. I took what I assumed were facts (assuming go was co-routine like), created a mental model about what guarantees I had, then I took my assumptions and iterated through them by plugging them into my partial mental model and "seeing" how it would affect the outcome. Then I chose the outcome the mostly closely fit what I thought I saw.

    That is an example of inference and deduction. I knew what I did know and knew what I didn't know and filled in the blanks by making purposeful assumptions while fully understanding the ramifications of my assumptions, making sure my theoretical mental model would create the same characteristic reduced performance. Not just any reduced performance, but the exact same type.

    Of course this has absolutely nothing to do with someone making a "typo" style mistake. I make mistakes all of the time, just rarely with reasoning.