Amazon Explains Why S3 Went Down

← Back to Stories (view on slashdot.org)

Amazon Explains Why S3 Went Down

Posted by timothy on Saturday July 26, 2008 @09:33AM from the not-mere-sluttiness dept.

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."

39 of 114 comments (clear)

Min score:

Reason:

Sort:

for want of a nail ... by thrillseeker · 2008-07-26 09:34 · Score: 4, Interesting

a single bit?! I think there are some serious design deficiencies ...
1. Re:for want of a nail ... by Daimanta · 2008-07-26 09:37 · Score: 5, Funny
  
  It was the evil bit...
  
  --
  Knowledge is power. Knowledge shared is power lost.
2. Re:for want of a nail ... by Ctrl-Z · 2008-07-26 09:43 · Score: 4, Informative
  
  Thank you Capt. Obvious. A single bit is enough to cause a cascading failure, and someone overlooked this instance. It's not the first time, nor will it be the last. See New York City blackout of 1977, The Crash of the AT&T Network in 1990, et al.
  
  --
  www.timcoleman.com is a total waste of your time. Never go there.
3. Re:for want of a nail ... by CalSolt · 2008-07-26 11:52 · Score: 2
  
  It's like a self-replicating virus that arose from the result of a random mutation.
  "Ever since the first computers,
  there have always been
  ghosts in the machine.
  Random segments of code that
  have grouped together to
  form unexpected protocols."
4. Re:for want of a nail ... by iamhassi · 2008-07-26 13:38 · Score: 4, Funny
  
  FTA:
  "On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests."
  
  sounds like a restaurant, gossiping servers were failing to process customer requests
  
  --
  my karma will be here long after I'm gone
5. Re:for want of a nail ... by mrmeval · 2008-07-26 13:57 · Score: 2, Funny
  
  1 million code monkeys typing out Aleister Crowley?
  
  --
  I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
6. Re:for want of a nail ... by Ctrl-Z · 2008-07-26 14:13 · Score: 3, Interesting
  
  Actually, that should have been Northeast Blackout of 1965. But you already knew that.
  
  --
  www.timcoleman.com is a total waste of your time. Never go there.
Simple by gardyloo · 2008-07-26 09:46 · Score: 3, Insightful

S3 is a total slut.
Re:They need more Erlang. by nacturation · 2008-07-26 09:51 · Score: 5, Insightful

They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.

--
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
haha by msauve · 2008-07-26 09:56 · Score: 4, Informative

For those who don't know what you're referring to, like the AC who commented: search in this for "evil bit".

--
"National Security is the chief cause of national insecurity." - Celine's First Law
1. Re:haha by ivoras · 2008-07-26 11:22 · Score: 2, Informative
  
  Not widely known, but the RFC was actually implemented, at least once: http://lists.freebsd.org/pipermail/cvs-all/2003-April/001098.html :)
  
  --
  -- Sig down
...make lemonade. by fahrbot-bot · 2008-07-26 10:03 · Score: 3, Funny

A random bit got flipped in one of the server state messages...

Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector. :-)

--
It must have been something you assimilated. . . .
1. Re:...make lemonade. by erc · 2008-07-26 10:16 · Score: 2, Insightful
  
  Or they could checksum their UDP packets. The entire packet, not just the customer payload. Duh.
  
  --
  -- Ed Carp, N7EKG erc@pobox.com PGP KeyID: 0x0BD32C9B What I'm up to: http://intuitives.mine.nu
2. Re:...make lemonade. by spinkham · 2008-07-26 10:52 · Score: 2, Interesting
  
  There's probably information that changes as the packets move around, and they probably wanted to avoid the overhead. I'm guessing it was a deliberate design decision, but it turned out to be the wrong one. It's easy to see that after a failure, but it's hard to design large distributed systems and foresee every possible way things can break, and where the computation overhead is worth it. The number of interactions between servers here makes any small design flaw a big thing.
  
  --
  Blessed are the pessimists, for they have made backups.
3. Re:...make lemonade. by TheRaven64 · 2008-07-26 11:48 · Score: 2, Insightful
  
  True, but not particularly informative. The point is to detect errors. Error correction is nice, but error detection is enough if the sender can then retransmit. Oh, and your 'proof' is flawed, since you are completely ignoring the fact that any correction scheme contains redundant information, so the while n bits might work instead of n+1 bits, n-1 might not.
  If the last bit is missing, then your receiver knows that there is an error. If the last bit is flipped, then it knows that there is an error. A checksum can be very simple and just give the count of the number of bits that are set in the message. This will protect from single-bit errors, but an error in both the message and the checksum can cause an erroneous packet to pass. It's basically a hash, and the idea is to make sure that hash collisions are as infrequent as possible.
  You can usually guarantee a maximum number of errors in the network and it's possible to design correction schemes which will detect any n-bit error.
  In some cases, it's possible to accurately detect all errors because all failures will be in one direction (0 to 1 or 1 to 0, but not both). In this case, and XOR'd copy of the message will work, because any bit in the message flipping from 0 to 1 needs a corresponding bit in the check flipping from 1 to 0 (there are shorter encoding schemes that work in this case too, but this is a very simple example).
  
  --
  I am TheRaven on Soylent News
4. Re:...make lemonade. by Sique · 2008-07-26 12:05 · Score: 4, Interesting
  
  I see you completely miss the point of the proof. I know that you can minimize the impact of a bit error by checksums, and that you can improve reliability by adding redundance. But what is the consequence of error detection? Normally the protocol then asks for resending the message. But how do you (as the sender) know that the message finally arrived correctly? You wait for an aknowledgement. But what if the aknowledgement gets lost or scrambled? You add redundancy in the handshaking. But how does redundancy help reliability? You can ask for a resend due to detected errors etc.pp.
  Your protocol never finds an end because it has to secure the correctness of the security of the correctness of the security...
  
  --
  .sig: Sique *sigh*
5. Re:...make lemonade. by spinkham · 2008-07-26 14:43 · Score: 4, Insightful
  
  No, I mean favoring speed and computational simplicity over error detection.
  It is often a valid trade off. For example, most filesystems do not validate the stored data at all for size and computational reasons. As hard drives and arrays get bigger, that trade of no longer makes much sense, and most all new filesystems being designed have hash based error detection built in at some level.
  Good design takes experience. There aren't that many systems like S3 that have been built in the past, and there are many tricky decisions to be made. No system gets it all correct out of the gate.
  
  --
  Blessed are the pessimists, for they have made backups.
Other companies could learn from this... by Manip · 2008-07-26 10:10 · Score: 5, Insightful

Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.
1. Re:Other companies could learn from this... by FalcDot · 2008-07-26 10:33 · Score: 2, Insightful
  
  Looking back, I feel that the one thing all our technological progress has given us more than anything else, is more and better means of communication.
  You can talk to people on the other side of the world, heck, on the other side of the solar system if you don't mind the delay. Video feeds, planes that'll actually get you there in less than a day, ...
  And yet, with all of this, it seems that we're not actually doing it. A company explaining what went wrong is the exception. Internet forums without flamers and trolls? Exception.
  "Anything you say can and will be used against you."
2. Re:Other companies could learn from this... by Anonymous Coward · 2008-07-26 10:34 · Score: 5, Funny
  
  Other companies could learn something from this, unfortunately they won't be able to do anything similar as Amazon has patented the process of explaining technological problems to customers.
3. Re:Other companies could learn from this... by SanityInAnarchy · 2008-07-26 12:30 · Score: 2, Insightful
  
  Well, technically speaking, there isn't an apology there:
  
  Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.
  Allow me to translate:
  
  We screwed up. We'll do better next time.
  Nowhere in the document do the words "I'm sorry" appear. That's entirely implied.
  
  --
  Don't thank God, thank a doctor!
4. Re:Other companies could learn from this... by Alpha830RulZ · 2008-07-26 15:41 · Score: 4, Insightful
  
  The words may not be there, but there is a pretty clear message there for me to see that they are not happy or smug about this event, and are agreeing with the consumer that this shouldn't have happened, and won't happen again if they can help it. That's enough for me.
  And I'm actually one of their consumers, compared to some of the dilletantes here. We use S3 and EC2 to manage training and demo instances of our software, and are pretty pleased so far.
  
  --
  I was taught to respect my elders. The trouble is, it's getting harder and harder to find some.
Re:Lost time? by Anpheus · 2008-07-26 10:14 · Score: 4, Informative

Read "the system's state cleared" as "we turned everything off" and they proceeded to turn every server on one by one until around 3PM when the EU location was complete and not showing any symptoms.
Re:Lost time? by Anonymous Coward · 2008-07-26 10:21 · Score: 3, Informative

sun enterprise and most other enterprise servers takes 30-45 minutes to pass the prom tests after you start them up. thats per server, before you even boot the os. my guess is they brought up servers in large batches of 25% at a time and took about an hour per batch.
It was a design defect by j.+andrew+rogers · 2008-07-26 10:44 · Score: 4, Informative

It has been generally well-known for a number of years now that any time you have a large cluster you cannot count on hardware checksums to catch every bit flip that may occur during copies and transmission, particularly with consumer hardware which has many internal paths with no checksums at all. Google learned this the hard way, like the supercomputing people before them, and now like Amazon after Google. And some of the better database engines also do their own internal software checksums as well to catch uncaught errors introduced as the data gets copied across the silicon, disks, and network -- it is one way they get their very high uptime and low failure rate.
It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs. The performance cost of doing a software CRC on your data every time it is moved around is low enough that it is generally worth it these days. If your system is large enough, the probability of getting bitten by this approaches unity. Very fast implementations of Adler-32 and other high-performance checksum algorithms are widely available online.
1. Re:It was a design defect by James+Youngman · 2008-07-26 11:17 · Score: 4, Interesting
  
  Adler-32 wouldn't be a great choice. It's fast but it's weak for short messages and I've seen it fooled by multi-bit errors on large messages too.
  See Koopman's paper 32-bit cyclic redundancy codes for Internet applications for some better ideas.
2. Re:It was a design defect by James+Youngman · 2008-07-26 12:22 · Score: 4, Interesting
  
  I've seen Adler-32 fail twice, so your assumption of "a few times per millennium" doesn't seem to work for me (I'm under 40). In fact in those cases it was even stacked on top of at least one other checksum mechanism, too.
  One of the problems with it is poor spreading of the input bits into s2. There are other algorithms which don't have that weakness but don't (IIRC) cost any more to compute.
Re:They need more Erlang. by edsousa · 2008-07-26 11:17 · Score: 5, Insightful

This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.
It's quite an old story - see RFC789 by anti-NAT · 2008-07-26 11:18 · Score: 5, Interesting

Vulnerabilities of Network Control Protocols: An Example, published in January 1981.
What do they say about those who ignore history?

--
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
1. Re:It's quite an old story - see RFC789 by Gazzonyx · 2008-07-26 15:17 · Score: 4, Funny
  
  [...]
  What do they say about those who ignore history?
  I think it was, they're doomed to reimplement it... poorly. Or was that Unix? ;)
  
  --
  If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Re:Programmers never learn... by SanityInAnarchy · 2008-07-26 12:33 · Score: 3, Interesting

I think the real lesson here is simply that input over the network cannot ever be trusted.
It is their network, and S3 is built on tech which is explicitly designed to not have any kind of security built-in. The security is applied at the API level, but any misbehaving machine within the S3 cluster could cause some serious damage.
I actually agree with this philosophy, to an extent. After all, this is essentially a large number of computers acting as a hard disk. How would you approach talking to a hard disk in your own machine? Do you assume that everything is corrupt, untrusted, or wrong?

--
Don't thank God, thank a doctor!
Always happens on a Sunday by sebastiengiroux · 2008-07-26 13:05 · Score: 2, Interesting

Anybody else noticed that these major problems always occur when no one is around to fix them?
1. Re:Always happens on a Sunday by innerweb · 2008-07-26 14:41 · Score: 3, Informative
  
  After working in a production environment as a developer, I can assure you that the correct interpretation is "Anybody else noticed that these major problems always occur when no one is around to catch them (before they get out of hand)?"
  InnerWeb
  
  --
  Freud might say that Intelligent Design is religion's ID.
ECC memory, anyone? by Maxmin · 2008-07-26 14:35 · Score: 3, Interesting

I hafta wonder if the bit flipped due to a bad RAM stick?

We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted.
Nothing specific about *what* caused the bit to flip.
This comes to mind only because bad RAM on a new server at work caused installation of a stock Perl module to throw excessive errors during the XS compile phase - the same package installed without error on an identical machine 20 minutes earlier. Took over an hour before we realized it was probably hardware. Memtest86 quickly turned up the problem.
Would hashes and the like protect against RAM suddenly going south? Wouldn't any piece of data that passes through main memory be vulnerable to corruption? Makes me wonder why ECC memory isn't being used much anymore... we have various flavors of RAID to protect slow memory from corruption, but not many machines I see have ECC anymore.

--
O lord, bless this thy holy hand grenade, that with it thou mayest blow thine enemies to tiny bits, in thy mercy.
1. Re:ECC memory, anyone? by this+great+guy · 2008-07-26 15:26 · Score: 3, Informative
  
  Even ECC memory isn't a panacea. ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors. To the poster that seemed to think a 1-bit error causing a downtime is a sign of a defective design: the truth is that 99.9% of the software out there doesn't even try to work around data corruption issues. One can easily introduce 1-bit errors capable of crashing virtually any app. For example by flipping 1 bit of the first byte of the MBR (master boot record) of an OS, it can make it unbootable (it changes the opcode of what is usually a JMP instruction to something else).
2. Re:ECC memory, anyone? by Maxmin · 2008-07-26 17:13 · Score: 4, Informative
  
  ECC can only correct 1-bit errors. It can't correct 2-bit errors (only detect them) and can't even detect nor correct 3-bit (or more) errors.
  No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations. If part or all of one byte goes bad, the rest of the bits and the parity code are unchanged, and the affected bytes can be reconstructed.
  Also like Red Book CDs are multiply redundant memory systems, with -just what it sounds like- multiple copies of each byte, and the memory controller arbitrates differences. CDs effectively contain three copies of the data, striped and parity encoded. That's how scratched CDs can still operate error-free (sometimes.) The space shuttle's computer systems are relatively fault-tolerant - multiple redundant computers all running the same programs and data, with a fourth computer evaluating the output of the other computers, looking for failures.
  Where there's a will, there's a way, but the will in the mainstream x86 server industry to build truly fault-tolerant computers is slim. It's a specialty, and that makes it very expensive. Stratus, for example, makes a line of fault-tolerant servers, with some of the fail-over in hardware, so they make their 99.999% uptime claim (about 5 minutes downtime per year.)
  "Five nines" is a claim I've heard from most top-dollar *nix hosting companies, but have *never* experienced - it's generally been hours of downtime per year. Not even their network infrastructure gets close to 99.999% uptime! Cadillac prices, but downtime contingency planning is all up to the client, even with "managed hosting." They all suck.
  
  --
  O lord, bless this thy holy hand grenade, that with it thou mayest blow thine enemies to tiny bits, in thy mercy.
3. Re:ECC memory, anyone? by iluvcapra · 2008-07-26 18:53 · Score: 2, Informative
  
  No, that's just one kind of memory system. There are a number of designs, and recovery also depends on the kind of error. IIRC, one design is somewhat similar to the CD Red Book spec, in that the bits for a given byte are distributed around - a physical byte is composed of bits all from different memory locations.
  Red Book audio CDs, Sony MiniDisks and DATs all use a form of Cross-Interleaved Reed-Solomon coding, which is has the nice characteristic of being able to use the fact that a piece of information is known to be missing when reconstructing the original signal, whereas other systems can't necessarily be improved by being informed of the difference between an "error" and an "erasure." Side information about "known-bad" media areas are a natural fit for physical media, not necessarily for serial data or other things.
  CDs also have parity bits on every (EFM-encoded) byte on the media, which can contribute to the "erasure" side-information along with tracking data from the laser. Also working in the CD's favor is the fact that they carry relatively low-information PCM data, and if there is a complete loss of a sample or two, the decoding device can just do a 1st order interpolation between the surrounding known good samples. This is why CDs can sound excellent until one day it might just not play at all, without an significant period of declining quality, because errors were accumulating until you reached a critical point where your player couldn't spackle over the errors anymore, and it juts gives up.
  only ot fyi :)
  
  --
  Don't blame me, I voted for Baltar.
4. Re:ECC memory, anyone? by afidel · 2008-07-27 18:48 · Score: 2, Informative
  
  All decent servers use multibit ECC and the better ones are using IBM's chipkill technology which is basically RAID for ram, it uses an extra memory chip to do parity calculations.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Byzantine failure by ge · 2008-07-27 02:15 · Score: 3, Interesting

So the whole cloud is in trouble if one node starts spewing nonsense? So much for redundancy. Amazon developers would be well advised to read up on the "Byzantine Generals" problem.