Amazon Explains Why S3 Went Down

← Back to Stories (view on slashdot.org)

Amazon Explains Why S3 Went Down

Posted by timothy on Saturday July 26, 2008 @09:33AM from the not-mere-sluttiness dept.

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."

5 of 114 comments (clear)

Min score:

Reason:

Sort:

for want of a nail ... by thrillseeker · 2008-07-26 09:34 · Score: 4, Interesting

a single bit?! I think there are some serious design deficiencies ...
Re:It was a design defect by James+Youngman · 2008-07-26 11:17 · Score: 4, Interesting

Adler-32 wouldn't be a great choice. It's fast but it's weak for short messages and I've seen it fooled by multi-bit errors on large messages too.
See Koopman's paper 32-bit cyclic redundancy codes for Internet applications for some better ideas.
It's quite an old story - see RFC789 by anti-NAT · 2008-07-26 11:18 · Score: 5, Interesting

Vulnerabilities of Network Control Protocols: An Example, published in January 1981.
What do they say about those who ignore history?

--
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
Re:...make lemonade. by Sique · 2008-07-26 12:05 · Score: 4, Interesting

I see you completely miss the point of the proof. I know that you can minimize the impact of a bit error by checksums, and that you can improve reliability by adding redundance. But what is the consequence of error detection? Normally the protocol then asks for resending the message. But how do you (as the sender) know that the message finally arrived correctly? You wait for an aknowledgement. But what if the aknowledgement gets lost or scrambled? You add redundancy in the handshaking. But how does redundancy help reliability? You can ask for a resend due to detected errors etc.pp.
Your protocol never finds an end because it has to secure the correctness of the security of the correctness of the security...

--
.sig: Sique *sigh*
Re:It was a design defect by James+Youngman · 2008-07-26 12:22 · Score: 4, Interesting

I've seen Adler-32 fail twice, so your assumption of "a few times per millennium" doesn't seem to work for me (I'm under 40). In fact in those cases it was even stacked on top of at least one other checksum mechanism, too.
One of the problems with it is poor spreading of the input bits into s2. There are other algorithms which don't have that weakness but don't (IIRC) cost any more to compute.