Amazon Explains Why S3 Went Down

← Back to Stories (view on slashdot.org)

Amazon Explains Why S3 Went Down

Posted by timothy on Saturday July 26, 2008 @09:33AM from the not-mere-sluttiness dept.

Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence. In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."

8 of 114 comments (clear)

Min score:

Reason:

Sort:

Re:for want of a nail ... by Daimanta · 2008-07-26 09:37 · Score: 5, Funny

It was the evil bit...

--
Knowledge is power. Knowledge shared is power lost.
...make lemonade. by fahrbot-bot · 2008-07-26 10:03 · Score: 3, Funny

A random bit got flipped in one of the server state messages...

Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector. :-)

--
It must have been something you assimilated. . . .
It was drunk, had father issues, and... by SensitiveMale · 2008-07-26 10:21 · Score: 1, Funny

was trying to hold onto a man?
I'm just guessing here.
Re:Other companies could learn from this... by Anonymous Coward · 2008-07-26 10:34 · Score: 5, Funny

Other companies could learn something from this, unfortunately they won't be able to do anything similar as Amazon has patented the process of explaining technological problems to customers.
It providers a timeline of events by Anonymous Coward · 2008-07-26 11:22 · Score: 1, Funny

It providers a timeline of events
It provideRS? PROVIDERS?!?
I'TS PROVIDED!
Re:for want of a nail ... by iamhassi · 2008-07-26 13:38 · Score: 4, Funny

FTA:
"On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests."

sounds like a restaurant, gossiping servers were failing to process customer requests

--
my karma will be here long after I'm gone
Re:for want of a nail ... by mrmeval · 2008-07-26 13:57 · Score: 2, Funny

1 million code monkeys typing out Aleister Crowley?

--
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
Re:It's quite an old story - see RFC789 by Gazzonyx · 2008-07-26 15:17 · Score: 4, Funny

[...]
What do they say about those who ignore history?
I think it was, they're doomed to reimplement it... poorly. Or was that Unix? ;)

--
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.