Amazon Explains Why S3 Went Down
Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence.
In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
It was the evil bit...
Knowledge is power. Knowledge shared is power lost.
Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector. :-)
It must have been something you assimilated. . . .
was trying to hold onto a man?
I'm just guessing here.
Other companies could learn something from this, unfortunately they won't be able to do anything similar as Amazon has patented the process of explaining technological problems to customers.
It provideRS? PROVIDERS?!?
I'TS PROVIDED!
FTA:
"On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests."
sounds like a restaurant, gossiping servers were failing to process customer requests
my karma will be here long after I'm gone
1 million code monkeys typing out Aleister Crowley?
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
[...]
What do they say about those who ignore history?
I think it was, they're doomed to reimplement it... poorly. Or was that Unix? ;)
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.