Amazon Explains Why S3 Went Down
Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence.
In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
It was the evil bit...
Knowledge is power. Knowledge shared is power lost.
They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.
This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.
Vulnerabilities of Network Control Protocols: An Example, published in January 1981.
What do they say about those who ignore history?
The Internet's nature is peer to peer - 20050301_cs_profs.pdf