Amazon Explains Why S3 Went Down
Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence.
In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
a single bit?! I think there are some serious design deficiencies ...
S3 is a total slut.
They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
I think Erlang's cool and all, but it's not the magical bullet that will solve this. It's still possible to have information corrupted during message passing between processes in Erlang (say, as the result of an intermittently failing network switch) as it is in any language.
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
For those who don't know what you're referring to, like the AC who commented: search in this for "evil bit".
"National Security is the chief cause of national insecurity." - Celine's First Law
Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector. :-)
It must have been something you assimilated. . . .
Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.
Read "the system's state cleared" as "we turned everything off" and they proceeded to turn every server on one by one until around 3PM when the EU location was complete and not showing any symptoms.
sun enterprise and most other enterprise servers takes 30-45 minutes to pass the prom tests after you start them up. thats per server, before you even boot the os. my guess is they brought up servers in large batches of 25% at a time and took about an hour per batch.
It has been generally well-known for a number of years now that any time you have a large cluster you cannot count on hardware checksums to catch every bit flip that may occur during copies and transmission, particularly with consumer hardware which has many internal paths with no checksums at all. Google learned this the hard way, like the supercomputing people before them, and now like Amazon after Google. And some of the better database engines also do their own internal software checksums as well to catch uncaught errors introduced as the data gets copied across the silicon, disks, and network -- it is one way they get their very high uptime and low failure rate.
It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs. The performance cost of doing a software CRC on your data every time it is moved around is low enough that it is generally worth it these days. If your system is large enough, the probability of getting bitten by this approaches unity. Very fast implementations of Adler-32 and other high-performance checksum algorithms are widely available online.
This message is written by one that writes real parallel, distributed and concurrent code (they are not all the same):
Erlang or any other functional language will not account for lack of good design. If you have a good design with the right concerns you can implement in Java, C, Fortran, ASM and if done right, it will work.
I'm sick of hear "Erlang is THE solution". It is not. Good design and implementation practices are.
Vulnerabilities of Network Control Protocols: An Example, published in January 1981.
What do they say about those who ignore history?
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
I think the real lesson here is simply that input over the network cannot ever be trusted.
It is their network, and S3 is built on tech which is explicitly designed to not have any kind of security built-in. The security is applied at the API level, but any misbehaving machine within the S3 cluster could cause some serious damage.
I actually agree with this philosophy, to an extent. After all, this is essentially a large number of computers acting as a hard disk. How would you approach talking to a hard disk in your own machine? Do you assume that everything is corrupt, untrusted, or wrong?
Don't thank God, thank a doctor!
Anybody else noticed that these major problems always occur when no one is around to fix them?
I hafta wonder if the bit flipped due to a bad RAM stick?
Nothing specific about *what* caused the bit to flip.
This comes to mind only because bad RAM on a new server at work caused installation of a stock Perl module to throw excessive errors during the XS compile phase - the same package installed without error on an identical machine 20 minutes earlier. Took over an hour before we realized it was probably hardware. Memtest86 quickly turned up the problem.
Would hashes and the like protect against RAM suddenly going south? Wouldn't any piece of data that passes through main memory be vulnerable to corruption? Makes me wonder why ECC memory isn't being used much anymore... we have various flavors of RAID to protect slow memory from corruption, but not many machines I see have ECC anymore.
O lord, bless this thy holy hand grenade, that with it thou mayest blow thine enemies to tiny bits, in thy mercy.
So the whole cloud is in trouble if one node starts spewing nonsense? So much for redundancy. Amazon developers would be well advised to read up on the "Byzantine Generals" problem.