Amazon Explains Why S3 Went Down
Angostura writes "Amazon has provided a decent write-up of the problems that caused its S3 storage service to fail for around 8 hours last Sunday. It providers a timeline of events, the immediate action take to fix it (they pulled the big red switch) and what the company is doing to prevent re-occurrence.
In summary: A random bit got flipped in one of the server state messages that the S3 machines continuously pass back and forth. There was no checksum on these messages, and the erroneous information was propagated across the cloud, causing so much inter-server chatter that no customer work got done."
a single bit?! I think there are some serious design deficiencies ...
They need to start using Erlang more. It's designed specifically for building highly-distributed, concurrent systems that must scale to millions of transactions per minute. So it's a natural fit between Erlang and what Amazon is trying to offer with their S3 service.
S3 is a total slut.
A random bit got flipped
...that a bit on the side has caused problems.
Look at Max Mosely, for example.
For those who don't know what you're referring to, like the AC who commented: search in this for "evil bit".
"National Security is the chief cause of national insecurity." - Celine's First Law
Cosmic Rays perhaps? I guess they could line the room with lead, or simply re-market S3 as a Neutrino detector. :-)
It must have been something you assimilated. . . .
By 11:05am PDT, ..., the system's state cleared.
At 2:57pm PDT, Amazon S3's EU location began successfully completing customer requests.
So WTF happened during the four hours between 11:05AM and 2:57PM?
And... have they learned nothing from the TIBCO fiasco?
Other large businesses could learn a lot from Amazon's example.
How often do you have the problem really explained to you, an apology, and a reasonable set of changes to stop it occurring again?
Most businesses would never explain the root of any problem. They simply list "hardware issues." And they NEVER say sorry anymore - supposedly it opens them up to more liability or something.
If I was an Amazon customer I would be happy with their explanation and apology even if obviously the downtime is still an issue.
was trying to hold onto a man?
I'm just guessing here.
I remember at least one similar incident:
Early (or earlish) arpanet, the network controllers did not implement checksum on messages. A bit flip caused the whole shebang to keep on forwarding the same message over and over, to the point that nothing else would flow, not even a reset command. Some one had to be sent to the culprit box and manually reset it.
I remember this being discussed on some technical circles for a while (I think Software Engineering Notes from the ACM had a go on it).
Stop making protocols whereby one server can crash another already. Especially when they talk to each other constantly and there's a lot of them. Cascade failures, FTW.
It has been generally well-known for a number of years now that any time you have a large cluster you cannot count on hardware checksums to catch every bit flip that may occur during copies and transmission, particularly with consumer hardware which has many internal paths with no checksums at all. Google learned this the hard way, like the supercomputing people before them, and now like Amazon after Google. And some of the better database engines also do their own internal software checksums as well to catch uncaught errors introduced as the data gets copied across the silicon, disks, and network -- it is one way they get their very high uptime and low failure rate.
It does not reflect well on the software community that most people *still* do not know to do this for very large scale system designs. The performance cost of doing a software CRC on your data every time it is moved around is low enough that it is generally worth it these days. If your system is large enough, the probability of getting bitten by this approaches unity. Very fast implementations of Adler-32 and other high-performance checksum algorithms are widely available online.
heading for s3.
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
Vulnerabilities of Network Control Protocols: An Example, published in January 1981.
What do they say about those who ignore history?
The Internet's nature is peer to peer - 20050301_cs_profs.pdf
It provideRS? PROVIDERS?!?
I'TS PROVIDED!
Comment removed based on user account deletion
What Solid Snake Simulation?
"sudo rm -rf your-face"
Anybody else noticed that these major problems always occur when no one is around to fix them?
As someone who worked at Amazon as a software engineer for over three years in various backend areas, I can say that without a doubt, Amazon's code and production quality is so horrible, that it's hard to believe.
Engineers carry pagers and, in many groups, are constantly paged. The only thing that keeps the systems running is a bunch of junior engineers responding in the middle of the night, fixing databases, bouncing services, etc., etc. Engineers are rarely, if ever, given the chance to actually *fix* things, they're just supposed to band-aid them up.
And, here's a big secret for you: When I left Amazon a little over a year ago, no development groups internally were even using EC2, S3, SQS, or any of the other web services they sell to you. They make it sound like you're using the same high-end services they use to satisfy tens of millions of customers. They're not.
I hafta wonder if the bit flipped due to a bad RAM stick?
Nothing specific about *what* caused the bit to flip.
This comes to mind only because bad RAM on a new server at work caused installation of a stock Perl module to throw excessive errors during the XS compile phase - the same package installed without error on an identical machine 20 minutes earlier. Took over an hour before we realized it was probably hardware. Memtest86 quickly turned up the problem.
Would hashes and the like protect against RAM suddenly going south? Wouldn't any piece of data that passes through main memory be vulnerable to corruption? Makes me wonder why ECC memory isn't being used much anymore... we have various flavors of RAID to protect slow memory from corruption, but not many machines I see have ECC anymore.
O lord, bless this thy holy hand grenade, that with it thou mayest blow thine enemies to tiny bits, in thy mercy.
"There was a cock-up and it took us a while to figure out what it was."
Love it when PR departments try to sound technical.
See Koopman's paper 32-bit cyclic redundancy codes for Internet applications [ieee.org] for some better ideas.
Free, non-payware preprint version, enjoy!
(Found in about a minute via Google)
See RFC 3514 http://www.ietf.org/rfc/rfc3514.txt?number=3514
So the whole cloud is in trouble if one node starts spewing nonsense? So much for redundancy. Amazon developers would be well advised to read up on the "Byzantine Generals" problem.
I work for a company that uses Amazon S3 for our customer's data storage for the same reasons that many other companies do - they're reliable and inexpensive. We have a couple hundred terabytes of data stored on Amazon's servers and, aside from this one instance, we haven't had a major problem in three years.
Because we're in Seattle and a few blocks from Amazon's headquarters, we got a personal visit last week from one of the senior managers of Amazon's hosted platforms group. In addition to being able to ask him all kinds of great questions about how they do their business and what technologies they employ that we could also use, we got to ask him about what happened.
He was completely open and honest about it. He knew that we, like every other Amazon S3 customer, had suffered and that some of us had lost a Sunday to dealing with customer complaints. He apologized and told us that they were taking steps to make sure it wouldn't happen again.
Amazon has handled this very well and we will continue to be a customer of theirs.
---
Five nines allows for over eight hours of downtime a year.
I thought that was caused by the bouncy-ball "gift" from the Great Collector. (He thought it was funny as hell....)
"The server is down, purple monkey dishwasher"
We reserve the right to serve refuse to anyone. -management
More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.
"A single bit" corruption would be detected even by the UDP checksum mechanism (which is guaranteed to catch any single-bit error).
So, either Amazon uses something even more primitive than UDP in their inter-server messages (which I don't believe), or they are flat-out lying.
I thought that was caused by the bouncy-ball "gift" from the Great Collector. (He thought it was funny as hell....)
He apparently thought we WERE hosting an intergalactic kegger down here.