Amazon EC2 Crash Caused Data Loss

← Back to Stories (view on slashdot.org)

Amazon EC2 Crash Caused Data Loss

Posted by timothy on Thursday April 28, 2011 @06:17PM from the but-that-was-off-site-backup! dept.

Relayman writes "Henry Blodget is reporting that the recent EC2 crash caused permanent data loss. Apparently, the backups that were being made were not sufficient to recover the lost data. Although a small percentage of the total data was lost, any data loss can be bad to a Website operator."

81 of 112 comments (clear)

Min score:

Reason:

Sort:

Um, backups? by Anonymous Coward · 2011-04-28 18:22 · Score: 1

srsly, as in your own
I am not rightly able to comprehend... by Man+On+Pink+Corner · 2011-04-28 18:25 · Score: 5, Insightful

... the confusion of ideas that would lead someone to treat their live web server as their primary/master data repository.
I guess I'm still stuck in Commodore 64 World, or something..
1. Re:I am not rightly able to comprehend... by SpiralSpirit · 2011-04-28 18:41 · Score: 1
  
  well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand. It took something pretty catastrophic to bring it down and cause data lass. the problem is if someone decided to only have one copy, at amazon. If they had 2 at 2 different servers, success!
2. Re:I am not rightly able to comprehend... by Anonymous Coward · 2011-04-28 18:53 · Score: 1
  
  naaa, I've not heard of any meteor swarm hitting amazon servers, so there was not anything catastrophic.
  
  just business as usual.
3. Re:I am not rightly able to comprehend... by obarthelemy · 2011-04-28 18:53 · Score: 2
  
  I'm not so sure about rigorous...
  1- I personnally have never lost a single byte of meaningful data
  2- do amazon detail their exact procedures and commitments ?
  3- do amazon backup those "commitments" with hard cash ? How much will the people whose data they lost be compensated ?
  read the sig....
  
  --
  The Cloud - because you don't care if your apps and data are up in the air.
4. Re:I am not rightly able to comprehend... by MichaelSmith · 2011-04-28 18:53 · Score: 4, Informative
  
  It took something pretty catastrophic to bring it down and cause data lass
  Catastrophic would be an earthquake, tsunami and meltdown, in that order. From my reading of the situation amazon stuffed up their own replication mechanism and it recursively replicated the system to fill up the available hardware. Thats just bad design. Its obvious they did no testing under realistic conditions.
  
  --
  http://michaelsmith.id.au
5. Re:I am not rightly able to comprehend... by greenbird · 2011-04-28 19:08 · Score: 2
  
  well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with
  It's funny. Not a single place I've worked at has had as good of backups as I have for my personal stuff. And I didn't even spend 6 figures for some useless enterprise backup solution. Some scripting, cp -al, rsync, dmcrypt, ssh and a remote PC at my girlfriends house and you have an incremental backup solution more secure and more robust than any enterprise solution I've ever seen, and it only cost a couple hundred for the drives.
  
  --
  Who is John Galt?
6. Re:I am not rightly able to comprehend... by zonky · 2011-04-28 19:17 · Score: 1, Insightful
  
  Congrads. Does your GF have a key to your house? Because your "perfect system" has a single point of failure- a insider who could cause damage to both causing loss of data. Best not get on her bad side for now, anyway....
7. Re:I am not rightly able to comprehend... by jpapon · 2011-04-28 19:18 · Score: 2
  
  Until she dumps you and throws your backup drives out her window that is. Tying the security of your backup to the security of your relationship is an interesting gamble. One day you might find yourself lonely AND data-less.
  Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.
  
  --
  -- Let us endeavor so to live that when we pass even the undertaker shall be sorry. -- M. Twain
8. Re:I am not rightly able to comprehend... by Anonymous Coward · 2011-04-28 19:20 · Score: 1
  
  I am not rightly able to comprehend how some whose primary data source is people entering data via their website could have it elsewhere. Sure, they can have copies elsewhere, but those would be the backups.
9. Re:I am not rightly able to comprehend... by jimicus · 2011-04-28 19:25 · Score: 1
  
  Bigger solutions are invariably more complicated. And when they're more complicated, there's more to go wrong - and when it does go wrong, there's more that can be affected.
  This is why I'm quite wary of people throwing the word "Enterprise" around. IME, it's frequently a codeword meaning "A proprietary vendor has told us their product can be all things to all men - which is technically true but what we're buying needs many more man-hours of work to turn it into anything for anyone than we can hope to dedicate."
10. Re:I am not rightly able to comprehend... by im_thatoneguy · 2011-04-28 19:33 · Score: 1
  
  My backup system is no where near the studio I work at's backup system. But I could deploy something even easier for one simple reason: I have less data.
  Do you know how much it would cost to remote push 10TB of data once a week?
11. Re:I am not rightly able to comprehend... by leehwtsohg · 2011-04-28 19:47 · Score: 1
  
  Hmmmm? You don't store copies of your data in remote locations? What about a fire? I think a backup scheme must store data remotely. Leave one copy with your parents (upstairs), and one with a friend in australia (I guess across the street...)
12. Re:I am not rightly able to comprehend... by Yvanhoe · 2011-04-28 20:14 · Score: 4, Funny
  
  That is still better than Amazon's plan actually.
  
  --
  The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
13. Re:I am not rightly able to comprehend... by wvmarle · 2011-04-28 20:20 · Score: 4, Informative
  
  From a look at the linked article, it seems that one of the issues is data generated by these web sites. Such as user statistics, or user uploaded content, etc. That naturally lives primarily on the live web server and is also data that you don't want to lose. Also as other commenters mentioned as well the EC2 service is not a cloud-storage server, it's a web hosting service, and web hosts tend to indeed generate their own data.
  This data of course needs to be backupped actively, and one would expect a web host to include that in its service. That's one of the reasons to pay for such a service, instead of doing it yourself.
  Besides relying on their backups it's of course a good idea to regularly take backups yourself. But even if you do this daily, it means you may lose up to a day's worth of data. And that's (partly) what happened here. It's similar to someone who takes a photo on a digital camera, and subsequently loses that camera and the photo with it. You don't say "they shouldn't use a camera as primary data repository". It isn't. It's a temporary repository, and when the data is generated it's the one and only repository, simply pending copying to backup media.
14. Re:I am not rightly able to comprehend... by SuricouRaven · 2011-04-28 20:27 · Score: 1
  
  I wouldn't be too hard. Yes, they screwed up - but a bug like that could easily slip through testing, as it might only occur on extreme-sized data sets. Their real screwup was in not noticing right away and reverting to the previous config.
15. Re:I am not rightly able to comprehend... by SuricouRaven · 2011-04-28 20:29 · Score: 1
  
  That's what differentials are for.
16. Re:I am not rightly able to comprehend... by Eivind · 2011-04-28 20:29 · Score: 1
  
  I've pondered the same thing. My workplace spends 6 figures, and as far as I'm able to tell, gets significantly less than I have at home, despite my investment being 2 orders of magnitude less.
  Every 2 hours for the last day, every day for the last week, every week for the last month, every month for the last year, every year forever. Physically backed up to 2 distinct discs inhouse (one of which is pretty burglar-proof, living in safe), and 2 encrypted copies under the care of 2 distinct companies, in different jurisdictions and on different continents. (there's 3 copies of the decryption-phrase, one in my head, 1 in my safety-deposit-box in the bank and one stored with my will in a secure will-storage-system run by a respected law-firm)
  While this ain't "perfect" (nothing is), it's *hell* of a lot better than what we've got at work, despite the latter costing 100 times more. (and no, the amount of data backed up at work, is not larger, both backups are aproximately 5TB for a complete copy)
  Yes, a burglar could steal one copy from my house. But I'm more concerned with not losing files than with preserving privacy, there's nothing *really* secret anywhere in my data.
17. Re:I am not rightly able to comprehend... by jc2brown · 2011-04-28 20:47 · Score: 2
  
  You might want to read this.
  
  They're crediting all accounts that had any activity in the USA-East region for 10 days of usage, regardless if they were affected.
  
  Remember that it was EC2 that was affected, which is just a virtual machine with volatile storage. Had it been S3 data that was lost one should expect restitution, but in this case downtime and data loss is ultimately the fault of the user.
18. Re:I am not rightly able to comprehend... by digitig · 2011-04-28 20:57 · Score: 1
  
  So that's basically "not much" -- only a bit more than just not charging them for the period of the outage.
  
  --
  Quidnam Latine loqui modo coepi?
19. Re:I am not rightly able to comprehend... by mwvdlee · 2011-04-28 21:00 · Score: 1
  
  Just curious; what 5TB worth of personal data requires a 4-figure backup spending?
  
  --
  Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
20. Re:I am not rightly able to comprehend... by teh+kurisu · 2011-04-28 21:02 · Score: 3, Interesting
  
  That depends. Only a couple of our servers in that availability zone were actually affected, but we're apparently being compensated as though all of them were. Bonus for us.
21. Re:I am not rightly able to comprehend... by jimicus · 2011-04-28 22:24 · Score: 1
  
  Where things start to get more complicated is when the data being stored requires some massaging before you can take a copy - or for that matter if you can only take a copy under specific circumstances or your copy is only useful under specific circumstances.
  For instance:
  Most modern databases store their data in files on the disk. Database transactions are atomic, sure. And (hopefully, assuming a modern FS) so are disk transactions. This does not mean, however, that you can simply copy the underlying files. cp, tar et al are not atomic, so you can wind up with an unusable backup. Both MySQL and Postgres explicitly state that you need to either shut down the database, use the native backup tool (which requires a lot of free space because you're essentially dumping the DB to a text file on disk and you take a backup of that) or in the case of MySQL, lock the tables.
  For instance:
  Microsoft Exchange stores all the data in a honking great database. This has pros and cons. The biggest pro is with appropriate indexing (pre-cooked by Microsoft because they designed the application), it's very fast.
  The biggest con is that backup and restoration of individual mailboxes are a PITA (though I understand some companies produce proprietary software to try to resolve this). It's not too difficult to backup and recover the entire server.
  Something similar is true for most mail systems to a greater or lesser extent. Systems which store everything in discrete files (such as Courier IMAP) at least have the advantage that it's dead easy to recover a person's mailbox, but it's a pig to recover a particular email from it because the metadata you'd use to find the email isn't stored in the filename, it's in the file itself. About the only sensible thing you can do is create them a new mailbox, restore the entire backup into it and tell them to find the email(s) they need from in there.
  These are fairly simple examples I can come up with without having to put any real effort in. It's likely that Amazon - in creating their EC2 system - created something with lots of wonderful features but "dead easy to backup and restore, very difficult to screw up the backup process" wasn't one of them.
22. Re:I am not rightly able to comprehend... by renoX · 2011-04-28 22:57 · Score: 1
  
  He didn't claim that his backup system was perfect, just better than what many enterprise do, which is probably true.
  > Best not get on her bad side for now, anyway....
  Bah, if he is wise his backups are encrypted, so this shouldn't be a big issue (unless he has a bad break-up with his GF and loose data at the same time: Murphy's law in action).
23. Re:I am not rightly able to comprehend... by tigersha · 2011-04-28 23:21 · Score: 1
  
  This is definitely better than Amazon's backup plan. It backs up data and LITERALLY screws you. Amazon just screws you.
  
  --
  The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
24. Re:I am not rightly able to comprehend... by brusk · 2011-04-29 00:01 · Score: 1
  
  Unless of course you're one of those people who refers to female friends as "girlfriends", in which case, I hate you.
  Women do that too, but this is /.
  
  --
  .sig withheld by request
25. Re:I am not rightly able to comprehend... by Cyrrus30 · 2011-04-29 00:16 · Score: 1
  
  That's what I was wondering. What can be 5 TB of personal data? My "real" worhtwhile data (meaning things that I can't get back if something wrong happens) takes like 30 GB. And 29.5 of this are pictures. But things like downloaded movies or MP3 (a certain chunk of it being illegally acquired) are not worth backing up using such a complicated scheme. My house burn and I lose some MP3 I bought on iTunes? No big deal, there would be alot of other things more important that would bug me (like, you know, my house).
26. Re:I am not rightly able to comprehend... by davidbrit2 · 2011-04-29 00:28 · Score: 1
  
  well, the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand
  And a fat lot of good that did.
27. Re:I am not rightly able to comprehend... by ColdWetDog · 2011-04-29 00:34 · Score: 1
  
  I guess I'm still stuck in Commodore 64 World, or something..
  Cassette tapes? I'm so very sorry.
  
  --
  Faster! Faster! Faster would be better!
28. Re:I am not rightly able to comprehend... by Thing+1 · 2011-04-29 00:49 · Score: 1
  
  Its obvious they did no testing under realistic conditions.
  And how do you test under realistic conditions when those realistic conditions are an enormous, ~10 datacenter system that serves a good percentage of the internet?
  Like in Contact: you build two of them.
  
  --
  I feel fantastic, and I'm still alive.
29. Re:I am not rightly able to comprehend... by CSMoran · 2011-04-29 01:06 · Score: 1
  
  bring it down and cause data lass.
  I'm a lad, you insensitive clod!
  
  --
  Every end has half a stick.
30. Re:I am not rightly able to comprehend... by Khyber · 2011-04-29 02:15 · Score: 1
  
  "the a data center run by amazon certainly has more rigorous backup and maintenance schedules than anything I could personally come up with, offhand."
  Differential backups every five minutes across three different backup systems rolling RAID-6, one local and two remote.
  I don't have data loss issues, EVER.
  And I'm just a low-level tech guy. If Amazon can't get it right and I can, something is wrong.
  
  --
  Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
31. Re:I am not rightly able to comprehend... by breakfastpirate · 2011-04-29 02:18 · Score: 1
  
  Well, so long as his backups at her place consist only of jpeg pictures of their relationship, you could actually kill two birds with one stone there...
32. Re:I am not rightly able to comprehend... by MarkGriz · 2011-04-29 02:24 · Score: 1
  
  Just curious; what 5TB worth of personal data requires a 4-figure backup spending?
  Porn
  
  --
  Beauty is in the eye of the beerholder.
33. Re:I am not rightly able to comprehend... by darkpixel2k · 2011-04-29 03:04 · Score: 2
  
  1- I personnally have never lost a single byte of meaningful data
  Yep--the moment I accidentally 'rm -rf /', I simply re-classify the drive as 'not containing meaningful data' and my stats are saved.
  
  --
  There's no place like ::1 (I've completed my transition to IPv6)
34. Re:I am not rightly able to comprehend... by aztracker1 · 2011-04-29 03:38 · Score: 1
  
  It's not a "web host" even.. it's simply a virtual machine run-time environment. You setup the OS, and configure it... Amazon does not... they provide storage facilities that can be used to backup to, and even mount to your host OS. Also, many virtual machine, or virtual host providers don't necessarily provide backup solutions.
  
  --
  Michael J. Ryan - tracker1.info
35. Re:I am not rightly able to comprehend... by greenbird · 2011-04-29 05:41 · Score: 1
  
  Do you know how much it would cost to remote push 10TB of data once a week?
  If you're generating 10TB of data weekly you're talking about an extremely rare situation that would require a specialized solution anyway since no backup solution out there can support that.
  If your talking about have around 10TB of data total that's what rsync is for. You do fast incrementals on the local system over a high bandwith pipe (sata, sas). Then you have 2 options depending on your system requirements. You either run a daily rsync of one of the incermentals directly offsite over a separate network pipe or you dump to an onsite backup server over a fat network pipe and then do your remotes from there. You're isolating your backup bandwith from your production systems and with rsync only modified portions of files are sent greatly reducing bandwith requirements. The cp -il gives you point in time backup images (say every four hours) while only needing disk space for modified files to allow finely incremented recoveries (e.g. I need a file that was created Tuesday morning and deleted that afternoon). Your offsites are really only for major disaster like a plane crashing into the building. If you use an onsite backup server put it in a different location than your servers. That way when the server room floods you still have your incermentals.
  All this can be done in a highly reliable fashion with a few hundred lines of bash scripting including the email warnings.
  
  --
  Who is John Galt?
36. Re:I am not rightly able to comprehend... by greenbird · 2011-04-29 05:50 · Score: 1
  
  Until she dumps you and throws your backup drives out her window that is.
  She'd have to come here and trash those also. That'd be a trick though. I have some 10 computers spread over several rooms here and a dozen or more external drives. I never said it was perfect. Just better than any enterprise setup I've seen. The malicious insider is always the toughest hole to cover in any data protection scheme.
  
  --
  Who is John Galt?
37. Re:I am not rightly able to comprehend... by arndawg · 2011-04-29 07:24 · Score: 1
  
  Wait. Your backup target is ONLINE? I've got news for you buddy. You don't have backups!
38. Re:I am not rightly able to comprehend... by MichaelSmith · 2011-04-29 13:17 · Score: 1
  
  Its obvious they did no testing under realistic conditions.
  And how do you test under realistic conditions when those realistic conditions are an enormous, ~10 datacenter system that serves a good percentage of the internet?
  I work on air traffic control systems. In our environment the whole system, including the bit between the keyboard and the seat, is intensively exercised in realistic conditions. Simulation modes are built in. Its expensive but thats the way to deploy a complex system which works reliably.
  
  --
  http://michaelsmith.id.au
39. Re:I am not rightly able to comprehend... by LordLimecat · 2011-04-29 14:31 · Score: 1
  
  It doesnt take 6 figures, and if it does and isnt as good as Rsync, you need to find a new line of work.
  Any backup solution should include tape, for a very simple reason-- you end up with multiple offline copies of data at about $25 per terabyte. Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.
  A good backup system isnt even that expensive; about $4000 will get you an LTO4 autoloader with a full complement of tapes, and another $500 gets you a good disk-based system.
40. Re:I am not rightly able to comprehend... by greenbird · 2011-04-29 15:45 · Score: 1
  
  Your backup solution sounds like it gets knocked out if someone introduces bogus data into your system; once the backup occurs, it overwrites all your good backups.
  Ummm...you need to look at what cp -il and rsync -H does. Let's just say you're completely and utterly wrong to be polite about it. I have exact images of my systems at 4 hour intervals available instantly just by going to my backup drive. No need to go digging through tapes, spending hours looking through dumps and incrementals or trying to figure out if it's on an offsite tape. I suspect you're one of those people who set up the backups for one the companies I've worked for. Tape is utterly useless in this day and age. The volume of data is just to great for any tape system to be manageable. Incrementals on tape are impossible to manage for more than a very small amount of data in this day and age. Just the problem of consistently checking your tapes to ensure you got a good backup is outrageously expensive in manhours.
  
  --
  Who is John Galt?
41. Re:I am not rightly able to comprehend... by LordLimecat · 2011-04-30 01:54 · Score: 1
  
  I have used (and still do) use Rsync-- i missed the -H option in your post, but regardless, all of your history is in one place.
  The biggest problem I have seen with Rsync is that in directory structures with tens of thousands of tiny files, it takes a VERY long time to search for changes, which can be a problem if you have, for example, database files which need to be taken out of use for backup (the particular database im dealing with doesnt have a "dump" command, as it, itself, is used for remote backup and are gigantic 4GB encrypted files). It is possible that it is just this particular implementation, but I have seen this in several places where rsyncing these backup files takes hours even for 1GB or less of changes over a gbit lan.
  As for tape being useless, you can buy, right now, appx 1.5TB of storage at $25 using LTO4 tapes. With that, you get the ability to easily remove and add storage units to your system without awful wonky drivers (ala REV), or dealing with possibly changing drive letters (under windows), or changing device identifiers (under linux-- though as I have only once dealt with tape vs external HDD on linux, there may be a way around this that I am unaware of). Durability wise, HDDs cant hold a candle to tape; tape can be dropped, kicked around, and reused with no issues. Also, good luck rotating 24TB of removable storage; with a budget of $6500, I can easily do that -- a 16slot LTO4 autoloader goes for around $4k, and with the remaining $2500 I can purchase 100 LTO4 tapes and rotate 5x 24TB sets a week if I desire, and store the other weeks in vaults.
  Claiming that "tape is obsolete" just makes you look ignorant. There are fortune 500 companies with $1mil tape libraries, and its not because their CTOs are incompetent; its because when dealing with absolutely monstrous quantities of vital information, you would be silly to rely on spinning platters. Additionally, I am not aware of an equivalent to WORM tapes in an HDD form; how do you propose to deal with archival needs?
  There is a reason that tape is used a lot-- it is cheap, it is very reliable, and it is trivial to swap out storage.
42. Re:I am not rightly able to comprehend... by greenbird · 2011-05-02 12:43 · Score: 1
  
  all of your history is in one place.
  No, it's not. It's here and at my offsite location (girlfriends house atm).
  
  Also, good luck rotating 24TB of removable storage; with a budget of $6500
  There's no rotation required. Once it's set up the only time you have to touch a drive is to replace a bad one. With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server. Not only that but I can have backup images at, say, 4 hour intervals instantly available at any given moment. No trying to get an offsite tape hoping they get the right one or haven't lost it or get the right tape and find it's not readable. All of which I've had happen, mind you.
  
  without awful wonky drivers (ala REV), or dealing with possibly changing drive letters (under windows), or changing device identifiers (under linux-- though as I have only once dealt with tape vs external HDD on linux, there may be a way around this that I am unaware of).
  
  None of this is an issue. Setting up SAS drives with RAID and hot swapable is trivial in both cost and time these days and has been for years (unless you're stupid enough to use windows for this kind of thing). With tape you have to constantly check each and every tape to ensure it's still readable. You have to worry about transporting tapes. With 24 TB you could spend days searching tapes looking for some specific files in a certain state.
  
  There are fortune 500 companies with $1mil tape libraries, and its not because their CTOs are incompetent;
  I learned early in my career that going to tape was a terrifying experience. The odds of data lose at that point were always much higher than any level I would consider expectable. No company I've every worked for, and this includes at least one fortune 500 company at the time, had the time money or procedures in place to ensure secure good reliable backups on tape.
  
  There is a reason that tape is used a lot-- it is cheap, it is very reliable, and it is trivial to swap out storage.
  Have you ever hot swapped a hard drive? What could be more trivial than that. And I know you can't be claiming that tape drives and, god forbid, tapes are more reliable than hard drives. From my experience tape is way too far towards the unreliable side of the scale for my comfort.
  
  Claiming that "tape is obsolete" just makes you look ignorant.
  Hmmm...interesting. You call someone ignorant yet you have no understanding or knowledge of (by your own admission in some areas even) what they're talking about. But it's much easier to just declare everyone else ignorant rather than actually making an effort and learning about new things that might upset your world view. I'll say it again. Tape is obsolete. There are MUCH faster, cheaper, less labor intensive, more reliable and more secure ways to do backups if you take the time to learn all the new tools and options that are available.
  
  --
  Who is John Galt?
43. Re:I am not rightly able to comprehend... by LordLimecat · 2011-05-03 18:54 · Score: 1
  
  No, it's not. It's here and at my offsite location (girlfriends house atm).
  Which means you have 2 copies. The tape system I will be setting up this week (~$5000) will give me 20 copies, on LTO4 tapes. Thats 20 entire(ish) backups of our network, with the ability to roll back to any date within the last 3 weeks (our supplemental backup system covers dates further back).
  
  With well less than half that I could build a backup server with 24TB. For that much I could have RAID 1 on the backup server.
  Have fun with your error and drive failure rates. 24 drives @ 2TB each (RAID1) means an awful lot of failures each year. Thats also a LOT of arrays to present to your system-- in order to have any reasonable kind of data security, youd need to do 12 separate RAID1 arrays.
  Additionally, 24 drives @ 2TB costs between $2400 and $4800, depending on whether you get the "enterprise" quality drives, and a RAID card capable of driving all of them another $1400. Server chassis with backplane and drive cages another $600 or so, and we havent even gotten to the meat of the server yet (unless you want to drive the thing on an Atom patform). You might be able to do it for around $6k, I remember pricing something similar out a year ago, but all of your data is now on one system-- a fire would neatly wipe out 24TB worth of storage, not to mention the Filesystem errors that (so Ive heard) start to creep into systems that large.
  
  Not only that but I can have backup images at, say, 4 hour intervals instantly available at any given moment.
  Thats not really what tape is for; we have a supplemental system (crashplan) for that, which IS disk based. Tape is for disaster recovery, not dealing with a trigger happy accountant.
  
  With tape you have to constantly check each and every tape to ensure it's still readable. You have to worry about transporting tapes.
  Not much of an issue; LTO4 auto-verifies data, we have enough copies in a regular rotation that one failure is of minor concern, and we have backup software which alerts us to such failures anyways. Transport is of minor concern as well. Ideally, youre supposed to move tape into a fireproof safe, which would mean transport anyways.
  
  With 24 TB you could spend days searching tapes looking for some specific files in a certain state.
  We use a backup suite which keeps a catalog of the tapes, and the tapes are barcoded. We tell it we want to roll back to 4/20, for file /etc/sshd/sshd_config, and it says "please insert tape LTO40002A". I go to set A, and insert the tape. Most backup suites will do this.
  Worst case, if the catalogs were lost, I could insert the tapes for the proper week, and tell the drive to start inventorying them.
  
  I learned early in my career that going to tape was a terrifying experience. The odds of data lose at that point were always much higher than any level I would consider expectable.
  Odd dataloss scenarios are a fact of life. I would much rather have a multitude of copies (ie, 10 or so) that I can roll back to, PARTICULARLY on windows based systems where backup is more complicated than just "cp -aH / /RaidBackupPartition".
  
  Have you ever hot swapped a hard drive? What could be more trivial than that.
  The problem is that ISNT trivial. Drive letters can change unpredictably on windows and Linux (identifiers, that is) depending on factors such as what other media is present (though GUIDs can alleviate that to some degree...) and in what order it was presented to the system. Harddrives also have a number of other problems, such as all being hooked to the same electrical source and in the same environment, and often from the same batch. Multiple drive failures are much more common than pure statistics would have you believe since they tend to experien
44. Re:I am not rightly able to comprehend... by LordLimecat · 2011-05-03 19:16 · Score: 1
  
  Sorry for double post, I also did want to point out that here
  
  Also, good luck rotating 24TB of removable storage; with a budget of $6500
  That $6500 budget went towards both the autoloader, and ~140TB of tape storage-- that is, 5 or so sets of ~24TB each (100 tapes).
  Tape is currently $16/TB or so. HDD platters, for the absolute cheapest deals (1TB drives), are around $40/TB-- 3x the price-- and require RAID cards to drive them, as well as hotswap cages. You will not be able to match tape prices for a long, long time, if ever.
Lost data? by DWMorse · 2011-04-28 18:28 · Score: 2

Was the lost data... all the stuff the PSN network lost? I think I see a connection!

--
There's a spot in User Info for World of Warcraft account names? Really?
Clouds are ephemeral by sincewhen · 2011-04-28 18:29 · Score: 1

Who knew?

--
-- Braden's law of data: All data spends some of its lifetime in an excel spreadsheet.
1. Re:Clouds are ephemeral by mini+me · 2011-04-28 18:40 · Score: 5, Informative
  
  Cloud applications hosted on Amazon survived this incident without issue, as expected. Only the regular old hosted applications had problems with the outage. They were never "the cloud" to begin with, so I'm not sure why the term even comes up in this discussion.
  The cloud represents a black box that hides the underlying network topology so that there are no single points of failure. Cloud applications are tolerant because they are spread through different datacenters across multiple points of in world. A catastrophe at one or more datacenters will have no noticeable effect on the availability of a cloud application because it continues to run in many more.
  Amazon offers a few cloud applications: S3 comes to mind. But Amzon's EC2/EBS hosting service is a plain old hosting service like any other. The EC2 topology is not hidden away from you. You have to make active decisions about where you want your EC2 instance to live. That goes against the idea of the cloud. What Amazon does offer in EC2 is the tools necessary for you to build a cloud application, but not everything hosted on EC2 is a cloud application by default.
2. Re:Clouds are ephemeral by The+End+Of+Days · 2011-04-28 19:17 · Score: 1
  
  When you put it like that, it's hard to just bash things.
  Stupid facts.
3. Re:Clouds are ephemeral by im_thatoneguy · 2011-04-28 19:36 · Score: 1
  
  You're right, by default EC2 isn't a cloud solution, but Amazon doesn't help alleviate that confusion (from their website for EC2):
  
  Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
  It doesn't help when the name of the service (EC2) has the word "Cloud" in it.
4. Re:Clouds are ephemeral by SuricouRaven · 2011-04-28 21:50 · Score: 1
  
  I still don't think there is any real definition for what 'cloud' means. As far as I can gather it's just fancy new marketing-speak for the very old idea of putting things on a server - the only difference is that with the cloud, you don't have to care about the server's physical location.
5. Re:Clouds are ephemeral by jimicus · 2011-04-28 22:35 · Score: 1
  
  That's exactly what it is.
  It's confusing a lot of people because something sold as a cloud application (ie. SaaS) may or may not be designed with HA in mind and the vendor likely won't tell you. If it is, and the underlying infrastructure is sound, you're probably OK. Hopefully.
  If it's not, it's not much different to an application running on some server in a co-lo somewhere, the only real difference is that you don't lease the server directly and you're not responsible for any backups.
  Then you've got virtual servers (which with modern hypervisors and a suitable SAN backend can in theory make an entire application HA regardless of whether or not that was part of the original design) which confuses the issue even further. Yes, they can in theory do all that but you often find that companies selling virtual servers haven't done it at all. All they've done is bought a few racks of servers and put together some fancy management software to make it easy to create and destroy virtual servers on the fly. The features like live migration are mostly not used at all, but you don't find that one out until your virtual server fails and their support team tell you they're "rebooting the physical server it lives on". These may also be sold as "cloud servers".
6. Re:Clouds are ephemeral by mini+me · 2011-04-29 04:13 · Score: 1
  
  In networking, the cloud has always represented an abstract network whose implementation details are unknown – it just magically works, thanks to the hard efforts of third parties. It is only lately that some marketing types want to exploit the term to make it mean something else.
What is S3? by badran · 2011-04-28 18:36 · Score: 5, Informative

EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.
1. Re:What is S3? by Big_Mamma · 2011-04-29 00:03 · Score: 1
  
  And that's exactly how EBS is supposed to be backed up - it saves snapshots all the time to S3. Small and cheap incremental backups stored to a 99.999999999% durable storage area. But apparently, Amazon messed up the backed up copies as well - instead of producing an outdated, but valid snapshot, they replied to affected customers with:
  
  A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful.
2. Re:What is S3? by Anonymous Coward · 2011-04-29 00:32 · Score: 1
  
  The quote is about the snapshots they took before they started recovery efforts. If you took snapshots of your EBS volumes regularly, these were not affected at all...
3. Re:What is S3? by Slashdot+Parent · 2011-04-29 06:58 · Score: 1
  
  EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.
  I don't think this is a fair criticism of people who lost data.
  S3 isn't designed as an online datastore for live applications. Sure, I can put any content in there that I want, but it can't be up-to-the-millisecond.
  AWS said to consider EBS volumes to be like hard disks, with a similar failure rate to hard disks. I forget the expected failure rate that they posted, but I think it was roughly between 1:100 and 1:1000 EBS volumes should be expected to fail each year. So go ahead and make your usual solutions with RAID arrays, DB write logs on a different volume, consistent snapshots stored on S3 for backup, etc.
  But last week's outage was way different. That was a failure of EBS the service, not an EBS volume. This turned the whole "EBS volume as a hard disk" paradigm on its ear. That shiny RAID array you've got? Dead. Those DB write logs? Dead. Those pristine consistent snapshots sitting safe and sound on S3? Sorry, you can't access those. Those EBS-backed virtual machines that your application runs on? Sorry, you can't access those, and you can't launch any new ones, either.
  So now you're left with offsite backups, which many users had, but are going to be out-of-date by nature. It didn't really matter if your offsite backups were in S3, on your local hard disk, or at some other online storage provider. Also, if your application was architected for EBS-backed instances, you couldn't launch new instances, anyway. Not without rearchitecting your application.
  So "sorry, you should have used S3 for your data" isn't really the answer. It's a little hard for your application to run with no access to CreateInstance or CreateVolume!
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Did this save Wikileaks? by kulnor · 2011-04-28 18:53 · Score: 2, Funny

Guess Wikileaks feels good about not being hosted there anymore.... their critical information could have been "lost" as well....
Availability zones by nereid666 · 2011-04-28 19:08 · Score: 1

What is more scaring for me, is that Amazon tell you that they have multiple availavility zones on each zone, and recomends you to distribute replicated servers, on each of this zones, for example I have a project with the master database in one zone, and the replica on the other zone. Why both zones fail?? Are not isolated/independent? Amazon charges you for data transfer between zones. As other says fails the servers, anyone must had backups on other place (S3, or Amazon external).

--
Damia
1. Re:Availability zones by Anonymous Coward · 2011-04-28 19:47 · Score: 1
  
  Availability zones are just one part of the picture for a good software design in data centers. While they are isolated from each other, they may still be in the same location (or within the same general area), and could more easily be hit with a network partition, floods, tornadoes, etc. Instead, as most big businesses know (like Netflix, which didn't suffer from the outage), you need regional separation in addition to availability zones. Amazon does provide this functionality, it just costs more, and most of the businesses that went out did not pay for regional separation.
2. Re:Availability zones by nereid666 · 2011-04-28 20:16 · Score: 2
  
  From: http://aws.amazon.com/es/ec2/
  Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.
  
  Better than use different region, I think it is better have multiple cloud providers...
  
  --
  Damia
Re:The Cloud Is Dead by inputdev · 2011-04-28 19:51 · Score: 2

I think people miss the point of the cloud - saying the cloud is worthless because it "brings people that would otherwise have nothing against you trying to take down your server" is like saying that the internet is worthless because it opens up security risks.
I for one am glad to be connected, and obviously so are many others. Don't use services that aren't good for you - there are some cloud based services that are great, and some that aren't. It's pretty clear that in the future, things will be more connected, not less - adapt and take advantage of the good parts, the rest will fade anyway.
Re:The Cloud Is Dead by Hazel+Bergeron · 2011-04-28 20:20 · Score: 1

There's something simplistically technocratic about assuming that what is now is better than what has been.
Buy X! It's newer, thus better, than Y!
Because the economy's like a religion and set up so people lose their jobs and their homes if you don't needlessly produce and consume nothing of value.
..at my girlfriends house by Anonymous Coward · 2011-04-28 20:22 · Score: 1

What's a girlfriend?
1. Re:..at my girlfriends house by Anonymous Coward · 2011-04-28 23:59 · Score: 1
  
  "Wonder what capacities they come in."
  There's a lot of variance, and bigger isn't necessarily better. Most capacities are specified in the form "x-y-z w/ nX", where x, y, z, and n are numbers, and X is an alphabetic designation that may consist of multiple letters. Many people attracted to women prefer to maximize x and z (while still having them be nearly equal) while minimizing y and n, and want a designation of "C" or "D" used for X.
Post morten Amazon explanation by nereid666 · 2011-04-28 21:17 · Score: 5, Informative

Post morten Amazon explanation:
http://aws.amazon.com/message/65648/

--
Damia
any data loss can be bad to a Website operator. by jamesh · 2011-04-28 21:59 · Score: 1

any data loss can be bad to a Website operator.
any data loss is catastrophic, if it's your data. They claim "a small percentage" of data was lost... 1% is a small percentage... 10% is also small percentage, but it's a huge amount of data.
Fortunately where I live and work there isn't really sufficient and reliable connectivity to "the cloud" to make it a worthwhile endeavor, so hopefully all the mistakes are learnt from before I have to worry about it.
1. Re:any data loss can be bad to a Website operator. by Rakishi · 2011-04-28 23:09 · Score: 1
  
  any data loss is catastrophic, if it's your data.
  No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.
  Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.
2. Re:any data loss can be bad to a Website operator. by Slashdot+Parent · 2011-04-29 07:03 · Score: 1
  
  No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.
  Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.
  Please explain to me how I can keep my data in S3 and/or offsite backups up-to-the-millisecond.
  I'll wait.
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
3. Re:any data loss can be bad to a Website operator. by Rakishi · 2011-04-29 07:44 · Score: 1
  
  And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?
4. Re:any data loss can be bad to a Website operator. by Slashdot+Parent · 2011-04-29 08:14 · Score: 1
  
  And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?
  Was that supposed to be an answer to my question? Is every application that fails to store all of its data immediately in S3 "clearly not compatible" with EC2?
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
5. Re:any data loss can be bad to a Website operator. by Rakishi · 2011-04-29 08:40 · Score: 1
  
  If losing the intermittent data is catastrophic then yes they're not compatible. Find a different solution.
  That said, database replication is a very old problem and solutions exist to that. Likewise, some applications simply don't suffer too much from losing a bit of data so the cost of that is low. Other applications have no data to lose since they're simply acting as data serving platforms. And that's all for web applications which aren't quite what EC2 was made for, after all it's called "elastic compute cloud" not "elastic web serving cloud."
6. Re:any data loss can be bad to a Website operator. by Slashdot+Parent · 2011-04-29 09:24 · Score: 1
  
  That said, database replication is a very old problem and solutions exist to that.
  Thank you for at least attempting to answer my question.
  While your answer does not involve the use of S3, it is exactly the answer that should have worked, but did not in the case of yesterday's outage. Replicate your database to a different availability zone. Great, except EBS failed region-wide, so your slave database just died, too.
  Not that I'm really even all that upset about the outage. My application was down for a few hours and degraded for a few more hours until I could replay the transactions that occurred after my 6am snapshots (app was in the bad zone), and it lost no data. The only point I'm trying to make is that "you should have used S3" is not the answer.
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Amazon is pretty up-front about expected data loss by molotov303 · 2011-04-28 22:19 · Score: 1

Unless you pay extra, they say you can expect to lose data stored in S3 on a regular basis. There's nothing wrong with that per se, but it's something you need to plan for.
S3:

Designed to provide 99.99% durability and 99.99% availability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.01% of objects.

http://aws.amazon.com/s3/
EBS:

...Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% - 0.5%, where failure refers to a complete loss of the volume.
http://aws.amazon.com/ebs/
Store a backup yourself by olau · 2011-04-28 22:35 · Score: 2

This is not the first time I've heard about a big hosting centre losing data even though it never happens, and they are keeping backups, etc.
It if it's at all manageable, keep one copy safe at your own place in addition to the replication at the hosting centre. You can set up a cheap box at the office with a couple of terabytes disk space and suck down the data periodically with something like rsync and rdiff-backup. It's not a whole lot of work and can make the difference between having a big problem and total disaster.
It would help if hosting centres actually told you how exactly they store and backup your data and what they do in case of emergency instead of throwing meaningless phrases like "99.999% uptime!" and "fully redundant storage backbone!" at you. Fully redundant storage backbone is nothing if it means it's built with some big arse proprietary SAN stuff where the whole array goes down if the main controller goes down. Which it of course does because it's a flaky embedded thing with 2k memory that has to be programmed in assembler and C with dangling memory pointers all over the place.
1. Re:Store a backup yourself by Junta · 2011-04-29 00:01 · Score: 1
  
  Very good advice. One issue is a *lot* of their users are commercial companies that viewed this as a way not to sweat the details at all. For many of those, if they have to sweat backup and all that, they might as well do the hosting themselves because the cost delta for them is not particularly large.
  
  --
  XML is like violence. If it doesn't solve the problem, use more.
2. Re:Store a backup yourself by dsouza42 · 2011-04-29 02:54 · Score: 1
  
  I don't know if it would help if they told you exactly how everything works. I'm sure no company with an infrastructure like Amazon's takes backups and safety very seriously. The availability numbers tell you what you can expect statistically from their services. The service that caused data loss is called EBS and acording to Amazon: "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume". So if you have your data there you have to know that it can fail and it probably will fail eventually, so you're right, it's a really good idea to backup the data yourself.
  
  My business runs on Amazon's infrastructure and that was one of my main concerns before hiring their service. Because of this chance of failure I take hourly snapshots of my EBS volumes (which is enough for me.. I could even do it every 5 minutes) and copy the data back to my own servers periodically as you suggest. It's just common sense for anyone who deals with this type of thing. In my case, when the outage happened I just restored the latest snapshots and was up and running in a few minutes.
  
  Now even with "proprietary SAN stuff" and "flaky things with 2k memory programmed in assembler and C with dangling memory pointers all over the place" it's still many times more reliable and many times cheaper than hosting it myself. After moving to Amazon I greatly increased my uptime and reduced IT costs by 90%. That doesn't mean I trust they'll work 100% of the time and that's because I do my due diligence and make backups.
3. Re:Store a backup yourself by Slashdot+Parent · 2011-04-29 07:15 · Score: 1
  
  Be careful with your quoting from the middle of a sentence. When you quoted, "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume," that gave the impression the EBS snapshots had an AFR of 0.1% – 0.5%. Actually, EBS snapshots are stored on S3, so they have a durability rate of 99.999999999% each year.
  It's EBS volumes, themselves, that have 0.1% – 0.5% annual failure rates. I'm sure you already knew this, but others might not.
  Naturally, offsite backups are still a good idea, but if snapshots failed as often as implied, offsite backups would be a total necessity, rather than just a mere "good idea". :)
  
  --
  They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
Clarification by Mascot · 2011-04-28 22:40 · Score: 2

The durability you quote for S3 (99.99%) is for the reduced redundancy option. The standard storage lists 99.999999999% durability.
Re:The Cloud Is Dead by sarhjinian · 2011-04-29 01:47 · Score: 1

Yes, because building your own datacentre, or paying hosting fees to a five-nines-plus facility, costs nothing. Air conditioning, batteries, generators, fire supression, multiple, redundant network connectivity: that stuff''s all free. A mainframe solves it all!!
Look, a quality DC costs millions to build or tens of thousands to rent space in. Servers and mainframes cost money to manage, support and spare out. If you're starved for capital, why wouldn't you use EC2+EBS+S3 for a few bucks a month, rather than tie up dollars that could be spent on developers, marketing or suchlike in hardware and facilities that you're not really benefitting from. To build something like EC2 and the like is seriously expensive. Can your average startup with a server or two claim five nines? Really?
All these people who chant "Don't use the cloud, there could be an outage/breach!!" are just one screw-up away from the same, and it's often pure luck that they haven't been whacked yet.

--
--srj/mmv