Slashdot+Parent · Slashdot Mirror

Re:any data loss can be bad to a Website operator. on Amazon EC2 Crash Caused Data Loss · 2011-04-29 09:24 · Score: 1

That said, database replication is a very old problem and solutions exist to that.

Thank you for at least attempting to answer my question.

While your answer does not involve the use of S3, it is exactly the answer that should have worked, but did not in the case of yesterday's outage. Replicate your database to a different availability zone. Great, except EBS failed region-wide, so your slave database just died, too.

Not that I'm really even all that upset about the outage. My application was down for a few hours and degraded for a few more hours until I could replay the transactions that occurred after my 6am snapshots (app was in the bad zone), and it lost no data. The only point I'm trying to make is that "you should have used S3" is not the answer.

Re:any data loss can be bad to a Website operator. on Amazon EC2 Crash Caused Data Loss · 2011-04-29 08:14 · Score: 1

And who is forcing you to use E2 and EBS? Is there a gun to your head? Why in god's name are you using an infrastructure that is clearly not compatible with your needs?

Was that supposed to be an answer to my question? Is every application that fails to store all of its data immediately in S3 "clearly not compatible" with EC2?

Re:Can we get this in non-Amazon speak on Amazon EC2 Failure Post-Mortem · 2011-04-29 07:42 · Score: 1

And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.

There are a couple of interesting tools to help you with this. One that's been around for a while, but is not maintained by Amazon, is ec2-consistent-snapshot. It is a tool that automates the process of quickly quiescing your database and filesystem, initiating a snapshot of your volume (or volumes, if you have a RAID array), and then restoring read/write access. It all happens very quickly, so the disruption to your application is short (although the I/O performance of your volume will suffer while the snapshot is in progress).

I use ec2-consistent-snapshot for all of my EBS volumes and it made recovery from last week's outage pretty painless. It went something like this:
1. Application died at about 10am. Alarm.
2. I look at the instances. Can't log in.
3. I look at the AWS status monitor page. Find out EBS the service (as opposed to EBS volumes, 0.5% of which are expected to fail in any given year) has failed in all 4 us-east-1 availability zones. Find out my app is running in the AZ that has completely failed.
4. Curse.
5. Look at status of today's EBS snapshots, which begin at 6am. Some have completed, not all.
6. Wait for snapshots to complete. Happened at about 12pm.
7. Terminate all instances. None terminate. Curse.
8. AWS recovers 2 availability zones (I think this happened at about 1pm.)
9. Do some testing to figure out which of my zones are good.
10. Launch application from snapshots in a good AZ. App is up and running, but missing 4 hours worth of transactions, by 1:30pm.
11. Wait for AWS to restore access to bad AZ.
12. Initiate snapshots of bad volumes.
13. Replay missing transactions from 6-10am.
14. That's it.

In all, it was a few hours of outage, and a bit of care not to lose any transactions. Not bad for AWS's biggest failure, to date.

Another interesting tool, which I have not used, is CloudFormation. It automates the task of provisioning your application's infrastructure. I'm not sure how complete it is, as I haven't looked into it. I already have scripts to provision my app's resources from before CloudFormation was created.

Re:Can we get this in non-Amazon speak on Amazon EC2 Failure Post-Mortem · 2011-04-29 07:22 · Score: 1

EBS which is your high IO filesystem

Damnit, now you owe me a new monitor.

Re:Store a backup yourself on Amazon EC2 Crash Caused Data Loss · 2011-04-29 07:15 · Score: 1

Be careful with your quoting from the middle of a sentence. When you quoted, "Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume," that gave the impression the EBS snapshots had an AFR of 0.1% – 0.5%. Actually, EBS snapshots are stored on S3, so they have a durability rate of 99.999999999% each year.

It's EBS volumes, themselves, that have 0.1% – 0.5% annual failure rates. I'm sure you already knew this, but others might not.

Naturally, offsite backups are still a good idea, but if snapshots failed as often as implied, offsite backups would be a total necessity, rather than just a mere "good idea". :)

Re:any data loss can be bad to a Website operator. on Amazon EC2 Crash Caused Data Loss · 2011-04-29 07:03 · Score: 1

No, it's only catastrophic if you're an idiot. Then again many website operators seem to be just that given how many need to use google cache to recover data after their web provider's server croaks.

Anyway, having your data in any single unreliable location is a recipe for disaster. And yes, with a 0.5-1% annual failure rate EBS is unreliable and no one claims otherwise. If you want reliable you use S3 and off-site backups.

Please explain to me how I can keep my data in S3 and/or offsite backups up-to-the-millisecond.

I'll wait.

Re:What is S3? on Amazon EC2 Crash Caused Data Loss · 2011-04-29 06:58 · Score: 1

EC2 is not meant to be used for data storage, that is what S3 is designed for. You store data and backups on S3, and use EC2 to serve high bandwidth websites to the masses.

I don't think this is a fair criticism of people who lost data.

S3 isn't designed as an online datastore for live applications. Sure, I can put any content in there that I want, but it can't be up-to-the-millisecond.

AWS said to consider EBS volumes to be like hard disks, with a similar failure rate to hard disks. I forget the expected failure rate that they posted, but I think it was roughly between 1:100 and 1:1000 EBS volumes should be expected to fail each year. So go ahead and make your usual solutions with RAID arrays, DB write logs on a different volume, consistent snapshots stored on S3 for backup, etc.

But last week's outage was way different. That was a failure of EBS the service, not an EBS volume. This turned the whole "EBS volume as a hard disk" paradigm on its ear. That shiny RAID array you've got? Dead. Those DB write logs? Dead. Those pristine consistent snapshots sitting safe and sound on S3? Sorry, you can't access those. Those EBS-backed virtual machines that your application runs on? Sorry, you can't access those, and you can't launch any new ones, either.

So now you're left with offsite backups, which many users had, but are going to be out-of-date by nature. It didn't really matter if your offsite backups were in S3, on your local hard disk, or at some other online storage provider. Also, if your application was architected for EBS-backed instances, you couldn't launch new instances, anyway. Not without rearchitecting your application.

So "sorry, you should have used S3 for your data" isn't really the answer. It's a little hard for your application to run with no access to CreateInstance or CreateVolume!

Re:Chapter 1: on Book Review: Amazon SimpleDB Developer Guide · 2011-04-29 06:27 · Score: 1

Yeah, lets start this stupid argument about clouds again. Slashdot used to have knowledgeable persons commenting on stuff, not some idiots making remarks about things they know nothing about. Sigh.

He may have meant it as a joke, but the truth is no laughing matter. I use AWS extensively, so I know what I am talking about.

Amazon provides no way to do the following in SimpleDB:

Export a consistent copy of all of your data.
Import all of your data.
Backup your data.
Snapshot your data.
Point in time recovery of your data (i.e. I want to keep all of the data except the result of my update statement that lacked a where clause).

It's a shame, really, because SimpleDB is a great datastore for many use cases. But I cannot in good conscience ever recommend the service to a client due it its lack of a data backup facility.

Re:guilty eh? on Bizarre Porn Raid Underscores Wi-Fi Privacy Risks · 2011-04-25 04:47 · Score: 5, Insightful

Do you know how many cops are killed every year? 48 was in 2009 over 3/4 of them at traffic stops.(speeding suspected drunk driving, etc)

There are 800,000 law enforcement officers in the US, so we're talking about 0.006% here. Assuming the officer has a 10 mi. commute, he has a greater chance of getting killed on the way into the office or "home to their spouse's[sic] at the end of their shift" (0.007%).

Yawn.

Re:Emergency Plan on Major Outage At the Amazon Web Services · 2011-04-22 04:54 · Score: 1

If they should be most embarrassed of one thing, I'd say that not spinning Regions into their High Availability PR would be the biggest one.

I suppose that's true, but in the end, what's the difference between deploying your app across Regions and deploying your app across AWS's competitors? One of the biggest value-adds that AWS provides, as far as I'm concerned, is the ability to do scaling, HA, and DR with roughly zero effort.

Need more capacity? You can clone a running server from a consistent snapshot.

Need HA? Spin up the clone in a different Availability Zone.

Need DR? Take automated consistent snapshots of your running server at whatever interval is appropriate for your application.

So, sure. They could just update their best practices and say, "Yeah, you should really be multi-Region if you want to be HA". But that would leave a lot of implementation work to the customers. And if I have to implement it myself, do you really think my HA site will be hosted with Amazon? Hell, no. If I'm going to go to the trouble of custom-developing a solution, the HA site will be with a different provider since that even further reduces my risk.

This outage exposed a huge wart in EC2's AZ isolation. The root cause of this was a network fault that caused EBS to fail in 1 AZ. As far as I'm concerned, I'm totally cool with that happening. Stuff happens in the datacenter, and I get that. But when a network fault in one AZ takes out an entire Region, that shows that there is insufficient isolation between AZs, and I am definitely not cool with that.

As far as I can see, a useful option for AWS (wholly aside from fixing their isolation architecture) would be to introduce the ability to copy an EBS snapshot from one Region to another. That would have helped out a lot of their customers who got caught by this, because while we were fully ready to launch our applications in us-west-1, we couldn't get our data out of us-east-1.

I was fairly close to restoring an older backup to us-west-1 anyway when AWS got an AZ online for EBS-backed instances. At that point, I was able to get back online without too much fuss.

Oh well, those are my thoughts. As a result of this, I may do more frequent off-site backups, but I still maintain that I shouldn't have to. I was using AWS infrastructure in the manner that they recommend, and yet I still suffered a 5-hour downtime.

Re:They're called "FAILOVER" zones for a reason... on Amazon Outage Shows Limits of Failover 'Zones' · 2011-04-22 04:09 · Score: 1

You can't hold amazon accountable for your own stupidity.

I'm pretty sure you don't really understand what happened.

First, they're called Availability Zones. Not to be pedantic, but I just want you to be able to have the correct terminology if you want to read up on this.

Secondly, a failure in one AZ took out an entire Region. This is NOT supposed to happen. Each AZ is supposed to be considered as a separate datacenter in your application (separate power source, separate facility, separate uplink, etc.) AZs are supposed to be isolated from failures in other AZs.

Like you, I have little sympathy for AWS customers who put all of their eggs in one VM. However, lots of AWS customers who did the Right Thing(TM) got hosed by this. You can't say that they suffered because of their own stupidity.

Re:6 weeks before the AWS summit 2011 on Major Outage At the Amazon Web Services · 2011-04-22 03:55 · Score: 1

I have machines in the effected zone, not a problem with them. If I had a problem it would have had no impact since I spun up a full standby deployment in the west coast data center.

Only EBS volumes and EBS-backed instances were affected.

Re:6 weeks before the AWS summit 2011 on Major Outage At the Amazon Web Services · 2011-04-22 03:54 · Score: 2

Only 1 region is effective. If your app was set to work with multiple zones then it likely wouldn't be impacted by this outage.

Not true. My application works just fine in multiple Availability Zones, yet it was knocked out yesterday due to an entire Region getting knocked offline.

And before you tell me that the application should have been multi-Region, I'm not buying it. AWS has always maintained that deploying an app across multiple AZs is HA. AZs are supposed to be considered as separate datacenters: separate power, separate uplink, etc. And yes, separate EBS infrastructure (you can't attach an EBS volume to an instance that was launched in a different AZ). Multi-Region is for geographic reasons (reduced latency, compliance with EU data laws, etc.) or Disaster Recovery.

In yesterday's case, a network hiccup triggered EBS to eat itself in one AZ. Fine, I'm totally cool with that. I understand that stuff happens. But for that EBS failure to bring EBS down in all Availability Zones, I am absolutely not cool with. That that happened reveals a serious architectural flaw in the supposed isolation between AZs. Make no mistake about it, it is a huge egg to the face of AWS's EBS team.

Would making my app multi-Region have saved my bacon? Sure. And so would have deploying across multiple providers, etc. But the point is, I shouldn't have to do that. AWS told their customers that we don't have to do that. So as far as I'm concerned, nobody gets to say, "Well, you should have been multi-Region." That's just hindsight's 20/20 vision talking.

Personally, it didn't take much effort to get my app back online. Most of the effort was me trying to decide whether or wait it out or go into DR mode. Around the time I decided to go ahead and restore in us-west-1, AWS got EBS-backed instances working in an AZ, so I just relaunched in us-east-1. In all, I didn't lose much. But some people really go hosed by this, and I can't say I blame them for being upset. They did the Right Thing, and they still got hosed.

Re:Emergency Plan on Major Outage At the Amazon Web Services · 2011-04-22 03:36 · Score: 2

It's also worth pointing out that all cloud SLAs are basically useless: if Amazon falls below their advertised uptime they'll refund you some of your charges - but they'll never refund more than what you've paid them: they don't compensate you for all the money you're losing (and the AWS charges are likely pocket change compared to this)

FYI, I don't think this outage even falls under EC2's SLA. The Region was still technically on line. Only EBS was down.

Granted, many customers depend heavily on EBS, but the SLA doesn't cover an outage in just one specific EC2 feature. That being said, I wonder if AWS will honor SLA claims anyway, as a PR move. This outage is just so clearly Amazon's fault: a network hiccup causes EBS to overload in one Availability Zone, which cascades into all Availability Zones in the Region.

Personally, I think that they should honor SLA claims. But you're right, any money recovered would be chump change compared to the cost of the downtime.

Re:Emergency Plan on Major Outage At the Amazon Web Services · 2011-04-22 03:31 · Score: 1

Just using Amazon West as well as Amazon East would have saved customers from this outage.

This is hindsight talking.

AWS has always maintained that deploying an app across multiple Availability Zones was sufficient for High Availability. They introduced Regions for geographical reasons (reduced latency, compliance with EU data laws, etc.), not for HA reasons.

What happened yesterday should not have happened. It is one big, giant egg in the faces of AWS's EBS team.

Re:Emergency Plan on Major Outage At the Amazon Web Services · 2011-04-22 03:22 · Score: 1

Actually in the case of EC2 the smart thing would have been to have your instances spread over different availability zones...

This is exactly what AWS recommends, and this would not have saved you yesterday.

The reason why yesterday was such a Big Deal(TM) is that a software failure in one AZ took out an entire region. That is absolutely not how EC2 is supposed to work. Each AZ in a Region is supposed to function like a separate data center: independent power supply, uplink, etc. But in yesterday's outage (they're still having issues today, by the way), an entire Region failed, and it failed for reasons other than a huge natural disaster.

I've always been a big fan of AWS, and have used them for a long time. I will continue to do so, but make no mistake about it, yesterday's event is a colossal egg in the face of EC2's EBS team. Many AWS users are seriously bent out of shape over this, and I tend to agree with them. They architected their applications to failover to another AZ, just as AWS recommended. They did the right thing, yet they still got burned.

Re:tethered via adhoc wifi will do the job on The Tablet Debate: 3G Or Wi-Fi? · 2011-04-21 14:48 · Score: 1

IMO, CDMA's biggest disadvantage is that the US is the only country in the world that uses it. I've never had a problem roaming internationally with my GSM (and now UMTS) phones, and that's important to me. This is no reflection on the technical merits of CDMA, obviously. Just an observation.

You must travel internationally a lot to care that much.

Seems stupid to me to base a big (in terms of expense) decision on a seldom-used feature. Personally, I'd just get a $30 unlocked GSM phone to travel internationally with and call it a day.

Re:tethered via adhoc wifi will do the job on The Tablet Debate: 3G Or Wi-Fi? · 2011-04-21 14:45 · Score: 1

I don't think there's anything deceptive - there's nothing "superior" in a phone that's lacking a certain feature, despite the superiority of whatever protocol its running.

Sure there is. Superior battery life due to not having to power two radios.

It's not about features, it's about end user experience. My end user experience sucks ass when my battery is dead.

Re:tethered via adhoc wifi will do the job on The Tablet Debate: 3G Or Wi-Fi? · 2011-04-21 14:44 · Score: 1

What is quite a lot more amusing is to watch someone claim a feature that is significantly less useful for the user is in any way a mark of "superiority". Just because it has a nicer technical design does not make it superior if the end result to the user is inferior.

I guess it all depends on how you define a good experience to the end user. Personally, I'd prefer longer battery life (one radio) to simultaneous voice and data (two radios).

I speak from experience. My current phone has 3 radios (CDMA, WiFi, and WiMAX (4G)), and while I hardly ever use voice and data simultaneously, I'd use a longer battery life every day.

So, yeah. This end user would gladly give up his simultaneous voice and data if it meant better battery life.

Re:Job Change on Promotion Or Job Change: Which Is the Best Way To Advance In IT? · 2011-04-20 06:26 · Score: 1

No offense, but an employee whose only job security is his ability to type "Uninformative error message #54122: Illegal operation at 0xFACB4464" into Google doesn't have much job security.

I know what you were getting at, but I think that a lot of people overestimate the damage that the loss of a single, key employee can do. Certainly in a well-managed department. One of my firm's clients lost two key people within the span of 1 month (for different reasons, by the way), but they were able to recover. It was expensive for them, but I'm OK with that! :)

Re:Job Change on Promotion Or Job Change: Which Is the Best Way To Advance In IT? · 2011-04-20 06:18 · Score: 1

True enough.

Oh well, at least you weren't hoping for a 'man kill' to be worked in there somewhere!

Re:Job Change on Promotion Or Job Change: Which Is the Best Way To Advance In IT? · 2011-04-20 05:15 · Score: 1

getting married is plenty popular as a concept. There just isnt a clear man page on how to start the process.

That's preposterous. The following should pretty much talk you from the procurement stage, right on through midlife crisis:

man find
man nice
man tail
man get
man date
man chat
man talk
man sh
man talk
man sh
man talk
man sh

man touch
man unzip
man finger
man strip
man head
man mount
man make
man fsck
man login
man logout
man login
man logout
man more
man yes
man more
man yes
man more
man yes
man eject
man raw
man umount
man awk
man sleep

man get
man jobs
man fork
man wait
man yacc

man complain
man whoami
man look
man split
man dash
man get
man ex

Re:Just say on Michigan Police Could Search Cell Phones During Traffic Stops · 2011-04-20 03:21 · Score: 1

"Sorry officer, I don't have a cellphone"

"Sorry Ziwcam, you just committed a felony."

Re:Just say on Michigan Police Could Search Cell Phones During Traffic Stops · 2011-04-20 03:19 · Score: 2

IAAL.

Are you a criminal defense attorney that defends traffic violations? I'm going to guess that you are not, because most of what you wrote disagrees with the advice given by experts in this area of law.

Why advise a motorist to withhold consent to a search if the officer could just arrest him and perform the search anyway? Why would the officer even waste time requesting consent to search if it is not required? Why do many jurisdictions have a consent form for the motorist to sign that affirms his consent to the search?

If the police can show probable cause (eg. if the officer detects the smell of marijuana emanating from the vehicle), no consent or warrant would be required. However, the article refers to traffic stops for minor moving violations (i.e. a Terry Stop. See Arizona v. Johnson, 129 S.Ct. 781 (2009), for those following along at home).

A bit of advice: When you identify yourself as an attorney, and then proceed to dispense advice, you might consider first familiarizing yourself with the area of law that you are discussing. In failing to do so, you run a considerable risk of looking like a bit of an ass.

Re:Job Change on Promotion Or Job Change: Which Is the Best Way To Advance In IT? · 2011-04-20 02:09 · Score: 2

Often this is not the case, but it'll still work against you. If you are good at what you currently do, management will always be reluctant to promote you. They'll prefer to leave well enough alone, and instead promote the guy with mediocre performance but strong communication skills; maybe he'll improve his performance in a management role. From your manager's perspective, it kind of makes sense to take a chance on promoting a non-performer or hire a new guy, rather than promote they guy who is already doing a good job.

I've found that good worker-bee skills don't always translate into good management skills. I can see why you'd see things the way that you do, but it's worth looking at it from the other side.

In no particular order, here are some qualities that I look for when I'm considering promoting an employee to a supervisory role:

Be a people person and be good at communicating, and enjoy interacting with people. You can't manage a team with your nose buried in your monitor.
Caring about others and wanting to help others succeed. A huge part of a manager's job is developing the careers of his/her direct reports.
Ability to work toward a goal and drive chunks of work to completion.
Estimation skills. If you can't estimate your own work, how will you estimate a team's work?
Organization. Can you keep track of what an entire team is supposed to be doing?

That's what's coming to mind right now. It might be worth doing a little research on your own on what the experts say are good management skills in your field. Those are the skills you'll want to develop if you're interested in developing your career down a management track.

Slashdot Mirror

User: Slashdot+Parent

Comments · 3,032