Quickly Switching Your Servers to Backups?

← Back to Stories (view on slashdot.org)

Quickly Switching Your Servers to Backups?

Posted by Cliff on Tuesday May 8, 2007 @09:15AM from the fast-failover dept.

moogoogaipan writes "After a few days thinking about the quickest way to bring my website back to the internet users, I am still stuck at DNS. From experience, even if I set the TTL for my DNS zone file as low as 5 minutes, there are still DNS servers out there won't update until a few days later (Yeah. I'm looking at you, AOL). Here is my situation. Say that I have my web servers and database servers at a remote backup location, ready to serve. If we get hit by an earthquake at our main location, what can I do in a few hours to get everyone to go to our backup location?"

25 of 73 comments (clear)

Min score:

Reason:

Sort:

BGP by ckdake · 2007-05-08 09:19 · Score: 5, Informative

Same Provider at both (N) locations, Same IPs for servers/services, Just don't advertise the prefixes via BGP from the backup location until the primary one goes down.
1. Re:BGP by georgewilliamherbert · 2007-05-08 10:46 · Score: 4, Informative
  
  Bingo.
  
  This is exactly what BGP (or OSPF feeding in to your providers' BGP) is made for.
  
  If you're big enough, you can get your own AS number and do this without having the same provider at each end (useful if the disaster that happens is that the software on all of provider X's core routers goes insane all at once, which happens from time to time).
  
  DNS just can't be assumed to fail fast enough for very high reliability services. You can do DNS right... low TTLs and all... and some providers just cache the results and do the wrong thing, and some client systems will never look up the changed data if the old IP stops responding, until someone reboots the browser or workstation.
  
  BGP.
2. Re:BGP by sjames · 2007-05-08 14:21 · Score: 2, Interesting
  
  In addition, make sure that your routers at both locations have a routable IP NOT in your IP block and use that as the source for your BGP sessions AND make sure you can log in remotely using that address (perhaps from a short list of external IPs you control). That way you can log in to each and make sure the route is actually announced by only one of them. Not all failover situations will completely take your network down (but unless you have a way to do it yourself, you'll sure wish it did).
Server Clusters by Tuoqui · 2007-05-08 09:20 · Score: 2, Informative

NLB (Network Load Balancing) Cluster, link the two together and have them both serve the website. Not only will it not go down (barring freak accidents like both locations being hit at once) but it will also have the added benefit of presumably double the bandwidth and such.

Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.

Basically any server clustering type setup if you can connect the two remotely would probably be a good starting point for your website assuming it is that important that it dont go down ever.

--
09F911029D74E35BD84156C5635688C0
+2 Troll is Slashdot's way of saying groupthink is confused
1. Re:Server Clusters by Tackhead · 2007-05-08 09:32 · Score: 5, Funny
  
  > Only problem is if you're locating them in two separate locations that they need to be able to communicate with each other and keep identical copies of the website and be able to connect to any databases you may need.
  Depending on the industry, that's a very real problem.
  Sysadmin: "Don't worry, we're already switched over to the hot spare, just get out of there!"
  CIO: "What if the whole building goes?"
  Sysadmin: "No worries. Remember that $1M we spent stringing all that fiber over to the other datacenter?"
  CIO: "Oh yeah, the one in WTC 2!"
  Sysadmin: "Aaw, shit."
Get the ISP involved by MeanMF · 2007-05-08 09:21 · Score: 3, Insightful

Talk to your ISP. They can set it up so the IP addresses at the main location can be rerouted to the DR site almost instantly.
Scale back your expectations by eln · 2007-05-08 09:25 · Score: 5, Informative

You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake.

In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions. Unless your business would cease to exist within 24 hours if your website went down, I would consider a 72 hour return to service to be perfectly adequate, and it would cost a whole lot less time and money to set up. Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center.
1. Re:Scale back your expectations by eln · 2007-05-08 10:17 · Score: 5, Insightful
  
  Ok sure, but if you're the kind of company that is making millions of dollars a minute through your website, you're paying qualified IT professionals to go out and spend a bundle of money developing an architecture that will allow full global load balancing with constant mirroring, probably with dedicated circuits between sites. You are definitely not posting Ask Slashdot articles about how to get around other ISPs' annoying habit of holding on to DNS records for too long.
2. Re:Scale back your expectations by CrankyOldBastard · 2007-05-08 11:02 · Score: 4, Insightful
  
  "You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake."
  
  I wish I still had mod points left. It's overlooked by many people, that you should always compare the cost of a disaster/breakin/breakdown to the cost of being prepared for it. I've seen situations where over $10,000 was spent on a bug that would of cost about $200 in downtime. Similarly I've seen a few thousand dollars spent fixing a bug that affected one customer who was paying $15 per month.
3. Re:Scale back your expectations by autocracy · 2007-05-08 23:46 · Score: 2, Insightful
  
  A million dollars in revenue isn't what he meant, I think. A bank's revenue, for example, is much less than its transaction amount. Stock Exchance as well.
  
  Also, consider 2x60x40x52 comes out to 249,600,000,000. I bet Bank of America sees a billion dollars a day move. Peak transaction volume is often used when calculating potential loss, so it may only be $2 million / minute during the highest hour -- but that's always the hour you'll fail during ;)
  
  --
  SIG: HUP
Um by Anonymous Coward · 2007-05-08 09:29 · Score: 2, Funny

You could hire an actual IT administrator who knows what they're doing? Like, one who's actually trained?
1. Re:Um by CRiMSON · 2007-05-08 09:50 · Score: 2, Funny
  
  But then it wouldn't be another half-assed implementation. Come on what were you thinking.
  
  --
  oogly boogly!
DNS failover by linuxwrangler · 2007-05-08 09:42 · Score: 3, Informative

We have used DNS failover from dnsmadeeasy.com for a couple years and have put it to the test a couple times. They have had perfect reliability and a low cost (typically well under $100/year).

The method is not perfect, but it is plenty good enough for our needs to protect against something that takes a datacenter down for a prolonged time (several minutes/hours/days). And the price

And to those who recommend avoiding "disaster prone" places: they all have people. People like the backhoe guy who took out the OC192 down the street. Or the core drillers who managed to punch both the primary and secondary optical links to a building of ours at a point where they were too close to each other.

You can roll your own by having a DNS server at each site and DNS 1 always issues IP of server 1 while DNS 2 always issues IP of server 2. But there are a number of issues like traffic hitting both sites at the same time. And you will have to detect more than just a down link so you will be scripting web test and DNS update systems. By the time you are done, you will have spent decades' worth of dnsmadeeasy fees.

Note: dnsmadeeasy isn't the only game in town. Just the one we happen to use.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
1. Re:DNS failover by Joe5678 · 2007-05-08 10:27 · Score: 4, Informative
  
  dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.
  
  The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.
  
  As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.
2. Re:DNS failover by mother_reincarnated · 2007-05-09 01:25 · Score: 2, Informative
  
  Disclosure time: I deal with this stuff every day and I have a vested interested in commercial solutions to this issue.
  dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.
  
  The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.
  
  As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.
  
  The problem is that this is somewhat specious of a concern. It tends to be browser/os combos that don't honor TTLs, and there are very few of those around anymore. In that case all the client has to do is close and reopen their browser.
  
  BGP is cumbersome but a valid solution (if you've got your own AS). You probably want to suppliment it with a DNS based solution though- far more of your outages will be less than whole site events. The more viable (and likely) scenario to keep in mind is a partial outage at a the primary DC where you can use DNS to send 99% of the users of affected applications to the DR site, and the 1% who are non RFC compliant can be handled via more draconian methods (either 302's, l3 backhaul, or even shared l2 if you have it between sites). A properly designed solution will be able to do all of this for you and also know what is really available without manual intervention.
  
  I really want to address your comment about convincing people they "don't really need instant failover in the event of a major disaster." Well they probably don't [need 100% of users to have instant failover during a major disaster]. Metrics get wrapped around BC for a good reason. People often fail to grasp is that if your entire primary site goes 'poof' you, and likely most of the world, will not notice the time it takes either a DNS or BGP based solution to 'reconverge,' and you certainly won't care if a fractional percent of your users need to close their browsers (btw many users probably won't be able to get to your site because of route flapping mayhem et al. caused by said disaster...) Frankly it won't be anyone's biggest concern at that point.
  
  Layered approaches are often needed, your particular requirements and bugdet determine how deep you go. A DNS based first layer will do 99%+ of the job.
  
  [FYI for dealing with the bad boy superproxy types like AOL you could segment that traffic out and handle it differently than everyone who 'plays fair'... If you've got the money to play with the best of breed solutions then it's all been done before- but you probably wouldn't be asking /.]
Re:Location by bahwi · 2007-05-08 09:43 · Score: 2, Informative

I've got my personal/small business server(does nothing but a crappy webpage) so it's not critical down in florida. It's gone down because of router issues but never a hurricane, and oh yes, it's been hit. I actually think those places may be better as they are built to weather those types of storms.

Yeah, I've heard lots of people sweating and panicking because a back ho was working somewhere near the datacenter. On site and beads of sweat on their forehead.
Oops by tedhiltonhead · 2007-05-08 09:59 · Score: 2, Funny

That was me... sorry... my bad. FSB's (Fiber Seeking Backhoe) are tough to control.
Buy "Scalable Internet Architectures" by tedhiltonhead · 2007-05-08 10:01 · Score: 3, Informative

For a real answer, buy Theo Schlossnagle's book, "Scalable Internet Architectures". Theo presented a lengthy and highly-informative session at OSCON last year, and I subsequently bought and read his book. Worth every penny if you're professionally involved in providing reliable Internet services of any kind.
excellent point by swschrad · 2007-05-08 10:03 · Score: 2, Insightful

but wrong answer for it. the disaster plan should include backup for key people and assume responsibility for their dependents, so the key people give a schytte about what they're doing and have an out for the whole family from the (hopefully local) disaster.

it's incumbent on manglement to have useful plans, and you should help make what they have useful. shift the end focus and present it to them.

--
if this is supposed to be a new economy, how come they still want my old fashioned money?
Re:Fail the IP address across by QuantumRiff · 2007-05-08 10:30 · Score: 4, Insightful

you may find that people have other things on their minds when that amount of shit hits the fan.

Really interesting point that seems to be overlooked. The CEO is concerned about getting everything back up and running (since statistically they have no heart or pulse), but the employees are more concerned about finding family members in the wreckage of their house, cleanup, watching the kids cause schools are shut down, etc..

Whatever you do, ensure it is automated as possible, and please, please, please don't forget to test. I've heard to many stories about everything looking okay, until the emergency generator runs for several hours, vibrating a connection loose and causing it to shut down. It would pass the test run every month, that was only 15 min long. "Hmm, power is out, and power poles are blown all over the streets, do I stay safely inside? Or do I brave a trip across town to try to flip a switch for my wonderful employer?"

--

What are we going to do tonight Brain?
Re:geographical load balancing by passthecrackpipe · 2007-05-08 11:25 · Score: 2, Informative

F5 mainly uses DNS for its Global Traffic Management solution. There are other bits and pieces, but that is the core, really.

--
People who think they know everything are a great annoyance to those of us who do.
Stochastic Resillience by DamonHD · 2007-05-08 12:08 · Score: 4, Interesting

Hi, An alternative is to forget the all-or-nothing view, and make sure that with some simple round-robin DNS and enough geographically-separated servers for the DNS and HTTP/whatever, then even if one is taken out by a quake or Act of Congress (ewwww, those nature programmes), *most* users will still get through just fine. Any clients/proxies that are smart and that can try out multiple A records for one URL will always get through if even one of your servers is reachable. Example: my main UK server failed strangely yesterday morning, but only about 30% of my visitors can even have noticed, and the other servers worldwide took up some of the load. Just simple and reliable and cheap round-robin DNS. Rgds Damon

--
http://m.earth.org.uk/
Try asking at webhostingtalk.com by Wabbit+Wabbit · 2007-05-08 12:24 · Score: 3, Informative

The industry pros discuss this sort of thing there all the time. The colocation sub-forum would be the best place to ask. I know that sounds odd, but that's the area on WHT where the best network/transit/BGP people hang out.

--
Nothing is inexplicable; only unexplained -Tom Baker, Doctor Who
Re:Location by walt-sjc · 2007-05-08 12:36 · Score: 2, Interesting

There is an AT&T data center in Virginia that was hit by a tornado. Our servers are there. In the part that got hit hardest, water was pouring in and down onto a few racks of servers. The servers were still up, but they powered down that section of the data center for safety reasons. Our servers were fortunate not to be affected, and AT&T kept them running throughout the whole ordeal (power grid was down too, so they were on generator for a couple days.) BTW, that was the "before SBC" AT&T.
Back in the day I used Exodus to do this by ejoe_mac · 2007-05-08 15:11 · Score: 3, Informative

When $ is no issue, a tier 1 colocation provider with their own services would be the best option. They've got big pipes, and will work with you to have the additional services needed. I'd go as far to say that you're going to want to have a failover script that they would follow in the event of site A going offline. You'd need redundant equipment, or use a DR firm for getting back up.