Quickly Switching Your Servers to Backups?

← Back to Stories (view on slashdot.org)

Quickly Switching Your Servers to Backups?

Posted by Cliff on Tuesday May 8, 2007 @09:15AM from the fast-failover dept.

moogoogaipan writes "After a few days thinking about the quickest way to bring my website back to the internet users, I am still stuck at DNS. From experience, even if I set the TTL for my DNS zone file as low as 5 minutes, there are still DNS servers out there won't update until a few days later (Yeah. I'm looking at you, AOL). Here is my situation. Say that I have my web servers and database servers at a remote backup location, ready to serve. If we get hit by an earthquake at our main location, what can I do in a few hours to get everyone to go to our backup location?"

8 of 73 comments (clear)

Min score:

Reason:

Sort:

BGP by ckdake · 2007-05-08 09:19 · Score: 5, Informative

Same Provider at both (N) locations, Same IPs for servers/services, Just don't advertise the prefixes via BGP from the backup location until the primary one goes down.
1. Re:BGP by georgewilliamherbert · 2007-05-08 10:46 · Score: 4, Informative
  
  Bingo.
  
  This is exactly what BGP (or OSPF feeding in to your providers' BGP) is made for.
  
  If you're big enough, you can get your own AS number and do this without having the same provider at each end (useful if the disaster that happens is that the software on all of provider X's core routers goes insane all at once, which happens from time to time).
  
  DNS just can't be assumed to fail fast enough for very high reliability services. You can do DNS right... low TTLs and all... and some providers just cache the results and do the wrong thing, and some client systems will never look up the changed data if the old IP stops responding, until someone reboots the browser or workstation.
  
  BGP.
Scale back your expectations by eln · 2007-05-08 09:25 · Score: 5, Informative

You could spend a bundle of money doing global load balancing and maintaining a full hot spare site, or you could figure out how critical it really is that your website be up within 5 minutes of some major disaster like an earthquake.

In the event of a major disaster, the need for "immediate" recovery is actually defined as being able to be back up and running within 24 hours of the event. This is true even for business critical functions. Unless your business would cease to exist within 24 hours if your website went down, I would consider a 72 hour return to service to be perfectly adequate, and it would cost a whole lot less time and money to set up. Keep in mind that we are talking about an eventuality that would only occur if your primary site was entirely disabled for an extended period of time, which is highly unlikely to happen if you're hosted in any kind of modern data center.
DNS failover by linuxwrangler · 2007-05-08 09:42 · Score: 3, Informative

We have used DNS failover from dnsmadeeasy.com for a couple years and have put it to the test a couple times. They have had perfect reliability and a low cost (typically well under $100/year).

The method is not perfect, but it is plenty good enough for our needs to protect against something that takes a datacenter down for a prolonged time (several minutes/hours/days). And the price

And to those who recommend avoiding "disaster prone" places: they all have people. People like the backhoe guy who took out the OC192 down the street. Or the core drillers who managed to punch both the primary and secondary optical links to a building of ours at a point where they were too close to each other.

You can roll your own by having a DNS server at each site and DNS 1 always issues IP of server 1 while DNS 2 always issues IP of server 2. But there are a number of issues like traffic hitting both sites at the same time. And you will have to detect more than just a down link so you will be scripting web test and DNS update systems. By the time you are done, you will have spent decades' worth of dnsmadeeasy fees.

Note: dnsmadeeasy isn't the only game in town. Just the one we happen to use.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
1. Re:DNS failover by Joe5678 · 2007-05-08 10:27 · Score: 4, Informative
  
  dnsmadeeasy doesn't solve the problem the OP is asking about. They simply monitor your services and start serving a different DNS record if your primary is down.
  
  The OP is concerned with all the DNS servers that aren't yours that would then have a cached version already, and continue to serve up the dead DNS record until their (incorrectly configured) TTL expired.
  
  As another poster already mentioned, BGP is really the only technical solution to this problem. All other "solutions" are going to be convincing people that they don't really need instant failover in the event of a major disaster.
Buy "Scalable Internet Architectures" by tedhiltonhead · 2007-05-08 10:01 · Score: 3, Informative

For a real answer, buy Theo Schlossnagle's book, "Scalable Internet Architectures". Theo presented a lengthy and highly-informative session at OSCON last year, and I subsequently bought and read his book. Worth every penny if you're professionally involved in providing reliable Internet services of any kind.
Try asking at webhostingtalk.com by Wabbit+Wabbit · 2007-05-08 12:24 · Score: 3, Informative

The industry pros discuss this sort of thing there all the time. The colocation sub-forum would be the best place to ask. I know that sounds odd, but that's the area on WHT where the best network/transit/BGP people hang out.

--
Nothing is inexplicable; only unexplained -Tom Baker, Doctor Who
Back in the day I used Exodus to do this by ejoe_mac · 2007-05-08 15:11 · Score: 3, Informative

When $ is no issue, a tier 1 colocation provider with their own services would be the best option. They've got big pipes, and will work with you to have the additional services needed. I'd go as far to say that you're going to want to have a failover script that they would follow in the event of site A going offline. You'd need redundant equipment, or use a DR firm for getting back up.