Tips for Increasing Server Availability?
uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
if you are moving to a level that you need uptime, but cant dedicate more resources to overseeing it - you may want to considering a hosted solution. They host, monitor, upgrade, do checkups (YMMV with whom you choose)
If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load
If you have a service that must be highly available, cluster or load balance the service. Use more than 1 box and either cluster them or load balance them.
RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.
P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
Don't believe anything I say. I crash test crack pipes for a living.
You are hosting this on a 56K dial-up in your root cellar?
Your apps need to run on Microsoft Windows or HP-UX or...?
You've got a SAN or local disk or...?
You're using home-built white-box x86s or Sun E15000s or...?
You have sysadmin talent on hand? You're outsourced to IBM global services?
Who vets these silly questions? Oh, I forgot - the "Editors".
Advice: on VPS providers
Just post a link to this server of his - we'll gladly stress-test it for him at no charge. :)
From the internet to the resources you are trying to provide, identify every point of failure. Power outage, uplink, router, switch, server, etc. Prioritize - which are the most likely to fail? Budget - which ones can we afford to, or are cost effective, to duplicate? Clustering may be an option, but might be too expensive. What about a cold/hot standby - reduces downtime overall. You can find relatively inexpensive UPSs just about everywhere. Making your entire network redundant can take a lot of time and money.
I think sysadmins would respond very nicely to tips for increasing server availability. Let's say average tip is about 15%, and for simplicities sake we'll say that every hour of downtime costs about $10,000. Baseline service should be set either by an SLA or a measured baseline. For this we'll say 99% uptime per month. That allows for 7.2 hours of uncscheduled downtime in a month. So if the server is up for 100% of that time, then you'd want to tip your sysadmin about $10,800 per server (assuming the #s stay the same) for a month of great work.
That is all...
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
You will find that "availability" is a vague term. First you need to have a discussion to determine what availability means. It must be able to be put in measurable and non-vague terms. 99% uptime is not a good definition. The system must handle 99.7% of requests in 30 milliseconds or less is much better in part because it includes a performance expectation. It's also recognizes that not every request will receive the desired level of response. Additionally, if you determine that you want N+1 redundancy then you need to know the appropriate value of N (how many servers are needed to provide our required response times).
- of-locusts destroys our primary site? Cough up the dough.
You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague
Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.
Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.
With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?
A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
Hmm, two RAID drives failed simultaneously? This is possible but so unlikely, it isn't worth mentioning. Either the equipment he (Freudian You) is using is utter crap or they didn't really fail at the same time. Most likely one failed and no one noticed until the second failed. Or possibly He/You were using software RAID in which case your OS or you failed and caused the apparent drive failure. IN any of these cases the real cause of the failure is him/you!
Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!
Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.
Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.
Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!
are extremely low given the MTBF of modern drives. You have a better chance of a power supply or fan failure.
On that basis I am going to make some wild assed guesses that are more probable given the little information we have.
1) the drives were consumer models from the same production lot,
2) the death of the first drive was not immediately noticed,
3) compatible replacement drives are not easy to come by (no hot spare),
3) the second drive died before the first one was replaced,
4) the server did not have hot swap drive carriers
5) someone tried to replace the dead drive in the running chassis
If you don't like my guesses provide your own
Preventing downtime is an expensive, time consuming exercise, with few limits.
Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.
Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.
Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" by Gregory F. Pfister to get a fundamental understanding of the technology.
As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.
You might also consider a local ISP co-lo and do your own remote clustering.
machinator omnis sine licentia
Could you negotiate my contract for me? Tips or paid bonuses for uptime... I love you, man!
Use a service monitor, such as Hobbit or Big Brother, to monitor the services. If a service fails once, have it auto-restarted. If it fails repeatedly, have the monitor reboot the box automatically. If the server keeps crashing, or if recovery locks up, then have it notify you to intervene.
If you need guaranteed 100% uptime -and- the software is more reliable than the hardware (for whatever reason), your best bet is to run two boxes in parallel and have BOTH serve all requests, but have your router filter out the latter of any two identical packets. Then, if a box crashes, the other already has the connection running and no fail-over is required. When the crashed box is restored, you then have to replicate the state, but you can afford to take more time over that, as the user won't be aware of it.
If you need even higher levels of availability, then you'd want to move the disks onto a SAN and mirror the disk access - this time, filtering out duplicates in both directions. That way, either computer can crash AND either disk RAID can crash, and you STILL have a functioning system.
You can keep parallelizing, adding redundancy (such as mixing hot-swap and cold-standby), etc, as far as you like for the reliability you require. Need better WAN network reliability? Get two providers and set up dynamic routing over BGP. Let BGP take care of monitoring connectivity and dropping routes that aren't working. That's what routers are designed to do. Ideally, you'd co-locate, have backup IP providers for each site, then replicate transactions on write. If a site totally crashes, you can't avoid the time to fail over, but you can keep it to an absolute minimum.
If the LAN is potentially a weakpoint, then have each server with a line to each WAN router. Have an independent cable running between servers, and use it specifically for replicating states when a server is restarted.
These options range from costing virtually nothing (Big Brother is free) to tens of thousands of dollars or more, depending on the scale of redundancy you want, and give you from a few seconds of downtime perhaps every few months through to no user-visible downtime ever (short of a nuclear attack on all locations you co-locate between).
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Put a server in San Francisco, put a server in Massachusetts, put a server in Florida or Texas, put a server in Chicago, put a server in New York, and then redundent-cluster them all with something like mod_globule or rsync scripting mojo. Problem solved.
http://tinyurl.com/4ny52
Drives die. Fans die. Power supplies die. Motherboards die. Having a RAID array is not enough. There are plenty of other things in a single system that can go wrong that can take the system down for a period of time. The biggest issue I have with the limited description here is the fact you are talking about one system. If you want availability, you need to be looking at scaling by the machine.
.it gives you response to minimize downtime . . .)
.and there was a lot less broadcasts
Now you can just have other machines waiting to take the load with a quick reconfig, or you can start doing things automatically (people have mentioned using things like Nagios for monitoring, but monitoring doesn't give you uptime .
If you want to look at solutions that don't cost, check out LVS. It'll allow you to balance your ports across multiple systems (you can even balance win32 and linux systems if you for some reason wanted to) with a couple different methods (I prefer the DR method myself). The setup isn't that bad with several of the recent kernels in the major distro's including all the ipvs turned on by default, so you may not even have to recompile a kernel on your balancer systems.
Now, of course you can't depend on a single balancer any more than you can depend on a single web server; there is support using the HA linux stuff to allow you to have backup LVS systems to take over as a balancer if your primary balancer bites it (heartbeat and ldirectord is your friend here).
With a pair of fairly low end systems and some monkey work at the keys, you can have a system that will balance your tcp traffic (or setup an automagic failover from a system) that can be as good as some of the commercial balancing products out there.
I currently use LVS with heartbeat/ldirector to balance the following:
Win32 Apache Servers
Linux Apache Servers
Win2k3 IIS Servers (the LVS system balanced better than the built in WLBS from MS . .
Postfix
Amavisd-new
And as others have mentioned about setting up some good monitoring (ala Nagios if you want), we monitor the virtual services on the LVS systems in addition to the real servers' services so that we can know if we are still delivery service externally even though real server B is down...
When you get bigger, then you should even start looking at having datacenter redundancy . . . deploying the meteor net never seems to be the right answer to the 'Force Majeur' question . . .
Despite the obvious price tag, vmware products allow to "virtualize" your server(s) and to make it run across multiple hardware hosts. Ok, that just adds another option, but it's nice to be able to just "move" one virtual host from one hardware box to another without shutting it down.
#include "coucou.h"
Well, 2 simultaneous drive fails: it has happened to me at two seperate occasions on the same Dell PE2650. I also complained to tech support that this was very strange, and that I did not trust the server anymore, and that there had to be a harware faillure in the Array or backplane .
But, apparantly you have to schedule a consistency check every week or so... if you do not do this there is a possibility that data on the RAID gets out of sync/corrupted (or something like that) and when a drive fails the subsequent rebuild will fail also. In my case the RAID then complained of 2 drive faillures, while the second drive would afterwards format and work normally (but I got it and almost everything inside the server replaced anyhow).
Maybe this happened to you also? Make sure you schedule the weekly check of the RAID!!!
After examining those things, you'll need to look at load balancing multiple servers. I personally prefer F5 for my load balancing, but they come with a hefty price tag (mid five figures). I've used Cisco, Foundry and F5 so far and I love the F5s. They're extremely extensible but you don't have to be a rocket scientist to get the basic functionality out of them. Dual methods of configuration (web and CLI) make it nice for the newbies and the seasoned professionals. It also helps that the CLI is a full Linux system, so you can write shell scripts to do all of your basic maintenance tasks.
If you have more questions, please reply.
Still, with a plan, you only get the best you can imagine. I'd always hoped for something better than that. -CP
- two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even
though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
- redundant front end servers which have their own copies of static content and cache any active content from...
- redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that
traffic doesn't fight with the actual web service.
- Backup power (UPS + generator) with regular tests. (test on one
side of your redundant servers at a time, just in case...)
- Log only raw IP's, have a backend system with a caching DNS setup
where you do web reports. Do things like log file compression,
reports, etc. on the back end server only.
- tripwire all the config stuff against a tripwire database burned to CD-ROM.
- update configs on a test server (you do have test servers, right?) when they're
right update the tripwire stuff, build a new tripwire CDROM, then update the
production boxes.
- use a fast network-switch-style load balancer on the front. They also
help defend your servers against certain DOS attacks, (I.e. SYN floods).
- when things get busy, load your test servers with the latest production
stuff, and bring them into the load balance pool. If it takes N servers
to handle a given load, it takes N+1 or N+2 to dig back out of a hole,
because the load has at least 1 server out of commission at a time...
- use revision control (RCS, CVS, subversion, whatever) on your config files.
- use rsync or some such to keep 2 copies of your version control, above.
- make sure you can reproduce a front-end or back-end machine from media +
revision control + backups in under an hour. Test this regularly with a
test server.
If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Anycast allows you to serve the same IP from multiple servers distributed around the net. It's a BGP hack, and some people don't like it because of that, but it's used to keep some of the DNS root servers up. Has the added ability to divide and conquer DDOS attacks. The caveat is that routing changes break stateful connections, so that's why DNS can use it well (single package UDP connections most of the time).
Mosix is a UNIX patch that allows processes to migrate across machines. There's even a neat utility called mtop or something that shows load on all servers in a curses interface.. and you can see them balance out like water pouring between them. This also has issues with maintaining active connections.
If you could migrate statefull connections you could effectively have infinite uptime. Anyone have a patch?
Well, I was really more after general approaches to be honest. I didn't post a lot of specifics because I didn't think anyone was truly interested is solving my exact problem for me.
Here are some details:
The budget is probably $30k or less.
I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
The server is hosted locally, bandwidth is not an issue.
They system is Windows based, ASP and SQL Server.
The raid array is using scsi drives. They are hot swappable, but do not have a hot spare.
I am approaching other sysadmins and looking for advice from them as well. I am not as worried about the traffic or about the backbone at this point as I am keeping the hardware up and the data backed up and available. I am also interested in methods of getting things back up and running quickly should a hardware failure occur or the database become corrupt.
This is not my area of expertise (obvious from my questions) and I thought there might be some general guidelines for this sort of thing. I suspect that he will be paying someone to help with this, but I was hoping to get a good feel for what to expect and to have some knowedge beforehand to be better able to make decisions. (get more than one opinion basically)
Thanks for everyone's responses, I appreciate your time.
You're right that disks don't fail together that often, but components do tend to fail when you get them or at the end of their expected lifetimes (just like us!). This is called the bathtub curve. If you buy a bunch of disks at the same time with the same MTBF, you'll get a big spike of failures within the first few days or in say 4 years. If you use RAID5 on lots of disks, you're hosed because it can't tolerate a failure during a recovery. This may sound exotic, but it's a key design consideration on larger disk systems like archive.org's petaboxen (though, I guess those are exotic :).
As usual, variety is the spice of life... just don't buy lots of the same kind of stuff at once.
- the bathtub curve of disk
failure rates, and
- that a raid reconstruct can take about a day on a lot of RAID sets
you can certainly have a second drive fail while another one is being reconstructed if all the drives in the RAID are near end of life. The only good way to prevent it is to intentionally fail & replace sufficiently old drives before they actually fail (i.e. before you start climbing the steep end of the "bathtub" curve).It can be hard to explain to a company with whom you have a maintenance contract that a drive needs to be replaced that hasn't actaully failed yet. I know one admin (honest, it isn't me!) who advocates pulling old drives from the raid set and dropping them on the floor a few times and then calling service to "schedule" thes replacements ;-).
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
I would recommend the Google approach - cluster cheap computers. Clustering ASP can be easy (depends if you use the Session varaible) - look into Microsoft's Network Load Balancing; which while it load balances HTTP applications, also provides clustering and failover (I think - you'd have to check) without setting up a formal Windows cluster.
As for SQL, you could have two installations of SQL 2000 and use NLB to share among them; so long as you either manually take care of write transactions or use replication. (I'm not sure what the potential for lost information is if one SQL server goes down taking all data on its discs with it.)
You should obivously also read about MS's own clustering support and look into that. It tends to be bigger systems than you're talking about. Certain configurations use shared discs - you will have to research.
The ideal in my book is multiple "share nothing" servers where any can take the load of the whole - protects against disc failure too!
The idea of manually swapping in a spare server suggests you don't need 99.9% uptimes, otherwise you'd be looking into clustering systems to make that swap automatically.
Oh, and one thing I did saved my bacon at work once: Every two hours have your SQL Server backup (dump) the transaction log to another computer entirely accross the network - I use the SQL Server Maintainence plans. If you loose the server entirely, you've still got most of today's data! (Adjust frequency to taste.)
Well, as you can see, you can get a bit of information her, obvoiusly.
Since there was no specific details, but a request for information sources, I would say this:
Many vendors of products will offer an assortment of solutions to high availability needs. Legato used to have cluster software for Windows servers (for availabilty, not load balanacing). But alas (as I just discovered) thats now gone with their EMC merger.
Microsoft actually makes an okay clustering solution.
Oracle has clustering ability in their database product and are considered by some as one of the better solitions for a truely high availabilty database.
The Linux High Avilaibilty project is a good place to look around if you have time on your hands to impliment it. I've done it and it helps if you alread understand a lot of the concepts involde in HA solutions.
As you will find out though, is that you really have to determine the value a solution can provide, versus the potential loss of revenue a failure of any type can cause. whn you realize how much money you can loose, you can evaluate how much money you can spend. Thats the real key to any high availabilty solution.
Keep in mind there are also two type of clustering to think about (you'll discover it on your own in your research anyways):
If you think you are falling behind from the rest of the world, you are not. Right now I am going through this whole proces at work figuring out what it will take to get the management team to buy into high availability, and we have a customer base that really needs us to impliment it. It all comes down to the money game.
Worked on that a bit while I was employed there.
Basically, we use synchronize routing table state across several hot-standby routers, so failover is instantaneous, limiting flapping in the network.
Rather cool, actually.
You could've hired me.
Modern server hardware is pretty good. If you have redundant disks and PSUs, which are the main moving parts in a system, that should be enough. Your bigger worry, especially if you're running Windows systems, is downtime due to rebooting for service patches, and downtime due to malicious break-ins. You can mitigate against this to an extent by having lots of servers. Make sure they all have their own passwords so breaking in to one won't compromise your whole network. Check regularly using tripwire or similar tool that security and integrity haven't been compromised.
You also have to worry about changing a wrong setting, or not testing a new configuration enough. Use revision control, so you have a log of every change you make on production systems. Test first on non-production systems. Keep backups for as long as you can, and practice disaster recovery to make sure if a hurricane hits your data center, you can get back up and running without trouble. Store backups offsite in case your building is destroyed, and make sure you aren't the only one who knows how to restore the systems. Make sure backups are only accessible by authorised personnel - especially if your system passwords get backed up. See point above about break-ins. You are far more likely to suffer downtime due to human weakness than machine weakness, and all the harware redundancy in the world won't save you.
double-parity raid and add a hot spare.
First you should make a list of all your server and define your tolerances (figure 2/5). What needs to be kept online in the event of a power outage for example (maybe your PBX, telephones, publically accessible servers, and associated networking equipment) is a good question that needs to be asked, but don't stop there. List every possible disaster (structure fire, flood, earthquake, terrorist attack, you name it) and determine which resources need to be available. Then you can design a redundant system (maybe with servers in redundant locations in different cities), but you can only do this when you know what you are up against.
Then you can start building in redundancy into every one of these areas. Get duplicate network connections, duplicate switches, hot spares in your RAID arrays, redundant servers, and more. Maybe even redundant locations....
LedgerSMB: Open source Accounting/ERP