Tips for Increasing Server Availability?
uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
if you are moving to a level that you need uptime, but cant dedicate more resources to overseeing it - you may want to considering a hosted solution. They host, monitor, upgrade, do checkups (YMMV with whom you choose)
If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load
If you have a service that must be highly available, cluster or load balance the service. Use more than 1 box and either cluster them or load balance them.
RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.
P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
Don't believe anything I say. I crash test crack pipes for a living.
Just post a link to this server of his - we'll gladly stress-test it for him at no charge. :)
You will find that "availability" is a vague term. First you need to have a discussion to determine what availability means. It must be able to be put in measurable and non-vague terms. 99% uptime is not a good definition. The system must handle 99.7% of requests in 30 milliseconds or less is much better in part because it includes a performance expectation. It's also recognizes that not every request will receive the desired level of response. Additionally, if you determine that you want N+1 redundancy then you need to know the appropriate value of N (how many servers are needed to provide our required response times).
- of-locusts destroys our primary site? Cough up the dough.
You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague
Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.
Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.
With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?
A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.
~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
- two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even
though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
- redundant front end servers which have their own copies of static content and cache any active content from...
- redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that
traffic doesn't fight with the actual web service.
- Backup power (UPS + generator) with regular tests. (test on one
side of your redundant servers at a time, just in case...)
- Log only raw IP's, have a backend system with a caching DNS setup
where you do web reports. Do things like log file compression,
reports, etc. on the back end server only.
- tripwire all the config stuff against a tripwire database burned to CD-ROM.
- update configs on a test server (you do have test servers, right?) when they're
right update the tripwire stuff, build a new tripwire CDROM, then update the
production boxes.
- use a fast network-switch-style load balancer on the front. They also
help defend your servers against certain DOS attacks, (I.e. SYN floods).
- when things get busy, load your test servers with the latest production
stuff, and bring them into the load balance pool. If it takes N servers
to handle a given load, it takes N+1 or N+2 to dig back out of a hole,
because the load has at least 1 server out of commission at a time...
- use revision control (RCS, CVS, subversion, whatever) on your config files.
- use rsync or some such to keep 2 copies of your version control, above.
- make sure you can reproduce a front-end or back-end machine from media +
revision control + backups in under an hour. Test this regularly with a
test server.
If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
6) The drives are overheating. This happened to my two Seagate 200Gb drives. Had to mount them in a heatsink, the normal bay does not provide adequate cooling for Seagate's 7200 rpm drives.