Tips for Increasing Server Availability?
uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."
Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.
Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.
- AMW
- two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even
though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
- redundant front end servers which have their own copies of static content and cache any active content from...
- redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that
traffic doesn't fight with the actual web service.
- Backup power (UPS + generator) with regular tests. (test on one
side of your redundant servers at a time, just in case...)
- Log only raw IP's, have a backend system with a caching DNS setup
where you do web reports. Do things like log file compression,
reports, etc. on the back end server only.
- tripwire all the config stuff against a tripwire database burned to CD-ROM.
- update configs on a test server (you do have test servers, right?) when they're
right update the tripwire stuff, build a new tripwire CDROM, then update the
production boxes.
- use a fast network-switch-style load balancer on the front. They also
help defend your servers against certain DOS attacks, (I.e. SYN floods).
- when things get busy, load your test servers with the latest production
stuff, and bring them into the load balance pool. If it takes N servers
to handle a given load, it takes N+1 or N+2 to dig back out of a hole,
because the load has at least 1 server out of commission at a time...
- use revision control (RCS, CVS, subversion, whatever) on your config files.
- use rsync or some such to keep 2 copies of your version control, above.
- make sure you can reproduce a front-end or back-end machine from media +
revision control + backups in under an hour. Test this regularly with a
test server.
If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.
As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.
Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.