Tips for Increasing Server Availability?

← Back to Stories (view on slashdot.org)

Tips for Increasing Server Availability?

Posted by Cliff on Tuesday September 27, 2005 @05:30AM from the more-nines-the-better dept.

uptime asks: "I've got a friend that needs some help with his web server availability. On two separate occasions, his server has had a problem that caused it to be unavailable for a period of time. One was early on and was probably preventable, but this latest one was due to two drives failing simultaneously in a RAID5 array. As a web business moves from a small site to a fairly busy one, availability and reliability becomes not only more important, but more difficult to accomplish it seems. Hardware gets bigger, services get more expensive, and options seem to multiply. Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."

18 of 74 comments (clear)

Min score:

Reason:

Sort:

Hosting by hatch815 · 2005-09-27 05:34 · Score: 5, Insightful

if you are moving to a level that you need uptime, but cant dedicate more resources to overseeing it - you may want to considering a hosted solution. They host, monitor, upgrade, do checkups (YMMV with whom you choose)

If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load
high availability of the service by PFactor · 2005-09-27 05:36 · Score: 4, Informative

If you have a service that must be highly available, cluster or load balance the service. Use more than 1 box and either cluster them or load balance them.

RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.

P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful

--
Don't believe anything I say. I crash test crack pipes for a living.
1. Re:high availability of the service by Rolan · 2005-09-27 05:48 · Score: 4, Interesting
  
  P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful
  
  Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.
  
  Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.
  
  --
  - AMW
2. Re:high availability of the service by wolf31o2 · 2005-09-27 07:51 · Score: 4, Interesting
  
  This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.
  As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.
  Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.
3. Re:high availability of the service by captainclever · 2005-09-27 08:08 · Score: 2, Interesting
  
  If you find yourself running a bunch of servers all with similar spec/config, you should consider removing the disks from them and netbooting off a single image on another server (or a single image available one 2 other servers just in case). Disks are far more likely to break than any other component imo, far more likely than fans or PSUs if you ask me.
  As for RAID5, it's not always practical, but bear in mind if you buy all your disks from the same mfg at the same time, your chances of concurrent failure are increased. (the batch the disks came from may be suspect). You could buy disks from different manufacturers. Hot spares are always handy too.
  The webservers for last.fm are all diskless and boot off a single debian image. Makes it helluvalot easier to upgrade/update them. We use Perlbal (from the LiveJournal crew) as a reverse-proxy load balancer, which works nicely.
  
  --
  Last.fm - join the social music revolution
4. Re:high availability of the service by dwater · 2005-09-27 15:46 · Score: 2, Interesting
  
  This is my take on RAID5 - I take it as written that I will be corrected :
  
  You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.
  
  Therefore, the time between the two disks failing *must* be enough for the regeneration to complete, else you'll lose everything(?).
  
  This can take quite some time.
  
  On my s/w RAID5, it takes hours.
  
  Furthermore, the process causes significantly more disk activity on the remaining disks, increasing the risk of another failure.
  
  The time taken to regenerate the array depends on many things. The ones I can think of :
  
  1) speed of the read performance of the disks remaining in the array, and the write performance of the spare.
  2) controller performance (or interface(s)/chipset/memory/cpu performance if s/w RAID).
  
  Of course, once the array is back to it's redundant state, it still isn't back to 'normal' because the spare is now missing and needs to be replaced.
  
  Better to use mirroring (0+1?) in some way, IMO. If you have 3 mirrors, you can hot swap one 'reflection' and immediately replace it - keeping many off-line 'reflections' in the same way as you would tape and using it as a backup.
  
  I think 0+1 is faster too, since the '0' is a stripe. Of course, it's also more expensive to obtain a given capacity.
  
  --
  Max.
Where's the link? by azuroff · 2005-09-27 05:36 · Score: 5, Funny

Just post a link to this server of his - we'll gladly stress-test it for him at no charge. :)
Identify, Prioritize, Budget by Nos. · 2005-09-27 05:59 · Score: 3, Informative

From the internet to the resources you are trying to provide, identify every point of failure. Power outage, uplink, router, switch, server, etc. Prioritize - which are the most likely to fail? Budget - which ones can we afford to, or are cost effective, to duplicate? Clustering may be an option, but might be too expensive. What about a cold/hot standby - reduces downtime overall. You can find relatively inexpensive UPSs just about everywhere. Making your entire network redundant can take a lot of time and money.
Not a bad idea... by Ingolfke · 2005-09-27 06:06 · Score: 2, Funny

I think sysadmins would respond very nicely to tips for increasing server availability. Let's say average tip is about 15%, and for simplicities sake we'll say that every hour of downtime costs about $10,000. Baseline service should be set either by an SLA or a measured baseline. For this we'll say 99% uptime per month. That allows for 7.2 hours of uncscheduled downtime in a month. So if the server is up for 100% of that time, then you'd want to tip your sysadmin about $10,800 per server (assuming the #s stay the same) for a month of great work.
Hire a Professional? by marcus · 2005-09-27 06:13 · Score: 3, Insightful

That is all...

--
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO
Define then plan by linuxwrangler · 2005-09-27 06:15 · Score: 4, Informative

You will find that "availability" is a vague term. First you need to have a discussion to determine what availability means. It must be able to be put in measurable and non-vague terms. 99% uptime is not a good definition. The system must handle 99.7% of requests in 30 milliseconds or less is much better in part because it includes a performance expectation. It's also recognizes that not every request will receive the desired level of response. Additionally, if you determine that you want N+1 redundancy then you need to know the appropriate value of N (how many servers are needed to provide our required response times).

You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague- of-locusts destroys our primary site? Cough up the dough.

Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.

Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.

With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?

A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis
Uptime ++ by Anonymous Coward · 2005-09-27 06:16 · Score: 3, Informative

Hmm, two RAID drives failed simultaneously? This is possible but so unlikely, it isn't worth mentioning. Either the equipment he (Freudian You) is using is utter crap or they didn't really fail at the same time. Most likely one failed and no one noticed until the second failed. Or possibly He/You were using software RAID in which case your OS or you failed and caused the apparent drive failure. IN any of these cases the real cause of the failure is him/you!

Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!

Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.

Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.

Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!
Probability of simultaneous two disk failure by metoc · 2005-09-27 06:23 · Score: 2, Insightful

are extremely low given the MTBF of modern drives. You have a better chance of a power supply or fan failure.

On that basis I am going to make some wild assed guesses that are more probable given the little information we have.

1) the drives were consumer models from the same production lot,
2) the death of the first drive was not immediately noticed,
3) compatible replacement drives are not easy to come by (no hot spare),
3) the second drive died before the first one was replaced,
4) the server did not have hot swap drive carriers
5) someone tried to replace the dead drive in the running chassis

If you don't like my guesses provide your own
1. Re:Probability of simultaneous two disk failure by mangu · 2005-09-27 07:32 · Score: 4, Insightful
  
  If you don't like my guesses provide your own
  
  6) The drives are overheating. This happened to my two Seagate 200Gb drives. Had to mount them in a heatsink, the normal bay does not provide adequate cooling for Seagate's 7200 rpm drives.
HA is elusive by Ropati · 2005-09-27 06:27 · Score: 3, Informative

Preventing downtime is an expensive, time consuming exercise, with few limits.

Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.

Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.

Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" by Gregory F. Pfister to get a fundamental understanding of the technology.

As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.

You might also consider a local ISP co-lo and do your own remote clustering.

--
machinator omnis sine licentia
There was a really good LISA talk... by mengel · 2005-09-27 07:06 · Score: 4, Interesting
... about this a few years back. I forget the guy's name; he was administering a site that did stock quotes with pretty graphs, etc. I suspect I don't remember all of his points anymore, but:
- two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
- redundant front end servers which have their own copies of static content and cache any active content from...
- redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that traffic doesn't fight with the actual web service.
- Backup power (UPS + generator) with regular tests. (test on one side of your redundant servers at a time, just in case...)
- Log only raw IP's, have a backend system with a caching DNS setup where you do web reports. Do things like log file compression, reports, etc. on the back end server only.
- tripwire all the config stuff against a tripwire database burned to CD-ROM.
- update configs on a test server (you do have test servers, right?) when they're right update the tripwire stuff, build a new tripwire CDROM, then update the production boxes.
- use a fast network-switch-style load balancer on the front. They also help defend your servers against certain DOS attacks, (I.e. SYN floods).
- when things get busy, load your test servers with the latest production stuff, and bring them into the load balance pool. If it takes N servers to handle a given load, it takes N+1 or N+2 to dig back out of a hole, because the load has at least 1 server out of commission at a time...
- use revision control (RCS, CVS, subversion, whatever) on your config files.
- use rsync or some such to keep 2 copies of your version control, above.
- make sure you can reproduce a front-end or back-end machine from media + revision control + backups in under an hour. Test this regularly with a test server.
If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.
Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...
--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Re: Here they are by Anonymous Coward · 2005-09-27 07:14 · Score: 2, Informative

Well, I was really more after general approaches to be honest. I didn't post a lot of specifics because I didn't think anyone was truly interested is solving my exact problem for me.

Here are some details:
The budget is probably $30k or less.
I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
The server is hosted locally, bandwidth is not an issue.
They system is Windows based, ASP and SQL Server.
The raid array is using scsi drives. They are hot swappable, but do not have a hot spare.

I am approaching other sysadmins and looking for advice from them as well. I am not as worried about the traffic or about the backbone at this point as I am keeping the hardware up and the data backed up and available. I am also interested in methods of getting things back up and running quickly should a hardware failure occur or the database become corrupt.

This is not my area of expertise (obvious from my questions) and I thought there might be some general guidelines for this sort of thing. I suspect that he will be paying someone to help with this, but I was hoping to get a good feel for what to expect and to have some knowedge beforehand to be better able to make decisions. (get more than one opinion basically)

Thanks for everyone's responses, I appreciate your time.
Bathub Curve Makes it More Likely by freality · 2005-09-27 07:20 · Score: 2, Informative

You're right that disks don't fail together that often, but components do tend to fail when you get them or at the end of their expected lifetimes (just like us!). This is called the bathtub curve. If you buy a bunch of disks at the same time with the same MTBF, you'll get a big spike of failures within the first few days or in say 4 years. If you use RAID5 on lots of disks, you're hosed because it can't tolerate a failure during a recovery. This may sound exotic, but it's a key design consideration on larger disk systems like archive.org's petaboxen (though, I guess those are exotic :).

As usual, variety is the spice of life... just don't buy lots of the same kind of stuff at once.