Tips for Increasing Server Availability?

Hosting by hatch815 · 2005-09-27 05:34 · Score: 5, Insightful

if you are moving to a level that you need uptime, but cant dedicate more resources to overseeing it - you may want to considering a hosted solution. They host, monitor, upgrade, do checkups (YMMV with whom you choose)

If that isnt something you want to venture down, then start planning outages for fsck, upgrade, and standard checkups. There are alos plugins for NAGIOS that will check different RAID controller status, server response, and server load

high availability of the service by PFactor · 2005-09-27 05:36 · Score: 4, Informative

If you have a service that must be highly available, cluster or load balance the service. Use more than 1 box and either cluster them or load balance them.

RAID, ECC RAM, team NICs and all that stuff are very helpful, but if you want to make DARN sure that service is as available as possible, do server times two.

P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful

--
Don't believe anything I say. I crash test crack pipes for a living.

Re:high availability of the service by Thauma · 2005-09-27 05:47 · Score: 1

Don't forget to have a redundant network connection, unless you sign up for HSRP, most data centers only hook you into one of their providers.
Re:high availability of the service by Rolan · 2005-09-27 05:48 · Score: 4, Interesting

P.S. - your second server should be able to handle the exact same load as the first server or its not going to be terribly helpful

Actually, each server should be able to handle the entire load of the cluster. A lot of people forget to pay attention to this. It's great that you have two servers in a cluster, so that if one fails the other still works, except when you HAVE to have two servers in the cluster for it to work at all.

Where I work we run our web solutions on clusters. This works great for redundancy, availability, etc. BUT, if we ever have less than two servers in the cluster, the system will go down anyway due to load. Our primary production cluster, therefore, is four servers.

--
- AMW
Re:high availability of the service by PFactor · 2005-09-27 05:55 · Score: 1

"Actually, each server should be able to handle the entire load of the cluster"

Good point. I took it as read that the first server in question is handling the load on its own and the question was solely about providing higher availability. I put in a note about clustered/load balanced servers being equivalent in available capacity because in my experience, this is a place that novices will shortsightedly cut corners to save $$$.

--
Don't believe anything I say. I crash test crack pipes for a living.
Re:high availability of the service by mengel · 2005-09-27 07:08 · Score: 1

or, for really busy sites, 3 out of 5 should be able to handle the load...

--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Re:high availability of the service by wolf31o2 · 2005-09-27 07:51 · Score: 4, Interesting

This is basically the priciple that my company runs on with their servers. We should be able to be running perfectly fine with 2/5ths of our servers down at any given time. Of course, this almost never happens, but building with that sort fo redundancy in mind reduces the chances of downtime to almost nothing. Each machine is also on redundant links to redundant switches on redundant upstream links. We do have the advantage of being an ISP and CLEC ourselves, so we already have multiple peering agreements with many other CLEC/ILECs.
As for the double-failure in a RAID5 array thing the article poster mentioned, for Pete's sake, buy a couple spare disks. You should follow the same rule in making your RAID arrays as your server clusters. You *should* be able to lose 2/5ths of your disks without losing the array. This means that you need at least 1 spare for every 5 drives, for a total of 6 drives.
Add some good monitoring on top of these and your downtimes drop to almost nothing. In fact, you shouldn't ever see service downtimes with a proper setup, provided you actually bring machines back up as they fail.
Re:high availability of the service by captainclever · 2005-09-27 08:08 · Score: 2, Interesting

If you find yourself running a bunch of servers all with similar spec/config, you should consider removing the disks from them and netbooting off a single image on another server (or a single image available one 2 other servers just in case). Disks are far more likely to break than any other component imo, far more likely than fans or PSUs if you ask me.
As for RAID5, it's not always practical, but bear in mind if you buy all your disks from the same mfg at the same time, your chances of concurrent failure are increased. (the batch the disks came from may be suspect). You could buy disks from different manufacturers. Hot spares are always handy too.
The webservers for last.fm are all diskless and boot off a single debian image. Makes it helluvalot easier to upgrade/update them. We use Perlbal (from the LiveJournal crew) as a reverse-proxy load balancer, which works nicely.

--
Last.fm - join the social music revolution
Re:high availability of the service by Glonoinha · 2005-09-27 12:09 · Score: 1

Actually, per definition of RAID 5 - no, you *shouldn't* be able to lose 2/5ths of your disks at the same time without losing the array.
A RAID 5 array can happily chug along after a single disk dies, but if you lose two drives on the same array at the same time, you are pretty well and good screwed.
--
The proof of this, of course, is left up to the reader (with help from Google.)

--
Glonoinha the MebiByte Slayer
Re:high availability of the service by innosent · 2005-09-27 12:45 · Score: 1

Not bad, but by booting from a remote image, you run the risk of the remote image server being unavailable, unless you have a redundant image server. Your best bet is to have the remote image server, but use/create a provisioning system that allows you to install the image you want to the machine OR run a network image. This eliminates the need for a redundant image server most of the time (machines can still run from local disk), but allows you to reprovision for temporary needs like a failed server or scheduled maintenance.

For (an overly simplified) instance, say you run two main sites, site x on 4 machines, and site y on 4 machines, with the image server set to do nothing (allowing the machines to continue to boot from disk). If site x is on /. today, you simply adjust your provisioning to have 2 of the machines that usually host site y to run site x from network image, and reboot the two machines. Now you have a 6/2 split, and when the load subsides, all you have to do is remove the rule and reboot the two machines. Of course, if you decide that site x needs 5 full time machines, you could just set the provisioning to load the host image for x onto one of the y machines, reboot it, and remove the rule. I don't know of any existing free/open source software to do this in a general case off hand, but I'm sure such an animal already exists, since there are dozens of commercial packages to do this. If not, it won't take more than an hour or two to whip up some shell scripts to do this in your specific case. All you need is a network boot system that writes an image file or copies a directory named in some configuration file to disk and reboots for the permanent case, and runnable network boot images for the temporary case (probably just a root NFS that uses the same directory as the permanent one).

--
--That's the point of being root, you can do anything you want, even if it's stupid.
Re:high availability of the service by einhverfr · 2005-09-27 13:11 · Score: 1

Actually, per definition of RAID 5 - no, you *shouldn't* be able to lose 2/5ths of your disks at the same time without losing the array.

Depends what "at the same time" means. If it happens a few minutes apart and you have a hot spare, your data might be saved by the spare.

--

LedgerSMB: Open source Accounting/ERP
Re:high availability of the service by dwater · 2005-09-27 15:46 · Score: 2, Interesting

This is my take on RAID5 - I take it as written that I will be corrected :

You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.

Therefore, the time between the two disks failing *must* be enough for the regeneration to complete, else you'll lose everything(?).

This can take quite some time.

On my s/w RAID5, it takes hours.

Furthermore, the process causes significantly more disk activity on the remaining disks, increasing the risk of another failure.

The time taken to regenerate the array depends on many things. The ones I can think of :

1) speed of the read performance of the disks remaining in the array, and the write performance of the spare.
2) controller performance (or interface(s)/chipset/memory/cpu performance if s/w RAID).

Of course, once the array is back to it's redundant state, it still isn't back to 'normal' because the spare is now missing and needs to be replaced.

Better to use mirroring (0+1?) in some way, IMO. If you have 3 mirrors, you can hot swap one 'reflection' and immediately replace it - keeping many off-line 'reflections' in the same way as you would tape and using it as a backup.

I think 0+1 is faster too, since the '0' is a stripe. Of course, it's also more expensive to obtain a given capacity.

--
Max.
Re:high availability of the service by afidel · 2005-09-28 02:07 · Score: 1

DO NOT listen to this advice. Every real RAID controller I have ever worked with requires exactly matched disks. In fact having different firmware versions on the drive controllers can cause problems. You can get away with unmatched disks using pseudo software controllers but I still wouldn't do it because I really doubt the manufacturer has tested such a configuration. Total software setups can of course use whatever as they are just a layer on top of the already existing disks and drivers.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:high availability of the service by Nutria · 2005-09-29 09:56 · Score: 1

You can lose one disk in a RAID5 and it still works. However, the contents of the failed disk needs to be regenerated on the spare before it's back to its 'redundant' state.

[snip]

This can take quite some time.

On my s/w RAID5, it takes hours.

This is what 15K RPM SCSI drives and expensive RAID controllers are for.

s/w RAID is great 99% of the time, but...

--
"I don't know, therefore Aliens" Wafflebox1

Um, details? by afabbro · 2005-09-27 05:36 · Score: 1, Insightful

You have a budget of $1 million?
You are hosting this on a 56K dial-up in your root cellar?
Your apps need to run on Microsoft Windows or HP-UX or...?
You've got a SAN or local disk or...?
You're using home-built white-box x86s or Sun E15000s or...?
You have sysadmin talent on hand? You're outsourced to IBM global services?

Who vets these silly questions? Oh, I forgot - the "Editors".

--
Advice: on VPS providers

Where's the link? by azuroff · 2005-09-27 05:36 · Score: 5, Funny

Just post a link to this server of his - we'll gladly stress-test it for him at no charge. :)

Identify, Prioritize, Budget by Nos. · 2005-09-27 05:59 · Score: 3, Informative

From the internet to the resources you are trying to provide, identify every point of failure. Power outage, uplink, router, switch, server, etc. Prioritize - which are the most likely to fail? Budget - which ones can we afford to, or are cost effective, to duplicate? Clustering may be an option, but might be too expensive. What about a cold/hot standby - reduces downtime overall. You can find relatively inexpensive UPSs just about everywhere. Making your entire network redundant can take a lot of time and money.

Not a bad idea... by Ingolfke · 2005-09-27 06:06 · Score: 2, Funny

I think sysadmins would respond very nicely to tips for increasing server availability. Let's say average tip is about 15%, and for simplicities sake we'll say that every hour of downtime costs about $10,000. Baseline service should be set either by an SLA or a measured baseline. For this we'll say 99% uptime per month. That allows for 7.2 hours of uncscheduled downtime in a month. So if the server is up for 100% of that time, then you'd want to tip your sysadmin about $10,800 per server (assuming the #s stay the same) for a month of great work.

Re:Not a bad idea... by hatch815 · 2005-09-27 06:12 · Score: 1

who then pays when it is down. SLA's for the most part are a joke. Did you ever get your 10,000 per hour back for downtime? I was down multiple T-1's for a week and never got a single penny. The owners estimated that we loose 8K per hour for T-1. that week was 6 t-1x10hoursx4 days = 240x8K

Hire a Professional? by marcus · 2005-09-27 06:13 · Score: 3, Insightful

That is all...

--
Good judgement comes from experience, and experience comes from bad judgement.
- W. Wriston, former Citibank CEO

Re:Hire a Professional? by tgbrittai · 2005-09-27 08:51 · Score: 1

A server's stability is a transitive property: it is about as stable as its administrator's personality.

A third party product is simply an additional tool in this case. One he will have to manage in addition to his server. It will help, but your friend needs to look at how he manages his technology (including product selection). I'm not trying to bust his chops but he can learn it sooner or he can learn it later. I wish him luck...

Define then plan by linuxwrangler · 2005-09-27 06:15 · Score: 4, Informative

You will find that "availability" is a vague term. First you need to have a discussion to determine what availability means. It must be able to be put in measurable and non-vague terms. 99% uptime is not a good definition. The system must handle 99.7% of requests in 30 milliseconds or less is much better in part because it includes a performance expectation. It's also recognizes that not every request will receive the desired level of response. Additionally, if you determine that you want N+1 redundancy then you need to know the appropriate value of N (how many servers are needed to provide our required response times).

You may find that one valuable outcome of this exercise is that it puts everything on a sliding scale rather than a managerial edict of "just make sure we don't go down." It also means that costs can be attached to everything. Peak time slowness is OK and we can take the system down 30 minutes each night for maintenance? Here's the tab. No maintenance windows allowed and peak-load must be handled well? That costs more. We need to stay up even if a hurricane/earthquake/volcano/terror-attack/plague- of-locusts destroys our primary site? Cough up the dough.

Managers deal with money/value issues all the time and expressing things this way is really just giving them the info they need to do their job.

Once you know the requirements, list everything that may impact your availablity including hardware, os, application(s), network switches, internet connectivity, etc. And it doesn't just include the web server - any database, app-server, dns-server, load-balancer or other necessary piece of the puzzle must be included as well. You will have to determine the likelyhood of failure of each piece, its impact on your defined goal, and the speed with which the failure must be corrected.

With this in hand you can start to make informed decisions on whether to have single drives (since your servers are mirrored), non hot-swap drives, hot-swap drives or hot-swap drives with warm spare. You can determine if you need hot redundant networking or if a spare switch on the shelf is good enough. Can you page a tech and have him be there in 2 hours or do you need people on-site 24/7?

A personal note: to be really well covered you have to have multiple sites located at significant distances from each other. I've suffered FAR more cumulative downtime due to fiber cuts (when a backhoe hits a OC192 the backhoe wins and large parts of the city lose) than to all other failures combined. Colo facilities have suffered downtime due to improper use of the Emergency Power Off switch or large natural disaster. To do this you can use DNS failover (from the inexpensive but effective dnsmadeeasy to the high-end and pricey UltraDNS) to switch traffic to your backup site within a few minutes or, if you are really big (ie. can afford $$$), you can use routing protocols to reroute the traffic to your other location at the TCP/IP level very quickly. But one nice thing about having two sites is that each individual site doesn't need to be as highly reliable in order to achieve the desired system reliability.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis

Re:Define then plan by azrider · 2005-09-28 06:44 · Score: 1

What I found running maintenance operations at a large, multinational bank, is that the definition of down time is very important. You need to separate "scheduled" down time (patches, preventative maintenance, etc) from unplanned downtime (disk died). Otherwise, you will never get to a reasonable benchmark.

--
And ye shall know the truth, and the truth shall make you free.
John 8:32(King James Version)
Re:Define then plan by linuxwrangler · 2005-09-29 14:14 · Score: 1

You are correct. Definitions are key. Be VERY VERY careful what constitutes "scheduled" downtime. Scheduling downtime to fix a problem is unacceptable.

Circa Y2K, we had a provider who promised full redundancy on everything but they had a piece-o-crap load balancer and firewall from a company that got bought by Cisco. Every so often this would cause an amusing array of serious network problems but only on a portion of the sites that were handled by this equipment. This had a severe impact on our site but rebooting the balancer would "fix" the problem. The company would refuse to reboot because it would impact the other users of the balancer, they would claim that since some packets were handled correctly we weren't "down", then they would schedule some downtime that night to reboot and tell us that we weren't entitled to compensation because the downtime was "scheduled", not unplanned.

Of course these were the same jerks (who shall remain nameless but who trade under the symbol NAVI) who, when a machine would go down, would get alerts on http, https, ping, smtp, and all the many other system checks. Fine, so far - the tech would spend 5 minutes rebooting the machine. But when we got the bill it was 5 minutes for http, 5 minutes for https, 5 minutes for smtp, 5 minutes for ping... Somehow a single reboot would cost us an hour and a half. When we challenged them, they said they didn't have time to correct their bills and tried to get us to do their work for them. I guess that should surprise me - we had to diagnose their load-balancer problem for them, too.

--

~~~~~~~
"You are not remembered for doing what is expected of you." - Atul Chitnis

Uptime ++ by Anonymous Coward · 2005-09-27 06:16 · Score: 3, Informative

Hmm, two RAID drives failed simultaneously? This is possible but so unlikely, it isn't worth mentioning. Either the equipment he (Freudian You) is using is utter crap or they didn't really fail at the same time. Most likely one failed and no one noticed until the second failed. Or possibly He/You were using software RAID in which case your OS or you failed and caused the apparent drive failure. IN any of these cases the real cause of the failure is him/you!

Regardless of how badly you may have done things in the past, here's how to prevent problems in the future. First, start with top-shelf equipment like HP Proliant servers. Sure, there will be flames for this recommendation but, think about it. There is a reason that HP Proliant servers are the ONLY choice in almost every Fortune 1000 company. Regardless of some anecdotal whining about poor support that is sure to follow my post, HP Proliants ARE that good!

Use hot-pluggable SCSI drives attached to a battery backed-up RAID controller and a hot-spare drive. Do NOT use IDE and you might even want to forgo SATA though they are a possibility. Use high quality ECC memory, dual processors, redundant power supplies and don't forget to fully utilize HP's management utilities to monitor and manage the server. SNMP and HP's Insight Manager will not only let you know, via alarms or alerts or pagers or email, before a drive fails. It will let you know when your logs are getting too large or your utilization is too high or even restart Apache for you should it fail.

Now this is all well and good for greatly reducing downtime to almost none but, it doesn't guarantee uptime. To guarantee 100% uptime you need to implement redundant systems behind a load-balancer or implement a cluster. If you're super paranoid, both. Naturally you need to also have redundant power sources and network connectivity with all this so that you do not have a single point of failure ANYWHERE.

Naturally, all this will cost big piles of cash. But that's what it takes for 100% uptime. If you're going to try and use white-box desktop hardware you've already failed it!

Re:Uptime ++ by ewwhite · 2005-09-27 07:12 · Score: 1

Eh... two drives... it can happen. I manage 175 remote Proliant servers ranging from lowly ML330's to 8-CPU DL740's. These machines are reliable and have excellent value-added monitoring features, but anything can happen. I average about 20-30 drive failures annually across those 175 machines. I had two drives fail simultaneously on a four-drive Raid 10 array recently following a power-outage. The drives wouldn't spin up. And unfortunately, they were mirror pairs. I didn't have any predictive-failure or S.M.A.R.T. errors leading up to it.... but I certainly did lose the array.

--
Edmund White
http://flickr.com/ewwhite
Re:Uptime ++ by Kalak · 2005-09-27 12:36 · Score: 1

Ever heard of a failure somewhere else other than the drives causing the failure? The controller? The Cable? 2 drives from the same batch that had a flaw? And saying it is going to cost big piles of cash, you obviously haven't heard of how Hotmail and Google started. (Hotmail back when it ran BSD.) He's not asking for marketing speak for big iron, but real examples. My One Big Servers have cost me more sleep than my little cheap, but well thought out cluster ever has. And the cluster has been running for over 3 years now with only a halon trip in the data center an planned upgrade to the power systems taking it down. My One Big Servers go down every time they need a reboot.

Also, if you really knew your stuff, you'd know that a load balancer includes a basic cluster of at least 2 machines, so saying "If you're super paranoid, both." shows you don't work with load balancers.

(for my set up, see my above posting - brand names removed to prodect the innocent, but they make both low and high end servers, and the high ends are outside the cluster because if I could cluster the apps, I'd save the money.)

Real demanding applications (your Fortune 1000) *will* have load balanced clusters, even if they use expensive hardware, and that is the key.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
Re:Uptime ++ by dtfinch · 2005-09-27 15:47 · Score: 1

Rebuilding is pretty disk intensive. It can push a second drive that's near failure over the edge.
Re:Uptime ++ by wild_berry · 2005-09-27 22:49 · Score: 1

I'm no expert but that sounds like the filesystem or RAID program isn't doing a good job of planning where it puts the data (or its hash) in the array. While there will be benefits in avoiding putting data in the same access pattern across an array of drives (if the drive heads all scratch the same area of disk surface), if you have the datalined up nicely, rebuilding the arrays should be lighter work on the drive.

I would advise that the MTBF divided between the drives in the array be used to give a guide as to when risk of failure is increased to the point that you perform a pre-emptive drive retirement.
Re:Uptime ++ by Anonymous Coward · 2005-09-28 02:03 · Score: 0

It's interesting that people such as yourself that use lesser equipment are the ones that have experienced most multidrive failures. Don't you think so? Furthermore, I did say they were possible but, when using good kit, they are extraordinarily rare which is why most people question the likelyhood of multidrive failures..

You lost me with your other ramblings though... HotMail and Google were cheaper why? Because they ran tons and tons of cheap hardware or because the hardware was provided the the university? Do you not think that someone shelled out a lot of cash for the equipment?

As for your clustering comments... Two machines, by themselves, do not make a cluster. The machines within a cluster are ware of each other and interact with each other. Load balancers can load balance to numerous independent machines that are completely unaware of each other and have no interaction whatever. Load balancers do not necessarily create clusters nor are clusters a requirement. It can be a simple matter of balancing requests between multiple independent machines. Additionally, you can have clusters behind load balancers. I won't even bother discussing that there are different types of clusters.

(for my set up, see my above posting - brand names removed to prodect the innocent, but they make both low and high end servers, and the high ends are outside the cluster because if I could cluster the apps, I'd save the money.)

I am happy for you and your setup! But, your situation is probably not the same as everyone else and what works best for you may not suite everyone.
Re:Uptime ++ by Kalak · 2005-09-28 02:29 · Score: 1

It's interesting that people such as yourself that use lesser equipment are the ones that have experienced most multidrive failures.

I never said that I experienced a multi-drive failure. I am aware of how they come about, and I know of those who have experienced them. You are making assumptions that are not supported by fact.

HotMail and Google were cheaper why?
I also never said this. I said they used less expensive components. I made no statement as to their total cost. Again, you are making assumptions that are not supported by fact.

Two machines, by themselves, do not make a cluster. The machines within a cluster are ware of each other and interact with each other.
Again, you show your lack of research into the subject:
http://en.wikipedia.org/wiki/Computer_cluster#Load _balancing_clusters
http://www.linux-ha.org/FAQ#head-7f4d8eec3b4075a46 4bab8ccedfc1c970cb2cd29
Your statement implies a HPC cluster:
http://en.wikipedia.org/wiki/Computer_cluster#High -performance_.28HPC.29_clusters

But, your situation is probably not the same as everyone else and what works best for you may not suite everyone.
Yet, you spout that the only "real" way to achieve high reliability is using *your* recomendations, using *your* recomended hardware. I agreee that the solution to clustering is situationally dependent upon what the needs of availability are, the budget constraints, etc. I present a possible scrnerio, and corrections to your statements, not an answer for the submitter. You present your statement as "all this will cost big piles of cash", which I provided counter examples that it will not require a large budget.

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)
Re:Uptime ++ by Linux_Bastard · 2005-09-28 08:59 · Score: 1

Raid 5 was specified.
That's what Raid 5 does.
Rebuilding a drive in raid 5 is a long intensive process, as it has to rebuild the parity.
Raid 5's often fail when a second near end of life drive dies while rebuilding a failed drive. The number one cause of this is heat buildup over time IMHO. Usually when a raid 5 looses 2 drives, it is actually the case that one drive failed, and there was no hot spare to take over, when the second drive fails (at some later date) the raid is broken. Anyone running a raid 5 with no hot spare is either an idiot or out of $$$.

http://www.pcguide.com/ref/hdd/perf/raid/levels/si ngle_Level5.htm

If you want more resiliance, go with multiple mirror raid1
Mirrors just duplicate the data exactly, which yields far quicker rebuilds.

Most people go raid 10 (Striping across mirror sets)

http://www.pcguide.com/ref/hdd/perf/raid/levels/mu ltLevel01-c.html

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q

Probability of simultaneous two disk failure by metoc · 2005-09-27 06:23 · Score: 2, Insightful

are extremely low given the MTBF of modern drives. You have a better chance of a power supply or fan failure.

On that basis I am going to make some wild assed guesses that are more probable given the little information we have.

1) the drives were consumer models from the same production lot,
2) the death of the first drive was not immediately noticed,
3) compatible replacement drives are not easy to come by (no hot spare),
3) the second drive died before the first one was replaced,
4) the server did not have hot swap drive carriers
5) someone tried to replace the dead drive in the running chassis

If you don't like my guesses provide your own

Re:Probability of simultaneous two disk failure by knarfling · 2005-09-27 07:10 · Score: 1

I have actually had this happen. Actually, option 4 is closest to it, although the time between failures was too short to do anything. We have an older Dell computer (First mistake!) with a SCSI hot swap RAID. One of the drives failed, and we replaced it with the spare and ordered a second drive. About a day later, another drive failed before the replacement arrived. For some reason, the Windows Admin removed the drive even though there was no replacement available. He placed the bad one back in, and within moments, a third drive failed. There is some discussion about whether it was the drives or the backplane, but the system reported that two drives failed within minutes of each other. Fortunately, we were able to restore from tape to another server (slated for a different purpose) and lost only a day's worth of work.

--
Great civilizations have lived and died on false theories. Don't mess up mine with a few facts.
Re:Probability of simultaneous two disk failure by mangu · 2005-09-27 07:32 · Score: 4, Insightful

If you don't like my guesses provide your own

6) The drives are overheating. This happened to my two Seagate 200Gb drives. Had to mount them in a heatsink, the normal bay does not provide adequate cooling for Seagate's 7200 rpm drives.
Re:Probability of simultaneous two disk failure by HawkingMattress · 2005-09-27 07:39 · Score: 1

6. PSU went crazy and killed all the drive in the array and the spares instantly.
Happened to us... It didn't stop at the hard drives either.
But yes, it was cheap hardware.

HA is elusive by Ropati · 2005-09-27 06:27 · Score: 3, Informative

Preventing downtime is an expensive, time consuming exercise, with few limits.

Before tackling the problem of downtime you should consider how much downtime is acceptable. See the discussion on downtime at the Uptime Institute regarding what is acceptable. Are you looking for 99.999% uptime? Dream on.

Specifically you need to make everything in your system redundant. The web servers need to be redundant, you need to have redundant copies of the data. The paths to the internet need to be redundant and the environment should be remote and redundant.

Once you get a handle on your environment, you should consider some sort of clustering technology for server duality. I suggest you read "In Search of Clusters" by Gregory F. Pfister to get a fundamental understanding of the technology.

As was posted earlier, you might just want to throw in the towel and accept web hosting. Use the Uptime Institute specifications against the ISP's service level agreement.

You might also consider a local ISP co-lo and do your own remote clustering.

--
machinator omnis sine licentia

Re:HA is elusive by Miniluv · 2005-09-28 02:08 · Score: 1

Are you looking for 99.999% uptime? Dream on.
Why? What makes 5 nines unachievable, especially for such a simple setup as described in TFQ?
HA for web applications isn't very difficult, it just requires doing a little architecture up front, and spending the appropriate amount of money. Load balanced N+1 clusters, multiple redundant Internet links and their associated hardware, none of this is difficult or even that expensive today.
There are very few places in the computing world where real HA is even remotely difficult, and even in those places there are workable, if not actually good, solutions available.
Re:HA is elusive by Linux_Bastard · 2005-09-28 10:09 · Score: 1

You do realize that that is less than six minutes down time per year.

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-09-28 10:14 · Score: 1

Yes, I'm fully aware. And like I said, its not that hard to achieve. You have to be talking about service availability though, not component availability.
I fully agree that keeping any individual server to 5 nines availability is generally an unattainable goal, in my environment we strive for 3 nines on individual components, and 5 nines on the overall service. With that goal we build HA into everything we design, whether its through load balancing, heartbeat/ip take over, or building the failover logic into our custom applications. We pick the setup that we'll use for each component when we're architecting the platform, since each piece has different limitations, etc.
Re:HA is elusive by Linux_Bastard · 2005-09-28 15:53 · Score: 1

Five nines is dificult, even for simple services like web serving. For this guys' 30K budget, it will be very dificult. He will need at least 3 machines +NAS. Several whitebox machines would be fine, and even cheap, but the shared storage will not be cheap. IBM4200 fiber attached SATA or similar. Only if his web content is very static can he get by without an external raid. (all content on each server, rsync and a gang of servers)

I do clustering for db, and we shoot for "NO Nine's" (100%). Our apps are all medical and no outage is acceptable. Our best cluster is 100% after 493 days (three tier lvs/apps/db). The best box in that cluster is at 370 days. Honestly though we only average 99.997, but more than half the outage is from our leased lines service provider, not the clusters. Onsite avaiability is over five nines.
Just on the off chance, do you use Saru?

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-09-29 02:46 · Score: 1

I'd disagree on the cost of the shared storage, especially depending on how its done. I'd also disagree that whether the content is static or not really makes a difference. If he's running a bunch of CGIs (as an example), which all pull data from a database then simply doing a heartbeat/ip-takeover pair of whitebox DB servers isn't that expensive. The shared storage there could be expensive, but depending on the DB engine may not be necessary.
We've bought some really large storage solutions at really cut-throat pricing by finding small vendors with decent/good tech. Area Systems is one vendor we used to use who provides SATA->SCSI solutions at very competitive prices. We bought a 1.2TB enclosure from them for around $5k, and it performed pretty darn well. The only hesitation I have is doing shared storage with SCSI, due to the lack of hardware level arbitration on access. I'm sure there're some boxes that do provide this, but its not the norm. I've been burned on clusters doing this when both heads bounced at the same time (idiot predecessor plugged them into the same circuit in the colo), and since they booted at the same rate, they both tried to fsck the file system at the same time. Thanks Veritas for that bug! If you want to do it right, then you're absolutely correct that it costs money, cuz FCAL just ain't gettin' cheaper.
Another option would be to buy a used NetApp or similar device from Ebay for a couple grand and just mount the entire content tree for the webservers via NFS. Then you can push content to one place, the sites all automagically pick it up, and you've got a highly available filer managing the data.
The cost of this stuff just keeps coming down, which is why we're all expected as sysadmins/netadmins/etc to keep driving availability up. Thankfully there's also some really good F/OSS packages out there which are maturing rapidly to provide some of these HA services.
I've never actually heard of Saru, got a link and maybe some info? I'm always interested in new technologies, particularly ones related to clustering/HA.
Re:HA is elusive by Linux_Bastard · 2005-09-29 05:08 · Score: 1

My point about static content was that if his content is static, he can avoid having shared storage. No shared storage significantly reduces the price tag. The problem of using a NetApp or similar is the single point of failure. The cheapest possible way with shared storage would be a 2 box nfs server cluster with shared scsi. Finding hardware that is reliable for this can be problematic. LSI Logic MegaRAID (perc 3) is usable and not too expensive... I built one cluster that used AFS, but distributed file systems just don't cut it yet.

Saru is a package that allows active active load balancers (lvs linux directors).
With active active, there is no single point of failure in the lvs, as the lvs directors load balance themselvs and share state.
The really nice thing here is that in some cluster layouts, the LVS directors are the real servers. All the boxes load balance without the need of a dedicated lvs directors and with no bottleneck or single point of failure.

http://www.ultramonkey.org/
http://www.ultramonkey.org/papers/active_active/
http://www.ultramonkey.org/papers/active_active/ac tive_active.shtml

If you know of any other active-active LVS let me know.

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-09-29 08:49 · Score: 1

You don't actually need active/active to avoid the single point of failure, though I do see some advantages in terms of resource costs to going active/active. No matter what you won't pick up a huge amount of savings, since you can't load active/active boxes as heavily as active/passive but it does look a little better when presenting to a cost-conscious management team.
I've seen the ultramonkey stuff before, and even deployed a fairly complex LVS balanced cluster (the pair of lvs directors actually had a single transaction balanced through them 5 different times as different resources talked to each other) with heartbeat and so forth. For F/OSS its pretty slick. Not as easy or nice as, for example, the Foundry ServerIron's I use in my current environment, but still damn nifty considering the price.
NetApps don't have to be a single point of failure, they have supported real clustering for an awfully long time. I use a pair to serve the message store for my mail cluster, and while I've never actually suffered a head failure (my head uptime is 650 days 21 hours as of just now, that being the exact amount of time since they were installed), I've thoroughly tested failover and giveback and it all works seamlessly.
I think ultimately we're in agreement that there are a number of good, low cost solutions to solve various aspects of availability, we've just worked with different pieces of them in different respects. Thanks for repointing me to the LVS stuff, because I'd not kept up and they've done some amazing maturing in the couple years since I last looked at it.
Re:HA is elusive by Linux_Bastard · 2005-09-29 10:16 · Score: 1

With active passive, when you loose the active director, you loos the session mapping through it. With more complex things going on through LVS like telnet or remote instruments or ftp, or even some .asp, When you loose the LVS state, you loose the connection. Sure they can just reconnect, but when 400 techs & data entry people get bumped off, and 40 instruments in a lab have to be reset and reloaded, thats not an acceptable level of service. For continuous sessions, active-passive is a single point of failure. Additionaly, the active-active solution is scaleable, where active-passive isn't (for a single connection type anyway). With equal boxes, you get better performance and service with active-active. With Saru and UltraMonkey, configuration is something of a pain, but price/performance nothing I know can touch it.

Clustered NetApps.. Hmm I'v never really considered that. I guess I need to get some money in next quarters budget for "NAS research". There are a few niches here for lightweight storage with nothing really the right fit, and EMC wants an arm and a leg.
Thanks for the idea.

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-09-30 02:12 · Score: 1

I can speak from experience that active/passive versus active/active is not the determining factor for single point of failure with load balancers (or anything else for that matter). It merely determines whether both nodes are providing services or if one is in standby.
Some examples of services I've run in active/passive with full stateful failover include Cisco PIX firewalls, Foundry ServerIron load balancers, NetApp filers, a pair of LVS directors (there was an app you ran which passed state data from the active node to the passive in near real time so that a failure didn't kill stateful connections) and Cisco LocalDirectors. All of these except the LVS nodes required a direct physical connection (Cisco, Foundry and NetApp all use proprietary hardware cabling for this purpose).
I definitely agree though that LVS is untouchable for price/performance in virtually every situation. And definitely look into NetApp. Their pricing is a lot more aggressive than EMC, and they're a much better NAS solution. The only downside is that until you get to near the top of their model line at the moment, you can't use anything except FCAL drives, which are pricy. We're taking delivery on a pair of FAS3050s in a couple weeks, which will allow us to start deploying some SATA storage for stuff that doesn't demand quite as much performance as the FCAL provides. The other cool thing about NetApp is their support for trunking ethernet connections (it may only work with Cisco switches though). Our filers have 6 ethernet interfaces, 3 into each of our core switches, trunked for 300Mbit and then trunked for failover. NetApp really does understand HA, and has support for it built into their product at just about every level. The coolest thing I've experienced with them was them calling me while I was out at lunch to notify me that not only had a drive failed, but the replacement would arrive in about 3 hours. What truly blew my mind was the tech trying to convince me to let them send a field service person out to swap the drives, since I paid for it they felt they should provide it. Find me another vendor who'll insist upon incurring cost to themselves for no reason other than that they feel you deserve the service. Oh yeah, and they let you pick from 6 different genres of hold music when calling tech support. I don't know why, but that just kills me.
Re:HA is elusive by Linux_Bastard · 2005-09-30 08:05 · Score: 1

I really have to look into NetApp. Service levels are critical to us and they sound outstanding.

I've never worked with Foundry ServerIron load balancers or NetApp (to do this), Unfortunatly our clusters are not the standard type workload usually found behind this type of gear. Cisco, Veritas and Piranha (RedHat LVS) all failed in work simulation testing. There were too many dropped conections and timeouts. The Cisco gear failed over ok, but had problems with failback and high load. Veritas was just a waste of money, and piranha/ipvsadm wasn't working together. Ultramonkey+Saru passed, but had issues with increased latency under high load when in straight NAT mode. Like I said, our RDC clusters are a long way from the run of the mill web farm.
YMMV

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-09-30 09:53 · Score: 1

Overall I think the foundries are excellent devices. They balance my mail cluster exceptionally well. The site here has a number of ip based virtual hosts on a cluster of 7 servers, and the foundries balance each virtual host separately which means that on a server load basis the balancing is uneven. For a single host to a group of machines though they're really quite good. The failover and failback works quite well also.
I strongly recommend NetApp to everyone I encounter professional who needs HA NAS. They also support iSCSI (for the last 2 years they've been giving the licenses away for free with the purchase of the filers to drive adoption, and I believe they're still doing so). We've used that for our windows ms-sql servers since they don't support storing databases on a CIFS share. There is some delay when they give back after a head failure, but its clearly documented exactly what the delay is, and my testing validated that their docs were 100% accurate.
I have to agree that Veritas is just crap. Everything I did with VCS was incredibly painful, and the failover/giveback was shite. Even worse, when both sides failed and came back they couldn't properly arbitrate which side was master, despite the config file specifically stating which head was the default master, and they ended up eating 500G of mailboxes. That turned into a 32 hour support call, which in turn woke up a senior engineer who dug deep into his cvs tree and found a mysteriously hacked up version of fsck that knew how to deal with the fall out of this specific situation. While on the one hand that was great because it recovered the vast majority (probably 95%+) of the data, it also scared me that they had this lying around, because that really indicated to me that this was a recurring problem for them and that's pretty bad for a clustering product.
Re:HA is elusive by Linux_Bastard · 2005-10-04 03:24 · Score: 1

With our inhouse expertise in LVS, I'm not likely to use any of the Foundry ServerIron gear, it's too hard to justify on the budget. I like getting new kit to play with though. Where would Foundry have a reasonable justification cost (over LVS). Where is it that the Foundry ServerIron is best suited?

What, an iSCSI implementation thats not junk? I'll believe it when I see it.

I really can't say too little about Vertitas. They have been costing me sleep since UnixWare 7.1.1.
Why is it that the Beancounters are happy to fork over scads of money to Veritas? I've worked with some that actually require its use, Even where there is no benefit at all (even if it worked!). And don't even get me started about the Veritas backup.

--
F X=0:1:9999 F D=2:1 Q:((X>2)&(X#D=0)!((D>X/2)&(X'=1))) I D>(X/2) W:$X>75 ! W X,?$X+5-$l(X) Q
Re:HA is elusive by Miniluv · 2005-10-05 01:58 · Score: 1

I doubt the ServerIron gear, for example, has any significant advantages over a well run LVS cluster, which it sounds like you've got. There are some higher end LBs out there that I've looked at to do things like SSL acceleration, caching, etc. Stuff that LVS doesn't really hook into very well, though you could easily build from open source components should you want those features. For my environment SSL acceleration is a big deal, because I'm running 30+ sites with different certs, and therefor stuck doing IP based virtual hosting. SSL acceleration would let me switch to name based hosting on my apache boxes, greatly simplifying their config, while still offering separate certs for each site.
The Windows iSCSI initiator is actually halfway decent, and the NetApp iSCSI implementation is seemingly bulletproof. I've only played a bit with the one for linux (which was jointly written by Cisco and NetApp), but so far it appears pretty solid. Its not a protocol I'd build large parts of my network around just yet, but its emerging as a pretty cool alternative to FC and the whole fiber router/switch/etc mess.
Veritas to me is in the same camp as Sun. They used to have market leading technology, and they still have market leading high prices, but the bean counters have all heard of them so they think this is a good reason to pay top dollar. In a year or two the bean counters should start hearing the negatives about Veritas and then maybe they'll see the decline they so richly deserve.

I love you, man! by Anonymous Coward · 2005-09-27 06:30 · Score: 0

Could you negotiate my contract for me? Tips or paid bonuses for uptime... I love you, man!

Depends on budget and requirements by jd · 2005-09-27 06:41 · Score: 1

If you need High Availability (ie: almost, but not quite, 100% uptime) then you want two or more boxes which you can either load-balance between (dropping crashed servers from the list) OR which you can fail-over to in the event of a problem.

Use a service monitor, such as Hobbit or Big Brother, to monitor the services. If a service fails once, have it auto-restarted. If it fails repeatedly, have the monitor reboot the box automatically. If the server keeps crashing, or if recovery locks up, then have it notify you to intervene.

If you need guaranteed 100% uptime -and- the software is more reliable than the hardware (for whatever reason), your best bet is to run two boxes in parallel and have BOTH serve all requests, but have your router filter out the latter of any two identical packets. Then, if a box crashes, the other already has the connection running and no fail-over is required. When the crashed box is restored, you then have to replicate the state, but you can afford to take more time over that, as the user won't be aware of it.

If you need even higher levels of availability, then you'd want to move the disks onto a SAN and mirror the disk access - this time, filtering out duplicates in both directions. That way, either computer can crash AND either disk RAID can crash, and you STILL have a functioning system.

You can keep parallelizing, adding redundancy (such as mixing hot-swap and cold-standby), etc, as far as you like for the reliability you require. Need better WAN network reliability? Get two providers and set up dynamic routing over BGP. Let BGP take care of monitoring connectivity and dropping routes that aren't working. That's what routers are designed to do. Ideally, you'd co-locate, have backup IP providers for each site, then replicate transactions on write. If a site totally crashes, you can't avoid the time to fail over, but you can keep it to an absolute minimum.

If the LAN is potentially a weakpoint, then have each server with a line to each WAN router. Have an independent cable running between servers, and use it specifically for replicating states when a server is restarted.

These options range from costing virtually nothing (Big Brother is free) to tens of thousands of dollars or more, depending on the scale of redundancy you want, and give you from a few seconds of downtime perhaps every few months through to no user-visible downtime ever (short of a nuclear attack on all locations you co-locate between).

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:Depends on budget and requirements by stevey · 2005-09-27 11:07 · Score: 1

If you need High Availability (ie: almost, but not quite, 100% uptime) then you want two or more boxes which you can either load-balance between (dropping crashed servers from the list)
And using software such as Pound that can be setup fairly easily. Of course if you're running a massive site like /. you might be better off with a hardware load balancer.

But Pound can be used to easily pool a collection of back-end hosts, and avoid forwarding connections if one of them dies. It will even take care of maintaining state if you need that.

mod_globule, bitches by ubiquitin · 2005-09-27 06:43 · Score: 1

Put a server in San Francisco, put a server in Massachusetts, put a server in Florida or Texas, put a server in Chicago, put a server in New York, and then redundent-cluster them all with something like mod_globule or rsync scripting mojo. Problem solved.

--
http://tinyurl.com/4ny52

You need to start looking at server redundancy by millisa · 2005-09-27 06:47 · Score: 1

Drives die. Fans die. Power supplies die. Motherboards die. Having a RAID array is not enough. There are plenty of other things in a single system that can go wrong that can take the system down for a period of time. The biggest issue I have with the limited description here is the fact you are talking about one system. If you want availability, you need to be looking at scaling by the machine.

Now you can just have other machines waiting to take the load with a quick reconfig, or you can start doing things automatically (people have mentioned using things like Nagios for monitoring, but monitoring doesn't give you uptime . .it gives you response to minimize downtime . . .)

If you want to look at solutions that don't cost, check out LVS. It'll allow you to balance your ports across multiple systems (you can even balance win32 and linux systems if you for some reason wanted to) with a couple different methods (I prefer the DR method myself). The setup isn't that bad with several of the recent kernels in the major distro's including all the ipvs turned on by default, so you may not even have to recompile a kernel on your balancer systems.

Now, of course you can't depend on a single balancer any more than you can depend on a single web server; there is support using the HA linux stuff to allow you to have backup LVS systems to take over as a balancer if your primary balancer bites it (heartbeat and ldirectord is your friend here).

With a pair of fairly low end systems and some monkey work at the keys, you can have a system that will balance your tcp traffic (or setup an automagic failover from a system) that can be as good as some of the commercial balancing products out there.

I currently use LVS with heartbeat/ldirector to balance the following:
Win32 Apache Servers
Linux Apache Servers
Win2k3 IIS Servers (the LVS system balanced better than the built in WLBS from MS . . .and there was a lot less broadcasts
Postfix
Amavisd-new

And as others have mentioned about setting up some good monitoring (ala Nagios if you want), we monitor the virtual services on the LVS systems in addition to the real servers' services so that we can know if we are still delivery service externally even though real server B is down...

When you get bigger, then you should even start looking at having datacenter redundancy . . . deploying the meteor net never seems to be the right answer to the 'Force Majeur' question . . .

Re:You need to start looking at server redundancy by Kalak · 2005-09-27 12:15 · Score: 1

The parent has said some of the most important things about high-availability, and load balancing includes HA as well. The nugget of the logic driving this is that with redundancy outside servers, you don't need redundancy *within* a server. Then whiteboxes will do, RAIDs are less important, and a reboot on one machine will not take you down (you can actually kill the correct half your machines without a hitch). Add in redundant, seperate UPSs and network switches - my pager is *so* quiet these days - and you sleep well at night. A rack of properly configured cheap boxes beats the big iron that is out of your budget any day ($30k from the submitter's earlier answer).

If you have no linux admins, I'm sure there is a Windows load balancer out there, or you could shell out the big money on a hardware load balancer.

My credentials can be served up by any of the 2 peers that are currently live (the others are being reconfigured and are in testing, then I'll switch to the 3 not up today and reconfigure the other 2). It's a LVS, using Direct Routing, with linux HA on the front and back end. The servers inside the dashed rectangle are all either older boxes or simple 1Us. The expensive component is the shared storage, and if you can get by on a few hundred GB, you can do this for a lot less (soon to be replaced with one of these, fully redundant).

Another advantage of this logic (which grew from 2 machines in a Linux HA setup) is the possibilities for growth. Got an box that's too slow for a desktop? Make a peer out of it!

And one day I'll get the darn thing slashdotted and see how it holds out. ;)

--
I am, and always will be, an idiot. Karma: Coma (mostly effected by .hack)

Virtualisation ? by dago · 2005-09-27 06:56 · Score: 1

Despite the obvious price tag, vmware products allow to "virtualize" your server(s) and to make it run across multiple hardware hosts. Ok, that just adds another option, but it's nice to be able to just "move" one virtual host from one hardware box to another without shutting it down.

--
#include "coucou.h"

2 drive fail happening 2 times! by wimbor · 2005-09-27 06:57 · Score: 1

Well, 2 simultaneous drive fails: it has happened to me at two seperate occasions on the same Dell PE2650. I also complained to tech support that this was very strange, and that I did not trust the server anymore, and that there had to be a harware faillure in the Array or backplane .

But, apparantly you have to schedule a consistency check every week or so... if you do not do this there is a possibility that data on the RAID gets out of sync/corrupted (or something like that) and when a drive fails the subsequent rebuild will fail also. In my case the RAID then complained of 2 drive faillures, while the second drive would afterwards format and work normally (but I got it and almost everything inside the server replaced anyhow).

Maybe this happened to you also? Make sure you schedule the weekly check of the RAID!!!

Re:2 drive fail happening 2 times! by Undertaker43017 · 2005-09-27 07:53 · Score: 1

So does a RAID consistency check mean you have to reboot the system, every week or so, and use the RAID bios to do the consistency check? If so that sort of defeats any uptime requirements, since I would imagine that consistency checks on large RAID volumes could take quite a while...
Re:2 drive fail happening 2 times! by jsellens · 2005-09-27 07:57 · Score: 1

Make sure you schedule the weekly check of the RAID!!!

Better yet, make sure you buy hardware that isn't complete crap and needs you to manually check it to make sure it hasn't silently started corrupting itself!!!!

Sheesh -- next you'll be telling us to run mysqlcheck from time to time to make sure your "database" hasn't corrupted itself.
Re:2 drive fail happening 2 times! by wimbor · 2005-09-27 18:47 · Score: 1

No, there is Dell utility you can use to schedule this... No need to take the system offline...

Remove Single Points of Failure by arnie_apesacrappin · 2005-09-27 07:02 · Score: 1

You need to take a look at the infrastructure and see where the single points of failure are. After you have come up with a list of single points of failure, you can then analyze the cost vs. risk of each one. Here's a list of things I would look at (off the top of my head):

Access provider - Do you have more than 1? If you only have one, do they have multiple uplinks? Do you have multiple connections to your provider? If so, do they terminate into different pieces of equipment? Also, if these are data circuits (straight network connections), do they take different paths to the provider?
Networks - Is every piece of equipment dual homed? For each network segment, are there two physical devices? When you dual homed the equipment, did you plug the connections into different physical devices?
Power - Do you have redundant power connections? Are the power connections fed from different circuits? Are the circuits fed from different UPSes? Are there generators to back up the UPS(es)?
Cooling - Do you have redundant cooling systems? How long can your servers last if a single cooling system fails? What about complete loss of cooling?

After examining those things, you'll need to look at load balancing multiple servers. I personally prefer F5 for my load balancing, but they come with a hefty price tag (mid five figures). I've used Cisco, Foundry and F5 so far and I love the F5s. They're extremely extensible but you don't have to be a rocket scientist to get the basic functionality out of them. Dual methods of configuration (web and CLI) make it nice for the newbies and the seasoned professionals. It also helps that the CLI is a full Linux system, so you can write shell scripts to do all of your basic maintenance tasks.

If you have more questions, please reply.

--

Still, with a plan, you only get the best you can imagine. I'd always hoped for something better than that. -CP

Re:Remove Single Points of Failure by Anonymous Coward · 2005-09-28 04:35 · Score: 0

You forgot single points of failure implemented as services/daemons or other software.

For instance, if you have multiple virtual hosts, rather than using a single Apache instance to service them all, create a virtual interface for each virtual host, and bind a unique Apache instance for each virtual host its corresponding virtual interface. That way, if one virtual host goes mental, the whole lot don't. You can down/debug individual instances without affecting the others.

Minimize your use of CRON - this is often a SPOF, people assume CRON doesn't fail, but it can, and does.

Make good use of init - it can fail, but if it does your machine is going down anyway. Set important services to respawn if they fail. Consider using monit.

trim the fat - anything running that doesn't need to is not only a resource hog, but also a potential vector for a catastrophic failure.

I'm sure you can all think of others...

There was a really good LISA talk... by mengel · 2005-09-27 07:06 · Score: 4, Interesting

... about this a few years back. I forget the guy's name; he was administering a site that did stock quotes with pretty graphs, etc. I suspect I don't remember all of his points anymore, but:

two (or more!) network feeds from different vendors, verify monthly that they don't have common routing the best you can (sometimes you end up sharing a fiber even though it doesn't look like it...). These various connections all come into your front-end service LAN (which is distinct from your back-end service LAN...)
redundant front end servers which have their own copies of static content and cache any active content from...
redundant back end servers that actualy do the active content, and keep any databases, etc. Use a separate LAN for the front-end/back-end connections so that traffic doesn't fight with the actual web service.
Backup power (UPS + generator) with regular tests. (test on one side of your redundant servers at a time, just in case...)
Log only raw IP's, have a backend system with a caching DNS setup where you do web reports. Do things like log file compression, reports, etc. on the back end server only.
tripwire all the config stuff against a tripwire database burned to CD-ROM.
update configs on a test server (you do have test servers, right?) when they're right update the tripwire stuff, build a new tripwire CDROM, then update the production boxes.
use a fast network-switch-style load balancer on the front. They also help defend your servers against certain DOS attacks, (I.e. SYN floods).
when things get busy, load your test servers with the latest production stuff, and bring them into the load balance pool. If it takes N servers to handle a given load, it takes N+1 or N+2 to dig back out of a hole, because the load has at least 1 server out of commission at a time...
use revision control (RCS, CVS, subversion, whatever) on your config files.
use rsync or some such to keep 2 copies of your version control, above.
make sure you can reproduce a front-end or back-end machine from media + revision control + backups in under an hour. Test this regularly with a test server.

If you have a site whose content changes less frequently, (i.e. at most daily) burn the whole site to a CD-ROM or DVD-ROM image, and boot your webservers from CDROM, as well. Then if you blow a server, you can just slap the CD's/DVD's in another box and be back in business, and it's much harder to hack.

Well, anyhow, those are my top N recommendations for a keeps-on-running web service configuration. I'm sure I'm overlooking some stuff, but that should head you in the right direction. And if it doesn't sound like a lot of work, you weren't paying attention...

--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'

Anycast, Mosix by freality · 2005-09-27 07:08 · Score: 1

Anycast allows you to serve the same IP from multiple servers distributed around the net. It's a BGP hack, and some people don't like it because of that, but it's used to keep some of the DNS root servers up. Has the added ability to divide and conquer DDOS attacks. The caveat is that routing changes break stateful connections, so that's why DNS can use it well (single package UDP connections most of the time).

Mosix is a UNIX patch that allows processes to migrate across machines. There's even a neat utility called mtop or something that shows load on all servers in a curses interface.. and you can see them balance out like water pouring between them. This also has issues with maintaining active connections.

If you could migrate statefull connections you could effectively have infinite uptime. Anyone have a patch?

Re: Here they are by Anonymous Coward · 2005-09-27 07:14 · Score: 2, Informative

Well, I was really more after general approaches to be honest. I didn't post a lot of specifics because I didn't think anyone was truly interested is solving my exact problem for me.

Here are some details:
The budget is probably $30k or less.
I'm not exactly sure of the server models (I didn't spec them), but they are dell boxes and are fairly new. One is production, one sits idle to be swapped in the event of failure.
The server is hosted locally, bandwidth is not an issue.
They system is Windows based, ASP and SQL Server.
The raid array is using scsi drives. They are hot swappable, but do not have a hot spare.

I am approaching other sysadmins and looking for advice from them as well. I am not as worried about the traffic or about the backbone at this point as I am keeping the hardware up and the data backed up and available. I am also interested in methods of getting things back up and running quickly should a hardware failure occur or the database become corrupt.

This is not my area of expertise (obvious from my questions) and I thought there might be some general guidelines for this sort of thing. I suspect that he will be paying someone to help with this, but I was hoping to get a good feel for what to expect and to have some knowedge beforehand to be better able to make decisions. (get more than one opinion basically)

Thanks for everyone's responses, I appreciate your time.

Bathub Curve Makes it More Likely by freality · 2005-09-27 07:20 · Score: 2, Informative

You're right that disks don't fail together that often, but components do tend to fail when you get them or at the end of their expected lifetimes (just like us!). This is called the bathtub curve. If you buy a bunch of disks at the same time with the same MTBF, you'll get a big spike of failures within the first few days or in say 4 years. If you use RAID5 on lots of disks, you're hosed because it can't tolerate a failure during a recovery. This may sound exotic, but it's a key design consideration on larger disk systems like archive.org's petaboxen (though, I guess those are exotic :).

As usual, variety is the spice of life... just don't buy lots of the same kind of stuff at once.

Not really so unlikely... by mengel · 2005-09-27 07:31 · Score: 1

... when the drives in a RAID set near their end of life: Given

the bathtub curve of disk failure rates, and
that a raid reconstruct can take about a day on a lot of RAID sets

you can certainly have a second drive fail while another one is being reconstructed if all the drives in the RAID are near end of life. The only good way to prevent it is to intentionally fail & replace sufficiently old drives before they actually fail (i.e. before you start climbing the steep end of the "bathtub" curve).

It can be hard to explain to a company with whom you have a maintenance contract that a drive needs to be replaced that hasn't actaully failed yet. I know one admin (honest, it isn't me!) who advocates pulling old drives from the raid set and dropping them on the floor a few times and then calling service to "schedule" thes replacements ;-).

--
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'

Re: Here they are by swmccracken · 2005-09-27 08:11 · Score: 1

I would recommend the Google approach - cluster cheap computers. Clustering ASP can be easy (depends if you use the Session varaible) - look into Microsoft's Network Load Balancing; which while it load balances HTTP applications, also provides clustering and failover (I think - you'd have to check) without setting up a formal Windows cluster.

As for SQL, you could have two installations of SQL 2000 and use NLB to share among them; so long as you either manually take care of write transactions or use replication. (I'm not sure what the potential for lost information is if one SQL server goes down taking all data on its discs with it.)

You should obivously also read about MS's own clustering support and look into that. It tends to be bigger systems than you're talking about. Certain configurations use shared discs - you will have to research.

The ideal in my book is multiple "share nothing" servers where any can take the load of the whole - protects against disc failure too!

The idea of manually swapping in a spare server suggests you don't need 99.9% uptimes, otherwise you'd be looking into clustering systems to make that swap automatically.

Oh, and one thing I did saved my bacon at work once: Every two hours have your SQL Server backup (dump) the transaction log to another computer entirely accross the network - I use the SQL Server Maintainence plans. If you loose the server entirely, you've still got most of today's data! (Adjust frequency to taste.)

Some thoughts on this by bruciferofbrm · 2005-09-27 11:07 · Score: 1

"Where could one find material on recommended strategies for increasing server availability? Anything related to equipment, configurations, software, or techniques would be appreciated."

Well, as you can see, you can get a bit of information her, obvoiusly.

Since there was no specific details, but a request for information sources, I would say this:

Many vendors of products will offer an assortment of solutions to high availability needs. Legato used to have cluster software for Windows servers (for availabilty, not load balanacing). But alas (as I just discovered) thats now gone with their EMC merger.

Microsoft actually makes an okay clustering solution.
Oracle has clustering ability in their database product and are considered by some as one of the better solitions for a truely high availabilty database.

The Linux High Avilaibilty project is a good place to look around if you have time on your hands to impliment it. I've done it and it helps if you alread understand a lot of the concepts involde in HA solutions.

As you will find out though, is that you really have to determine the value a solution can provide, versus the potential loss of revenue a failure of any type can cause. whn you realize how much money you can loose, you can evaluate how much money you can spend. Thats the real key to any high availabilty solution.

Keep in mind there are also two type of clustering to think about (you'll discover it on your own in your research anyways):

One is Load Balancing clusters like web farms. All ther servers in the cluster share the work load. One server drops, the others have to take up the slack.
The other type is a High Availability cluster with active and passive nodes, where one computer does all the work while another sits idle waiting for the first to fail. When that happens it takes over the firsts work. A variation on this is an active/active cluster, where both machines do work but have to be ready to take on the others work load as well if the other fails.

If you think you are falling behind from the rest of the world, you are not. Right now I am going through this whole proces at work figuring out what it will take to get the management team to buy into high availability, and we have a customer base that really needs us to impliment it. It all comes down to the money game.

Similar but different by renehollan · 2005-09-27 12:08 · Score: 1

Chiaro S.T.A.R.: STateful Assured Routing.

Worked on that a bit while I was employed there.

Basically, we use synchronize routing table state across several hot-standby routers, so failover is instantaneous, limiting flapping in the network.

Rather cool, actually.

--
You could've hired me.

Don't forget downtime due to human error/malice by pyrotic · 2005-09-27 12:14 · Score: 1

Modern server hardware is pretty good. If you have redundant disks and PSUs, which are the main moving parts in a system, that should be enough. Your bigger worry, especially if you're running Windows systems, is downtime due to rebooting for service patches, and downtime due to malicious break-ins. You can mitigate against this to an extent by having lots of servers. Make sure they all have their own passwords so breaking in to one won't compromise your whole network. Check regularly using tripwire or similar tool that security and integrity haven't been compromised.

You also have to worry about changing a wrong setting, or not testing a new configuration enough. Use revision control, so you have a log of every change you make on production systems. Test first on non-production systems. Keep backups for as long as you can, and practice disaster recovery to make sure if a hurricane hits your data center, you can get back up and running without trouble. Store backups offsite in case your building is destroyed, and make sure you aren't the only one who knows how to restore the systems. Make sure backups are only accessible by authorised personnel - especially if your system passwords get backed up. See point above about break-ins. You are far more likely to suffer downtime due to human weakness than machine weakness, and all the harware redundancy in the world won't save you.

try by a11 · 2005-09-27 12:39 · Score: 0

double-parity raid and add a hot spare.

But it is a generic problem: Here is how to start by einhverfr · 2005-09-27 13:17 · Score: 1

First you should make a list of all your server and define your tolerances (figure 2/5). What needs to be kept online in the event of a power outage for example (maybe your PBX, telephones, publically accessible servers, and associated networking equipment) is a good question that needs to be asked, but don't stop there. List every possible disaster (structure fire, flood, earthquake, terrorist attack, you name it) and determine which resources need to be available. Then you can design a redundant system (maybe with servers in redundant locations in different cities), but you can only do this when you know what you are up against.

Then you can start building in redundancy into every one of these areas. Get duplicate network connections, duplicate switches, hot spares in your RAID arrays, redundant servers, and more. Maybe even redundant locations....

--

LedgerSMB: Open source Accounting/ERP

Slashdot Mirror

Tips for Increasing Server Availability?

74 comments