Clustering vs. Fault-Tolerant Servers
mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.
Heh. In order to do it completely right, you'd make a cluster out of fault tollerant nodes :-P
So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.
Breakfast served all day!
Hardware fails... it's as simple as that. You should plan on that for one reason or another you will have to shutdown and replace hardware. If it can be done with minimal or no disruption to the services, then that's all the better. OS makes licencing no longer a problem.
Clustering provides a backup for software failures, that fault-tolerant servers don't. Also, upgrades without downtime are easier done with a load-balanced cluster.
If HA is what you are really after, you should use both. You want a fault tolerant server so you never have to go down unexpectedly and you want a fail over node so if the unexpected occurs, you'll be back up in a jiffy.
"That's the sort of blinkered, philistine pig ignorance I've come to expect from you non-creative garbage."-Monty Python
Fault tolerant systems are all in one physical location.
Clusters can be in different server racks, building, city even country.
It depends what the goal is. Fault tolerance, scalability, disaster recovery, etc.
They both have their uses, let's not discount one or the other, just use them properly.
**Typically, the goal is a mix of the ones I enumerated, hence I typically choose clusters. However, I always re-evaluate every time a new requirement comes in.
A Web farm is the simplest form of clustering, some would argue it isn't even a cluster because the nodes are not aware of each other. However, it gets more confusing when you add a Java layer that load balances...
Anyway, I do agree that I've seen more trouble caused by DB Clustering solutions than it helps...
A cluster adds complexity to the environment, Complexity == Cost, even without the expensive software.
This sig is the express property of someone.
The Good: Using cheap components in a cluster to create scalability at a good value The Bad: Using a cluster to cover up coding issues, architectural crap, or instabilities in the system The Ugly: "the bad" gets so bad that it crashes the whole freakin' cluster. Why did we do this again?
We run volumes of Dell 2850s with RAID arrays, redundant power, etc. powering high volume websites... I can speak first handedly that internal fault tolerance in these systems can only get you so far, where a failure of a component such as the management device in charge of the two power supplies, itself fails, resulting in both power supplies being useless. Or a raid card going out of commission, leaving drives with mangled and unrecoverable data. As with most solutions, a mixture of both fault tolerance and data clustering is the safest alternative.
It all comes down to Availability (Clustering) vs. Reliability (Fault Tolerant). They are NOT the same thing.
Fault tolerant servers are nice, even the simplest true server should offer some fault tolerance to a degree (IE: RAID drives). This is handy but may not help your availability in the event that you have a SLA promising xx% of uptime and then find yourself needing to take the server down to apply service packs or other patches.
Clustered servers allow you to increase the availability of your machines, because when you need to take one down for some updates, you can simply fail over all your traffic to the other server in the cluster accordingly. Clustering may increase the availability of the services those machines are offering, but it doesn't not help the reliability of the machines themselves.
Therefore, I personally choose to start with fault tolerant machines initially (RAID and dual power supplies at a minimum). It makes for a good base. If the services on that machine are 'mission critical', then cluster that machine with other fault tolerant machines.
Well, in case you haven't noticed, it's late 2005 now.
Some things have changed, for example Windows 2003 Server came out and MSCS is now quite a decent HA solution.
(BTW, the grandparent post didn't say that Microsoft's own clustering solution was lame, he made a general statement about all clustering software for the Windows platform).
That's true, it's a massively distributed app. In every class of solution, there are extreme cases for which the rule does not apply. Those cases do not change how the average customer does business.
--LWM
Not worth doing. The cluster components should be dumb. There isn't a valid reason to have them know about each other. Your Round Robin or whatever balance you want should come from outside. F5 makes a nice box for that, so do others, if your really a cheapskate and wanted to you could duplicate them. If you need to have anything know about who is on what machine let the system tell that to the backend DB machine. It should be a channel architecture, not a crazy tangle. The more you break the functions down on the system level the better and faster your cluster will be.
Syncing databases on the other hand is tricky. Save your money and resources for that.
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
What about iFolder? Looking at the spec's I think it's missing serverless/hiving (which could be provided by any of the normal p2p people), file history ... not understanding your database object comment.
;^)
Speaking of which, what about freenet? The only thing it's missing is "guaranteed availability of critical business data", eh? And I hear it might have some performance problems.
--Robert
The usual clustering I've seen is "Hot Spare" clustering. The primary runs until it goes kaput, then the second takes over. For database clustering, the two boxes usually share the same disks. I think I've seen more outages from false takeovers by the seconday than real failures of the primary.
The other problem with clustering is that all of your software applications have to be cluster tolerant. If the user app keeps a database connection open and a rollover occurs, the connection state doesn't and can't rollover with it. To a client system, a cluster failover looks like a server reboot. Don't underestimate the difficulty of this problem. A new application has to be designed with that in mind. Retro-fitting it in later is hard - and costly, even with free platforms.
Another issue that can't be solved with clustering is application failure or application limits. You may recall the airline system failure last Christmas? Some 80% of Slashdot readers asked where was the backup? (there was) should have used Unix (they were). The box (RS6000) and operating system (AIX) kept running just fine. A hundred computer cluster couldn't solve the the real problem: the application couldn't handle the volume of information it was required to hold and they at the mercy of a proprietary source code vendor.
This is a boring sig
The choice between fault tolerants systems is decided on the interval your company can sustain an outage. A cluster can take 1-2 min to move applications from a dead node to another working one. If you applications require sustained 100% connectivity you need to go fault tolerant. Usually its for Realtime monitoring software like the computers used to monitor telephone exchanges. For databases and NFS services clusters work better as you can take a 1-2 minute hit in the response when a node fails. Software licenses do not come into it with active-active nodes where you pay for all the CPU's you are running on there. With active-passive failover, only 1 instance of your licensed software is running on your 2 systems. If your software vendor insists on your paying a license for both nodes then I would opt for a active-active node instead of an active-passive one.
Both.
If you have an application that requires ULTIMATE uptime, then you need a geographically remote cluster (Cluster spread over two sites with a redundant leased line link to provide the heartbeat). No matter how many redundant parts in a server, if it gets nuked (read power failure, flood, or other, not ACTUALLY nuked) then that application is down.
Active-active clusters are not really ideal, while load-balancing is a nice idea in this instance it means that when half of it fails then the application suffers severe performance issues. Active-active also creates data issues, as you've got two servers writing to their own local storage that also requires real-time replication between sites. Veritas Storage Foundation is about the most cost-effective option here, you don't even need 2003 Server Enterprise.
If you want a nice simple active-passive cluster and its on the same locale, fine, use a SAN. If they are geographically remote, then they will need real-time replication and as one is passive then you can use HP Storage Mirror or similar. HP are the only vendor in fact that do a nice packaged cluster solution with a SAN included all under one part code. FYI.
Having said that, if you're buying a decent server, then you are an absolutle idiot to not put RAID into it. After that, it only costs another £300 or so to add a redundant hot-plug PSU & fan. Plus p'raps a bit for an extra CPU. After that, the only component that will cause a total outage is the mainboard failing - and the only real way to get around that is to... uh... add another mainboard! Well... guess that's another server then...!
Others have said it, I'll say it again: you don't use clustering in place of FT hardware, or vice versa. You use them together!
Take a server: Hot-swappable mirrored OS disks, N+1 power supplies, dual NICs (which support failover), dual cards initiating separate paths to your storage (through independent switches, if fibre-attached), ECC RAM with on-system logic to take out a failing DIMM. Oh yeah, and multiple CPUs, again with logic to remove one from active use if need be. (chipkill sort of stuff.)
Now take another identical server (or two) and cluster them. By cluster, I mean add the heartbeat interconnects and software layer to monitor all of the mandated hardware and application resources, and fail over as necessary, or take other appropriate actions. Gluing a pile of machines together in a semi-aware grid is NOT a cluster, and does not properly address the same problem!
Now once you've got this environment in place, add the most crucial aspect: Highly competent sysadmins, and a strict change control system. The former will cost you a fair sum of money in salary, and the latter will likely necessitate duplicating your entire cluster for dev/test purposes, before rolling out changes.
That's the beginning of an HA environment. Still up for it?
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
Of course you do. Fault Tolerance is pretty cheap and straightforward. It costs me about a 10% premium on my Dell servers. However it does not buy me the ability to tak a machine down for maintenace the way clustering does. If you're looking for serious uptime, fault tolerance is not going to get you there on its own.
You are in a maze of twisted little posts, all alike.