Clustering vs. Fault-Tolerant Servers
mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.
Personally I opt for clustering over fualt-tollerance - but thats my personal choice. It really depends on what the machine(s) will be doing. If you have a database server - fault tollerence (because I have yet to meet a clustering DB solution that didnt suck). But if your building a webserver - cluster.
Also the one thing the article mentions is that clustering is just as expensive as fault-tollerence due to software licesing. Last I checked if its one copy of Debian + Apache + MySQL + Perl or 200 copies - its going to cost me the same price (free). And windows doesnt support clustering yet - in any decent way shape or form - so I dont see the problem here.
snowulf.com
I just use Geocities, it's free and easy!
It's slashdotted already.
...and i am just waiting on the call from our vendor recommending we upgrade to a cluster of fault-tolerant servers.
So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.
Breakfast served all day!
Hardware fails... it's as simple as that. You should plan on that for one reason or another you will have to shutdown and replace hardware. If it can be done with minimal or no disruption to the services, then that's all the better. OS makes licencing no longer a problem.
tolerating a lot of faults in one girlfriend or get a cluster of them and deal only with the good points?
Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.
Shouldn't we be encouraging server failures which enable their freedom from magnetic imprisonment? Kinda like PETA freeing lab animals...
If brevity is the soul of wit, then how does one explain Twitter?
Because of the open source stack behind a lot of server platforms these days, I'm dubious that this decision boils down simply to a software cost issue. One major benefit of using clustering is that many white box, non specialized machines can be used, which are easier & cheapter to replace or obtain components for. Complex and specialized hardware with built in redundancy is often expensive and can require vendor support contracts for effective maintainance.
Business Voyeur
Clustering provides a backup for software failures, that fault-tolerant servers don't. Also, upgrades without downtime are easier done with a load-balanced cluster.
If you are just talking about fault tolerance (FT) then spill a drink on the FT server then spill a drink on a clustered server and see the difference :) If we are not limited to fault tolerance than try load balancing an FT server with.. um..er... itself. This is really apples and oranges. BTW, I like FT servers in a cluster!
The article seems to make the choice one-sided. Fault tolerant servers have higher uptimes because the backup takes over immediately. Clusters have a single point of failure in the middleware. They argue that the clusters can run different operating systems, but that means more patches and updates to keep track of. Clusters are expensive because they need more OS and software licenses and require a lot of maintenance, though that might drop if they are running Linux or FreeBSD.
Anyone make a case for clusters for high-uptime situations?
A NYC lawyer blogs. http://www.chuangblog.com/
If HA is what you are really after, you should use both. You want a fault tolerant server so you never have to go down unexpectedly and you want a fail over node so if the unexpected occurs, you'll be back up in a jiffy.
"That's the sort of blinkered, philistine pig ignorance I've come to expect from you non-creative garbage."-Monty Python
If you buy one machine, you still may need to power it off to open the case, or replace a part.
Fault tolerant systems are all in one physical location.
Clusters can be in different server racks, building, city even country.
It depends what the goal is. Fault tolerance, scalability, disaster recovery, etc.
They both have their uses, let's not discount one or the other, just use them properly.
**Typically, the goal is a mix of the ones I enumerated, hence I typically choose clusters. However, I always re-evaluate every time a new requirement comes in.
-dave
http://millionnumbers.com/ - own the number of your dreams
I build AIX HACMP clusters for a living, and I'll tell you that you should *never* use an either/or approach, as TFA suggests. Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.
Maybe if they want to write an article, they should spend some time in the real world and see how the HA industry works instead of making up some arbitrary demarkation line to hang a preconception on.
That's one of those ideas that sounds all good and well, but it hardly works in practice. In many cases, downtime is unacceptable. You need transactions processed continually, and you cannot have downtime caused by a dead server.
It is not a good idea to build a system out of parts that you know will fail, and then proceed to design the system around such failure. A far better idea is to spend some money, and design a system that will work. Of course you do take into account hardware failure, and you build in redundancy where necessary. But you do not build your solution around knowingly faulty and cheap hardware. That's just looking for trouble.
Often times the "cheap" solution ends up being most expensive, not only because of the cost of repeated hardware repairs, but also because of the cost of the labour necessary to perform the repairs, and the possibility of downtime. When you're processing millions of dollars worth of transactions per minute (if not per second), even a couple of minutes of downtime can be financially costly.
Cyric Zndovzny at your service.
The Good: Using cheap components in a cluster to create scalability at a good value The Bad: Using a cluster to cover up coding issues, architectural crap, or instabilities in the system The Ugly: "the bad" gets so bad that it crashes the whole freakin' cluster. Why did we do this again?
Fault tolerance gets you a machine that keeps running in the face of hardware failures and maintenance. The switchover time is arguably negligible.
Clustering gets you a set of services that keep running in the face of hardware failures and maintenance. The switchover time can range from negligible to huge depending on the application involved.
However, clustering also helps you to solve other problems, including scaling, software failures, software upgrades, A-B testing (running different versions side by side), major hardware upgrades, and even data center relocations.
Clustering tends to require a lot more local knowledge to get right.
So if you narrow the problem definition to hardware only, they solve the same class of problems. But when you broaden it to the full range of what clustering offers you find a greater opportunity for cost savings - because one technique is covering multiple needs.
In my 15 years of IT consulting, no network has provided data safety transparency cheaply or consistently enough. Clusters and fault tolerance both cost more than downtime in my experience.
We desperately need a better way to access data in a corporate network.
My favorite customers are those architects and engineers who avoid networking except for the Net. Seriously, sneakernet and peer-to-peer has shown the least downtime I've seen.
I think p2p networks will see a comeback if a torrent-like protocol can grow to be speedy. My customers are not banks, but they need 100% uptime as every day is a beat-the-deadline day.
If someone can extend and combine an internal torrent system with a decent file cataloging and searching system, they'll see huge money. I have some 150 user CAD networks just waiting for it.
What would a hive network need?
* Serverless
* Files hived to 3+ workstations
* Database object hiving
* File modification ability (save new file in hive, rename previous file as old version, delete really old versions after user configurable changes)
* "Wayback Machine" feature from old versions
* PCs disconnected from hive will self correct upon reconnection
It is very complex right now, but my bet is that the P2P network will trump client-server for the short run. The "client is the server" vs "the server is the client"?
In my opinion, Beowulf is not the hammer everyone thinks it is. Ask the average slashdot reader even, and they relate Beowulf to something more like OpenSSI or Mosix... something you can easily add nodes to, and just use a special compiler to compile all of your multthreaded/multiproc apps and it will all work magically.
If you are one of those people, stop. A Beowulf cluster is a performance cluster, but it is not a replacement for an SMP system. You more or less have the master node delegate actual computations EXPLICITLY in your application (EX "Hey... Node X, Caclulate X + Y for me, kthx").
"When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
Having built both true high-reliability fault-tolerant devices and clustered systems, I don't see any fundamental theoretical difference. In both cases, you have redundant hardware capacity in place, theoretically to allow you to tolerate the failure of a certain amount of your hardware (and, sometimes, your software) for a certain amount of time. Neither option guards you against failures outside of the cluster or FT system box. Neither one is a panacea. Both are sold as snake-oil insurance against "badness".
In a single fault-tolerant box, you generally have environmental monitoring, careful attention to error detection, and automatic failover. You also have customer-replaceable units for failure-prone components, utiilties for managing all of the redundancy, and a fancy nameplate. In exchange for that, you have more complexity, more cost, serious custom hardware and software modifications, and often (but not always) performance constraints.
In a clustered system, you treat each individual server as a failure unit. Good fault detection is a challenge, especially for damaging but non-catastrophic failure, but it's much easier to configure a given level of redundancy and it's easier to take care of environmental problems like building power (or water in the second floor) -- you just configure part of the cluster a longer distance away.
Where clustering is inadequate is when you have a single mission-critical system where any failure is disaster (like flight-control avionics or nuclear power plant monitoring). There are applications where there's no substitute for redundant design, locked-clock processors and "voting" hardware, and all of the other low-level safeguards you can use.
For Web applications, however, where a certain sloppiness is tolerable, and where the advantages of load balancing, off-the-shelf hardware and software, and system administration that doesn't require an EE with obsessive-compulsive disorder, clusters are the natural solution.
The fact that you get to sell more licenses for the software is just gravy.
What you wrote is really ignorant (which, modded on /., translates to Insightful).
1. (because I have yet to meet a clustering DB solution that didnt suck).
Where do you live? In Ruanda?
Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.
2. one copy of Debian + Apache + MySQL + Perl or 200 copies
mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...
3. And windows doesnt support clustering yet - in any decent way shape or form, I dont see the problem here.
Hah, hah! Enough said.
And also - what's it to you? If Microsoft (in your view) had a good clustering solution, you'd lose sleep over that?
When you're biased like that, no wonder you can't have a quality, unbiased opinion on this topic.
We run volumes of Dell 2850s with RAID arrays, redundant power, etc. powering high volume websites... I can speak first handedly that internal fault tolerance in these systems can only get you so far, where a failure of a component such as the management device in charge of the two power supplies, itself fails, resulting in both power supplies being useless. Or a raid card going out of commission, leaving drives with mangled and unrecoverable data. As with most solutions, a mixture of both fault tolerance and data clustering is the safest alternative.
There is nothing like OpenBSD running pf and carp. Dead easy to set up, works like a charm, and secure by default. One wonders why the editors seem to think OSS == Linux.
http://www.openbsd.org/faq/pf/index.html
http://www.openbsd.org/faq/faq6.html#CARP
Cypherpunks: Civil Liberty Through Complex Mathematics. Those who live by the sword die by the arrow.
It all comes down to Availability (Clustering) vs. Reliability (Fault Tolerant). They are NOT the same thing.
Fault tolerant servers are nice, even the simplest true server should offer some fault tolerance to a degree (IE: RAID drives). This is handy but may not help your availability in the event that you have a SLA promising xx% of uptime and then find yourself needing to take the server down to apply service packs or other patches.
Clustered servers allow you to increase the availability of your machines, because when you need to take one down for some updates, you can simply fail over all your traffic to the other server in the cluster accordingly. Clustering may increase the availability of the services those machines are offering, but it doesn't not help the reliability of the machines themselves.
Therefore, I personally choose to start with fault tolerant machines initially (RAID and dual power supplies at a minimum). It makes for a good base. If the services on that machine are 'mission critical', then cluster that machine with other fault tolerant machines.
--LWM
Not worth doing. The cluster components should be dumb. There isn't a valid reason to have them know about each other. Your Round Robin or whatever balance you want should come from outside. F5 makes a nice box for that, so do others, if your really a cheapskate and wanted to you could duplicate them. If you need to have anything know about who is on what machine let the system tell that to the backend DB machine. It should be a channel architecture, not a crazy tangle. The more you break the functions down on the system level the better and faster your cluster will be.
Syncing databases on the other hand is tricky. Save your money and resources for that.
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
What about iFolder? Looking at the spec's I think it's missing serverless/hiving (which could be provided by any of the normal p2p people), file history ... not understanding your database object comment.
;^)
Speaking of which, what about freenet? The only thing it's missing is "guaranteed availability of critical business data", eh? And I hear it might have some performance problems.
--Robert
Google proved that clustering could be fault tolerant, while costing less than true fault-tolerant hardware.
Google built massive clusters of thousands of machines out of very cheap unreliable hardware. They have tons of hardware failure due to the extremely cheap components (and sheer number of machines), but everything is redundant (And fully fault tolerant).
They did this, again, using dirt cheap hardware.
According to a presentation that I recently attended given by Jim Reese, the guy who scaled google from a couple hundred servers to over 300,000, this is still true. It was a very interesting presentation and included discussion about the problems with cramming 80 pc's into a standard server rack... including heat, cable management, machine replacement.. etc.
.3 seconds.
Other interesting tid bits that I remember:
-over 300,000 x86 machines make up the network, with clusters all over the place which make searces return in under
-commodity hardware (maxtor, western digital, whatever is available) is used.
-over a thousand machines fail daily. Most are automatically reboot, and it sounded like admins only come into play when a machine needs to be replaced.
-the longest uptime of a single machine has been 7 years
-they use a heavily modified redhat distro.
-real time stats of the entire network can be seen at any moment
i'm sure there were more interesting facts but that's all I can regurgitate at the moment.
Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.
This is dead-on correct. For example, if a CGI hits a problematic state where it eats a lot of memory putting the server into a state where it's swapping, then it takes longer to service each http transaction, which means each more httpd transactions queue up, which means more memory gets allocated which means more swapping .. rendering the machine useless for a little while (until a sysadmin or a bot notices the state and either restarts the httpd or kills a few select processes). If we were running this on one mammoth server with lots of redundant bits, then 100% of our web service capacity would be down in the interim. But since we run a pool of ten http servers under keepalived/IPVS, we only lose 10% of our capacity during that time.
Other reasons I've traditionally preferred clustering: easy to incrementally scale up infrastructure (no big buy-in in the beginning to get the server which can be expanded), fully parallel resources (an independent memory bus, an independent IO bus, two independent CPU's, an independent network card, and a few independent disks for each server, as opposed to a mammoth shared bus on a leviathan crossbar, which will inevitably run into contention), and more flexibility in how resources are divided amongst mutually exclusive tasks.
One of those reasons is getting less relevant -- point-to-point bus technologies like LightningTransport and PCI-Express are inexpensively replacing the "one big shared bus" with a lot of independent busses, transforming the server into a little cluster-in-a-box. It is a positive change IMO, and shifts the optimal setup away from the huge cluster of relatively small machines, and towards a more moderately-sized cluster of more medium-sized cluster-in-a-box machines.
The price of licenses is, IME, rarely an issue (in my admittedly limited career -- I don't doubt that it's relevant to many companies) because the places I've worked for have tended to use primarily free-as-in-beer (and often free-as-in-speech) open source solutions. What is more of an issue, IME, is the necessity of staffing yourself with cluster-savvy sysadmins and software engineers. Those of that ilk tend to be a bit rare and expensive, and difficult to keep track of. It takes a distributed systems professional to look at a distributed system and understand what is being seen, and this makes it easy to bend the spec or juggle the schedule on the sly, or run skunkworks projects outright. By contrast, the insanely redundant, mondo-expensive uberserver was created and programmed by very smart hardware and software specialists so that your IT staff doesn't need to be so specialized. This makes useful talent easier to acquire, and understanding the system closer to the reach of mere mortals.
Just my two cents
-- TTK
Both.
The agency bought a pair of dual proc Dells with lots of RAM and a full software stack (Windows Server, SQLServer, and ColdFusion Server). Total cost: ~$57,000.
That's right, nearly 60k.
Now, I've read that Google buys their white boxes at $1k each for their server farm. And I couldn't help but think what they'd (or I) would do with 57 boxes instead of 2.
But hey, my opinion doesn't matter. I'm not a PHB in a gov't agency. But sure as hell, if I were a business in a competitive environment (and a gov't agency is not), I'd be looking to implement the simple and effective white box solution on the cheap. But that's just me.
Sun:= SunStore&cmdViewProduct_CP&catid=83174
http://store.sun.com/CMTemplate/CEServlet?process
For around $20,000 you could build a PC cluster that includes:
20+ x Intel P4 D820 at ~$500 ea.
20+ x AMD64 X2 3800+ at ~$750 ea.
You could almost get a cluster of 40 Intel PCs, each with a dual-core chip running at 2.8 Ghz. Or almost 30 AMD64 PCs, each with a dual-core chip running at 1.8 Ghz. If you shop smart you can get gigabit ethernet on the motherboard and have a fault-tollerant / redundant system with over 10 times the performance of the Sun system.
I don't know about you, but I would take the cluster of AMD X2s. The Intels might beat 'em on price/performance, but the X2s might be a lil bit nicer to work on.
If you have an application that requires ULTIMATE uptime, then you need a geographically remote cluster (Cluster spread over two sites with a redundant leased line link to provide the heartbeat). No matter how many redundant parts in a server, if it gets nuked (read power failure, flood, or other, not ACTUALLY nuked) then that application is down.
Active-active clusters are not really ideal, while load-balancing is a nice idea in this instance it means that when half of it fails then the application suffers severe performance issues. Active-active also creates data issues, as you've got two servers writing to their own local storage that also requires real-time replication between sites. Veritas Storage Foundation is about the most cost-effective option here, you don't even need 2003 Server Enterprise.
If you want a nice simple active-passive cluster and its on the same locale, fine, use a SAN. If they are geographically remote, then they will need real-time replication and as one is passive then you can use HP Storage Mirror or similar. HP are the only vendor in fact that do a nice packaged cluster solution with a SAN included all under one part code. FYI.
Having said that, if you're buying a decent server, then you are an absolutle idiot to not put RAID into it. After that, it only costs another £300 or so to add a redundant hot-plug PSU & fan. Plus p'raps a bit for an extra CPU. After that, the only component that will cause a total outage is the mainboard failing - and the only real way to get around that is to... uh... add another mainboard! Well... guess that's another server then...!
Others have said it, I'll say it again: you don't use clustering in place of FT hardware, or vice versa. You use them together!
Take a server: Hot-swappable mirrored OS disks, N+1 power supplies, dual NICs (which support failover), dual cards initiating separate paths to your storage (through independent switches, if fibre-attached), ECC RAM with on-system logic to take out a failing DIMM. Oh yeah, and multiple CPUs, again with logic to remove one from active use if need be. (chipkill sort of stuff.)
Now take another identical server (or two) and cluster them. By cluster, I mean add the heartbeat interconnects and software layer to monitor all of the mandated hardware and application resources, and fail over as necessary, or take other appropriate actions. Gluing a pile of machines together in a semi-aware grid is NOT a cluster, and does not properly address the same problem!
Now once you've got this environment in place, add the most crucial aspect: Highly competent sysadmins, and a strict change control system. The former will cost you a fair sum of money in salary, and the latter will likely necessitate duplicating your entire cluster for dev/test purposes, before rolling out changes.
That's the beginning of an HA environment. Still up for it?
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban