Clustering vs. Fault-Tolerant Servers

← Back to Stories (view on slashdot.org)

Clustering vs. Fault-Tolerant Servers

Posted by ScuttleMonkey on Monday October 3, 2005 @07:01AM from the i-don't-want-my-server-to-tolerate-faults dept.

mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.

14 of 321 comments (clear)

Min score:

Reason:

Sort:

It depends on what you want to do. by ResQuad · 2005-10-03 07:02 · Score: 3, Interesting

Personally I opt for clustering over fualt-tollerance - but thats my personal choice. It really depends on what the machine(s) will be doing. If you have a database server - fault tollerence (because I have yet to meet a clustering DB solution that didnt suck). But if your building a webserver - cluster.

Also the one thing the article mentions is that clustering is just as expensive as fault-tollerence due to software licesing. Last I checked if its one copy of Debian + Apache + MySQL + Perl or 200 copies - its going to cost me the same price (free). And windows doesnt support clustering yet - in any decent way shape or form - so I dont see the problem here.

--
snowulf.com
1. Re:It depends on what you want to do. by crimethinker · 2005-10-03 07:14 · Score: 4, Interesting
  
  Quoth the GP: "in any decent way shape or form"
  Yes, Windows has supported clustering since NT4 (Wolfpack), and per the GP, it SUCKED BOLLOCKS. I had to deal with that shite every damn day for almost 3 years (1997-2000). We used active-active failover, and the joke around the company was that MS were halfway there: the "fail" worked just fine.
  -paul
  
  --
  Pistol caliber is like religion: everyone has their favourite, and theirs is the only right choice.
2. Re:It depends on what you want to do. by donaldm · 2005-10-03 08:13 · Score: 2, Interesting
  
  Most clusters are equivalent to DEC-safe (you can even get the source code on Freshmeat) which is mainly a group of machines joined together via a SCSI interconnect or a Storage Area Network and a common lan. all interconnects should be redundant and that includes the network. The only cluster that is different is the Tru64 cluster which has a clustered file-system. I think Redhat clustering uses NFS (anyone advise on this) but you need a very fast network if you want disk performance.
  
  Fault tolerant is the most expensive option such as the Himalaya machines, nearly all components can be replaced while the machine is hot.
  
  The cluster is quite a reliable method of application availability. In the event of a cluster member failure the application failover to another cluster member should be relatively quick (about 1 to 5 minutes), however if an application takes say 25 minutes (I actually struck this once) to start then fail-over is going to take at least 25 minutes. Also your application should be capable of restarting and recovering from power off then on. If the application cannot do this then clustering is useless and you should be thinking fault tolerant machines or getting your Application vendor to fix the issue.
  
  PS. All clusters I have setup (Trucluster - Tru64 Unix) using Informix, Oracle, Sybase and SAP applications have worked extremely well.
  
  --
  There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
More about the cost of hardware? by Sv-Manowar · 2005-10-03 07:06 · Score: 4, Interesting

Because of the open source stack behind a lot of server platforms these days, I'm dubious that this decision boils down simply to a software cost issue. One major benefit of using clustering is that many white box, non specialized machines can be used, which are easier & cheapter to replace or obtain components for. Complex and specialized hardware with built in redundancy is often expensive and can require vendor support contracts for effective maintainance.

--
Business Voyeur
Why are clusters better? by darkmeridian · 2005-10-03 07:06 · Score: 2, Interesting

The article seems to make the choice one-sided. Fault tolerant servers have higher uptimes because the backup takes over immediately. Clusters have a single point of failure in the middleware. They argue that the clusters can run different operating systems, but that means more patches and updates to keep track of. Clusters are expensive because they need more OS and software licenses and require a lot of maintenance, though that might drop if they are running Linux or FreeBSD.

Anyone make a case for clusters for high-uptime situations?

--
A NYC lawyer blogs. http://www.chuangblog.com/
Clustering is safer by arcadum · 2005-10-03 07:09 · Score: 2, Interesting

If you buy one machine, you still may need to power it off to open the case, or replace a part.
Not either/or by Declarent · 2005-10-03 07:13 · Score: 5, Interesting

I build AIX HACMP clusters for a living, and I'll tell you that you should *never* use an either/or approach, as TFA suggests. Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.

Maybe if they want to write an article, they should spend some time in the real world and see how the HA industry works instead of making up some arbitrary demarkation line to hang a preconception on.
1. Re:Not either/or by sapbasisnerd · 2005-10-03 07:45 · Score: 2, Interesting
  
  What Google does barely deserves the label clustering.
  Actually that's not really fair, the problem is the term clustering has become overloaded. What Google does is would be more completely described as "shared nothing" distributed computing. They use cheap as chips iron beacuse nobody cares if a transaction fails, because no data is lost, the end user just pushes refresh. Similarily the various grid compute "clusters" (SETI, Folding@Home etc.) can recover from a lost unit of work by sending it out for reprocessing after a timeout (or IIRC SETI doesn't wait, every unit of work is sent out multiply and the results that do come back are compared).
  If on the other hand you are dealing with applications that actually save data, silly little things like, oh, electronic funds transfers or credit card charges, that's a whole different class of problem.
SneakerNet * by dada21 · 2005-10-03 07:16 · Score: 5, Interesting

In my 15 years of IT consulting, no network has provided data safety transparency cheaply or consistently enough. Clusters and fault tolerance both cost more than downtime in my experience.

We desperately need a better way to access data in a corporate network.

My favorite customers are those architects and engineers who avoid networking except for the Net. Seriously, sneakernet and peer-to-peer has shown the least downtime I've seen.

I think p2p networks will see a comeback if a torrent-like protocol can grow to be speedy. My customers are not banks, but they need 100% uptime as every day is a beat-the-deadline day.

If someone can extend and combine an internal torrent system with a decent file cataloging and searching system, they'll see huge money. I have some 150 user CAD networks just waiting for it.

What would a hive network need?

* Serverless
* Files hived to 3+ workstations
* Database object hiving
* File modification ability (save new file in hive, rename previous file as old version, delete really old versions after user configurable changes)
* "Wayback Machine" feature from old versions
* PCs disconnected from hive will self correct upon reconnection

It is very complex right now, but my bet is that the P2P network will trump client-server for the short run. The "client is the server" vs "the server is the client"?
Re:Not the same. by Anonymous Coward · 2005-10-03 07:17 · Score: 1, Interesting

You're still not safe with clustering if you share data. I once worked on a SQL Server cluster with shared disks. SQL Server would crash because a database page contained crap data. The system would then take 10 minutes to fail over to another node. Once it was running, it would read the same page and crap out, causing the other node to come back up. Lather, rinse, repeat.
Google as an example by Guspaz · 2005-10-03 07:36 · Score: 2, Interesting

Google proved that clustering could be fault tolerant, while costing less than true fault-tolerant hardware.

Google built massive clusters of thousands of machines out of very cheap unreliable hardware. They have tons of hardware failure due to the extremely cheap components (and sheer number of machines), but everything is redundant (And fully fault tolerant).

They did this, again, using dirt cheap hardware.
Re:This brings to mind Google's strategy. by mcewen98 · 2005-10-03 07:47 · Score: 5, Interesting

According to a presentation that I recently attended given by Jim Reese, the guy who scaled google from a couple hundred servers to over 300,000, this is still true. It was a very interesting presentation and included discussion about the problems with cramming 80 pc's into a standard server rack... including heat, cable management, machine replacement.. etc.

Other interesting tid bits that I remember:

-over 300,000 x86 machines make up the network, with clusters all over the place which make searces return in under .3 seconds.
-commodity hardware (maxtor, western digital, whatever is available) is used.
-over a thousand machines fail daily. Most are automatically reboot, and it sounded like admins only come into play when a machine needs to be replaced.
-the longest uptime of a single machine has been 7 years
-they use a heavily modified redhat distro.
-real time stats of the entire network can be seen at any moment

i'm sure there were more interesting facts but that's all I can regurgitate at the moment.
Real world example and cost by MarkEst1973 · 2005-10-03 08:38 · Score: 2, Interesting

A gov't contractor I worked for was getting a contract to consolidate multiple servers and apps into a single pair of servers (web and db) for a small gov't agency.
The agency bought a pair of dual proc Dells with lots of RAM and a full software stack (Windows Server, SQLServer, and ColdFusion Server). Total cost: ~$57,000.
That's right, nearly 60k.
Now, I've read that Google buys their white boxes at $1k each for their server farm. And I couldn't help but think what they'd (or I) would do with 57 boxes instead of 2.
But hey, my opinion doesn't matter. I'm not a PHB in a gov't agency. But sure as hell, if I were a business in a competitive environment (and a gov't agency is not), I'd be looking to implement the simple and effective white box solution on the cheap. But that's just me.
Re:Ignoramus by Anonymous Coward · 2005-10-03 08:57 · Score: 1, Interesting

Ever seen the actual numbers on Oracle RAC scaling on any decent sized install..let's say 4+? Not pretty. Oracle RAC is an interesting technology, but it serves one purpose and one purpose only. Allow Larry to buy fancier boats.

Why do you think Oracle runs its business (main ERP apps) on 4 nodes of big Sun iron and not 8+ nodes of cheap linux boxes? They tout the shit out of the cheap linux boxes to their customers because if an IT department has a $4 million dollar budget for a big project, Oracle wants to get $3.95 million of that. They only way they can do that is to push the crap out of cheap dell and linux for the hardware and the OS. But when it is *Oracle's* business on the line..well, the proof is in the pudding as they say.