Slashdot Mirror


Clustering vs. Fault-Tolerant Servers

mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.

20 of 321 comments (clear)

  1. Software vendors by PCM2 · · Score: 4, Insightful

    So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.

    --
    Breakfast served all day!
  2. Not the same. by tekn0lust · · Score: 5, Informative

    Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.

  3. Since information wants to be free by Shadow+Wrought · · Score: 4, Funny

    Shouldn't we be encouraging server failures which enable their freedom from magnetic imprisonment? Kinda like PETA freeing lab animals...

    --
    If brevity is the soul of wit, then how does one explain Twitter?
  4. More about the cost of hardware? by Sv-Manowar · · Score: 4, Interesting

    Because of the open source stack behind a lot of server platforms these days, I'm dubious that this decision boils down simply to a software cost issue. One major benefit of using clustering is that many white box, non specialized machines can be used, which are easier & cheapter to replace or obtain components for. Complex and specialized hardware with built in redundancy is often expensive and can require vendor support contracts for effective maintainance.

  5. Apples and Oranges by Steven_M_Campbell · · Score: 4, Funny

    If you are just talking about fault tolerance (FT) then spill a drink on the FT server then spill a drink on a clustered server and see the difference :) If we are not limited to fault tolerance than try load balancing an FT server with.. um..er... itself. This is really apples and oranges. BTW, I like FT servers in a cluster!

  6. Re:It depends on what you want to do. by Tenareth · · Score: 4, Insightful

    A Web farm is the simplest form of clustering, some would argue it isn't even a cluster because the nodes are not aware of each other. However, it gets more confusing when you add a Java layer that load balances...

    Anyway, I do agree that I've seen more trouble caused by DB Clustering solutions than it helps...

    A cluster adds complexity to the environment, Complexity == Cost, even without the expensive software.

    --
    This sig is the express property of someone.
  7. Re:And what's so difficult about... by MankyD · · Score: 4, Funny
    And what's so difficult about clustering a bunch of fault tolerant servers?
    Well that just plain redundant. Err...
    --
    -dave
    http://millionnumbers.com/ - own the number of your dreams
  8. Re:It depends on what you want to do. by CSHARP123 · · Score: 5, Informative
    And windows doesnt support clustering yet
    Windows Server 2003 actually supports two different types of clustering. One is called network load balancing, which enables up to 32 clustered servers to run a high-demand application to prevent a single server from being bogged down. If one of the servers in the cluster fails, then the other servers instantly pick up the slack.

    Network load balancing has been most often used with Web servers, which tend to use fairly static code and require little data replication. If a clustered web site needs more performance than what the cluster is currently providing, additional servers can be instantaneously added to the cluster. Once the cluster reaches the 32-server limit, you can further expand the cluster by creating a second cluster and then using round-robin DNS to divide traffic between the two clusters.

    The other type of clustering that Windows Server 2003 supports by default is often referred to simply as clustering. The idea behind this type of clustering is that two or more servers share a common hard disk. All of the servers in the cluster run the same application and reference the same data on the same disk. Only one of the servers actually does the work. The other servers constantly check to make sure that the primary server is online. If the primary server does not respond, then the secondary server takes over.

    This type of clustering doesn't really give you any kind of performance gain. Instead, it gives you fault tolerance and enables you to perform rolling upgrades. (A server can be taken offline for upgrade without disrupting users.) In Windows 2000 Advanced Server, only two servers could be clustered together in this way (four servers in Windows 2000 Datacenter Edition). In Windows Server 2003, though, the limit has been raised to eight servers. Microsoft offers this as a solution to long-distance fault tolerance when used in conjunction with the iSCSI protocol (SCSI over IP).

  9. Not either/or by Declarent · · Score: 5, Interesting

    I build AIX HACMP clusters for a living, and I'll tell you that you should *never* use an either/or approach, as TFA suggests. Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.

    Maybe if they want to write an article, they should spend some time in the real world and see how the HA industry works instead of making up some arbitrary demarkation line to hang a preconception on.

  10. Re:It depends on what you want to do. by crimethinker · · Score: 4, Interesting
    Quoth the GP: "in any decent way shape or form"

    Yes, Windows has supported clustering since NT4 (Wolfpack), and per the GP, it SUCKED BOLLOCKS. I had to deal with that shite every damn day for almost 3 years (1997-2000). We used active-active failover, and the joke around the company was that MS were halfway there: the "fail" worked just fine.

    -paul

    --
    Pistol caliber is like religion: everyone has their favourite, and theirs is the only right choice.
  11. Clustering Potentially Solves More Problems by bradm · · Score: 4, Informative

    Fault tolerance gets you a machine that keeps running in the face of hardware failures and maintenance. The switchover time is arguably negligible.

    Clustering gets you a set of services that keep running in the face of hardware failures and maintenance. The switchover time can range from negligible to huge depending on the application involved.

    However, clustering also helps you to solve other problems, including scaling, software failures, software upgrades, A-B testing (running different versions side by side), major hardware upgrades, and even data center relocations.

    Clustering tends to require a lot more local knowledge to get right.

    So if you narrow the problem definition to hardware only, they solve the same class of problems. But when you broaden it to the full range of what clustering offers you find a greater opportunity for cost savings - because one technique is covering multiple needs.

  12. SneakerNet * by dada21 · · Score: 5, Interesting

    In my 15 years of IT consulting, no network has provided data safety transparency cheaply or consistently enough. Clusters and fault tolerance both cost more than downtime in my experience.

    We desperately need a better way to access data in a corporate network.

    My favorite customers are those architects and engineers who avoid networking except for the Net. Seriously, sneakernet and peer-to-peer has shown the least downtime I've seen.

    I think p2p networks will see a comeback if a torrent-like protocol can grow to be speedy. My customers are not banks, but they need 100% uptime as every day is a beat-the-deadline day.

    If someone can extend and combine an internal torrent system with a decent file cataloging and searching system, they'll see huge money. I have some 150 user CAD networks just waiting for it.

    What would a hive network need?

    * Serverless
    * Files hived to 3+ workstations
    * Database object hiving
    * File modification ability (save new file in hive, rename previous file as old version, delete really old versions after user configurable changes)
    * "Wayback Machine" feature from old versions
    * PCs disconnected from hive will self correct upon reconnection

    It is very complex right now, but my bet is that the P2P network will trump client-server for the short run. The "client is the server" vs "the server is the client"?

  13. No difference, just a matter of packaging. by TheMohel · · Score: 4, Informative

    Having built both true high-reliability fault-tolerant devices and clustered systems, I don't see any fundamental theoretical difference. In both cases, you have redundant hardware capacity in place, theoretically to allow you to tolerate the failure of a certain amount of your hardware (and, sometimes, your software) for a certain amount of time. Neither option guards you against failures outside of the cluster or FT system box. Neither one is a panacea. Both are sold as snake-oil insurance against "badness".

    In a single fault-tolerant box, you generally have environmental monitoring, careful attention to error detection, and automatic failover. You also have customer-replaceable units for failure-prone components, utiilties for managing all of the redundancy, and a fancy nameplate. In exchange for that, you have more complexity, more cost, serious custom hardware and software modifications, and often (but not always) performance constraints.

    In a clustered system, you treat each individual server as a failure unit. Good fault detection is a challenge, especially for damaging but non-catastrophic failure, but it's much easier to configure a given level of redundancy and it's easier to take care of environmental problems like building power (or water in the second floor) -- you just configure part of the cluster a longer distance away.

    Where clustering is inadequate is when you have a single mission-critical system where any failure is disaster (like flight-control avionics or nuclear power plant monitoring). There are applications where there's no substitute for redundant design, locked-clock processors and "voting" hardware, and all of the other low-level safeguards you can use.

    For Web applications, however, where a certain sloppiness is tolerable, and where the advantages of load balancing, off-the-shelf hardware and software, and system administration that doesn't require an EE with obsessive-compulsive disorder, clusters are the natural solution.

    The fact that you get to sell more licenses for the software is just gravy.

  14. Ignoramus by Donny+Smith · · Score: 4, Informative

    What you wrote is really ignorant (which, modded on /., translates to Insightful).

    1. (because I have yet to meet a clustering DB solution that didnt suck).

    Where do you live? In Ruanda?
    Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.

    2. one copy of Debian + Apache + MySQL + Perl or 200 copies

    mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...

    3. And windows doesnt support clustering yet - in any decent way shape or form, I dont see the problem here.

    Hah, hah! Enough said.
    And also - what's it to you? If Microsoft (in your view) had a good clustering solution, you'd lose sleep over that?
    When you're biased like that, no wonder you can't have a quality, unbiased opinion on this topic.

  15. Re:It depends on what you want to do. by Jim+Hall · · Score: 5, Informative

    Let me preface this by saying I'm the Enterprise IT Manager for a large, Big-10 University. "Enterprise" means I am responsible for all servers that run the University, not just a small department. My userbase is 70,000+ students, and somewhere between 15,000-20,000 faculty and staff.

    We run a variety of hardware platforms, including a large Linux deployment. Yes, it really does depend on what you want to do with that server, before you can decide to go with a bunch of servers behind a load balancer v. a larger, fault-tolerant server.

    For our production web servers (PeopleSoft, web registration, etc.) we run a bunch of cheap servers running Red Hat Enterprise Linux, and we distribute them across two data centers (for redundancy.) We run a load balancer in front of them, so that users access one URL, and the load balancer automagically distributes traffic to the servers on both data centers. For a lightly-used application, we may only run 2 web servers. For heavily-used applications (web registration) we run 5 web servers. Those are IBM x-series now, but we are in the process of moving to IBM BladeCenters.

    With multiple servers in production, I can lose any single web server and not experience downtime on the application. We usually only have a single PSU in each server, because there's no point in the extra expense when we have redundancy at the server level. And because we've split our web servers across two data centers, I can actually lose an entire data center and only experience slow response time on the application. (Note to the paranoid: while the data centers are only 1.4miles apart, they are on separate power grids, etc. The other back-end infrastructure is also split between data centers.) We run a lot of sites behind load balancers, so we can afford to have a separate load balancer pair at each site (which can provide backup to each other.)

    However, for large applications we may use a single fault-tolerant Linux server. For example, we used to do this with a database server. Multiple power supplies, multiple network connections, RAID storage, etc. To be honest, though, we tend to run databases on "big iron" hardware such as Sun SPARC (E25000, V890, etc.) and IBM p-series. We don't have any Linux database servers left, but that's not because Linux wasn't up to the task (our DBAs preferred to have the same platform for all databases, to make debugging and knowledge-sharing easier.)

    In a few cases, we have a third tier. If the application is low-priority (i.e. a development server) and/or low-volume (i.e. a web site that doesn't get much traffic), we run a single server for that. The server is a cheap IBM x-series box running Red Hat Enterprise Linux, usually with no built-in redundancy.

    Yes, for us Linux has been able to play along quite nicely with the "big iron" UNIX systems. We've run Linux at the Enterprise level since 1998 or 1999, and Linux is definitely considered part of our Enterprise solution.

  16. Re:It depends on what you want to do. by Donny+Smith · · Score: 5, Insightful

    Well, in case you haven't noticed, it's late 2005 now.
    Some things have changed, for example Windows 2003 Server came out and MSCS is now quite a decent HA solution.

    (BTW, the grandparent post didn't say that Microsoft's own clustering solution was lame, he made a general statement about all clustering software for the Windows platform).

  17. Re:It depends on what you want to do. by Marillion · · Score: 4, Insightful
    That makes lots of sense. Software costs do multiply in clustering. Zero times 100 is still zero. But, clustering has other headaches beyond money.

    The usual clustering I've seen is "Hot Spare" clustering. The primary runs until it goes kaput, then the second takes over. For database clustering, the two boxes usually share the same disks. I think I've seen more outages from false takeovers by the seconday than real failures of the primary.

    The other problem with clustering is that all of your software applications have to be cluster tolerant. If the user app keeps a database connection open and a rollover occurs, the connection state doesn't and can't rollover with it. To a client system, a cluster failover looks like a server reboot. Don't underestimate the difficulty of this problem. A new application has to be designed with that in mind. Retro-fitting it in later is hard - and costly, even with free platforms.

    Another issue that can't be solved with clustering is application failure or application limits. You may recall the airline system failure last Christmas? Some 80% of Slashdot readers asked where was the backup? (there was) should have used Unix (they were). The box (RS6000) and operating system (AIX) kept running just fine. A hundred computer cluster couldn't solve the the real problem: the application couldn't handle the volume of information it was required to hold and they at the mercy of a proprietary source code vendor.

    --
    This is a boring sig
  18. Re:Never build systems on a core of failure. by dkleinsc · · Score: 4, Informative

    Most successful strategies I've heard of involve building a system out of parts that you know can't fail, and then designing the system around the failure of the parts that you know can't fail.

    --
    I am officially gone from /. Long live http://www.soylentnews.com/
  19. Re:This brings to mind Google's strategy. by mcewen98 · · Score: 5, Interesting

    According to a presentation that I recently attended given by Jim Reese, the guy who scaled google from a couple hundred servers to over 300,000, this is still true. It was a very interesting presentation and included discussion about the problems with cramming 80 pc's into a standard server rack... including heat, cable management, machine replacement.. etc.

    Other interesting tid bits that I remember:

    -over 300,000 x86 machines make up the network, with clusters all over the place which make searces return in under .3 seconds.
    -commodity hardware (maxtor, western digital, whatever is available) is used.
    -over a thousand machines fail daily. Most are automatically reboot, and it sounded like admins only come into play when a machine needs to be replaced.
    -the longest uptime of a single machine has been 7 years
    -they use a heavily modified redhat distro.
    -real time stats of the entire network can be seen at any moment

    i'm sure there were more interesting facts but that's all I can regurgitate at the moment.

  20. Re:So the choice is between... by mickwd · · Score: 5, Funny

    It depends how often they go down.