Slashdot Mirror


Clustering vs. Fault-Tolerant Servers

mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.

20 of 321 comments (clear)

  1. Not the same. by tekn0lust · · Score: 5, Informative

    Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.

  2. Re:It depends on what you want to do. by TheRealMindChild · · Score: 2, Informative

    What is this then:

    http://www.microsoft.com/windowsserver2003/technol ogies/clustering/default.mspx

    Clustering (NOT performance clustering mind you, which is NOT the topic at hand anyway) has been around in Windows NT as far back as I can remember. With NT4, you needed to have Enterprise Edition, but it was there.

    --

    "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
  3. Re:It depends on what you want to do. by CSHARP123 · · Score: 5, Informative
    And windows doesnt support clustering yet
    Windows Server 2003 actually supports two different types of clustering. One is called network load balancing, which enables up to 32 clustered servers to run a high-demand application to prevent a single server from being bogged down. If one of the servers in the cluster fails, then the other servers instantly pick up the slack.

    Network load balancing has been most often used with Web servers, which tend to use fairly static code and require little data replication. If a clustered web site needs more performance than what the cluster is currently providing, additional servers can be instantaneously added to the cluster. Once the cluster reaches the 32-server limit, you can further expand the cluster by creating a second cluster and then using round-robin DNS to divide traffic between the two clusters.

    The other type of clustering that Windows Server 2003 supports by default is often referred to simply as clustering. The idea behind this type of clustering is that two or more servers share a common hard disk. All of the servers in the cluster run the same application and reference the same data on the same disk. Only one of the servers actually does the work. The other servers constantly check to make sure that the primary server is online. If the primary server does not respond, then the secondary server takes over.

    This type of clustering doesn't really give you any kind of performance gain. Instead, it gives you fault tolerance and enables you to perform rolling upgrades. (A server can be taken offline for upgrade without disrupting users.) In Windows 2000 Advanced Server, only two servers could be clustered together in this way (four servers in Windows 2000 Datacenter Edition). In Windows Server 2003, though, the limit has been raised to eight servers. Microsoft offers this as a solution to long-distance fault tolerance when used in conjunction with the iSCSI protocol (SCSI over IP).

  4. Solution by Anonymous Coward · · Score: 1, Informative

    Just go with fault tolerant clusters.

  5. Never build systems on a core of failure. by CyricZ · · Score: 2, Informative

    That's one of those ideas that sounds all good and well, but it hardly works in practice. In many cases, downtime is unacceptable. You need transactions processed continually, and you cannot have downtime caused by a dead server.

    It is not a good idea to build a system out of parts that you know will fail, and then proceed to design the system around such failure. A far better idea is to spend some money, and design a system that will work. Of course you do take into account hardware failure, and you build in redundancy where necessary. But you do not build your solution around knowingly faulty and cheap hardware. That's just looking for trouble.

    Often times the "cheap" solution ends up being most expensive, not only because of the cost of repeated hardware repairs, but also because of the cost of the labour necessary to perform the repairs, and the possibility of downtime. When you're processing millions of dollars worth of transactions per minute (if not per second), even a couple of minutes of downtime can be financially costly.

    --
    Cyric Zndovzny at your service.
    1. Re:Never build systems on a core of failure. by dkleinsc · · Score: 4, Informative

      Most successful strategies I've heard of involve building a system out of parts that you know can't fail, and then designing the system around the failure of the parts that you know can't fail.

      --
      I am officially gone from /. Long live http://www.soylentnews.com/
  6. Clustering Potentially Solves More Problems by bradm · · Score: 4, Informative

    Fault tolerance gets you a machine that keeps running in the face of hardware failures and maintenance. The switchover time is arguably negligible.

    Clustering gets you a set of services that keep running in the face of hardware failures and maintenance. The switchover time can range from negligible to huge depending on the application involved.

    However, clustering also helps you to solve other problems, including scaling, software failures, software upgrades, A-B testing (running different versions side by side), major hardware upgrades, and even data center relocations.

    Clustering tends to require a lot more local knowledge to get right.

    So if you narrow the problem definition to hardware only, they solve the same class of problems. But when you broaden it to the full range of what clustering offers you find a greater opportunity for cost savings - because one technique is covering multiple needs.

  7. False Dichotomy by bluffcityjk · · Score: 1, Informative

    Since when can every software solution be categorized as "proprietary" or "Linux-based"?

  8. Re:Queue... by TheRealMindChild · · Score: 2, Informative

    In my opinion, Beowulf is not the hammer everyone thinks it is. Ask the average slashdot reader even, and they relate Beowulf to something more like OpenSSI or Mosix... something you can easily add nodes to, and just use a special compiler to compile all of your multthreaded/multiproc apps and it will all work magically.

    If you are one of those people, stop. A Beowulf cluster is a performance cluster, but it is not a replacement for an SMP system. You more or less have the master node delegate actual computations EXPLICITLY in your application (EX "Hey... Node X, Caclulate X + Y for me, kthx").

    --

    "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
  9. No difference, just a matter of packaging. by TheMohel · · Score: 4, Informative

    Having built both true high-reliability fault-tolerant devices and clustered systems, I don't see any fundamental theoretical difference. In both cases, you have redundant hardware capacity in place, theoretically to allow you to tolerate the failure of a certain amount of your hardware (and, sometimes, your software) for a certain amount of time. Neither option guards you against failures outside of the cluster or FT system box. Neither one is a panacea. Both are sold as snake-oil insurance against "badness".

    In a single fault-tolerant box, you generally have environmental monitoring, careful attention to error detection, and automatic failover. You also have customer-replaceable units for failure-prone components, utiilties for managing all of the redundancy, and a fancy nameplate. In exchange for that, you have more complexity, more cost, serious custom hardware and software modifications, and often (but not always) performance constraints.

    In a clustered system, you treat each individual server as a failure unit. Good fault detection is a challenge, especially for damaging but non-catastrophic failure, but it's much easier to configure a given level of redundancy and it's easier to take care of environmental problems like building power (or water in the second floor) -- you just configure part of the cluster a longer distance away.

    Where clustering is inadequate is when you have a single mission-critical system where any failure is disaster (like flight-control avionics or nuclear power plant monitoring). There are applications where there's no substitute for redundant design, locked-clock processors and "voting" hardware, and all of the other low-level safeguards you can use.

    For Web applications, however, where a certain sloppiness is tolerable, and where the advantages of load balancing, off-the-shelf hardware and software, and system administration that doesn't require an EE with obsessive-compulsive disorder, clusters are the natural solution.

    The fact that you get to sell more licenses for the software is just gravy.

  10. Ignoramus by Donny+Smith · · Score: 4, Informative

    What you wrote is really ignorant (which, modded on /., translates to Insightful).

    1. (because I have yet to meet a clustering DB solution that didnt suck).

    Where do you live? In Ruanda?
    Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.

    2. one copy of Debian + Apache + MySQL + Perl or 200 copies

    mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...

    3. And windows doesnt support clustering yet - in any decent way shape or form, I dont see the problem here.

    Hah, hah! Enough said.
    And also - what's it to you? If Microsoft (in your view) had a good clustering solution, you'd lose sleep over that?
    When you're biased like that, no wonder you can't have a quality, unbiased opinion on this topic.

  11. Re:Fault tolerant hardware is not the solution by TinyManCan · · Score: 2, Informative

    Unfortunately, for many reasons, Open Source does not end the cost of licensing for many organizations. Most of the good clustering solutions that I have seen recently involve breaking every application and service into a 'package' that can run on many different physical servers. Each package has a virtual IP address associated with it.

    When hardware fails, you bring up the required packages on a different physical host, and other applications access it using the virtual IP. Going this route allows you to do N+1 style clustering where say 3 servers are hosting 2 applications. This is a big win over the older model where each box had a physical duplicate that would step in when failure occurred.

    To use this style of clustering, you need to have excellent shared storage support, which has come in the form of SAN based disk arrays in all cases I have seen. The cost of software licensing aside, SAN equipment can case an arm and a leg.

    For real, enterprise, supported applications you pay through the nose for the software, the hardware and then again for the support systems (HVAC, Power Conditioning and UPS, fault tollerant networking, SAN gear, Backup infrastructure, etc). It all costs, and it all has to be supported. Adding more machines (in the case of these clusters) increases the base overhead cost even before you get to the licensing.

    Providing reliable and functional enterprise services (the type that require clustering) is expensive, plain and simple.

  12. Re:It depends on what you want to do. by pete-classic · · Score: 3, Informative

    I worked in Dell server support from summer of '98 to summer 2000. I supported NT 4 HA clustering and I have to tell you, it was an unqualified nightmare.

    Since I was in support I didn't see a cross-section, I only saw the failures. That said, there were a LOT of installations out there that would have had better availability with a beige box, and MUCH better availability with a single fault-tolerant server.

    It didn't help that sales constantly sold invalid configurations and set unreasonable expectations.

    Bad, bad memories. If I never hear the word quorum again it will be too soon.

    -Peter

  13. For firewalls and/or routers by SquadBoy · · Score: 2, Informative

    There is nothing like OpenBSD running pf and carp. Dead easy to set up, works like a charm, and secure by default. One wonders why the editors seem to think OSS == Linux.

    http://www.openbsd.org/faq/pf/index.html
    http://www.openbsd.org/faq/faq6.html#CARP

    --

    Cypherpunks: Civil Liberty Through Complex Mathematics. Those who live by the sword die by the arrow.
  14. Re:It depends on what you want to do. by Jim+Hall · · Score: 5, Informative

    Let me preface this by saying I'm the Enterprise IT Manager for a large, Big-10 University. "Enterprise" means I am responsible for all servers that run the University, not just a small department. My userbase is 70,000+ students, and somewhere between 15,000-20,000 faculty and staff.

    We run a variety of hardware platforms, including a large Linux deployment. Yes, it really does depend on what you want to do with that server, before you can decide to go with a bunch of servers behind a load balancer v. a larger, fault-tolerant server.

    For our production web servers (PeopleSoft, web registration, etc.) we run a bunch of cheap servers running Red Hat Enterprise Linux, and we distribute them across two data centers (for redundancy.) We run a load balancer in front of them, so that users access one URL, and the load balancer automagically distributes traffic to the servers on both data centers. For a lightly-used application, we may only run 2 web servers. For heavily-used applications (web registration) we run 5 web servers. Those are IBM x-series now, but we are in the process of moving to IBM BladeCenters.

    With multiple servers in production, I can lose any single web server and not experience downtime on the application. We usually only have a single PSU in each server, because there's no point in the extra expense when we have redundancy at the server level. And because we've split our web servers across two data centers, I can actually lose an entire data center and only experience slow response time on the application. (Note to the paranoid: while the data centers are only 1.4miles apart, they are on separate power grids, etc. The other back-end infrastructure is also split between data centers.) We run a lot of sites behind load balancers, so we can afford to have a separate load balancer pair at each site (which can provide backup to each other.)

    However, for large applications we may use a single fault-tolerant Linux server. For example, we used to do this with a database server. Multiple power supplies, multiple network connections, RAID storage, etc. To be honest, though, we tend to run databases on "big iron" hardware such as Sun SPARC (E25000, V890, etc.) and IBM p-series. We don't have any Linux database servers left, but that's not because Linux wasn't up to the task (our DBAs preferred to have the same platform for all databases, to make debugging and knowledge-sharing easier.)

    In a few cases, we have a third tier. If the application is low-priority (i.e. a development server) and/or low-volume (i.e. a web site that doesn't get much traffic), we run a single server for that. The server is a cheap IBM x-series box running Red Hat Enterprise Linux, usually with no built-in redundancy.

    Yes, for us Linux has been able to play along quite nicely with the "big iron" UNIX systems. We've run Linux at the Enterprise level since 1998 or 1999, and Linux is definitely considered part of our Enterprise solution.

  15. Absolutely right by TTK+Ciar · · Score: 3, Informative

    Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.

    This is dead-on correct. For example, if a CGI hits a problematic state where it eats a lot of memory putting the server into a state where it's swapping, then it takes longer to service each http transaction, which means each more httpd transactions queue up, which means more memory gets allocated which means more swapping .. rendering the machine useless for a little while (until a sysadmin or a bot notices the state and either restarts the httpd or kills a few select processes). If we were running this on one mammoth server with lots of redundant bits, then 100% of our web service capacity would be down in the interim. But since we run a pool of ten http servers under keepalived/IPVS, we only lose 10% of our capacity during that time.

    Other reasons I've traditionally preferred clustering: easy to incrementally scale up infrastructure (no big buy-in in the beginning to get the server which can be expanded), fully parallel resources (an independent memory bus, an independent IO bus, two independent CPU's, an independent network card, and a few independent disks for each server, as opposed to a mammoth shared bus on a leviathan crossbar, which will inevitably run into contention), and more flexibility in how resources are divided amongst mutually exclusive tasks.

    One of those reasons is getting less relevant -- point-to-point bus technologies like LightningTransport and PCI-Express are inexpensively replacing the "one big shared bus" with a lot of independent busses, transforming the server into a little cluster-in-a-box. It is a positive change IMO, and shifts the optimal setup away from the huge cluster of relatively small machines, and towards a more moderately-sized cluster of more medium-sized cluster-in-a-box machines.

    The price of licenses is, IME, rarely an issue (in my admittedly limited career -- I don't doubt that it's relevant to many companies) because the places I've worked for have tended to use primarily free-as-in-beer (and often free-as-in-speech) open source solutions. What is more of an issue, IME, is the necessity of staffing yourself with cluster-savvy sysadmins and software engineers. Those of that ilk tend to be a bit rare and expensive, and difficult to keep track of. It takes a distributed systems professional to look at a distributed system and understand what is being seen, and this makes it easy to bend the spec or juggle the schedule on the sly, or run skunkworks projects outright. By contrast, the insanely redundant, mondo-expensive uberserver was created and programmed by very smart hardware and software specialists so that your IT staff doesn't need to be so specialized. This makes useful talent easier to acquire, and understanding the system closer to the reach of mere mortals.

    Just my two cents
    -- TTK

  16. Well, let's see by Cyno · · Score: 2, Informative

    Sun:
    http://store.sun.com/CMTemplate/CEServlet?process= SunStore&cmdViewProduct_CP&catid=83174

    For around $20,000 you could build a PC cluster that includes:
    20+ x Intel P4 D820 at ~$500 ea.
    20+ x AMD64 X2 3800+ at ~$750 ea.

    You could almost get a cluster of 40 Intel PCs, each with a dual-core chip running at 2.8 Ghz. Or almost 30 AMD64 PCs, each with a dual-core chip running at 1.8 Ghz. If you shop smart you can get gigabit ethernet on the motherboard and have a fault-tollerant / redundant system with over 10 times the performance of the Sun system.

    I don't know about you, but I would take the cluster of AMD X2s. The Intels might beat 'em on price/performance, but the X2s might be a lil bit nicer to work on.

  17. Re:So the choice is between... by ScuzzMonkey · · Score: 3, Informative

    Mods! Wake up! How is this not +5 (either Funny or Insightful, I haven't decided which yet) already?

    --
    No relation to Happy Monkey
  18. Re:Google as an example by Thundersnatch · · Score: 2, Informative

    Google does not have to worry about ACID compliance in their database. From what I've read about the google file system, cluster nodes lazily share new data amongst themselves. Serving up old data is explicitly allowed.

    To cluster something like an OLTP database, every node has to be immediately informed about updates to the data, and they all have to report back that they have said data intact before the transaction commits. This can be something of a problem when you have hundreds of thousands of updates per second happening.

    And of course, you need to have a method to rapidly bring back into sync a sever which has been out of commission for a while before it comes back online.

    The only way I've seen to do that is to have some sort of high-speed shared interconnect between nodes, and some cluster-awareness in the application to handle synchronization. That is currenlty very expensive, especially if your some of your nodes are in California, and the rest are in Chciago.

    Shared-nothing clusters simply require high-speed interconnects for transactional applications. Data changes must pushed everywhere, immediately, before the transaction commits. I don't see how you get around that.

  19. Re:It depends on what you want to do. by Mateito · · Score: 2, Informative

    This is not a case of "which is better", but a "what is right for what I want to do".
    There are "Best Practises" for doing this sort of thing that take the religion out of server-farm design.

    First thing to work out:
    (1) How many minutes of APPLICATION downtime are acceptable
    (2) How much money will I lose for each miunte the application is down.

    Multiply (1) by (2), and you have a rough idea of your budget. Ideally, this should be the last thing - you work out your needs and then pay for them - and that was true five years ago. Today, IT budgets are a lot tighter, and the money often comes first. At least by taking this approach, you have a dollar value to present to the board to get funding approved before you spend a huge amount of time and effort putting together a proposal that will jet be rejected.

    If this is only a few thousand dollars, you aren't about to rush out and buy Oracle RAC licenses. You don't need them. If you are going to lose tens of thousands of dollars per minute, you are going to go for big-iron servers running Oracle RAC and run a global cluster between HA data centers.

    From there, you can attack the design from the Top-down.

    You then look at your application, and work out how each component scales: Horizontally, vertically or diagonally (H+V).
    In general:
    - web servers don't need to be clustered as they aren't stateful
    - Databases scales vertically, though if you've got the money, Oracle RAC is an option. Once you get above 4 CPU cores in the cluster though, you need to go Enterprise edition, and this is expensive.
    - App servers may go horizontal or vertical, depending on the design of the app.

    Once you know how stuff scales, you can start working out what will run where. Some applications play nicely together, and can be combined. From there you can start to work out what OS's you need, and from that the hardware platform. Yes, I know this runs contrary to most people's design philosophies (is, choose the OS, then the app), but 90% of the time the app has dependencies that will limit the OS.

    Designing a data center shouldn't be as detached from personal preference as possible. The obvious link is that you don't spend huge amounts on one OS if all your in-house expertise is in another. But this is one of the last filters, not one of the first. It may be cheaper to roll-out Solaris (for example) and hire or train to get the expertise, than it is to port the app to (say) Windows.