Slashdot Mirror


Clustering vs. Fault-Tolerant Servers

mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.

321 comments

  1. It depends on what you want to do. by ResQuad · · Score: 3, Interesting

    Personally I opt for clustering over fualt-tollerance - but thats my personal choice. It really depends on what the machine(s) will be doing. If you have a database server - fault tollerence (because I have yet to meet a clustering DB solution that didnt suck). But if your building a webserver - cluster.

    Also the one thing the article mentions is that clustering is just as expensive as fault-tollerence due to software licesing. Last I checked if its one copy of Debian + Apache + MySQL + Perl or 200 copies - its going to cost me the same price (free). And windows doesnt support clustering yet - in any decent way shape or form - so I dont see the problem here.

    1. Re:It depends on what you want to do. by Anonymous Coward · · Score: 2, Insightful

      Heh. In order to do it completely right, you'd make a cluster out of fault tollerant nodes :-P

    2. Re:It depends on what you want to do. by TheRealMindChild · · Score: 2, Informative

      What is this then:

      http://www.microsoft.com/windowsserver2003/technol ogies/clustering/default.mspx

      Clustering (NOT performance clustering mind you, which is NOT the topic at hand anyway) has been around in Windows NT as far back as I can remember. With NT4, you needed to have Enterprise Edition, but it was there.

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    3. Re:It depends on what you want to do. by Tenareth · · Score: 4, Insightful

      A Web farm is the simplest form of clustering, some would argue it isn't even a cluster because the nodes are not aware of each other. However, it gets more confusing when you add a Java layer that load balances...

      Anyway, I do agree that I've seen more trouble caused by DB Clustering solutions than it helps...

      A cluster adds complexity to the environment, Complexity == Cost, even without the expensive software.

      --
      This sig is the express property of someone.
    4. Re:It depends on what you want to do. by CSHARP123 · · Score: 5, Informative
      And windows doesnt support clustering yet
      Windows Server 2003 actually supports two different types of clustering. One is called network load balancing, which enables up to 32 clustered servers to run a high-demand application to prevent a single server from being bogged down. If one of the servers in the cluster fails, then the other servers instantly pick up the slack.

      Network load balancing has been most often used with Web servers, which tend to use fairly static code and require little data replication. If a clustered web site needs more performance than what the cluster is currently providing, additional servers can be instantaneously added to the cluster. Once the cluster reaches the 32-server limit, you can further expand the cluster by creating a second cluster and then using round-robin DNS to divide traffic between the two clusters.

      The other type of clustering that Windows Server 2003 supports by default is often referred to simply as clustering. The idea behind this type of clustering is that two or more servers share a common hard disk. All of the servers in the cluster run the same application and reference the same data on the same disk. Only one of the servers actually does the work. The other servers constantly check to make sure that the primary server is online. If the primary server does not respond, then the secondary server takes over.

      This type of clustering doesn't really give you any kind of performance gain. Instead, it gives you fault tolerance and enables you to perform rolling upgrades. (A server can be taken offline for upgrade without disrupting users.) In Windows 2000 Advanced Server, only two servers could be clustered together in this way (four servers in Windows 2000 Datacenter Edition). In Windows Server 2003, though, the limit has been raised to eight servers. Microsoft offers this as a solution to long-distance fault tolerance when used in conjunction with the iSCSI protocol (SCSI over IP).

    5. Re:It depends on what you want to do. by crimethinker · · Score: 4, Interesting
      Quoth the GP: "in any decent way shape or form"

      Yes, Windows has supported clustering since NT4 (Wolfpack), and per the GP, it SUCKED BOLLOCKS. I had to deal with that shite every damn day for almost 3 years (1997-2000). We used active-active failover, and the joke around the company was that MS were halfway there: the "fail" worked just fine.

      -paul

      --
      Pistol caliber is like religion: everyone has their favourite, and theirs is the only right choice.
    6. Re:It depends on what you want to do. by vengy · · Score: 1

      The best use (and most expensive) of clustering under Windows I've done is with SQL Server 2000. Active/Passive configuration only requires one copy of SQL Server Enterprise, (MSRP is $20k but you can get for well under $10k now), two Windows servers, and a nice storage array, and you've got a pretty decent SQL Server cluster solution from Microsoft. I'll admit that the install and setup isn't entirely easy to do from memory, but there are decent instructions online from both Microsoft and other websites.

    7. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      uh, windows clustering is crap, wake up mr.mcse! =p

    8. Re:It depends on what you want to do. by pete-classic · · Score: 3, Informative

      I worked in Dell server support from summer of '98 to summer 2000. I supported NT 4 HA clustering and I have to tell you, it was an unqualified nightmare.

      Since I was in support I didn't see a cross-section, I only saw the failures. That said, there were a LOT of installations out there that would have had better availability with a beige box, and MUCH better availability with a single fault-tolerant server.

      It didn't help that sales constantly sold invalid configurations and set unreasonable expectations.

      Bad, bad memories. If I never hear the word quorum again it will be too soon.

      -Peter

    9. Re:It depends on what you want to do. by TheRealMindChild · · Score: 1

      To be honest, I would have never brought it up if I had such issues. It was one of those things that ALMOST always worked great for me... and any collegues that I spoke with that delt with the same.

      It is possible, though, that I am in the minority.

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
    10. Re:It depends on what you want to do. by Jim+Hall · · Score: 5, Informative

      Let me preface this by saying I'm the Enterprise IT Manager for a large, Big-10 University. "Enterprise" means I am responsible for all servers that run the University, not just a small department. My userbase is 70,000+ students, and somewhere between 15,000-20,000 faculty and staff.

      We run a variety of hardware platforms, including a large Linux deployment. Yes, it really does depend on what you want to do with that server, before you can decide to go with a bunch of servers behind a load balancer v. a larger, fault-tolerant server.

      For our production web servers (PeopleSoft, web registration, etc.) we run a bunch of cheap servers running Red Hat Enterprise Linux, and we distribute them across two data centers (for redundancy.) We run a load balancer in front of them, so that users access one URL, and the load balancer automagically distributes traffic to the servers on both data centers. For a lightly-used application, we may only run 2 web servers. For heavily-used applications (web registration) we run 5 web servers. Those are IBM x-series now, but we are in the process of moving to IBM BladeCenters.

      With multiple servers in production, I can lose any single web server and not experience downtime on the application. We usually only have a single PSU in each server, because there's no point in the extra expense when we have redundancy at the server level. And because we've split our web servers across two data centers, I can actually lose an entire data center and only experience slow response time on the application. (Note to the paranoid: while the data centers are only 1.4miles apart, they are on separate power grids, etc. The other back-end infrastructure is also split between data centers.) We run a lot of sites behind load balancers, so we can afford to have a separate load balancer pair at each site (which can provide backup to each other.)

      However, for large applications we may use a single fault-tolerant Linux server. For example, we used to do this with a database server. Multiple power supplies, multiple network connections, RAID storage, etc. To be honest, though, we tend to run databases on "big iron" hardware such as Sun SPARC (E25000, V890, etc.) and IBM p-series. We don't have any Linux database servers left, but that's not because Linux wasn't up to the task (our DBAs preferred to have the same platform for all databases, to make debugging and knowledge-sharing easier.)

      In a few cases, we have a third tier. If the application is low-priority (i.e. a development server) and/or low-volume (i.e. a web site that doesn't get much traffic), we run a single server for that. The server is a cheap IBM x-series box running Red Hat Enterprise Linux, usually with no built-in redundancy.

      Yes, for us Linux has been able to play along quite nicely with the "big iron" UNIX systems. We've run Linux at the Enterprise level since 1998 or 1999, and Linux is definitely considered part of our Enterprise solution.

    11. Re:It depends on what you want to do. by DJbeta_masta · · Score: 0, Redundant

      And windows doesnt support clustering yet - in any decent way shape or form Ummm.. yes it does: http://www.microsoft.com/windowsserver2003/technol ogies/clustering/default.mspx

    12. Re:It depends on what you want to do. by Donny+Smith · · Score: 5, Insightful

      Well, in case you haven't noticed, it's late 2005 now.
      Some things have changed, for example Windows 2003 Server came out and MSCS is now quite a decent HA solution.

      (BTW, the grandparent post didn't say that Microsoft's own clustering solution was lame, he made a general statement about all clustering software for the Windows platform).

    13. Re:It depends on what you want to do. by elrick_the_brave · · Score: 1

      Hmm.. Active-Active in a MS Cluster... the only unsupported configuration. Active-passive (or for that matter Active (as many) plus one passive is always the configuration of choice. At any rate.. that's MS for ya!

      --
      (1st sig) If this were a snappy sig, you'd be reading it right now. (2nd sig) I'm a karma whore. >Insert FUD here
    14. Re:It depends on what you want to do. by gusmao · · Score: 1

      Actually, it depends mainly on the application. Telecom companies for one, run their commuting centrals in fault-tolerant hardware/software, that work simultaneously processing the same data. In this way, if one of the components (disk, processors, boards, etc.) fails, the mirror can assume immediately. However, those companies buy very a specific hardware and software from (usually) a single vendor, which are necessary to provide their specific services. Of course vendors charge an obscene price for them, but in this case it is worth to pay it anyway, because a cluster solution would be very complex and is not available off-the-shelf. On the opposite side, if you not running an aplication that requires hardware/software so specific, a cluster may be a good solution. It is extensible, customizable, and you can choose from multiple vendors and technologies.

    15. Re:It depends on what you want to do. by Marillion · · Score: 4, Insightful
      That makes lots of sense. Software costs do multiply in clustering. Zero times 100 is still zero. But, clustering has other headaches beyond money.

      The usual clustering I've seen is "Hot Spare" clustering. The primary runs until it goes kaput, then the second takes over. For database clustering, the two boxes usually share the same disks. I think I've seen more outages from false takeovers by the seconday than real failures of the primary.

      The other problem with clustering is that all of your software applications have to be cluster tolerant. If the user app keeps a database connection open and a rollover occurs, the connection state doesn't and can't rollover with it. To a client system, a cluster failover looks like a server reboot. Don't underestimate the difficulty of this problem. A new application has to be designed with that in mind. Retro-fitting it in later is hard - and costly, even with free platforms.

      Another issue that can't be solved with clustering is application failure or application limits. You may recall the airline system failure last Christmas? Some 80% of Slashdot readers asked where was the backup? (there was) should have used Unix (they were). The box (RS6000) and operating system (AIX) kept running just fine. A hundred computer cluster couldn't solve the the real problem: the application couldn't handle the volume of information it was required to hold and they at the mercy of a proprietary source code vendor.

      --
      This is a boring sig
    16. Re:It depends on what you want to do. by milimetric · · Score: 0

      "And windows doesnt support clustering yet"

      Oh shit! It doesn't? Cause we just clustered a Sql Server for our clients on Windows 2003 boxes. Boy, I'd hate to see the look on their face when I tell them that what we actually set up doesn't exist.

      FYI. I've been clustering Windows boxes and Sql Servers since 2000. Yes, it can be done, no it's not the prettiest thing in the world but neither are any other clusters. At my university, they set up a 256 node Windows cluster. Tell them it can't be done.

    17. Re:It depends on what you want to do. by Craptastic+Weasel · · Score: 1

      u may wanna look at www.sciencelogic.com for some nice monitoring solutions to go along with all that hardware... it's a hell of a tool

    18. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      Spoken just like someone who just got their MCSE. Have you ever tried implementing this? I know you haven't as you're posting on Slashdot, as opposed to having killed yourself out of frustration and shame.

    19. Re:It depends on what you want to do. by Anonymous Coward · · Score: 1

      > 256 node Windows cluster

            What application(s) run on that? Got a link?

    20. Re:It depends on what you want to do. by Stripe7 · · Score: 2, Insightful

      The choice between fault tolerants systems is decided on the interval your company can sustain an outage. A cluster can take 1-2 min to move applications from a dead node to another working one. If you applications require sustained 100% connectivity you need to go fault tolerant. Usually its for Realtime monitoring software like the computers used to monitor telephone exchanges. For databases and NFS services clusters work better as you can take a 1-2 minute hit in the response when a node fails. Software licenses do not come into it with active-active nodes where you pay for all the CPU's you are running on there. With active-passive failover, only 1 instance of your licensed software is running on your 2 systems. If your software vendor insists on your paying a license for both nodes then I would opt for a active-active node instead of an active-passive one.

    21. Re:It depends on what you want to do. by donaldm · · Score: 2, Interesting

      Most clusters are equivalent to DEC-safe (you can even get the source code on Freshmeat) which is mainly a group of machines joined together via a SCSI interconnect or a Storage Area Network and a common lan. all interconnects should be redundant and that includes the network. The only cluster that is different is the Tru64 cluster which has a clustered file-system. I think Redhat clustering uses NFS (anyone advise on this) but you need a very fast network if you want disk performance.

      Fault tolerant is the most expensive option such as the Himalaya machines, nearly all components can be replaced while the machine is hot.

      The cluster is quite a reliable method of application availability. In the event of a cluster member failure the application failover to another cluster member should be relatively quick (about 1 to 5 minutes), however if an application takes say 25 minutes (I actually struck this once) to start then fail-over is going to take at least 25 minutes. Also your application should be capable of restarting and recovering from power off then on. If the application cannot do this then clustering is useless and you should be thinking fault tolerant machines or getting your Application vendor to fix the issue.

      PS. All clusters I have setup (Trucluster - Tru64 Unix) using Informix, Oracle, Sybase and SAP applications have worked extremely well.

      --
      There ain't no such thing as proprietary standards only proprietary formats. Standards are by definition open.
    22. Re:It depends on what you want to do. by killjoe · · Score: 1

      Where is the shared nothing clustering?

      --
      evil is as evil does
    23. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      I bet all of your billy bathgatesOS 'clustering' options are shared nothing (unless data sharing is tacked on at the database application level - Oracle RAC, Cache' ECP clusters, to list two examples). Now contrast your 'marketing salesman meeting level glossy brochure having computerworld press release MCSE clusters', to 'actual clusters' (OpenVMS), with all filesystems, including the system disk, fully shared. Wow.

      Going back to the article - it's based on a false premise. In a truly mission critical environment you should not be deciding between be using both 'fault tolerant' (this term is used in the article to mean more 'redundant architected' rather than fault tolerance such as is provided by systems like tandem nonstop), and 'clustered' (in the sense of failover, load-balancing, or simply having a warm standby system), you should be using both.

      And bill gates operating systems need not apply.

    24. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0
      And because we've split our web servers across two data centers, I can actually lose an entire data center and only experience slow response time on the application. (Note to the paranoid: while the data centers are only 1.4miles apart, they are on separate power grids, etc. The other back-end infrastructure is also split between data centers.) We run a lot of sites behind load balancers, so we can afford to have a separate load balancer pair at each site (which can provide backup to each other.)
      If the web tier is split across 2 data centers, how do you manage the synchro of the back-end data?
    25. Re:It depends on what you want to do. by ckaminski · · Score: 1

      Um, active/active has always been supported.

      Active/Active with SQL Server or Exchange, however, has only recently been supported, and then only in the Enterprise versions.

    26. Re:It depends on what you want to do. by ckaminski · · Score: 1

      I've had issues with MSCS (homebrews and Dell/HP solutions) with mysterious service migration, and one instance where the whole thing got fubared because someone put a SQL TXN Log on the quorum disk, but on the whole I've not had issues with them

      I think Dell hardware just sucks, and it figures that the cluster I manage now with this migration issue, and the previous one are both Dell configurations.

    27. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0
      What is this then

      Poo on a stick? Hey, you asked!

    28. Re:It depends on what you want to do. by moro_666 · · Score: 1

      >> :. It was one of those things that ALMOST always worked great for me...

      well, almost always just is not good enough, if you are involved in stockmarkets or banking, then failover must work 24/365 and not miss a second. imagine now that you have to answer to your boss who is pretty curiously asking about where is the record of that 10mil$ transaction that "did" happen 5 seconds before the server raid decided to play "flames on" from the fantastic four movie ? do you really think that "i dont know" is even considerable answer ? nope ...

      what good is a failover if tends to Just Fail sometimes ? (Maybe we should create a JustFail(tm) server/cluster line that is the total opposite of failover systems ?)

      it is as good as having an operating system that doesnt operate from time to time (i wont start windows bashing this time, :p)

      --

      I'd tell you the chances of this story being a dupe, but you wouldn't like it.
    29. Re:It depends on what you want to do. by muffdivr · · Score: 0

      According to the NCES stats, the hisghest enrollment among any University in the country is 57K. Apparently your University has at least 15,000 zombie students that you might want to purge from your highly redundant data centers 1.4 miles apart. Moron!!
      Let me preface this by saying I'm the Enterprise IT Manager for a large, Big-10 University. "Enterprise" means I am responsible for all servers that run the University, not just a small department. My userbase is 70,000+ students, and somewhere between 15,000-20,000 faculty and staff.

    30. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      notepad datacenter edition

    31. Re:It depends on what you want to do. by Jim+Hall · · Score: 1

      According to the NCES stats, the hisghest enrollment among any University in the country is 57K. Apparently your University has at least 15,000 zombie students that you might want to purge from your highly redundant data centers 1.4 miles apart. Moron!!

      I assure you, we do have that many students across the whole system. True, our largest campus (the whole system of our 4 coordinate campuses is probably not counted as a "single university") doesn't have more than about 45,000 students. But don't forget - we also have a number of part-time students.

      Yes, we do have about 70,000 students, counting full-time and part-time students together. Perhaps I should have been more specific about that in my original post.

      Our data centers processes all web registration activity across all 4 coordinate campuses. We also run the PeopleSoft system (and other applications) for all 4 campuses. So I tend to view all 4 campuses as one, for what I do.

    32. Re:It depends on what you want to do. by TarrySingh · · Score: 1

      heh heh

      --
      Scott McNealy to Michael: "Suck my Sun!" Michael Dell to Scott : "Lick my Dell!"
    33. Re:It depends on what you want to do. by Jim+Hall · · Score: 1

      If the web tier is split across 2 data centers, how do you manage the synchro of the back-end data?

      Take web registration as an example; we have 5 web servers for that. When a student registers for a class, the web registration application submits the data to the PeopleSoft app servers for processing. We currently have 3 app servers (running on Sun hardware) across both data centers... soon to be 4. The app servers save all data into the database tier (also running on Sun hardware.)

      It's not like the web servers are also the database servers. Eventually, data has to get to a single point/tier.

      There are other parts of the architecture that I won't go into, because they are not Linux-specific (and would be off-topic.)

    34. Re:It depends on what you want to do. by muffdivr · · Score: 0

      Why in the world would anyone want to build their data centers 1.4 miles apart? Any decent Entreprise DR (Disaster Recovery) plan provides for Data Centers to be located logically as well as physically apart.

    35. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      I can second that it almost works in some situations. For our Exchange cluster, active/active was a complete joke. We worked directly with MS engineers for over a year before we finally both agreed that it does not meet our redundency expectations or requirements and finally switched to active/passive. Now in active/passive it operates okay but the failover is not always smooth and under many conditions the active server can fail and not cause an automatic roll (which we have to roll manually which defeats the main purpose). I guess that is not a fault of the actual cluster but more the server software itself. Funny thing though, one of the reasons that we even tried active/active was at the advice of MS to save money because we could use existing servers and would not have to buy something more powerful, you know, part of that lower TCO bullshit they claim. Well, that sales pitch went completely flat. Oddly enough, they now claim 2K3 and yet another upgrade of Exchange should allow us to go active/active "With no problems". We heard that before. Unforetuneatly, I am the only non MS background person in the network engineering department, everyone else assumes failures like this are "normal" and part of the job. They also feel comfortable with the fact that they can seperate themselves from the problems and walk away clean by blaming a vendor. The CEO calls us directly when mail is not working and the first words out of anyones mouth is "We have a call into MS/HP/some vendor and we are working with them on the problem". I guess the old sayings still hold true today, you can never get fired for choosing IBM or MS.

    36. Re:It depends on what you want to do. by Bahn · · Score: 1

      Not only was Enterprise Edition required for the clustering back in the days of old but it is still required with the new 2003 severs.

    37. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      Microsoft offers this as a solution to long-distance fault tolerance when used in conjunction with the iSCSI protocol (SCSI over IP).

      "WAN" clustering/majority node set clustering is possible with SAN replication as well. Not cheap, but nice.

      http://www.hds.com/pdf/wp179_storage_cluster_windo ws_2003.pdf#view=FitH&pagemode=bookmarks

    38. Re:It depends on what you want to do. by DJbeta_masta · · Score: 0

      What? Redundant?

      OK

      I should have wrote something like:

      "I for one, welcome our new Microsoft clustering overlords!"

    39. Re:It depends on what you want to do. by twiddlingbits · · Score: 1

      Aren't you running your Enterpise DBs on a SAN or NAS? You still got local RAID on the boxes? When you move to BladeServers are you going to let the BladeCenter do the balancing on the Blades in the box (and CPUs on a blade) or are you still going to use a front end load balancer? Ever thought of doing software load balancing using Websphere to spread a big Java application across several instances of Websphere on different boxes? You could load balance to the subnet of Websphere platforms and let Websphere do the rest.

    40. Re:It depends on what you want to do. by rebelcan · · Score: 1

      Either their PR guy reads a lot of slashdot, or is just clueless. From the ad on the front page:

      "Is your network managing you?"

      I think many /.-ers and Soviet Russians agree, the answer is yes.

      --
      God is dead -- Nietzsche
      Nietzsche is dead -- God
      Zombie Nietzsche lives! -- Zombie Nietzsche
    41. Re:It depends on what you want to do. by kd5ujz · · Score: 1

      It is a college, not a financial institution. If the school is hit by anything large enough to level 1.4 miles worth of buildings, school is probably not going to be in session. Now if you are doing online trading/banking and a hurricane hits the south east, people in the north west still want their money. If a school gets hit, the only people inconvenienced by a network outage are going to be the telecommuter students.

      --
      -William
      God is everything science has yet to explain.
    42. Re:It depends on what you want to do. by cloudmaster · · Score: 1

      What about those of us who have UMN set as the default mirror for our Sourceforge downloads? Darn it, we'll be inconvenienced too! :)

    43. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      I implemented both systems. NLB took 5 minutes to setup and works great. Clustering took a few hours and it works great as well. No problems for a year with 16 web servers and 2 data servers.

      Sorry to ruin your infantile anti-MS argument.

    44. Re:It depends on what you want to do. by Total_Wimp · · Score: 1

      A cluster adds complexity to the environment, Complexity == Cost, even without the expensive software.

      True, but with a good payoff. You can patch or otherwise modify one of the machines in a cluster without bringing your system offline. I know this is less important with Unix and Linux equipment, but even Linux admins can have bad days and screw something up during an upgrade. It happens. If it happens on the offline node of a cluster, you have some time to fix things while the user base is being nicely taken care of. If you only have a single box, even one with very nice redundant hardware, you're dead in the water.

      TW

    45. Re:It depends on what you want to do. by kd5ujz · · Score: 1

      I am so sorry, I almost forgot that when a school is leveled, the first thing they are going to worry about is your open sorce software.

      --
      -William
      God is everything science has yet to explain.
    46. Re:It depends on what you want to do. by Tet · · Score: 1
      The only cluster that is different is the Tru64 cluster which has a clustered file-system.

      Really? While DEC were certainly at the bleeding edge as far as clustering was concerned, there are now plenty of options for clustered filesytems. Vertias are making big money on their clustered storage management offerings. You could also look at GFS or OCFS. IIRC, DG/UX had a clustered filesystem, too.

      Don't get me wrong, TruCluster had some nice features, and is still probably the easier cluster software I've met to work with. But it sucked in some ways, too (use of rsh was hardcoded, for example -- to the point where Compaq provided a patch to turn /bin/rsh into a wrapper around ssh because it was easier than fixing TruCluster to remove all the hardcoded references to rsh).

      --
      "The invisible and the non-existent look very much alike." -- Delos B. McKown
    47. Re:It depends on what you want to do. by pete-classic · · Score: 1

      I don't have a horse in this race. I've sold my stock, and I left the company feeling pretty lousy about the experience. (That can mostly be attributed to it being my first job in "Corporate America".)

      That said, I disagree about the quality of the hardware. I strongly disagree that any significant percentage of the clustering problems were attributable to the hardware.

      The biggest class of problems were sales related. "Sure you can cluster a 2300 and a 6300, each with a non-cluster PERC!" or "Sure, both servers can serve the same database at the same time!"

      Of the remaining problems the huge majority were some variation of the cluster config going wonky for no known reason. These were generally resolved by sending out a "hero kit" of parts that didn't resolve the problem, and then the customer (or less often an SE) rebuilding the cluster from scratch.

      -Peter

    48. Re:It depends on what you want to do. by ePhil_One · · Score: 1, Insightful
      Heh. In order to do it completely right, you'd make a cluster out of fault tollerant nodes

      Of course you do. Fault Tolerance is pretty cheap and straightforward. It costs me about a 10% premium on my Dell servers. However it does not buy me the ability to tak a machine down for maintenace the way clustering does. If you're looking for serious uptime, fault tolerance is not going to get you there on its own.

      --
      You are in a maze of twisted little posts, all alike.
    49. Re:It depends on what you want to do. by Nefarious+Wheel · · Score: 1

      Clustering arrived on the scene with VMS, pre-dating WNT by several years. Spiffy thing, based on a cluster distributed lock manager. Worked just fine with relational databases, was fully symmetric, didn't suck.

      --
      Do not mock my vision of impractical footwear
    50. Re:It depends on what you want to do. by Craptastic+Weasel · · Score: 1

      rotfl.. not PR guy.. just a fan... i am a network admin myself.. and since my business is small business IT, this is really just a suggestion...

      as far as the Soviet Russian thing... hehehhe good eye

    51. Re:It depends on what you want to do. by Mateito · · Score: 2, Informative

      This is not a case of "which is better", but a "what is right for what I want to do".
      There are "Best Practises" for doing this sort of thing that take the religion out of server-farm design.

      First thing to work out:
      (1) How many minutes of APPLICATION downtime are acceptable
      (2) How much money will I lose for each miunte the application is down.

      Multiply (1) by (2), and you have a rough idea of your budget. Ideally, this should be the last thing - you work out your needs and then pay for them - and that was true five years ago. Today, IT budgets are a lot tighter, and the money often comes first. At least by taking this approach, you have a dollar value to present to the board to get funding approved before you spend a huge amount of time and effort putting together a proposal that will jet be rejected.

      If this is only a few thousand dollars, you aren't about to rush out and buy Oracle RAC licenses. You don't need them. If you are going to lose tens of thousands of dollars per minute, you are going to go for big-iron servers running Oracle RAC and run a global cluster between HA data centers.

      From there, you can attack the design from the Top-down.

      You then look at your application, and work out how each component scales: Horizontally, vertically or diagonally (H+V).
      In general:
      - web servers don't need to be clustered as they aren't stateful
      - Databases scales vertically, though if you've got the money, Oracle RAC is an option. Once you get above 4 CPU cores in the cluster though, you need to go Enterprise edition, and this is expensive.
      - App servers may go horizontal or vertical, depending on the design of the app.

      Once you know how stuff scales, you can start working out what will run where. Some applications play nicely together, and can be combined. From there you can start to work out what OS's you need, and from that the hardware platform. Yes, I know this runs contrary to most people's design philosophies (is, choose the OS, then the app), but 90% of the time the app has dependencies that will limit the OS.

      Designing a data center shouldn't be as detached from personal preference as possible. The obvious link is that you don't spend huge amounts on one OS if all your in-house expertise is in another. But this is one of the last filters, not one of the first. It may be cheaper to roll-out Solaris (for example) and hire or train to get the expertise, than it is to port the app to (say) Windows.

    52. Re:It depends on what you want to do. by UltimateRobotLover · · Score: 1

      Yeah, but apart from network load balancing, and general clustering... What has Windows clustering ever done for us?

    53. Re:It depends on what you want to do. by Craptastic+Weasel · · Score: 1

      damn near made me spit this mocha latte vente grande thingy out.. Great eye...

    54. Re:It depends on what you want to do. by muffdivr · · Score: 0

      Dude, losing a datacenter is more than a minor inconvenience, even for a school. Think credit card numbers, transcripts, etc. My question was that if someone takes the trouble to build out a new datacebntenter, why do it in such proximity? We just completed building out a third datacenter (I work for a Fortune 10 company)and one of they key considerations in deciding the location was geographical separation.

    55. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      It doesn't have to do anything except aggrivate you!

    56. Re:It depends on what you want to do. by kd5ujz · · Score: 1

      I am talking about dynamic data, if a school is destroyed, it is not going to be generating any more data, and you can restore from your (hopefully) nightly incremental off site backups ( in a vault, further than 1.4 miles away). If a disaster hits a bank brance, the other banks are still going to generate data. With a school, after you rebuild, you can restore the data before more students are enrolled/charging books to their accounts.

      --
      -William
      God is everything science has yet to explain.
    57. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      This site is interesting. Is it supported by your group, and is it updated manually or automatically?

    58. Re:It depends on what you want to do. by cbreaker · · Score: 1

      A RAID card failure is hardly the fault of the OS, and a SCSI/SAN configuration that would have that completed transaction not be written to disk would also not be the fault of the OS.

      I don't blame Windows in many cases where the system fails. Windows 2003 might not be the fastest, and it's certainly not the cheapest, but it does work if you have good hardware. These Dell servers just suck, and everyone's using them now. I have retarded problems with all our Dell stuff and it often appears to be a software problem. It's really not. When I used high-end IBM x86 servers at my last contract the infrastructure was solid. All Windows, and it all worked. Clusters were solid.

      I don't know what they've got going on (cheap ass hardware that's not built very well is my guess) but the Dell stuff just stinks.

      --
      - It's not the Macs I hate. It's Digg users. -
    59. Re:It depends on what you want to do. by muffdivr · · Score: 0

      Good points - Lets hope they have their backups in a different location.

    60. Re:It depends on what you want to do. by sumdumass · · Score: 1

      I wasn't aware windows supported clustering SQL on 2000. I hear 2003 does if you get the enterprise edition. you know werre i could get more info on it?

    61. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      We ran the 2005 UK general election reporting for the BBC out of a 2003 active/passive SQL cluster on HP hardware and it ran beyond perfect.

    62. Re:It depends on what you want to do. by Jim+Hall · · Score: 1

      Yup, we do off-site backups, and we've been working with one of our coordinate campuses (about 4 hours away, by car) to set up live data bunkering.

    63. Re:It depends on what you want to do. by milimetric · · Score: 1

      yeah, that's the tough part. Googling for the installation steps will give you a choice of step by step instructions. However, as with some other things, there's a certain amount of random magic involved in setting it up. I really can't give you a good source. I just used google and a compilation of other people's instructions, problems and solutions

    64. Re:It depends on what you want to do. by cloudmaster · · Score: 1

      You're aware that I was joking, right?

    65. Re:It depends on what you want to do. by LaCosaNostradamus · · Score: 1

      I worked in the "Digital Clusters for Windows NT" group in 1996/7. As time went on, it became obvious that we were working under the assumption that we would be shut down by Wolfpack (as we were). My theory is that this was due in no small part to the so-called partnership between DEC and MS, flavored by the "Dave Cutler Affair". But I also detected a particular lack of willpower on the part of the company to stand behind the development effort.

      This is all pretty pathetic given DEC's lead in clustering. They simply surrendered to MS, yet MS's clustering effort was itself poor. The end result was that current and potential customers were not served. Capitalism should have met their needs, but something beyond Capitalism was at work behind the scenes: corporate self-interest.

      So I see Linux rising now and terrorizing these companies to various degrees ... and I smile an evil little smile. While the American tech companies were playing massive corporate games to the expense of their customers and the public at large, some nerd from Finland was creating a corporation killer.

      --
      [You have a stable society when some nut guns down a schoolyard and the law doesn't change.]
    66. Re:It depends on what you want to do. by Kent+Recal · · Score: 1

      MSCS is now quite a decent HA solution

      You mean HA as in "HAHA"? Okay, I give you that.
      But if you're in for "high availability" (which, afaik, means at least four nines after the dot) you're most certainly not even considering MS products.

      What good is your availability when the application fails randomly?

      I remember reading a MS-paper about the IIS ""HA""-solution (read again: "HAHA").
      It involved a cluster of at least 4 nodes (IIRC) so that there's always a node left to pick up service when one of the others failed. It was recommended (seriously) to reboot each node daily and there were detailed instructions about how to deal with "hung" nodes - which apparently was expected to happen.

      And, well, I have yet to see a piece of serious hardware that would even run the joke from redmond...

    67. Re:It depends on what you want to do. by CrazyJoel · · Score: 1

      man, in my experience, clustering is the root of most of the downtime in a Microsoft server cluster.

      --

      Such is the infinite Grace of Popeye.
    68. Re:It depends on what you want to do. by Anonymous Coward · · Score: 0

      OOOOOH, so THAT's how Tony Blair was elected!!! That explains EVERYTHING!

  2. lol at bannedtown by Anonymous Coward · · Score: 0

    lol at bannedtown
    --rucas

  3. I don't see why anybody would use their own server by jackcarter · · Score: 2, Funny

    I just use Geocities, it's free and easy!

  4. Queue... by 0110011001110101 · · Score: 0, Redundant

    the BeoWulf!!!

    --
    Don't anthropomorphize computers: they hate that.
    1. Re:Queue... by TheRealMindChild · · Score: 2, Informative

      In my opinion, Beowulf is not the hammer everyone thinks it is. Ask the average slashdot reader even, and they relate Beowulf to something more like OpenSSI or Mosix... something you can easily add nodes to, and just use a special compiler to compile all of your multthreaded/multiproc apps and it will all work magically.

      If you are one of those people, stop. A Beowulf cluster is a performance cluster, but it is not a replacement for an SMP system. You more or less have the master node delegate actual computations EXPLICITLY in your application (EX "Hey... Node X, Caclulate X + Y for me, kthx").

      --

      "When life gives you lemons, don't make lemonade. Make life take the lemons back!" -- Cave Johnson
  5. Oh the irony by Anonymous Coward · · Score: 2, Funny

    It's slashdotted already.

    1. Re:Oh the irony by matr0x_x · · Score: 1

      Indeed... the only thing that comes to my mind is the saying: I Cluster Therefore I Am!

      --
      LINUX ONLINE POKER: Linux Poker
  6. Clustering for performance by Anonymous Coward · · Score: 0

    Clustering for performancce. Redundant components for fault tolerance.

    1. Re:Clustering for performance by cbreaker · · Score: 0

      You really don't know what you're talking about do you?

      --
      - It's not the Macs I hate. It's Digg users. -
  7. And what's so difficult about... by Nuclear+Elephant · · Score: 1

    ...clustering a bunch of fault tolerant servers?

    1. Re:And what's so difficult about... by magarity · · Score: 1

      Budget?

    2. Re:And what's so difficult about... by MankyD · · Score: 4, Funny
      And what's so difficult about clustering a bunch of fault tolerant servers?
      Well that just plain redundant. Err...
      --
      -dave
      http://millionnumbers.com/ - own the number of your dreams
  8. I know where this is going by Anonymous Coward · · Score: 2, Funny

    ...and i am just waiting on the call from our vendor recommending we upgrade to a cluster of fault-tolerant servers.

  9. Software vendors by PCM2 · · Score: 4, Insightful

    So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.

    --
    Breakfast served all day!
    1. Re:Software vendors by joelleo · · Score: 1
      So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.

      Ask a mechanic if it's better to get your oil changed every 5000 miles. He'll say "why, yes of course!" He'll make money on the deal, but it IS truly better for your car, dependent upon how you drive your car. Ok, so if a software vendor says it's better to have redundant copies of the software, guess what? It IS actually better, from a redundancy standpoint, if your application requires maximum availability!

      --
      "In the end, there is simply no weapon more devastating than the truth, delivered in just the right way." - tnk1
    2. Re:Software vendors by Bert64 · · Score: 1

      But having additional copies of software running on additional machines introduces a greater maintenence headache..
      Also the mechanic tells you to change the oil because if he lies and tells you not to change the oil in the hope that he`l get the work of repairing the car when it inevitably blows up, another mechanic could put him straight and take your custom. Software vendors on the other hand, are too complacent and arrogant.. They don`t believe anyone will expose what they`re doing and take their customers away.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  10. Fault tolerant hardware is not the solution by Anonymous Coward · · Score: 2, Insightful

    Hardware fails... it's as simple as that. You should plan on that for one reason or another you will have to shutdown and replace hardware. If it can be done with minimal or no disruption to the services, then that's all the better. OS makes licencing no longer a problem.

    1. Re:Fault tolerant hardware is not the solution by TinyManCan · · Score: 2, Informative

      Unfortunately, for many reasons, Open Source does not end the cost of licensing for many organizations. Most of the good clustering solutions that I have seen recently involve breaking every application and service into a 'package' that can run on many different physical servers. Each package has a virtual IP address associated with it.

      When hardware fails, you bring up the required packages on a different physical host, and other applications access it using the virtual IP. Going this route allows you to do N+1 style clustering where say 3 servers are hosting 2 applications. This is a big win over the older model where each box had a physical duplicate that would step in when failure occurred.

      To use this style of clustering, you need to have excellent shared storage support, which has come in the form of SAN based disk arrays in all cases I have seen. The cost of software licensing aside, SAN equipment can case an arm and a leg.

      For real, enterprise, supported applications you pay through the nose for the software, the hardware and then again for the support systems (HVAC, Power Conditioning and UPS, fault tollerant networking, SAN gear, Backup infrastructure, etc). It all costs, and it all has to be supported. Adding more machines (in the case of these clusters) increases the base overhead cost even before you get to the licensing.

      Providing reliable and functional enterprise services (the type that require clustering) is expensive, plain and simple.

  11. So the choice is between... by Anonymous Coward · · Score: 3, Funny

    tolerating a lot of faults in one girlfriend or get a cluster of them and deal only with the good points?

    1. Re:So the choice is between... by mickwd · · Score: 5, Funny

      It depends how often they go down.

    2. Re:So the choice is between... by TheRaven64 · · Score: 1

      If by `girlfriend,' you mean `computer' (which is probably a valid assumption here), then yes.

      --
      I am TheRaven on Soylent News
    3. Re:So the choice is between... by ScuzzMonkey · · Score: 3, Informative

      Mods! Wake up! How is this not +5 (either Funny or Insightful, I haven't decided which yet) already?

      --
      No relation to Happy Monkey
  12. Not the same. by tekn0lust · · Score: 5, Informative

    Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.

    1. Re:Not the same. by Anonymous Coward · · Score: 1, Interesting

      You're still not safe with clustering if you share data. I once worked on a SQL Server cluster with shared disks. SQL Server would crash because a database page contained crap data. The system would then take 10 minutes to fail over to another node. Once it was running, it would read the same page and crap out, causing the other node to come back up. Lather, rinse, repeat.

    2. Re:Not the same. by maverick97008 · · Score: 1

      Especially useful for upgrading software. In most cases you can upgrade 1 node at a time and have complete up-time. Software is becoming the real problem with 100% up-time for my company.

    3. Re:Not the same. by elBart0 · · Score: 1

      but, if you're running Oracle RAC, with multiple servers sharing the same filesystem on a SAN (which has its own level of replication, as well), the crash of a single system is invisible to the end-user. I've seen this work in both controlled tests, and 'oh shit a node just crashed' scenarios.
      I haven't worked with SQL Server in a while, but I have to hope it's gotten better, since then.

      --
      09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
    4. Re:Not the same. by flatass · · Score: 1

      Unless you are running NSK on a Tandem system that is.

    5. Re:Not the same. by djdavetrouble · · Score: 1

      good reference. pity only a few know what you are talking about.

      --
      music lover since 1969
    6. Re:Not the same. by Anonymous Coward · · Score: 0

      The point is, if your database is corrupted somehow on a shared filesystem, it doesn't matter how many redundant nodes you have, they will all be dead in the water.

    7. Re:Not the same. by Bert64 · · Score: 1

      Well, the less reliable database software you use the more likely the chance of corruption... High end databases like Oracle will try to repair any corruption they find, and often do a pretty good job..
      If you run a buggy database that corrupts it`s own data, and then can`t handle being fed corrupt data without crashing.. Well, you made a poor choice.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  13. Since information wants to be free by Shadow+Wrought · · Score: 4, Funny

    Shouldn't we be encouraging server failures which enable their freedom from magnetic imprisonment? Kinda like PETA freeing lab animals...

    --
    If brevity is the soul of wit, then how does one explain Twitter?
  14. More about the cost of hardware? by Sv-Manowar · · Score: 4, Interesting

    Because of the open source stack behind a lot of server platforms these days, I'm dubious that this decision boils down simply to a software cost issue. One major benefit of using clustering is that many white box, non specialized machines can be used, which are easier & cheapter to replace or obtain components for. Complex and specialized hardware with built in redundancy is often expensive and can require vendor support contracts for effective maintainance.

  15. one other option by hatch815 · · Score: 1

    they forgot the third option - cheap servers ran in single tier fashion. If one dies, you just swap it out, and then build another for emergency. Granted there is some down time, but it works as a good cheap end solution

    1. Re:one other option by cerelib · · Score: 1

      bzzzz Wrong. The companies and groups that need this kind of performance would laugh at you if you proposed a solution and ended it with "granted, there is some down time".

    2. Re:one other option by Anonymous Coward · · Score: 0

      The primary purpose of both fault tolerance and clustering is uptime. Your solution sucks donkey dicks big time.

    3. Re:one other option by Major+Blud · · Score: 0

      I totally agree with Cyric. A generic white-box may work for a one or two server cluster, but not for time-critical systems. Besides, building a white-box with the kind of components you would need is still going to be in the tens-of-thousands of dollars....considering you will need stuff like Hot-Swappable Drives, Fiber HBA's, Redundant Power-supplies, dual (if not quad) processors, Gigs of memory...stuff that is very expensive and not-so-easy to find...basically vendor specific. With that kind of money being spent, it pays to just buy the vendor's package.

    4. Re:one other option by turbidostato · · Score: 1

      "The companies and groups that need this kind of performance would laugh at you if you proposed a solution and ended it with "granted, there is some down time"."

      That's exactly why PHBs will prefer to believe that pretty colourful brochure that tells "just buy our solution and you will reduce downtime to zero". Of course, the brochure lies there *are* downtimes, no matter what, and anyone that doesn't end his proposition with "granted, there is some down time" is simply a liar. It is about how much insurance are you wanting to buy for how much money but at the very end, granted, there will be some down time.

    5. Re:one other option by cerelib · · Score: 1

      Okay, go to the head of a data storage center for Wal-Mart or Amazon and propose a data storage solution and end the propostion with "granted, there will be some down time". See how they respond. I actually work on commercial DAS Disk systems and they are designed for 100% uptime. No matter what you need to do to the box, you are supposed to be able to run I/O through the whole process. This is no small operations junk, downtime means large losses to some of these companies.

    6. Re:one other option by turbidostato · · Score: 1

      "Okay, go to the head of a data storage center for Wal-Mart or Amazon and propose a data storage solution and end the propostion with "granted, there will be some down time". See how they respond"

      I *know* how they'll respond. Still, I'm sure some companies based at Twin Towers were sold 100% uptime solutions that didn't stand for their claims on some 9/11, and almost noone of them would stand a clever and coordinated sabotage plan from company inside (not to talk about the almigthy Total Nuclear War Scenario, of course).

      There is NO 100% uptime no matter what, and EVERY solution should end with a "granted, there will be some downtime" for downtime there WILL be. It is the question about how many time "some" means, what the affordable risk is and how much money is reasonable to drop at it. Only when a salesdrone meets a moron PHB something like "100% uptime guaranteed" makes sense.

      "I actually work on commercial DAS Disk systems and they are designed for 100% uptime"

      See what I said about salesdrones and PHBs.

  16. Clustering by FnH · · Score: 2, Insightful

    Clustering provides a backup for software failures, that fault-tolerant servers don't. Also, upgrades without downtime are easier done with a load-balanced cluster.

  17. Apples and Oranges by Steven_M_Campbell · · Score: 4, Funny

    If you are just talking about fault tolerance (FT) then spill a drink on the FT server then spill a drink on a clustered server and see the difference :) If we are not limited to fault tolerance than try load balancing an FT server with.. um..er... itself. This is really apples and oranges. BTW, I like FT servers in a cluster!

  18. Why are clusters better? by darkmeridian · · Score: 2, Interesting

    The article seems to make the choice one-sided. Fault tolerant servers have higher uptimes because the backup takes over immediately. Clusters have a single point of failure in the middleware. They argue that the clusters can run different operating systems, but that means more patches and updates to keep track of. Clusters are expensive because they need more OS and software licenses and require a lot of maintenance, though that might drop if they are running Linux or FreeBSD.

    Anyone make a case for clusters for high-uptime situations?

    --
    A NYC lawyer blogs. http://www.chuangblog.com/
  19. Why Linux based? Why not Open Source based? by Anonymous Coward · · Score: 0

    Subject says it all.

  20. Linux Clusters rock! by Anonymous Coward · · Score: 0

    I chose a Linux Cluster everytime. Linux is absolutely fabulous at clustering. Personally i much prefer to trust a cluster of servers over one single server, no matter how good, ANY DAY.

  21. You shouild use both by Barondude · · Score: 3, Insightful

    If HA is what you are really after, you should use both. You want a fault tolerant server so you never have to go down unexpectedly and you want a fail over node so if the unexpected occurs, you'll be back up in a jiffy.

    --
    "That's the sort of blinkered, philistine pig ignorance I've come to expect from you non-creative garbage."-Monty Python
  22. Clustering is safer by arcadum · · Score: 2, Interesting

    If you buy one machine, you still may need to power it off to open the case, or replace a part.

    1. Re:Clustering is safer by magarity · · Score: 1

      you still may need to power it off to open the case, or replace a part.
       
      If you're willing to lay out the cash, you can get a server that will let you swap out bad cards, memory, and even CPUs while the thing is running without missing a beat.

    2. Re:Clustering is safer by Tenareth · · Score: 1


      I don't know, I've replaced almost every component on my production servers without taking them down... Of course, that's a $1Million server...

      --
      This sig is the express property of someone.
    3. Re:Clustering is safer by schon · · Score: 1

      if you buy one machine, you still may need to power it off to open the case, or replace a part.

      I think you don't quite understand the concept of "fault tolerant servers".

      The entire point of a fault-tolerant server is that you don't have to power it off to open the case or replace a part.

    4. Re:Clustering is safer by Anonymous Coward · · Score: 0

      If you buy one machine, you still may need to power it off to open the case, or replace a part.

      I know Sun has serveral machines where you can swap out just about anything you want (P/S, memory, CPU, PCI boards) while keeping the OS running.

    5. Re:Clustering is safer by Anonymous Coward · · Score: 0

      Awesome! Thanks for your valuable post, Dingus.

  23. Fault Tolerant Hardware can be cheap by Anonymous Coward · · Score: 0

    One advantage of the dot com bubble-burst is that you can find good hardware inexpensively on e-bay. Do a search on "Sun Enterprise". Machines that sold for $100K a few years ago can be had for less than $2000.

  24. It all depends by Anonymous Coward · · Score: 2, Insightful

    Fault tolerant systems are all in one physical location.
    Clusters can be in different server racks, building, city even country.

    It depends what the goal is. Fault tolerance, scalability, disaster recovery, etc.

    They both have their uses, let's not discount one or the other, just use them properly.

    **Typically, the goal is a mix of the ones I enumerated, hence I typically choose clusters. However, I always re-evaluate every time a new requirement comes in.

  25. Microsoft Windows Server DOES support clustering by MLopat · · Score: 1

    In fact, Microsoft Windows has supported clustering for quite some time. At least the better part of seven years as it was available on Windows NT Server 4.0.

    If you want to see the latest Microsoft offering on clustering services, check out this site http://www.microsoft.com/windowsserver2003/technol ogies/clustering/default.mspx

  26. Clustering is really... by Shads · · Score: 1

    ... the better technology IF space isn't an issue.

    If you've got the space for the extra servers clusters are great, if you don't have that kind of excess space then fault tolerance is top of the mark.

    --
    Shadus
  27. Consider Maintenance and Personnel Requirements by LazloToth · · Score: 1

    Clusters have a reputation for needing a lot of upkeep. Windows dudes say that Microsoft clusters are a royal pain to maintain. Fault-tolerance in servers, on the other hand, is known by almost everyone to be a good or excellent investment, regardless of the OS platform. If you have a hard time holding onto admins for more than a couple of years, you'd have to consider whether clustering is a good choice. But then, I come from a network of only 275 users. Still, we've never considered clusters. Redundancy is were we put our money.

    --


    It's only funny until someone gets hurt. Then, it's hilarious.
    1. Re:Consider Maintenance and Personnel Requirements by Anonymous Coward · · Score: 0

      ...And all this time I thought high availability clusters ARE redundant

      Silly me

    2. Re:Consider Maintenance and Personnel Requirements by Amouth · · Score: 1

      I agree with this.. here we only have about 50 users and i am the only admin, and yes i have a habbit of over building.. I run six servers all with FT drives and network cards. (most common failure point) they are also setup as a cluster so that services are provided by atleast 2 of the six.. if i have a drive die i don't have to jump to it i can take my time because everything is still going.. and i can take a server offline and no one will notice, it makes it nice to not get support calls, sure i had to spend a little more money per box but you know i never liked the idea of plain whitebox clusters. if i have drive die i want to replace it not replace it and reload everything so that it can get back to work. also makes it nice for backups becuse i only have to backup one server (the one that has all services) but even if that server is offline no one will notice. cost per box is about 800$ i think it is a nice middle ground and takes the advantages from both.

      --
      '...if only "Jumping to a Conclusion" was an event in the Olympics.'
  28. Not to be too snide... by Bill+the+Cat · · Score: 1

    ...but my users and my bosses don't care much what searchdatacenter.com has to say about the situation, in the event hardware failure takes down a critical application.

    If the people that pay me are willing to invest in the extra HW and SW to make a critical app available, then we do it.

  29. Re:I don't see why anybody would use their own ser by interiot · · Score: 1

    brilliant! just brilliant.

  30. Not either/or by Declarent · · Score: 5, Interesting

    I build AIX HACMP clusters for a living, and I'll tell you that you should *never* use an either/or approach, as TFA suggests. Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.

    Maybe if they want to write an article, they should spend some time in the real world and see how the HA industry works instead of making up some arbitrary demarkation line to hang a preconception on.

    1. Re:Not either/or by Anonymous Coward · · Score: 0

      Google supposedly uses clusters of the cheapest shitboxen they can find.

    2. Re:Not either/or by Declarent · · Score: 2, Insightful

      That's true, it's a massively distributed app. In every class of solution, there are extreme cases for which the rule does not apply. Those cases do not change how the average customer does business.

    3. Re:Not either/or by sapbasisnerd · · Score: 2, Interesting
      What Google does barely deserves the label clustering.

      Actually that's not really fair, the problem is the term clustering has become overloaded. What Google does is would be more completely described as "shared nothing" distributed computing. They use cheap as chips iron beacuse nobody cares if a transaction fails, because no data is lost, the end user just pushes refresh. Similarily the various grid compute "clusters" (SETI, Folding@Home etc.) can recover from a lost unit of work by sending it out for reprocessing after a timeout (or IIRC SETI doesn't wait, every unit of work is sent out multiply and the results that do come back are compared).

      If on the other hand you are dealing with applications that actually save data, silly little things like, oh, electronic funds transfers or credit card charges, that's a whole different class of problem.

    4. Re:Not either/or by blowdart · · Score: 1
      Indeed, and I've done the same for streaming media systems. A bunch of cheap 1U servers, with content on at least 3 boxes, and an intelligent redirection layer which would send the user to the least loaded server with the content on, nearest to them.

      But you wouldn't want to do that with a database server. It's horses for courses.

    5. Re:Not either/or by Anonymous Coward · · Score: 0

      Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.

      Amen to that. Wear both a belt and suspenders!

      Unless you've got a really compelling reason to cluster, stick with fault tolerance. One big, strong server is easier to manage than several wimpy ones.

      I can't remember who said it, but it's a lot easier to plough a field with one ox than 1024 chickens.

    6. Re:Not either/or by snow_man · · Score: 1

      i build/maintain aix & sunOS clusters too and i've found that most organizations don't understand their business requirements very well much less their technical options. frankly, i've found that tossing the various buzz words around doesn't do any good in determining those requirements either. generally i've found that working backwards through business continuity plans can help determine the "best" path for most organizations.

      .

      --
      i am snow. fear me.
    7. Re:Not either/or by swordgeek · · Score: 1

      I've never really dug too far into Google's architecture, but I always assumed it was some form of grid computing. Is this correct, because I've never really associated grids with clustering.

      --

      "People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
    8. Re:Not either/or by adam872 · · Score: 1

      Seymour Cray would be the person you would be thinking of. I also agree about a large number of small machines, versus a small number of large machines. I prefer the latter for system maintenance alone. Reduced complexity means fewer things can go wrong, but the load balanced cluster of small systems has its place too, like in a web server or MTA farm (behind a load balancer like the F5 BigIP).

    9. Re:Not either/or by prefect42 · · Score: 1

      Grid is a horrible term that means relatively little even with the grid computing field. Grid is all about virtualisation, and complex authorization/authentication, something that google really doesn't have to worry about that much.

      --

      jh

  31. Flexibility by div_2n · · Score: 1

    A large and fully redundant fault tolerant server is more flexible. Use virtualization and have many reliable servers of many different operating systems in one unit as opposed to a highly specialized cluster.

    For certain tasks, clustering will certainly offer a performance advantage from a scalability standpoint. Yet a fully fault tolerant hardware system like from Stratus offers just a touch more reliability than a fault tolerant software system.

  32. Re:I don't see why anybody would use their own ser by zmokhtar · · Score: 1

    Yes, but what does geocities use?

    --
    Why aren't we told when editors moderate our posts?
  33. Solution by Anonymous Coward · · Score: 1, Informative

    Just go with fault tolerant clusters.

  34. Never build systems on a core of failure. by CyricZ · · Score: 2, Informative

    That's one of those ideas that sounds all good and well, but it hardly works in practice. In many cases, downtime is unacceptable. You need transactions processed continually, and you cannot have downtime caused by a dead server.

    It is not a good idea to build a system out of parts that you know will fail, and then proceed to design the system around such failure. A far better idea is to spend some money, and design a system that will work. Of course you do take into account hardware failure, and you build in redundancy where necessary. But you do not build your solution around knowingly faulty and cheap hardware. That's just looking for trouble.

    Often times the "cheap" solution ends up being most expensive, not only because of the cost of repeated hardware repairs, but also because of the cost of the labour necessary to perform the repairs, and the possibility of downtime. When you're processing millions of dollars worth of transactions per minute (if not per second), even a couple of minutes of downtime can be financially costly.

    --
    Cyric Zndovzny at your service.
    1. Re:Never build systems on a core of failure. by dkleinsc · · Score: 4, Informative

      Most successful strategies I've heard of involve building a system out of parts that you know can't fail, and then designing the system around the failure of the parts that you know can't fail.

      --
      I am officially gone from /. Long live http://www.soylentnews.com/
    2. Re:Never build systems on a core of failure. by Hairy1 · · Score: 1

      The problem is that all hardware will fail eventually, so we have no alternative but to create systems from such. My approach to high availability is to ensure that a failure of any two servers won't bring down the system. There are still some areas I have trouble with, distributed databases being one. For database servers I still prefer fault tolerant servers, and to have a backup server which can be quickly brought up. The application servers are all just standard builds. The other point of failure is the load balancer, although I suspect this could also be mitigated.

      By the way, anyone had success with Apache and JK load balancing across multiple Tomcat web servers?

  35. The Good, The Bad, and The Ugly by flinxmeister · · Score: 2, Insightful

    The Good: Using cheap components in a cluster to create scalability at a good value The Bad: Using a cluster to cover up coding issues, architectural crap, or instabilities in the system The Ugly: "the bad" gets so bad that it crashes the whole freakin' cluster. Why did we do this again?

  36. Clustering Potentially Solves More Problems by bradm · · Score: 4, Informative

    Fault tolerance gets you a machine that keeps running in the face of hardware failures and maintenance. The switchover time is arguably negligible.

    Clustering gets you a set of services that keep running in the face of hardware failures and maintenance. The switchover time can range from negligible to huge depending on the application involved.

    However, clustering also helps you to solve other problems, including scaling, software failures, software upgrades, A-B testing (running different versions side by side), major hardware upgrades, and even data center relocations.

    Clustering tends to require a lot more local knowledge to get right.

    So if you narrow the problem definition to hardware only, they solve the same class of problems. But when you broaden it to the full range of what clustering offers you find a greater opportunity for cost savings - because one technique is covering multiple needs.

  37. False dicotemy by afidel · · Score: 1

    If you are going to go so far as to pay for redundant everything hardware you probably want to buy at least a pair of them and put them in a cluster. I know very few places where the demands are such that they would buy a single super expensive server and NOT have a cluster to allow for things like software upgrades.

    --
    There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  38. This brings to mind Google's strategy. by ShyGuy91284 · · Score: 1

    I remember reading a few years ago that something that made Google's system different then the competition was that they preferred to have a cluster of very cheap hardware (so if one died, another would take over it's job) instead of a single expensive hardware configuration that had its own backup system built in. Not sure if they continue to work with this philosophy though. They seem to have proven that it works successfully though (Google has never given me any problems).

    --
    In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
    1. Re:This brings to mind Google's strategy. by Anonymous Coward · · Score: 0

      If it did.. how would you know?

      It's google.. if it misses out on 50% of the articles your searching for.. you still get 50% and are happy with that result, completely ignorant to the fact it's missing a ton..

    2. Re:This brings to mind Google's strategy. by mcewen98 · · Score: 5, Interesting

      According to a presentation that I recently attended given by Jim Reese, the guy who scaled google from a couple hundred servers to over 300,000, this is still true. It was a very interesting presentation and included discussion about the problems with cramming 80 pc's into a standard server rack... including heat, cable management, machine replacement.. etc.

      Other interesting tid bits that I remember:

      -over 300,000 x86 machines make up the network, with clusters all over the place which make searces return in under .3 seconds.
      -commodity hardware (maxtor, western digital, whatever is available) is used.
      -over a thousand machines fail daily. Most are automatically reboot, and it sounded like admins only come into play when a machine needs to be replaced.
      -the longest uptime of a single machine has been 7 years
      -they use a heavily modified redhat distro.
      -real time stats of the entire network can be seen at any moment

      i'm sure there were more interesting facts but that's all I can regurgitate at the moment.

    3. Re:This brings to mind Google's strategy. by ShyGuy91284 · · Score: 1

      Ever forget to bookmark something a few times because your lazy? Googling it each time you want it to get to it results in it being at the same position as it was the first time (if the index hasn't updated to some changes).

      --
      In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
  39. SneakerNet * by dada21 · · Score: 5, Interesting

    In my 15 years of IT consulting, no network has provided data safety transparency cheaply or consistently enough. Clusters and fault tolerance both cost more than downtime in my experience.

    We desperately need a better way to access data in a corporate network.

    My favorite customers are those architects and engineers who avoid networking except for the Net. Seriously, sneakernet and peer-to-peer has shown the least downtime I've seen.

    I think p2p networks will see a comeback if a torrent-like protocol can grow to be speedy. My customers are not banks, but they need 100% uptime as every day is a beat-the-deadline day.

    If someone can extend and combine an internal torrent system with a decent file cataloging and searching system, they'll see huge money. I have some 150 user CAD networks just waiting for it.

    What would a hive network need?

    * Serverless
    * Files hived to 3+ workstations
    * Database object hiving
    * File modification ability (save new file in hive, rename previous file as old version, delete really old versions after user configurable changes)
    * "Wayback Machine" feature from old versions
    * PCs disconnected from hive will self correct upon reconnection

    It is very complex right now, but my bet is that the P2P network will trump client-server for the short run. The "client is the server" vs "the server is the client"?

  40. False Dichotomy by bluffcityjk · · Score: 1, Informative

    Since when can every software solution be categorized as "proprietary" or "Linux-based"?

    1. Re:False Dichotomy by Anonymous Coward · · Score: 0

      Will somebody PLEASE think of the BSDs!

  41. Simple by RussGarrett · · Score: 1

    Clustering costs more for the software. Fault-tolerance generally costs more for the hardware, especially if you cluster using commodity equipment. When the software is free, clustering is the obvious option.

    1. Re:Simple by Anonymous Coward · · Score: 0

      Except you are now running on at least two machines, and not one. Now you are spending at least twice as much on hardware.

  42. cluster = hard to upgrade by Anonymous Coward · · Score: 0

    Clusters can be EXTREMELY hard to upgrade to newer software versions or service packs. That's something to be aware of in figuring out the costs of maintaining such a system. Of course, it helps if you have a test system, and "a lot of companies" (if you get my drift) that spring for a cluster won't have a test cluster for roll-outs... if you want to feel pain, try to upgrade a MSCS system running some major clustered + non-clustered software to new versions, in production.

  43. I've struggle with this one myself... by Supp0rtLinux · · Score: 1

    But in the end, I opted for a "both" approach. If I'm going to do a cluster, I usually do it for applications, so I'll build it out in an N+1 style so I can easily add more resources to the cluster. If uptime is the concern and not horse-power, I'll simply make things as redundant as possible with drives, power supplies, RAID, etc.

  44. No difference, just a matter of packaging. by TheMohel · · Score: 4, Informative

    Having built both true high-reliability fault-tolerant devices and clustered systems, I don't see any fundamental theoretical difference. In both cases, you have redundant hardware capacity in place, theoretically to allow you to tolerate the failure of a certain amount of your hardware (and, sometimes, your software) for a certain amount of time. Neither option guards you against failures outside of the cluster or FT system box. Neither one is a panacea. Both are sold as snake-oil insurance against "badness".

    In a single fault-tolerant box, you generally have environmental monitoring, careful attention to error detection, and automatic failover. You also have customer-replaceable units for failure-prone components, utiilties for managing all of the redundancy, and a fancy nameplate. In exchange for that, you have more complexity, more cost, serious custom hardware and software modifications, and often (but not always) performance constraints.

    In a clustered system, you treat each individual server as a failure unit. Good fault detection is a challenge, especially for damaging but non-catastrophic failure, but it's much easier to configure a given level of redundancy and it's easier to take care of environmental problems like building power (or water in the second floor) -- you just configure part of the cluster a longer distance away.

    Where clustering is inadequate is when you have a single mission-critical system where any failure is disaster (like flight-control avionics or nuclear power plant monitoring). There are applications where there's no substitute for redundant design, locked-clock processors and "voting" hardware, and all of the other low-level safeguards you can use.

    For Web applications, however, where a certain sloppiness is tolerable, and where the advantages of load balancing, off-the-shelf hardware and software, and system administration that doesn't require an EE with obsessive-compulsive disorder, clusters are the natural solution.

    The fact that you get to sell more licenses for the software is just gravy.

    1. Re:No difference, just a matter of packaging. by tonygarza · · Score: 1

      I agree with TheMohel: There is NO DIFFERENCE between Server Clustering and Fault-Tolerant Servers. The only difference is "a matter of packaging"... or in my words, a matter of WHERE a level of redundancy is implemented.

      The main issues concern 100% Uptime, Server Performance, and Data Safety. It is the IT Manager JOB to [1]Imagine all worst case scenarios, [2]Employ a solution which protects against those scenarios, and [3]routinely exercise his/her system's redundant solution to verify it's actual redundancy.

      I want to avoid being technical, because 100 years from now systems will be completely different, but the question of WHERE to employ redundancy will always be important. So many people here are making arguments that there Clustering System doesn't even work, blah blah. News Flash: If your Clustering System doesn't work, IT'S NOT A CLUSTER! If your Fault-Tolerant server experiences a Fault, IT'S NOT FAULT-TOLERANT.

      A Wise (Good) IT Manager will always routinely test there system to make sure it works. So now a wise IT manager knows he/she needs to focus on what his/her systems vulnerabilities are. For example, I remember a while ago microsoft.com experience a DNOS attack and was down for a day or so. There weakness? Their server farm, which even if it employed both Clustered and Fault-Tolerant Servers, failed because it was all centralized in Washington. Now they have another server farm somewhere else, providing them with redundancy against a DNOS attack.

      But who cares about MS's uptime... more important systems exist like air traffic control, nuclear power plant systems, banks, government, etc.

      A Wise IT Manager will see that what they really need to identify is WHERE redundancy is needed. Server Clusters can provide redundancy on a Server level, on a ISP level, on a power-grid level, and on a geographical level. Fault-Tolerant Servers provide redundancy on a hardware or software level, whether it be CPU's, Disk Controllers, Hard-Drives, and Power Supplies or Virtual Computers, Databases, or Applications.

      I remember reading a powerpoint overhead I got from MIT's OpenCourseWare... one of their computer science classes. The one that "stuck" with me said something like:

      "Systems are ephemeral, but Data is eternal"

      I had to look up the word ephemeral, but it means "short lived." And this is very true. Whether we store Romeo & Juliet in Shakespeares original hand-written form, or typed with a type-writer, or stored in binary on some database somewhere, the PRESERVATION of the DATA ITSELF is what is important. The SYSTEM ITSELF is not so important because it will eventually change... according to Moore's Law. The only issue concerning the system is it's ability to preserve and serve the Data.

      -Tony
  45. Don't forget Load Balancing by ScentCone · · Score: 1

    Speaking of the Windows universe, here. I've found actual for-real clustering (say, of SQL Server) to be workable, but to be a serious (and expensive) pain in the ass. Obviously it depends on the app, but log-shipping and other mechanisms are frequently good enough to prepare for fail-over to another machine, and decent fault-tolerant hardware is good enough insurance for a lot of circumstances.

    On the web side of things, clustering (actual clustering) sure hasn't come up much in my world. But I use native NLB with very good results. Depending on how your app handles state/sessions, that native load balancing is pretty much a no-brainer to set up. There are problems, though... your server can (from NLB's perspective) seem perfectly happy - even as your web app is puking in some way, and defeating the whole purpose. So for that, you've got to have something watching the app and then kicking the machine in the ass if it's stupid in that higher layer. This would be, of course, just as true of any load balancer that's out in front of the web servers and doesn't know if a particular app is happy or not.

    But if you're trying to spread the pain across a handful of web servers, NLB is a pretty easy solution. Making sure that a SQL server behind those web servers is up though... in real life, unless there's a large budget and good admins doing the care and feeding, the risk of having to rely on a managed fail over to a recently replicated copy of the db on another machine seems to be a pretty popular choice. Considering that you can buy a seriously fault-tolerant server and storage solution for a pittance compared to the long-term admin costs of not screwing up a clustered rig, that's the sweet spot for a lot of users, and the risk is fairly low. Hardware, properly housed in a decent data center, is pretty damn reliable at good price points these days. A somewhat fragile clustering environment, though, is one slightly-drowsy off-shore admin mouseclick away from being REAL hard to unscrew.

    --
    Don't disappoint your bird dog. Go to the range.
  46. Cost and scalability by microbee · · Score: 1

    I am not sure fault-tolerance is cheaper than clustering. You can build a cluster from cheap PCs and you can keep adding nodes to it. But fault-tolerant servers sound like not easily scalable, vendor-locked in, and costly too (since the hardware has to be specially designed).

  47. Ignoramus by Donny+Smith · · Score: 4, Informative

    What you wrote is really ignorant (which, modded on /., translates to Insightful).

    1. (because I have yet to meet a clustering DB solution that didnt suck).

    Where do you live? In Ruanda?
    Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.

    2. one copy of Debian + Apache + MySQL + Perl or 200 copies

    mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...

    3. And windows doesnt support clustering yet - in any decent way shape or form, I dont see the problem here.

    Hah, hah! Enough said.
    And also - what's it to you? If Microsoft (in your view) had a good clustering solution, you'd lose sleep over that?
    When you're biased like that, no wonder you can't have a quality, unbiased opinion on this topic.

    1. Re:Ignoramus by rasjani · · Score: 1

      Check out M/Cluster software from Emicnetworks for mysql h/a clustering..

      --
      yush
    2. Re:Ignoramus by Donny+Smith · · Score: 1

      I am aware of their replication-based software. The point is, mySQL per se is not a relatively reliable database. And it doesn't scale like Oracle RAC.

    3. Re:Ignoramus by Anonymous Coward · · Score: 1, Interesting

      Ever seen the actual numbers on Oracle RAC scaling on any decent sized install..let's say 4+? Not pretty. Oracle RAC is an interesting technology, but it serves one purpose and one purpose only. Allow Larry to buy fancier boats.

      Why do you think Oracle runs its business (main ERP apps) on 4 nodes of big Sun iron and not 8+ nodes of cheap linux boxes? They tout the shit out of the cheap linux boxes to their customers because if an IT department has a $4 million dollar budget for a big project, Oracle wants to get $3.95 million of that. They only way they can do that is to push the crap out of cheap dell and linux for the hardware and the OS. But when it is *Oracle's* business on the line..well, the proof is in the pudding as they say.

    4. Re:Ignoramus by Anonymous Coward · · Score: 0

      It would be pretty stupid for Oracle to make a major migration from one platform to another just to serve as a proof-of-concept for you. If they've already made the hardware, software, and human resource investment in Solaris on Sun servers, why should they change simply because their products also run on {x} platform? Should they also switch to Windows servers as well? As long as they're using their own software (which is the case), that's all the "pudding" I need.

    5. Re:Ignoramus by Ledis · · Score: 0

      WTF...I have seen this exact same Oracle RAC bashing post at least couple of times before. Do you keep it ready on your desktop for a quick copy&paste?

    6. Re:Ignoramus by sn00ker · · Score: 1
      mySQL isn't enterprise-reliable even in stand-alone configuration
      Yeah, I hate these unreliable databases. I mean, our MySQL server at work is a hot 2.8GHz Xeon HT with 2GB of RAM, running RH7.3, and it only managed to serve an average of 200 queries per second (peak load was over 400q/s) over 2½ years. That's really unstable, and crap, and stuff, eh!
      --
      "God, root, what is difference?" - Pitr, userfriendly
    7. Re:Ignoramus by Anonymous Coward · · Score: 0

      That is nothing. We have a MySQL server up for 600 days and now at anytime has 590 queries a second. Someone that talks about MySQL not being stable either has never used it or thinks just because it doesn't have feature X it isn't worth running. Well I can tell you one thing that stability is not a problem with MySQL

    8. Re:Ignoramus by isorox · · Score: 1

      mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...

      Yet it is used by enterprises to run software 24/7 with zero downtime (at least at my company over the last 3 years)

    9. Re:Ignoramus by rasjani · · Score: 1

      Yes, its replication based and yes i agree with you about MySQL. I just pointed out the software that helps abit to move MySQL to the right direction.

      --
      yush
  48. Standard and up to date hardware and software by markus_baertschi · · Score: 1

    The main problem is that building a fault-tolerant server is an ardous task. It take a lot of engineering and testing. This slows you down and your product cycles get long. When you bring your new machine to the market it will look old and slow compared to 'standard' competitors. In addition your database will be a specialized, proprietary version which does not work with any tool and the admin staff needs special education to manage and operate it.

    Clusters are different. Just take your latest and greatest server and middleware, package it with a version of your clustering glue and voila - instant high availability. All your tools and admin knowledge is applicable because it's built on the same stuff you know already.

    In addition, the real test, a real emergency is unlikely to happen anyway. Even if it does happen and your cluster fails to provide the promised availability there is no real problem besides your 1000 users beeing without application for a day. You'll blame the problem on the vendor and your reputation is safe. This is why you bought the cluster in place of a single system anyway.

    Markus

  49. Fault tolerance only goes so far by networkphantom · · Score: 2, Insightful

    We run volumes of Dell 2850s with RAID arrays, redundant power, etc. powering high volume websites... I can speak first handedly that internal fault tolerance in these systems can only get you so far, where a failure of a component such as the management device in charge of the two power supplies, itself fails, resulting in both power supplies being useless. Or a raid card going out of commission, leaving drives with mangled and unrecoverable data. As with most solutions, a mixture of both fault tolerance and data clustering is the safest alternative.

  50. The big disadvantage by OeLeWaPpErKe · · Score: 1

    Clustering has a MAJOR problem going with it. Clustering requires applications to be written specifically to support clustering. All sorts of libraries have been written to "make this process easier", but one thing's for sure : it will require a recompile, and software that is not designed by people who know what ACID means for databases. It is very hard to keep a hand written app in a consistent state on all machines, knowing that any one of them might fail completely (we only support complete failures, disfunctional memory for example, will not be reacted to) at any time.

    1. Re:The big disadvantage by TTK+Ciar · · Score: 1

      Clustering has a MAJOR problem going with it. Clustering requires applications to be written specifically to support clustering. All sorts of libraries have been written to "make this process easier", but one thing's for sure : it will require a recompile

      This is not true at all for many of the most-common cluster applications. Framework software exists which "gangs together" a pool of servers, each of which can run ordinary, non-cluster-aware software. No need to write code, no need for a recompile. Please qv: keepalived, for one example.

      -- TTK

  51. F'ed up terminology.... by numbski · · Score: 1

    Clustering: Several systems that do parallel computing.

    Fault Tolerant Servers: Serval systems will a failover loadbalancers in front.

    I get frustrated when people use the latter and call it the former. True, you could hae fault tolerant servers in a single box, but why? In fact I'm rolling out infrastructure of the latter in large dose.

    This is how google dunnit. Very well in fact. ;) It doesn't have to be expensive either. So far the most expensive part seems to be a soft switch for SAN so I can use OpenAFS and scale storage space without downtime. Help in that area would be nice, btw. :)

    --

    Karma: Chameleon (mostly due to the fact that you come and go).

    1. Re:F'ed up terminology.... by MadMorf · · Score: 1

      Clustering: Several systems that do parallel computing.
      Fault Tolerant Servers: Serval systems will a failover loadbalancers in front.
      I get frustrated when people use the latter and call it the former.


      Not necessarily:
      High Availability Clusters are what we're talking about here.

      You're talking about High Performance Clusters, which is NOT what we're talking about...

  52. Gas Lamp by Ian.Waring · · Score: 1
    See ActiveGrid - provides a production environment that can scale an app over a Grid of Application Servers each running the LAMP stack. If the middleware over the top is reliable, you've got a largely fault tolerant and ever scalable setup...

    Ian W.

  53. For firewalls and/or routers by SquadBoy · · Score: 2, Informative

    There is nothing like OpenBSD running pf and carp. Dead easy to set up, works like a charm, and secure by default. One wonders why the editors seem to think OSS == Linux.

    http://www.openbsd.org/faq/pf/index.html
    http://www.openbsd.org/faq/faq6.html#CARP

    --

    Cypherpunks: Civil Liberty Through Complex Mathematics. Those who live by the sword die by the arrow.
    1. Re:For firewalls and/or routers by Anonymous Coward · · Score: 0

      They don't mention it because everyone knows that BSD is a zombie OS. Dead dead dead.

    2. Re:For firewalls and/or routers by Slashcrap · · Score: 1

      OpenBSD is such a breath of fresh air in contrast to the moldy scent of Linux. I'll be firing-up my 3rd OpenBSD box here next week - Yeah, people...SUPRISE!!! I'm not an OSS bigot like 98% of you think I am!

      98% think you are? To get a figure like that assumes that there must be at least 50 people on Slashdot who have some sort of opinion about you. I think that you are massively overestimating your own significance.

      I have no doubt that you have constructed yourself an elaborate belief system in which your "controversial" views are suppressed by the liberal elite. The truth is that you're just rather dull. Have you considered starting a blog?

  54. "What time is it?" by Anonymous Coward · · Score: 0

    Those that know the horrors of 3AM maintenance will have the fullest appreciation of being able to take servers out of service in the middle of the day for software and OS updates.

    Expensive are the applications and OSes that can be upgraded without downtime, regardless of the faulttolerantness of the server.

  55. This is not an XOR by 3ryon · · Score: 1

    I have always clustered fault-tolerant servers. For important business applications there is no choice but clustering. However, I want to fail over to the standby node on my own terms...not a hardware failure. This solution gives you great availability along with the chance to make firmware/driver/hardware updates to the fail-to node during business hours. You can then fail over in a maintence window and then update the other server during business hours.

    BTW, SQL server does not require that you buy liceneses for the fail-to node.

  56. Availability vs. Reliability by JustASlashDotGuy · · Score: 3, Insightful

    It all comes down to Availability (Clustering) vs. Reliability (Fault Tolerant). They are NOT the same thing.

    Fault tolerant servers are nice, even the simplest true server should offer some fault tolerance to a degree (IE: RAID drives). This is handy but may not help your availability in the event that you have a SLA promising xx% of uptime and then find yourself needing to take the server down to apply service packs or other patches.

    Clustered servers allow you to increase the availability of your machines, because when you need to take one down for some updates, you can simply fail over all your traffic to the other server in the cluster accordingly. Clustering may increase the availability of the services those machines are offering, but it doesn't not help the reliability of the machines themselves.

    Therefore, I personally choose to start with fault tolerant machines initially (RAID and dual power supplies at a minimum). It makes for a good base. If the services on that machine are 'mission critical', then cluster that machine with other fault tolerant machines.

  57. Catrastrophic loss by lilmouse · · Score: 2, Insightful
    Anyone make a case for clusters for high-uptime situations?
    Well, if your whole rackspace burns to the ground, that's a bit much for a "fault tolerant" server to handle. Mutliple sites mean a single nuclear weapon (plane hitting WTC, fire, hurricane, earthquake, you get the idea) can't take you down.

    --LWM
  58. Re:Microsoft Windows Server DOES support clusterin by WindBourne · · Score: 1

    From a few that I have talked to and that have actually worked with this, they tell me that it is a nightmare and that they would switch to something like NCR's server, next time. Apparently, they felt for running MS clusters, that it was expensive and difficult and did not work well.

    Interestingly, one of them also runs a Linux and a HP cluster and say they were much easier and were moving their code base to Linux only.

    --
    I prefer the "u" in honour as it seems to be missing these days.
  59. a Java layer that load balances??!!??!!?!? by infonography · · Score: 2, Insightful

    Not worth doing. The cluster components should be dumb. There isn't a valid reason to have them know about each other. Your Round Robin or whatever balance you want should come from outside. F5 makes a nice box for that, so do others, if your really a cheapskate and wanted to you could duplicate them. If you need to have anything know about who is on what machine let the system tell that to the backend DB machine. It should be a channel architecture, not a crazy tangle. The more you break the functions down on the system level the better and faster your cluster will be.

    Syncing databases on the other hand is tricky. Save your money and resources for that.

    --
    Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
    1. Re:a Java layer that load balances??!!??!!?!? by twiddlingbits · · Score: 1

      When you get that load sensing and balancing for the SAME price as you get your App Server then it makes some sense. F5 and Cisco load balancers that balance at Layer 5 or above are not cheap.Websphere comes out of the box with this capability for ZERO extra and that Websphere code is Java so you have Java app server code load balancing Java apps on multiple machines or on multiple "virtual servers" on the same physical machine.

    2. Re:a Java layer that load balances??!!??!!?!? by cloudmaster · · Score: 1

      I actually had pretty good luck with MySQL replication - so long as the machines all stay up and are running the same version. :( At least it wasn't very hard to set up.

    3. Re:a Java layer that load balances??!!??!!?!? by infonography · · Score: 1

      Agreed, I defaulted to OSS model.

      Reflex.

      --
      Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
    4. Re:a Java layer that load balances??!!??!!?!? by twiddlingbits · · Score: 1

      Wow..someone on /. that AGREES with me ;) I gotta post more controversal ideas! :)

  60. It's all the way the cookie crumbles. by xAXISx · · Score: 0

    Fault-tolerent servers are the way to go for critical applications. Obviously critical applications need quite a bit of computational power too, so I'd consider the best approach to a situation like this would to go with a small cluster of servers, each being redundant. For example, say we have 6 servers to get this operation up and running, what in my opinion will be most reliable would be to do what was mentioned either and do both. It covers all bases, and really makes for a stable networking enviroment. There is an application built into OpenBSD used for making two servers act much like one, I believe it was called CARP. In principle, it can be used to make two servers allocate the primary functions, while the other is on constant standby to take over operations in case of an incident. It might seem like a waste of CPU cycles, but it works out very well, esspecially if you turn 3 pairs of them into one Beowulf cluster.

  61. Re:Well then you haven't heard of RAC by TarrySingh · · Score: 1
    Oracle RAC does clustering and lolad balancing. If you want a high available db, independent of other HW resources, you'd wanna go for clustering.

    I too choose clustering for variety of reason's

    o Scalability, meaning when you're done Scaliung UP, you can Scale OUT!

    o High Avalibility, node's get down all the time, then other nodes can pick up the load and continuity!

    o Performance Enhancement, well you can address performance problems by dividing load

    to name a few iportant aspects of it...

    --
    Scott McNealy to Michael: "Suck my Sun!" Michael Dell to Scott : "Lick my Dell!"
  62. Capacity Planning by rufey · · Score: 1
    Its harder to increase the capacity of a fault tolerant system - at some point you reach a limit as to how many CPUs and memory you can add, and to a lesser extent, the amount of disk (assuming you use a storage area network).

    With a cluster, you simply add another machine to the cluster when you need more computing power. You can also take a single machine off the cluster for upgrades, hardware troubleshooting, or to reallocate the single machine to do something else.

    As other posters have said, a large factor in deciding what to do depends on the application. Google wouldn't be where they are today if they used a fault tolerant system instead of the massive cluster technology they use today. In fact you could say that Google has built a fault tolerant system using cluster technology.

    On the other hand, there are some apps (such as databases) that are tricky to cluster right where the performance/benefit outweighs the problems associated with it.

  63. Cluster + Virtualization by Anonymous Coward · · Score: 0

    One of the big themes at LinuxWorld 2005 in SF was virtualization on top of clusters. You get the look and feel of a single machine and also get the power and availability of a cluster. Oracle's 10g database makes use of these architecture. Another company called Virtual Iron uses VMware across 16-way clusters through high speed interconnects for their solutions.

  64. Re:SneakerNet * by Ramses0 · · Score: 3, Insightful

    What about iFolder? Looking at the spec's I think it's missing serverless/hiving (which could be provided by any of the normal p2p people), file history ... not understanding your database object comment.

    Speaking of which, what about freenet? The only thing it's missing is "guaranteed availability of critical business data", eh? And I hear it might have some performance problems. ;^)

    --Robert

  65. All about the cost/benefit. by jafo · · Score: 1

    High Availability is all about cost/benefit. RAID and a redundant power-supply are both reasonably cheap for smaller systems, and increase system management complexity only a bit. They are also fairly limited in what they can protect against: certain disc or power supply failures.

    A cluster can, if properly designed, protect against all sorts of failures: disc, power supply, controller, motherboard, CPU, backplane, cable, network, some designs can even deal with physical disaster like a fire in one of your server rooms and fail over to another or even another geographic location. However, the more protection you add, the more time it takes to implement, test, and maintain.

    Tandem, one of the large vendors of fault tolerant hardware/software systems published a report in the late '80s saying that with recent advances in hardware and software, the major cause of system outages was now due to human error: administrators removing the active CPU when trying to replace a failed CPU for example. To properly implement a cluster can involve dozens or hundreds of hours of staff time setting things up, testing, and documenting it all. Especially if it's your first time, I'd say that budgeting 100 hours isn't unreasonable.

    With HA clusters, the devil is definitely in the details. For example, incorrectly implmenting shared storage locking can mean that an unplugged network cable can result in having to re-load your systems from backups. In that case you're far worse off than if you had no HA at all. Sure, this is a nightmare scenario that hopefully shouldn't happen in production if you do appropriate testing, but I use it to illustrate a point.

    Usually HA is implmented in places where downtime has a real cost, so you are paying more for maintenance and hardware so that you don't have to pay (usually many times) more in lost revenue and/or reputation in downtime.

    Sean

  66. Just go with Big Iron by DaKrzyGuy · · Score: 1

    If you really have mission critical applications that can never go down just get an IBM Mainframe. You can replace just about any part in the system without it going down. They even have extra CPUs they can bring online if one fails. Oh, and if you are really paranoid you can cluster them together in different geographic areas.

    1. Re:Just go with Big Iron by BillBrasky · · Score: 1

      Yeah, I'm surprised there aren't alot of posts here from mainframe guys... or maybe they ignored the post because they just don't worry about this kind of stuff :)

    2. Re:Just go with Big Iron by tweek · · Score: 1

      But that only addresses hardware. What about the application layer? What about your database or whatever else you're running on that mainframe? Sure you can partition out the fault-tolerant hardware of the mainframe in to N+1 partitions but does your application support it?

      --
      "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  67. Fault Tolerance by rlp · · Score: 1

    Read a lot of misinformation on this thread. Properly designed, a fault tolerant machine should NOT require downtime to replace a failed component, as all components (including CPU modules) should be hot-pluggable. In general, a fault tolerant system should be able to shut down a single failed component and keep going without any noticable impact on processing. A cluster may require take some time to switch-over depending if it is a fail-over system, or may need to restore / restart / migrate a checkpointed task. Fault-tolerant systems and high-end clusters are generally expensive. Low end fail-over systems less so. Is is worth the cost? That depends entirely on the application - in particular; what is the cost / impact of down time? System availability is NOT solely dependent on the FT / Cluster box alone - redundant power, networking (including WAN and Internet connections), physical and network security must be considered. Finally, external events (like Hurricanes!!) must be considered - and a carefully crafted disaster recovery plan is a must.

    --
    [Insert pithy quote here]
  68. And the moral of the story is..... by WindBourne · · Score: 1

    Ones too many, and 100 is not enough?

    --
    I prefer the "u" in honour as it seems to be missing these days.
  69. Cluster, obviously by Anonymous Coward · · Score: 0

    After all, who has ever heard of a "fault-tolerant fuck?"

  70. wait a sec... by um_atrain · · Score: 1

    you have to PAY for proprietary???

    Woops.

  71. Google as an example by Guspaz · · Score: 2, Interesting

    Google proved that clustering could be fault tolerant, while costing less than true fault-tolerant hardware.

    Google built massive clusters of thousands of machines out of very cheap unreliable hardware. They have tons of hardware failure due to the extremely cheap components (and sheer number of machines), but everything is redundant (And fully fault tolerant).

    They did this, again, using dirt cheap hardware.

    1. Re:Google as an example by chez69 · · Score: 1

      google's solution only works because they don't care if nodes fail.

      --
      PHP is the solution of choice for relaying mysql errors to web users.
    2. Re:Google as an example by Launch · · Score: 1

      and this is the basic premis of clustered servers in the contex of this discussion.

      --
      Your mammas flamebait.
    3. Re:Google as an example by Thundersnatch · · Score: 2, Informative

      Google does not have to worry about ACID compliance in their database. From what I've read about the google file system, cluster nodes lazily share new data amongst themselves. Serving up old data is explicitly allowed.

      To cluster something like an OLTP database, every node has to be immediately informed about updates to the data, and they all have to report back that they have said data intact before the transaction commits. This can be something of a problem when you have hundreds of thousands of updates per second happening.

      And of course, you need to have a method to rapidly bring back into sync a sever which has been out of commission for a while before it comes back online.

      The only way I've seen to do that is to have some sort of high-speed shared interconnect between nodes, and some cluster-awareness in the application to handle synchronization. That is currenlty very expensive, especially if your some of your nodes are in California, and the rest are in Chciago.

      Shared-nothing clusters simply require high-speed interconnects for transactional applications. Data changes must pushed everywhere, immediately, before the transaction commits. I don't see how you get around that.

    4. Re:Google as an example by tweek · · Score: 1

      "Shared-nothing clusters simply require high-speed interconnects for transactional applications. Data changes must pushed everywhere, immediately, before the transaction commits. I don't see how you get around that."

      It depends on the implementation. DB2 HADR has three modes - active sync, passive sync and near sync. It all depends on the SLA you have with your enterprise.

      In our case, we're considering a near sync implementation. We recently had a major outage that was not helped by any level of HA implementation we were currently running. We had DB2 write a bad transaction (don't ask me how - they say because of memory corruption in the instance) and thus couldn't even bring the instance online. We couldn't failover because our data was corrupt. In a realtime sync between two shared-nothing nodes, this corruption would have spread to our other server as the transactions got replicated.

      With the near sync option, we have a tunable parameter (20 minutes was all we needed) that says always stay this far behind the primary node. That way we could have had business continuity. In our case, we had to restore from backup and reenter 6 hours of financial transactions. We couldn't determine where the corrupt transaction occured so we had to go to our last good full backup. We couldn't even roll-forward.

      No amount of HA/Fault-tolerance can fix that problem.

      --
      "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
    5. Re:Google as an example by Thundersnatch · · Score: 1

      Your story underscores the real point: hardware reliability is typically not the problem these days.

      Software failures are (and this includes mySQL and PostGres) are far more common than hardware failures. Data corruption is a software failure, despite IBM's claim otherwise. (A database application should have a strong integrity-checking mechanism to ensure that a corrupt page is never written to a transaction log. If the data corruption occured in memory as they say, their log-writing process should have caught it and aborted the transaction. I have a feeling hardware was not the root cause of your failure, but I suppose corruption at the disk-driver level could make the failure undetectable to the DB application.)

      Clustering, even shared-nothing clustering, cannot protect against failures of this sort when the clusters must share data in real-time. At least not without a whole lot of application-aware code to check the integrity of data before accepting it on the 2nd cluster node, which will slow the whole transaction down.

      I have seen DB mirroring devices which act like a BIG-IP for database servers, sending the same transactions to two machines. But these introduce another bottleneck and another single point of failure, and presumably software problems on one DB server will also occur on the other servers when they are fed the same SQL data. And recovery is still a big problem.

  72. fault tolerance works better by Anonymous Coward · · Score: 0

    i run a relatively large .com website + servers (main dev/it) and our compaq/hp servers are all fault tolerance. they do an exceptional job. our webserver recently had the primary psu die, we were alerted and it was replaced without the machine going offline.

  73. NetWare and Windows Clustering... by MadMorf · · Score: 1

    I've worked with both fairly extensively and I'd have to say that although NetWare clusters seem to be more stable than Windows clusters, neither is a great solution for anything...

    In my experience, the Windows Cluster Nodes will fail into some sort of "undead" state, in which the dead node isn't quite dead yet and the live node never quite picks up the slack, so you end up having to reboot both of them...

    The NetWare Cluster Nodes have such a hair-trigger with the default settings that they seem to fail-over for no particular reason and get into a tail-chasing situation, which would be amusing if you didn't have 200 screaming attorneys looking over your shoulder as you try stop the failover merry-go-round...

  74. Windows NLB (aka WLBS) by Anonymous Coward · · Score: 0

    At my last company we used to push Windows NT Adv. Server or 2000 Adv. Server because it comes packaged with or available : NLB or WLBS. WLBS (the original name) is Windows Load Balancing Software. It allows you to set up the same IP addy on multiple systems and it handles all the routing. Never once have I had a customer experience hardware related downtime using the system. Software became more difficult to manage when using backend DB's but we made it work.

    You can also buy clustering routers, but they cost $10,000 and really the speed difference is not noticeable (we were doing CC routing, always under 3 second responses required.). It would easily handle 200-400 transactions per second.

          We also had hot swappable boxes (mostly STRATUS boxes) but I never liked them due to pricing. Clustering ALWAYS worked out cheaper from a hardware stance, and IMHO improved response times by sharing the load. Anyways, I know some of you won't like this, but NLB/WLBS worked like a charm. Of course, MS didn't design the software originally, it was a buy out..

  75. how much state do you have to distribute? by zorro6 · · Score: 1

    In my mind it comes down to whether the service you are trying to provide needs to present synchronized and consistent state to each user in the presence of asynchronous updates or not. If you have to distribute and synchronize significant amounts of changing state across the entire user base then you might want to go with a shared memory/shared resource HA/FT solution. A database app is often of this type. If you don't have this requirement (if the content is static, etc.) then a cluster solution might be better. There are certainly many ways to offer synchronized, shared state in a cluster solution but having to do this over a network, through standard network protocols, is slower, heavier, harder than in a single machine, shared memory environment.

  76. It's not a cluster if it doesn't say VMS by tengu1sd · · Score: 1
    All these other platforms that try to label either a "hot spare" spare system or load distribution as a cluster cheapen the label for the one operating system that got it right. Of course no one hears about it due to the buffoons that in management and dead product FUD.

    For a reliable environment you want a reliable platform and you want enough platforms (clustered) so that dropping one site doesn't mean no service.

    1. Re:It's not a cluster if it doesn't say VMS by msbsod · · Score: 1

      You are absolutely right! But the folks at "techtarget.com" are not interested in working solutions, like VMS. Go to their main web page and check the long list of topics. There is no real information behind each link. And when you click on the "Media Kits" link, then you get the address of their sales person. People like those behind "techtarget.com" only want to address as many people as possible, advertise and do more marketing BS with flash files and cookies. This is the reason why they only mention Microsoft and Linux. It is just a pity that so many people buy this junk. I mean the Microsoft and Linux software as well as the low quality publications.

  77. Clustering gives you scalability, FT servers don't by Jim+Ethanol · · Score: 1

    These technologies can not be compared in an apples-to-apples fashion. Clustering solves performance AND reliability problems. Fault Tolerance just solves reliability issues.

    Performance + Reliability = Clustering

    Reliability alone = Fault Tolerance

    -Jim

    "money is the purest form of energy on this planet"

  78. Scalibility by Launch · · Score: 1

    When dicussing Fault Tollerant vs. Clustering systems it's extreemly important to dicuss the need for scalibility. Clustered systems are inherintly scalible, while fault tollerant are not (in general).

    For my business needs I usually see clustered systems as a much greater solution than fault tollerance. When dealing with systems that require fault tollerance you mostly are concerned with keeping the data they store avalible (database servers, file systems, etc). When dealing with systems where high avalibility is required for data, 99% of the time you are dealing with systems that will need to be responcive to an increased scale.

    DFS and HA in SQL 2005 or 10g are examples of where a clustered system really couldn't be replaced with fault tollerance.

    --
    Your mammas flamebait.
    1. Re:Scalibility by Launch · · Score: 1

      I also wanted to mention cost.

      Usually a clustered server solution is very comprible in cost to a fault tollerant solution. In general your clustered boxes are pretty cheap off the shelf deals, while fault tollerant machines are not. When a critical error occurs with a fault tollerant system, the cost to repair is much greater than in a clustered solution, and downtime can be exponetially higher.

      Clustered solutions are designed to maintain uptime even when their is failure.. FT solutions are designed not to fail... I trust things that are designed to keep working when they break more than I trust things that are designed not to break.

      --
      Your mammas flamebait.
  79. Re:SneakerNet * by MyHair · · Score: 1

    You might check out OpenAFS. I'm not sure it meets all your requirements, though.

  80. More clustering benefits by Anonymous Coward · · Score: 2, Insightful
    Clustering protects against many more types of failures than servers with internally redundant hardware.
    • Clustering protects allows easy zero-downtime upgrades (update half the cluster, and then the second half)
    • Clustering allows easy zero-downtime moves from one data-center to another (move half the servers; and then the second half)
    • Clustering protects against more types of user errors than internally redundant servers (oops, I turned off the wrong machine)
  81. Re:SneakerNet * by sfcat · · Score: 1

    What about AFS which stands for Andrew File System. It was developed at CMU and allows dynamic backup of data (it automagically copies you data to different physical volumes). I've never even heard about data being lost on an AFS system and it supports very high security too. Then just build your code on top of the UNIX commands or AFS file API. But then again, it might be a bit much for your requirements. I don't know of a windows client version but one might exist. And the wayback part you might have to write yourself but it might be supported as well. Check it out if you are interested.

    --
    "Those that start by burning books, will end by burning men."
  82. How much load? by TheSHAD0W · · Score: 1

    How much load is your site going to need to handle? If it's high, clustering is a darn good idea, because the separate machines will share the load on top of giving you redundancy. If the load expected is low, a single fault-tolerant machine will be easier to maintain.

    This especially goes for multiple services, and you may want to mix-and-match. For a CGI+SQL combo, you may prefer to split the web load over a cluster, but you may want to forego the complexity of a clustered database and put your SQL server on a single redundant box.

  83. O.G. ...whiz! by comzen · · Score: 1

    "...I heard this ad that said it runs faster, costs less and never breaks!"

    What a bright idea, ...such a smart bunch of chaps 'eh, ...why didn't we think of that a long time ago?

    ..but wait, ...we did!

    --
    Crunch!
  84. Re:SneakerNet * by dada21 · · Score: 1

    iFolder is so-1990's to me, heh. Freenet seems doomed!

    The war is on:

    A. huge megaservers online serving thin/dumb terminals over high speed network connections (renting processors and storage and even apps all on demand with backups)

    B. P2P with cheap clients and cheap shared in-client storage

    I don't know which way is better. High bandwidth will get cheaper and more available every day.

    For now, I'm betting on DumbClient/MonsterServer being the cheapest both initially and in the long run when 10Mb connections to the Net are the norm.

    Yet internal P2P seems more secure and more fault tolerant.

    Database Object just means a hive IS a database. One object could be "MarsVoltaSong.mp3" or "John Jones Contact Record"

    Your contact manager would access the hive to retrieve your contacts. Super-secure databases could have public description keys with encrypted actual data.

  85. Imagine... by Jupix · · Score: 1

    Imagine a Beowulf cluster of server clusters. Oh wait...

  86. Look at it from a physical perspective by Anonymous Coward · · Score: 0

    Let's say you have mission critical application SuperUpTime(tm) to run 24/7 and above anything else; the box can never go down. Sure, you can go the fault tolerant route, stuff it on a FT box and hope for the best. But a comet has hit directly into the building, flattening everything including your FT box. Or the airco has leaked and shorted out the FT box. Oops, bye-bye SLA.

    This is one of the benefits of clustering, you can have nodes at different sites. As long as both sites are far away enough from each other to survive natural or accidental disasters; SuperUpTime(tm) will keep running.

    Suppose SuperUpTime(tm) is really popular, the load far exceeding the what was anticipated.. The box has been filled with the maximum amount of cpu and memory supported, but it's still suffereing under the load. With a cluster, you're not constrained by the physical design limitations of the FT server, since you can add a beefier node to the cluster at any time.

    If it was up to me, i'd go the cluster way; but that depends one what's needed. If you have a good, compentant sysadmin that knows what he's doing, he can admin several clusters no problem. But if you'd rather sink money into a FT system, not trusting humans.. then by all means, go the FT way. However I have yet to see a FT system that is truly redudant.. If there is a bad CPU for example, can they guarentee there is now way it can bring down the entire box? By constantly initiating panics, etc.. Then you have to intervene manually etc. If a cluster node goes down, a failover should happen automagically..

  87. Re:SneakerNet * by Tiroth · · Score: 1

    There is a commercial implementation of Andrew called DFS (distributed filed system) and sold by IBM. It is mostly used by banks and universities AFAIK due to the mentioned strong integrity and security features.

    It IS possible to chuff things up, mainly by making administrative errors.

  88. Re:SneakerNet * by raddan · · Score: 1
    You're forgetting about:


    * backups
    * authentication/permissions
    * simultaneous use of the same file
    etc...

    These are problems that have already been addressed in most corporate LANs. Fault tolerance is an issue, yes, but if I had to trade the few items above for the extra tolerance that a P2P network gives me, I'd stay with the regular 'ol client-server model.

    I'm not saying that P2P isn't a potential solution for the future, but for this application, it's not ready yet. In my experience, the problem isn't that desperate.

  89. Re:SneakerNet * by silas_moeckel · · Score: 1

    Been there done that, AFS http://www.faqs.org/faqs/afs-faq/ works wonders. Pretty much it's a nice fault tolerant file sharing system that supports direconnected ops meaning you can work with everything in disk cache and checkout / checkin things as needed.

    --
    No sir I dont like it.
  90. Absolutely right by TTK+Ciar · · Score: 3, Informative

    Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.

    This is dead-on correct. For example, if a CGI hits a problematic state where it eats a lot of memory putting the server into a state where it's swapping, then it takes longer to service each http transaction, which means each more httpd transactions queue up, which means more memory gets allocated which means more swapping .. rendering the machine useless for a little while (until a sysadmin or a bot notices the state and either restarts the httpd or kills a few select processes). If we were running this on one mammoth server with lots of redundant bits, then 100% of our web service capacity would be down in the interim. But since we run a pool of ten http servers under keepalived/IPVS, we only lose 10% of our capacity during that time.

    Other reasons I've traditionally preferred clustering: easy to incrementally scale up infrastructure (no big buy-in in the beginning to get the server which can be expanded), fully parallel resources (an independent memory bus, an independent IO bus, two independent CPU's, an independent network card, and a few independent disks for each server, as opposed to a mammoth shared bus on a leviathan crossbar, which will inevitably run into contention), and more flexibility in how resources are divided amongst mutually exclusive tasks.

    One of those reasons is getting less relevant -- point-to-point bus technologies like LightningTransport and PCI-Express are inexpensively replacing the "one big shared bus" with a lot of independent busses, transforming the server into a little cluster-in-a-box. It is a positive change IMO, and shifts the optimal setup away from the huge cluster of relatively small machines, and towards a more moderately-sized cluster of more medium-sized cluster-in-a-box machines.

    The price of licenses is, IME, rarely an issue (in my admittedly limited career -- I don't doubt that it's relevant to many companies) because the places I've worked for have tended to use primarily free-as-in-beer (and often free-as-in-speech) open source solutions. What is more of an issue, IME, is the necessity of staffing yourself with cluster-savvy sysadmins and software engineers. Those of that ilk tend to be a bit rare and expensive, and difficult to keep track of. It takes a distributed systems professional to look at a distributed system and understand what is being seen, and this makes it easy to bend the spec or juggle the schedule on the sly, or run skunkworks projects outright. By contrast, the insanely redundant, mondo-expensive uberserver was created and programmed by very smart hardware and software specialists so that your IT staff doesn't need to be so specialized. This makes useful talent easier to acquire, and understanding the system closer to the reach of mere mortals.

    Just my two cents
    -- TTK

    1. Re:Absolutely right by Lucractius · · Score: 1

      "Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked. "

      COuldnt be more true. but while "If we were running this on one mammoth server with lots of redundant bits, then 100% of our web service capacity would be down in the interim. But since we run a pool of ten http servers under keepalived/IPVS, we only lose 10% of our capacity during that time.
      " is a method of solving the problem. Perhaps you chose the wrong software? or are people so entrenched in the idea that MS and Linux and Mac OS X ( and BSD and Solaris for the especialy informed) are the only available oses that they forget that big heavy ass multiply redundant machines are usualy capable of running oses with licences that can cost more than your average home PC per cpu you have in your server or other extremely reliable OSes.

      OpenVMS on AlphaServers represents the primary example of a Redundant "single machine" HA system. While also coincidentaly being a fantastic Clustering soloution (equal or better than whats available presently) with redundat process and data mirroring and disaster tollerant geographicaly spread clustering and with the entire system built around the principle that if your machine is down for a minute you didnt schedule as downtime, then your loosing a LOT of money.

      VMS practicaly "invented" clustering in computing when it first used it, so its got a lot of past proof as a reliable system.

      the world uptime record is still held by a British Rail, Vax running VMS, something close to 15 years without downtime. if thats not HA then what is?

      --
      XML - A clever joke would be here if /. didn't mangle tag brackets.
  91. For me it' simplicity versus Complexity by brunos · · Score: 1

    Clustering is great if it's simple, such as web servers. However, removing single point of failure is complex, in terms of software, hardware and network traffic. The solution as a whole can fail, say, because of stupid clustering software. Eg. Microsoft/HP cluster setting same MAC address tothe entire cluster. Or, Forgetting to put UPS on the air conditioning. I am all for one big powerful, but simple computer. They are expensive, but at least they don't run Windows.

  92. Re:SneakerNet * by dada21 · · Score: 1

    I've played with it. It seems more of a backup bandaid than a realtime data hive like I'm thinking.

    I may try to torrent a corporate network if I can find a good file "explorer" or file access subsystem that integrates into Windows.

  93. Cluster Fault Tolerant Humans! by schmedley · · Score: 1

    Human error has for years been the ghost in my metrics, not hardware or software failures. Sure, hardware and software goes bad, but the really big hits I've experienced in the past 15 years were because:

    * An employee's four year old pushed the big red data center button that spelled his mothers ultimate career doom.
    * Our CEO refused to release the capital needed to replace the UPS batteries. Boom! Sure, grid power was available, at least until the fire captain ordered the power cut.
    * A new sysadmin hit the rack power switch instead of the server.
    * An application administrator (SAP) with temporary root access learned that with power comes responsibility and that responsibility requires competence!
    * For every action under the raised tile there is a cubed reaction. That pots cable was connected to the OC3 fiber.

    And that's just five!

    1. Re:Cluster Fault Tolerant Humans! by tweek · · Score: 1

      On line item one:

      We actually just went through this same thing but it was an actual employee of the datacenter and not some kid.

      It seems that a water sensor was malfunctioning and reporting water under the floor. The onsite tech didn't follow procedure and instead of checking first, he hit the big red button.

      Boom, datacenter down for 4 hours while they figured out what was going on. And mind you this was an AT&T Global hosting center and POP here in Atlanta.

      Now the next 3 days has the following going on at that datacenter where we're a customer:

      1 - realize that a change control process wasn't in place well enough to document changes that had been made to various switches and routers. Some body forgot a "write mem" here and it wasn't documented.

      2 - realize that some equipment can occasionally fail after having not been power cycled for a year or two.

      3 - some servers after having uptimes of a year decide that a filesystem check is in order and, oh look, I just lost two drives because no one bothered to check that small server in the back somewhere that was handling DNS as a quick solution.

      4 - realize that something have to be brought up in a specific order and that the last round of new equipment wasn't added to the checklist.

      I'm sure number 3 didn't happend but I've seen it in other places. I can say for a fact that the extended outage had some traces of 1 and 2 in there.

      --
      "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  94. Real fault tolerance by couch_warrior · · Score: 1

    The only "truly" fault tolerant machines I have seen, other than custom-built military hardware are the "Stratus" (www.stratus.com) servers - everything is doubly redundant. You can walk up to a server and pull CPU cards and memory out at random, and the machine doesn't even hiccup. You can hot-plug replacements back in, and the machine doesn't miss so much as a single clock cycle.

    Any other form of high availability amounts to pre-emptive rebooting. You keep a warm spare machine next to the original to pick up the workload in case of failure. It's just like rebooting, but doing it in advance on a spare machine just to be ready.

    For *most* applications the warm spare approach works reasonably well. But in real-time control applications, the few seconds spent loading up context to restart from a checkpoint can kill people.

    Would you trust a windows cluster to drive your car on the highway? Or fly the jet you are riding in? Or run the anti-missile phalanx gun on a missile frigate?

    Real fault tolerance is like a parachute. Most of the time you have no use for it. But when you need it, nothing else will do.

    --
    "Sic Semper Path of Least Resistance"
  95. Fault Tolerance vs Clustering by NeonRonin · · Score: 1

    With fault tolerance, you still have a single point of failure with the chassis and one that can not be eliminated though mitigated. Clustering gives two seperate units that can be on the opposite side of the datacenter, or even across town with the Metro Ethernet infrastructures available today. As costly as it may seem, if you want to save your arse, you are better off clustering then having to explain why your company lost $2 million for the downtime when it could have been mitigated with a 7k

    --
    -- NeonRonin
  96. Re:SneakerNet * by dada21 · · Score: 1

    Backups are integrated in the hive. I think a backup node in the hive could stream backups constantly.

    Authentications/permissions can be realized by using a registry-like Address/Key/Source structure. The address of a chunk in the hive designates what data it is, the key can be 0 for public or an encryption key known to client apps permitted to access the chunk. Source is the data (encrypted or otherwise).

    Since the client node is responsible for reassimilating chunks it hived out, the encryption is twofold: cracking the key only gets your a bite. You need to know what other chunks connect to yours to eat a full plate.

    The problem to me IS desperate. In all of my businesses my main goal is "how do I make myself obsolete?" It sounds counterintuitive yet it lets my customers see that I don't want to do the same work ad infinitum. I want to return a nice profit to my customers for the money they pay me. $150/hour should save my customers $225/hour in added productivity. Down time is a huge net loss.

    Proper data utilization is money made and time saved. Programmers facilitate that. Accessing the data to be utilized is my job. I want it transparent with 100% uptime. Client dies? Pop in a replacement.

  97. Re:SneakerNet * by Morgalyn · · Score: 1

    Oh. Be patient, the solution is coming within the next year or so (we are currently alpha). That is all I can say at the moment. Any more features you want to dream up?

    --
    You say you got a real solution
    Well, you know
    We'd all love to see the plan
    (The Beatles)
  98. Different purposes by Anonymous Coward · · Score: 0

    I used to build HACMP clusters also, although I haven't in 5 or so years. One of the key points in the HACMP class was that clustering and fault-tolerance are for different purposes. Fault tolerance is required for systems that ABSOLUTELY MUST NOT fail, e.g. aircraft avionics. High availability clusters are designed to provide high uptime with a known (small) acceptable amount of downtime. Fault tolerant systems generally cost at least one order of magnitude more than highly-available systems.

  99. Oxymoron: Fault Tolerant Computers by mschuyler · · Score: 1

    They simply are not. There are too many things to go wrong even if you double up every single component with the idea that if one side goes it will run on the other while you unplug the bad part and plug in the good part. It's been my experience that as likely as not they will both go at once anyway. I've had this happen with dual fault tolerant RAID systems more than once.

    The neat thing about clustering, in my opinion, is that you get built-in redundancy AND you get the ability to take care of an increased load very neatly. Too many processes? Just plug in another CPU to the cluster and load balance it out. Did one break? OK, so you're under a load for a while until you can plug in a good one, but at least the app doesn't go down. And you DO have an extra ready to go just in case, don't you?

    I built a cluster supporting a couple hundred thin clients, which tend to proliferate ("Hey! Gimme ten more in this room!") and once the bugs were worked out (Had to load balance the brain cells to get things working right) it works slick. If I had it to design and spec a systen from scratch knowing what I know now, I'd do it again in a heart beat.

    So I say: Backup! Backup! Backup! and Cluster! Cluster! Cluster!

    --
    How about a moderation of -1 pedantic.
  100. Just imagine... by Articuno · · Score: 1

    ... a Fault-Tolerant Server of these!

    --
    So Long and Thanks for All the Fish!
  101. Re:SneakerNet * by killjoe · · Score: 1

    You know what you are describing looks an aweful lot like a distributed version control system with bittorrent as a transport. Of course with bittorrent you need a server to act as a tracker though.

    Why couln't something like SVK work for this?

    --
    evil is as evil does
  102. One word. by Wdomburg · · Score: 2, Insightful

    Both.

  103. Where did you get the idea SMP is cheaper? by Glasswire · · Score: 1

    Big, (fault tolerant or not) single-image Symmetric Multi Processing are certainly NOT, on a $/cpu basis cheaper than a cluster of the same number of cpus. Vendors not only make lower margins on clustered systems, these absolute amount of the cost is lower. Clustering with small-way (typically 2-way) commmodity systems will always be cheaper than big SMP - whether you get real (or in the case of some clusters, effective virtual) fault tolerance or not. The sensible rule-of-thumb today is that if your application lends it self to be parallelized easily, you cluster - if it absolutely requires a single MP image of x cpus, you have to do SMP. And most things in the real world fall in between. So, if you CAN cluster effecively, you SHOULD cluster.
    Next question.

  104. Which Is Better -- T-Rex or Velociraptor by tspauld98 · · Score: 1

    Answer: Neither. They are both dinosaurs.

    My point is that traditional high-availability solutions are not getting it done any more. None of the customers that I work with are thrilled to spend money on any solution where all the hardware is in one location exposing the application to being wiped out by a hurricane or a terrorist attack.

    Of these two solutions, clustering does provide some flexibility to implement in different geographic areas but most clustering products fall short of features that support enhancing application availability across geographically-dispersed sites. This is a huge feature hole that is only answered with custom integration currently. In fact, it's paying my bills quite nicely these days, but I do wish there was more support for this kind of solution.

    "Never under-estimate the bandwidth of a FedEx truck" -- Availability Consultant in response to query about the "best way" to get data from one data center to another.

    --
    "Ahhhh, best laid plans of mice and men... and Cookie Monster." -- Cookie Monster, Sesame Street
  105. Re:SneakerNet * by swb · · Score: 1

    A virtual file server would be an extremely cool thing. I could have sworn somebody had something that would allow you to take N bytes of disk from a workgroup of workstations and coalesce it into a virtual disk sharable with the workgroup.

    Just extend the idea to include replication of the virtual storage. You could either allow splitting of the data as above, or require each replicated segment to be 100% complete (ie, any one workstation has the whole server's storage), or some mixture of all of the above.

    I think the downside comes from keeping it all in sync. Probably viable on a high speed LAN, but will fall apart exponentially as you get onto WAN networks.

  106. Re:SneakerNet * by Alt_Cognito · · Score: 0

    If sneakernet were down, would that imply you were injured or had passed on to the great bitbucket in the sky?

  107. Re:SneakerNet * by dada21 · · Score: 1

    Too many ideas :)

    Here are some:

    * Bandwidth designation for each node
    * AI style workgroup-relations caching -- users embed information into chunks offering other hiveminds a chance to stock up on common data for faster response
    * Address/Key First sorting for confirming files accessed are the most recent
    * Data-In-Use flags (flags should be hive updated every so often to entrust that data is still in use)
    * Momentum Updates (Organized hive updates through friendship pairs or quads keeping data chunks closer to home)
    * Administration Worldmap (show chunk usage, lifespan, friendship groups, connection uptime)

    IMHO, torrent is too anonymous. I'd rather see chunks offered within the friendship pair/quad only. Easier to notify hiveminds of updates or in-use flagging.

  108. And then ... by ScrewMaster · · Score: 1

    there's Google.

    --
    The higher the technology, the sharper that two-edged sword.
  109. My own experiences with clustering by Anonymous Coward · · Score: 0

    I have had a fair amount of experience with linux clustered filesystems (primarily with block based ones not file based). The performance, reliability, and feature set for each of the ones I have looked at have varied greatly depending on your application and hardware. Some samples of clustered filesystems available for Linux:

    Polyserve
      - closed source / proprietary
      - supports Oracle RAC
      - distributed locking manager
      - supports up to 16 servers
      - has load balanced NFS
      - supports SuSE and Redhat
      - requires a SAN
      - block based I/O
      - limited hardware support
    Redhat GFS (formerly Sistina GFS)
      - open source
      - supports Oracle RAC and is also supported by Oracle
      - distributed locking manager in 6.1, single metadata lock manager in 6.0
      - can work with or without a SAN
      - SAN requires pool driver in 6.0, or open source LVM 2 in 6.1
      - supports up to 256 servers?
      - works on other distributions, but good luck getting it to work
      - is available in Fedora Core for those not worried about support
      - block based I/O
      - extensive hardware support
    IBM GPFS on Linux
      - closed source / proprietary
      - supports Redhat and SuSE
      - lots of servers, don't remember if there is a theoretical limit or not
      - quasi-distributed locking manager, RTM for more details
      - block based I/O
      - supports IBM hardware + some smattering of major competitors
      - requires a SAN (preferably one of IBM's)
    Lustre on Linux
      - open source
      - distribution agnostic
      - lots of servers
      - lock manager is not very good
      - does not require a SAN
    Panasas
      - closed source
      - file based I/O
    PeerFS
      - closed source

    Honestly clustered filesystems can be more trouble than they are worth. With the exception of Redhat GFS and a smattering of others that aren't as advanced, there are few open source options for implementing a 'true' clustered filesystem. Most of the setups I have seen end up having so many moving parts that they are often very fragile. Having said that if you want to get started try GFS. It is a decent introduction to the clustered filesystem world:

    http://download.fedora.redhat.com/pub/fedora/linux /core/4/i386/os/Fedora/RPMS/GFS-6.1-0.pre22.6.i386 .rpm

  110. Real world example and cost by MarkEst1973 · · Score: 2, Interesting
    A gov't contractor I worked for was getting a contract to consolidate multiple servers and apps into a single pair of servers (web and db) for a small gov't agency.

    The agency bought a pair of dual proc Dells with lots of RAM and a full software stack (Windows Server, SQLServer, and ColdFusion Server). Total cost: ~$57,000.

    That's right, nearly 60k.

    Now, I've read that Google buys their white boxes at $1k each for their server farm. And I couldn't help but think what they'd (or I) would do with 57 boxes instead of 2.

    But hey, my opinion doesn't matter. I'm not a PHB in a gov't agency. But sure as hell, if I were a business in a competitive environment (and a gov't agency is not), I'd be looking to implement the simple and effective white box solution on the cheap. But that's just me.

    1. Re:Real world example and cost by mOdQuArK! · · Score: 1
      But sure as hell, if I were a business in a competitive environment (and a gov't agency is not), I'd be looking to implement the simple and effective white box solution on the cheap.

      Of course, if you were the winning bidder on a government contract, you might implement the simple & effective white box solution, but CHARGE the government $60k anyway...

    2. Re:Real world example and cost by rhizome · · Score: 1

      Now, I've read that Google buys their white boxes at $1k each for their server farm.

      Yes, but they also write a lot of their own software.

      --
      When I was a kid, we only had one Darth.
    3. Re:Real world example and cost by Anonymous Coward · · Score: 0

      $60k is peanuts. We probably buy that much just in ethernet cards in a given week.

      We pay more so that we get first rate support and guaranteed functionality.

      The point is, depending on the use, the $60k might either represent the entire budget for a project (probably a waste in that case) or it might be 1/1000th of the budget. Sometimes it is actually cheaper to pay a bit more money on the front side in exchange for drastically reduced costs on the back side.

      Lots of people don't seem to understand that point. Sometimes you have to spend money to save money.

      Now, I would never buy a Dell for a datacenter, just like I'd never use Windows or SQLServer, but that is my preference.

  111. Re:SneakerNet * by Anonymous Coward · · Score: 0

    I think the downside comes from keeping it all in sync. Probably viable on a high speed LAN, but will fall apart exponentially as you get onto WAN networks.

    So you want something like a NAS iSCSI using LVM on RAID5 of NBDs? There are *certainly* going to be delays. It is easier and cheaper just to get two or three NAS boxes and setup Linux HAC.

  112. Re:SneakerNet * by dada21 · · Score: 1

    There was. I think it was called Orange or something fruity. I played with it and it was terrible.

    The solution is out there in the ether. Even over WANs it is viable as nodes can search for chunk updates by merely requesting Address/Key instead of Address/Key/Data. Of course topology outage would be murder, but that's true with client-server, too.

    I have a customer with 1TB available in the server, 700GB in use. Their 150 workstations have 30TB free. Their server frequently gets bogged down and needs constant hardware improvements. A Virtual File Server (hive) would increase uptime, decrease costs and increase performance.

  113. How 'bout NEITHER? by Medievalist · · Score: 1

    You could also build a system where critical data is distributed to where it needs to be used, then updated dynamically.

    Examples of such systems would be BIND for DNS or OpenLDAP for LDAP. You can have hordes of cheap servers running BIND or OpenLDAP that build from a kickstart or mondo CD, and get really amazing reliability overall. And no backup tape system required - the systems themselves can be your backup store.

    It's not a solution for every problem, but neither are fault-tolerance or clustering. You need to have more than one tool in your toolkit.

  114. Re:SneakerNet * by multipartmixed · · Score: 1

    Yes. It should analyze the porn in users' home dirs, then seek out new, but similar porn. Sort of like tivo.net, only for Linux geeks.

    --

    Do daemons dream of electric sleep()?
  115. Re:SneakerNet * by Morgalyn · · Score: 1

    I think there will be room for things like that and more, whenever it actually gets rolled out. Exciting, huh?

    --
    You say you got a real solution
    Well, you know
    We'd all love to see the plan
    (The Beatles)
  116. Voice Calls and Fault Tolerance by cheezus_es_lard · · Score: 1

    Voice calls which transit a Lucent 5ESS switch generally get touched by many UNIX based systems, of which there are at least 2 of every server which process the result independantly and compare answers. That's the environment I work in.

    I think, personally, that a mixture between clustering and fault tolerance is best. Sell systems with multiple redundant processors and high levels of fault tolerance, and cluster them together 3 or 5 ways, and you'd have a highly bulletproof setup.

  117. For Linux ( as well as Solaris and *BSD) by SirGeek · · Score: 1

    There is Linux HA. This is High Availability Clustering software (via a heartbeat). This along with DRBD ( Disk Replicated Block Device ) you have a very robust cluster.

    This uses an Active/Standby setup with a heartbeat between the systems. If the Active is no longer responding, within X seconds (10 by default ) the Standby takes over all the processes that were running on the other system. And ( if needed ) STONITH's (Shot the other node in the head) the other server to ensure that it really IS dead .

    We've been running webservers and Oracle database servers here with 0 downtime using heartbeat and drbd.

  118. pov and cost by micromuncher · · Score: 1

    Though the article claims clustering is about selling hardware, it goes on to suggest fault tolerant systems by various manufacturers...

    And clustering has some advantages over fault tolerant hardware when it comes to site [in]security.

    Say for example you want to architect the new Iraqi stock exchange. Do you put all the hardware in the same place and go crazy on physical security and housing? Or do you distribute the hardware with redundancy over multiple physical sites?

    You are probably cheap, because you KNOW the hardware is going to fail anyway, so why spend $30k plus on the latest SunFire-UltraSPARC or NEC Express5800FT when you can get a swarm of cheap intel servers for the same price.

    --
    /\/\icro/\/\uncher
  119. Marathon Technologies by SmurfButcher+Bob · · Score: 1

    Proponants of clustering neglect one thing - it mostly works, but requires a painful coding practice to prevent any loss of state when a failure happens. For the bulk of productions out there, this state cannot be transferred from box to box - find me a solution that'll real-time "cluster" a file-region lock, for example, of... who cares, a 5 meg autocad file. It's not likely to happen... users will get collisions, and the file will get chewed. Make it easy - cluster your favorite spreadsheet file, such that 50 people can edit it at once without clobbering each other. It's not going to happen - and hopefully you see my point about "state". Clustering is best used when the server-side is stateless... which is useless in most productions. File locking, for example, is a server-side state.

    Years ago, a company named "Marathon Technologies" went after the fault-proof market, and succeeded quite well. They cut the problem into points of failure, and duplicated each of them.

    The first POF was the context. They addressed this by having two machines handle the software state - literally, two PCs loaded with RAM, CPU, and a custom FPGA controller. No I/O, no keyboard, no mouse. The FPGA would keep the two contexts in near-lock-step with each other, effectively making a Raid-1 software state.

    The second POF was the hardware. They addressed this by... you guessed it, two boxes, again with a raid-1 type of resource mirror. The boxes each needed the same config - right down to the mac address of the NICs you threw in them. Resources where then virtualized and redirected into the software context - whatever the context does to the hardware, it does it to both, simultaneously. (The only exception being the NICs - one would be "hot", the other a warm spare). If any combination of resources went away, it didn't matter - so long as you had one of *something*, *somewhere*, the software context would not notice. We took a lightning hit in 2000 - and one of everything died. No problem, though - it used the drive array from this box, the Nic from that box, the CDRom from this box, the keyboard from that box, the mouse from this box... and unless I'd known about the failure, I'd never have noticed unless I was standing at the physical racks.

    "Failover" was instant, as there was no "failover". If something died, it din't matter - you were already using the other one. The only "failover" that would take place is if a NIC died - and the time for the "warm" nic to fire up was under 10ms.

    It had extra bonus points because you could separate the components by almost half a mile... and required ZERO rewrites of any software to use it. If the s/w would run on WinNT, then it'd run on this.

    Of particular fun was using the system to manage (trial) patches. Literally split the brain - isolate each half of the context & hardware from the other, so that each would think the other had died. You'd leave one half such that the users could continue to use it uninterrupted, while you try the patch on the disconnected one. Once done testing the patch, no big deal - just rejoin it, and it'll be brainwiped by the production one during the sync.

    It was also useful for actual application of patches. Come time to apply, you'd freeze production, split the brain and shut one half of it off. Then commit your updates and gain confidence. If the updates succeed, just fire up the half that's off and it'll be overwritten with the updated version. If the updates fail - kill the failed half, and fire up the half you shut down. Rollback could be achieved in about 25 seconds.

    Clustering cannot compare to this as far as availability is concerned... with zero downtime, zero loss of state, it's open and shut. It didn't scale well, sadly, and their newer versions don't thrill me - but the E4000 product was killer, hands down.

    --

    help me i've cloned myself and can't remember which one I am

    1. Re:Marathon Technologies by ArkiMage · · Score: 1

      Agree... I sat through a demo of their FTVirtual product recently and was fairly impressed. Some of the things I pointed out to them that I wasn't crazy about was that it's Windows only and that it requires Windows to be installed on both "host" servers. I can understand it being on the "virtual" server but would be interested in a solution similar to VMWare ESX Server where there is no host OS, other than the product itself. Therefore the underlying OS shouldn't get spyware/viruses if it weren't Windows. Their answer was to run virus protection on both host servers _and_ the virtual server. A different approach I realize, and what they have does work... No support for 64-bit yet and that will hurt them, x86_64 in the server space is almost already the norm. Other than that though, a good product that did appear to work. I saw various hardware failures get artificially introduced with not so much as a hiccup from the client workstations accessing it(them). Good stuff... For the Wintel only crowd.

    2. Re:Marathon Technologies by SmurfButcher+Bob · · Score: 1

      >> I saw various hardware failures get artificially introduced with not so much as a hiccup from the client workstations

      Yeah, but you missed the good version. The E4000... you'd PCAW to the desktop of the "server", introduce hardware failures (such as ripping the RAM out of one of the CEs, or pulling the active NIC, or whatever) and see not so much as a hiccup from the desktop of that server, itself. :)

      Too bad they hosed the performance with the FT version, though... the FT product was supposedly going to allow a 100km split between the tuples. Dunno if they got it working. Ugh.

      --

      help me i've cloned myself and can't remember which one I am

  120. In the land of redundancy... by Anonymous Coward · · Score: 0

    ...uptime is king. And redundant servers with hot-swap components and long life guarantees is the throne uptime sits highest upon. Clustering can be a good inexpensive solution, but it inherently brings in some downtime, which cannot and is not the final solution in many applications. Redundant components gives you the always-on functionality you may need. Clustered redundant machines are even better.

  121. Reliability of the switch-over by cybersk4nk · · Score: 1

    Don't forget that in a parrallel rendundant system, if the fail-over switch or mechanism is any less reliable than the individual components themselves, you might as well not use a fail-over system! Of course this is all theoretical, but any failover system IMHO that uses a software mechanism *built-in* to the devices that can fail is just plain stupid. It's akin to the 'software firewall' vs hardware firewall debate -- hardware firewalls are better because they isolate the hacker from your computer and increase security! If you truly want to build a foolproof redundant failover, it should be a seperate hardware box, like a network switch that senses a fail and brings the other system online. Just from browsing this post and casual knowledge is seems there are very few systems for computers like this, or they use a software method for failover. Does anybody know of any network hardware devices that do just this? Are they efficient? Are the swtiches more reliable (have greater uptime) than the computer servers behind them?

  122. My Hard-Earned Experience by sabat · · Score: 1

    It really all depends.

    How stable is your application? When I first walked into one particular job, the outgoing tech guy was crowing about how super-redundant that this ONE box was that he was running the ONE webserver on.

    That was all fine and good -- dual power supplies, multiple CPUs, yadda yadda -- but the app was not stable and could not handle the traffic it was getting. It crashed a lot, and when it did, there was no more business until someone bounced it.

    Redundancy was good in that case.

    You're running a database? That's a challenge to run in a clustered setup. It can be done and done right, but you need experts. If you're Amazon, you need that -- clustered geographically as well as locally. You're a little startup? Cluster your website and your app servers and just make your db internally redundant. And for chrissakes, don't run MS products. Stick with things that are easy to keep stable.

    --
    I, for one, welcome our new Antichrist overlord.
  123. Clustering as load balance by phorm · · Score: 1

    Really, what I tend to see clustering for more often is load-balancing. If you have a streaming video server that, say, gets slashdotted (and assuming it's not your bandwidth that is the bottleneck), then you could dump some of the load to a secondary machine. Of course, the usage that you describe - dumping to one machine in the event of a primary machine failing completely in some fashion - can also be used.

    However, this isn't to say that clustering should eliminate a need for fault-tolerance. After the price of good servers, adding a redundant PSU and a good UPS, as well as some other basic hardware necessities (RAID perhaps?), shouldn't make a huge dent in your budget in comparison to the overall cost of the machine(s), and it'll probably save you in the longtime.

  124. Well, let's see by Cyno · · Score: 2, Informative

    Sun:
    http://store.sun.com/CMTemplate/CEServlet?process= SunStore&cmdViewProduct_CP&catid=83174

    For around $20,000 you could build a PC cluster that includes:
    20+ x Intel P4 D820 at ~$500 ea.
    20+ x AMD64 X2 3800+ at ~$750 ea.

    You could almost get a cluster of 40 Intel PCs, each with a dual-core chip running at 2.8 Ghz. Or almost 30 AMD64 PCs, each with a dual-core chip running at 1.8 Ghz. If you shop smart you can get gigabit ethernet on the motherboard and have a fault-tollerant / redundant system with over 10 times the performance of the Sun system.

    I don't know about you, but I would take the cluster of AMD X2s. The Intels might beat 'em on price/performance, but the X2s might be a lil bit nicer to work on.

    1. Re:Well, let's see by Tw1ggy · · Score: 1

      I think its mostly a question of initial scale and HR capability. If you have the talent and processes in place to manage the cluster, then go for it. If you are starting out, then simplicity and ease of management will lead to stability, go fault tolerant.

    2. Re:Well, let's see by tweek · · Score: 1

      And what does that get you? If you're wanting to cluster/load-balance a web server or an app server, you're fine.

      But what about a database? Does mysql support N-way replication? Postgres? I think you can do it with Oracle RAC?

      Again, I big cabinet full of servers is nice if you're applying it to the right application.

      --
      "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  125. What ? by fullofangst · · Score: 1

    Sorry, what is the point of this article ?

    What's the question being asked ? There's no point to this headline. Useless.

  126. Re:SneakerNet * by Oriumpor · · Score: 1

    Sounds like about 30 more lines or so of python and you're halfway there.
    TinyP2P

    A bit of checksumming, some automated distribution of indexed files based on some arbitrary weight (Important 1-kinda 5-YOU BET), and you've got it.

    You would have to install Python for windows... (Or OSX if they're using AutoCad and not Softplan.) Setup some login/boot-time scripts etc.

    Still, more for the "fun" kind of thing to do, and not something for a production environment. But everything has to start somewhere.

  127. Clusters. 20+ years ago... by Anonymous Coward · · Score: 0

    I did my first instalation of a cluster 20 years ago last Jan. This is not new, except in the M$ world, and they still can't get the VMS code they have to work. So what is new...

    Configured right, with apps that are correctly designed and written (eh hum!) clustering can do ALMOST, but not all a FT machine can. The ultimate is to build a cluster of FT machines.

  128. You should never use Windows NLB by dangermen · · Score: 1

    Host-based network load balancing stinks on almost every platform imagined. Host-based systems typically require multicast or broadcasts to perform crap trickery. MS NLB is NO different. It turns your multi-thousand dollar switches into hubs.

    I highly recommend one of the following dedicated appliances for load balancing:
    1. Cisco Content Service Switch
    2. F5 BigIP
    3. Nortel's Load balancers
    4. Redline

    They are appliances that respect the network rather than beat the crap out of it. Oh, and yes, when you are talking about managing enterprise networks - you treat them well.

  129. Re:SneakerNet * by joe22 · · Score: 1

    Perhaps something called OLRAS (olras-archives.com)
    A file synchronization system to synchronize user-data
    with a server (therefore requires a server). The system supports
    workgroups as files shared by members are distributed
    to all members as they connect. The server is necessary
    because otherwise, all members of a workgroup would need
    to be connected at the same time to synchronize the workgroup files.

    Targeted to very small companies, specially useful
    for laptop users whom may or may not be connected to
    the network at all times. The server is used to backup private data
    and also as a repository of data for users who are not presently
    connected: as they connect, their PC gets updated.

          - uses SSH: VPN not required
          - incorporates a private Web space for company-wide documents
          - incorporates WebDAV, for sharing files when file sharing by
                synchronization will not do.
          - Supports archiving of files, previous versions of files may
                be retrieved by the user without IT personnel involvement.

  130. Hassel as a drag on reliability by Tw1ggy · · Score: 1

    My experience is in SME's or in small to medium sized units of larger businesses, that try to maximize their internal capability at the lowest possible cost. In almost every case, clusters (based on Windows) were far more prone to problems than fault tolerant servers. The lack of reliability came most often from the increased hassle, maintenance, and complexity of the clusterred (especially fibre) solutions. While they had, on paper, more fault tolerance and could theoretically provide greater availability, the factors I laid out above acted as a drag on their uptime. So much so, that fault tolerant servers or small storage devices (especially Netapp) often provided comparable ACTUAL uptime figures. As a bonus, the simplicity of the infrastructure allowed much greater flexibility when approaching problems. Unless your infrastructure is already of "enterprise" scale, clustering is not a great option. As a first step into a large scale infrastructure, I would definitely pursue fault tolerance first.

  131. Horses 4 Courses - They are NOT mutually exclusive by mr_rizla · · Score: 2, Insightful
    Why would you want a cluster? For high availability. Why do you want a server with multiple redundant parts? For resiliency.

    If you have an application that requires ULTIMATE uptime, then you need a geographically remote cluster (Cluster spread over two sites with a redundant leased line link to provide the heartbeat). No matter how many redundant parts in a server, if it gets nuked (read power failure, flood, or other, not ACTUALLY nuked) then that application is down.

    Active-active clusters are not really ideal, while load-balancing is a nice idea in this instance it means that when half of it fails then the application suffers severe performance issues. Active-active also creates data issues, as you've got two servers writing to their own local storage that also requires real-time replication between sites. Veritas Storage Foundation is about the most cost-effective option here, you don't even need 2003 Server Enterprise.

    If you want a nice simple active-passive cluster and its on the same locale, fine, use a SAN. If they are geographically remote, then they will need real-time replication and as one is passive then you can use HP Storage Mirror or similar. HP are the only vendor in fact that do a nice packaged cluster solution with a SAN included all under one part code. FYI.

    Having said that, if you're buying a decent server, then you are an absolutle idiot to not put RAID into it. After that, it only costs another £300 or so to add a redundant hot-plug PSU & fan. Plus p'raps a bit for an extra CPU. After that, the only component that will cause a total outage is the mainboard failing - and the only real way to get around that is to... uh... add another mainboard! Well... guess that's another server then...!

  132. We've Got Three Types of Clustering by fdiskne1 · · Score: 1

    We're using Windows network load balancing on a web-hosted application. The cluster is given one IP, the servers are each given one. When the initial connection is made, the client is directed to one of the servers and that server handles all requests from that client until the session ends. The biggest problem is when one server is having issues, we need to connect to each individually to figure out which is having problems, then remove that one from the cluster. Also, the load balancing takes some processing power on each of the servers. This isn't important to us in this particular situation.

    Another one we use is an active-active Exchange cluster. Each server is aware of the other in the cluster and they share disks on a SAN. If one server is brought down for whatever reason, the other automatically grabs the services that were running on the first. The thing you have to watch is that neither of the servers uses more than 50% of their processing power when running one half of the cluster. If it ever does, you'd better upgrade your servers because if it gets the full load, it won't be able to handle it.

    The last one we use is an F5 BigIP box. This is a dedicated network load balancing box we use on a high-use web cluster. The nice thing is that all the computing power needed to manage the cluster is on the F5 box, freeing the servers for more users.

    --
    But why is the rum gone?
  133. Clustered FT hardware is the proper solution by swordgeek · · Score: 2, Insightful

    Others have said it, I'll say it again: you don't use clustering in place of FT hardware, or vice versa. You use them together!

    Take a server: Hot-swappable mirrored OS disks, N+1 power supplies, dual NICs (which support failover), dual cards initiating separate paths to your storage (through independent switches, if fibre-attached), ECC RAM with on-system logic to take out a failing DIMM. Oh yeah, and multiple CPUs, again with logic to remove one from active use if need be. (chipkill sort of stuff.)

    Now take another identical server (or two) and cluster them. By cluster, I mean add the heartbeat interconnects and software layer to monitor all of the mandated hardware and application resources, and fail over as necessary, or take other appropriate actions. Gluing a pile of machines together in a semi-aware grid is NOT a cluster, and does not properly address the same problem!

    Now once you've got this environment in place, add the most crucial aspect: Highly competent sysadmins, and a strict change control system. The former will cost you a fair sum of money in salary, and the latter will likely necessitate duplicating your entire cluster for dev/test purposes, before rolling out changes.

    That's the beginning of an HA environment. Still up for it?

    --

    "People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
  134. Why not cluster fault tolerant servers? by Mustang+Matt · · Score: 1

    I've stuck with fault tolerance as it's been cheaper for a smaller scale operation, but I don't see any reason you couldn't cluster them.

    --
    The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
  135. It really is two different subjects.... by tweek · · Score: 1

    or maybe it's just semantics.

    I see Clustering as:
    a group of machines sharing workload - i.e. a cluster of app or webservers.

    We have a cluster of Websphere servers handling our appserver load behind a CoyotePoint Load Balancers.

    The Coyotepoint LBs operate in an HA/FT mode. If one goes down, the other picks up right off the bat including existing state (which client was "stuck" to which server). I call this HA.

    It all depends on the product and the vendor. We have DB2 operating in an Active/Passive HACMP cluster but the workload isn't shared. As far as licensing, we only have to have licenses for the active server (according to IBM).

    There's also the shared-nothing vs shared-everything model. We currently run a shared everything model for our database allthough DB2 has a feature inherited from Informix called HADR which is a shared nothing model. It's still active/passive but the passive box is in a state of concurrency with the active primary based on user-configurable parameters i.e. update secondary node every 60 seconds or keep secondary node updated in real-time.

    Honestly it all depends on the actual implementation. Look past the vendor cruft and marketspeak to the actual implementation.

    They may be selling you a database cluster but it might require schema changes and really can overly complicate the problem. What are you trying to acheive? Business continuity? High Availability? Distributed workload? Some products support one but not the other. Some also support all of them with differing levels of complexity.

    If you're looking for a linux-only solution, we currently running two - SteelEye's LifeKeeper on one configuration and the linux-ha stuff (with drbd) on another.

    I'd be happy to provide answers to any business related experience via email.

    --
    "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  136. Re:Horses 4 Courses - They are NOT mutually exclus by tweek · · Score: 1

    Geographic Load balancing is a bitch. Especially if you don't have dark fiber between datacenters/facilities.

    We're just now starting to investigate HAGEO on AIX, geographic mirroring on our SAN and Q-Series replication for DB2. I know it's important but it really adds to the complexity of the environment ten fold.

    --
    "Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
  137. but also creates problems by Baki · · Score: 1

    especially with global data. of course global data must be minimized in any system, however at least the database can be considered "global data". clustering a database is very expensive, performance wise. even outside the database you may need global data.

    for example, we're working on a system where we have some global pool of resources which is used by many concurrent processes. if you would keep this pool of resources (counting down) in the database, we would need a DB which handles many thousands of transactions per second, which is not realistic. keeping it in memory (as a stateful service, i.e. global data/singleton) is the only solution, but is not compatible with clustering.

    clustering can be considered "loosely coupled" fault tolerance. a single fault tolerant machine is much faster when synchronizing memory between different processes. in a cluster, synchronized memory access would slow down way too much.

    i think clustering only applies to simple systems. many parts of problems just are not suited to be distributed. it is better, IMHO, to distribute functions (i.e put function A on machine 1 and function B on machine 2) than to distribute each function over multiple machines. only a particular class of functions can be distributed easily.

    alas, some people (for example in the company where I work) only envisage simple web applications and thus mandate that any application is clustered, creating great headaches and severely restricting sound design.

    1. Re:but also creates problems by bradm · · Score: 1

      No, it doesn't create problems, it just isn't a silver bullet to solve them. This is the extra local knowledge requirement I stated above.

      Let's take the database example that many people jump to. It's certainly correct that clustered databases are complex, expensive beasts, and so if you have an application that requires a single, modifiable (not read only), ACID compliant database image for a single site, a large iron fault tolerant solution will likely make more sense than a clustered attempt.

      But what if that's not your requirement? If, for example, the majority of your application's database needs are read-only, then perhaps you can split it to a set of clustered read-only databases, and an active/passive set up for your modifyable database. You then use a publishing model for changes to your read-only cluster. This pattern is applicable to a large number of online commerce sites - you have a catalog of product information that people search in. You publish updates to this catalog relatively infrequently - a few times a day or less. You have a lot smaller volume of traffic creating new commerce accounts and invoices than searching the catalog. Notice that it's okay if the updates to the catalog aren't in dead sync across all the copies of the database for a few seconds or minutes. Once you stop thinking in monolithic single database terms, it frees you to architect a more effective solution. Local knowledge.

      You mention synchronized memory access. Of course it's silly to do this across a cluster. The question becomes again: do you need it? It's not uncommon for the synced piece to be a very small fraction of the overall memory requirement. So you factor the synced stuff out into procedure calls executed against a single service somewhere running on a single box, and then you let everything else exist as multiple copies in various cluster nodes for performance. Local knowledge.

      Clustering is not a drop in solution; it requires careful design all the way from power supplies to application requirements. Done correctly, however, you get Google. It's about applying local knowledge to choose the most effective long-term architecture for the job.

      One final OT thought: Simple is harder than complex. Maybe the exercise of trying to model your underlying requirement as a "simple web application" would be enlightening.

  138. Low cost upfront vs low(er) cost down the line by Anonymous Coward · · Score: 0

    I think the main tradeoff is in the level of complexity you have in your network. At my nameless company I run about 600+ mailservers. This is mostly because we started small and grew very quickly. The cost of things like datacenter space, remote hands calls, and the amount of time it takes to manage a network of this size all create costs that we would not have in a "big iron" scenario. For example, we generally have about 30-40% free space on our hard drives, and all of our servers have dual power supplies. If we were in a big iron scenario the costs associated with having those extra powersupplies and disk space savings over all would be a substantial savings. Essentially, I think it breaks down to if you know you are going to have a big system and have a decent idea of how much processsing power/resources you are going to need.... if you are in the position to spec out hardware ahead of time, go big iron. If you are a going to grow over time, you can push some of the inital costs down the line and free up capital. I would personally go with big iron if I had to do it all over again, but that's my network, it's likely to be quite different in yours.

  139. Don't waste your hardware by nt5matt · · Score: 1

    Try utilizing all of your hardware instead of having some sitting around doing nothing but waiting for the primary node to break. Especially if you don't have money to waste. On our big UNIX boxes, they are all SAN attached and the application binaries as well as the data are located on these drives. Use some clustering software so that each server can see all of the drives. This will allow you to have each server with a primary responsibility and can handle a secondary when another primary server dies. Server A dies? Server B can see it's drives and you can start the application - most of the time scriptable with the failover server monitoring the primary. This eliminates dead (and expensive) weight in the datacenter. We tend to find load balancing not as attractive, as you have to keep all the member nodes exactly the same when you install patches and/or make changes to the application. With commodity x86 servers running 3-4GHz and 2Gbps Fibre Channel for storage, things tend to run plenty fast. If your application requires more than that, you need to be running on a "big iron" 16 or 32-processor system with an unGodly amount of memory.

  140. 85% to 99% availability using a cluster by ehiris · · Score: 1

    I achieved 85% to 99% availability increase using a cluster of Fault-Tolerant servers because clustering provides you redundancy where the rubber hits the road. In most cases that is at the user interface.

    Try to use hardware load balancers so that they can detect if one "side of the cluster"* is down and stop directing traffic to it. Proper configuration of the load balancers and proper monitoring of the individual sides of the cluster are also very important.

    * - I prefer to call it a side of a cluster instead of a server because the side of the cluster can be a collection of application, web, and database servers in multi-tier environments.

  141. So many options... by xaosflux · · Score: 1

    MS Clustering does add several layers of complexicity to a MS environment, and anyone administrating any part of the server needs to be fully aware of the environment, other wise a DBA shutting down a database server will trigger a failover event unexpectedly.

    Clustering can be used for load balancing multiple applications over the array members, provided that a LUN is provided for each application, that way if a node fails the other can carry it, albeit at 50% performance.

    I've had plenty of fault tolerant servers (multiple PSU's, RAID, hot-swap memory, NIC's, the works) but none of that helps a bit against a BSOD.

    An attractive alternative is the luke-warm spare, where you have a redundant server that meets the hardware needs of many of your servers, with either preloaded SCSI disks in a box, or at least images sitting on tape/dvd ready to load.

  142. bad math by flaming-opus · · Score: 1

    Most of the cost of a cluster is in the software, not the hardware. Even running on linux, you need your middleware and application to be written to deal with a cluster environment. You probably even need some sort of cluster filesystem or at least san hardware. In short: for anything resembling even 3 9's of uptime, you can't begin to deploy for as little as $40,000. Linux and x86 reduces the cost of the servers by 40%, everything else stay pretty damn expensive.

    1. Re:bad math by Cyno · · Score: 1

      3 9's for $40k? I can do that with 3 PCs for less than 2K for moderate loads.

      Its crazy to think one must spend $100k for a stable server. Modern PC CPUs is much faster than yesterday's servers.

      Sure, some specialized applications might require custom hardware and software. But for most network services some sort of replication can easily be scripted for failover. Automated warnings can make it easy to replace or repair broken systems. PC hardware is cheap and easy to find and fix. Linux supports a lot of choices in this area. You can choose just about any PC to replace a server temporarily with reduced performance for a quick patch, if you have the right environment setup.

      Some software requires you use expensive fault tolerant or redundant systems. Some software does not. Depending on your needs you may have to spend more to support your software. Your competition may not have the same policy. For them Linux is a very attractive option. It can be custom fit to their organization to help automate most of the administration. Their software can work for them instead of the other way around.

  143. Linux != non-proprietary by Anonymous Coward · · Score: 0

    Just as an aside, a Linux solution is still proprietary unless it adheres to a standard administered by some sort of standards administrating body. It may be open, but it's proprietary. If we're going to be all religious about open and closed source, let's keep things straight.

    Posted anonymously as a zealotry filter.

  144. Re:Microsoft Windows Server DOES support clusterin by TimTheFoolMan · · Score: 1

    Well, my team has deployed several NT and Win2K clusters, and they worked fine. We were clustering IIS along with several legacy apps.

    My experience has been that when you can't get Windows Clustering Services to work, it's either a lame app, or lame people running the show.

    Tim

    P.S. I'm the king of Windows bashers, so I'm definitely no lover of MS. At the same time, if it works, I'll install it.

  145. ultimate fault-tolerance: Tandem Computers by tjanke · · Score: 1

    The first fault-tolerant computers (circa 1978) were made by Tandem Computers, which is now the NonStop Division (or something like that) of HP. They're still the best: no single point of failure, backups take over in 15 microseconds. Full disclosure: I worked for Tandem for 6 years. Prior to that, I worked for a Manhatten brokerage firm; the building got hit by lightning one evening; all the Amdall and IBM mainframes went down, and it took three days to get them back up. The Tandem system's lost half their processors and a third of their disc drives, and kept right on processing. No down time, no lost data. The only drawback is price; these are Enterprise class servers; the cheapest one is $250k. I guess you get what you pay for.

    --
    Cheers, Tim -- Tim Janke Part mad scientist, part lion tamer: sr. software engineer, global team leader, project mana
    1. Re:ultimate fault-tolerance: Tandem Computers by Anonymous Coward · · Score: 0

      >I worked for a Manhatten brokerage firm; the building got hit by lightning one evening

      Did you repent and confess your sins? It is easier for a camel to walk though the hole in a needle than to enter the Gates of Heaven for the riches. It was a sign from Heavens.

      >the building got hit by lightning one evening; all the Amdall and IBM mainframes went down

      A., Isolators cost about nothing, APC makes all kinds of them, small dongles that plug inline, ethernet, serial, parallel, scsi, coax, etc. Why didn't you use them? That stops lightning.

      B., If the damn rich brokerage firm did not have off-site redundancy, that was their fault. All the world's TANDEMs can't help when Osama dives into your skycraper and everything goes down. To have a chance against Allah you need something like Sysplex: one mainframe in CityA and the other CityB with thick fibre optic cable link between them for instant replication. This is standard in the petrol-chemical industry, because you never know if a site is about to go ka-boom.

    2. Re:ultimate fault-tolerance: Tandem Computers by tjanke · · Score: 1

      Isolators cost about nothing [snip] Why didn't you use them?

      If by 'you', you mean me, personally, it's because I was just a lowly operator at the time; I assumed they had 'em, and even if I'd known they didn't, my influence was, of course, zero.

      If by 'you', you mean the brokerage firm, it's because they were cheap! Duh. Well, cheap where it wouldn't show. They spent millions covering the huge lobby with italian marble, and putting in a double row of palm trees, leading to the two-story glass atrium window overlooking the Hudson. They spent probably a million more hosting a political fund-raiser in the new lobby for the (then) President. They spent hundreds of thousands paneling their offices with walnut and putting in extra-thick, luxurious carpets. But, they obviously wouldn't open up the purse for some lightning protection, which they'd never get to show off (or so they thought). After this little incident, they heard from their investors, big time. They opened two redundant data centers, one in Brooklyn, and the other in New Jersey.

      But you're right about no protection from errant airplanes in the hands of crazies. When the World Trade Center went down, their building was heavily damaged, including both the data center, and the obscenely extravagent lobby. I shed many a tear for those who died, especially the poor souls forced to jump. I shed none for all that wrecked italian marble.

      --
      Cheers, Tim -- Tim Janke Part mad scientist, part lion tamer: sr. software engineer, global team leader, project mana
  146. Not either/or by jgwythe · · Score: 1

    "Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers." This model defeats the whole cost savings in a cluster solution. One of the main advantages of clustering is the ability to leverage low cost hardware, and create a highly fault tolerant infrastructure. You do need to pick one or the other. If you don't your paying on both ends. The clustering software provides the HA, not the hardware. If you don't use clustering software, then by all means rely on hardware HA, but both is just wrong. Why would you invest in the headache of scaling horizontally, only to spend the money on hardware HA as well?

  147. Re:SneakerNet * by swhalen · · Score: 1

    Perhaps a better link is to the OpenAFS (Open Andrew File System) implemenation: another IBM contribution to Open source. It is continuing to make available current releases. They're working right now on a new stable release.

    The link is http://www.openafs.org/

    Steve

  148. Thank you for the informative post by BananaJr6000 · · Score: 1

    ... a rarity on Slashdot these days it seems. I work at a well-known East coast university, and I have been trying to get the 'chief engineer' to adopt similar practices, but he is book-smart and trade rag smart, and trys to pinch pennies to save dollars.

    "We'll just script it!"
    --
    Working at a university makes my brain feel toasted

  149. MS Clustering licks by Anonymous Coward · · Score: 0

    I had the thrill of developing our company's clustering solution, as long as it was Windows on IBM hardware. I built active-passive systems with raid 10 arrays for databases. our application lived on web servers that would call the database servers. i spent the better part of my days, and nights for that matter, either bringing nodes back up or failing them over because the system wouldn't do it itself. no end of grief. our higher ups were convinced this setup was giving our clients 100% reliability, but they forgot about the 5 to 10 minutes of resync between our web servers and the databases. add that up over a few times a day, and the client ends up with a hefty outage. neither ibm or microsoft could come up with an answer to make things run smoother. we decided to de-cluster our setup, and things could not run smoother. ymmv, but we had a bad experience with clustering and won't be going back to it.

    this was all done under windows 2000. under windows 2003, you need an extra dedicated resource besides your quorum drive for msmq. we simply didn't have the spindles available to create an extra resource. besides that, it was going to be a pain to get 2003 clustering running without active directory. yes, we are still using nt domains.

  150. Best of both worlds by Anonymous Coward · · Score: 0

    Sysplexed mainframe pair: maximum 10 minutes downtime per year. Only costs 2 million $ or more.

    East or west, COBOL is best!

  151. Is there anything like that on Linux by coder111 · · Score: 1

    Hmm, this CARP seems quite a nice feature. Is there something similar on Linux?

    --Coder

  152. Where would the information go? by OldManAndTheC++ · · Score: 1
    It's a cold, lonely world out there. Pretty soon it'll be back, begging for a safe and cozy environment. Freedom ain't all it's cracked up to be, especially if you're just an abstraction.

    Don't think of that server as a prison. It's more like a womb.

    --
    Soylent Green is peoplicious!
  153. Re:Oracle RAC? can I have 200 copies for free pls by Anonymous Coward · · Score: 0
    Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.

    Ok ok I hear you, give me a torrent or link where I can download its source code and use it legally for no cost in my business.

  154. Re:But but by Anonymous Coward · · Score: 0

    but, if you're running Oracle RAC, you're gonna be ripped off shitloads of money for it. end of story.

  155. grid by Stu+Charlton · · Score: 1

    If the idea is "massive parallelism using commodity hardware", then grids are pretty much clusters with better management tools for single-system image, provisioning and/or re-distributing resources (CPU, I/O, etc.). Grids are mostly in the scientific community though you're starting to see it creep in commercial data centres (Oracle being a big proponent with their database version 10g, "g" standing for Grid).

    The problem with clusters has always been software; to fund new software development, one needs to build a little bit of hype. That means new terminology. Kind of like how "expert systems" are now called "rules engines", workflow is now called "BPM", and interface contract negotation is now called "choreography". Not to belittle the work in these areas, there is much good being done, just an observation about funding cycles and human attention spans.

    --
    -Stu
  156. Re:Horses 4 Courses - They are NOT mutually exclus by hicksw · · Score: 1

    OpenVMS has supported geographically separated cluster nodes and shadowed storage for years.

    http://h71000.www7.hp.com/doc/82FINAL/6318/6318pro _016.html

  157. Fault Tolerance special case of Clustering by Anonymous Coward · · Score: 0

    Clustering provides Fault tolerance and scalability.
    Fault tolerance is usually a system with two internal nodes. I used to work at a telecom vendor where there was development
    of HW-based Fault-tolerant systems. These could handle HW failures in a very good manner but had no extra support for SW
    failures which nowadays is much higher in frequency. It could also handle SW upgrades by splitting the system in intricate
    manners. Later on there was also development of SW-based solutions for Fault Tolerance, it was possible to have both
    Hot (failover in seconds) and Cold (failover in minutes) solutions using SW only. The nice thing with the Hot solution was
    that one could actually catch SW failures and tell the other side to abort the current thread of activity since it was known that it
    would crash (according to a Tandem report this catches 25% of the SW failures).

    Clustering can have exactly the same level of fault tolerance but still have scalability. One way of achieving is by partition the system
    in a set of fault tolerance groups. This is how MySQL Cluster works (fault tolerance groups is called node groups). MySQL Cluster
    uses Hot Failover by ensuring that all nodes in the group is always in synch.

  158. Re:SneakerNet * by raddan · · Score: 1

    I could see the value if you had some sort of network-version of RAID. You would need to make sure you had 'striping' and 'parity', for when a host goes down, like when a drive goes down. But what would you do if you lost too many peers, like when employees go home for the day? How would you do off-site backups (if the building burns down, it doesn't matter how many copies are distributed among workstations)? How much bandwidth would this require? How would network latency affect performance? How do you make sure you can do atomic writes on your 'filesystem'? I agree, it's a good idea, but there are a lot of problems to solve to get there.

  159. Re:SneakerNet * by sjames · · Score: 1

    In esscence, there are two approaches to possible failure. Spend a lot to make the probability vanishingly small, or engineer the system so failure is less of a problem.

    The former is the NASA approach, the latter was used by the Soviet space program. Both have had their great successes and spectacular failures. The Soviet approach tends to be orders of magnitde cheaper, especially when the cost of failure is on the high side.