Clustering vs. Fault-Tolerant Servers
mstansberry writes "According to SearchDataCenter.com fault-tolerant server vendors say the majority of hardware and software makers have pushed clustering as a high-availability option because it sells more hardware and software licenses. Fault-tolerant servers pack redundant components such as power supply and storage into a single box, while clustering involves the networking of multiple, standard servers used as failover machines." Perhaps some readers on the front lines can shed a bit more light on the debate based on both proprietary and Linux-based approaches.
Personally I opt for clustering over fualt-tollerance - but thats my personal choice. It really depends on what the machine(s) will be doing. If you have a database server - fault tollerence (because I have yet to meet a clustering DB solution that didnt suck). But if your building a webserver - cluster.
Also the one thing the article mentions is that clustering is just as expensive as fault-tollerence due to software licesing. Last I checked if its one copy of Debian + Apache + MySQL + Perl or 200 copies - its going to cost me the same price (free). And windows doesnt support clustering yet - in any decent way shape or form - so I dont see the problem here.
snowulf.com
lol at bannedtown
--rucas
I just use Geocities, it's free and easy!
the BeoWulf!!!
Don't anthropomorphize computers: they hate that.
It's slashdotted already.
Clustering for performancce. Redundant components for fault tolerance.
...clustering a bunch of fault tolerant servers?
...and i am just waiting on the call from our vendor recommending we upgrade to a cluster of fault-tolerant servers.
So if you ask a software vendor whether it's better to buy expensive hardware or to save money on hardware and install more copies of software, what's he going to say? Even if you had a site license he'd still say that, because guess what ... he's a software vendor. He's not in the business of solving your problems with hardware.
Breakfast served all day!
Hardware fails... it's as simple as that. You should plan on that for one reason or another you will have to shutdown and replace hardware. If it can be done with minimal or no disruption to the services, then that's all the better. OS makes licencing no longer a problem.
tolerating a lot of faults in one girlfriend or get a cluster of them and deal only with the good points?
Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.
Shouldn't we be encouraging server failures which enable their freedom from magnetic imprisonment? Kinda like PETA freeing lab animals...
If brevity is the soul of wit, then how does one explain Twitter?
Because of the open source stack behind a lot of server platforms these days, I'm dubious that this decision boils down simply to a software cost issue. One major benefit of using clustering is that many white box, non specialized machines can be used, which are easier & cheapter to replace or obtain components for. Complex and specialized hardware with built in redundancy is often expensive and can require vendor support contracts for effective maintainance.
Business Voyeur
they forgot the third option - cheap servers ran in single tier fashion. If one dies, you just swap it out, and then build another for emergency. Granted there is some down time, but it works as a good cheap end solution
Clustering provides a backup for software failures, that fault-tolerant servers don't. Also, upgrades without downtime are easier done with a load-balanced cluster.
If you are just talking about fault tolerance (FT) then spill a drink on the FT server then spill a drink on a clustered server and see the difference :) If we are not limited to fault tolerance than try load balancing an FT server with.. um..er... itself. This is really apples and oranges. BTW, I like FT servers in a cluster!
The article seems to make the choice one-sided. Fault tolerant servers have higher uptimes because the backup takes over immediately. Clusters have a single point of failure in the middleware. They argue that the clusters can run different operating systems, but that means more patches and updates to keep track of. Clusters are expensive because they need more OS and software licenses and require a lot of maintenance, though that might drop if they are running Linux or FreeBSD.
Anyone make a case for clusters for high-uptime situations?
A NYC lawyer blogs. http://www.chuangblog.com/
Subject says it all.
I chose a Linux Cluster everytime. Linux is absolutely fabulous at clustering. Personally i much prefer to trust a cluster of servers over one single server, no matter how good, ANY DAY.
If HA is what you are really after, you should use both. You want a fault tolerant server so you never have to go down unexpectedly and you want a fail over node so if the unexpected occurs, you'll be back up in a jiffy.
"That's the sort of blinkered, philistine pig ignorance I've come to expect from you non-creative garbage."-Monty Python
If you buy one machine, you still may need to power it off to open the case, or replace a part.
One advantage of the dot com bubble-burst is that you can find good hardware inexpensively on e-bay. Do a search on "Sun Enterprise". Machines that sold for $100K a few years ago can be had for less than $2000.
Fault tolerant systems are all in one physical location.
Clusters can be in different server racks, building, city even country.
It depends what the goal is. Fault tolerance, scalability, disaster recovery, etc.
They both have their uses, let's not discount one or the other, just use them properly.
**Typically, the goal is a mix of the ones I enumerated, hence I typically choose clusters. However, I always re-evaluate every time a new requirement comes in.
In fact, Microsoft Windows has supported clustering for quite some time. At least the better part of seven years as it was available on Windows NT Server 4.0.
l ogies/clustering/default.mspx
If you want to see the latest Microsoft offering on clustering services, check out this site http://www.microsoft.com/windowsserver2003/techno
... the better technology IF space isn't an issue.
If you've got the space for the extra servers clusters are great, if you don't have that kind of excess space then fault tolerance is top of the mark.
Shadus
Clusters have a reputation for needing a lot of upkeep. Windows dudes say that Microsoft clusters are a royal pain to maintain. Fault-tolerance in servers, on the other hand, is known by almost everyone to be a good or excellent investment, regardless of the OS platform. If you have a hard time holding onto admins for more than a couple of years, you'd have to consider whether clustering is a good choice. But then, I come from a network of only 275 users. Still, we've never considered clusters. Redundancy is were we put our money.
It's only funny until someone gets hurt. Then, it's hilarious.
...but my users and my bosses don't care much what searchdatacenter.com has to say about the situation, in the event hardware failure takes down a critical application.
If the people that pay me are willing to invest in the extra HW and SW to make a critical app available, then we do it.
brilliant! just brilliant.
I build AIX HACMP clusters for a living, and I'll tell you that you should *never* use an either/or approach, as TFA suggests. Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers.
Maybe if they want to write an article, they should spend some time in the real world and see how the HA industry works instead of making up some arbitrary demarkation line to hang a preconception on.
A large and fully redundant fault tolerant server is more flexible. Use virtualization and have many reliable servers of many different operating systems in one unit as opposed to a highly specialized cluster.
For certain tasks, clustering will certainly offer a performance advantage from a scalability standpoint. Yet a fully fault tolerant hardware system like from Stratus offers just a touch more reliability than a fault tolerant software system.
Yes, but what does geocities use?
Why aren't we told when editors moderate our posts?
Just go with fault tolerant clusters.
That's one of those ideas that sounds all good and well, but it hardly works in practice. In many cases, downtime is unacceptable. You need transactions processed continually, and you cannot have downtime caused by a dead server.
It is not a good idea to build a system out of parts that you know will fail, and then proceed to design the system around such failure. A far better idea is to spend some money, and design a system that will work. Of course you do take into account hardware failure, and you build in redundancy where necessary. But you do not build your solution around knowingly faulty and cheap hardware. That's just looking for trouble.
Often times the "cheap" solution ends up being most expensive, not only because of the cost of repeated hardware repairs, but also because of the cost of the labour necessary to perform the repairs, and the possibility of downtime. When you're processing millions of dollars worth of transactions per minute (if not per second), even a couple of minutes of downtime can be financially costly.
Cyric Zndovzny at your service.
The Good: Using cheap components in a cluster to create scalability at a good value The Bad: Using a cluster to cover up coding issues, architectural crap, or instabilities in the system The Ugly: "the bad" gets so bad that it crashes the whole freakin' cluster. Why did we do this again?
Fault tolerance gets you a machine that keeps running in the face of hardware failures and maintenance. The switchover time is arguably negligible.
Clustering gets you a set of services that keep running in the face of hardware failures and maintenance. The switchover time can range from negligible to huge depending on the application involved.
However, clustering also helps you to solve other problems, including scaling, software failures, software upgrades, A-B testing (running different versions side by side), major hardware upgrades, and even data center relocations.
Clustering tends to require a lot more local knowledge to get right.
So if you narrow the problem definition to hardware only, they solve the same class of problems. But when you broaden it to the full range of what clustering offers you find a greater opportunity for cost savings - because one technique is covering multiple needs.
If you are going to go so far as to pay for redundant everything hardware you probably want to buy at least a pair of them and put them in a cluster. I know very few places where the demands are such that they would buy a single super expensive server and NOT have a cluster to allow for things like software upgrades.
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
I remember reading a few years ago that something that made Google's system different then the competition was that they preferred to have a cluster of very cheap hardware (so if one died, another would take over it's job) instead of a single expensive hardware configuration that had its own backup system built in. Not sure if they continue to work with this philosophy though. They seem to have proven that it works successfully though (Google has never given me any problems).
In undeveloped countries, the consumer controls the market. In capitalist America, the market controls you.
In my 15 years of IT consulting, no network has provided data safety transparency cheaply or consistently enough. Clusters and fault tolerance both cost more than downtime in my experience.
We desperately need a better way to access data in a corporate network.
My favorite customers are those architects and engineers who avoid networking except for the Net. Seriously, sneakernet and peer-to-peer has shown the least downtime I've seen.
I think p2p networks will see a comeback if a torrent-like protocol can grow to be speedy. My customers are not banks, but they need 100% uptime as every day is a beat-the-deadline day.
If someone can extend and combine an internal torrent system with a decent file cataloging and searching system, they'll see huge money. I have some 150 user CAD networks just waiting for it.
What would a hive network need?
* Serverless
* Files hived to 3+ workstations
* Database object hiving
* File modification ability (save new file in hive, rename previous file as old version, delete really old versions after user configurable changes)
* "Wayback Machine" feature from old versions
* PCs disconnected from hive will self correct upon reconnection
It is very complex right now, but my bet is that the P2P network will trump client-server for the short run. The "client is the server" vs "the server is the client"?
Since when can every software solution be categorized as "proprietary" or "Linux-based"?
Clustering costs more for the software. Fault-tolerance generally costs more for the hardware, especially if you cluster using commodity equipment. When the software is free, clustering is the obvious option.
Clusters can be EXTREMELY hard to upgrade to newer software versions or service packs. That's something to be aware of in figuring out the costs of maintaining such a system. Of course, it helps if you have a test system, and "a lot of companies" (if you get my drift) that spring for a cluster won't have a test cluster for roll-outs... if you want to feel pain, try to upgrade a MSCS system running some major clustered + non-clustered software to new versions, in production.
But in the end, I opted for a "both" approach. If I'm going to do a cluster, I usually do it for applications, so I'll build it out in an N+1 style so I can easily add more resources to the cluster. If uptime is the concern and not horse-power, I'll simply make things as redundant as possible with drives, power supplies, RAID, etc.
Having built both true high-reliability fault-tolerant devices and clustered systems, I don't see any fundamental theoretical difference. In both cases, you have redundant hardware capacity in place, theoretically to allow you to tolerate the failure of a certain amount of your hardware (and, sometimes, your software) for a certain amount of time. Neither option guards you against failures outside of the cluster or FT system box. Neither one is a panacea. Both are sold as snake-oil insurance against "badness".
In a single fault-tolerant box, you generally have environmental monitoring, careful attention to error detection, and automatic failover. You also have customer-replaceable units for failure-prone components, utiilties for managing all of the redundancy, and a fancy nameplate. In exchange for that, you have more complexity, more cost, serious custom hardware and software modifications, and often (but not always) performance constraints.
In a clustered system, you treat each individual server as a failure unit. Good fault detection is a challenge, especially for damaging but non-catastrophic failure, but it's much easier to configure a given level of redundancy and it's easier to take care of environmental problems like building power (or water in the second floor) -- you just configure part of the cluster a longer distance away.
Where clustering is inadequate is when you have a single mission-critical system where any failure is disaster (like flight-control avionics or nuclear power plant monitoring). There are applications where there's no substitute for redundant design, locked-clock processors and "voting" hardware, and all of the other low-level safeguards you can use.
For Web applications, however, where a certain sloppiness is tolerable, and where the advantages of load balancing, off-the-shelf hardware and software, and system administration that doesn't require an EE with obsessive-compulsive disorder, clusters are the natural solution.
The fact that you get to sell more licenses for the software is just gravy.
Speaking of the Windows universe, here. I've found actual for-real clustering (say, of SQL Server) to be workable, but to be a serious (and expensive) pain in the ass. Obviously it depends on the app, but log-shipping and other mechanisms are frequently good enough to prepare for fail-over to another machine, and decent fault-tolerant hardware is good enough insurance for a lot of circumstances.
On the web side of things, clustering (actual clustering) sure hasn't come up much in my world. But I use native NLB with very good results. Depending on how your app handles state/sessions, that native load balancing is pretty much a no-brainer to set up. There are problems, though... your server can (from NLB's perspective) seem perfectly happy - even as your web app is puking in some way, and defeating the whole purpose. So for that, you've got to have something watching the app and then kicking the machine in the ass if it's stupid in that higher layer. This would be, of course, just as true of any load balancer that's out in front of the web servers and doesn't know if a particular app is happy or not.
But if you're trying to spread the pain across a handful of web servers, NLB is a pretty easy solution. Making sure that a SQL server behind those web servers is up though... in real life, unless there's a large budget and good admins doing the care and feeding, the risk of having to rely on a managed fail over to a recently replicated copy of the db on another machine seems to be a pretty popular choice. Considering that you can buy a seriously fault-tolerant server and storage solution for a pittance compared to the long-term admin costs of not screwing up a clustered rig, that's the sweet spot for a lot of users, and the risk is fairly low. Hardware, properly housed in a decent data center, is pretty damn reliable at good price points these days. A somewhat fragile clustering environment, though, is one slightly-drowsy off-shore admin mouseclick away from being REAL hard to unscrew.
Don't disappoint your bird dog. Go to the range.
I am not sure fault-tolerance is cheaper than clustering. You can build a cluster from cheap PCs and you can keep adding nodes to it. But fault-tolerant servers sound like not easily scalable, vendor-locked in, and costly too (since the hardware has to be specially designed).
What you wrote is really ignorant (which, modded on /., translates to Insightful).
1. (because I have yet to meet a clustering DB solution that didnt suck).
Where do you live? In Ruanda?
Perhaps you have heard of Oracle RAC. And there are other very good clustering solutions for DBMS.
2. one copy of Debian + Apache + MySQL + Perl or 200 copies
mySQL isn't enterprise-reliable even in stand-alone configuration, let alone clustering. I can't believe this...
3. And windows doesnt support clustering yet - in any decent way shape or form, I dont see the problem here.
Hah, hah! Enough said.
And also - what's it to you? If Microsoft (in your view) had a good clustering solution, you'd lose sleep over that?
When you're biased like that, no wonder you can't have a quality, unbiased opinion on this topic.
The main problem is that building a fault-tolerant server is an ardous task. It take a lot of engineering and testing. This slows you down and your product cycles get long. When you bring your new machine to the market it will look old and slow compared to 'standard' competitors. In addition your database will be a specialized, proprietary version which does not work with any tool and the admin staff needs special education to manage and operate it.
Clusters are different. Just take your latest and greatest server and middleware, package it with a version of your clustering glue and voila - instant high availability. All your tools and admin knowledge is applicable because it's built on the same stuff you know already.
In addition, the real test, a real emergency is unlikely to happen anyway. Even if it does happen and your cluster fails to provide the promised availability there is no real problem besides your 1000 users beeing without application for a day. You'll blame the problem on the vendor and your reputation is safe. This is why you bought the cluster in place of a single system anyway.
Markus
We run volumes of Dell 2850s with RAID arrays, redundant power, etc. powering high volume websites... I can speak first handedly that internal fault tolerance in these systems can only get you so far, where a failure of a component such as the management device in charge of the two power supplies, itself fails, resulting in both power supplies being useless. Or a raid card going out of commission, leaving drives with mangled and unrecoverable data. As with most solutions, a mixture of both fault tolerance and data clustering is the safest alternative.
Clustering has a MAJOR problem going with it. Clustering requires applications to be written specifically to support clustering. All sorts of libraries have been written to "make this process easier", but one thing's for sure : it will require a recompile, and software that is not designed by people who know what ACID means for databases. It is very hard to keep a hand written app in a consistent state on all machines, knowing that any one of them might fail completely (we only support complete failures, disfunctional memory for example, will not be reacted to) at any time.
Clustering: Several systems that do parallel computing.
;) It doesn't have to be expensive either. So far the most expensive part seems to be a soft switch for SAN so I can use OpenAFS and scale storage space without downtime. Help in that area would be nice, btw. :)
Fault Tolerant Servers: Serval systems will a failover loadbalancers in front.
I get frustrated when people use the latter and call it the former. True, you could hae fault tolerant servers in a single box, but why? In fact I'm rolling out infrastructure of the latter in large dose.
This is how google dunnit. Very well in fact.
Karma: Chameleon (mostly due to the fact that you come and go).
Ian W.
There is nothing like OpenBSD running pf and carp. Dead easy to set up, works like a charm, and secure by default. One wonders why the editors seem to think OSS == Linux.
http://www.openbsd.org/faq/pf/index.html
http://www.openbsd.org/faq/faq6.html#CARP
Cypherpunks: Civil Liberty Through Complex Mathematics. Those who live by the sword die by the arrow.
Those that know the horrors of 3AM maintenance will have the fullest appreciation of being able to take servers out of service in the middle of the day for software and OS updates.
Expensive are the applications and OSes that can be upgraded without downtime, regardless of the faulttolerantness of the server.
I have always clustered fault-tolerant servers. For important business applications there is no choice but clustering. However, I want to fail over to the standby node on my own terms...not a hardware failure. This solution gives you great availability along with the chance to make firmware/driver/hardware updates to the fail-to node during business hours. You can then fail over in a maintence window and then update the other server during business hours.
BTW, SQL server does not require that you buy liceneses for the fail-to node.
Kind thoughts do not change the world
It all comes down to Availability (Clustering) vs. Reliability (Fault Tolerant). They are NOT the same thing.
Fault tolerant servers are nice, even the simplest true server should offer some fault tolerance to a degree (IE: RAID drives). This is handy but may not help your availability in the event that you have a SLA promising xx% of uptime and then find yourself needing to take the server down to apply service packs or other patches.
Clustered servers allow you to increase the availability of your machines, because when you need to take one down for some updates, you can simply fail over all your traffic to the other server in the cluster accordingly. Clustering may increase the availability of the services those machines are offering, but it doesn't not help the reliability of the machines themselves.
Therefore, I personally choose to start with fault tolerant machines initially (RAID and dual power supplies at a minimum). It makes for a good base. If the services on that machine are 'mission critical', then cluster that machine with other fault tolerant machines.
--LWM
From a few that I have talked to and that have actually worked with this, they tell me that it is a nightmare and that they would switch to something like NCR's server, next time. Apparently, they felt for running MS clusters, that it was expensive and difficult and did not work well.
Interestingly, one of them also runs a Linux and a HP cluster and say they were much easier and were moving their code base to Linux only.
I prefer the "u" in honour as it seems to be missing these days.
Not worth doing. The cluster components should be dumb. There isn't a valid reason to have them know about each other. Your Round Robin or whatever balance you want should come from outside. F5 makes a nice box for that, so do others, if your really a cheapskate and wanted to you could duplicate them. If you need to have anything know about who is on what machine let the system tell that to the backend DB machine. It should be a channel architecture, not a crazy tangle. The more you break the functions down on the system level the better and faster your cluster will be.
Syncing databases on the other hand is tricky. Save your money and resources for that.
Sorry about the writing. Robot fingers, you know? Cliff Steele in DOOM PATROL #23
Fault-tolerent servers are the way to go for critical applications. Obviously critical applications need quite a bit of computational power too, so I'd consider the best approach to a situation like this would to go with a small cluster of servers, each being redundant. For example, say we have 6 servers to get this operation up and running, what in my opinion will be most reliable would be to do what was mentioned either and do both. It covers all bases, and really makes for a stable networking enviroment. There is an application built into OpenBSD used for making two servers act much like one, I believe it was called CARP. In principle, it can be used to make two servers allocate the primary functions, while the other is on constant standby to take over operations in case of an incident. It might seem like a waste of CPU cycles, but it works out very well, esspecially if you turn 3 pairs of them into one Beowulf cluster.
I too choose clustering for variety of reason's
o Scalability, meaning when you're done Scaliung UP, you can Scale OUT!
o High Avalibility, node's get down all the time, then other nodes can pick up the load and continuity!
o Performance Enhancement, well you can address performance problems by dividing load
to name a few iportant aspects of it...
Scott McNealy to Michael: "Suck my Sun!" Michael Dell to Scott : "Lick my Dell!"
With a cluster, you simply add another machine to the cluster when you need more computing power. You can also take a single machine off the cluster for upgrades, hardware troubleshooting, or to reallocate the single machine to do something else.
As other posters have said, a large factor in deciding what to do depends on the application. Google wouldn't be where they are today if they used a fault tolerant system instead of the massive cluster technology they use today. In fact you could say that Google has built a fault tolerant system using cluster technology.
On the other hand, there are some apps (such as databases) that are tricky to cluster right where the performance/benefit outweighs the problems associated with it.
One of the big themes at LinuxWorld 2005 in SF was virtualization on top of clusters. You get the look and feel of a single machine and also get the power and availability of a cluster. Oracle's 10g database makes use of these architecture. Another company called Virtual Iron uses VMware across 16-way clusters through high speed interconnects for their solutions.
What about iFolder? Looking at the spec's I think it's missing serverless/hiving (which could be provided by any of the normal p2p people), file history ... not understanding your database object comment.
;^)
Speaking of which, what about freenet? The only thing it's missing is "guaranteed availability of critical business data", eh? And I hear it might have some performance problems.
--Robert
High Availability is all about cost/benefit. RAID and a redundant power-supply are both reasonably cheap for smaller systems, and increase system management complexity only a bit. They are also fairly limited in what they can protect against: certain disc or power supply failures.
A cluster can, if properly designed, protect against all sorts of failures: disc, power supply, controller, motherboard, CPU, backplane, cable, network, some designs can even deal with physical disaster like a fire in one of your server rooms and fail over to another or even another geographic location. However, the more protection you add, the more time it takes to implement, test, and maintain.
Tandem, one of the large vendors of fault tolerant hardware/software systems published a report in the late '80s saying that with recent advances in hardware and software, the major cause of system outages was now due to human error: administrators removing the active CPU when trying to replace a failed CPU for example. To properly implement a cluster can involve dozens or hundreds of hours of staff time setting things up, testing, and documenting it all. Especially if it's your first time, I'd say that budgeting 100 hours isn't unreasonable.
With HA clusters, the devil is definitely in the details. For example, incorrectly implmenting shared storage locking can mean that an unplugged network cable can result in having to re-load your systems from backups. In that case you're far worse off than if you had no HA at all. Sure, this is a nightmare scenario that hopefully shouldn't happen in production if you do appropriate testing, but I use it to illustrate a point.
Usually HA is implmented in places where downtime has a real cost, so you are paying more for maintenance and hardware so that you don't have to pay (usually many times) more in lost revenue and/or reputation in downtime.
Sean
If you really have mission critical applications that can never go down just get an IBM Mainframe. You can replace just about any part in the system without it going down. They even have extra CPUs they can bring online if one fails. Oh, and if you are really paranoid you can cluster them together in different geographic areas.
Read a lot of misinformation on this thread. Properly designed, a fault tolerant machine should NOT require downtime to replace a failed component, as all components (including CPU modules) should be hot-pluggable. In general, a fault tolerant system should be able to shut down a single failed component and keep going without any noticable impact on processing. A cluster may require take some time to switch-over depending if it is a fail-over system, or may need to restore / restart / migrate a checkpointed task. Fault-tolerant systems and high-end clusters are generally expensive. Low end fail-over systems less so. Is is worth the cost? That depends entirely on the application - in particular; what is the cost / impact of down time? System availability is NOT solely dependent on the FT / Cluster box alone - redundant power, networking (including WAN and Internet connections), physical and network security must be considered. Finally, external events (like Hurricanes!!) must be considered - and a carefully crafted disaster recovery plan is a must.
[Insert pithy quote here]
Ones too many, and 100 is not enough?
I prefer the "u" in honour as it seems to be missing these days.
After all, who has ever heard of a "fault-tolerant fuck?"
you have to PAY for proprietary???
Woops.
Google proved that clustering could be fault tolerant, while costing less than true fault-tolerant hardware.
Google built massive clusters of thousands of machines out of very cheap unreliable hardware. They have tons of hardware failure due to the extremely cheap components (and sheer number of machines), but everything is redundant (And fully fault tolerant).
They did this, again, using dirt cheap hardware.
i run a relatively large .com website + servers (main dev/it) and our compaq/hp servers are all fault tolerance. they do an exceptional job. our webserver recently had the primary psu die, we were alerted and it was replaced without the machine going offline.
I've worked with both fairly extensively and I'd have to say that although NetWare clusters seem to be more stable than Windows clusters, neither is a great solution for anything...
In my experience, the Windows Cluster Nodes will fail into some sort of "undead" state, in which the dead node isn't quite dead yet and the live node never quite picks up the slack, so you end up having to reboot both of them...
The NetWare Cluster Nodes have such a hair-trigger with the default settings that they seem to fail-over for no particular reason and get into a tail-chasing situation, which would be amusing if you didn't have 200 screaming attorneys looking over your shoulder as you try stop the failover merry-go-round...
Goofy, Geeky Gifts and More!
At my last company we used to push Windows NT Adv. Server or 2000 Adv. Server because it comes packaged with or available : NLB or WLBS. WLBS (the original name) is Windows Load Balancing Software. It allows you to set up the same IP addy on multiple systems and it handles all the routing. Never once have I had a customer experience hardware related downtime using the system. Software became more difficult to manage when using backend DB's but we made it work.
You can also buy clustering routers, but they cost $10,000 and really the speed difference is not noticeable (we were doing CC routing, always under 3 second responses required.). It would easily handle 200-400 transactions per second.
We also had hot swappable boxes (mostly STRATUS boxes) but I never liked them due to pricing. Clustering ALWAYS worked out cheaper from a hardware stance, and IMHO improved response times by sharing the load. Anyways, I know some of you won't like this, but NLB/WLBS worked like a charm. Of course, MS didn't design the software originally, it was a buy out..
In my mind it comes down to whether the service you are trying to provide needs to present synchronized and consistent state to each user in the presence of asynchronous updates or not. If you have to distribute and synchronize significant amounts of changing state across the entire user base then you might want to go with a shared memory/shared resource HA/FT solution. A database app is often of this type. If you don't have this requirement (if the content is static, etc.) then a cluster solution might be better. There are certainly many ways to offer synchronized, shared state in a cluster solution but having to do this over a network, through standard network protocols, is slower, heavier, harder than in a single machine, shared memory environment.
For a reliable environment you want a reliable platform and you want enough platforms (clustered) so that dropping one site doesn't mean no service.
These technologies can not be compared in an apples-to-apples fashion. Clustering solves performance AND reliability problems. Fault Tolerance just solves reliability issues.
Performance + Reliability = Clustering
Reliability alone = Fault Tolerance
-Jim
"money is the purest form of energy on this planet"
When dicussing Fault Tollerant vs. Clustering systems it's extreemly important to dicuss the need for scalibility. Clustered systems are inherintly scalible, while fault tollerant are not (in general).
For my business needs I usually see clustered systems as a much greater solution than fault tollerance. When dealing with systems that require fault tollerance you mostly are concerned with keeping the data they store avalible (database servers, file systems, etc). When dealing with systems where high avalibility is required for data, 99% of the time you are dealing with systems that will need to be responcive to an increased scale.
DFS and HA in SQL 2005 or 10g are examples of where a clustered system really couldn't be replaced with fault tollerance.
Your mammas flamebait.
You might check out OpenAFS. I'm not sure it meets all your requirements, though.
What about AFS which stands for Andrew File System. It was developed at CMU and allows dynamic backup of data (it automagically copies you data to different physical volumes). I've never even heard about data being lost on an AFS system and it supports very high security too. Then just build your code on top of the UNIX commands or AFS file API. But then again, it might be a bit much for your requirements. I don't know of a windows client version but one might exist. And the wayback part you might have to write yourself but it might be supported as well. Check it out if you are interested.
"Those that start by burning books, will end by burning men."
How much load is your site going to need to handle? If it's high, clustering is a darn good idea, because the separate machines will share the load on top of giving you redundancy. If the load expected is low, a single fault-tolerant machine will be easier to maintain.
This especially goes for multiple services, and you may want to mix-and-match. For a CGI+SQL combo, you may prefer to split the web load over a cluster, but you may want to forego the complexity of a clustered database and put your SQL server on a single redundant box.
"...I heard this ad that said it runs faster, costs less and never breaks!"
...such a smart bunch of chaps 'eh, ...why didn't we think of that a long time ago?
..but wait, ...we did!
What a bright idea,
Crunch!
iFolder is so-1990's to me, heh. Freenet seems doomed!
The war is on:
A. huge megaservers online serving thin/dumb terminals over high speed network connections (renting processors and storage and even apps all on demand with backups)
B. P2P with cheap clients and cheap shared in-client storage
I don't know which way is better. High bandwidth will get cheaper and more available every day.
For now, I'm betting on DumbClient/MonsterServer being the cheapest both initially and in the long run when 10Mb connections to the Net are the norm.
Yet internal P2P seems more secure and more fault tolerant.
Database Object just means a hive IS a database. One object could be "MarsVoltaSong.mp3" or "John Jones Contact Record"
Your contact manager would access the hive to retrieve your contacts. Super-secure databases could have public description keys with encrypted actual data.
Imagine a Beowulf cluster of server clusters. Oh wait...
Let's say you have mission critical application SuperUpTime(tm) to run 24/7 and above anything else; the box can never go down. Sure, you can go the fault tolerant route, stuff it on a FT box and hope for the best. But a comet has hit directly into the building, flattening everything including your FT box. Or the airco has leaked and shorted out the FT box. Oops, bye-bye SLA.
This is one of the benefits of clustering, you can have nodes at different sites. As long as both sites are far away enough from each other to survive natural or accidental disasters; SuperUpTime(tm) will keep running.
Suppose SuperUpTime(tm) is really popular, the load far exceeding the what was anticipated.. The box has been filled with the maximum amount of cpu and memory supported, but it's still suffereing under the load. With a cluster, you're not constrained by the physical design limitations of the FT server, since you can add a beefier node to the cluster at any time.
If it was up to me, i'd go the cluster way; but that depends one what's needed. If you have a good, compentant sysadmin that knows what he's doing, he can admin several clusters no problem. But if you'd rather sink money into a FT system, not trusting humans.. then by all means, go the FT way. However I have yet to see a FT system that is truly redudant.. If there is a bad CPU for example, can they guarentee there is now way it can bring down the entire box? By constantly initiating panics, etc.. Then you have to intervene manually etc. If a cluster node goes down, a failover should happen automagically..
There is a commercial implementation of Andrew called DFS (distributed filed system) and sold by IBM. It is mostly used by banks and universities AFAIK due to the mentioned strong integrity and security features.
It IS possible to chuff things up, mainly by making administrative errors.
* backups
* authentication/permissions
* simultaneous use of the same file
etc...
These are problems that have already been addressed in most corporate LANs. Fault tolerance is an issue, yes, but if I had to trade the few items above for the extra tolerance that a P2P network gives me, I'd stay with the regular 'ol client-server model.
I'm not saying that P2P isn't a potential solution for the future, but for this application, it's not ready yet. In my experience, the problem isn't that desperate.
Been there done that, AFS http://www.faqs.org/faqs/afs-faq/ works wonders. Pretty much it's a nice fault tolerant file sharing system that supports direconnected ops meaning you can work with everything in disk cache and checkout / checkin things as needed.
No sir I dont like it.
Clustering provides you with Fault Tollerant OS/Applications. A single server with tons of redundant bits, doesn't help you if the OS or Applications that it servers get borked.
This is dead-on correct. For example, if a CGI hits a problematic state where it eats a lot of memory putting the server into a state where it's swapping, then it takes longer to service each http transaction, which means each more httpd transactions queue up, which means more memory gets allocated which means more swapping .. rendering the machine useless for a little while (until a sysadmin or a bot notices the state and either restarts the httpd or kills a few select processes). If we were running this on one mammoth server with lots of redundant bits, then 100% of our web service capacity would be down in the interim. But since we run a pool of ten http servers under keepalived/IPVS, we only lose 10% of our capacity during that time.
Other reasons I've traditionally preferred clustering: easy to incrementally scale up infrastructure (no big buy-in in the beginning to get the server which can be expanded), fully parallel resources (an independent memory bus, an independent IO bus, two independent CPU's, an independent network card, and a few independent disks for each server, as opposed to a mammoth shared bus on a leviathan crossbar, which will inevitably run into contention), and more flexibility in how resources are divided amongst mutually exclusive tasks.
One of those reasons is getting less relevant -- point-to-point bus technologies like LightningTransport and PCI-Express are inexpensively replacing the "one big shared bus" with a lot of independent busses, transforming the server into a little cluster-in-a-box. It is a positive change IMO, and shifts the optimal setup away from the huge cluster of relatively small machines, and towards a more moderately-sized cluster of more medium-sized cluster-in-a-box machines.
The price of licenses is, IME, rarely an issue (in my admittedly limited career -- I don't doubt that it's relevant to many companies) because the places I've worked for have tended to use primarily free-as-in-beer (and often free-as-in-speech) open source solutions. What is more of an issue, IME, is the necessity of staffing yourself with cluster-savvy sysadmins and software engineers. Those of that ilk tend to be a bit rare and expensive, and difficult to keep track of. It takes a distributed systems professional to look at a distributed system and understand what is being seen, and this makes it easy to bend the spec or juggle the schedule on the sly, or run skunkworks projects outright. By contrast, the insanely redundant, mondo-expensive uberserver was created and programmed by very smart hardware and software specialists so that your IT staff doesn't need to be so specialized. This makes useful talent easier to acquire, and understanding the system closer to the reach of mere mortals.
Just my two cents
-- TTK
Clustering is great if it's simple, such as web servers. However, removing single point of failure is complex, in terms of software, hardware and network traffic. The solution as a whole can fail, say, because of stupid clustering software. Eg. Microsoft/HP cluster setting same MAC address tothe entire cluster. Or, Forgetting to put UPS on the air conditioning. I am all for one big powerful, but simple computer. They are expensive, but at least they don't run Windows.
I've played with it. It seems more of a backup bandaid than a realtime data hive like I'm thinking.
I may try to torrent a corporate network if I can find a good file "explorer" or file access subsystem that integrates into Windows.
Human error has for years been the ghost in my metrics, not hardware or software failures. Sure, hardware and software goes bad, but the really big hits I've experienced in the past 15 years were because:
* An employee's four year old pushed the big red data center button that spelled his mothers ultimate career doom.
* Our CEO refused to release the capital needed to replace the UPS batteries. Boom! Sure, grid power was available, at least until the fire captain ordered the power cut.
* A new sysadmin hit the rack power switch instead of the server.
* An application administrator (SAP) with temporary root access learned that with power comes responsibility and that responsibility requires competence!
* For every action under the raised tile there is a cubed reaction. That pots cable was connected to the OC3 fiber.
And that's just five!
The only "truly" fault tolerant machines I have seen, other than custom-built military hardware are the "Stratus" (www.stratus.com) servers - everything is doubly redundant. You can walk up to a server and pull CPU cards and memory out at random, and the machine doesn't even hiccup. You can hot-plug replacements back in, and the machine doesn't miss so much as a single clock cycle.
Any other form of high availability amounts to pre-emptive rebooting. You keep a warm spare machine next to the original to pick up the workload in case of failure. It's just like rebooting, but doing it in advance on a spare machine just to be ready.
For *most* applications the warm spare approach works reasonably well. But in real-time control applications, the few seconds spent loading up context to restart from a checkpoint can kill people.
Would you trust a windows cluster to drive your car on the highway? Or fly the jet you are riding in? Or run the anti-missile phalanx gun on a missile frigate?
Real fault tolerance is like a parachute. Most of the time you have no use for it. But when you need it, nothing else will do.
"Sic Semper Path of Least Resistance"
With fault tolerance, you still have a single point of failure with the chassis and one that can not be eliminated though mitigated. Clustering gives two seperate units that can be on the opposite side of the datacenter, or even across town with the Metro Ethernet infrastructures available today. As costly as it may seem, if you want to save your arse, you are better off clustering then having to explain why your company lost $2 million for the downtime when it could have been mitigated with a 7k
-- NeonRonin
Backups are integrated in the hive. I think a backup node in the hive could stream backups constantly.
Authentications/permissions can be realized by using a registry-like Address/Key/Source structure. The address of a chunk in the hive designates what data it is, the key can be 0 for public or an encryption key known to client apps permitted to access the chunk. Source is the data (encrypted or otherwise).
Since the client node is responsible for reassimilating chunks it hived out, the encryption is twofold: cracking the key only gets your a bite. You need to know what other chunks connect to yours to eat a full plate.
The problem to me IS desperate. In all of my businesses my main goal is "how do I make myself obsolete?" It sounds counterintuitive yet it lets my customers see that I don't want to do the same work ad infinitum. I want to return a nice profit to my customers for the money they pay me. $150/hour should save my customers $225/hour in added productivity. Down time is a huge net loss.
Proper data utilization is money made and time saved. Programmers facilitate that. Accessing the data to be utilized is my job. I want it transparent with 100% uptime. Client dies? Pop in a replacement.
Oh. Be patient, the solution is coming within the next year or so (we are currently alpha). That is all I can say at the moment. Any more features you want to dream up?
You say you got a real solution
Well, you know
We'd all love to see the plan
(The Beatles)
I used to build HACMP clusters also, although I haven't in 5 or so years. One of the key points in the HACMP class was that clustering and fault-tolerance are for different purposes. Fault tolerance is required for systems that ABSOLUTELY MUST NOT fail, e.g. aircraft avionics. High availability clusters are designed to provide high uptime with a known (small) acceptable amount of downtime. Fault tolerant systems generally cost at least one order of magnitude more than highly-available systems.
They simply are not. There are too many things to go wrong even if you double up every single component with the idea that if one side goes it will run on the other while you unplug the bad part and plug in the good part. It's been my experience that as likely as not they will both go at once anyway. I've had this happen with dual fault tolerant RAID systems more than once.
The neat thing about clustering, in my opinion, is that you get built-in redundancy AND you get the ability to take care of an increased load very neatly. Too many processes? Just plug in another CPU to the cluster and load balance it out. Did one break? OK, so you're under a load for a while until you can plug in a good one, but at least the app doesn't go down. And you DO have an extra ready to go just in case, don't you?
I built a cluster supporting a couple hundred thin clients, which tend to proliferate ("Hey! Gimme ten more in this room!") and once the bugs were worked out (Had to load balance the brain cells to get things working right) it works slick. If I had it to design and spec a systen from scratch knowing what I know now, I'd do it again in a heart beat.
So I say: Backup! Backup! Backup! and Cluster! Cluster! Cluster!
How about a moderation of -1 pedantic.
... a Fault-Tolerant Server of these!
So Long and Thanks for All the Fish!
You know what you are describing looks an aweful lot like a distributed version control system with bittorrent as a transport. Of course with bittorrent you need a server to act as a tracker though.
Why couln't something like SVK work for this?
evil is as evil does
Both.
Big, (fault tolerant or not) single-image Symmetric Multi Processing are certainly NOT, on a $/cpu basis cheaper than a cluster of the same number of cpus. Vendors not only make lower margins on clustered systems, these absolute amount of the cost is lower. Clustering with small-way (typically 2-way) commmodity systems will always be cheaper than big SMP - whether you get real (or in the case of some clusters, effective virtual) fault tolerance or not. The sensible rule-of-thumb today is that if your application lends it self to be parallelized easily, you cluster - if it absolutely requires a single MP image of x cpus, you have to do SMP. And most things in the real world fall in between. So, if you CAN cluster effecively, you SHOULD cluster.
Next question.
Answer: Neither. They are both dinosaurs.
My point is that traditional high-availability solutions are not getting it done any more. None of the customers that I work with are thrilled to spend money on any solution where all the hardware is in one location exposing the application to being wiped out by a hurricane or a terrorist attack.
Of these two solutions, clustering does provide some flexibility to implement in different geographic areas but most clustering products fall short of features that support enhancing application availability across geographically-dispersed sites. This is a huge feature hole that is only answered with custom integration currently. In fact, it's paying my bills quite nicely these days, but I do wish there was more support for this kind of solution.
"Never under-estimate the bandwidth of a FedEx truck" -- Availability Consultant in response to query about the "best way" to get data from one data center to another.
"Ahhhh, best laid plans of mice and men... and Cookie Monster." -- Cookie Monster, Sesame Street
A virtual file server would be an extremely cool thing. I could have sworn somebody had something that would allow you to take N bytes of disk from a workgroup of workstations and coalesce it into a virtual disk sharable with the workgroup.
Just extend the idea to include replication of the virtual storage. You could either allow splitting of the data as above, or require each replicated segment to be 100% complete (ie, any one workstation has the whole server's storage), or some mixture of all of the above.
I think the downside comes from keeping it all in sync. Probably viable on a high speed LAN, but will fall apart exponentially as you get onto WAN networks.
If sneakernet were down, would that imply you were injured or had passed on to the great bitbucket in the sky?
Too many ideas :)
Here are some:
* Bandwidth designation for each node
* AI style workgroup-relations caching -- users embed information into chunks offering other hiveminds a chance to stock up on common data for faster response
* Address/Key First sorting for confirming files accessed are the most recent
* Data-In-Use flags (flags should be hive updated every so often to entrust that data is still in use)
* Momentum Updates (Organized hive updates through friendship pairs or quads keeping data chunks closer to home)
* Administration Worldmap (show chunk usage, lifespan, friendship groups, connection uptime)
IMHO, torrent is too anonymous. I'd rather see chunks offered within the friendship pair/quad only. Easier to notify hiveminds of updates or in-use flagging.
there's Google.
The higher the technology, the sharper that two-edged sword.
I have had a fair amount of experience with linux clustered filesystems (primarily with block based ones not file based). The performance, reliability, and feature set for each of the ones I have looked at have varied greatly depending on your application and hardware. Some samples of clustered filesystems available for Linux:
x /core/4/i386/os/Fedora/RPMS/GFS-6.1-0.pre22.6.i386 .rpm
Polyserve
- closed source / proprietary
- supports Oracle RAC
- distributed locking manager
- supports up to 16 servers
- has load balanced NFS
- supports SuSE and Redhat
- requires a SAN
- block based I/O
- limited hardware support
Redhat GFS (formerly Sistina GFS)
- open source
- supports Oracle RAC and is also supported by Oracle
- distributed locking manager in 6.1, single metadata lock manager in 6.0
- can work with or without a SAN
- SAN requires pool driver in 6.0, or open source LVM 2 in 6.1
- supports up to 256 servers?
- works on other distributions, but good luck getting it to work
- is available in Fedora Core for those not worried about support
- block based I/O
- extensive hardware support
IBM GPFS on Linux
- closed source / proprietary
- supports Redhat and SuSE
- lots of servers, don't remember if there is a theoretical limit or not
- quasi-distributed locking manager, RTM for more details
- block based I/O
- supports IBM hardware + some smattering of major competitors
- requires a SAN (preferably one of IBM's)
Lustre on Linux
- open source
- distribution agnostic
- lots of servers
- lock manager is not very good
- does not require a SAN
Panasas
- closed source
- file based I/O
PeerFS
- closed source
Honestly clustered filesystems can be more trouble than they are worth. With the exception of Redhat GFS and a smattering of others that aren't as advanced, there are few open source options for implementing a 'true' clustered filesystem. Most of the setups I have seen end up having so many moving parts that they are often very fragile. Having said that if you want to get started try GFS. It is a decent introduction to the clustered filesystem world:
http://download.fedora.redhat.com/pub/fedora/linu
The agency bought a pair of dual proc Dells with lots of RAM and a full software stack (Windows Server, SQLServer, and ColdFusion Server). Total cost: ~$57,000.
That's right, nearly 60k.
Now, I've read that Google buys their white boxes at $1k each for their server farm. And I couldn't help but think what they'd (or I) would do with 57 boxes instead of 2.
But hey, my opinion doesn't matter. I'm not a PHB in a gov't agency. But sure as hell, if I were a business in a competitive environment (and a gov't agency is not), I'd be looking to implement the simple and effective white box solution on the cheap. But that's just me.
I think the downside comes from keeping it all in sync. Probably viable on a high speed LAN, but will fall apart exponentially as you get onto WAN networks.
So you want something like a NAS iSCSI using LVM on RAID5 of NBDs? There are *certainly* going to be delays. It is easier and cheaper just to get two or three NAS boxes and setup Linux HAC.
There was. I think it was called Orange or something fruity. I played with it and it was terrible.
The solution is out there in the ether. Even over WANs it is viable as nodes can search for chunk updates by merely requesting Address/Key instead of Address/Key/Data. Of course topology outage would be murder, but that's true with client-server, too.
I have a customer with 1TB available in the server, 700GB in use. Their 150 workstations have 30TB free. Their server frequently gets bogged down and needs constant hardware improvements. A Virtual File Server (hive) would increase uptime, decrease costs and increase performance.
You could also build a system where critical data is distributed to where it needs to be used, then updated dynamically.
Examples of such systems would be BIND for DNS or OpenLDAP for LDAP. You can have hordes of cheap servers running BIND or OpenLDAP that build from a kickstart or mondo CD, and get really amazing reliability overall. And no backup tape system required - the systems themselves can be your backup store.
It's not a solution for every problem, but neither are fault-tolerance or clustering. You need to have more than one tool in your toolkit.
Yes. It should analyze the porn in users' home dirs, then seek out new, but similar porn. Sort of like tivo.net, only for Linux geeks.
Do daemons dream of electric sleep()?
I think there will be room for things like that and more, whenever it actually gets rolled out. Exciting, huh?
You say you got a real solution
Well, you know
We'd all love to see the plan
(The Beatles)
Voice calls which transit a Lucent 5ESS switch generally get touched by many UNIX based systems, of which there are at least 2 of every server which process the result independantly and compare answers. That's the environment I work in.
I think, personally, that a mixture between clustering and fault tolerance is best. Sell systems with multiple redundant processors and high levels of fault tolerance, and cluster them together 3 or 5 ways, and you'd have a highly bulletproof setup.
There is Linux HA. This is High Availability Clustering software (via a heartbeat). This along with DRBD ( Disk Replicated Block Device ) you have a very robust cluster.
This uses an Active/Standby setup with a heartbeat between the systems. If the Active is no longer responding, within X seconds (10 by default ) the Standby takes over all the processes that were running on the other system. And ( if needed ) STONITH's (Shot the other node in the head) the other server to ensure that it really IS dead .
We've been running webservers and Oracle database servers here with 0 downtime using heartbeat and drbd.
UPS Sucks
Though the article claims clustering is about selling hardware, it goes on to suggest fault tolerant systems by various manufacturers...
And clustering has some advantages over fault tolerant hardware when it comes to site [in]security.
Say for example you want to architect the new Iraqi stock exchange. Do you put all the hardware in the same place and go crazy on physical security and housing? Or do you distribute the hardware with redundancy over multiple physical sites?
You are probably cheap, because you KNOW the hardware is going to fail anyway, so why spend $30k plus on the latest SunFire-UltraSPARC or NEC Express5800FT when you can get a swarm of cheap intel servers for the same price.
/\/\icro/\/\uncher
Proponants of clustering neglect one thing - it mostly works, but requires a painful coding practice to prevent any loss of state when a failure happens. For the bulk of productions out there, this state cannot be transferred from box to box - find me a solution that'll real-time "cluster" a file-region lock, for example, of... who cares, a 5 meg autocad file. It's not likely to happen... users will get collisions, and the file will get chewed. Make it easy - cluster your favorite spreadsheet file, such that 50 people can edit it at once without clobbering each other. It's not going to happen - and hopefully you see my point about "state". Clustering is best used when the server-side is stateless... which is useless in most productions. File locking, for example, is a server-side state.
Years ago, a company named "Marathon Technologies" went after the fault-proof market, and succeeded quite well. They cut the problem into points of failure, and duplicated each of them.
The first POF was the context. They addressed this by having two machines handle the software state - literally, two PCs loaded with RAM, CPU, and a custom FPGA controller. No I/O, no keyboard, no mouse. The FPGA would keep the two contexts in near-lock-step with each other, effectively making a Raid-1 software state.
The second POF was the hardware. They addressed this by... you guessed it, two boxes, again with a raid-1 type of resource mirror. The boxes each needed the same config - right down to the mac address of the NICs you threw in them. Resources where then virtualized and redirected into the software context - whatever the context does to the hardware, it does it to both, simultaneously. (The only exception being the NICs - one would be "hot", the other a warm spare). If any combination of resources went away, it didn't matter - so long as you had one of *something*, *somewhere*, the software context would not notice. We took a lightning hit in 2000 - and one of everything died. No problem, though - it used the drive array from this box, the Nic from that box, the CDRom from this box, the keyboard from that box, the mouse from this box... and unless I'd known about the failure, I'd never have noticed unless I was standing at the physical racks.
"Failover" was instant, as there was no "failover". If something died, it din't matter - you were already using the other one. The only "failover" that would take place is if a NIC died - and the time for the "warm" nic to fire up was under 10ms.
It had extra bonus points because you could separate the components by almost half a mile... and required ZERO rewrites of any software to use it. If the s/w would run on WinNT, then it'd run on this.
Of particular fun was using the system to manage (trial) patches. Literally split the brain - isolate each half of the context & hardware from the other, so that each would think the other had died. You'd leave one half such that the users could continue to use it uninterrupted, while you try the patch on the disconnected one. Once done testing the patch, no big deal - just rejoin it, and it'll be brainwiped by the production one during the sync.
It was also useful for actual application of patches. Come time to apply, you'd freeze production, split the brain and shut one half of it off. Then commit your updates and gain confidence. If the updates succeed, just fire up the half that's off and it'll be overwritten with the updated version. If the updates fail - kill the failed half, and fire up the half you shut down. Rollback could be achieved in about 25 seconds.
Clustering cannot compare to this as far as availability is concerned... with zero downtime, zero loss of state, it's open and shut. It didn't scale well, sadly, and their newer versions don't thrill me - but the E4000 product was killer, hands down.
help me i've cloned myself and can't remember which one I am
...uptime is king. And redundant servers with hot-swap components and long life guarantees is the throne uptime sits highest upon. Clustering can be a good inexpensive solution, but it inherently brings in some downtime, which cannot and is not the final solution in many applications. Redundant components gives you the always-on functionality you may need. Clustered redundant machines are even better.
Don't forget that in a parrallel rendundant system, if the fail-over switch or mechanism is any less reliable than the individual components themselves, you might as well not use a fail-over system! Of course this is all theoretical, but any failover system IMHO that uses a software mechanism *built-in* to the devices that can fail is just plain stupid. It's akin to the 'software firewall' vs hardware firewall debate -- hardware firewalls are better because they isolate the hacker from your computer and increase security! If you truly want to build a foolproof redundant failover, it should be a seperate hardware box, like a network switch that senses a fail and brings the other system online. Just from browsing this post and casual knowledge is seems there are very few systems for computers like this, or they use a software method for failover. Does anybody know of any network hardware devices that do just this? Are they efficient? Are the swtiches more reliable (have greater uptime) than the computer servers behind them?
It really all depends.
How stable is your application? When I first walked into one particular job, the outgoing tech guy was crowing about how super-redundant that this ONE box was that he was running the ONE webserver on.
That was all fine and good -- dual power supplies, multiple CPUs, yadda yadda -- but the app was not stable and could not handle the traffic it was getting. It crashed a lot, and when it did, there was no more business until someone bounced it.
Redundancy was good in that case.
You're running a database? That's a challenge to run in a clustered setup. It can be done and done right, but you need experts. If you're Amazon, you need that -- clustered geographically as well as locally. You're a little startup? Cluster your website and your app servers and just make your db internally redundant. And for chrissakes, don't run MS products. Stick with things that are easy to keep stable.
I, for one, welcome our new Antichrist overlord.
Really, what I tend to see clustering for more often is load-balancing. If you have a streaming video server that, say, gets slashdotted (and assuming it's not your bandwidth that is the bottleneck), then you could dump some of the load to a secondary machine. Of course, the usage that you describe - dumping to one machine in the event of a primary machine failing completely in some fashion - can also be used.
However, this isn't to say that clustering should eliminate a need for fault-tolerance. After the price of good servers, adding a redundant PSU and a good UPS, as well as some other basic hardware necessities (RAID perhaps?), shouldn't make a huge dent in your budget in comparison to the overall cost of the machine(s), and it'll probably save you in the longtime.
Sun:= SunStore&cmdViewProduct_CP&catid=83174
http://store.sun.com/CMTemplate/CEServlet?process
For around $20,000 you could build a PC cluster that includes:
20+ x Intel P4 D820 at ~$500 ea.
20+ x AMD64 X2 3800+ at ~$750 ea.
You could almost get a cluster of 40 Intel PCs, each with a dual-core chip running at 2.8 Ghz. Or almost 30 AMD64 PCs, each with a dual-core chip running at 1.8 Ghz. If you shop smart you can get gigabit ethernet on the motherboard and have a fault-tollerant / redundant system with over 10 times the performance of the Sun system.
I don't know about you, but I would take the cluster of AMD X2s. The Intels might beat 'em on price/performance, but the X2s might be a lil bit nicer to work on.
Sorry, what is the point of this article ?
What's the question being asked ? There's no point to this headline. Useless.
Sounds like about 30 more lines or so of python and you're halfway there.
TinyP2P
A bit of checksumming, some automated distribution of indexed files based on some arbitrary weight (Important 1-kinda 5-YOU BET), and you've got it.
You would have to install Python for windows... (Or OSX if they're using AutoCad and not Softplan.) Setup some login/boot-time scripts etc.
Still, more for the "fun" kind of thing to do, and not something for a production environment. But everything has to start somewhere.
I did my first instalation of a cluster 20 years ago last Jan. This is not new, except in the M$ world, and they still can't get the VMS code they have to work. So what is new...
Configured right, with apps that are correctly designed and written (eh hum!) clustering can do ALMOST, but not all a FT machine can. The ultimate is to build a cluster of FT machines.
Host-based network load balancing stinks on almost every platform imagined. Host-based systems typically require multicast or broadcasts to perform crap trickery. MS NLB is NO different. It turns your multi-thousand dollar switches into hubs.
I highly recommend one of the following dedicated appliances for load balancing:
1. Cisco Content Service Switch
2. F5 BigIP
3. Nortel's Load balancers
4. Redline
They are appliances that respect the network rather than beat the crap out of it. Oh, and yes, when you are talking about managing enterprise networks - you treat them well.
Perhaps something called OLRAS (olras-archives.com)
A file synchronization system to synchronize user-data
with a server (therefore requires a server). The system supports
workgroups as files shared by members are distributed
to all members as they connect. The server is necessary
because otherwise, all members of a workgroup would need
to be connected at the same time to synchronize the workgroup files.
Targeted to very small companies, specially useful
for laptop users whom may or may not be connected to
the network at all times. The server is used to backup private data
and also as a repository of data for users who are not presently
connected: as they connect, their PC gets updated.
- uses SSH: VPN not required
- incorporates a private Web space for company-wide documents
- incorporates WebDAV, for sharing files when file sharing by
synchronization will not do.
- Supports archiving of files, previous versions of files may
be retrieved by the user without IT personnel involvement.
My experience is in SME's or in small to medium sized units of larger businesses, that try to maximize their internal capability at the lowest possible cost. In almost every case, clusters (based on Windows) were far more prone to problems than fault tolerant servers. The lack of reliability came most often from the increased hassle, maintenance, and complexity of the clusterred (especially fibre) solutions. While they had, on paper, more fault tolerance and could theoretically provide greater availability, the factors I laid out above acted as a drag on their uptime. So much so, that fault tolerant servers or small storage devices (especially Netapp) often provided comparable ACTUAL uptime figures. As a bonus, the simplicity of the infrastructure allowed much greater flexibility when approaching problems. Unless your infrastructure is already of "enterprise" scale, clustering is not a great option. As a first step into a large scale infrastructure, I would definitely pursue fault tolerance first.
If you have an application that requires ULTIMATE uptime, then you need a geographically remote cluster (Cluster spread over two sites with a redundant leased line link to provide the heartbeat). No matter how many redundant parts in a server, if it gets nuked (read power failure, flood, or other, not ACTUALLY nuked) then that application is down.
Active-active clusters are not really ideal, while load-balancing is a nice idea in this instance it means that when half of it fails then the application suffers severe performance issues. Active-active also creates data issues, as you've got two servers writing to their own local storage that also requires real-time replication between sites. Veritas Storage Foundation is about the most cost-effective option here, you don't even need 2003 Server Enterprise.
If you want a nice simple active-passive cluster and its on the same locale, fine, use a SAN. If they are geographically remote, then they will need real-time replication and as one is passive then you can use HP Storage Mirror or similar. HP are the only vendor in fact that do a nice packaged cluster solution with a SAN included all under one part code. FYI.
Having said that, if you're buying a decent server, then you are an absolutle idiot to not put RAID into it. After that, it only costs another £300 or so to add a redundant hot-plug PSU & fan. Plus p'raps a bit for an extra CPU. After that, the only component that will cause a total outage is the mainboard failing - and the only real way to get around that is to... uh... add another mainboard! Well... guess that's another server then...!
We're using Windows network load balancing on a web-hosted application. The cluster is given one IP, the servers are each given one. When the initial connection is made, the client is directed to one of the servers and that server handles all requests from that client until the session ends. The biggest problem is when one server is having issues, we need to connect to each individually to figure out which is having problems, then remove that one from the cluster. Also, the load balancing takes some processing power on each of the servers. This isn't important to us in this particular situation.
Another one we use is an active-active Exchange cluster. Each server is aware of the other in the cluster and they share disks on a SAN. If one server is brought down for whatever reason, the other automatically grabs the services that were running on the first. The thing you have to watch is that neither of the servers uses more than 50% of their processing power when running one half of the cluster. If it ever does, you'd better upgrade your servers because if it gets the full load, it won't be able to handle it.
The last one we use is an F5 BigIP box. This is a dedicated network load balancing box we use on a high-use web cluster. The nice thing is that all the computing power needed to manage the cluster is on the F5 box, freeing the servers for more users.
But why is the rum gone?
Others have said it, I'll say it again: you don't use clustering in place of FT hardware, or vice versa. You use them together!
Take a server: Hot-swappable mirrored OS disks, N+1 power supplies, dual NICs (which support failover), dual cards initiating separate paths to your storage (through independent switches, if fibre-attached), ECC RAM with on-system logic to take out a failing DIMM. Oh yeah, and multiple CPUs, again with logic to remove one from active use if need be. (chipkill sort of stuff.)
Now take another identical server (or two) and cluster them. By cluster, I mean add the heartbeat interconnects and software layer to monitor all of the mandated hardware and application resources, and fail over as necessary, or take other appropriate actions. Gluing a pile of machines together in a semi-aware grid is NOT a cluster, and does not properly address the same problem!
Now once you've got this environment in place, add the most crucial aspect: Highly competent sysadmins, and a strict change control system. The former will cost you a fair sum of money in salary, and the latter will likely necessitate duplicating your entire cluster for dev/test purposes, before rolling out changes.
That's the beginning of an HA environment. Still up for it?
"People who do stupid things with hazardous materials often die." -- Jim Davidson on alt.folklore.urban
I've stuck with fault tolerance as it's been cheaper for a smaller scale operation, but I don't see any reason you couldn't cluster them.
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
or maybe it's just semantics.
I see Clustering as:
a group of machines sharing workload - i.e. a cluster of app or webservers.
We have a cluster of Websphere servers handling our appserver load behind a CoyotePoint Load Balancers.
The Coyotepoint LBs operate in an HA/FT mode. If one goes down, the other picks up right off the bat including existing state (which client was "stuck" to which server). I call this HA.
It all depends on the product and the vendor. We have DB2 operating in an Active/Passive HACMP cluster but the workload isn't shared. As far as licensing, we only have to have licenses for the active server (according to IBM).
There's also the shared-nothing vs shared-everything model. We currently run a shared everything model for our database allthough DB2 has a feature inherited from Informix called HADR which is a shared nothing model. It's still active/passive but the passive box is in a state of concurrency with the active primary based on user-configurable parameters i.e. update secondary node every 60 seconds or keep secondary node updated in real-time.
Honestly it all depends on the actual implementation. Look past the vendor cruft and marketspeak to the actual implementation.
They may be selling you a database cluster but it might require schema changes and really can overly complicate the problem. What are you trying to acheive? Business continuity? High Availability? Distributed workload? Some products support one but not the other. Some also support all of them with differing levels of complexity.
If you're looking for a linux-only solution, we currently running two - SteelEye's LifeKeeper on one configuration and the linux-ha stuff (with drbd) on another.
I'd be happy to provide answers to any business related experience via email.
"Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
Geographic Load balancing is a bitch. Especially if you don't have dark fiber between datacenters/facilities.
We're just now starting to investigate HAGEO on AIX, geographic mirroring on our SAN and Q-Series replication for DB2. I know it's important but it really adds to the complexity of the environment ten fold.
"Fighting the underpants gnomes since 1998!" "Bruce Schneier knows the state of schroedinger's cat"
especially with global data. of course global data must be minimized in any system, however at least the database can be considered "global data". clustering a database is very expensive, performance wise. even outside the database you may need global data.
for example, we're working on a system where we have some global pool of resources which is used by many concurrent processes. if you would keep this pool of resources (counting down) in the database, we would need a DB which handles many thousands of transactions per second, which is not realistic. keeping it in memory (as a stateful service, i.e. global data/singleton) is the only solution, but is not compatible with clustering.
clustering can be considered "loosely coupled" fault tolerance. a single fault tolerant machine is much faster when synchronizing memory between different processes. in a cluster, synchronized memory access would slow down way too much.
i think clustering only applies to simple systems. many parts of problems just are not suited to be distributed. it is better, IMHO, to distribute functions (i.e put function A on machine 1 and function B on machine 2) than to distribute each function over multiple machines. only a particular class of functions can be distributed easily.
alas, some people (for example in the company where I work) only envisage simple web applications and thus mandate that any application is clustered, creating great headaches and severely restricting sound design.
I think the main tradeoff is in the level of complexity you have in your network. At my nameless company I run about 600+ mailservers. This is mostly because we started small and grew very quickly. The cost of things like datacenter space, remote hands calls, and the amount of time it takes to manage a network of this size all create costs that we would not have in a "big iron" scenario. For example, we generally have about 30-40% free space on our hard drives, and all of our servers have dual power supplies. If we were in a big iron scenario the costs associated with having those extra powersupplies and disk space savings over all would be a substantial savings. Essentially, I think it breaks down to if you know you are going to have a big system and have a decent idea of how much processsing power/resources you are going to need.... if you are in the position to spec out hardware ahead of time, go big iron. If you are a going to grow over time, you can push some of the inital costs down the line and free up capital. I would personally go with big iron if I had to do it all over again, but that's my network, it's likely to be quite different in yours.
Try utilizing all of your hardware instead of having some sitting around doing nothing but waiting for the primary node to break. Especially if you don't have money to waste. On our big UNIX boxes, they are all SAN attached and the application binaries as well as the data are located on these drives. Use some clustering software so that each server can see all of the drives. This will allow you to have each server with a primary responsibility and can handle a secondary when another primary server dies. Server A dies? Server B can see it's drives and you can start the application - most of the time scriptable with the failover server monitoring the primary. This eliminates dead (and expensive) weight in the datacenter. We tend to find load balancing not as attractive, as you have to keep all the member nodes exactly the same when you install patches and/or make changes to the application. With commodity x86 servers running 3-4GHz and 2Gbps Fibre Channel for storage, things tend to run plenty fast. If your application requires more than that, you need to be running on a "big iron" 16 or 32-processor system with an unGodly amount of memory.
I achieved 85% to 99% availability increase using a cluster of Fault-Tolerant servers because clustering provides you redundancy where the rubber hits the road. In most cases that is at the user interface.
Try to use hardware load balancers so that they can detect if one "side of the cluster"* is down and stop directing traffic to it. Proper configuration of the load balancers and proper monitoring of the individual sides of the cluster are also very important.
* - I prefer to call it a side of a cluster instead of a server because the side of the cluster can be a collection of application, web, and database servers in multi-tier environments.
MS Clustering does add several layers of complexicity to a MS environment, and anyone administrating any part of the server needs to be fully aware of the environment, other wise a DBA shutting down a database server will trigger a failover event unexpectedly.
Clustering can be used for load balancing multiple applications over the array members, provided that a LUN is provided for each application, that way if a node fails the other can carry it, albeit at 50% performance.
I've had plenty of fault tolerant servers (multiple PSU's, RAID, hot-swap memory, NIC's, the works) but none of that helps a bit against a BSOD.
An attractive alternative is the luke-warm spare, where you have a redundant server that meets the hardware needs of many of your servers, with either preloaded SCSI disks in a box, or at least images sitting on tape/dvd ready to load.
Most of the cost of a cluster is in the software, not the hardware. Even running on linux, you need your middleware and application to be written to deal with a cluster environment. You probably even need some sort of cluster filesystem or at least san hardware. In short: for anything resembling even 3 9's of uptime, you can't begin to deploy for as little as $40,000. Linux and x86 reduces the cost of the servers by 40%, everything else stay pretty damn expensive.
Just as an aside, a Linux solution is still proprietary unless it adheres to a standard administered by some sort of standards administrating body. It may be open, but it's proprietary. If we're going to be all religious about open and closed source, let's keep things straight.
Posted anonymously as a zealotry filter.
Well, my team has deployed several NT and Win2K clusters, and they worked fine. We were clustering IIS along with several legacy apps.
My experience has been that when you can't get Windows Clustering Services to work, it's either a lame app, or lame people running the show.
Tim
P.S. I'm the king of Windows bashers, so I'm definitely no lover of MS. At the same time, if it works, I'll install it.
The first fault-tolerant computers (circa 1978) were made by Tandem Computers, which is now the NonStop Division (or something like that) of HP. They're still the best: no single point of failure, backups take over in 15 microseconds. Full disclosure: I worked for Tandem for 6 years. Prior to that, I worked for a Manhatten brokerage firm; the building got hit by lightning one evening; all the Amdall and IBM mainframes went down, and it took three days to get them back up. The Tandem system's lost half their processors and a third of their disc drives, and kept right on processing. No down time, no lost data. The only drawback is price; these are Enterprise class servers; the cheapest one is $250k. I guess you get what you pay for.
Cheers, Tim -- Tim Janke Part mad scientist, part lion tamer: sr. software engineer, global team leader, project mana
"Nobody in their right mind is wondering if they should get a cluster OR FT hardware. They get a cluster of FT servers." This model defeats the whole cost savings in a cluster solution. One of the main advantages of clustering is the ability to leverage low cost hardware, and create a highly fault tolerant infrastructure. You do need to pick one or the other. If you don't your paying on both ends. The clustering software provides the HA, not the hardware. If you don't use clustering software, then by all means rely on hardware HA, but both is just wrong. Why would you invest in the headache of scaling horizontally, only to spend the money on hardware HA as well?
Perhaps a better link is to the OpenAFS (Open Andrew File System) implemenation: another IBM contribution to Open source. It is continuing to make available current releases. They're working right now on a new stable release.
The link is http://www.openafs.org/
Steve
... a rarity on Slashdot these days it seems. I work at a well-known East coast university, and I have been trying to get the 'chief engineer' to adopt similar practices, but he is book-smart and trade rag smart, and trys to pinch pennies to save dollars.
"We'll just script it!"
--
Working at a university makes my brain feel toasted
I had the thrill of developing our company's clustering solution, as long as it was Windows on IBM hardware. I built active-passive systems with raid 10 arrays for databases. our application lived on web servers that would call the database servers. i spent the better part of my days, and nights for that matter, either bringing nodes back up or failing them over because the system wouldn't do it itself. no end of grief. our higher ups were convinced this setup was giving our clients 100% reliability, but they forgot about the 5 to 10 minutes of resync between our web servers and the databases. add that up over a few times a day, and the client ends up with a hefty outage. neither ibm or microsoft could come up with an answer to make things run smoother. we decided to de-cluster our setup, and things could not run smoother. ymmv, but we had a bad experience with clustering and won't be going back to it.
this was all done under windows 2000. under windows 2003, you need an extra dedicated resource besides your quorum drive for msmq. we simply didn't have the spindles available to create an extra resource. besides that, it was going to be a pain to get 2003 clustering running without active directory. yes, we are still using nt domains.
Sysplexed mainframe pair: maximum 10 minutes downtime per year. Only costs 2 million $ or more.
East or west, COBOL is best!
Hmm, this CARP seems quite a nice feature. Is there something similar on Linux?
--Coder
Don't think of that server as a prison. It's more like a womb.
Soylent Green is peoplicious!
Ok ok I hear you, give me a torrent or link where I can download its source code and use it legally for no cost in my business.
but, if you're running Oracle RAC, you're gonna be ripped off shitloads of money for it. end of story.
If the idea is "massive parallelism using commodity hardware", then grids are pretty much clusters with better management tools for single-system image, provisioning and/or re-distributing resources (CPU, I/O, etc.). Grids are mostly in the scientific community though you're starting to see it creep in commercial data centres (Oracle being a big proponent with their database version 10g, "g" standing for Grid).
The problem with clusters has always been software; to fund new software development, one needs to build a little bit of hype. That means new terminology. Kind of like how "expert systems" are now called "rules engines", workflow is now called "BPM", and interface contract negotation is now called "choreography". Not to belittle the work in these areas, there is much good being done, just an observation about funding cycles and human attention spans.
-Stu
OpenVMS has supported geographically separated cluster nodes and shadowed storage for years.
o _016.html
http://h71000.www7.hp.com/doc/82FINAL/6318/6318pr
Clustering provides Fault tolerance and scalability.
Fault tolerance is usually a system with two internal nodes. I used to work at a telecom vendor where there was development
of HW-based Fault-tolerant systems. These could handle HW failures in a very good manner but had no extra support for SW
failures which nowadays is much higher in frequency. It could also handle SW upgrades by splitting the system in intricate
manners. Later on there was also development of SW-based solutions for Fault Tolerance, it was possible to have both
Hot (failover in seconds) and Cold (failover in minutes) solutions using SW only. The nice thing with the Hot solution was
that one could actually catch SW failures and tell the other side to abort the current thread of activity since it was known that it
would crash (according to a Tandem report this catches 25% of the SW failures).
Clustering can have exactly the same level of fault tolerance but still have scalability. One way of achieving is by partition the system
in a set of fault tolerance groups. This is how MySQL Cluster works (fault tolerance groups is called node groups). MySQL Cluster
uses Hot Failover by ensuring that all nodes in the group is always in synch.
I could see the value if you had some sort of network-version of RAID. You would need to make sure you had 'striping' and 'parity', for when a host goes down, like when a drive goes down. But what would you do if you lost too many peers, like when employees go home for the day? How would you do off-site backups (if the building burns down, it doesn't matter how many copies are distributed among workstations)? How much bandwidth would this require? How would network latency affect performance? How do you make sure you can do atomic writes on your 'filesystem'? I agree, it's a good idea, but there are a lot of problems to solve to get there.
In esscence, there are two approaches to possible failure. Spend a lot to make the probability vanishingly small, or engineer the system so failure is less of a problem.
The former is the NASA approach, the latter was used by the Soviet space program. Both have had their great successes and spectacular failures. The Soviet approach tends to be orders of magnitde cheaper, especially when the cost of failure is on the high side.