Remus Project Brings Transparent High Availability To Xen
An anonymous reader writes "The Remus project has just been incorporated into the Xen hypervisor. Developed at the University of British Columbia, Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host. Remus requires no modifications to the OS or applications within the protected VM: on failure, Remus activates the replica on the second host, and the VM simply picks up where the original system died. Open TCP connections remain intact, and applications continue to run unaware of the failure. It's pretty fun to yank the plug out on your web server and see everything continue to tick along. This sort of HA has traditionally required either really expensive hardware, or very complex and invasive modifications to applications and OSes."
They may have a patent too!!
It's pretty fun to yank the plug out on your web server and see everything continue to tick along. "
Or an ordinary, every day run of the mill 'off the shelf' plain jane beige UPS. or a Ghetto one, if you'd like.
Still its pretty cool, just wondering how much overhead there is by setting up this system
How does this compare to a "big iron" solution like Tandem/Himalaya/NonStop/whatever-it's-called-nowadays.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
Not immediately clear on the Remus page... Is this like a constantly going "live migration" (without actually switching hosts) in that it _only_ keeps a copy of the memory of the guest? Or does this also keep a copy of the disk image? It'd be nice to not need shared storage just to be able to migrate without downtime...
Intact is one word, O ye editors...
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
... Of course, this ignores the fact that if it's a software glitch, it'll happily replicate the bug into the copy. Also, there are certain hardware bugs that will also replicate: Mountain dew spilled on top of the unit, for example. There's this huge push for virtualization, but it only solves a few classes of failure conditions. No amount of virtualization will save you if the server room starts on fire and the primary system and backup are colocated. Keep this in mind when talking about "High Availability" systems.
On a different note, nothing that's claimed to be transparent in IT ever is. Whenever I hear that word, I usually cancel my afternoon appointments... Nothing is ever transparent in this industry. Only managers use that word. The rest of us use the term "hopefully".
#fuckbeta #iamslashdot #dicemustdie
Xen ? The computer of the Liberator?
I've worked with Remus, so I can answer your question.
It's not "constantly going" into live migration. The backup image is constantly kept in a "paused" state. It doesn't come out of the paused state until communication with the original is broken.
Until the backup goes live, the shadow pages for memory are updated, via checkpoints. The checkpointing interval is somewhat variable, but it's actually hardcoded into the Xen software (at present - this will change), regardless of what the user level utility tells you.
As it is, the subsecond checking doesn't work too well. But intervals of about 1-2 seconds works great. Getting subsecond checkpointing can be done (I've done it), but you need extra code than what Remus currently provides.
Similar comments are applicable to the storage updating. This works absolutely superbly if you're using something like DRBD for the storage replication.
Remus is pretty cool technology, and it serves as a very solid foundation for taking things to the next level.
The folks at UBC have done a superb job here, and should be well congratulated.
I'm pretty sure that if I just yank the cable, not everything will be replicated. :-)
Remus presented their software well before VMware came out with their product.
What's different now is that the Remus patches have finally been incorporated into the Xen source tree.
If VMware has any patents, they'll have to jump over the hurdle of being before the Remus work was originally published, which was a while ago.
Besides, Remus can be used in more ways than what VMware offers, since you have the source code.
it is absolutely unbelievable that the official xen kernel is still 2.6.18. there's a lot of modern hardware that isnt supported by it. this is an absolute show stopper.
Surely there is a strong possibility of a failure where both VMs run at once- the original image thinking it has lost touch with a dead backup, and the backup thinking the master is dead, and so starting to execute independently? If they're connected to the same storage / network segment, it could cause data loss, bring down the network service and so on. I've not investigated these types of lockstep VMs, but it seems you have to make some pretty strong assumptions about failure modes, which always break eventually commodity hardware (I've seen bad backplanes, network chips, CPU caches, RAM of course, switches...). How can you possibly handle these cases to avoid having to mop up after your VM is accidentally cloned?
Matthew @ Bytemark Hosting
So it replicates the state to the new machine and then the new machine executes the same instructions and crashes the same way....
I left VMware ESX 3.5 for XenServer 5.5 and I have never been happier.
I am running 4 DL585 servers with (so far) 42 production guests (Linux & win2k3)and have really great, more predictable performance .
If someone is running VMware and is worried about the cost or performance they need to consider Citrix XenServer.
For $50,000 maybe you should develop in-house technical support, since it won't be just $50,000 in licenses, it will eventually be another $50,000 in support, perhaps.
but taking transparent high-availability to Xen can't bode well for Gordon or the Vortigaunts. . .
My sister opened a computer store in Hawaii. She sells C shells by the seashore.
If you can get in house technical support available 24x7 that has the programmers of the product on hand to deal with it in a timely fashion, sure - go for it.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Remus Project Brings Transparent High Availability To Xen
But does it solve those awful jumping puzzles?
Remember when virtualization was only something for companies with highly specialized needs? And RAID? And cooled CPUs? And hard drives? and computers?
When a solution like this comes along, it generally starts out being used only by a few people (nerds and people who REALLY need it)
Then it filters down into the rest of the market as a nice solution to a common problem.
Then it becomes something which nobody can imagine living without.
Then it becomes unthinkable to design a system which doesn't have this ability.
Not true of every technology, surely, but "allow an arbitrary system to fail without stopping" is one of those "how did we ever live without it?" things. People will laugh at "three nines" as something absurd, like advertising that your web servers connect to the Internet or are powered by Electricity.
-- 'The' Lord and Master Bitman On High, Master Of All
After reading this announcement, I tried to imagine the earliest possible year in which a technical reader would be able to comprehend what is being described. 2004? 1998? Last week? Never heard of either the Remus Project or the Xen hypervisor, and yet here I sit, merrily cranking out successful commercial software products, as I've been doing for the past 30 years. It took me a bit of browsing to understand what was being described.
I wonder how many readers completely understood this announcement at face value without doing a little digging. 5? 10? Everybody but me?
I think if you tried keeping up with all the technology/terms in our field, it would be a full time job.
This is nothing new, simply a modern implementation of a classic idea.
See "Hypervisor-based Fault Tolerance" by Bressoud and Schneider (SIGOPS 1995).
http://www.cs.cornell.edu/fbs/publications/HyperFTol.pdf
Every now and then, someone has to come along and pretend to do something new, either out of ignorance or the academic "publish or perish" pressure.
Just the other day, we were looking at yet another implementation of a transactional operating system (TXOS).
I think a larger portion of readers understood than you think. If you haven't heard of the Xen hypervisor or this type of virtualization then you probably have nothing to do with managing a server farm. If someone in that business has not heard of Xen then maybe they should be in another line of work.
I agree that keeping up with all tech would be a full time job. However this is pretty main stream stuff.
S.