Slashdot Mirror


Remus Project Brings Transparent High Availability To Xen

An anonymous reader writes "The Remus project has just been incorporated into the Xen hypervisor. Developed at the University of British Columbia, Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host. Remus requires no modifications to the OS or applications within the protected VM: on failure, Remus activates the replica on the second host, and the VM simply picks up where the original system died. Open TCP connections remain intact, and applications continue to run unaware of the failure. It's pretty fun to yank the plug out on your web server and see everything continue to tick along. This sort of HA has traditionally required either really expensive hardware, or very complex and invasive modifications to applications and OSes."

33 of 137 comments (clear)

  1. Already done by VMware by Lurching · · Score: 5, Interesting

    They may have a patent too!!

    1. Re:Already done by VMware by palegray.net · · Score: 2, Insightful

      I'll bet a paycheck that prior art in various incarnations would handily dispatch any such patent. As for it already being done by VMware, a lot of organizations prefer a purely open source solution, and Xen works extremely well for many companies.

    2. Re:Already done by VMware by TheRaven64 · · Score: 3, Interesting

      I know that a company called Marathon Technologies owns a few patents in this area. A few of their developers were at the XenSummit in 2007 where the project was originally presented.

      --
      I am TheRaven on Soylent News
    3. Re:Already done by VMware by Bert64 · · Score: 2, Insightful

      They bought a particular version of vmware, and paid vmware to support the setup they had bought and paid for...
      VMware's method of providing support was to tell them to buy new expensive products... They failed to provide adequate support for the version they were actually being supported for...
      If their product fails, then an upgrade to a working version should be free at the very least.

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  2. Himalaya by mwvdlee · · Score: 2, Interesting

    How does this compare to a "big iron" solution like Tandem/Himalaya/NonStop/whatever-it's-called-nowadays.

    --
    Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
    1. Re:Himalaya by Jay+L · · Score: 2, Informative

      I was just thinking that...

      Tandems may still have other advantages, though; back in the day, we built a database on Himalayas/NSK because, availability aside, it outperformed Sybase, Oracle, and other solutions. (They implemented SQL down at the drive controller level; it was ridiculously efficient.) No idea if that's still the case.

      But Tandem required you to build their availability hooks into your app; it wasn't transparent. OTOH, Stratus's approach is;a Stratus server is like having RAID-1 for every component of your server. I gotta think this will cut into their business.

    2. Re:Himalaya by teknopurge · · Score: 5, Interesting

      VM replication like this still has an IO bottleneck. This isn't magic: unless you move to infiniband you're not going to touch something like a Stratus or NonStop machine. By the time you add in the cost of the high-perf interconnects, you're on-par with the real-time boxes. All this convergence going on with people redesigning the mainframe but ass-backward with client/server gear. Makes little sense to me other than it being a gimmick.

      By the time you get all the components that provide the processing and I/O throughput of those high-end boxes, the x86/64 commodity hardware cost advantage has evaporated.

    3. Re:Himalaya by Cheaty · · Score: 3, Informative

      Actually, after reading the paper, this is no threat to Stratus or other players in the space like Marathon or VMWare's FT. The performance impact is pretty significant - by their own benchmarks there was a 50% perf hit in a kernel compile test, and 75% in a web server benchmark.

      This is an interesting approach and seems to handle multiple vCPU's in the VM which I haven't seen done by the software approaches like Marathon and VMware FT, but I think it will mainly be used in applications that would have never been considered for a more expensive solution anyway.

    4. Re:Himalaya by Vancorps · · Score: 3, Insightful

      Were you replying to my comment? Because it doesn't sound like you read my comment. I specifically said there are cut-off points where virtual infrastructure doesn't make sense.

      Also, the fact that you think the IO of SAN is any different than that of an HP Non-Stop setup is where things get really comical because you're talking about Infiniband which is used in x86 hardware as well. As I said, the threshold is moving into higher and higher workloads.

      I'm also not sure where you get your information about Exchange not being IO intensive. Exchange setups easily handle billions of transactions just like the big RDBMS out there. That's why when you evaluate virtual platforms they always ask you about your Exchange environment as well as your database environment. They are both considered to be high IO applications as all they do practically is read and write from disk.

      I find the whole concept of your argument funny considering the Non-stop setups were early attempts at abstraction from the hardware to handle failure and be able to spread the load. In essence it was the start of virtual infrastructure. There is a reason Non-Stop isn't primarily part of HP's business anymore, people are achieving what they need to with commodity hardware. Sorry, but you do indeed save a lot of money that way too. Enterprise crap used to cost boat loads, now it is accessible to much smaller players with smaller workloads but the same demands for up-time.

    5. Re:Himalaya by anon+mouse-cow-aard · · Score: 2, Interesting

      We had a 700 kline app written in some Tandem specific application language. the smallest server we could get from HP was 400 K$. we re-wrote the app in python to use pairs of servers replicating via DRDB over ethernet and a load balancer in front. DRBD is slow, but with the new app I could just add pairs of nodes. We already had such a configuration for another application, and we combined the two, so the hardware cost was just adding two nodes in this cluster, at about 4 K$ per server node. 400 K$ -> 8 k$. I think it would take a heck of a lot of hardware to compensate for the pricing of that gear.

  3. Intact? by Glock27 · · Score: 4, Informative

    Intact is one word, O ye editors...

    --
    Galileo: "The Earth revolves around the Sun!"
    Score: -1 100% Flamebait
    1. Re:Intact? by martin-boundary · · Score: 2, Funny

      Intact is one word

      That was before someone gave Romulus a shovel!

  4. state transfer by girlintraining · · Score: 3, Insightful

    ... Of course, this ignores the fact that if it's a software glitch, it'll happily replicate the bug into the copy. Also, there are certain hardware bugs that will also replicate: Mountain dew spilled on top of the unit, for example. There's this huge push for virtualization, but it only solves a few classes of failure conditions. No amount of virtualization will save you if the server room starts on fire and the primary system and backup are colocated. Keep this in mind when talking about "High Availability" systems.

    On a different note, nothing that's claimed to be transparent in IT ever is. Whenever I hear that word, I usually cancel my afternoon appointments... Nothing is ever transparent in this industry. Only managers use that word. The rest of us use the term "hopefully".

    --
    #fuckbeta #iamslashdot #dicemustdie
    1. Re:state transfer by Garridan · · Score: 4, Funny

      Mountain dew spilled on top of the unit, for example.

      FTFS:

      Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host.

      Wow! This software is *incredible* if mountain dew spilled on top of one machine is instantly replicated on the other machine! I'm gonna go read the source immediately, this has huge ramifications! In particular, if an officemate gets coffee and I also want coffee, only one of us needs to actually purchase a cup!

    2. Re:state transfer by Vancorps · · Score: 3, Interesting

      If your primary and secondary systems are physically located next to each other then they aren't in the category of highly available. Furthermore with storage replication and regular snapshotting you can have your virtual infrastructure at your DR site on the cheap while gaining enterprise availability and most importantly, business continuity.

      I'll agree with being skeptical about transparency although how many people already have this? I went with XenServer and Citrix Essentials for it, I already have this fail-over and I can tell you that it works. I physically pulled a blade out of the chassis and sure enough, by the time I got back to my desk the servers were functioning having dropped a whole packet. Further tweaking of the underlying network infrastructure resulted in keeping the packet with just a momentary rise in latency.

      Enterprise availability is fast coming to the little guys.

    3. Re:state transfer by bcully · · Score: 3, Informative

      FWIW, we have an ongoing project to extend this to disaster recovery. We're running the primary at UBC and a backup a few hundred KM away, and the additional latency is not terribly noticeable. Failover requires a few BGP tricks, which makes it a bit less transparent, but still probably practical for something like a hosting provider or smallish company.

    4. Re:state transfer by bcully · · Score: 5, Informative

      It depends pretty heavily on your workload. Basically, the amount of bandwidth you need is proportional to the number of different memory addresses your application wrote to since the last checkpoint. Reads are free -- only changed memory needs to be copied. Also, if you keep writing to the same address over and over, you only have to send the last write before a checkpoint, so you can actually write to memory at a rate which is much higher than the amount of bandwidth required. We have some nice graphs in the paper, but for example, IIRC, a kernel compilation checkpointed every 100ms burned somewhere between 50 and 100 megabits. By the way, there's plenty of room to shrink this through compression and other fairly straightforward techniques, which we're prototyping.

    5. Re:state transfer by shmlco · · Score: 2, Interesting

      "If your primary and secondary systems are physically located next to each other then they aren't in the category of highly available."

      High availability covers more than just distributed data centers. Load-balancing, fail-over, clustering, mirroring, reduntant switches, routers, and other hardware: all are zero-point-of-failure, high availability solutions.

      --
      Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
    6. Re:state transfer by girlintraining · · Score: 2, Funny

      Wow! This software is *incredible* if mountain dew spilled on top of one machine is instantly replicated on the other machine! I'm gonna go read the source immediately, this has huge ramifications! In particular, if an officemate gets coffee and I also want coffee, only one of us needs to actually purchase a cup!

      I told them quantum computing was a bad idea, but nobody listened...

      I told them quantum computing was a bad idea, but nobody listened...

      I told them...

      --
      #fuckbeta #iamslashdot #dicemustdie
  5. Answer by Anonymous Coward · · Score: 5, Informative

    I've worked with Remus, so I can answer your question.

    It's not "constantly going" into live migration. The backup image is constantly kept in a "paused" state. It doesn't come out of the paused state until communication with the original is broken.

    Until the backup goes live, the shadow pages for memory are updated, via checkpoints. The checkpointing interval is somewhat variable, but it's actually hardcoded into the Xen software (at present - this will change), regardless of what the user level utility tells you.

    As it is, the subsecond checking doesn't work too well. But intervals of about 1-2 seconds works great. Getting subsecond checkpointing can be done (I've done it), but you need extra code than what Remus currently provides.

    Similar comments are applicable to the storage updating. This works absolutely superbly if you're using something like DRBD for the storage replication.

    Remus is pretty cool technology, and it serves as a very solid foundation for taking things to the next level.

    The folks at UBC have done a superb job here, and should be well congratulated.

  6. Re:It's pretty fun by Hurricane78 · · Score: 4, Informative

    Uuum... session management? Transaction management? The server dying in the process of something that costs money?
    Even if it's something as simple as losing the contents of your shopping cart just before you wanted to buy, and then becoming angry at the stupid ass retarded admins and developers of that site.
    Or losing the server connection in your flash game, right before saving the highscore of the year.

    Webservers are far less stateless than you might think. Nowadays they practically are app servers. (Disclosure: I did web applications since 2000, so I know a bit about the subject.)

    When 5 minutes downtime mean over a hundred complaints in your inbox and tens of thousands of dropped connections, which your boss does not find funny at all, you don't do that error again.

    --
    Any sufficiently advanced intelligence is indistinguishable from stupidity.
  7. How does it deal with replication latency? by melted · · Score: 2, Interesting

    I'm pretty sure that if I just yank the cable, not everything will be replicated. :-)

    1. Re:How does it deal with replication latency? by bcully · · Score: 5, Informative

      Hello slashdot, I'm the guy that wrote Remus. It's my first time being slashdotted, and it's pretty exciting! To answer your question, Remus buffers outbound network packets until the backup has been synchronized up to the point in time where those packets were generated. So if you checkpoint every 50ms, you'll see an average additional latency of 25ms on the line, but the backup _will_ always be up to date from the point of view of the outside world.

    2. Re:How does it deal with replication latency? by bcully · · Score: 5, Informative

      The buffering I mentioned above means that packet X will not escape the machine until the checkpoint that produced X has been committed to the backup. So when it recovers on the backup, X will already be in the OS send buffer. There's no possibility for misprediction. If the buffer is lost, TCP will handle recovering the packet.

    3. Re:How does it deal with replication latency? by BitZtream · · Score: 2, Interesting

      No it won't.

      VMWare claims the same crap and its simply not true.

      You have a 50ms window between checkpoints that can be lost, in your example . The only way to ensure no lost is to ensure that every change, every instruction, every microcode executed in the CPU on machine A is duplicated on B before A continues to the next one. You simply can't do that without specialized hardware since you don't even have access to the microcode as its executed on standard hardware.

      50ms on my hardware/software can mean thousands of transactions lost. That can wreak havoc on certain network protocols and cause database operations to fail completely as you replay portions of transactions that the database has already seen.

      I can come up with situations all day long as to how this isn't as seamless as you make it out to be. Sure, xclock transitions to the other machine in what appears to be a perfect no loss transition, or solitaire on a windows machine, but thats not exactly useful.

      Remus has plenty of uses, but it has plenty of pitfalls and regardless of claims does require consideration when developing systems unless you're introducing latency that to me, would just be completely unacceptable and would require applications to be aware of the latency. Hell, thats 6.25MB of data that can be transmitted over a gigabit pipe between checkpoints. That can kill performance.

      I know what you're saying, I know what you mean, and I just don't think you realize how much that latency can effect certain classes of applications.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
    4. Re:How does it deal with replication latency? by bcully · · Score: 5, Insightful

      I think you're missing the point of output buffering. Remus _does_ introduce network delay, and some applications will certainly be sensitive to it. But it never loses transactions that have been seen outside the machine. Keeping an exact copy of the machine _without_ having to synchronize on every single instruction is exactly the point of Remus.

    5. Re:How does it deal with replication latency? by Antique+Geekmeister · · Score: 4, Insightful

      If your application cannot tolerate a 50 msec pause in outbound traffic (which is what Remus seems to introduce, similar to VMWare switchovers) then you have no business running it over a network, much less over the Internet as a whole. Similar pauses are introduced in core switching and core routers on a fairly frequent basis, and are entirely unavoidable.

      There are certainly classes of application sensitive to that kind of issue: various "real-time-programming" and motor control sensor systems require consistently low latency. But for public facing, high-availability services, it seems useful, and much lighter to implement than VMWare's expensive solutions.

  8. Nope by Anonymous Coward · · Score: 4, Insightful

    Remus presented their software well before VMware came out with their product.

    What's different now is that the Remus patches have finally been incorporated into the Xen source tree.

    If VMware has any patents, they'll have to jump over the hurdle of being before the Remus work was originally published, which was a while ago.

    Besides, Remus can be used in more ways than what VMware offers, since you have the source code.

  9. Re:It's pretty fun by Fulcrum+of+Evil · · Score: 3, Insightful

    Webservers are far less stateless than you might think. Nowadays they practically are app servers. (Disclosure: I did web applications since 2000, so I know a bit about the subject.)

    Webservers have no business being the sole repository for these things - the whole point of separating out web from app is that web boxes are easily replaceable with no state.

    Session mgmt: store the session in a distributed way at least after each request. Transactions: they fail if you die half way through. Shopping cart: this doesn't live on a web server.

    If you require all that state, how do you ever do load balancing? Add a web server and it's another SPOF.

    When 5 minutes downtime mean over a hundred complaints in your inbox and tens of thousands of dropped connections, which your boss does not find funny at all, you don't do that error again.

    That's right, you move the state off the webserver so nobody ever sees the downtime and tell your boss that you promised 99.9 and damnit, you're delivering it!

    --
    "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
  10. Re:It's pretty fun by stefanlasiewski · · Score: 3, Insightful

    In many cases, the webserver IS the app server.

    This sort of feature could be very useful for those smaller shops and cheap shops who haven't yet created a dedicated Web tier, or for all those internal webservers which host the Wiki, etc.

    Webservers also help with capacity. Run 4 and if 1 drops off, not a big problem. But what if half the webservers drop off because the circuit which powers that side of the cage went down? And the 'redundant' power supplies on your machines weren't really 'redundant' (Thanks Dell)?

    --
    "Can of worms? The can is open... the worms are everywhere."
  11. Wrong place to put a failsafe? by mattbee · · Score: 3, Insightful

    Surely there is a strong possibility of a failure where both VMs run at once- the original image thinking it has lost touch with a dead backup, and the backup thinking the master is dead, and so starting to execute independently? If they're connected to the same storage / network segment, it could cause data loss, bring down the network service and so on. I've not investigated these types of lockstep VMs, but it seems you have to make some pretty strong assumptions about failure modes, which always break eventually commodity hardware (I've seen bad backplanes, network chips, CPU caches, RAM of course, switches...). How can you possibly handle these cases to avoid having to mop up after your VM is accidentally cloned?

    --
    Matthew @ Bytemark Hosting
    1. Re:Wrong place to put a failsafe? by bcully · · Score: 4, Informative

      Split brain is a possibility, if the link between the primary and backup dies. Remus replicates the disks rather than requiring shared storage, which provides some protection over the data. But there are already a number of protocols for managing which replica is active (e.g., "shoot-the-other-node-in-the-head") -- we're worried about maintaining the replica, but happy to use something like linux-HA to control the actual failover.

    2. Re:Wrong place to put a failsafe? by dido · · Score: 4, Interesting

      This is something that the much simpler Linux-HA environment deals with by using something they call STONITH, which basically means to Shoot The Other Node In The Head. STONITH peripherals are devices that can completely shut down a server physically, e.g. a power strip that can be controlled via a serial port. If you wind up with a partitioned cluster, which they more colorfully call a 'split brain' condition, where each node thinks the other one is dead, each of them uses the STONITH device to make sure, if it is able. One of them will activate the STONITH device before the other, and the one which wins keeps on running, while the one that loses really kicks the bucket if it isn't fully dead. I imagine that Remus must have similar mechanisms to guard against split brain conditions as well. I've had several Linux-HA clusters go split brain on me, and I tell you it's never pretty. The best case is that they only both try to grab the same IP address and get an IP address conflict, in the worst case, they both try to mount and write to the same fiberchannel disk at the same time and bollix the file system. If a Remus-based cluster split brains, I can imagine that you'll get mayhem just as awful unless you have a STONITH-like system to prevent it from happening.

      --
      Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.