Remus Project Brings Transparent High Availability To Xen

← Back to Stories (view on slashdot.org)

Remus Project Brings Transparent High Availability To Xen

Posted by timothy on Wednesday November 11, 2009 @10:48AM from the when-servers-go-south-a-song dept.

An anonymous reader writes "The Remus project has just been incorporated into the Xen hypervisor. Developed at the University of British Columbia, Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host. Remus requires no modifications to the OS or applications within the protected VM: on failure, Remus activates the replica on the second host, and the VM simply picks up where the original system died. Open TCP connections remain intact, and applications continue to run unaware of the failure. It's pretty fun to yank the plug out on your web server and see everything continue to tick along. This sort of HA has traditionally required either really expensive hardware, or very complex and invasive modifications to applications and OSes."

137 comments

Min score:

Reason:

Sort:

Already done by VMware by Lurching · 2009-11-11 10:50 · Score: 5, Interesting

They may have a patent too!!
1. Re:Already done by VMware by palegray.net · 2009-11-11 11:18 · Score: 2, Insightful
  
  I'll bet a paycheck that prior art in various incarnations would handily dispatch any such patent. As for it already being done by VMware, a lot of organizations prefer a purely open source solution, and Xen works extremely well for many companies.
  
  --
  512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
2. Re:Already done by VMware by illegibledotorg · 2009-11-11 11:24 · Score: 1, Insightful
  
  Yeah, and at a great price point. *rolleyes*
  
  IIRC, to get this kind of functionality from ESX or vSphere you have to pay licenses numbering in the thousands of dollars for each VM host as well as a separate license fee for their centralized Virtual Center management system. I'm glad to see that this is finally making it into the Xen mainline.
3. Re:Already done by VMware by Anonymous Coward · 2009-11-11 11:36 · Score: 1, Insightful
  
  To anyone who actually needs this kind of uninterrupted HA the cost of a VMware license is an insignificant irrelevance. Of course, it's nice that people can play around with HA at home now for free.
4. Re:Already done by VMware by Anonymous Coward · 2009-11-11 11:47 · Score: 0
  
  Agreed. A specialized support team a phone call away is how to run a business. Time is the most expensive things to waste.
5. Re:Already done by VMware by TheRaven64 · 2009-11-11 12:00 · Score: 3, Interesting
  
  I know that a company called Marathon Technologies owns a few patents in this area. A few of their developers were at the XenSummit in 2007 where the project was originally presented.
  
  --
  I am TheRaven on Soylent News
6. Re:Already done by VMware by nurb432 · 2009-11-11 12:10 · Score: 1
  
  And it didn't require any "really expensive hardware, or very complex and invasive modifications" to do it. Not saying its going to run on some old beat up Pentium Pro from 10 years ago, but the hardware i see it run on every day isn't out of line for a modern data-center.
  And it requires ZERO changes to the OS.
  ( at risk here of sounding like a Vmware fanboy, but come on.. at least they can present facts when tooting their horn )
  
  --
  ---- Booth was a patriot ----
7. Re:Already done by VMware by Anonymous Coward · 2009-11-11 12:13 · Score: 0
  
  This? http://www.vmware.com/products/fault-tolerance/
8. Re:Already done by VMware by Anonymous Coward · 2009-11-11 12:35 · Score: 1
  
  I think your forgetting academic institutions, Startups, research groups and all the other organizations that would MUCH rather spend their money on other things than VMware when a free alternative is available.... Or any place that just wants to keep a pure open source environment.
  For that matter why would anyone NOT want HA if they can get it easy and cheap?
  Just because VMware has it does not in anyway reduce the significance of Remus making it easily available in Xen.
9. Re:Already done by VMware by Anonymous Coward · 2009-11-11 12:37 · Score: 0
  
  Thus driving up the cost of everything, just to pad VMwares profit margin.
  Go fuck yourself.
10. Re:Already done by VMware by Anonymous Coward · 2009-11-11 12:38 · Score: 0
  
  I just went through this at my company. VMWare calls it "Fault Tolerance" You're looking at $2K+ per CPU socket. A rack of 24 server with dual socket is over $50K in licences. Of course, that's just the hypervisor license with no support. Plus you need the management server licenses add another $xxK for 24 servers (I don't remember the cost on that. It was $6K for three hosts or something like that).
11. Re:Already done by VMware by smash · 2009-11-11 13:28 · Score: 1
  
  beaten. ESX 4.0 has vmware FT, and "lockstep" is patented i believe...
  
  --
  I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
12. Re:Already done by VMware by smash · 2009-11-11 13:29 · Score: 1
  
  +1 to this. And vmware support is *actually good*.
  
  --
  I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
13. Re:Already done by VMware by smash · 2009-11-11 13:30 · Score: 1
  
  And if downtime costs you 100k/hr, its a bargain. The support is also excellent, which is worth the price of admission, if FT is important to you.
  
  --
  I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
14. Re:Already done by VMware by Anonymous Coward · 2009-11-11 13:33 · Score: 0
  
  This feature sounds a lot like what Tandem NonStop systems used to do...
15. Re:Already done by VMware by Cheaty · 2009-11-11 13:39 · Score: 1
  
  This isn't lockstep. In storage terms, if you think of lockstep as synchronous replication, this is more akin to asynchronous snapshot-based replication. The metaphor falls apart a bit because the primary does wait for acknowledgment before modifying its external state (sending network packets or writing to disk), but can otherwise continue execution.
16. Re:Already done by VMware by Howard+Beale · 2009-11-11 13:57 · Score: 1
  
  We use our product with Marathon's everRun FT. Just starting to do load testing using the Xen with their 2g product. It looks nice, but the second layer of management gets to be a pain.
17. Re:Already done by VMware by uncqual · 2009-11-11 14:05 · Score: 1
  
  IIRC, NonStop applications had to be very involved in FT - initiating checkpoint type operations to the "backup" etc. Absolutely nothing like Remus.
  
  And, IIRC, NonStop SQL wasn't one of those applications - that amused me.
  
  --
  Why is there an "insightful" mod and why isn't it "-1"? If I wanted insight, I wouldn't be reading /.
18. Re:Already done by VMware by jipn4 · 2009-11-11 14:16 · Score: 1
  
  This sort of stuff is far older; it goes back to mainframe days and supercomputing.
  Furthermore, the idea of running two machines in lockstep and failing over shouldn't be patentable at all. Specific, particularly clever implementations of it might be, but those shouldn't preclude from others being able to create other implementations of the same functionality.
19. Re:Already done by VMware by ckaminski · 2009-11-11 16:57 · Score: 1
  
  No its not. We had four 2.x hosts running VCenter 1.3 which would randomly hang every couple weeks. The ONLY solution was a hard power-cycle. They never could resolve the issue, saying upgrade to 3.x and Vcenter 2. Right. We're going to upgrade all 20 of our ESX hosts just because you can't resolve this problem...
  
  Eventually we had to... retired the servers and the SAN it was connected to - problems never recurred. Great support, VMware.
20. Re:Already done by VMware by ckaminski · 2009-11-11 17:01 · Score: 1
  
  Prior art by HP which used to do this in Pentium-based Netservers?
  
  Granted real hardware, as opposed to software, but perhaps?
21. Re:Already done by VMware by Anonymous Coward · 2009-11-11 17:15 · Score: 0
  
  So you're saying that they gave you advice, you followed their advice, and the problem is gone? Sounds like terrible support to me.
22. Re:Already done by VMware by Per+Wigren · 2009-11-11 20:53 · Score: 1
  
  To anyone who actually needs this kind of uninterrupted HA the cost of a VMware license is an insignificant irrelevance.
  But now, we who don't actually need completely uninterrupted HA can have it anyway and as a bonus it will probably be easier to setup and maintain than a semi-custom "only one minute downtime"-HA solution. This is a good thing indeed.
  
  --
  My other account has a 3-digit UID.
23. Re:Already done by VMware by Bert64 · 2009-11-11 21:40 · Score: 2, Insightful
  
  They bought a particular version of vmware, and paid vmware to support the setup they had bought and paid for...
  VMware's method of providing support was to tell them to buy new expensive products... They failed to provide adequate support for the version they were actually being supported for...
  If their product fails, then an upgrade to a working version should be free at the very least.
  
  --
  http://spamdecoy.net - free throwaway anonymous email - avoid spam!
24. Re:Already done by VMware by oreaq · 2009-11-11 22:13 · Score: 1
  
  No. IBM has done this kind of things on mainframes 20 years ago. This stuff is actually pretty old.
25. Re:Already done by VMware by rjr3 · 2009-11-12 02:42 · Score: 1
  
  you would be wrong there buddy. if I can save $15K per ESX host (3.5 - I don't know 4.x w/new vsphere ) Iwill certainly try ... $100K is a lot of money ....
26. Re:Already done by VMware by Traksius+Egas · 2009-11-12 04:21 · Score: 1
  
  Mod parent up +1.
  Some people just think that money still grows on trees. I work for a local government who is migrating to VMWare for some of our core servers. But we have many other servers that could be virtualized with HA using this, but for a fraction of the cost.
27. Re:Already done by VMware by Jacques+Chester · 2009-11-12 19:49 · Score: 1
  
  I'd be surprised if the whole field isn't absolutely blanketed with patents by IBM. Mainframes have this since the 80s or 90s, I think.
  
  --
  
  Classical Liberalism: All your base are belong to you.
28. Re:Already done by VMware by pedershk · 2009-11-12 21:27 · Score: 1
  
  No. http://www.marathontechnologies.com/high_availability_xenserver.html
  
  --
  Henning Same Shit (TM)
It's pretty fun by ickleberry · 2009-11-11 10:53 · Score: 1

It's pretty fun to yank the plug out on your web server and see everything continue to tick along. "
Or an ordinary, every day run of the mill 'off the shelf' plain jane beige UPS. or a Ghetto one, if you'd like.

Still its pretty cool, just wondering how much overhead there is by setting up this system
1. Re:It's pretty fun by Fulcrum+of+Evil · 2009-11-11 10:56 · Score: 1, Insightful
  
  if it's a webserver, what's the big deal? Run 4 and if 1 drops off, stop sending it requests. For an app server, I can see the advantages.
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
2. Re:It's pretty fun by Hurricane78 · 2009-11-11 11:28 · Score: 4, Informative
  
  Uuum... session management? Transaction management? The server dying in the process of something that costs money?
  Even if it's something as simple as losing the contents of your shopping cart just before you wanted to buy, and then becoming angry at the stupid ass retarded admins and developers of that site.
  Or losing the server connection in your flash game, right before saving the highscore of the year.
  Webservers are far less stateless than you might think. Nowadays they practically are app servers. (Disclosure: I did web applications since 2000, so I know a bit about the subject.)
  When 5 minutes downtime mean over a hundred complaints in your inbox and tens of thousands of dropped connections, which your boss does not find funny at all, you don't do that error again.
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
3. Re:It's pretty fun by Anonymous Coward · 2009-11-11 11:45 · Score: 0
  
  Not sure how we do the coding for it, but we have a 4 web server setup, it round robins through them for everything. If the ports 80, 443 goes down, it gets marked as down in the CSS and it stops taking traffic. The customer might see an error page if they were on the server that failed, but a single press of F5 and they are back working, no lost sessions, no lost transactions. It can be done, otherwise thar be magic in them machines.
4. Re:It's pretty fun by Fulcrum+of+Evil · 2009-11-11 11:50 · Score: 3, Insightful
  
  Webservers are far less stateless than you might think. Nowadays they practically are app servers. (Disclosure: I did web applications since 2000, so I know a bit about the subject.)
  Webservers have no business being the sole repository for these things - the whole point of separating out web from app is that web boxes are easily replaceable with no state.
  Session mgmt: store the session in a distributed way at least after each request. Transactions: they fail if you die half way through. Shopping cart: this doesn't live on a web server.
  If you require all that state, how do you ever do load balancing? Add a web server and it's another SPOF.
  
  When 5 minutes downtime mean over a hundred complaints in your inbox and tens of thousands of dropped connections, which your boss does not find funny at all, you don't do that error again.
  That's right, you move the state off the webserver so nobody ever sees the downtime and tell your boss that you promised 99.9 and damnit, you're delivering it!
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
5. Re:It's pretty fun by Anonymous Coward · 2009-11-11 12:02 · Score: 0
  
  Not meaning to picking on you, but it sounds like bad coding practices are at fault. Have the session id get offloaded to a server that keeps track of the client and ties it to any transactions / what not that they need to use the web server for. Have the transactions logged into yet another server that allows the transactions to queue up. Have both of these session and transaction servers replicate to standby servers should they fail. In a group of web servers, as long as one web server is still up, you'll continue on, albeit slow, but at no lose of customer... anything.
  
  The biggest inconvenience is teaching your customers to refresh the page if they get an error so they will be moved over to the working web server. I'll end this by saying I have no idea what I'm talking about when it comes to handling a flash game and getting it to handle these kinds of failures. But just plain old web servers that handle e-commerce or whatever, it is very doable to have it track session and transaction management not local to the web server it is running on. It adds to the complexity, but it will be more fault tolerant.
6. Re:It's pretty fun by radish · 2009-11-11 12:08 · Score: 1
  
  Web servers are stateless and sit in front of app servers, which are stateful but which have their sessions propagated to at least one other instance. When a web server dies no-one cares, if an app server dies you just need to have some logic that allows the box which gets the next request in the session to either (a) redirect the request to the app server which was the back up for that session or (b) pull the session into it's own cache from the backup.
  
  --
  ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
7. Re:It's pretty fun by stefanlasiewski · 2009-11-11 12:11 · Score: 3, Insightful
  
  In many cases, the webserver IS the app server.
  This sort of feature could be very useful for those smaller shops and cheap shops who haven't yet created a dedicated Web tier, or for all those internal webservers which host the Wiki, etc.
  Webservers also help with capacity. Run 4 and if 1 drops off, not a big problem. But what if half the webservers drop off because the circuit which powers that side of the cage went down? And the 'redundant' power supplies on your machines weren't really 'redundant' (Thanks Dell)?
  
  --
  "Can of worms? The can is open... the worms are everywhere."
8. Re:It's pretty fun by Anonymous Coward · 2009-11-11 12:18 · Score: 0
  
  LOL, remind your boss he needs to fire you and get a competent programmer. Doing web apps since 2000... your comment would be spot on if this was still 2000, but web apps these days are a lot more robust than you seem to realize. Sounds like you are having major outages when you have a single server down... and you seem to think that this is expected and unavoidable... scary.
9. Re:It's pretty fun by Anonymous Coward · 2009-11-11 12:33 · Score: 0
  
  remind your boss he needs to fire you and get a competent programmer
  Hell, in 2000 I was writing shitty webapps and if the webserver wasn't also running the database and if we had owned more than one webserver, the webserver could have died and nobody would notice.
  Of course, nobody noticed when the company died either, so...
10. Re:It's pretty fun by BitZtream · 2009-11-11 12:56 · Score: 1
  
  I don't know about you, but my web apps don't let the web server handle session and transaction management. Thats what I have a database server for, thats capable of dealing with thoses issues in a known way, that I can recover from to some extent. My important web apps use clusters of databases that take care of each other. Theres a reason Oracle costs a fortune, and MySQL is free. I can't stand working with Oracle, but theres a reason it exists. Of course you don't have to use Oracle, thats just one example, there are plenty of alternatives from other vendors and middleware to do the job.
  Server dies in the middle of a process? Did the transaction complete? no? rollback. Yes? Good, the database is in a known good state with everyone being updated with proper information, live goes on, just a bit delayed.
  You designed your system so your game state is totally client dependent or something? So if the web server dies theres no acceptable failure mode to revert to a previous relatively recent state? Must not be that important. If it is so important, why are you not saving the state are acceptable intervals to allow the user to restart from a reasonable point? Sounds like you're doing too much on the client, which probably means some massive security issues as well. Clients are never trusted for anything.
  Nothing about a web app is new. We solved all of these issues years ago. I think the problem is that you're just learning about them and don't realize that this problems are actually rather old and there are known ways to deal with them.
  I can turn off one of my web servers or database servers, literally killing tens of thousands of connections, and the worst case is a half a second of delay or so while the cluster removes it from the loop. The most the user sees is some web pages don't load some content. Any application that uses the web servers for access to the database will retry the request, seeing a different server which will be more than happy to handle the request. If the request was completed and the client couldn't be notified in time, the client will retry the request and be notified it was already completed and move on.
  This stuff is 40 years old, not anything new and exclusive to web apps. In short, the problems you listed are because you're doing it wrong.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
11. Re:It's pretty fun by mikep554 · 2009-11-11 12:59 · Score: 1
  
  In many cases, the webserver IS the app server.
  This sort of feature could be very useful for those smaller shops and cheap shops who haven't yet created a dedicated Web tier, or for all those internal webservers which host the Wiki, etc.
  If they are smaller/cheaper shops, they probably aren't playing around with heavy virtualization to begin with. If you are virtualizing your example box, you're doing it wrong.
  
  But what if half the webservers drop off because the circuit which powers that side of the cage went down? And the 'redundant' power supplies on your machines weren't really 'redundant' (Thanks Dell)?
  Get a better UPS setup. If you have entire racks of systems that fill a cage, and your servers all shut down because their power died, you're doing it wrong. Rather than plugging all of the servers into individual UPS systems, get a UPS that covers all the circuits for the cage. And a generator.
12. Re:It's pretty fun by Fulcrum+of+Evil · 2009-11-11 14:53 · Score: 1
  
  or just hire someone to host your boxes, depending on what they're for.
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
13. Re:It's pretty fun by smash · 2009-11-11 16:33 · Score: 1
  
  a UPS does not protect against CPU/motherboard/ram hardware failure. This sort of HA does.
  
  --
  I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
14. Re:It's pretty fun by shmlco · 2009-11-11 17:42 · Score: 1
  
  "Session mgmt: store the session in a distributed way at least after each request."
  Bingo. With your solution, a submitted page request will fail. In fact, every page request and connection being handled by that server when it fails will fail.
  With the article's solution, things automagically switch over and everyone gets the data they requested. Users notice nothing.
  "... so nobody ever sees the downtime..."
  Except all of the users that clicked register or buy and get nothing at all.
  
  --
  Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
15. Re:It's pretty fun by shmlco · 2009-11-11 18:01 · Score: 1
  
  "I can turn off one of my web servers or database servers, literally killing tens of thousands of connections, and the worst case is a half a second of delay or so while the cluster removes it from the loop. The most the user sees is some web pages don't load some content."
  So if that server is running a shopping cart, then "thousands" of users might just have had their credit card submissions fail. They don't get confirmations and they don't know if the order went through or not. And I'd almost guarantee that your web server code sending the payment info to the payment processor and then updating the order database isn't atomic and will NOT fail gracefully, as two different systems are involved. Sure, your database may (repeat, may) have rolled back on a dropped connection, but that payment processor request to their server was applied and was NOT recorded.
  In short, your scenario wasn't "worst case" at all.
  If I were you I'd THINK about the consequences of such events occuring and not snidely assume that you have all the bases covered and that someone else is automatically "doing it wrong". In fact, I'll bet you a hundred bucks I could walk into your server room right now, pull just one power cord to the right switch, router, or load-balancer, and bring down the entire house.
  
  --
  Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
16. Re:It's pretty fun by Jeremi · 2009-11-11 18:12 · Score: 1
  
  Or an ordinary, every day run of the mill 'off the shelf' plain jane beige UPS. or a Ghetto one, if you'd like.
  Sure, but power failure isn't the only thing that can stop your server from running -- it's just the easiest one to reproduce without permanently damaging anything. If you'd like a better example, yank the CPU out of your web server's motherboard instead. Your UPS won't save you then! :^)
  
  --
  
  I don't care if it's 90,000 hectares. That lake was not my doing.
17. Re:It's pretty fun by Fulcrum+of+Evil · 2009-11-11 18:36 · Score: 1
  
  it's a choice between reliability and complexity, and complexity has its own reliability problems. Ideally, the HA solution is best, but it relies on a lot more than the simple solution. The users that get an error can try again and it will work. I did say that it's mostly useful for the app server layer, right?
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
18. Re:It's pretty fun by queazocotal · 2009-11-12 02:11 · Score: 1
  
  No, it doesn't.
  This sort of solution protects from a limited subset of faults.
  It protects 100% from any fault that causes instant death.
  It does not protect from any fault that causes data corruption, where the system continues to run.
  Undetected bit-errors cause the states across the machines to differ.
  If these bit errors are replicated, you've got a machine in a copied, but corrupt state - the original and the copy may crash at exactly the same point.
  If they aren't, then you may get 'lucky', and have it failover, but with corrupted data that leads to corrupted data being saved, or later crashes.
  At best in this situation, you can flag a 'parity error'.
  With lots of additional complexity, you might be able to hack this into a redundant system.
  For example - you run 3 machines, and if one gets into a different state, you disable it.
  This has problems of its own though.
19. Re:It's pretty fun by afidel · 2009-11-12 02:38 · Score: 1
  
  Silent bit errors on current server class hardware should be vanishingly small, the buses and memory are protected by ECC.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
20. Re:It's pretty fun by queazocotal · 2009-11-12 03:16 · Score: 1
  
  Server class hardware should never have hardware faults either.
  Yes, server class hardware is usually more robust than consumer grade in terms of some bit errors.
  However, in the last minutes or seconds before a crash due to hardware failure, something is obviously going way out of spec.
  If this is detectable - it's a no-brainer - you simply failover if thresholds are breached, but before the crash occurs. (and you can afford to be a _lot_ more critical if you've got spare hardware)
  But a fair proportion of crashes happen with no software-detectable warning, as there is not hardware coverage of the failing part in sensors.
  For example, a failing solderjoint on a chipset or CPU, or ...
21. Re:It's pretty fun by Unequivocal · 2009-11-12 05:08 · Score: 1
  
  It depends. You can engineer a system to be very stateful, and you have to route the same client to the same webserver in order to maintain functionality. Or, you can build a totally stateless webserver, with all data stored on db servers and/or memcache installs. It's not hard to do this with many different web frameworks. So I disagree with you on the facts: many, many webservers these days are totally stateless. Perhaps you program in .NET? I have no idea how they do things.
22. Re:It's pretty fun by jon3k · 2009-11-12 05:48 · Score: 1
  
  Yeah and one webserver would be even simpler and less reliable. And no webserver would be even simpler. Good argument.
23. Re:It's pretty fun by Fulcrum+of+Evil · 2009-11-12 06:09 · Score: 1
  
  actually, yes. One webserver means no loadbalancing hardware to fail. LB is a mature tech and means that you can treat your N web servers as independent and also scale boxes out individually instead of in pairs.
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
24. Re:It's pretty fun by stefanlasiewski · 2009-11-12 12:05 · Score: 1
  
  If they are smaller/cheaper shops, they probably aren't playing around with heavy virtualization to begin with.
  My point is, this is a great virtualization feature which is very accessible and affordable for smaller shops. It may not be as nice as some of the solutions offered by VMware, Citrix, etc. but it's not as expensive either.
  Get a better UPS setup.
  Even your 'better UPS setup' will fail, sometimes. I'm specifically thinking of several power outages at major datacenters in Northern California, which were backed by millions of dollars worth of redundant UPSs and generators, all N+1. It will fail, usually somewhere down the line where they didn't realize it, to the point where even big players like Netapp, rhn.redhat.com and Linden Labs had hosts go down.
  
  --
  "Can of worms? The can is open... the worms are everywhere."
Himalaya by mwvdlee · 2009-11-11 10:57 · Score: 2, Interesting

How does this compare to a "big iron" solution like Tandem/Himalaya/NonStop/whatever-it's-called-nowadays.

--
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
1. Re:Himalaya by teknopurge · 2009-11-11 11:09 · Score: 1
  
  It doesn't. HP Non-Stop is a beast.
  
  --
  Website Hosting
2. Re:Himalaya by Jay+L · 2009-11-11 11:13 · Score: 2, Informative
  
  I was just thinking that...
  Tandems may still have other advantages, though; back in the day, we built a database on Himalayas/NSK because, availability aside, it outperformed Sybase, Oracle, and other solutions. (They implemented SQL down at the drive controller level; it was ridiculously efficient.) No idea if that's still the case.
  But Tandem required you to build their availability hooks into your app; it wasn't transparent. OTOH, Stratus's approach is;a Stratus server is like having RAID-1 for every component of your server. I gotta think this will cut into their business.
3. Re:Himalaya by Anonymous Coward · 2009-11-11 11:18 · Score: 1
  
  How does this compare to a "big iron" solution like Tandem/Himalaya/NonStop/whatever-it's-called-nowadays.
  Precisely.
  It's actually pretty cool from a computing history aspect. Once upon our time, the mainframes were the bad-assed machines. Hot-swapping power supplies and core modules. Several nines of uptime. Now we're doing it in software.
  I see it as a mirror to what's happening with data storage and the whole "cloud computing" thing. Going back and fourth between big hosted machines and dumb clients to smaller smarter machines. It's like we flip back and fourth every few years when it comes to computer ideology.
  I guess, what I'm trying to get at is...I can't think of anything too insightful to say. The only thing that comes to mind is: It's pretty damn cool how old ideas become new ideas. How the archaic way of doing things suddenly finds a place with new technology.
4. Re:Himalaya by teknopurge · 2009-11-11 11:22 · Score: 5, Interesting
  
  VM replication like this still has an IO bottleneck. This isn't magic: unless you move to infiniband you're not going to touch something like a Stratus or NonStop machine. By the time you add in the cost of the high-perf interconnects, you're on-par with the real-time boxes. All this convergence going on with people redesigning the mainframe but ass-backward with client/server gear. Makes little sense to me other than it being a gimmick.
  
  By the time you get all the components that provide the processing and I/O throughput of those high-end boxes, the x86/64 commodity hardware cost advantage has evaporated.
  
  --
  Website Hosting
5. Re:Himalaya by Anonymous Coward · 2009-11-11 11:27 · Score: 0
  
  I think Stratus still has some differentiation versus this approach since there's no hypervisor involved. However, this is very similar to what Marathon was already doing with Xen in their latest everRun products and this doesn't require the VM to be running windows.
6. Re:Himalaya by Vancorps · 2009-11-11 11:32 · Score: 1
  
  Huh? We have a SAN son, you need more throughput? Add another 4 or 8gig trunk and bam you've added significant bandwidth. With individual blades having dual 8gig HBAs you have quite a bit of IO available to you assuming proper PCI-E. There is a upper limit where you shouldn't be virtualizing infrastructure but that limit is moving ever higher. I don't know about you, but I have a NetApp based storage array with redundant switching gear that is more than capable of keeping up with the IO of having 20 servers on a single physical host and that includes Oracle, Reporting services, Exchange, and a few other high IO applications. My security server recording our multi-megapixel security cameras and a backup Oracle database will stay outside the virtual environment for obvious reasons. Then of course there is our DR setup for basic business continuity.
7. Re:Himalaya by Anonymous Coward · 2009-11-11 11:50 · Score: 1, Informative
  
  The IO bottleneck in this case is the interconnect between the two machines, not disk, so the SAN isn't relevant. VMware FT needs at least a dedicated GbE NIC for replay/lockstep traffic, I think the recommendation is 10Gb, and is still limited to using a single vCPU in the VM.
8. Re:Himalaya by teknopurge · 2009-11-11 11:52 · Score: 1
  
  The fact you're comparing NonStop/Stratus to the IO of a SAN is comical. There's a reason you don't virtualize large RDBMS in production environments: they fall over.
  
  Exchange is not a "high IO application". A high IO application is something like all the ATM transactions for Chase bank in North America. If you can have 20 servers on a single physical host you're doing it wrong: your apps aren't heavy by a long shot.
  
  --
  Website Hosting
9. Re:Himalaya by Cheaty · 2009-11-11 12:04 · Score: 3, Informative
  
  Actually, after reading the paper, this is no threat to Stratus or other players in the space like Marathon or VMWare's FT. The performance impact is pretty significant - by their own benchmarks there was a 50% perf hit in a kernel compile test, and 75% in a web server benchmark.
  This is an interesting approach and seems to handle multiple vCPU's in the VM which I haven't seen done by the software approaches like Marathon and VMware FT, but I think it will mainly be used in applications that would have never been considered for a more expensive solution anyway.
10. Re:Himalaya by mwvdlee · 2009-11-11 12:11 · Score: 1
  
  I'm not comparing this to mainframes in general, only to the "redundant" types.
  This isn't going to compare to a general mainframe simply because it doesn't have the massive resources (cpu's, disk space, memory, bandwidth, etc).
  A lot of the those Tandems aren't used like a typical mainframe though. Sure, they may offer more resources than this Remus project solution, but many Tandem applications don't need those resources, they only need the redundancy and as-near-to-100%-as-possible-at-any-expense uptime.
  Another reply pointed to ATM machines. Indeed, these generally communicate with Tandems. The Tandems don't really do much with it, though. Mostly they just register the transactions, do some basic checks against non-realtime data and prepare the transaction to be handled by the actual bank systems housed on a different machine. The reason for this is simple economics; when comparing just performance, a Tandem is very expensive. It's wise to minimize the workload so as to minimize the amount of Tandem hardware required to run it.
  The question is; will the Remus project be able to handle some of those traditiional Tandem workloads with similar quality?
  
  --
  Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
11. Re:Himalaya by Vancorps · 2009-11-11 16:32 · Score: 3, Insightful
  
  Were you replying to my comment? Because it doesn't sound like you read my comment. I specifically said there are cut-off points where virtual infrastructure doesn't make sense.
  Also, the fact that you think the IO of SAN is any different than that of an HP Non-Stop setup is where things get really comical because you're talking about Infiniband which is used in x86 hardware as well. As I said, the threshold is moving into higher and higher workloads.
  I'm also not sure where you get your information about Exchange not being IO intensive. Exchange setups easily handle billions of transactions just like the big RDBMS out there. That's why when you evaluate virtual platforms they always ask you about your Exchange environment as well as your database environment. They are both considered to be high IO applications as all they do practically is read and write from disk.
  I find the whole concept of your argument funny considering the Non-stop setups were early attempts at abstraction from the hardware to handle failure and be able to spread the load. In essence it was the start of virtual infrastructure. There is a reason Non-Stop isn't primarily part of HP's business anymore, people are achieving what they need to with commodity hardware. Sorry, but you do indeed save a lot of money that way too. Enterprise crap used to cost boat loads, now it is accessible to much smaller players with smaller workloads but the same demands for up-time.
12. Re:Himalaya by Jeremi · 2009-11-11 18:17 · Score: 1
  
  By the time you get all the components that provide the processing and I/O throughput of those high-end boxes, the x86/64 commodity hardware cost advantage has evaporated
  I think the potential savings comes not so much from the hardware as from not having to redesign/re-write your low-availability (tm) software from scratch in order to make it highly-available. Instead you just slap your existing software in to the new Remus VM environment, connect the backup machine, and call it done.
  (Whether or not that method actually works in real life remains to be seen, of course, but that's the idea)
  
  --
  
  I don't care if it's 90,000 hectares. That lake was not my doing.
13. Re:Himalaya by anon+mouse-cow-aard · 2009-11-11 23:56 · Score: 2, Interesting
  
  We had a 700 kline app written in some Tandem specific application language. the smallest server we could get from HP was 400 K$. we re-wrote the app in python to use pairs of servers replicating via DRDB over ethernet and a load balancer in front. DRBD is slow, but with the new app I could just add pairs of nodes. We already had such a configuration for another application, and we combined the two, so the hardware cost was just adding two nodes in this cluster, at about 4 K$ per server node. 400 K$ -> 8 k$. I think it would take a heck of a lot of hardware to compensate for the pricing of that gear.
14. Re:Himalaya by afidel · 2009-11-12 02:43 · Score: 1
  
  Exchange hasn't been high I/O since 2007 and when 2010 launches it gets even better. A big enough environment might still see some decent IOPS but nothing like the same organizations DB environment in all likelihood.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
15. Re:Himalaya by jon3k · 2009-11-12 05:53 · Score: 1
  
  "By the time you get all the components that provide the processing and I/O throughput of those high-end boxes, the x86/64 commodity hardware cost advantage has evaporated."
  
  One word: scaling
16. Re:Himalaya by lewiscr · 2009-11-12 06:01 · Score: 1
  
  You forgot to account for the time it took you to re-write the app. Porting 700kLOC in an obscure language doesn't sound like one guy did it in a week.
  Without the data, I'll still assume it's cheaper. It would take a couple man years to make up the difference. But it's not a 98% cost savings.
17. Re:Himalaya by anon+mouse-cow-aard · 2009-11-12 07:03 · Score: 1
  
  funny you should mention the app. We replaced 750 klines of application in TAL, with 20,000 lines of python, which is precisely a 98% reduction in code size. Yes, it took a couple of years, this is mission critical stuff. tackled one functionality facet at a time.
18. Re:Himalaya by bl8n8r · 2009-11-12 07:23 · Score: 1
  
  > Unless you move to infiniband you're not going to touch something like a Stratus
  I don't know who makes the infiniband, but the Stratus in only a V6 at best. It's not *that* fast.
  
  --
  boycott slashdot February 10th - 17th check out: altSlashdot.org
19. Re:Himalaya by Anonymous Coward · 2009-11-12 18:02 · Score: 0
  
  Memory speed > 8Gbps FC
  That is the entirety of this conversation. Taking all the possible angles:
  DDR Infiniband > 8Gbps FC
  QDR Infiniband > 8Gbps FC
  and so on, and so forth.
  The problem isnt making sure the image is synced - its making sure the running state is synced, and on a high throughput application that has been tuned to run primarily in RAM your going to run into all sorts of thresholds and bottlenecks. VMware is shit that windows admins on small LANs need to keep up with the times, not something thats going to replace big iron, large clusters, hpc kit, or scale out infrastructure design. The marketing slides say otherwise - I however call bullshit.
  "That's why when you evaluate virtual platforms they always ask you about your Exchange environment as well as your database environment."
  Actually they ask you that because the sales droid excel cheat sheet says to ask you that. VMware is a non-player in a *lot* of places, and it is not as dominant in others as they would make you believe. The pure reality of this is inescapable, unless your asking VMware to design your environment for you, which given your statement ... well good luck with that.
Question by Anonymous Coward · 2009-11-11 11:00 · Score: 0

Not immediately clear on the Remus page... Is this like a constantly going "live migration" (without actually switching hosts) in that it _only_ keeps a copy of the memory of the guest? Or does this also keep a copy of the disk image? It'd be nice to not need shared storage just to be able to migrate without downtime...
1. Re:Question by Kjella · 2009-11-11 11:08 · Score: 1
  
  I'd think that'd be the easy part, much easier than having shared storage. The synchronization to make sure writes against shared storage happened exactly once would be much harder.
  
  --
  Live today, because you never know what tomorrow brings
Intact? by Glock27 · 2009-11-11 11:00 · Score: 4, Informative

Intact is one word, O ye editors...

--
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
1. Re:Intact? by Anonymous Coward · 2009-11-11 12:03 · Score: 1, Informative
  
  Infact, you're right!
2. Re:Intact? by stefanlasiewski · 2009-11-11 12:12 · Score: 1
  
  Your complaint shows a lack of tact ;)
  
  --
  "Can of worms? The can is open... the worms are everywhere."
3. Re:Intact? by martin-boundary · 2009-11-11 13:40 · Score: 2, Funny
  
  Intact is one word
  
  That was before someone gave Romulus a shovel!
state transfer by girlintraining · 2009-11-11 11:03 · Score: 3, Insightful

... Of course, this ignores the fact that if it's a software glitch, it'll happily replicate the bug into the copy. Also, there are certain hardware bugs that will also replicate: Mountain dew spilled on top of the unit, for example. There's this huge push for virtualization, but it only solves a few classes of failure conditions. No amount of virtualization will save you if the server room starts on fire and the primary system and backup are colocated. Keep this in mind when talking about "High Availability" systems.
On a different note, nothing that's claimed to be transparent in IT ever is. Whenever I hear that word, I usually cancel my afternoon appointments... Nothing is ever transparent in this industry. Only managers use that word. The rest of us use the term "hopefully".

--
#fuckbeta #iamslashdot #dicemustdie
1. Re:state transfer by Anonymous Coward · 2009-11-11 11:23 · Score: 0
  
  On a different note, nothing that's claimed to be transparent in IT ever is. Whenever I hear that word, I usually cancel my afternoon appointments... Nothing is ever transparent in this industry. Only managers use that word. The rest of us use the term "hopefully".
  You should know better than to make a claim like this.
  http://www.newegg.com/Product/Product.aspx?Item=N82E16811166006
2. Re:state transfer by Garridan · 2009-11-11 11:36 · Score: 4, Funny
  
  Mountain dew spilled on top of the unit, for example.
  FTFS:
  
  Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host.
  Wow! This software is *incredible* if mountain dew spilled on top of one machine is instantly replicated on the other machine! I'm gonna go read the source immediately, this has huge ramifications! In particular, if an officemate gets coffee and I also want coffee, only one of us needs to actually purchase a cup!
3. Re:state transfer by Vancorps · 2009-11-11 11:38 · Score: 3, Interesting
  
  If your primary and secondary systems are physically located next to each other then they aren't in the category of highly available. Furthermore with storage replication and regular snapshotting you can have your virtual infrastructure at your DR site on the cheap while gaining enterprise availability and most importantly, business continuity.
  I'll agree with being skeptical about transparency although how many people already have this? I went with XenServer and Citrix Essentials for it, I already have this fail-over and I can tell you that it works. I physically pulled a blade out of the chassis and sure enough, by the time I got back to my desk the servers were functioning having dropped a whole packet. Further tweaking of the underlying network infrastructure resulted in keeping the packet with just a momentary rise in latency.
  Enterprise availability is fast coming to the little guys.
4. Re:state transfer by bcully · 2009-11-11 11:48 · Score: 3, Informative
  
  FWIW, we have an ongoing project to extend this to disaster recovery. We're running the primary at UBC and a backup a few hundred KM away, and the additional latency is not terribly noticeable. Failover requires a few BGP tricks, which makes it a bit less transparent, but still probably practical for something like a hosting provider or smallish company.
5. Re:state transfer by Bender0x7D1 · 2009-11-11 12:14 · Score: 1
  
  How much bandwidth is needed for the connection on a per-machine basis? Asked another way - if I had 10 machines that I wanted to use this approach on, how fast of a connection would I need? At what levels of latency do problems start?
  
  --
  Reading code is like reading the dictionary - you have to read half of it before you can go back and understand it.
6. Re:state transfer by bcully · 2009-11-11 12:23 · Score: 5, Informative
  
  It depends pretty heavily on your workload. Basically, the amount of bandwidth you need is proportional to the number of different memory addresses your application wrote to since the last checkpoint. Reads are free -- only changed memory needs to be copied. Also, if you keep writing to the same address over and over, you only have to send the last write before a checkpoint, so you can actually write to memory at a rate which is much higher than the amount of bandwidth required. We have some nice graphs in the paper, but for example, IIRC, a kernel compilation checkpointed every 100ms burned somewhere between 50 and 100 megabits. By the way, there's plenty of room to shrink this through compression and other fairly straightforward techniques, which we're prototyping.
7. Re:state transfer by Bender0x7D1 · 2009-11-11 15:20 · Score: 1
  
  Cool. Thanks for the info.
  
  --
  Reading code is like reading the dictionary - you have to read half of it before you can go back and understand it.
8. Re:state transfer by Vancorps · 2009-11-11 16:16 · Score: 1
  
  Plenty of room for a Riverbed or Cisco WAAS in between to accelerate transfers as well. Sounds like you and I want to use the tech in similar ways.
  For me, I don't mess with BGP yet, I can accomplish what I need through virtual links with OSPF. Won't be as smooth as my per site fail-over since I have two locations on site. It's a temporary setup so I have three locations, a primary at our event, a secondary at our event, and a third back at HQ with a fourth on its way for DR purposes. Sucks moving your network from city to city but at least it makes for some interesting problems.
9. Re:state transfer by shmlco · 2009-11-11 17:32 · Score: 2, Interesting
  
  "If your primary and secondary systems are physically located next to each other then they aren't in the category of highly available."
  High availability covers more than just distributed data centers. Load-balancing, fail-over, clustering, mirroring, reduntant switches, routers, and other hardware: all are zero-point-of-failure, high availability solutions.
  
  --
  Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
10. Re:state transfer by girlintraining · 2009-11-11 18:11 · Score: 2, Funny
  
  Wow! This software is *incredible* if mountain dew spilled on top of one machine is instantly replicated on the other machine! I'm gonna go read the source immediately, this has huge ramifications! In particular, if an officemate gets coffee and I also want coffee, only one of us needs to actually purchase a cup!
  I told them quantum computing was a bad idea, but nobody listened...
  I told them quantum computing was a bad idea, but nobody listened...
  I told them...
  
  --
  #fuckbeta #iamslashdot #dicemustdie
11. Re:state transfer by MistrBlank · 2009-11-12 02:44 · Score: 1
  
  You're confusing high availabilty with disaster recovery. Don't worry, my managers can't get it right either.
Blakes 7 by Anonymous Coward · 2009-11-11 11:25 · Score: 0

Xen ? The computer of the Liberator?
Answer by Anonymous Coward · 2009-11-11 11:26 · Score: 5, Informative

I've worked with Remus, so I can answer your question.
It's not "constantly going" into live migration. The backup image is constantly kept in a "paused" state. It doesn't come out of the paused state until communication with the original is broken.
Until the backup goes live, the shadow pages for memory are updated, via checkpoints. The checkpointing interval is somewhat variable, but it's actually hardcoded into the Xen software (at present - this will change), regardless of what the user level utility tells you.
As it is, the subsecond checking doesn't work too well. But intervals of about 1-2 seconds works great. Getting subsecond checkpointing can be done (I've done it), but you need extra code than what Remus currently provides.
Similar comments are applicable to the storage updating. This works absolutely superbly if you're using something like DRBD for the storage replication.
Remus is pretty cool technology, and it serves as a very solid foundation for taking things to the next level.
The folks at UBC have done a superb job here, and should be well congratulated.
1. Re:Answer by Vrtigo1 · 2009-11-12 01:51 · Score: 1
  
  This does sound very similar to what VMware implemented in vSphere 4.0 as far as nearly real-time failover where the HA mechanism is application agnostic. It sounds like with Remus you can use a single box as a failover target for multiple physical hosts, (i.e. a single failover box can protect the web server, db server and mail server assuming it is sized appropriately). Does Remus only work with physical to virtual or can you also keep two VMs in lockstep?
2. Re:Answer by BitZtream · 2009-11-13 07:54 · Score: 1
  
  From http://www.usenix.org/events/nsdi/tech/full_papers/cully/cully_html/index.html
  
  We then evaluate the overhead of the system on application performance across very different workloads. We find that a general-purpose task such as kernel compilation incurs approximately a 50% performance penalty when checkpointed 20 times per second, while network-dependent workloads as represented by SPECweb perform at somewhat more than one quarter native speed. The additional overhead in this case is largely due to output-commit delay on the network interface.
  Based on this analysis, we conclude that although Remus is efficient at state replication, it does introduce significant network delay, particularly for applications that exhibit poor locality in memory writes. Thus, applications that are very sensitive to network latency may not be well suited to this type of high availability service (although there are a number of optimizations which have the potential to noticeably reduce network delay, some of which we discuss in more detail following the benchmark results). We feel that we have been conservative in our evaluation, using benchmark-driven workloads which are significantly more intensive than would be expected in a typical virtualized system; the consolidation opportunities such an environment presents are particularly attractive because system load is variable.
  So with 20 checkpoints a second you turn a normal compile time into twice as long as the original time. With a web server, just serving web pages and not pulling data off an external database you're doing to 25% or the original speed. If you throw in database access its not just going to be 25% of 25%, its going to be far worse due to the two way communications with the database having to wait on checkpointing.
  Thats at 20 checkpoints a second, at 1-2 seconds per checkpoint I imagine the kernel compile would probably go faster if you're not using NFS, but networked IO is going to be unusable.
  Its cool that you've done it, but its really not useful for many applications.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
How does it deal with replication latency? by melted · 2009-11-11 11:33 · Score: 2, Interesting

I'm pretty sure that if I just yank the cable, not everything will be replicated. :-)
1. Re:How does it deal with replication latency? by bcully · 2009-11-11 11:41 · Score: 5, Informative
  
  Hello slashdot, I'm the guy that wrote Remus. It's my first time being slashdotted, and it's pretty exciting! To answer your question, Remus buffers outbound network packets until the backup has been synchronized up to the point in time where those packets were generated. So if you checkpoint every 50ms, you'll see an average additional latency of 25ms on the line, but the backup _will_ always be up to date from the point of view of the outside world.
2. Re:How does it deal with replication latency? by shentino · 2009-11-11 11:49 · Score: 1
  
  How does remus handle things if it mispredicts the packets?
  Supposing that it sends packet X, crashes, and then when it's restored from checkpoint it decides to send packet Y instead?
  Schroedinger
3. Re:How does it deal with replication latency? by bcully · 2009-11-11 11:56 · Score: 5, Informative
  
  The buffering I mentioned above means that packet X will not escape the machine until the checkpoint that produced X has been committed to the backup. So when it recovers on the backup, X will already be in the OS send buffer. There's no possibility for misprediction. If the buffer is lost, TCP will handle recovering the packet.
4. Re:How does it deal with replication latency? by BitZtream · 2009-11-11 13:12 · Score: 2, Interesting
  
  No it won't.
  VMWare claims the same crap and its simply not true.
  You have a 50ms window between checkpoints that can be lost, in your example . The only way to ensure no lost is to ensure that every change, every instruction, every microcode executed in the CPU on machine A is duplicated on B before A continues to the next one. You simply can't do that without specialized hardware since you don't even have access to the microcode as its executed on standard hardware.
  50ms on my hardware/software can mean thousands of transactions lost. That can wreak havoc on certain network protocols and cause database operations to fail completely as you replay portions of transactions that the database has already seen.
  I can come up with situations all day long as to how this isn't as seamless as you make it out to be. Sure, xclock transitions to the other machine in what appears to be a perfect no loss transition, or solitaire on a windows machine, but thats not exactly useful.
  Remus has plenty of uses, but it has plenty of pitfalls and regardless of claims does require consideration when developing systems unless you're introducing latency that to me, would just be completely unacceptable and would require applications to be aware of the latency. Hell, thats 6.25MB of data that can be transmitted over a gigabit pipe between checkpoints. That can kill performance.
  I know what you're saying, I know what you mean, and I just don't think you realize how much that latency can effect certain classes of applications.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
5. Re:How does it deal with replication latency? by bcully · 2009-11-11 13:19 · Score: 5, Insightful
  
  I think you're missing the point of output buffering. Remus _does_ introduce network delay, and some applications will certainly be sensitive to it. But it never loses transactions that have been seen outside the machine. Keeping an exact copy of the machine _without_ having to synchronize on every single instruction is exactly the point of Remus.
6. Re:How does it deal with replication latency? by convolvatron · 2009-11-11 14:24 · Score: 1
  
  this isn't true. a fully recoverable abstraction can be maintained without digging into
  the architecture. you just need a point periodically where you flush everything and define a
  consistent checkpoint
  personally i prefer doing this in the database, or operating system, or application, but suggesting
  that you cant do this underneath is simply wrong. it just comes down to performance
7. Re:How does it deal with replication latency? by Antique+Geekmeister · 2009-11-11 15:08 · Score: 4, Insightful
  
  If your application cannot tolerate a 50 msec pause in outbound traffic (which is what Remus seems to introduce, similar to VMWare switchovers) then you have no business running it over a network, much less over the Internet as a whole. Similar pauses are introduced in core switching and core routers on a fairly frequent basis, and are entirely unavoidable.
  There are certainly classes of application sensitive to that kind of issue: various "real-time-programming" and motor control sensor systems require consistently low latency. But for public facing, high-availability services, it seems useful, and much lighter to implement than VMWare's expensive solutions.
8. Re:How does it deal with replication latency? by Gazzonyx · 2009-11-11 15:59 · Score: 1
  
  Indeed. With the right (or more accurately, wrong) file system, IO scheduler, RAID layout, and workload, you can push your disk latency to well over 50 ms before it has a chance to get to the wire's buffer. The objective is to avoid hours of latency, not milliseconds. TCP/IP will take care of the road bumps if you make sure that the road doesn't stop at the edge of a cliff.
  
  --
  If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
9. Re:How does it deal with replication latency? by msh104 · 2009-11-12 00:56 · Score: 1
  
  Hi bcully
  Mind if i ask you something.
  Currently i am running a xen setup where we replicate the storage between two machines using drbd.
  Live migration is supported in this scenario and failover is said to be as well though i haven't come around to check that out yet.
  1. Are there any advantages to using Remus over such a setup. ( other then being much easier to setup :p )
  2. Would it be possible to use proven solutions like drbd with remus or does this simply miss the point?
  I'll be sure to check it out when it turns up in the next version
  Greets
  Mark
10. Re:How does it deal with replication latency? by bill_mcgonigle · 2009-11-12 02:29 · Score: 1
  
  Do large memory operations cause the network buffer to stall until the memory changes are synchronized?
  
  --
  My God, it's Full of Source!
  OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
11. Re:How does it deal with replication latency? by happyhangone · 2009-11-12 04:57 · Score: 1
  
  Does Remus requires shared storage? It seems that it doesn't require it. If that is the case, is there a value on using it within a SAN?
12. Re:How does it deal with replication latency? by bcully · 2009-11-12 07:10 · Score: 1
  
  Yes, dirtying a lot of RAM will extend the time between checkpoints, and outbound traffic is buffered until the next checkpoint. I've put it under fairly heavy load and I don't think I've seen more than about 100ms epochs outside of my deliberate dirty-memory-as-fast-as-possible microbenchmark.
13. Re:How does it deal with replication latency? by bcully · 2009-11-12 07:12 · Score: 1
  
  No, Remus doesn't require shared storage -- just two regular PCs with their own disks will do. It could be used on a SAN, but the disk replication system would have to be modified, to do something like journalling the writes from the primary so that they could be undone on failover up to the point of the last checkpoint.
14. Re:How does it deal with replication latency? by bcully · 2009-11-12 07:16 · Score: 1
  
  I think it'd be interesting to use DRBD with Remus, since it does a nice job with things like resynchronization which you would need when the primary comes back on line. But DRBD gives your VM one shared block device, so as I mentioned in the the SAN question below, Remus would need some form of journal or transaction layer over the block device to be able to roll back uncommitted writes. A shared-storage option is on our to-do list.
15. Re:How does it deal with replication latency? by msh104 · 2009-11-12 09:43 · Score: 1
  
  Thanks for the feedback!
  I am really looking forward to it.
  You know it's actually quite easy to have multiple block devices using drbd and make them available to your VM? You can specify as many drbd devices as you like in your config. I am currently using one for root and one for swap.
  disk = [
  'drbd:drbd-server1-root,xvda1,w',
  'drbd:drbd-server1-swap,xvda2,w',
  ]
  I am no expert at this but couldn't you use a third drbd block for storing the journal that keeps track of all the changes and use that for recovery?
  I suppose it would cause some writes all over the disk, so that might get tricky to get optimized. But it does come with the advantage that it doesn't do any strange things to the data on the drbd device all the time :)
  Either that, or i am completely misunderstanding you :)
  Have fun and keep up the good work ;)
16. Re:How does it deal with replication latency? by BitZtream · 2009-11-12 21:08 · Score: 1
  
  Its not 'one 50ms pause' thats the problem, its 'one 50ms pause for every sort of communication with external hosts of any sort'
  Open a database connection, for instance:
  VM sends start request, wait for checkpoint (50ms)
  DB responds to packet with ACK
  VM sends response ACK
  VM sends DB handshake start, wait for checkpoint (50ms)
  Server responds with server info
  VM sends DB protocol version requested, wait for checkpoint (50ms)
  Server responds.
  VM sends transaction start request, wait for checkpoint (50ms)
  Server responds that its opened a transaction and returns a transaction id
  VM sends a query which locks some rows and returns some data, using a cursor because its a potentially large dataset, wait for checkpoint (50ms)
  Server responds with the cursor to us
  VM sends request to read first row, wait for checkpoint (50ms)
  Server responds with first row
  VM reads result and sends request for the next row (or 100 for that matter), wait for checkpoint (50ms)
  Server responds with data
  ( .. REPEAT FOR 10k rows ..)
  VM now starts an update on some of those rows, sends query off to the server, wait for checkpoint (50ms)
  Server performs the update and sends the response to the VM
  VM does another update on a different set of rows or table or whatever, wait for checkpoint (50ms)
  I'm already several seconds into a query that should have been done in the first 50ms.
  It has nothing to do with real time. If the outside world is going to see the state of the VM as it can be transfered to another host and continue if nothing happened then you've got to checkpoint every time a round trip happens between the server and the client. Introducing a 50ms delay into every single step of a database transaction would be a nightmare. I spend days trying to get transactions to occur in less than 50ms. It would be far far easier to fix the application to deal with failover to another machine than it would be to deal with the performance problems in this sort of setup.
  No matter how you look at it, adding even 50ms delay during EVERY SINGLE STAGE of the communication sequence (which is what would have to happen if the outside world is going to have NO CLUE that the guest OS disappeared and was moved to a new host hardware) and thats just for a single database connection. For every new network connection you add, you have to checkpoint between each one. If you have two processes on the guest doing two transactions they are fighting with each other to get the checkpoint to proceed to the next step.
  If the outside world has any indication that something went wrong than its probably just as easy to have a standard cluster then take the failed host out of it, rather than trying to migrate it.
  This sort of thing is fine when you have a process that uses a lot of CPU power compared to the amount of communication it does with other hosts, but as soon as you start bringing network IO into the process, its almost worthless.
  Do you know how absolutely horrible NFS would be in this situation, if it actually does what it claims?
  It'd be fine for running SETI@home, but not for anything that was interacting with the outside world on any regular interval.
  This is just the first couple of examples I can think of, put some effort into understanding whats being said and what actually has to happen to accomplish whats claimed and you should see the problem real quick.
  Hell, the claim is that the TCP/IP State is kept perfectly ... think about that alone. You know how many times the tcp/ip state can change in 50ms over a 1gb link? Without this checkpointing it can change a hell of a lot, with this check pointing it can change exactly once before you're stuck waitting on a checkpoint in order to maintain coherence. You're going to have to snapshot far more often than you can possibly imagine if there is network IO involved.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
17. Re:How does it deal with replication latency? by BitZtream · 2009-11-13 07:47 · Score: 1
  
  From a Remus whitepaper:
  http://www.usenix.org/events/nsdi/tech/full_papers/cully/cully_html/index.html
  
  We then evaluate the overhead of the system on application performance across very different workloads. We find that a general-purpose task such as kernel compilation incurs approximately a 50% performance penalty when checkpointed 20 times per second, while network-dependent workloads as represented by SPECweb perform at somewhat more than one quarter native speed. The additional overhead in this case is largely due to output-commit delay on the network interface.
  Based on this analysis, we conclude that although Remus is efficient at state replication, it does introduce significant network delay, particularly for applications that exhibit poor locality in memory writes. Thus, applications that are very sensitive to network latency may not be well suited to this type of high availability service (although there are a number of optimizations which have the potential to noticeably reduce network delay, some of which we discuss in more detail following the benchmark results). We feel that we have been conservative in our evaluation, using benchmark-driven workloads which are significantly more intensive than would be expected in a typical virtualized system; the consolidation opportunities such an environment presents are particularly attractive because system load is variable.
  50% performance lost for non-network IO applications, only 1/4 original speed for a normal web hit, not using a database to generate data on the page. This is well outside any acceptable level of performance for anything more than a toy/example.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
18. Re:How does it deal with replication latency? by bcully · 2009-11-13 09:30 · Score: 1
  
  You're right that this increases cost for round trips, but it's not nearly as bad for _throughput_ as you imply. Applications pipeline, and TCP is windowed, so you don't have a round trip per packet or message. Above, where you imply a round trip for each row returned from a table, the actual behaviour is that the client sends a request, and the server immediately starts spitting out a large number of rows. There's a delay of on average your checkpoint length/2 for the initial response to arrive (so 25ms if you're checkpointing at 50ms), then the rest of the query arrives at link speed.
  Remus is meant for applications that are separated by a WAN/internet link from their clients (internet applications generally have to tolerate this degree of latency anyway). For multi-server cooperative applications (like a web server in front of a database), you could get rid of the network delay by checkpointing both servers and failing them over as a unit. We have some experimental cluster checkpoint code here, but it's not part of this release.
Nope by Anonymous Coward · 2009-11-11 11:35 · Score: 4, Insightful

Remus presented their software well before VMware came out with their product.
What's different now is that the Remus patches have finally been incorporated into the Xen source tree.
If VMware has any patents, they'll have to jump over the hurdle of being before the Remus work was originally published, which was a while ago.
Besides, Remus can be used in more ways than what VMware offers, since you have the source code.
1. Re:Nope by Eli+Gottlieb · 2009-11-11 22:31 · Score: 1
  
  What's different now is that the Remus patches have finally been incorporated into the Xen source tree.
  Hear, hear! I spent my summer research internship this year incorporating Remus patches into the Xen source tree for use on a departmental project. It was two months of bloody hacking to make the patched source, the build system, and the use environment cooperate well enough to actually get a Remus system running and backing up its VMs over the network. We never got it perfect.
2. Re:Nope by Anonymous Coward · 2009-11-11 23:00 · Score: 0
  
  That's plain WRONG. VMWARE demoed this at VMWARE 2007 while the REMUS paper wasn't published till 2008.
3. Re:Nope by spotter · 2009-11-12 03:36 · Score: 1
  
  the remus paper references vmware's high availibility. (also was published in 2008 about 1.5 years ago, though dont know when it first started to be used, possibly before then)
  however, incremental checkpoint precedes both. See (pulling from my bibtex for paper I helped write)
  author = "J. S. Plank and J. Xu and R. H. B. Netzer",
  title = "{Compressed Differences: An Algorithm for Fast
  Incremental Checkpointing}",
  author = {Roberto Gioiosa and Jose Carlos Sancho and Song Jiang and Fabrizio Petrini},
  title = "{Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers}",
  author = {Ashok Joshi and William Bridge and Juan Loaiza and Tirthankar Lahiri},
  title = "{Checkpointing in Oracle}",
  author = "Angkul Kongmunvattana and Santipong Tanchatchawal and Nian-Feng Tzeng",
  title = "{Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems}",
  as well as a paper I was a coauthor on where we continuously checkpointed a regular gnome desktop (along with its file system) and enabled you to restart it at any point in the past.
  author = "Oren Laadan and Ricardo Baratto and Dan Phung and Shaya Potter and Jason Nieh",
  title = {{DejaView: A Personal Virtual Computer Recorder}},
make dom0 support for recent kernels first by Anonymous Coward · 2009-11-11 11:59 · Score: 0

it is absolutely unbelievable that the official xen kernel is still 2.6.18. there's a lot of modern hardware that isnt supported by it. this is an absolute show stopper.
Wrong place to put a failsafe? by mattbee · 2009-11-11 13:18 · Score: 3, Insightful

Surely there is a strong possibility of a failure where both VMs run at once- the original image thinking it has lost touch with a dead backup, and the backup thinking the master is dead, and so starting to execute independently? If they're connected to the same storage / network segment, it could cause data loss, bring down the network service and so on. I've not investigated these types of lockstep VMs, but it seems you have to make some pretty strong assumptions about failure modes, which always break eventually commodity hardware (I've seen bad backplanes, network chips, CPU caches, RAM of course, switches...). How can you possibly handle these cases to avoid having to mop up after your VM is accidentally cloned?

--
Matthew @ Bytemark Hosting
1. Re:Wrong place to put a failsafe? by bcully · 2009-11-11 13:26 · Score: 4, Informative
  
  Split brain is a possibility, if the link between the primary and backup dies. Remus replicates the disks rather than requiring shared storage, which provides some protection over the data. But there are already a number of protocols for managing which replica is active (e.g., "shoot-the-other-node-in-the-head") -- we're worried about maintaining the replica, but happy to use something like linux-HA to control the actual failover.
2. Re:Wrong place to put a failsafe? by dido · 2009-11-11 13:51 · Score: 4, Interesting
  
  This is something that the much simpler Linux-HA environment deals with by using something they call STONITH, which basically means to Shoot The Other Node In The Head. STONITH peripherals are devices that can completely shut down a server physically, e.g. a power strip that can be controlled via a serial port. If you wind up with a partitioned cluster, which they more colorfully call a 'split brain' condition, where each node thinks the other one is dead, each of them uses the STONITH device to make sure, if it is able. One of them will activate the STONITH device before the other, and the one which wins keeps on running, while the one that loses really kicks the bucket if it isn't fully dead. I imagine that Remus must have similar mechanisms to guard against split brain conditions as well. I've had several Linux-HA clusters go split brain on me, and I tell you it's never pretty. The best case is that they only both try to grab the same IP address and get an IP address conflict, in the worst case, they both try to mount and write to the same fiberchannel disk at the same time and bollix the file system. If a Remus-based cluster split brains, I can imagine that you'll get mayhem just as awful unless you have a STONITH-like system to prevent it from happening.
  
  --
  Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.
3. Re:Wrong place to put a failsafe? by CharlyFoxtrot · 2009-11-11 15:49 · Score: 1
  
  Sounds like a godawful mess, glad I've never had to deal with a split-brain. We manage mostly Solaris clusters and they're pretty good about panicking a node when there's a chance the cluster risks becoming inconsistent (loss of quorum). If you're already syncing disks like in this case it shouldn't be too difficult to set up a quorum device or HACMP-like disk heartbeats. Doesn't Linux-HA support this type of setup ?
  
  --
  If all else fails, immortality can always be assured by spectacular error.
4. Re:Wrong place to put a failsafe? by lewiscr · 2009-11-12 06:31 · Score: 1
  
  I ran some cluster software (Veritas) on Solaris and later Linux. The Solaris version was great. If a node lost sync, it paniced, rebooted, and attempted to rejoin. If it couldn't join the quorum, it didn't do anything. The Linux version had frequent single-node splits. If a node lost sync, it would dump a kernel stack trace to the serial console (taking several minutes), and then pick up where it left off.
  Technically, the Solaris cluster needed the same STONITH system that the Linux cluster needed. Practically, it never came up that I needed it. So rarely did Solaris have the problem that the expensive consultants we paid to install the Solaris cluster didn't recommend the extra billable hours. That's rare!
So it replicates the state to the new machine by Anonymous Coward · 2009-11-11 13:26 · Score: 0

So it replicates the state to the new machine and then the new machine executes the same instructions and crashes the same way....
1. Re:So it replicates the state to the new machine by JustinRLynn · 2009-11-11 16:56 · Score: 1
  
  This technology is meant to guard against physical layer problems (power, hardware failure) and not against software bugs.
Left VMware ESX for Xenserver 5.5 by Anonymous Coward · 2009-11-11 13:49 · Score: 0

I left VMware ESX 3.5 for XenServer 5.5 and I have never been happier.
I am running 4 DL585 servers with (so far) 42 production guests (Linux & win2k3)and have really great, more predictable performance .
If someone is running VMware and is worried about the cost or performance they need to consider Citrix XenServer.
That's at the point where in-house support works? by Futurepower(R) · 2009-11-11 14:38 · Score: 1

For $50,000 maybe you should develop in-house technical support, since it won't be just $50,000 in licenses, it will eventually be another $50,000 in support, perhaps.
I don't know how Dr. Breen is doing it. . . by MagusSlurpy · 2009-11-11 16:25 · Score: 1

but taking transparent high-availability to Xen can't bode well for Gordon or the Vortigaunts. . .

--
My sister opened a computer store in Hawaii. She sells C shells by the seashore.
Re:That's at the point where in-house support work by smash · 2009-11-11 16:30 · Score: 1

If you can get in house technical support available 24x7 that has the programmers of the product on hand to deal with it in a timely fashion, sure - go for it.

--
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Xen by Anonymous Coward · 2009-11-11 20:48 · Score: 0

Remus Project Brings Transparent High Availability To Xen
But does it solve those awful jumping puzzles?
Re:Who doesn't? by Lord+Bitman · 2009-11-11 23:58 · Score: 1

Remember when virtualization was only something for companies with highly specialized needs? And RAID? And cooled CPUs? And hard drives? and computers?
When a solution like this comes along, it generally starts out being used only by a few people (nerds and people who REALLY need it)
Then it filters down into the rest of the market as a nice solution to a common problem.
Then it becomes something which nobody can imagine living without.
Then it becomes unthinkable to design a system which doesn't have this ability.
Not true of every technology, surely, but "allow an arbitrary system to fail without stopping" is one of those "how did we ever live without it?" things. People will laugh at "three nines" as something absurd, like advertising that your web servers connect to the Internet or are powered by Electricity.

--
-- 'The' Lord and Master Bitman On High, Master Of All
Be honest - did anyone actually understand this? by rclandrum · 2009-11-12 02:51 · Score: 1

After reading this announcement, I tried to imagine the earliest possible year in which a technical reader would be able to comprehend what is being described. 2004? 1998? Last week? Never heard of either the Remus Project or the Xen hypervisor, and yet here I sit, merrily cranking out successful commercial software products, as I've been doing for the past 30 years. It took me a bit of browsing to understand what was being described.

I wonder how many readers completely understood this announcement at face value without doing a little digging. 5? 10? Everybody but me?

I think if you tried keeping up with all the technology/terms in our field, it would be a full time job.
Nothing new under the sun by Anonymous Coward · 2009-11-12 03:24 · Score: 0

This is nothing new, simply a modern implementation of a classic idea.
See "Hypervisor-based Fault Tolerance" by Bressoud and Schneider (SIGOPS 1995).
http://www.cs.cornell.edu/fbs/publications/HyperFTol.pdf
Every now and then, someone has to come along and pretend to do something new, either out of ignorance or the academic "publish or perish" pressure.
Just the other day, we were looking at yet another implementation of a transactional operating system (TXOS).
Re:Be honest - did anyone actually understand this by sharkman67 · 2009-11-12 14:31 · Score: 1

I think a larger portion of readers understood than you think. If you haven't heard of the Xen hypervisor or this type of virtualization then you probably have nothing to do with managing a server farm. If someone in that business has not heard of Xen then maybe they should be in another line of work.

I agree that keeping up with all tech would be a full time job. However this is pretty main stream stuff.

S.