First Look At VMware's vSphere "Cloud OS"

← Back to Stories (view on slashdot.org)

First Look At VMware's vSphere "Cloud OS"

Posted by ScuttleMonkey on Friday May 22, 2009 @09:13AM from the trusting-someone-else-with-your-data dept.

snydeq writes "InfoWorld's Paul Venezia takes VMware's purported 'cloud OS,' vSphere 4, for a test drive. The bottom line: 'VMware vSphere 4.0 touches on almost every aspect of managing a virtual infrastructure, from ESX host provisioning to virtual network management to backup and recovery of virtual machines. Time will tell whether these features are as solid as they need to be in this release, but their presence is a substantial step forward for virtual environments.' Among the features Venezia finds particularly worthwhile is vSphere's Fault Tolerance: 'In a nutshell, this allows you to run the same VM in tandem across two hardware nodes, but with only one instance actually visible to the network. You can think of it as OS-agnostic clustering. Should a hardware failure take out the primary instance, the secondary instance will assume normal operations instantly, without requiring a VMotion.'"

21 of 86 comments (clear)

Min score:

Reason:

Sort:

Instantly? by whereizben · 2009-05-22 09:19 · Score: 4, Insightful

With no delay at all? Somehow I don't believe it - there is always delay, but I wonder if it is "significant" enough to be noticed by an end-user.
1. Re:Instantly? by Inakizombie · 2009-05-22 09:23 · Score: 4, Funny
  
  Sure its instant! There's just an item in the hardware requirements that states "Quantum processing required."
2. Re:Instantly? by BSAtHome · 2009-05-22 09:27 · Score: 2, Funny
  
  Quantum processing,... hm, that means it will first happen when you look. That is definitely not a good idea. It should just work and I should not be required to walk down to the cellar to find the damn hardware box I am using. Next I will be required to locate the processor before Heisenberg is satisfied.
3. Re:Instantly? by lightyear4 · 2009-05-22 10:12 · Score: 5, Informative
  
  Instantly? Of course not. But the time required is equivalent to vmotion/live migration in bog-standard virtualization. How long? "That depends." To throw numbers at you, 30-100ms -- variance largely dependent upon how quickly your network infrastructure can react to MACs changing locations, whether in-flight TCP streams are broken as a result, etc. To help switches cope, people usually send a gratuitous ARP to jumpstart the process.
4. Re:Instantly? by Anonymous Coward · 2009-05-22 11:03 · Score: 2, Insightful
  
  Basically the two nodes will be in constant communication with each other. All data sent to the primary node is also sent to the secondary node, and the primary and secondary have a constant link between each other. Both nodes will perform the same computations on the data, but only the primary will reply to the user.
  If the secondary node notices that the primary is not responding it will immediately send what the primary was supposed to have been sending back to the user.
5. Re:Instantly? by Mista2 · 2009-05-22 21:55 · Score: 2, Informative
  
  It keeps a running copy on the failover host, reading from the same storage as the active host. It's as if the server were about to complete VMotion without having just done the final step. outage time is a small hiccough, less than a second. Current running sessions just carry on. If its uploading a file to someone, it just carries on. The outage is well withing the tollerance of typical TCP sessions.
6. Re:Instantly? by Thumper_SVX · 2009-05-23 08:54 · Score: 2, Informative
  
  It's close enough. I played with this feature at VMworld last year, and when running SQL transactions along with a ping, we dropped one packet but the SQL transactions didn't miss a beat.
  It's impressive enough... the two systems are working in lockstep, such that even memory is duplicated between the two systems. It's an extension on the existing VMotion function in VMware today. However, bear in mind it has some limitations; only one CPU is possible at the moment and you still have the overhead of really two VM's running at once instead of just one. So it's not a solution for ALL of your environment, just part of it.
  I'm sure the limitations will be eased over time as they tune the technology... but as a first attempt it IS awesome. Thing is, in my environment the stuff that is needed so critically that it can't take an hardware failure is usually >1 CPU, so this isn't a solution... but I guess if you have some relatively low-load but high-criticality servers then you could find a use for it (web servers seem like a good place to do this).
  And the answer is no, I don't think the end user will ever notice so long as your network infrastructure is good enough. Certainly, my users never notice a VMotion event.
Re:FT by MartijnL · 2009-05-22 09:29 · Score: 5, Informative

FT only supports a single vCPU from my understanding... Not too many people running single CPU VM's, at least in my experience...
You should be running single vCPU machines by default and only scale CPU's if absolutely necessary and if the app is fully SMP aware and functional.

--
http://virtualize.wordpress.com/
Xen did it first by lightyear4 · 2009-05-22 09:40 · Score: 2, Informative

Check out the Kemari and Remus projects, which allow precisely the same in Xen environments. In essence, it's a continual live migration (vmware people, think continual vmotion) that resumes virtual machine execution on the backup node if the origin node dies. Very cool tech. The demonstration involved pulling the plug on one of the nodes. For more information just search, there are code and papers and presentation slides galore.
1. Re:Xen did it first by ACMENEWSLLC · 2009-05-22 10:19 · Score: 2, Informative
  
  We have both vMotion and XEN.
  vMotion is very noticeable. Some things fail when it happens. Zenworks 6.5 is an example.
  With Xen, we setup a VNC mirror. EG the guest was VNC Viewing itself. We were moving a window around and then we moved the guest from Xen server 1 to 2 (we have iSCSI BTW.) There was a noticeable affect that lasted for less than a second, but then we were on XEN #2.
  It's nice to see VMWare getting this feature right with vSphere.
2. Re:Xen did it first by qnetter · 2009-05-22 11:58 · Score: 2, Informative
  
  Marathon has had it working on Xen for quite a while.
Re:FT by Jaime2 · 2009-05-22 09:46 · Score: 4, Insightful

He didn't say to use a single vCPU for a non-SMP aware app, he said to use a single vCPU for all application loads. For SMP aware apps, adding another virtual CPU is a scaling option. If you have non-SMP aware apps, then you need to find another solution, like migrate to a host with faster cores.

It makes sense. If you have 32 workloads and 16 cores, don't add the overhead of making 64 virtual vCPUs, 32 will use the host resources more efficiently as long as one app doesn't need the power of more than one core. If it does, give it to the guest only when it needs it.
Re:FT by asdf7890 · 2009-05-22 09:56 · Score: 5, Informative

If your apps are not SMP aware, WTF at you using? Multiple CPUs has been the standard for servers for at least a decade in x86 gear.
Yes, but it doesn't work in VMs the same way, at least not in VMWare. On a loaded system you often find a single-vCPU VM will out perform one with more than one vCPU, in fact if you can spread your app over multiple machines you are generally better off running two single CPU VMs instead of one dual-CPU one. This is true no matter how many physical CPUs/cores you have available.
Why is this? Because a single CPU VM can be scheduled when-ever there are time-slices available on any physical CPU/core (though a good hypervisor will try not bounce VMs between cores too often, as this reduces the potential gains from using the core's L1 cache and (on architectures where L2 cache isn't shared) L2 cache too), but a dual vCPU VM will have to wait until the hypervisor can give it timeslices on two CPUs at the same time. If this is the only VM that is actively doing anything on the physical machine (and the host OS is otherwise quiet too) this makes little difference aside from a small overhead on top of the normal hits for not running on bare metal, but as soon as that VM is competing with other processes for CPU resource it can have a massive negative effect on scheduling latency.
How does it detect a 'failure'? by moosesocks · 2009-05-22 10:09 · Score: 4, Informative

How many hardware failures are actually characterized by a complete 100% loss of communication (as you'd get by pulling the plug)?
Don't CPU and Memory failures tend to make the computer somewhat unstable before completely bringing it down? How would vSphere handle (or even notice) that?
Even hard disk failures can take a small amount of time before the OS notices anything is awry (although you're an idiot if you care enough about redundancy to worry about this sort of thing, but don't have your disks in a RAID array)

--
-- If you try to fail and succeed, which have you done? - Uli's moose
1. Re:How does it detect a 'failure'? by Mista2 · 2009-05-22 22:07 · Score: 2, Interesting
  
  Several Brand new servers with VI3 installed 2 weeks ago, left to run to burn in, first production guests moved onto them on Friday, Saturday sees CPU voltage regulator in one go pop, dead server. It would have been nice to just have the the Exchange server keep on rocking until Monday when we could replace the hardware, but no, now I've spent my Saturday morning going into work and fixing it.
  However thanks to VM, the HighAvailbility service did restart the guests automatically, but I did have to repair a damaged mailstore. 8(
2. Re:How does it detect a 'failure'? by Thumper_SVX · 2009-05-23 09:01 · Score: 2, Interesting
  
  This is one reason I run Exchange 2007 with a clustered PHYSICAL mailbox server, and all the CAS and HT roles I run on virtual machines. I don't run database type apps on VMware for exactly these reasons... I am a big VMware supporter, but I also specify for our big apps that we use big SQL and Exchange clusters for HA... not VMware. Yes, it's a bit more expensive that way, but our Exchange cluster now hasn't been "down" in over a year, despite the fact that each node gets patched once a month and rebooted. My users love it :)
Don't ask, just look. by RulerOf · 2009-05-22 10:13 · Score: 4, Informative

One of the statistics measured by virtualcenter is the lag you're asking about.

The first hit on google images should give you a good idea.

In practice, I don't know... I imagine that the secondary instance will still receive network traffic bound for the cluster, so it'd probably be perceived as a hiccup when the primary one goes down, which is good enough for most services.

--
Boot Windows, Linux, and ESX over the network for free.
It works as advertized by RobiOne · 2009-05-22 10:16 · Score: 5, Informative

Like everyone else pointed out, it's a VM in lockstep with a 'shadow' VM. This is not just 'continuous VMotion'.
If something happens to the VM, the shadow VM goes live instantly (you don't notice a thing if you're doing something on the VM).
Right after that, the system starts bringing up another shadow VM on another host to regain full FT protection.
This can be network intensive, depending on the VM load, and currently only works with 1 vCPU per VM. Think 1-2 FT VMs per ESX host + shadow VMs.
You'll need recent CPUs that support FT and have an VMware HA / DRS Cluster set up.
So if you've got it, use it wisely. It's very cool.

--
-- Robi
Re:FT by asdf7890 · 2009-05-22 10:24 · Score: 2, Informative

, but a dual vCPU VM will have to wait until the hypervisor can give it timeslices on two CPUs at the same time.
Then their hypervisor is broken. It should be possible for A dual vCPU machine to have vCPU1 and vCPU2 be two timeslices on the same real cpu if need be.
Which would kill any benefit of running SMP in the VM anyway, if it were possible.
My understanding, which may be out of date, is that this is not considered a good idea as timing issues between threads on the two vCPUs if scheduled one after the other on the same core could potentially cause race conditions. And if not that serious, the threads on the vCPU that gets the first slice of the real core could be paused waiting for locks to be released by threads the guest OS has lined up to runon the other vCPU. This is an explanation that I have seen given as to why VMWare would not allow you to do virtual SMP on a single-core-single-CPU host machine (i.e. emulating SMP in the guest by giving two or more vCPUs alternating time on the only physical core).
Re:FT by OrangeTide · 2009-05-22 13:04 · Score: 2, Insightful

Try supporting synchronization of two virtual machines over a network/SAN when you also have to deal with SMP. Gets hard.

--
“Common sense is not so common.” — Voltaire
Re:FT by Jaime2 · 2009-05-22 14:44 · Score: 3, Insightful

Yea, but you also get the overhead of two schedulers (the host and the guest) and two systems moving thread context from core to core, which is an expensive operation. Most VMware systems are pretty heavily oversubsubscribed in terms of cores. Its not uncommon to have 60 guests on a 16 core host. If all 60 guests have 4 virtual CPUs, you do get the advantage that one guest can expand out to consume about a quarter of the total host CPU power, but you also have the cost of guest SMP switching even though you are using an average of 0.25 cores per guest.