Extreme Memory Oversubscription For VMs

← Back to Stories (view on slashdot.org)

Extreme Memory Oversubscription For VMs

Posted by Soulskill on Tuesday August 10, 2010 @03:55PM from the eXtreme-virtualization dept.

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."

8 of 129 comments (clear)

Min score:

Reason:

Sort:

Kernel shared memory by Narkov · 2010-08-10 16:08 · Score: 5, Informative

The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:
http://lwn.net/Articles/306704/
1. Re:Kernel shared memory by amscanne · 2010-08-10 17:41 · Score: 5, Informative
  
  Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.
  Our mechanism for performing over-subscription is actually rather unique.
  Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.
  We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.
  We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.
  So the mechanism for accomplishing this works as follows:
  1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
  2. A memory server is setup to serve that snapshot up.
  3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
  4. The clones start pulling memory from the server on-demand.
  All of this takes a few seconds from start to finish, whether on the same machine or across the network.
  We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).
  The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)
Re:So... is this different from Linux KVM w/ KMS? by amscanne · 2010-08-10 16:32 · Score: 5, Informative

I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.
Re:Leaky Fawcet by ls671 · 2010-08-10 16:35 · Score: 4, Informative

Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.
Oh well....

--
Everything I write is lies, read between the lines.
This having been done before ... by cdrguru · 2010-08-10 16:43 · Score: 5, Informative

One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.
Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.
The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.
This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.
It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.
This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.
1. Re:This having been done before ... by Anonymous Coward · 2010-08-10 16:52 · Score: 4, Informative
  
  Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.
  These guys seem to be doing something different...
2. Re:This having been done before ... by pz · 2010-08-10 17:16 · Score: 4, Informative
  
  In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.
  From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.
  Those were heady days for Tech Square. And, otherwise, the parent poster is right on.
  
  --
  
  Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Re:Leaky Fawcet by GooberToo · 2010-08-10 17:41 · Score: 5, Informative

Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.
The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.
Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.
The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.
You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.
Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.