Extreme Memory Oversubscription For VMs
Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."
Given how many programs leak memory. Its amazing that companies get away with oversubscribing memory without running into big issues. And desktop programs are usually the worst of the bunch.
The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:
http://lwn.net/Articles/306704/
Even the same ratio of over-subscribed memory, around 300%, but without the overhead this article admits it has which reduces it's actual over-subscription ratio down to just over 200% instead:
http://lwn.net/Articles/306704/
Specifically, this link/LKML post: 52 1GB Windows VMs in 16GB of total physical RAM installed:
http://lwn.net/Articles/306713/
This blog post is just a summary of 3 existing techniques: Paging, Ballooning, and Content-Based Sharing. It does not describe any new techniques, or give any new insights.
It's a solid summary of these techniques, but nothing more.
OpenVZ has had this for years now, which is one of the reasons it has gained popularity in the hosting world.
nothing new .. i ran 6 w2k3 servers on a linux box running vmware server with 4GB of ram and allocated 1GB to each vm
I noticed free memory on the system was at 2GB and dropping quickly when they moved focus away from the console session (even though all of the VM's had the exact same app set running). This appears to be absolutely nothing new or amazing... in fact, it reads like an ad for gridcentric.
Oversubscription of memory for VMs has been around for decades - just not for the Intel platform. There are other older, more mature platforms for VM support...
One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.
Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.
The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.
This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.
It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.
This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.
Yes, we pay attention...
The concept is in Unix, including Linux, and probably in Windows - COW (copy-on-write) pages...
fork() uses COW, vfork() shares the entire address space (but suspends the parent).
$ man vfork
[snip]
Historic Description
Under Linux, fork(2) is implemented using copy-on-write pages, so the
only penalty incurred by fork(2) is the time and memory required to
duplicate the parent's page tables, and to create a unique task struc-
ture for the child. However, in the bad old days a fork(2) would
require making a complete copy of the caller's data space, often need-
lessly, since usually immediately afterwards an exec(3) is done. Thus,
for greater efficiency, BSD introduced the vfork() system call, which
did not fully copy the address space of the parent process, but bor-
rowed the parent's memory and thread of control until a call to
execve(2) or an exit occurred. The parent process was suspended while
the child was using its resources. The use of vfork() was tricky: for
example, not modifying data in the parent process depended on knowing
which variables are held in a register.
[snip]
Just another "Cubible(sic) Joe" 2 17 3061
This guy did some crazy stuff with it - 64Gb or so of fake memory on an 8Gb box http://vinf.net/2010/02/25/8-node-esxi-cluster-running-60-virtual-machines-all-running-from-a-single-500gbp-physical-server/
This may seem obvious, but in reading some of the trade press and the general buzz, it seems that it isn't obvious to everyone:
Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.
I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).
When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.
This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.
Our internet is oversubscribed, our processors are getting there, and now RAM?
When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
I had recently started poking around the lguest hypervisor. From my limited reading I believe 2 of the 3 memory subscription choices mentioned in the article are present in Linux. Existing linux based open source hypervisors like kvm etc use paging/swap mechanism (i,e, for x86 - the paravirt mechanism). Ballooning is possible using the virto_balloon. Kernel shared memory in linux allows dynamic sharing of memory pages between proceses - this probably doesn't apply to virtualization.
I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.
On an other note, established players like IBM tend to use Type-1 hypervisors (link) for enterprise servers, it would be interesting to see how this company fares against them.
Why not 24 or 32? Seriously, why should 5 GB real memory be capable of only supporting 16 VMs? What's the limit?
VMware has allowed RAM oversubscription for years. Indeed, it's one of the killer features of that platform over the alternatives. Who out there using VMware in non-trivial environments _isn't_ oversubscribing RAM ?
Really, this amount of overcommit is nothing. It's been done for decades.
I manage a little over 200 virtual servers, spread across 7 z/VM hypervisors, and 2 mainframes. They are currently running with overcommit ratios of 4.59:1, 3.87:1, 3.56:1, 2.05:1, 1.19:1, 1.19:1, and .9:1. And this is a relatively small shop and somewhat low overcommits for the environment.
That's one of the benefits of virtualization...and yes, I know that if all guests decided to allocate all of their memory at once, we'd drive the hypervisor paging subsystem up the wall. Actually, this did happen a few months ago and while everything was dog slow for a while, z/VM happily paged along with out issue.
Newsflash from 2 years in the future: "Massive performance improvements in the VM sector by combining VMs into a single... let's call it... kernel. We present: chroot()!"
What's the fucking point of VMs if you introduce all the security problems over and over again? Might as well leave out the superfluous garbage complexity and improve performance.
2 years so updates are way behind? Not all of them are no reboot ones.
..for the day when my ISP wants to sell me a more expensive class of memory because they oversold their physical memory so much that they can't support users that actually use all of what they were sold.
Why is this even an issue, modern OS's are designed to use the maximum amount of ram available. Unused RAM is wasted ram, both M$ and penguin-boi so apparently there's no need to oversubscribe since we apparently have too much already.
True, ESX is not free, but ESXi (http://www.vmware.com/products/vsphere-hypervisor/index.html) certainly is. If you have just one box and thus do not need stuff like HA or Vmotion, ESXi works just fine.
"Extraordinary claims require extraordinary evidence" - Carl Sagan
I've yet to encounter a sane system that does swapping well. Especially Windows is notorious of using the swapfile, even though memory is not filled yet. Why do I want diskwrites/reads before my memory is used up? I don't, and since I'm well below my 4GB of memory, I erase the swapfile instead and the system runs that much more smoothly.
Even desktop Linux (both Gnome and KDE) eats up memory and starts swapping needlessly in my experience, but the bloat is a bigger fault in Linux than in Windows desktop, and perhaps not the paging mechanism itself.
Whatever works. I just can't stand a slow system just because the system "finds out out of its own" that it needs to page, when I clearly see it doesn't. Often paged memory comes back to haunt me in form of trashing, or other slowdowns. Paging is a fine "optimization" for a server often, but is a horrible solution for a desktop, which should be more responsive.