Slashdot Mirror


Extreme Memory Oversubscription For VMs

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."

15 of 129 comments (clear)

  1. Kernel shared memory by Narkov · · Score: 5, Informative

    The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:

    http://lwn.net/Articles/306704/

    1. Re:Kernel shared memory by amscanne · · Score: 5, Informative

      Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.

      Our mechanism for performing over-subscription is actually rather unique.

      Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.

      We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.

      We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.

      So the mechanism for accomplishing this works as follows:
      1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
      2. A memory server is setup to serve that snapshot up.
      3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
      4. The clones start pulling memory from the server on-demand.

      All of this takes a few seconds from start to finish, whether on the same machine or across the network.

      We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).

      The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)

    2. Re:Kernel shared memory by pookemon · · Score: 5, Funny

      The article summary is not /entirely/ accurate

      That's surprising. That never happens on /.

      --
      dnuof eruc rof aixelsid
    3. Re:Kernel shared memory by descubes · · Score: 5, Interesting

      Having written VM software myself (HP Integrity VM), I find this fascinating. Congratulations for a very interesting approach.

      That being said, I'm sort of curious how well that would work with any amount of I/O happening. If you have some DMA transfer in progress to one of the pages, you can't just snapshot the memory until the DMA completes, can you? Consider a disk transfer from a SAN. With high traffic, you may be talking about seconds, not milliseconds, no?

      --
      -- Did you try Tao3D? http://tao3d.sourceforge.net
  2. Re:So... is this different from Linux KVM w/ KMS? by amscanne · · Score: 5, Informative

    I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.

  3. Re:Leaky Fawcet by ls671 · · Score: 4, Informative

    Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.

    Oh well....

    --
    Everything I write is lies, read between the lines.
  4. This having been done before ... by cdrguru · · Score: 5, Informative

    One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.

    Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

    The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.

    This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.

    It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.

    This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.

    1. Re:This having been done before ... by Anonymous Coward · · Score: 4, Informative

      Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.

      These guys seem to be doing something different...

    2. Re:This having been done before ... by pz · · Score: 4, Informative

      In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.

      From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.

      Those were heady days for Tech Square. And, otherwise, the parent poster is right on.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
  5. Re:Leaky Fawcet by Mr+Z · · Score: 4, Interesting

    Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.

    I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.

  6. Re:Leaky Fawcet by GooberToo · · Score: 5, Informative

    Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.

    The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.

    Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.

    The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.

    You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

    Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.

  7. Re:Leaky Fawcet by sjames · · Score: 4, Interesting

    I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.

    In other cases, some of the drivers may need an update, but if they're modules and not for something you can't take offline (such as the disk the root filesystem is on), it's no problem to update.

    Note that I generally spec RAM so that zero swap is actually required if nothing leaks and no exceptional condition arises.

    When disks come in 2TB sizes and server boards have 6 SAS ports on them, why should I sweat 8 GB?

    Let's face it, if the swap space thrashes (yes, I know paging and swapping are distinct but it's still called swap space for hysterical raisins) it won't much matter if it is 1:1 or .5:1, performance will tank. However, it it's just leaked pages, it can be useful.

    For other situations, it makes even more sense. For example, in HPC, if you have a long running job and then a short but high priority job comes up, you can SIGSTOP the long job and let it page out. Then when the short run is over, SIGCONT it again. Yes, you can add a file at that point, but it's nice if it's already there, especially if a scheduler might make the decision to stop a process on demand. Of course, on other clusters (depending on requirements) I've configured with no swap at all.

    And since Linux can do crash dumps and can freeze into swap, it makes sense on laptops and desktops as well.

    Finally, it's useful for cases where you have RAID for availability, but don't need SO much availability that a reboot for a disk failure is a problem. In that case, best preformance suggests 2 equal sized swaps on 2 drives. If one fails, you might need a reboot, but won't have to wait on a restore from backup and you'll still have enough swap.

    Pick your poison, either way there exists a failure case.

    And yes, in the old days I went with 2:1, but don't do that anymore because it really is excessive these days.

  8. Re:So... is this different from Linux KVM w/ KMS? by amscanne · · Score: 3, Insightful

    Yes, rebasing is reduced by careful selection of preferred base addresses (particularly by Microsoft for their DLLs). Yes, if DLLs are not rebased then they are shared -- I did not claim otherwise. My point along these lines is that rebasing *does* occur surprisingly often, and can hurt sharing. The actual level of sharing you achieve obviously depends almost *entirely* your applications, workload, data, etc.

    By the way, as far as I know versions of Windows newer than Vista enable address-space randomization by default for security purposes. Since the starting virtual address of each DLL is randomized, preferred bases can't be respected. I don't know what impact this has on Windows memory usage post-Vista, but it seems like one can't rely on carefully curated base addresses.

    I'm not saying one approach is better than the other (Linux, Windows, whatever) -- I'm only positing a possibility for why one might see better improvement with KSM. One might just as easily see better over-subscription with Windows simply due to the fact that it zeroes out physical pages when they are released, as far as I know. Those zero pages can all be mapped to the same machine frame transparently (without the need for balloooning).

  9. Re:Leaky Fawcet by akanouras · · Score: 3, Interesting

    Excuse my nitpicking, your post sparked some new questions for me:

    The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time?

    Are you sure that's it's only reading/writing 4KB at a time? It seems pretty braindead to me.

    The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was.

    Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.

    You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved.

    Just mentioning here that Swapspace (Debian package) takes care of that, with configurable thresholds.

    You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.

    Quoting Andrew Morton:
    "[On 2.6 kernels the difference is] None at all. The kernel generates a map of swap offset -> disk blocks at swapon time and from then on uses that map to perform swap I/O directly against the underlying disk queue, bypassing all caching, metadata and filesystem code."

  10. Re:Leaky Fawcet by vlm · · Score: 4, Insightful

    When disks come in 2TB sizes .... why should I sweat 8 GB?

    You are confusing capacity problems with thruput problems. Sweat how poor performance is when 8 gigs gets thrashing.

    The real problem is the ratio of memory access speed vs drive access speed has gotten dramatically worse over the past decades.

    Look at two scenarios with the same memory leak:

    With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.

    With no/little swap, the OOM killer will reap your processes, which will be restarted automatically by your init scripts or equivalent. The users will notice the maybe, just maybe, they had to click refresh twice on a page. Or maybe it seemed slow for a moment before it was normal speed. They'll probably just blame the network guys.

    End result, with swap means long outage that needs manual fix, no swap means no outage at all and automatic fix.

    In the 80s, yes you sized your swap based on disk space. In the 10s (heck, in the 00s) you size your swap based on how long you're willing to wait.

    It takes a very atypical workload and very atypical hardware for users to tolerate the thrashing of gigs of swap...

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger