Extreme Memory Oversubscription For VMs

← Back to Stories (view on slashdot.org)

Extreme Memory Oversubscription For VMs

Posted by Soulskill on Tuesday August 10, 2010 @03:55PM from the eXtreme-virtualization dept.

Laxitive writes "Virtualization systems currently have a pretty easy time oversubscribing CPUs (running lots of VMs on a few CPUs), but have had a very hard time oversubscribing memory. GridCentric, a virtualization startup, just posted on their blog a video demoing the creation of 16 one-gigabyte desktop VMs (running X) on a computer with just 5 gigs of RAM. The blog post includes a good explanation of how this is accomplished, along with a description of how it's different from the major approaches being used today (memory ballooning, VMWare's page sharing, etc.). Their method is based on a combination of lightweight VM cloning (sort of like fork() for VMs) and on-demand paging. Seems like the 'other half' of resource oversubscription for VMs might finally be here."

129 comments

Min score:

Reason:

Sort:

Leaky Fawcet by suso · 2010-08-10 16:06 · Score: 1

Given how many programs leak memory. Its amazing that companies get away with oversubscribing memory without running into big issues. And desktop programs are usually the worst of the bunch.
1. Re:Leaky Fawcet by warewolfsmith · 2010-08-10 16:24 · Score: 1
  
  Leaky memory, you need NuIO memory stop leak, just pour it in and off you go.....NuIO a Microsoft Certified Product
2. Re:Leaky Fawcet by ls671 · 2010-08-10 16:35 · Score: 4, Informative
  
  Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again. I have tried several times to explain that to some slashdotters that bragged about not using any swap space anymore nowadays and that called me stupid for reserving a 2 gig swap partition or more on a 4 gig ram machine that sometimes runs for 2 years before rebooting.
  Oh well....
  
  --
  Everything I write is lies, read between the lines.
3. Re:Leaky Fawcet by druke · 2010-08-10 16:37 · Score: 1
  
  You make it sound like this is some sort of conspiracy. Generally when you'd want to do something like this you would be doing VM servers anyways. they didn't do much (anything, actually) in the way of 'desktop programs' beyond X...
  Why does this matter anyways, it's not the vm dev's job to fix memory leaks in openoffice. They have to go forward assuming everything is working correctly. Also, if they're all sharing the memory leak, it'd be optimized anyways :p
4. Re:Leaky Fawcet by Mr+Z · 2010-08-10 16:48 · Score: 4, Interesting
  
  Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.
  I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.
  
  --
  Program Intellivision!
5. Re:Leaky Fawcet by sjames · 2010-08-10 16:54 · Score: 1, Informative
  
  Personally, I like to make swap equal to the size of RAM for exactly that reason. It's not like a few Gig on a HD is a lot anymore.
6. Re:Leaky Fawcet by buchner.johannes · 2010-08-10 16:58 · Score: 1
  
  Sometimes that doesn't work out so well. If you have a fragmented heap with gaps between the leaked items that keep getting reused, it can lead to a lot of strange thrashing, since it effectively amplifies your working set size.
  I think that may be one of the things that was happening to older Firefoxes (2.x when viewing gmail, in particular)... not only did it leak memory, it leaked memory in a way such that the leak couldn't just stay in swap.
  Wouldn't that be a good exercise for kernels? Recording the usage patterns of memory subsections, defragmenting them into segments by usage frequency. If that is not possible at runtime, store and apply at the next run.
  Or maybe clustering chunks by the code piece that allocated it would already help. That said, I don't know what malloc's current wisdom is.
  
  --
  NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
7. Re:Leaky Fawcet by Osso · 2010-08-10 17:04 · Score: 1
  
  Often you don't want to swap, and want memory allocations to fail. Sometimes everything will be slow and you can barely access your server instead of being able to check on what's happening
8. Re:Leaky Fawcet by Mr+Z · 2010-08-10 17:28 · Score: 1
  
  The heap is entirely in userspace, and the kernel is powerless to do anything about it.
  Imagine some fun, idiotic code that allocated, say, 1 million 2048 byte records sequentially (2GB total), and then only freed the even-numbered records. (I'm oversimplifying a bit, but the principle holds.) Now you've leaked 1GB memory, but its spread over 2GB space.
  The kernel only works in 4K chunks when paging. Each 4K page, though, has 2K of leaked data and 2K of free space. For all the subsequent non-leak allocations that fit in these holes, you effectively "amplify" the footprint due to the leaked data that shares the same 4K page. If you try to use 1GB of space for some actual work within that same process, the working set the kernel's VM will see will look more like 2GB if all the allocations fill the holes.
  Make sense?
  
  --
  Program Intellivision!
9. Re:Leaky Fawcet by ensignyu · 2010-08-10 17:35 · Score: 1
  
  The kernel can only defragment pages, which are 4KB on most Linux systems. If you have a page with 4080 bytes of leaked memory and 16 bytes of memory that you actually use, accessing that memory will swap in the entire page.
  You can't move stuff around within a page because the address would change (moving pages is OK because all memory accesses go through the TLB), unless you have a way of fixing up all the pointers to point at the new location. That's generally only possible in a type-safe language like Java, so the memory manager can guarantee that it's modifying a pointer and not some arbitrary data. The Java virtual machine can move objects around as part of the garbage collection process, defragmenting the heap in the process.
  Clustering by allocation site might work for some applications, but if you allocate, say, a string, it's difficult to tell if the string is going to be freed now, or later, or never, much less whether it'll be freed at the same time as any other objects. It might depend on the input data.
10. Re:Leaky Fawcet by GooberToo · 2010-08-10 17:41 · Score: 5, Informative
  
  Unfortunately you're not alone in doing this. Its a deprecated practice that used to make sense, but hasn't made sense to do so in a very long time.
  The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time? In the event you have legitimate applications which use large amounts of memory run away with a bug, it can effectively bring your entire system to a halt as it will take a long, long time before it runs out of memory.
  Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days. As someone else pointed out, sure you can buy more uptime from leaking applications but frankly, that's hardly realistic in the least. The chances of not requiring a kernel update over the span of a couple years is just silly unless you care more for uptime than you do for security and/or features and/or performance.
  The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was. These days, if you ever come closing to needing 1:1, let alone 2:1 page file/partition, you're not even close to properly spec'ing your required memory. In other words, with few exceptions, if you have a page file/partition anywhere near that size, you didn't understand how the machine was to be used in the first place.
  You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved. You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.
  Three decades ago it made sense. These days, its just silly and begging for your system to one day grind to a halt.
11. Re:Leaky Fawcet by GooberToo · 2010-08-10 18:02 · Score: 0, Redundant
  
  Why is a factually accurate, topical, informative, and polite message marked troll?
12. Re:Leaky Fawcet by GooberToo · 2010-08-10 18:08 · Score: 1
  
  Sorry. My other post which provides lots of good, accurate information was troll-moderated. Its been forever since I've last seen meta-moderation actually fix a troll moderated post so I'm hoping others will fix it. Not to mention, its information many, many users should learn.
  Hopefully yourself and others will read the post and realize why its a bad idea, which ignores the fact its a popular notion.
13. Re:Leaky Fawcet by akanouras · 2010-08-10 18:23 · Score: 1
  
  1. Programs can't use the swapped memory directly. The kernel only swaps parts of memory that haven't been accessed in a while.
  2. By swapping out unused (even because of leaks) memory, the kernel has more memory to use for disk caching.
  All this has nothing to do with whether your system will grind to a halt today instead of one month later.
  And to answer your question, the mod(s) apparently thought this was common knowledge and not worth responding to.
14. Re:Leaky Fawcet by GooberToo · 2010-08-10 18:30 · Score: 1
  
  Yes, everything you said is known and understood, but hardly topical.
  By paging leaked memory, if the leak is indeed bad enough to justify an abuse of the VM to offset it, chances are you'll be suffering from fragmentation and be on the negative side of the performance curve at some point. Its just silly to believe you'll be running a badly leaking application over the span of years and desire to hide the bug rather than fix it. There is just nothing about that strategy which makes sense.
  So to bring this full circle, the troll-moderator, was completely wrong. And while you're post is well intentioned, its nieve at best. More likely the moderator is completely clueless as to the subject matter or the moderation was done out of spite.
15. Re:Leaky Fawcet by sjames · 2010-08-10 18:33 · Score: 4, Interesting
  
  I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.
  In other cases, some of the drivers may need an update, but if they're modules and not for something you can't take offline (such as the disk the root filesystem is on), it's no problem to update.
  Note that I generally spec RAM so that zero swap is actually required if nothing leaks and no exceptional condition arises.
  When disks come in 2TB sizes and server boards have 6 SAS ports on them, why should I sweat 8 GB?
  Let's face it, if the swap space thrashes (yes, I know paging and swapping are distinct but it's still called swap space for hysterical raisins) it won't much matter if it is 1:1 or .5:1, performance will tank. However, it it's just leaked pages, it can be useful.
  For other situations, it makes even more sense. For example, in HPC, if you have a long running job and then a short but high priority job comes up, you can SIGSTOP the long job and let it page out. Then when the short run is over, SIGCONT it again. Yes, you can add a file at that point, but it's nice if it's already there, especially if a scheduler might make the decision to stop a process on demand. Of course, on other clusters (depending on requirements) I've configured with no swap at all.
  And since Linux can do crash dumps and can freeze into swap, it makes sense on laptops and desktops as well.
  Finally, it's useful for cases where you have RAID for availability, but don't need SO much availability that a reboot for a disk failure is a problem. In that case, best preformance suggests 2 equal sized swaps on 2 drives. If one fails, you might need a reboot, but won't have to wait on a restore from backup and you'll still have enough swap.
  Pick your poison, either way there exists a failure case.
  And yes, in the old days I went with 2:1, but don't do that anymore because it really is excessive these days.
16. Re:Leaky Fawcet by GooberToo · 2010-08-10 18:39 · Score: 2, Insightful
  
  I often see uptimes measured in years. It's not at all unusual for a server to need no driver updates for it's useful lifetime if you spec the hardware based on stable drivers being available. The software needs updates in that time, but not the drivers.
  Yes, we've all seen that. It makes for nice bragging rights. But realistically, to presume that one might have a badly leaking application, which can not ever be restarted, and that memory/paging fragmentation is not a consequence, to justify a poor practice is just that, a poor practice. And of course, that completely ignores the fact that there are likely nasty kernel bugs going unfixed. So it means you're advertising a poor practice, which will likely never be required, as an excuse to maintain uptime at the expense of security and/or reliability.
  And if you somehow manage to break the odds whereby the poor practice miraculously pays off, you can always create a paging file.
17. Re:Leaky Fawcet by akanouras · 2010-08-10 18:46 · Score: 1
  
  I apologise, I didn't pay enough attention to the context while replying.
  Indeed, using swapping for the sole purpose of mitigating memory leaks is wrong.
18. Re:Leaky Fawcet by tenchikaibyaku · 2010-08-10 19:08 · Score: 1
  
  I have lately disabled my swap for a very simple reason: with 4GB of RAM, the swap was only ever used when some rouge application suddenly went into an eat-all-memory loop *cough*adobe flash*cough*.
  
  It might not be a good reason in theory, but in practice I rather have the OOM kick in sooner rather than having to struggle with a system that is practically hanged due to all the swapping. I can live with having slightly less of my filesystem cached.
19. Re:Leaky Fawcet by akanouras · 2010-08-10 20:05 · Score: 3, Interesting
  
  Excuse my nitpicking, your post sparked some new questions for me:
  
  The problem stems when legitimate applications attempt to use that memory. How long does it take to page (read/wirte) 16GB, 4KB at a time?
  Are you sure that's it's only reading/writing 4KB at a time? It seems pretty braindead to me.
  
  The old 1:1+x and 2:1 memory to disk ratios are based on notions of swapping rather than paging (yes, those are two different virtual memory techniques), plus allowing room for kernel dumps, etc. Paging is far more efficient than swapping ever was.
  Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.
  
  You might come back and say, one day I might need it. Well, one day you can create a file (dd if=/dev/zero of=/pagefile bs=1024 count=xxxx), initialize it as page (mkswap /pagefile), and add it as a low priority paging device (swapon -p0 /pagefile). Problem solved.
  Just mentioning here that Swapspace (Debian package) takes care of that, with configurable thresholds.
  
  You may say the performance will be horrible with paging on top of a file system - but if you're overflowing several GB to a page file on top of a file system, the performance impact won't be noticeable as you already have far, far greater performance problems. And if the page activity isn't noticeable, the fact its on a file system won't matter.
  Quoting Andrew Morton:
  "[On 2.6 kernels the difference is] None at all. The kernel generates a map of swap offset -> disk blocks at swapon time and from then on uses that map to perform swap I/O directly against the underlying disk queue, bypassing all caching, metadata and filesystem code."
20. Re:Leaky Fawcet by Anonymous Coward · 2010-08-10 20:25 · Score: 0
  
  You must be new here
21. Re:Leaky Fawcet by somersault · 2010-08-10 21:59 · Score: 0, Redundant
  
  I'm assuming you've already heard of it, but you can use something like ksplice to patch up the kernel on the fly. It's not necessary to skip updates even if you want 100% uptime.
  
  --
  which is totally what she said
22. Re:Leaky Fawcet by lars_stefan_axelsson · 2010-08-10 22:19 · Score: 2, Informative
  
  Could you elaborate on the difference between swapping and paging? I have always thought of it (adopting the term "paging") as an effort to disconnect modern Virtual Memory implementations from the awful VM performance of Windows 3.1/9x. Wikipedia mentions them as interchangeable terms and other sources on the web seem to agree.
  It's actually (buried) in the wikipedia article you link, but only a sentence or so. In the old days, before paging, a Unix system would swap an entire running program onto disk/drum. (That's where the sticky bit comes from, as swap space was typically much faster than other secondary storage, if nothing else the lack of a file system helps, the sticky bit on an executable file meant, "keep text of program on swap even when it's stopped executing". This meant that executing the program again would go much faster). Then came paging, where only certain pages of a running program would get ejected to swap space.
  Unix systems would then both swap and page. Roughly, when memory pressure was low (but still high enough to demand swap space), the system would page. As memory pressure rose, the OS would decide the situation to be untenable and select entire processes to be evicted to swap for a long time (several seconds to tens of seconds) and then check periodically to see if they could/should be brought back (evicting someone else in the process). The BSDs even divided the task struct into two parts, the swappable and the unswappable part. Where the swappable part would record things like page tables etc. that is superfluous information when all the pages of a process have been ejected. The unswappable part contained only the bare minimum needed to remember there was a process on swap, and to make scheduling decisions regarding it. This made sense when main memory was measured in single digit megabytes, I don't think that Linux bothered with this (or even swapping as a concept, implementing just paging, but don't quote me on that, as memories were becoming bigger fast).
  Of course, swapping meant that those of us that ran X on a 4MB Sun system in the eighties would find that our X-term processes had been swapped out (the OS had decided that since they hadn't been used in a while, and where waiting for I/O, they were probably batch oriented in nature and could be swapped out wholesale) and it would take several seconds for the cursor to become responsive when you changed windows... :-) The scheduling decisions hadn't kept up. The solution though was the same as today, buy more memory... :-)
  Any good *old* book on OS internals, esp. the earlier incantations of "The Design and Implementation of the FreeBSD Operating System" by McCusick et.al. would have the gory details. (But the FreeBSD version of that book might have done away with that. It was still in the 4.2 version though.) :-)
  
  --
  Stefan Axelsson
23. Re:Leaky Fawcet by vlm · 2010-08-10 23:04 · Score: 4, Insightful
  
  When disks come in 2TB sizes .... why should I sweat 8 GB?
  You are confusing capacity problems with thruput problems. Sweat how poor performance is when 8 gigs gets thrashing.
  The real problem is the ratio of memory access speed vs drive access speed has gotten dramatically worse over the past decades.
  Look at two scenarios with the same memory leak:
  With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.
  With no/little swap, the OOM killer will reap your processes, which will be restarted automatically by your init scripts or equivalent. The users will notice the maybe, just maybe, they had to click refresh twice on a page. Or maybe it seemed slow for a moment before it was normal speed. They'll probably just blame the network guys.
  End result, with swap means long outage that needs manual fix, no swap means no outage at all and automatic fix.
  In the 80s, yes you sized your swap based on disk space. In the 10s (heck, in the 00s) you size your swap based on how long you're willing to wait.
  It takes a very atypical workload and very atypical hardware for users to tolerate the thrashing of gigs of swap...
  
  --
  "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
24. Re:Leaky Fawcet by kenh · 2010-08-10 23:36 · Score: 1
  
  First, it's Leaky Faucet (Unless you are thinking of Farrah Fawcett 8^)
  Second, never try ot teach a pig to fly, it wastes your time and annoys the pig. The same goes for having in-depth technical discussions with many slashdot commenters...
  
  --
  Ken
25. Re:Leaky Fawcet by Anonymous Coward · 2010-08-11 00:39 · Score: 0
  
  Excluding Window boxes (they have their own unique paging, memory/file mapping, and backing store systems), generally more than 1/4-1/2 memory is a waste these days.
  
  Unless, of course, you want to use hibernation.
26. Re:Leaky Fawcet by Just+Some+Guy · 2010-08-11 01:30 · Score: 1
  
  With 8 gigs of glacially slow swap, true everything will keep running but performance will drop by a factor of perhaps 1000. The users will SCREAM. Which means your pager/cellphone will scream. Eventually you can log in, manually restart the processes, and the users will be happy, for a little while.
  Is there a modern OS with a VM manager that horrible? And while I agree that the ratio of memory speed to HDD speed (but not necessarily SSD speed) keeps growing in favor of RAM, the ratio of RAM size to hard drive throughput still seems about the same. For instance, my first 512KB Amiga 1000 had a 5KB/s floppy, so writing out the entire contents of RAM would take about 100 seconds. These days my home server has 8GB of RAM and each of its drives can sustain about 80MB/s throughput, so writing out the entire contents of RAM would take about... 100 seconds.
  Finally, while I don't know as much about Linux's VMM, I know that FreeBSD's is fairly proactive about copying long-unused RAM pages to swap during idle periods. If those processes suddenly decide to access those pages, they're still in RAM and the processes race ahead as normal. If some other process tries to allocate that RAM, then those pages are released and allocated to the new process with no new disk IO at all - because they've already been copied out. I can't think of a single real-world reason why that isn't a good thing.
  
  --
  Dewey, what part of this looks like authorities should be involved?
27. Re:Leaky Fawcet by StayFrosty · 2010-08-11 01:55 · Score: 1
  
  The one good use for a 1:1 memory to disk ratio now days is suspend to disk. If you don't have enough swap space available and you try to suspend, it doesn't work.
  
  --
  "Frequently wrong, never in doubt."
28. Re:Leaky Fawcet by ultranova · 2010-08-11 01:59 · Score: 1
  
  Memory leaks usually get swapped out... your swap usage will grow but the system will keep going just as fast since those pages will never get swapped in again.
  
  I once had explorer.exe on Windows 7 go into some kind of seizure where it ended up using over 2 gigs of memory (of a total 4) before I killed it. It certainly was swapping in and out constantly. Fun, that.
  
  --
  Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
29. Re:Leaky Fawcet by Anonymous Coward · 2010-08-11 02:13 · Score: 0
  
  The raisins have calmed down, can we call it a page file now? :)
30. Re:Leaky Fawcet by akanouras · 2010-08-11 02:42 · Score: 1
  
  Thank you very much for your reply, old Unix stories are always fascinating to read! :D
  I have amended the Wikipedia "disambiguation" page to make clicking through more ...enticing for future visitors :)
31. Re:Leaky Fawcet by GooberToo · 2010-08-11 03:33 · Score: 1
  
  Yes, I've read plenty on ksplice. Most distributions do not yet include it. Besides, even those that do include it are not exempt from reboot; not by a wide measure. Ksplice is only good for very minor updates which do not change any data structures. Once a structure changes, ksplice is no longer an option.
  And assuming ksplice were always an option, it still doesn't change the fact that the justification for this entire thought experiment is as contrived and as poor a practice as they come.
32. Re:Leaky Fawcet by GooberToo · 2010-08-11 03:35 · Score: 0, Offtopic
  
  Please name me a server which requires multiple 24x7x365 operation and which should be suspended to disk. They are mutually exclusive concepts. Either the server needs to be up and running or it doesn't. If it doesn't, and therefore can be suspended, it can be upgraded and rebooted.
33. Re:Leaky Fawcet by GooberToo · 2010-08-11 03:47 · Score: 1
  
  Excellent summary.
34. Re:Leaky Fawcet by GooberToo · 2010-08-11 04:07 · Score: 1
  
  Is there a modern OS with a VM manager that horrible?
  You're mis-attributing the issue. The issue is not one of a poor VM. The issue is one of a poor admin. The VM is attempting to honor the configuration which the admin provided. By providing massive page area the admin has instructed the VM/paging system to suffer massive performance loss in exchange for not returning out of memory errors.
  
  These days my home server has 8GB of RAM and each of its drives can sustain about 80MB/s throughput, so writing out the entire contents of RAM would take about... 100 seconds.
  That's an extremely simplistic way of looking at it. Paging rarely happens entirely linearly; at least not because of memory pressures. Worse, paging is triggered because of memory pressure. This is why others have referred to it as thrashing - because it causes exactly that. So rather than a linear write of 8GB, you're now looking at non-linearly read/write cycles, on (typically) 4KB page boundaries, whereby some portion of that 8GB is very likely to be part of a cyclic, read/write thrash. And this is why the current paging algorithms are so damn complex and frequently, completely hose corner cases.
  Here's an extremely simplified example. Application X and Y both require memory. In order to satisfy the demand, both X and Y must be paged. Since X is currently running, parts of (pages) Y is paged first. But, since we want to be smart about this, we only page out the least frequently used pages of Y, which need not be linear in RAM - and in fact, is likely to not be so. So, now we have a good chunk of Y paged out and the pages in RAM are freed and then committed to X. X uses the memory and now its time for Y to run. Repeat this cycle over and over. Over time, the chances of the paged activity coming even close to linear access on disk becomes more and more unlikely and these two application's paging activity becomes ever more interleaved on disk.
  Now, add a memory leak to the above. Since the smallest granularity is 4KB, it creates large opportunities for fragmentation. Imagine one application having multiple data structures which fit within that 4KB page. Let's say 1/3 of that page is leaked. That 1/3 prevents reaping and since the other 2/3 is considered hot while 1/3 is cold, it may in become a candidate for paging with heavy pressure. In turn, you're now shuffling leaked memory to and fro disk all the while making less memory available to the entire OS. In short, you've dramatically harmed performance by reduced available continuous memory, larger scatter/gather and more random read/write cycles.
  Now imagine a dozens processes rather than just two. You'll need a tent and coffee near you're keyboard because you'll be waiting a while. And the term, "thrashing", absolutely is appropriate here. As you can see, its really not about a single, simply, linear, 8GB write.
35. Re:Leaky Fawcet by sjames · 2010-08-11 04:22 · Score: 1
  
  *IF* it was thrashing the swap, it would be a terrible problem, but it's not. A leaked page swaps out and NEVER swaps back in. It's a one time event for that page. Since it is leaked, nothing even "remembers" it exists except the swap.
  If I SIGSTOP a memory hog for a bit, there is a one time hit as it gets paged out. Then another as it is paged back in. Still no thrashing.
  As I said, I do make sure that there is enough RAM for the actual memory in use. No thrashing at all.
36. Re:Leaky Fawcet by StayFrosty · 2010-08-11 04:23 · Score: 1
  
  I don't recall the words "server" or "24x7x365 operation" in my previous post. Since you seemed to miss my point entirely, I'll spell it out again--only slowly and more loudly this time.
  
  Swap space is useful on Linux LAPTOPS, TABLETS and sometimes even DESKTOPS/WORKSTATIONS for suspend to disk.
  
  --
  "Frequently wrong, never in doubt."
37. Re:Leaky Fawcet by GooberToo · 2010-08-11 04:25 · Score: 1
  
  Someone else already provided an excellent summary. But in a nut shell, swapping is an entire process at a time while paging is typically least frequently used pages within that process. Swapping typically leads to linear and I/O while paging typically does not.
  Its for historic reasons why paging and swapping are frequently intermixed - just as the two implementations frequently were. Just the same, the distinction is important to understand when it comes time to allocate a paging area.
  
  Quoting Andrew Morton:
  Good point - and something I had forgotten about. It wasn't always so. But, do keep in mind when creating a pagefile on top of a filesystem, the file may be non-continuously allocated, especially if the filesystem has existed and been heavily used of some time. And the allocation is completely at the mercy of the underlying filesystem. Which means, its very reasonable, though certainly not a requirement, for paging via a file on top of a filesystem to be slower than direct paging partition access; which is assured to be continuous.
38. Re:Leaky Fawcet by GooberToo · 2010-08-11 04:28 · Score: 0, Offtopic
  
  It was inescapably implied. Go re-read it. By definition, any system which requires non-stop operation and uptime measured in years is exactly as I described. After all, the entire premise is that these are requirements and even moreso, paging is to be used to avoid application restarts and/or system reboot so as to work around a horrible memory leak.
  
  Swap space is useful on Linux LAPTOPS, TABLETS and sometimes even DESKTOPS/WORKSTATIONS for suspend to disk.
  No one said otherwise.
39. Re:Leaky Fawcet by GooberToo · 2010-08-11 04:32 · Score: 1
  
  A leaked page frequently causes fragmentation. Under memory pressure you've now directly inflicted additional I/O, a loss of continuous memory, and now imposing a requirement of yet additional paging pressure.
  The saner solution is to simply, periodically, restart your application. Followed by, getting it updated as soon as a fix is available.
40. Re:Leaky Fawcet by sjames · 2010-08-11 04:39 · Score: 2, Informative
  
  You are confusing the pageout of little used (or unused but unfreed) pages as a one time event with a constant state of thrashing. The former makes no noticeable difference to system performance except that it keeps NOT ending up out of memory, the latter is a horrific performance killer that happens any time the running processes actively demand more RAM than the machine has.
  An admin who routinely allows the machines to thrash would indeed be bad. The solution to that is adding ram or moving some services to another machine. Shrinking swap will not help, it will just trigger total failure sooner (but really the machine has already functionally failed since it is thrashing no matter how much or little swap it has).
  An admin who doesn't provide enough swap to handle little used or unused pages that are still allocated due to some silly superstition that the server will suddenly start thrashing because he didn't use his version of the golden ratio (but all would be just peachy otherwise) is a bad admin.
41. Re:Leaky Fawcet by Anonymous Coward · 2010-08-11 04:39 · Score: 0
  
  The point of swap files is how you answer the following question:
  
  What do you want to happen when you run out of real, physical, RAM?
  
  Instant crash? Don't bother with a swap file.
  
  A small tide-you-over grace period? Use a small swap file.
  
  (Given today's drive speeds and access times - something in the 4-16GB range is about right for a server. It's small enough not to be a big bother, but big enough that you'll never use all of it. And if you do suddenly start paging, it will give you enough time to be alerted via Nagios/Cactii before everything goes boom.)
42. Re:Leaky Fawcet by GooberToo · 2010-08-11 04:41 · Score: 0, Troll
  
  Considering I qualified my statements with "under pressure", the only person who misunderstood, is you.
43. Re:Leaky Fawcet by akanouras · 2010-08-11 04:58 · Score: 1
  
  Thank you both for your posts, it's these gems of insight that keep me visiting Slashdot, even though the S/N ratio is constantly diminishing...
44. Re:Leaky Fawcet by sjames · 2010-08-11 05:08 · Score: 1
  
  Agreed. However since people don't always drop everything and devote their lives to fixing the bug when I file a report, it's useful to employ a workaround in the mean while. There's generally not much problem with a scheduled restart, but it's nice to know the system won't go down in flames should that be put off for a while.
45. Re:Leaky Fawcet by GooberToo · 2010-08-11 07:27 · Score: 1
  
  "A while" was never the point of contention. The, "I do x, y, and z", to address problem which almost never happens, to prevent updating a buggy application (or simply periodically restarting it), such that one can maintain an uptime of "years" is the point of contention. Its such a point of contention, I'd call it, "complete bullshit."
46. Re:Leaky Fawcet by sjames · 2010-08-11 08:03 · Score: 1
  
  If it means that much to you, fine. I hereby declare that I shall never in your lifetime hold a gun to your head and order you to allocate swap 1:1 with RAM!
  There, feel better now? YEEESH!
47. Re:Leaky Fawcet by badkarmadayaccount · 2010-08-11 23:04 · Score: 1
  
  How about a garbage collector?
  
  --
  I know tobacco is bad for you, so I smoke weed with crack.
Kernel shared memory by Narkov · 2010-08-10 16:08 · Score: 5, Informative

The Linux kernel uses something called kernel shared memory (KSM) to achieve this with it's virtualization technology. LWN has a great article on it:
http://lwn.net/Articles/306704/
1. Re:Kernel shared memory by amscanne · 2010-08-10 17:41 · Score: 5, Informative
  
  Disclaimer: I wrote the blog post. I noticed the massive slashdot traffic, so I popped over. The article summary is not /entirely/ accurate, and doesn't really completely capture what we're trying to do with our software.
  Our mechanism for performing over-subscription is actually rather unique.
  Copper (the virtualization platform in the demo) is based on an open-source Xen-based virtualization technology named SnowFlock. Where KSM does post-processing on memory pages to share memory on a post-hoc basis, the SnowFlock method is much more similar to unix 'fork()' at the VM level.
  We actually clone a single VM into multiple ones by pausing the original VM, COW-ing its memory, and then spinning up multiple independent, divergent clones off of that memory snapshot.
  We combine this with a mechanism for bringing up lightweight VMs fetching remote memory on-demand, which allows us to bring up clones across a network about as quickly and easily as clones on the same machine. We can 'clone' a VM into 10 VMs spread across different hosts in a matter of seconds.
  So the mechanism for accomplishing this works as follows:
  1. The master VM is momentarily paused (a few milliseconds) and its memory is snapshotted.
  2. A memory server is setup to serve that snapshot up.
  3. 16 'lightweight' clone VMs are brought up with most of their memory "empty".
  4. The clones start pulling memory from the server on-demand.
  All of this takes a few seconds from start to finish, whether on the same machine or across the network.
  We're using all of this to build a bona-fide cluster operating system where you host virtual clusters which can dynamically grow and shrink on demand (in seconds, not minutes).
  The blog post was intended not as an ad, but rather a simple demo of what we're working on (memory over-subscription that leverages our unique cloning mechanism) and a general introduction to standard techniques for memory over-commit. The pointer to KSM is appreciated, I missed it in the post :)
2. Re:Kernel shared memory by pookemon · 2010-08-10 18:45 · Score: 5, Funny
  
  The article summary is not /entirely/ accurate
  That's surprising. That never happens on /.
  
  --
  dnuof eruc rof aixelsid
3. Re:Kernel shared memory by descubes · 2010-08-10 18:46 · Score: 5, Interesting
  
  Having written VM software myself (HP Integrity VM), I find this fascinating. Congratulations for a very interesting approach.
  That being said, I'm sort of curious how well that would work with any amount of I/O happening. If you have some DMA transfer in progress to one of the pages, you can't just snapshot the memory until the DMA completes, can you? Consider a disk transfer from a SAN. With high traffic, you may be talking about seconds, not milliseconds, no?
  
  --
  -- Did you try Tao3D? http://tao3d.sourceforge.net
4. Re:Kernel shared memory by kenh · 2010-08-10 23:23 · Score: 1
  
  Let me see if I understand - you take one VM (Say, an Ubuntu 10.04 server running a LAMP stack, just to pick one), then you make "diff's" of that initial VM and create additional VMs that are also running OS/software (as a starting point), Of course, I can load up other software on the "diff'd" VMs, but they increase the actual memory footprint of each VM. So, to maximize oversubscription of memory, I'd want to limit myself to running VMs that are as similar as possible (say a farm of Ubuntu 10.04 LAMP servers), and were I to run a machine with a couple Ubuntu 10.04 LAMP servers, a couple Windows Server 2008 servers, a WIndows Server 2003 server, and an Ubuntu 9.04 server ont he same machine I'd have minimal memory oversubscription benefits (the multiple Windows Server 2008 and Ubuntu 10.04 LAMP servers would share memory, but the one-off Windows Server 2003 and Ubuntu 9.04 would have no shared memory)... Correct?
  Interesting idea, seems to me the memory server would cause a serious impact on server performance, but that is the view from my armchair, I'll reserve judgement until I see it in action.
  Thanks for following up on the /. story.
  
  --
  Ken
5. Re:Kernel shared memory by Daniel+Boisvert · 2010-08-11 00:21 · Score: 1
  
  This is an interesting approach, especially across hosts in a cluster. Is it safe to assume you expect your hosts and interconnect to be very reliable?
  
  I'm curious about the methods you use to mitigate the problems that would seem to result if you clone VM 1 from Host A onto VM's 2-10 on hosts B-E, and Host A dies before the entirety of VM 1's memory is copied elsewhere. Can you shed any light on this?
6. Re:Kernel shared memory by calmond · 2010-08-11 00:26 · Score: 1
  
  Correct me if I'm wrong, but the method you described sounds almost exactly like LVM Snapshots. A great approach, and saves a ton of disk space. How often should a VM be rebooted or re-cloned though? Memory is a lot more volitile than disk storage, so I would think that the longer the system runs, the more divergent the memory stacks would be, thus the less efficient this method would be over time, or am I missing something? Thanks!
7. Re:Kernel shared memory by kscguru · 2010-08-11 05:15 · Score: 2, Interesting
  
  Disclaimer: VMware engineer; but I do like your blog post. It's one of the more accessible descriptions of cpu and memory overcommit that I have seen.
  The SnowFlock approach and VMware's approach end up making slightly different assumptions that really make each's techniques not applicable to the other. In a cluster it is advantageous to have one root VM because startup costs outweigh customization overhead; in a datacenter, each VM is different enough that the customization overhead outweighs the cost of starting a whole new VM. Particularly with Windows: a Windows VM essentially needs to be rebooted to be customized (and thus the memory server stops being useful), whereas Linux can more easily customize on-the-fly. Different niches of the market.
  The second big difference is architectural. VMware handles more in the virtual machine monitor; KVM and Xen use simpler virtual machine monitors that offload the complex tasks to a parent partition. This means that for VMware, each additional VM instance takes ~100MB of hypervisor overhead - small relative to non-idle VMs, but large relative to idle VMs. It's purely an engineering tradeoff: a design like VMware's vmm will always be (a little bit) quicker per-VM; a design like KVM/Xen's vmm will always scale (a little bit) better with idle VMs.
  These combine to make it easy to show KVM/Xen hypervisors more deeply overcommitted than VMware hypervisors by using only idle Linux VMs. VMware doesn't care about such numbers, because the difference disappears or favors VMware as load increases. If GridCentric has found a business for deeply overcommitted VMs, more power to you!
  
  --
  A witty [sig] proves nothing. --Voltaire
So... is this different from Linux KVM w/ KMS? by Anonymous Coward · 2010-08-10 16:10 · Score: 1

Even the same ratio of over-subscribed memory, around 300%, but without the overhead this article admits it has which reduces it's actual over-subscription ratio down to just over 200% instead:
http://lwn.net/Articles/306704/
Specifically, this link/LKML post: 52 1GB Windows VMs in 16GB of total physical RAM installed:
http://lwn.net/Articles/306713/
1. Re:So... is this different from Linux KVM w/ KMS? by fR0993R-on-Atari-520 · 2010-08-10 16:17 · Score: 1
  
  Funny... in the VMware whitepaper linked to from the article, even VMware wasn't able to get more than 110% memory over-consolidation from page sharing. I wonder what's so different about KVM's page sharing approach?
  
  --
  There are 11 types of people in the world: those who understand unary, and those who don't.
2. Re:So... is this different from Linux KVM w/ KMS? by amscanne · 2010-08-10 16:32 · Score: 5, Informative
  
  I have one possibility. The blog post alluded to this. Page sharing can be done *much* more efficiently on Linux due to the fact that the ELF loader does not need to rewrite large chunks of the binaries when applications are loaded into memory. The Windows loader will rewrite addresses in whole sections of code if a DLL or EXE is not loaded at it's "preferred base" virtual address. In Linux, these addresses are isolated through the use of trampolines. Basically, you can have ten instances of Windows all running the exact same Microsoft Word binaries and they might not share the code for the application. In Linux, if you have ten VMs running the same binaries of Open Office, there will be a lot more sharing.
3. Re:So... is this different from Linux KVM w/ KMS? by Anonymous Coward · 2010-08-10 18:26 · Score: 0
  
  I'm sorry, but this post and the blog post are extremely inaccurate, and I hesitate to say flat-out wrong. EXEs are never relocated unless mapped via LoadLibrary (a debugging technique only). All code (DLLs and EXEs) are system shared with copy-on-write memory mapping. If a DLL is relocated, what typically happens is that the .reloc section is copied to the private address space and rewritten. All other binary sections should remain shared (I believe; IAT rewriting should only happen once globally). Additionally, most (all?) Microsoft DLLs have unique base addresses to minimize the potential relocations.
4. Re:So... is this different from Linux KVM w/ KMS? by ringm000 · 2010-08-10 18:26 · Score: 1
  
  Base addresses of DLLs in an application are typically chosen to avoid conflicts with system DLLs and between each other, so these conflicts are relatively rare. When they happen, the DLLs can be manually rebased.
5. Re:So... is this different from Linux KVM w/ KMS? by amscanne · 2010-08-10 19:06 · Score: 3, Insightful
  
  Yes, rebasing is reduced by careful selection of preferred base addresses (particularly by Microsoft for their DLLs). Yes, if DLLs are not rebased then they are shared -- I did not claim otherwise. My point along these lines is that rebasing *does* occur surprisingly often, and can hurt sharing. The actual level of sharing you achieve obviously depends almost *entirely* your applications, workload, data, etc.
  By the way, as far as I know versions of Windows newer than Vista enable address-space randomization by default for security purposes. Since the starting virtual address of each DLL is randomized, preferred bases can't be respected. I don't know what impact this has on Windows memory usage post-Vista, but it seems like one can't rely on carefully curated base addresses.
  I'm not saying one approach is better than the other (Linux, Windows, whatever) -- I'm only positing a possibility for why one might see better improvement with KSM. One might just as easily see better over-subscription with Windows simply due to the fact that it zeroes out physical pages when they are released, as far as I know. Those zero pages can all be mapped to the same machine frame transparently (without the need for balloooning).
6. Re:So... is this different from Linux KVM w/ KMS? by Anonymous Coward · 2010-08-10 19:14 · Score: 0
  
  Hmm... last I checked, the .reloc section only specified code locations, and it's the locations themselves and not the index that is modified when relocated. And modification to the mapped code section should trigger a page fault, which results in the entire code section being copied to the private address space. If I'm right about this then both you and the GP are correct.
7. Re:So... is this different from Linux KVM w/ KMS? by Anonymous Coward · 2010-08-10 19:57 · Score: 0
  
  ASLR DLLs are randomly rebased only once while in existence (loaded by some process). It takes a reboot to relocate persistent system DLL's to another address. It's an improvement over preferred base loading in that two DLL's cannot request the same address.
8. Re:So... is this different from Linux KVM w/ KMS? by Wierdy1024 · 2010-08-10 20:53 · Score: 1
  
  Using trampolines for every cross-library call seems very inefficient...
  The windows method seems better for the more common case, where it does the costly rewriting at library load time, and then avoids an extra jump for every library function call.
  Whats the performance impact of this? I bet it's at least a couple of percent, which is significant if it's across the entire system.
9. Re:So... is this different from Linux KVM w/ KMS? by milosoftware · 2010-08-10 21:51 · Score: 2, Interesting
  
  On x64 Windows systems, addressing is always relative, so this eliminates the DLL relocation. So it might actually save memory to use 64-bit guest OSses, as there will be less relocation and more sharing.
  
  --
  Musicians don't die. They just decompose.
Just a summary of existing techniques by Anonymous Coward · 2010-08-10 16:14 · Score: 2, Informative

This blog post is just a summary of 3 existing techniques: Paging, Ballooning, and Content-Based Sharing. It does not describe any new techniques, or give any new insights.
It's a solid summary of these techniques, but nothing more.
1. Re:Just a summary of existing techniques by GooberToo · 2010-08-10 18:22 · Score: 1
  
  A new implementation of existing technique and/or technology can still be noteworthy. If this isn't the case then an F22 is really just a Wright Brother's Flyer - nothing new. My metaphor is absurd, but you get the point.
2. Re:Just a summary of existing techniques by dirtyhippie · 2010-08-10 18:24 · Score: 1
  
  It doesn't even say which if any of those techniques it's using. It's a teaser, not news.
3. Re:Just a summary of existing techniques by Anonymous Coward · 2010-08-10 23:28 · Score: 0
  
  More or less agreed. I always get annoyed when people demo memory overcommit technology with desktop VMs. It's easy to build a couple dozen Windows XP VMs with 1GB of RAM each and get huge overcommit ratios because XP will run in 256GB of RAM. Cramming 16 GB of allocated VMs into 8 GB of space is trivial when your actual memory requirements are close to 4GB. There's a reason virtualization vendors don't demo memory overcommit on VDI platforms using 4GB Windows 7 VMs running simulated workloads. Regarding the technologies listed:
  Page Sharing only works when you have a lot of duplicate memory pages, and if you're using large pages then you essentially have none (unless they are zeroed pages, in which case you have probably over-allocated your VMs).
  Paging at the hypervisor level is an absolute disaster waiting to happen. You're basically blindly paging data to disk that VMs think is stored in RAM. Consequently VMs request paged data and disk I/O increases. You're better off letting the VMs handle the paging so that they can intelligently determine what to page out.
  Ballooning isn't a bad technology, especially if your VMs have more memory than they need. But if they actually need the memory that they have then it doesn't get your much. It basically relies on the VM to handle it's own paging.
  This is one area where virtualization is going to struggle. Oversubscribing CPUs isn't much of an issue because most physical servers are over-allocated CPUs anyway, leading to lots of idle cycles. Oversubscribing memory is trickier because servers typically don't need memory one minute and then less memory the next, and even if memory pages are not actively being "used" at a particular instance there is still a performance hit involved in paging them out.
  At this point I think that Microsoft is actually in the lead when it comes to VM memory management with their Dynamic Memory technology. It uses hooks into the VM operating system to determine how much memory a VM actually needs at that particular time and provides it, but also gives it the capability to dynamically scale up memory allocations as workloads require it. You still can't "use" more memory than you actually have and you never will (in the same way that you can't "use" more CPU cycles than you actually have), but Dynamic Memory ensures that you can more highly optimize the memory utilization in your virtual environments.
  http://blogs.technet.com/b/virtualization/archive/2010/07/12/dynamic-memory-coming-to-hyper-v-part-6.aspx
OpenVZ? by Anonymous Coward · 2010-08-10 16:20 · Score: 1, Informative

OpenVZ has had this for years now, which is one of the reasons it has gained popularity in the hosting world.
1. Re:OpenVZ? by KiloByte · 2010-08-10 23:31 · Score: 1
  
  Or vserver. Or BSD jails.
  These just use the good old Unix memory management -- if you can coordinate between multiple VMs, things get a whole lot easier. The problem with VMs with separate kernels (Xen, VirtualBox, VMWare, etc) is that they have no way of knowing a given page mmaps the same block on the disk.
  The technique described in the article is a hack that works only if all processes are started before you clone the VMs and nothing else happens later. Vserver does it strictly better -- if multiple VMs use the same file on the disk, it will use the memory exactly once, no matter when it was read.
  
  --
  The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
nothing new by Anonymous Coward · 2010-08-10 16:21 · Score: 1, Interesting

nothing new .. i ran 6 w2k3 servers on a linux box running vmware server with 4GB of ram and allocated 1GB to each vm
Is this an ad? by saleenS281 · 2010-08-10 16:25 · Score: 2, Insightful

I noticed free memory on the system was at 2GB and dropping quickly when they moved focus away from the console session (even though all of the VM's had the exact same app set running). This appears to be absolutely nothing new or amazing... in fact, it reads like an ad for gridcentric.
Not exactly new by Anonymous Coward · 2010-08-10 16:38 · Score: 1, Informative

Oversubscription of memory for VMs has been around for decades - just not for the Intel platform. There are other older, more mature platforms for VM support...
This having been done before ... by cdrguru · 2010-08-10 16:43 · Score: 5, Informative

One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before. Often, even when there is an opportunity to gather information about older systems, they don't think it is relevent.
Well, here I would say it is extremely relevent to understand some of the performance tricks utilized by VM/370 and VM/SP in the 1970s and 1980s. VM/370 is pretty much the foundation of today's IBM virtualization offerings. In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.
The key bit of information here is that for interactive users running CMS a significant optimization was put in place of sharing the bulk of the operating system pages. This was done by dividing the operating system into two parts, shared and non-shared and by design avoiding writes to the shared portion. If a page was written to a local copy was made and that page was no longer shared.
This was extremely practical for VM/370 and later systems because all interactive users were using pretty much the same operating system - CMS. It was not unusual to have anywhere from 100 to 4000 interactive users on such systems so sharing these pages meant for huge gains in memory utilization.
It seems to me that a reasonable implementation of this for virtualization today would be extremely powerful in that a bulk of virtualized machines are going to be running the same OS. Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense. So instead of booting an OS "natively" you would instead load a shared system where the shared (read only) pages would be loaded along with an initial copy of writable non-shared memory from a snapshot taken at some point during initialization of the OS.
This would seem to be able to be done easily for Linux even to the extent of having it assist with taking a snapshot during initialization. Doing this with Windows should also be possible as well. This would greatly reduce the memory footprint of adding another virtual machine also using a shared operating system. The memory then used by a new virtual machine would only be the non-shared pages. True, the bulk of the RAM of a virtual machine might be occupied by such non-shared pages but the working set of a virtual machine is likely to be composed of a significant number of OS pages - perhaps 25% or more. Reducing memory requirements by 25% would be a significant performance gain and increase in available physical memory.
1. Re:This having been done before ... by Anonymous Coward · 2010-08-10 16:52 · Score: 4, Informative
  
  Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent. That means, each time MS Office is loaded on a different machine, the function jump points are re-written according to where in the address space the code gets loaded, which is apparently usually different on different Windows instances. That means very little opportunity for page sharing.
  These guys seem to be doing something different...
2. Re:This having been done before ... by pz · 2010-08-10 17:16 · Score: 4, Informative
  
  In the 1960s the foundation of VM/370 was created at Cambridge University (MA, USA, not UK) and called CP/67.
  From what I can gather, it was not Cambridge University (of which I believe there is still only one, located in the UK, despite the similarily-named Cambridge College in Cambridge, Massachusetts, but as the latter is an adult-educational center founded in 1971, the chances are that wasn't where CP/67 was developed), but rather IBM's Cambridge Scientific Center that used to be in the same building as MIT's Project MAC. Project MAC (becoming later the MIT Lab for Computer Science) being where much of the structure of modern OSes was invented.
  Those were heady days for Tech Square. And, otherwise, the parent poster is right on.
  
  --
  
  Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
3. Re:This having been done before ... by Anonymous Coward · 2010-08-10 17:28 · Score: 0
  
  It's true, if a module isn't loaded at it's preferred base address, there is pointer rewriting that occurs on Windows.
  However, each process gets its own address space. An app like MS Office is always going to be loaded at its preferred base address - none of the built-in modules are going to conflict over preferred base address. Two MS Office plugins that were written unaware of each other may conflict on preferred base address, but only the conflicting plugin will have its pointers rewritten. The rest of the process can happily share memory.
  Not even considering virtualization, this is useful in the many-user terminal server scenario. If you have 100 users running MS Word, Windows doesn't load 100 copies of winword.exe.
4. Re:This having been done before ... by ChipMonk · 2010-08-10 18:00 · Score: 1
  
  Very informative, but pure page sharing doesn't work for most Windows variants, due to the fact that Windows binaries aren't position independent.
  Is that also true for 64-bit Windows binaries? According to the docs I've read, position-independent binary code is preferred in 64-bits.
5. Re:This having been done before ... by Anonymous Coward · 2010-08-10 18:10 · Score: 0
  
  One of the problems with folks in the computer software business today is that they are generally young and haven't had much experience with what has gone on before.
  
  In my experience age has little to do with it. I know a sysadmin in his 50s who's one of the least knowledgeable people I know when it comes to actually understanding what's going on inside the machine. I know people 20 years younger than could do cartwheels around this guy. Age has little correlation with experience. The desire to understand is really the key.
  
  Today most kernel pages are read-only so sharing them across multiple virtual machines would make incredible sense.
  
  I guess. Kernels are dinky at a few megabytes of code compared to the gigabytes of memory available. Glibc is a couple megabytes. There's some other shared libs for sure, but I've a hard time believing they add up to anything substantial.
  The real question here to me is, why try to share memory in the first place? Memory is cheap. This hasn't always been the case, but it is now.
6. Re:This having been done before ... by Anonymous Coward · 2010-08-11 00:43 · Score: 0
  
  Is that also true for 64-bit Windows binaries? According to the docs I've read, position-independent binary code is preferred in 64-bits.
  
  Except that when you're using large amounts of memory you enable large pages, which reduces the possibility of page sharing to almost zero.
7. Re:This having been done before ... by petermgreen · 2010-08-11 01:50 · Score: 2, Informative
  
  Memory is cheap
  Kind of, the memory itself isn't too expensive but the cost of a system has a highly nonlinear relationship to memory requirements at least with the intel nahelm stuff (it's been a while since i've looked at AMD so I can't really commend there).
  Up to 16GB you can use an ordinary LGA1366 board and CPU.
  To get to 24GB you need a LGA1366 board and CPU.
  To get to 48GB (or 72GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need a dual-socket LGA1366 board and associated dual-socket capable CPUs (which are far far more expensive clock for clock than thier single socket equivilents) and associated speial case.
  To get to 96GB (or 144GB if you are prepared to take the performance hit and motherboard choice hit that comes from putting three memory modules on a channel) you need the aforementioned dual-socket platform plus insanely expensive 8GB modules.
  Beyond that you are talking moving to a quad-socket platform afaict.
  
  --
  note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
8. Re:This having been done before ... by Maarx · 2010-08-11 02:34 · Score: 2, Insightful
  
  You guys gotta learn to use the quote tags instead of the italics. Slashdot knows to hide the quote when displaying your post in abbreviated mode, so we can actually read what you said.
  And face it, if you post as AC, you're going to be in abbreviated mode.
9. Re:This having been done before ... by petermgreen · 2010-08-11 03:18 · Score: 2, Informative
  
  Up to 16GB you can use an ordinary LGA1366 board and CPU.
  That line should have said LGA1156
  
  --
  note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
Ok, but look at this... by ratboy666 · 2010-08-10 16:52 · Score: 1

Yes, we pay attention...
The concept is in Unix, including Linux, and probably in Windows - COW (copy-on-write) pages...
fork() uses COW, vfork() shares the entire address space (but suspends the parent).
$ man vfork
[snip]
Historic Description
Under Linux, fork(2) is implemented using copy-on-write pages, so the
only penalty incurred by fork(2) is the time and memory required to
duplicate the parent's page tables, and to create a unique task struc-
ture for the child. However, in the bad old days a fork(2) would
require making a complete copy of the caller's data space, often need-
lessly, since usually immediately afterwards an exec(3) is done. Thus,
for greater efficiency, BSD introduced the vfork() system call, which
did not fully copy the address space of the parent process, but bor-
rowed the parent's memory and thread of control until a call to
execve(2) or an exit occurred. The parent process was suspended while
the child was using its resources. The use of vfork() was tricky: for
example, not modifying data in the parent process depended on knowing
which variables are held in a register.
[snip]

--
Just another "Cubible(sic) Joe" 2 17 3061
1. Re:Ok, but look at this... by sjames · 2010-08-10 20:02 · Score: 1
  
  Parent is talking about the kernel itself sharing pages across instances, not userspace processes running under a single instance.
VMware ESX does this (yeah it's not free) by Anonymous Coward · 2010-08-10 17:10 · Score: 0

This guy did some crazy stuff with it - 64Gb or so of fake memory on an 8Gb box http://vinf.net/2010/02/25/8-node-esxi-cluster-running-60-virtual-machines-all-running-from-a-single-500gbp-physical-server/
Limitations by sjames · 2010-08-10 17:18 · Score: 1

This may seem obvious, but in reading some of the trade press and the general buzz, it seems that it isn't obvious to everyone:
Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.
I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).
1. Re:Limitations by drsmithy · 2010-08-10 19:26 · Score: 1
  
  Oversubscription only works when the individual VMs aren't doing much. If you have a pile of VMs oversubscribed to the degree TFA is talking about, it means the VM overhead is exceeding the useful computation. There are cases where that can't be helped, such as each VM is a different customer, but in an enterprise environment, it suggests that you should be running more than one service per instance and have less instances.
  No, you ideally want as few services per instance as possible, to reduce dependencies and simplify architectures.
  A dozen small VMs running a single service each, is generally easier to look after than a single server running a dozen different services, especially if your environment involves customers and/or services with differing availability requirements.
  I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).
  There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.
2. Re:Limitations by sjames · 2010-08-10 19:51 · Score: 1
  
  No, you ideally want as few services per instance as possible, to reduce dependencies and simplify architectures.
  If the services can be separated onto 2 VMs, they are necessarily orthogonal. If the availability requirements differ, they should certainly NOT be running as VMs on the same machine.
  As for the case of different customers, that would fall under the exception where it can't be helped.
  
  There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.
  Name one!
3. Re:Limitations by jon3k · 2010-08-17 03:16 · Score: 1
  I swear, some in the trade rags seem to honestly think there is a benefit to splitting a server into 16 VMs and then combining those into a virtual beowulf cluster for production work (it makes perfect sense for development and testing, of course).
  There are a number of benefits to what you just described.
  
  Application compatability issues - much easier to troubleshoot a problem when each guest is performing a single function
  Performance - you can move VMs around without rebooting using vMotion if one VM begins to consume too many resources
  Upgrades/Maintenance - you can upgrade software or even a single host OS and reboot a guest without affecting the other guests.
  In practice these are exceptionally useful features and worth the small performance hit (5-10% hvm overhead) in an enterprise environment.
4. Re:Limitations by jon3k · 2010-08-17 03:18 · Score: 1
  
  There are numerous examples where multiple clustered VMs will perform better than a single OS image, on the same hardware.
  While I agree that there are numerous advantages to taking a single host performing many functions and splitting it into many VMs performing individual functions (UNIX philosophy afterall - do one thing and do it well!) but I have yet to find an instance that bears out that statement you made above. What workload would perform better in that instance? If I ran one webserver on bare metal or 4 VMs each running an instance of that same webserver on the same hardware, it will perform better on bare metal (due to hypervisor overhead, if nothing else).
  
  Can you provide a link to a study that proves what you're describing?
5. Re:Limitations by sjames · 2010-08-17 05:16 · Score: 1
  
  Absolutely none of those things apply to a beowulf cluster.
  Understand, I'm not saying VMs have no place at all, they certainly do. It's just that they are not the end all and be all of computing they are cracked up to be in the trade rags. There is always a performance penalty for using a VM overall. It may be worth it but that doesn't mean it isn't there.
6. Re:Limitations by jon3k · 2010-08-17 17:30 · Score: 1
  
  Oh you literally meant a beowulf cluster, that's a horrible idea who does that? I'm against taking a workload on bare metal, creating multiple VMs on that same hardware and running identical workloads across all those guests. At least for performance reasons. I've seen people do things like ms exchange clusters in a box (1 server 2 VMs each running exchange enterprise) and there's at least an argument to be made there, but of course it isn't one for performance.
Oversubscription by Khyber · 2010-08-10 17:23 · Score: 1

When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.
This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.
Our internet is oversubscribed, our processors are getting there, and now RAM?
When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
1. Re:Oversubscription by TooMuchToDo · 2010-08-10 17:36 · Score: 1
  
  When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?
  When people are willing to pay for it. If you shop on price, this is the natural result, the need to squeeze as much as you can out of a capital asset.
2. Re:Oversubscription by gregrah · 2010-08-10 18:00 · Score: 1
  
  Really? That was your conclusion upon reading this article??
  
  Virtual memory has been around for quite a while now, and I don't think its inventors came up with the idea with the intention to scam anyone. I'd your outrage at "the designers of this stuff" may be misplaced.
3. Re:Oversubscription by sjames · 2010-08-10 19:58 · Score: 1
  
  Sad but true, especially when there's always someone out there ready to promise more for less and customers ready to believe the lie.
4. Re:Oversubscription by Slashcrap · 2010-08-10 20:27 · Score: 1
  
  When can we just effectively get what we pay for? This would explain the sudden jump in Intel-based Camfrog servers with a higher offering of hardware.
  This effectively means people can now lie about the hardware they're leasing out to you in a data center. They say you're getting 4GB, you're actually getting 1.5GB of RAM.
  Our internet is oversubscribed, our processors are getting there, and now RAM?
  When are the designers of this stuff going to just build the fucking hardware instead of trying to lie about it?
  Sorry about your anger issues and obvious lack of understanding about what this is.
5. Re:Oversubscription by Khyber · 2010-08-10 21:45 · Score: 1
  
  Not when this is apparently the exact same technology being used to run multiple heavy-traffic video chat servers on the same physical silicon. No wonder people on Camfrog are complaining about their servers lagging so hard, if this is the kind of thing we're paying for when we're actually expecting physical hardware.
  
  --
  Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
6. Re:Oversubscription by Khyber · 2010-08-10 21:47 · Score: 1
  
  No, I know EXACTLY what the issue is, having called my hosting provider for my video chat server. They just upgraded to this sort of management system, and my video server had been lagging horribly almost since the moment of implementation. And this would explain it - I've been moved to a shared server with overprovisioned hardware.
  Sorry you're not experienced enough with realtime applications to know when something's fucking with your system.
  
  --
  Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
7. Re:Oversubscription by TheRaven64 · 2010-08-11 01:35 · Score: 2, Insightful
  
  Nope, you really don't seem to understand what this is at all. It is eliminating duplicated pages in the system, so if two VMs have memory pages with the same contents the system only keeps one copy. To a VM, this makes absolutely no difference - the pages are copy-on-write, and when neither VM modifies them they both can see the same one without any interference (as is common with mapped process images, kernel stuff, and so on). The only thing that will change is that there will be reduced cache contention (as all VMs will be using the same copy of the page, rather than evicting each other's copy to get their own (identical) one into the data cache.
  And if you're running realtime applications in a VM, then this isn't the only thing that you don't understand.
  
  --
  I am TheRaven on Soylent News
8. Re:Oversubscription by Khyber · 2010-08-11 07:36 · Score: 1
  
  You still are in the wrong direction, Arctic cold.
  Let's take a machine with 4 VMs. only 2GB RAM. Running camfrog video servers.
  Only about 20MB will be used for the same program - everything else is a constant video stream and thus can't be swapped out to a disk cache.
  With Camfrog, anything higher than 300 video streams will max out that 2GB RAM.
  4 VMs all streaming that many video streams with only 2GB of physical memory WILL NOT WORK.
  Run a bunch of realtime programs as intensive as a real time thousand+ video stream server. You will notice overprovisioning of your hardware or bandwidth RAPIDLY.
  
  --
  Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
What's new? by voxner · 2010-08-10 17:56 · Score: 1

I had recently started poking around the lguest hypervisor. From my limited reading I believe 2 of the 3 memory subscription choices mentioned in the article are present in Linux. Existing linux based open source hypervisors like kvm etc use paging/swap mechanism (i,e, for x86 - the paravirt mechanism). Ballooning is possible using the virto_balloon. Kernel shared memory in linux allows dynamic sharing of memory pages between proceses - this probably doesn't apply to virtualization.
I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.
On an other note, established players like IBM tend to use Type-1 hypervisors (link) for enterprise servers, it would be interesting to see how this company fares against them.
1. Re:What's new? by Slashcrap · 2010-08-10 20:32 · Score: 1
  
  I couldn't find any CPU over-subscription thing in open-source hypervisors. It seems to be the only area where open-source hypervisors are lacking.
  
  Didn't look too hard, did you?
2. Re:What's new? by DrPizza · 2010-08-11 02:47 · Score: 1
  
  If your load average is >1, you have CPU over-subscription....
3. Re:What's new? by idiot900 · 2010-08-11 04:04 · Score: 1
  
  Actually, if the load average is greater than the number of cores, you have oversubscription.
4. Re:What's new? by jon3k · 2010-08-17 03:30 · Score: 1
  
  Uh, no. Load average can also refer to outstanding I/O (think disk, network, etc). CPU utilization could be 2% but you could have a load average of 500.
Why stop at 16? by Anonymous Coward · 2010-08-10 18:00 · Score: 0

Why not 24 or 32? Seriously, why should 5 GB real memory be capable of only supporting 16 VMs? What's the limit?
1. Re:Why stop at 16? by gregrah · 2010-08-10 18:44 · Score: 1
  
  It depends. With a suitably small linux build (say, the firmware running on my router) you could probably go much, much higher than 16 VMs.
  
  At some point though after you've started up enough VMs, the probability that a given virtual memory location necessary to continue processing is actually residing in your RAM effectively drops to 0, and you spend all of your time waiting on disk IO. At that point your system is effectively hosed.
2. Re:Why stop at 16? by drsmithy · 2010-08-10 19:36 · Score: 1
  
  What's the limit?
  Average VM working set * number of VMs - a few hundred MB for Hypervisor and overheads.
  Generally speaking, IME, you don't even need to _begin_ worrying until your RAM is oversubscribed above 2:1 (obviously YMMV depending on what your VMs are doing).
How is this interesting ? by drsmithy · 2010-08-10 19:13 · Score: 1

VMware has allowed RAM oversubscription for years. Indeed, it's one of the killer features of that platform over the alternatives. Who out there using VMware in non-trivial environments _isn't_ oversubscribing RAM ?
1. Re:How is this interesting ? by swb · 2010-08-11 03:27 · Score: 1
  
  We usually advise against it if possible, but some of that is consulting CYA; when clients are new to virtualization they are often very sensitive to perceived performance differences between physical and virtual systems. A new virtual environment where someone decided they wanted 8 Windows machines with 8 GB RAM running in 32 GB physical RAM usually gets too far oversubscribed, swaps hard (on a SAN) and the customer complains mightily.
  Usually we find that a little tuning of VMs makes sense, since you don't have to robotically give every x32 system 4 GB RAM or every x64 system 8 or 16 GB RAM. "Detuning" the RAM from individual VMs is almost always possible and allows you to keep your VMs RAM sum running within the total physical RAM and avoid the possibility of swapping.
  In many ways it's less of an issue than it was, say, a few years ago, too. The CPUs have gotten so powerful that it actually makes sense to buy less CPU per node but buy more nodes (and hence more RAM). The bonus generally being overall more RAM, generally better performance (since I/O is distributed) and greater HA capacity.
  Sales even tells me lately that it's cheaper to buy two nodes x 32 GB and a single node x 64GB of RAM.
2. Re:How is this interesting ? by drsmithy · 2010-08-11 09:15 · Score: 1
  
  We usually advise against it if possible, but some of that is consulting CYA; when clients are new to virtualization they are often very sensitive to perceived performance differences between physical and virtual systems. A new virtual environment where someone decided they wanted 8 Windows machines with 8 GB RAM running in 32 GB physical RAM usually gets too far oversubscribed, swaps hard (on a SAN) and the customer complains mightily.
  Well, that (swapping) should only happen if the VMs really do need and use all 8GB of RAM at the same time - in which case it's the host that is improperly sized, not the VMs.
  The point and benefit of being able to oversubscribe RAM is that your VMs won't all use all of their RAM at the same time, thus allowing you to make better utilisation of your hardware. Probably the second most important benefit is to circumvent having to convince people that their Domain Controller doesn't really need 2GB of RAM to run effectively, even if it's currently on a physical box with that much (or something similar).
  I must admit I'm kind of surprised whenever I see (non-trivial) VM environments that don't oversubscribe RAM (assuming the Hypervisor is capable). It's kind of defeating one of the main purposes for virtualising.
  Usually we find that a little tuning of VMs makes sense, since you don't have to robotically give every x32 system 4 GB RAM or every x64 system 8 or 16 GB RAM. "Detuning" the RAM from individual VMs is almost always possible and allows you to keep your VMs RAM sum running within the total physical RAM and avoid the possibility of swapping.
  The problem there is you can end up spending a LOT of time babysitting VMs, trying to find their optimal RAM allocation.
  Which is not to say that you should just go out and give VMs $LOTS_OF_RAM just because you can, as you note, but more that you don't need to do the VM babysitting I mention earlier and, more important, your VMs are more capable of handling varying workloads throughout the day.
  In many ways it's less of an issue than it was, say, a few years ago, too. The CPUs have gotten so powerful that it actually makes sense to buy less CPU per node but buy more nodes (and hence more RAM). The bonus generally being overall more RAM, generally better performance (since I/O is distributed) and greater HA capacity.
  The problem then becomes that more nodes == more licensing costs. You are right that RAM is generally the first bottleneck, however, which is why you should get hosts with as much RAM as possible.
No big deal by cyball · 2010-08-10 20:50 · Score: 1

Really, this amount of overcommit is nothing. It's been done for decades.
I manage a little over 200 virtual servers, spread across 7 z/VM hypervisors, and 2 mainframes. They are currently running with overcommit ratios of 4.59:1, 3.87:1, 3.56:1, 2.05:1, 1.19:1, 1.19:1, and .9:1. And this is a relatively small shop and somewhat low overcommits for the environment.
That's one of the benefits of virtualization...and yes, I know that if all guests decided to allocate all of their memory at once, we'd drive the hypervisor paging subsystem up the wall. Actually, this did happen a few months ago and while everything was dog slow for a while, z/VM happily paged along with out issue.
VMs by Anonymous Coward · 2010-08-10 23:05 · Score: 0

Newsflash from 2 years in the future: "Massive performance improvements in the VM sector by combining VMs into a single... let's call it... kernel. We present: chroot()!"
What's the fucking point of VMs if you introduce all the security problems over and over again? Might as well leave out the superfluous garbage complexity and improve performance.
2 years so updates are way behind? by Joe+The+Dragon · 2010-08-11 01:37 · Score: 1

2 years so updates are way behind? Not all of them are no reboot ones.
I can't wait by hilltop+coder · 2010-08-11 02:46 · Score: 1

..for the day when my ISP wants to sell me a more expensive class of memory because they oversold their physical memory so much that they can't support users that actually use all of what they were sold.
Why is this an issue? by Anonymous Coward · 2010-08-11 03:11 · Score: 0

Why is this even an issue, modern OS's are designed to use the maximum amount of ram available. Unused RAM is wasted ram, both M$ and penguin-boi so apparently there's no need to oversubscribe since we apparently have too much already.
ESXi is free, though by the_doctor_23 · 2010-08-11 03:14 · Score: 2, Informative

True, ESX is not free, but ESXi (http://www.vmware.com/products/vsphere-hypervisor/index.html) certainly is. If you have just one box and thus do not need stuff like HA or Vmotion, ESXi works just fine.

--
"Extraordinary claims require extraordinary evidence" - Carl Sagan
Windows paging by Anonymous Coward · 2010-08-11 03:32 · Score: 0

I've yet to encounter a sane system that does swapping well. Especially Windows is notorious of using the swapfile, even though memory is not filled yet. Why do I want diskwrites/reads before my memory is used up? I don't, and since I'm well below my 4GB of memory, I erase the swapfile instead and the system runs that much more smoothly.
Even desktop Linux (both Gnome and KDE) eats up memory and starts swapping needlessly in my experience, but the bloat is a bigger fault in Linux than in Windows desktop, and perhaps not the paging mechanism itself.
Whatever works. I just can't stand a slow system just because the system "finds out out of its own" that it needs to page, when I clearly see it doesn't. Often paged memory comes back to haunt me in form of trashing, or other slowdowns. Paging is a fine "optimization" for a server often, but is a horrible solution for a desktop, which should be more responsive.
1. Re:Windows paging by GooberToo · 2010-08-11 04:17 · Score: 1
  
  I've yet to encounter a sane system that does swapping well. Especially Windows is notorious of using the swapfile, even though memory is not filled yet. Why do I want diskwrites/reads before my memory is used up? I don't, and since I'm well below my 4GB of memory, I erase the swapfile instead and the system runs that much more smoothly.
  That's a good question and actually has a good answer.
  Windows implements paging using a completely different approach than does Linux and most Unixes. Basically, everything gets paged. Which is to say, back-stored to a page file. When memory pressure requires paging, they need only release the page in memory because its already paged to disk; presumably when disk I/O was not at a premium. Once the page is released it can immediately be made available to the application requiring it. In doing so they have prevented an entire write cycle when I/O latency and/or bandwidth is at a premium, and saved the corresponding latency to the paging activity.
  This is why Windows uses lots of page space and why they recommend well over physical memory be alloted to page files.
2. Re:Windows paging by sjames · 2010-08-11 04:59 · Score: 1
  
  Older windows (Possibly newer as well, but I don't know for a fact) had a TERRIBLE vm.
  On a decent VM, if a program has a page allocated, but rarely uses it (perhaps not until the program is ready to exit), would you rather it be pinned in memory or would you like to page it out (cost 4KB of I/O) and have one more disk buffer available?
  If it's actually thrashing, you need more RAM. If it's just speculatively paging something out (which isn't done when there is I/O demand), it isn't impacting performance. At least that's the way it works in Linux. YMMV for Windows.