Ask Slashdot: How do Software MMU's Work?
Rob_D_Clark
asks:
"How does a program (like VMware) implement
memory management on top of Linux (or Unix
in general)? For example: in VMware, the
guest OS is going to expect to have a 32-bit
address space, into which the memory you
allocate to the guest OS is mapped. Also,
the guest OS is going to expect hardware
registers for different devices, etc., to
be mapped in at certain addresses. How
does a program trap reads/writes to these
addresses and deal with them appropriately?"
Funny, my dad told me that's how an 'internal combustion engine' works.
My dad explained "marginal utility" and "network externalities" the same way. In fact, he used the same explaination for sex.
Hmmm . . .
. . . is a parody of the humorless mental dwarves who moderate all the life out of Slashdot.
Lighten up.
You don't need a disassembler, just a pattern matcher. Scan through each text page looking for unvirtualizable binary opcodes and over-write them with some other opcode(s) you know will cause a trap, make sure to pick some trap that is unique enough that you can determine what opcode you replaced.
I believe the main problem is that opcode to ask the cpu what mode (user or supervisor) it is in is not virtualizable, so it isn't like you will need to be recomputing things like offsets into the stack and complicated stuff like that, just a binary search and replace.
But how do you know what to breakpoint? There's only 4 hardware breakpoints on x86, I think.
In a nutshell, this is how I would do it:
My program (VMware work-alike) would run in ring 0. ALL other programs would run in ring 3. I'd set a seperate x86 VM for each instance of a hosted operating system. My work-alike program would emulate any instructions which can't be run in ring 3. The work-alike would have to emulate a lot of the functionality of the x86.
When a program tries to use an instruction that can't be used in ring 3 a GPF will occur which will switch control to my work-alike. My work-alike will look at the stack to get the address of the offending instruction and emulate its functionality, then return control back to the program. The program would have no idea it was just interrupted and its request emulated as long as it gets what it wants.
It'll be much more involved than the above and would require a lot of work to write, but thats basically how it could be done.
Anonymous Coder
TLB == Translation Lookaside Buffer
Its flushing the on-chip cache.
Actually, that is exactly how VM is doing it, I believe. We've discussed VM ware at length, here in the office -- we're all REALLY impressed -- and we have come to the conclusion (so realise that the following is just speculation) that:
As others have pointed out, the i386 is NOT virtulisable, so you have to play some tricks, unless you want to emulate the processor (hey, it worked for insignia). But, that is too slow to get VMware's level of performance. Even digital's assembly language translation thingie (the had a vapor ware alpha/i386 hybid processor a few years ago that did this -- look in back issues of byte for pointers) is too slow.
VM ware's trick is to scan the object code at load time and translate unvirtulisable instruction sequences to something else -- what I don't know, but I suspect they jump to an emulator for just that sequence. So it's just like Pure Atria's stuff, and even related to the Melting Ice tech availible for rapid development for Eiffel.
hope this helps.
Johan
johan@ccs.neu.edu -- I'll log in when I get the johan uid. gimmie!
Anyone who has some time to spare and knows a great deal about how to implement this should head over to: http://www.freemware.org/ they could really use your help.
[The Freemware project started directly after VMWare was announced. It's an effort to create an open source (and possibly portable) VMWare-clone.]
I have a dual CPU system, and that second processor isn't doing all that much. Rather than use a bunch of kluges to virtualize the processor, why not give a whole processor to that OS?
This would require you to "hide" the other processor from the OS as far as SMP support goes (or to use an OS that doesn't do SMP), and to make it use "device drivers" that use inter-processor communication to funnel the actual device I/O through the hosting OS.
There would be other issues, like arranging for the hosted processor to see a BIOS that doesn't try to access devices directly and guaranteeing that the hosted processor doesn't go playing with hardware directly. Unfortunately, the solutions to those problems might leave you back at the original problem.
It's been >5 years since I last studied this, but I thought that one could set a flag that allows even debug registers to be "protected", in effect causing a fault upon *any* access attempt by ring 3 ("user") code.
My current favorite example comes from Linux: The kernel allows user processes to read the current value of the CPU clock counter (using the instruction "rdtsc", or "read time stamp counter"). That instruction can be made to cause a fault by an appropriate flag setting.
I would expect Intel to be fairly good at VM technology after hearing some of the complaints about the '386. (The obvious one is the lack of ring 0 write-protect page faults.)
I'd dig around in deja-news if I were you.
No, it's simpler than that. Read the linux-kernel archives and see how the UltraSparc guys discussed working around the bugs in the UltraSparc CPUs:
(1) you mark all the pages that you want to trap instructions in as non-executable
(2) when code attempts to execute in one of those pages, you get a fault
(3) you trap the fault, and then (and only then) scan the page and modify instructions as necessary
(4) you then mark the page executable and not writable, and let it run
(5) if the page is modified, you then clear the executable bit, because you may have to re-scan it.
Okay, imagine that the memory of your computer is like a vast attic, full of flies. Each of the flies is either asleep or awake, and they change state frequently. They live, work, and play in groups of eight, called "bytes". Now, when the computer gets hungry, it opens up its mouth much like a blue whale and sucks in a great big gulp of air from the attic. It filters the flies out of the air with its giant long strandy teeth and gobbles up the flies -- gobble gulp!
So.
The whale has no eyes, and in the whale's tummy there is a man without his greatcoat. That guy is called the "kernel", or "Colonel", and he looks and talks exactly like Colonel Klink on Hogan's Heroes. He has a goofy, bumbling sidekick named Sergeant Shultz, otherwise known as the "Memory Mangement Unit". What Sgt. Shultz does is, well . . . okay. Let's start over. Colonel Klink is in charge of sorting through these flies and putting them together in the right order before the whale (the computer, remember) digests them. This way, the whale won't get a tummyache and feel funny. Col. Klink has to decide which flies to send when, but he needs to have them organized in the right way so he knows which flies are which. If two batches of flies crash into each other, the computer will get very frowny and sad. Col. Klink doesn't like that, because when that happens the General comes and yells at him in German, and Col. Klink doesn't speak German, he just speaks English with a funny accent. So Sgt. Shultz has the very important job of ensuring that the flies don't get mixed up before Col. Klink gets to look at them.
In the arrangement that you're talking about above, things are more complex, because Col. Klink and Sgt. Shultz have to coexist with Col. Hogan and Richard Dawson, who are doing the same thing at the same time. (A little imagination will suffice to guess which OS is which). Hilarity ensues! But everything runs smoothly again at the end of the episode.
Hope this helps.
The general consensus in comp.arch is that vmware is doing some dynamic recompilation, but is otherwise allowing the hosted operating system to execute natively, and thus use the hardware mmu for the majority of the work.
As has already been mentioned, the IA32 instruction set architecture (ISA) is not completely self-virtualizable, i.e. you can't trap accesses to all cpu state information. But, you can scan through the text of your process and search for those specific opcodes that are not virtualizable. Substitute a call to your own handler for those opcodes and voila! we are now effectively fully virtualizable and the performance hit is minimal, especially if you can save your changes so that you don't have to scan and recompile each page of text more than once. And once you are fully virtualized, as long as you properly trap the right operations and do the right thing, you can let the hardware do 99% of the work for you.
Clearly vmware does more than this with its various virtualized devices, but fundamentally this is probably what is going on.
Remember, i386 is little endian:
"fc ff ff ff" is really "0xFFFFFFFC"
or -4.
--synaptik
HSJ$$*&#^!#+++ATH0
NO CARRIER
My experience in MS Windows land, is that writes to DR7 are protected, but reads are not.
In fact, the Intel documents I have specifically state that this register is readable at any priveledge level, but no where have I seen a statement that you can MAKE it a priveledged instruction.
--synaptik
HSJ$$*&#^!#+++ATH0
NO CARRIER
Wow, I'm glad Rob Clark thought to ask this on Ask Slashdot. I was wondering this myself.
Although, I would like to add a rider to his question:
With Intel processors, some hardware registers can't be trapped. For example, any priviledge level can read DR7 to find out if a debugger is resident. Writes to this can obviously be trapped, but AFAIK there is no way to get the processor to trap on reads.
I am sure there are other examples like this, as well. This seems to indicated that it is impossible to virtualize every aspect of the machine.
(Although, I suppose you could put the processor into single-step mode, and look at each instruction before it executes, looking for these types of instructions, but that would slow things WAAAYYYY down.
--synaptik
HSJ$$*&#^!#+++ATH0
NO CARRIER
Posted by Nick Carraway:
I think you mean "voila." I liked the Colonel Klink explanation better...
Rather than rewriting instructions you detect on a scan of a page of program text, you could possibly set a hardware breakpoint for the instruction. Since there are only 4 hardware breakpoints (at least on the 386), you would only be able to do this on pages containing less than 4 instructions that need to be watched.
;-) ).
I wrote a simple program to scan all the windows dlls and exes for "dangerous" instructions. I found that for most exes and dlls, there were less than 4 instructions per page that would be dangerous. For the remaining ones, you could rewrite the instructions. But then, you have to make the page execute only (not readable or writeable-- is this possible?) and trap any access to it by the processor, to fool it into seeing the original instruction instead of the rewritten one.
Or, you could simply do single step on that page (which might be a viable option since there would be so few of those pages in the average OS-- unless someone specifically wanted to make your VM perform badly
Life's a lot like money-- you spend it, then it's gone. Spend wisely.
A few comments show that a program may determine that it's in ring 3 rathar than 0. It's important to remember that an OS has little legitiomate reason to check for that. I wouldn't be surprised if M$ added such checks now that vmware is out, but apparently, they haven't done that in the past.
In general, it's not necessary to perfectly virtualize ring 0 instructions, it just has to be 'good enough'. In practice, determining what 'good enough' actually is can be a tough problem (which is why there aren't dozens of vmware like products out there), but perfection is not required. Most OSes are not hostile to being virtualized, they just assume that they're not being virtualized.
The short answer is: "mmap + SIGILL + SIGSEGV". If you're curious about the details, you might want to check out the Brown Simulator, which provides a full MMU at user-level on top of Solaris.
The point of vmware is to provide the fastest possibly emulation of an ia32 machine. So it want to execute all (or nearly all) the instructions directly on the host processor, rather than having to emulate them. The clever bit is to allow it to do this without clobbering the host OS -- this is what requires lots of memory management tricks.
that VM ware uses a virtual machine for each operating system. This is similar to JVM. It allows each OS to run natively on the same computer using it's resources without conflicting with the others. What *I* want to know is disk partitions/file systems. Do you need a different partition for each OS? Or a different drive entirely. Yes, you need a different partition for each OS. But it's that way anyways (for the multi-boot people) without VMWare. VMWare, however, can be set to use an existing partiton with an OS in "RAW" mode, so you don't have to reinstall the OS in a VMWare space on the host OS. Of course, you DO NOT want the host OS to be able to access that partition when VMWare is running...
Quite elegantly put :)
While we're at it, here's my guess, which is based on badly blurred memories of 80486 documentation.
Memory reads and writes really aren't the difficult part. In protected mode, every process (or task) gets executed in its own 4GB (max) virtual memory space and gets translated by the processor into absolute memory space. The OS swaps out these task spaces to disk while they're not being used. One process should never be able to write to another processes space, which was the whole point of protetcted mode with the i386.
The real issues involve handling interrupts, and executing protected instructions. Take for instance writing directly to hardware through IO ports. The host OS absolutely can't let the hosted OS do what it wants in this area. But the interupt mechanisms of x86 architecture come to the rescue here.
Run the hosted OS in some unpriveledged level (not ring 0) and let the processor interrupt whenever there's a priveledged instruction executed. The host then examines the situation and recovers by implementing the priveledged instruction in an alternative way.
Registers also won't be a problem in most cases since they are saved and restored at a task state switch. Linux shouldn't care what NT does with the registers as long as they get restored when NT gets preempted.
- dw
Wow, that reminds me of the old Altos 3068 I used on my first job out of college. It had a discrete logic MMU on a separate (big!) board. I never really thought about it, but I wonder if there was some problem with the 68[48]51 chips (the system was a 16MHz 68020).
Just junk food for thought...
What I am really trying to figure out is how to trap writes/reads to certain addresses without having to interpret the machine instructions... the best hack I have thought of is to trap SIGSEGV, and have your signal handler try to figure out what was going on.
/proc file, then have your VM mmap that file to use as it's memory. Then the module could deal with simulating memory space. (This is assuming you can prevent the mmap file from being cached.) This would be even better if you could bind a user-space program to a file. (I vaguely remember reading that GNU hurd has the ability to do this.)
For example you can mark pages of memory as a not being readable (PROT_NONE flag for the mmap). This will cause a SIGSEGV if the program tries to read/write that address.
Another idea I just thought of as I was writing this post... you could use a kernel module to create a
--Rob
Well, I remember using an old Siemens box. It ran SINIX 1.0, had a 80186 processor and just under 1MB of ram. It also (wonder of wonders) had a MMU implemented as a piggy-back board on the bus with an 8086 and a software ROM. There's software MMUs for you!
OK, well since it's the day of whacky metephors, I'll try and tackle this one...
A normal linux application has to obey posix rules to interface with the outside world (memory, disk, printer, etc..). Lets take memory as the normal example. An *application* has to ask nicely for whatever resources it wants. It dosn't know much of anything about the *real* state of the machine. Like a cow in it's pen, the process has no idea what the other cows are up to or even how many other cows there are or how big the ranch is. Operating systems attempt to "dial directly" to the hadware, and this is the part that VMWare must emulate. No easy job either, considering all the whacky things you can do on a PC. So if I'm an OS (or one of those old boot-me game disks like flight sim 2.0), I don't even worry about allocating memory - I just start eating it by the bucketfull. BIOS loads the kernel into memory starting at 0x00, then "jmp 0x00" (goto 0x00). Kernel executes, checks how much memory it has, and starts parcelling it out to other applications - it rules the ranch. It dosn't *ask* for more memory, it just "walks the fence" to figure out how much there is.
Clearly to make the ranch-owner behave as a simple cow, while letting him run his own little rat-ranch and never letting him have a clue that he's just a cow in a pen is a pretty neat trick.
Most of the "neat trick" is done for you by the CPU, however as some of the more advanced hackers have pointed out, there are a few weak spots in this virtualized environment. So you still have some crazy stuff to do before you can fully fool the rancher. He's always asking if there's a larger wourld out there, and you have to keep him in the dark at all costs (or he'll die of surprise and fright). Like flatland...
Basically, this is done by brain-washing the rancher into never poking his head through that hole in the wall - which we know leads to the *real* outside world. Or what we know as the outside world - but is it really? Maybee we've already been brainwashed ourselves!!
Application = cow
OS = rancher
computer = ranch
vmware = sophisticated brainwashing for ranchers which makes them think rats are cows and keeps them from looking over the wall of the stall. Also makes them live on hay instead of beef.
-=Julian=-
Unfortunately unlike MIPS, the x86 TLB is
implemented entirely in hardware so this won't
work.
Maybe you could get some hints as to :-)
how VMWare works by watching what MS
changes in their next OS release in
order to break it, if they can...
Choice of masters is not freedom.
It all has to do with virtual memory. (not the misnomer use as swap) Basically, there is a mapping between _real_ memory addresses and the addresses programs use to access data.
In a kernel, this is done (usually) using a mix of hardware and software. If a program tries to access a piece of memory, the hardware looks at the Transition Lookaside Buffer (TLB) to translate the address. If the address exists in the buffer, it does the transition and all is good. If it does not exist, a trap is called to the kernel. It is the kernel's responsibility to look at the virtual memory tables, allocate the memory, copy it if it was copy on write, and most importantly update the TLB so next time it does not have to set up the translation.
So in VM case, this is sorta conjecture. The VM can allocate a slew of memory on the host OS. (As far as the client OS is concerned, this is physical RAM. Then it can make a TLB and all memory accesses will go through it first. This way it can stop Windows from pissing all over OS/2 running on the VM. But Linux will stop the VM from pissing on anything else on the host OS.
As far as kernel traps, the user level program's data needs to be copied over to kernel space for the kernel to access it.
I hope this begins to answer your question.
I never realy thought it would be all that complicated.
wouldn't linux be able o give it a fixed ampount of RAM as an application and then let VMWare tell linux what to do with it? (er.. i guess im asking why is it different from any normal application? dont they all have protected memory?)
arghh.. my head hurts now..
--- all posts are not affiliated with my workplace. period. i dont care how good it may make them look, they are all
Robert