Ask Slashdot: How do Software MMU's Work?
Rob_D_Clark
asks:
"How does a program (like VMware) implement
memory management on top of Linux (or Unix
in general)? For example: in VMware, the
guest OS is going to expect to have a 32-bit
address space, into which the memory you
allocate to the guest OS is mapped. Also,
the guest OS is going to expect hardware
registers for different devices, etc., to
be mapped in at certain addresses. How
does a program trap reads/writes to these
addresses and deal with them appropriately?"
It's been >5 years since I last studied this, but I thought that one could set a flag that allows even debug registers to be "protected", in effect causing a fault upon *any* access attempt by ring 3 ("user") code.
My current favorite example comes from Linux: The kernel allows user processes to read the current value of the CPU clock counter (using the instruction "rdtsc", or "read time stamp counter"). That instruction can be made to cause a fault by an appropriate flag setting.
I would expect Intel to be fairly good at VM technology after hearing some of the complaints about the '386. (The obvious one is the lack of ring 0 write-protect page faults.)
I'd dig around in deja-news if I were you.
No, it's simpler than that. Read the linux-kernel archives and see how the UltraSparc guys discussed working around the bugs in the UltraSparc CPUs:
(1) you mark all the pages that you want to trap instructions in as non-executable
(2) when code attempts to execute in one of those pages, you get a fault
(3) you trap the fault, and then (and only then) scan the page and modify instructions as necessary
(4) you then mark the page executable and not writable, and let it run
(5) if the page is modified, you then clear the executable bit, because you may have to re-scan it.
Okay, imagine that the memory of your computer is like a vast attic, full of flies. Each of the flies is either asleep or awake, and they change state frequently. They live, work, and play in groups of eight, called "bytes". Now, when the computer gets hungry, it opens up its mouth much like a blue whale and sucks in a great big gulp of air from the attic. It filters the flies out of the air with its giant long strandy teeth and gobbles up the flies -- gobble gulp!
So.
The whale has no eyes, and in the whale's tummy there is a man without his greatcoat. That guy is called the "kernel", or "Colonel", and he looks and talks exactly like Colonel Klink on Hogan's Heroes. He has a goofy, bumbling sidekick named Sergeant Shultz, otherwise known as the "Memory Mangement Unit". What Sgt. Shultz does is, well . . . okay. Let's start over. Colonel Klink is in charge of sorting through these flies and putting them together in the right order before the whale (the computer, remember) digests them. This way, the whale won't get a tummyache and feel funny. Col. Klink has to decide which flies to send when, but he needs to have them organized in the right way so he knows which flies are which. If two batches of flies crash into each other, the computer will get very frowny and sad. Col. Klink doesn't like that, because when that happens the General comes and yells at him in German, and Col. Klink doesn't speak German, he just speaks English with a funny accent. So Sgt. Shultz has the very important job of ensuring that the flies don't get mixed up before Col. Klink gets to look at them.
In the arrangement that you're talking about above, things are more complex, because Col. Klink and Sgt. Shultz have to coexist with Col. Hogan and Richard Dawson, who are doing the same thing at the same time. (A little imagination will suffice to guess which OS is which). Hilarity ensues! But everything runs smoothly again at the end of the episode.
Hope this helps.
The general consensus in comp.arch is that vmware is doing some dynamic recompilation, but is otherwise allowing the hosted operating system to execute natively, and thus use the hardware mmu for the majority of the work.
As has already been mentioned, the IA32 instruction set architecture (ISA) is not completely self-virtualizable, i.e. you can't trap accesses to all cpu state information. But, you can scan through the text of your process and search for those specific opcodes that are not virtualizable. Substitute a call to your own handler for those opcodes and voila! we are now effectively fully virtualizable and the performance hit is minimal, especially if you can save your changes so that you don't have to scan and recompile each page of text more than once. And once you are fully virtualized, as long as you properly trap the right operations and do the right thing, you can let the hardware do 99% of the work for you.
Clearly vmware does more than this with its various virtualized devices, but fundamentally this is probably what is going on.
Wow, I'm glad Rob Clark thought to ask this on Ask Slashdot. I was wondering this myself.
Although, I would like to add a rider to his question:
With Intel processors, some hardware registers can't be trapped. For example, any priviledge level can read DR7 to find out if a debugger is resident. Writes to this can obviously be trapped, but AFAIK there is no way to get the processor to trap on reads.
I am sure there are other examples like this, as well. This seems to indicated that it is impossible to virtualize every aspect of the machine.
(Although, I suppose you could put the processor into single-step mode, and look at each instruction before it executes, looking for these types of instructions, but that would slow things WAAAYYYY down.
--synaptik
HSJ$$*&#^!#+++ATH0
NO CARRIER
What I am really trying to figure out is how to trap writes/reads to certain addresses without having to interpret the machine instructions... the best hack I have thought of is to trap SIGSEGV, and have your signal handler try to figure out what was going on.
/proc file, then have your VM mmap that file to use as it's memory. Then the module could deal with simulating memory space. (This is assuming you can prevent the mmap file from being cached.) This would be even better if you could bind a user-space program to a file. (I vaguely remember reading that GNU hurd has the ability to do this.)
For example you can mark pages of memory as a not being readable (PROT_NONE flag for the mmap). This will cause a SIGSEGV if the program tries to read/write that address.
Another idea I just thought of as I was writing this post... you could use a kernel module to create a
--Rob
It may do the work, and does, at least for vmware. There is a thing called single-step in an 80x86, which lets you single step a program, with an interrupt after every instruction, so you may easily look-ahead and see if the next would compromize your emulation... This functionality is used by vmware, while they say this functionality is not available to debuggers running inside a guest-os...
--The knowledge that you are an idiot, is what distinguishes you from one.
It all has to do with virtual memory. (not the misnomer use as swap) Basically, there is a mapping between _real_ memory addresses and the addresses programs use to access data.
In a kernel, this is done (usually) using a mix of hardware and software. If a program tries to access a piece of memory, the hardware looks at the Transition Lookaside Buffer (TLB) to translate the address. If the address exists in the buffer, it does the transition and all is good. If it does not exist, a trap is called to the kernel. It is the kernel's responsibility to look at the virtual memory tables, allocate the memory, copy it if it was copy on write, and most importantly update the TLB so next time it does not have to set up the translation.
So in VM case, this is sorta conjecture. The VM can allocate a slew of memory on the host OS. (As far as the client OS is concerned, this is physical RAM. Then it can make a TLB and all memory accesses will go through it first. This way it can stop Windows from pissing all over OS/2 running on the VM. But Linux will stop the VM from pissing on anything else on the host OS.
As far as kernel traps, the user level program's data needs to be copied over to kernel space for the kernel to access it.
I hope this begins to answer your question.
Robert