The New Garbage Man
We've all heard of "garbage collection" in reference to cleaning up after yourself after a memory allocation. Two graduate students at the Illinois Institute of Technology are proposing a new method of garbage collection and memory allocation called DMM. What they are trying to do is "map basic software algorithms such as malloc(), free(), and realloc() into hardware." Sounds as if it has some promise, but can it ever be as flexible as some of the memory managers used today?
After working on a project that was rewriting a heap allocator (I just wrote the test program, but I did read up on memory allocators quite a bit), I realized exactly how complicated memory allocators can be and more importantly, how little the community knows about exactly what is the optimal way to do memory allocation so that both speed and fragmentation are mimimized. Given this uncertanty it is probably best to keep the allocator in software so that it can be modified.
The whole idea behind RISC is that the hardware provide FAST primitives that the upper levels can use to create more complex structures. Putting malloc free etc into hardware would just make things more complicated and maybe faster in the VERY short run but would surely not be as good as just making a DAMN fast memory subsytem or an awesome processor.
Thoughts on tech, Software Engineering, and stuff
How is this different from what Lisp machines (Symbolics and LMI) did 15-20 years ago?
FrontPage's lack of proper software garbage collection is the perfect environment to test hardware garbage collection in.
Wouldn't hardware dedicated to collecting software garbage tend to delete FrontPage entirely?
They are going to insert malloc/free into hardware, that doesn't mean they will build garbage collection into the hardware (which is a completely different thing). However, as the project description says, it could improve garbage collection performance, but that is another issue.
DMM (Dynamic Memory Management) is a well known term, not something new as the slashdot article tries to make it sound like.
Building DMM into hardware it not a new idea. However, I'm not really sure if it has been successfully implemented in any "real" CPUs.
People would love to write programs for such systems using a high-level language like Java or C++. The main issue is that real-time systems must be able to deterministically predict how long any operation will take. Problem is, with current implementations of runtimes, the garbage collector could kick in at any time and take some arbitrary amount of time to finish, totally messing things up.
Lets say you've got a software-controlled robot arm, putting together a car or something (cheesy example). A start command is executed, and the arm begins moving. Then, right before the stop command was supposed to be executed, in comes your garbage collector. It does its thing and finishes, and the stop command executes - 50 milliseconds too late, putting the arm out of position. Oops - I hope you weren't planning on selling that car.
But if you can deterministically predict when and how long all of the operations will take, voila, you can just take them into account when scheduling what happens and everything is all hunky-dory. The hardware part of it is a bit of a distraction I think - the deterministic part is what's cool.
Sigh. Have you ever written a garbage collector? Think about it- the people who cared enough about garbage collection to implement it in Lisp, Java, Python, etc. did. I've written one too, and I promise, you MUST have an intimate knowledge of memory to get one working. It's not incompetent programmers who write these things.
So why did those non-incompetent programmers who know all about managing memory on their own bother with writing these tools for the weak-minded? Mainly, it's that good programmers understand the principles of program design. One important design principle is that when you want to know what your program is doing at the top level, you shouldn't have to care what is really going on at the lower levels. That principle is the reason we have operating systems and compilers, for example. All of those things allow the programmer not to have to worry about what goes on under the hood, not because the programmer is stupid, but because the programmer's time is better spent solving the problem than dealing over and over again with low-level details that have nothing to do with the program at hand as opposed to any other program in the world.
Garbage-collected languages do the same thing- you trade a little speed (the idea that garbage collectors must be slower than hand allocation and deallocation actually turns out not to be true, incidentally, but in practice it often is) for the ability not to have to think about a ridiculously machine-level idea (id est, 'in which cell of memory am I going to write down this particular chunk of data?') so your code can just concisely reflect what your program actually does- quicker to write, easier to read, better all the way around. A smidgen slower, but who cares? You can certainly write programs of acceptable speed in garbage-collected languages, so it's no big deal unless you're writing something really speed critical, like tight graphics rendering loops or something, in which case us softheaded garbage-collection-needing incompetent programmers just use inline assembly. =)
-jacob
None of those machines have caught on. The reason is that generally, you seem to do better if you invest the same kinds of resources into just making your processor faster.
Simply implementing existing malloc/free algorithms in hardware is unlikely to help. Those are complex algorithms with complex data structures, and they are probably memory limited. Where hardware could help is by changing the representation of pointers themselves: adding tag bits that are invisible to other instructions (allowing pointers and non-pointers to be distinguished, as well as space for mark bits), and possibly representing pointers as base+offset or some other kind of descriptor that makes it easier for the allocator to move memory around without the C/C++ program knowing about. As a side benefit, this kind of approach could also make C/C++ programs much safer and help run languages like Java faster.
But altogether, after having seen several attempts at this, I'm not hopeful. Unless Intel or Motorola do this in a mainstream, mass market processor, it will be very expensive and not perform as well. If you want those features, you end up getting better performance at a lower price by simply buying a mainstream processor without the features and simulating what you need.
Argh! Moving stuff to hardware, when we have long struggled to get rid of esoteric hardware instructions?!? The whole idea behind RISC is to remove this kind of stupidness.
History is full of examples where this kind of hardware "features" made things faster originally, but has since become bottlenecks instead of improvements. "rep movsb" made sense on the 8086, but on a Pentium Pro it's just another thing which makes the chip more complicated, big, and power-comsuming.
And as for garbage collection, those of you who say that "GC is slow" are just plain wrong. There is a lot of research on garbage collection, and nowadays a good GC may be faster than manual malloc/freeing. It does not lock up your programs either - there are incremental GC's which you won't notice. They are even used in real-time environments today.
A major hindrance for good GC's in commonplace programs is that people usually write in C(++), and you can't have a good GC in a programming language which allows pointers. Just to give an example: I once saw a program which used a doubly-linked list. But instead of storing "next" and "prev" pointers, it only used one value - the XOR of prev and next! It was possible to traverse, since you always knew the address of either the previous or the next entry, so getting the other was just an XOR operation. But please tell me, how the h* is a GC algorithm supposed to know this? Even if you use the loosest approach, "treat every single word in memory as a possible pointer", it will fail - since there are no real pointers at all!
So the result is this: There are lots of good GC's around, but they can't be used in C - so people don't know that they exist. More or less.
GC for procedural code is optional. A good idea, IMHO- but still optional. If worse comes to worse, you can always "bubble up" responsibility for deallocating the memory to the block in which encompasses the entire lifetime of the memory. For example, if you had:
/* use mem_p */
/* does mem_p need to be allocated? */
/* use mem_p */
/* does mem_p need to be freed? */
...
...
/* we don't need to free mem_p here */
...
void * mem_p;
void foo(void) {
mem_p = malloc(somesize);
}
void bar(void) {
}
void bang(void) {
foo();
bar();
}
We can refactor this to:
void foo (void * mem_p) {
}
void bar (void * mem_p) {
}
void bang (void) {
void * mem_p = malloc(somesize);
foo(mem_p);
bar(mem_p);
}
We can do this because the life time of the body of bang() completely encompasses the life time of mem_p- which neither foo() nor bar() does.
Unfortunately, OO programming makes GC signifigantly less optional. And exceptions make GC no longer an option. Consider the following C++ code:
{
someclass * ptr1 = new someclass();
someclass * ptr2 = new someclass();
}
Can you spot the bug? If the program runs out of memory trying to allocate the second class, new throws an exception. But since the exception isn't immediatly caught, the stack is simply poped, and the memory pointed to by ptr1 is leaked.
The solution, as any C++ programmer will tell you, is to use smart classes and smart pointers, which implement GC. I.e., if the language doesn't have GC, you have to add it.
There are other reasons to use GC- by using GC you can more effectively divorce the implementation of a class from it's interface. In a GC system you don't have to export information about the memory management aspects of the implementation. This is especially important if you have two different classes implementing the same interface, but with different memory management requirements.
Which is faster, GC or manual allocation, often depends upon both the program in question, and the GC algorithm implemented. There is some data indicating that copying garbage collection is the fastest in complex situations- what you lose copying the data you win back in better cache and virtual memory utilization.
There seems to be a number of misconceptions as to what this is all about or how might it be useful in practice. I will offer some opinions on some of the issues raised in other posts.
The first one, as already mentioned by a number of people, is that hardware implementation of malloc and free has nothing to do with GC. The most difficult part of GC is to determine which part of the memory is garbage (this is not as easy as it may seem) without using lots of memory (after all, garbage collectors are typically only activated when free memory blocks are running low), and for those garbage collectors running as an independent thread, avoid possible race conditions with the foreground processes. Other issues a garbage collector faces include how to reduce the impact on overall system performance, and how to decrease memory fragmentation.
A garbage collector is not something easy to design and implement. Making a good garbage collector especially requires the almost-impossible combination of profound understanding in the memory usage behavior of typical programs, logical reasoning abilities, and coding techniques. In addition to the garbage collector itself, you also need support from the language and compiler side, and you have to integrate all of these into a clean design. That is about as hard as things like this can be.
(Of course, you can also write a conservative pointer finding GC like the libgc used by many functional language compilers --- but that is far from the state of art in this business.)
The proposed hardware support has nothing to do with the points we mentioned above, therefore has nothing to do with GC. Then, is it possible to build some hardware support for garbage collection? Maybe, but I am not an expert in this field. Whatever the solution turns out to be, it will never be as simple as hardware implementation of malloc and free.
Second, this also has nothing to do with real time systems. Many people seems to think that being "real time" means you have to code everything in assembly language and make things run as fast as possible, but that is simply not true. Being (hard) real time means operations must respond and complete within bounded time; as long as that bound is satisfied, there is no need for the task be completed "very fast". The trick of building real time systems is in making sure that nothing will delay the operation for a long time, and that requires careful analysis and implementation.
If you remain to be convinced, think of a typical hard real time application: CD-R burning control programs. They must feed a continuous stream of data (say, 600KB/s for 4X burning) into the burner, or the disk will turn into a coaster. Is it necessary to code it in assembly? Absolutely not, because pumping data at 600KB/s is very little work for current architectures with 600MHz processors and 100MHz memory busses. However, does that mean you do not need to pay special attention to make it real time? Wrong, because although the hardware is several orders of magnitudes more powerful than it needs to be, there are still possibilities that something will make the program stop working for a second or two and screw the party. It is the responsibility of real time system designers to make sure that it does not happen.
You fail to grasp that fast memory will always be more expensive than slow memory. That should be obvious, if for no other reason than fast memory needs six transistors where dram needs one.
Here's an example. Say fast memory is 10x the price of slow memory and 10x the speed. Also, assume that a cache one tenth the size of the main ram has a miss rate of 0.5%. Given these specs, you can have either:
N megs of fast memory with an average access time of Q.
5N megs of dram with N/2 megs of cache with an average access time of 1.045Q.
That's five times the capacity for a 4.5% speed hit. In the real world the situation is even more favorable for a memory heirarchy.
BTW, nobody just uses ram anymore, everyone has swap space too. Why don't you argue for disks that run at core speed?
Ryan Salsbury