Slashdot Mirror


Review:Garbage Collection

A.M. Kuchling is back again, this time with a review of Richard Jones and Rafael Lins' oddly named Garbage Collection. This book is not actually about Waste Management Inc, but is "A highly detailed survey of garbage collection algorithms, and the research literature surrounding them. Interesting, if at times very dry.". Click below for more information. Garbage Collection author Richard Jones and Rafael Lins pages publisher Wiley & Sons rating 6 reviewer A.M. Kuchling ISBN summary A highly detailed survey of garbage collection algorithms, and the research literature surrounding them. Interesting, if at times very dry.

The first time you encounter garbage collection (GC, for short) is often by using Lisp or Scheme in an introductory computer science course, where it can seem an almost magical technology. How does the interpreter determine that a variable has become garbage, and can safely be deallocated? This book explains how GC works, in great detail, and interest people implementing garbage collectors or who've wondered how it's done. There are 3 approaches to GC, all of which are covered in this book:

  1. Reference counting:
    Every object keeps a count of the references pointing to it. When new references are created, the counter must be increased; when references are destroyed, the counter is decremented, and, if the count is zero, the object has become garbage and can be reclaimed. Reference counting unfortunately exacts the most overhead from constantly incrementing and decrementing the counters, and in its simplest form it can't collect cyclic data structures, because their count will never reach zero.
  2. Mark-sweep:
    Traverse the graph of object references, starting from the roots of the computation; roots might be the program's global variables, the contents of the C stack, static class variables, or a main namespace of some sort. Each object that's traversed is marked as being reachable; when the traversal is complete, any unmarked objects are unreachable because they can't be reached by the program any longer, and can be reclaimed as garbage.
  3. Copying:
    Divide memory into two halves, called "Fromspace" and "Tospace". Memory for new objects is always allocated out of Fromspace. When Fromspace fills up, a traversal similar to the one used by mark-sweep is done, but instead of just marking objects as reachable, they're copied into Tospace. Once the traversal is done, all the live objects are now in Tospace, and Fromspace no longer contains any live data; only garbage objects can be left in Fromspace. Tospace is now declared the new Fromspace, and new objects are allocated until it fills up; the algorithm is then repeated.

This book covers the above 3 algorithms for GC in great detail; there's a chapter for each algorithm, examining its limitations, common implementation techniques, and historical notes on the its development. Later chapters cover more advanced collectors that extend the basic algorithms in various ways. Generational collectors are based on the observation that many objects are temporary and have short lifetimes, while some few objects may last for the entire lifetime of the program. A generational garbage collector concentrates on young objects, because they will occupy the most recyclable space, and spends less time examining old objects that probably won't be reclaimed. Incremental and concurrent collectors don't bring the whole program to a halt while they scavenge through memory doing a collection, making them more suited for interactive or even real-time applications.

Chapter 9 is an interesting examination of the implementation of GC in an unfriendly environment, namely C programs. The Boehm-Demer-Weiser collector implements a mark-sweep garbage collector that can replace C's malloc() function, scanning stack frames and registers for potential pointers, and managing to do it fairly efficiently. One study found the Boehm-Demer-Weiser collector added 20% more execution time overhead than malloc() did -- that is, the time spent in memory allocation was 20% longer. (That doesn't mean it made programs 20% slower, because most programs spend more time computing results than they spend allocating memory). A garbage collector for C is a tricky job, complicated by the lack of type information, data values that may mimic pointers, and compiler optimizations that may render objects invisible to the collector. Barlett's Mostly Copying Garbage Collector, also described in this chapter, is more liberal and requires some assistance from the C programmer in exchange for greater accuracy.

The remaining 3 chapters consider suggestions for adding garbage collection to C++, the interaction of GC with cache memory, and distributed garbage collection. The topics of these chapters are still subjects of active research, so the final chapters are lighter on implementation details and heavy on references to research papers, making them a lot less interesting.

The authors are computer scientists specializing in GC (Richard Jones maintains a garbage collection page), and their coverage is quite complete, which at times makes for exceedingly dry reading as the authors enumerate what seems like every minor implementation variation and every relevant paper. On the other hand, if you're actually faced with implementing a garbage collector, the pointers into the research literature will be very useful. A casual reader (like me) can simply skim, or even skip, pages until the next interesting section arrives.

Excerpt

However, the matter is more subtle than this. The collectors described in this chapter are very simple and inefficient. The behaviour of these collectors cannot be automatically ascribed to their more sophisticated variants. The cost of copying an object is likely to be more expensive than simply testing and setting a mark-bit, particularly if the object is large. Although mark-sweep must sweep the entire heap, in practice its real cost is dominated by the mark phase. Linearly scanning the heap will generally be less expensive than tracing data structures even using the simple technique shown above. ... Furthermore, more sophisticated methods can substantially reduce the cost of the sweep. Although copying collectors have predominated in the past, recent studies suggest that the choice between mark-sweep and copying garbage collection may well depend as much on the behaviour of the client program as on the inherent properties of the garbage collection algorithm.
-- From section 2.4 of the book.

To purchase this book, head over to Fatbrain.

Table of Contents
  1. Introduction
  2. The Classical Algorithms
  3. Reference Counting
  4. Mark-Sweep Garbage Collection
  5. Mark-Compact Garbage Collection
  6. Copying Garbage Collection
  7. Generational Garbage Collection
  8. Incremental and Concurrent Garbage Collection
  9. Garbage Collection for C
  10. Garbage Collection for C++
  11. Cache-Conscious Garbage Collection
  12. Distributed Garbage Collection

5 of 105 comments (clear)

  1. Garbage collection leads to better programs. by unruh · · Score: 2
    The most hideous bugs can lurk in memory management. You get memory leaks if you are lucky or segfaults if you are not. GC frees you from a whole class of bugs and permits leaner, more legible code.

    GC was a revelation to me when I switched from C/C++ to Perl. Of course Perl's memory management goes beyond GC, but that's another story.

    Rapid Application Development is nearly impossible without GC -- you can't develop rapidly when you are hunting down memory leaks and illegal pointers.

  2. Grrr... by bluGill · · Score: 2

    Obviously you have never written a real program spanning several million lines of code with subroutines written by someone other then you.

    Combine the above with a data structure malloced in a library not written by you (possibal in OS/2, though it isn't malloc) that needs to be freeed in a latter subroutine. Maybe we should toss threads in too, so you may have a differnt thread looking at your memory after your subroutine is done, and your subroutine has to return a pointer to memory malloced. Suddenly you don't know who is done with that memory you just started with first. Can't call free() too soon lest you prevent access by someone else who needs it, but you have to call it. GC solves this problem, and if it was fast enough would be more universial.

    The creater of C++ has listed allowing himself to be convinced that GC was too slow as his greatest regret.

    Now for simple cases you are right, but the simple cases are rarely got wrong (for long) but when the memory needs to hang around for a long time the case is no longer simple.

  3. SmallTalk by Stu+Charlton · · Score: 2

    Actually, I agree more or less with your assessment, though I'm a little more optimistic about Java's abilities. (This probably comes down to the strong vs. weak typing argument that I won't get into. In summary: I like both, but going foward there are certain domains that need strong typing, and others that should have weak typing. This is too much of a religious issue: witness the flamewars against Tcl :)

    SmallTalk did catch on in many industries, but it never quite penetrated into mainstream use. Does anyone really know why? Perhaps it primarily because it was so advanced for its time (in the business community, I'd say late 80s to early 90s). Perhaps it was because C++ offered the "path of least resistance". I don't really know.

    Java is just an attempt (imho, and from what I understand, some veteren SmallTalker's opinions) at doing SmallTalk over again for the C++ crowd (albeit with some limitations, since the programming community at large seems to be set against weakly-typed languages)

    I love SmallTalk [and am very interested in Dylan or Self], and wish it would resurge with industry support. Until it does, (go Squeak go!) however, Java is an acceptable alternative, as it also provides the "path of least resistence" to some of SmallTalk's better technologies: GC, dynamic typing (though restricted through java interfaces), and a Packages/Classes/Methods IDE (through VisualAge for Java).

    Regarding COM: I felt that Microsoft actually did some cool stuff with their VM to support COM and make it MUCH easier to use than the C++-way of using COM. COM inherently is based upon the idea of C/C++ pointers, so it's not going to naturally map to any language, Java or otherwise (witness the atrocity called "Automation" to allow COM to talk to Visual Basic).

    Furthermore, as for your "double lines of code" assertion, I think you're overstating it a bit. Java doesn't have blocks, and it requires explicit type down-casting, but I don't see this as "doubling" code requirements. Certainly, the code requirement is larger than SmallTalk, but not significantly so.

    --
    -Stu
  4. I can growl louder than you can by SimonK · · Score: 2

    Garbage collection for those who find it to lazy or dificult to call free()...

    I am afraid you have just pressed one of my buttons. There is nothing that alerts me to danger more than one programmer acusing another of being 'lazy' because he does not want to use some rediculously complicated tool of which the first programmer is fond.

    I have NEVER seen a place where, if you were careful (as you should be) in designing your code, you couldn't have nice matching malloc()/free() pairs.

    If this were usenet, at this point I would endeavour to call you out on this, and we would end up in a pissing contest over who had the most experience and whose experience was most real

    Since this is not usenet, and I am in a good mood, I will not. I will just say that I have seen many examples of supposed OO code hacked to death or made overly complicated by the need to keep track of memory allocation. The stability, memory usage and locality of a great many C programs (including XFree86) can be improved by linking them with a garbage collector. In many cases this can actually make them faster.

    I am, at this very minute, attempting to track down a bug in a C++ library linked against my Java program, which is caused by a failure of the evil reference counting scheme the library author used to keep track of his memory usage. Now I know, from looking at the code, that he could not have used malloc and free in the traditional way without breaking the clean interface of his code, and using a real garbage collector with C++ is hard, so what choice did he have ? But the need to reimplement garbage collection badly for every project rather than doing it properly once is a total waste of time.

  5. Grrr... by Rob_D_Clark · · Score: 2

    My point is exactly that. If you have bad program design, GC is necessary, but well thought out design makes it unnecessary.

    Well, you could probably do without GC, but the question is whether you can do it better without GC.

    Without GC, there are situations where you either need to (a) implement reference counting (which is a form of GC) in order to know when to free some data, or (b) make duplicate copies of objects so each object can be freed by the code that receives a reference to the object when it is done with it. I am really not convinced that either approach is better than just using GC in the first place. (FWIW, the example I am thinking of is a multithreaded application that communicates between threads via message passing, where you would like to be able to have multicast messages. The objects passed are treated as read-only, so side effects are not an issue. I am sure that given a couple minutes I could come up with a better example.)

    --
    --Rob