Slashdot Mirror


A Glance At Garbage Collection In OO Languages

JigSaw writes "Garbage collection (GC) is a technology that frees programmers from the hassle of explicitly managing memory allocation for every object they create. Traditionally, the benefit of this automation has come at the cost of significant overhead. However, more efficient algorithms and techniques, coupled with the increased computational power of computers have made the overhead negligible for all but the most extreme situations. Zachary Pinter wrote an excellent article about all this."

20 of 216 comments (clear)

  1. Sigh. It's not a "feature" of other languages... by devphil · · Score: 3, Insightful


    ...it's required by them.

    Stack-based languges like the C family (including Java) don't need GC to operate correctly, but can use it if it's available. (Java just has it all turned on by default.)

    By "correctly," I'm specifically leaving out memory leaks. Your program may leak, but it will still run correctly, give the right answers to computations, not suddenly lose track of variables, etc. (Right up until you run out of swap.)

    Those "other languages" the author dumps a list of don't use GC just to free the poor programmer from the burden of thinking, or whatever. Nearly every one of those languages either has support for functional programming, or is centered around it. And in functional programming, you're creating functions on the fly.

    Which means returning functions as data. Possibly involving local variables in the creating function. Which means that locally-declared variables have to keep existing after the creating function returns, even if the coder can't get to them anymore. And the only way to do that is to have the runtime system manage its own heap, which means a garbage collector.

    So for all those languages, it's not an "ease of use" thing. It's a "there's no way for a programmer to do even do it manually at all" thing. GC is the only option.

    --
    You cannot apply a technological solution to a sociological problem. (Edwards' Law)
  2. Under the Rug by Markus+Registrada · · Score: 4, Informative
    As with most such presentations, this article sweeps under the rug most of the reasons why languages dependent on garbage collection have always failed to find much deployment in industrial settings.

    A previous poster noted that most GC algorithms are distinctly unfriendly to virtual memory systems. They usually have similar problems with cache locality, which can result in an enormous slowdown, regardless of the time actually spent in the GC itself. A practical problem is that GC regimes are notoriously non-portable, so that each new language implementation needs to have the (increasingly complex) GC re-done again.

    A more fundamental problem is that memory is only one of many resources a typical industrial program must manage. GC takes over memory management, but leaves the other scarce resources -- file descriptors, sockets, mutexes, database connections -- to be managed manually, as in C. (Java has this problem, for instance.) "Finalization" simply cannot provide the necessary guarantees.

    Given a resource management regime that can handle all these other important resources, as is commonly practiced in C++, memory becomes just another resource. Management is encapsulated the same way for all. A language that lacks the tools necessary to implement such a regime needs GC, so the presence of GC may actually (as in the case of Java) indicate a fundamental weakness in the language.

    (Anybody who thinks languages like Haskell or ML are fundamentally more powerful than C++ must be unaware of the Boost Lambda library, and of FC++, a set of header files that implements Haskell language semantics for C++ programs. They get along fine without GC, as well.)

    1. Re:Under the Rug by WayneConrad · · Score: 4, Informative

      GC takes over memory management, but leaves the other scarce resources -- file descriptors, sockets, mutexes, database connections -- to be managed manually, as in C.

      Ruby has an interesting approach using closures to handle manual resource allocation. One calls the function that allocates a resource, passing it a closure. The function allocates the resource, calls the closure, and then deallocates the resource (even if an exception occurs). Here's how you might write to a file the manual way (I apologize for the lousy formatting; I don't know how to trick /. into indenting):

      file = File.new("foo")
      file.puts "My mistress's eyes are nothing like the sun"
      file.close

      That's the usual way, easy to get wrong: What if an exception occurs? What if I forget to call close? Here's the better way, calling File.open and passing it a closure:

      File.open("foo") do |file|
      file.puts "My mistress's eyes are nothing like the sun"
      end

      File.open might use this common idiom:

      def File.open(filename)
      file = File.open(filename)
      begin
      yield(file)
      ensure
      file.close
      end
      end

      The "yield" calls the closure that was passed in, passing it the file object. The "begin...ensure" is like Java's "try...finally" construct, used here to make sure that the file gets closed whether the closure terminates normally or raises an exception.

      This idiom doesn't solve all manual resource allocation/cleanup problems, but it's a pretty way to solve some of them.

      I don't think Ruby invented this idiom, but I don't know where it came from. Perhaps Lisp: Everything seems to have come from Lisp.

    2. Re:Under the Rug by Pseudonym · · Score: 4, Insightful
      A previous poster noted that most GC algorithms are distinctly unfriendly to virtual memory systems.

      It depends on the language. Haskell, for example, has very different memory access patterns than Java. Being lazy, a value is produced only when it's time to be first consumed, at which point it often becomes garbage immediately. It follows that most of the garbage that a decent generational GC will be collecting will probably be in cache.

      Anybody who thinks languages like Haskell or ML are fundamentally more powerful than C++ must be unaware of the Boost Lambda library [...]

      I'm one of those rarest of beasts, a programmer who regularly uses (and likes) both Haskell and C++. (Disclaimer: I'm not familiar with FC++, though from what I've read it doesn't really support lazy evaluation, which is one of Haskell's most important distinguishing features.)

      From a reductionist point of view, of course, neither is more powerful than the other. However, even with Boost.Lambda and the likw, I still find Haskell almost always allows for far more rapid development than C++ does, all other things being equal. Naturally, all other things are rarely equal, and speed of development is not always the greatest concern, and I won't be drawn into ranking one of my two favourite languages over the other.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    3. Re:Under the Rug by BCoates · · Score: 4, Informative
      That sounds like the way a C++ destructor is used with the "Resource Acquisition is Initialization" model. You'd open a file by creating an object on the stack, and the destructor would close the file-handle once control returns (or the object is deleted, if on the heap)
      // some_file_object is a hypothetical file i/o object with manual open(), close(), write(), etc. functions

      class File : public some_file_object {
      public:
      File( const std::string & fname ) : m_handle( open(fname) ) {}
      ~File() { close(m_handle); }
      private:
      const handle m_handle;
      };
      It's sort of inside-out relative to the ruby version becuase it doesn't use the closure, but the useage is near-identical:
      {
      File file( "foo" );
      file.write( "There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy." );
      } // close happens here, or at the throw/return/break/continue site, if any
      new/delete just being another open/close pair to be avoided or contained away in a small object when practical, so it reduces the benefit gained from GC use.
    4. Re:Under the Rug by DavidTurner · · Score: 3, Informative

      Well put!

      Another important consideration is that where the programmer has the expectation that his garbage will be cleaned up for him, he will tend to assume that all of his resources will be cleaned up. This is clearly not the case. The seminal example for me was the use of database query result sets in C# - if you don't explicitly close them, they tend hang around, and the next time you try to perform a query on the same connection, you'll as likely as not get an exception. Surprise!

      Also, as some other posters have pointed out, not all garbage collection is automatically bad. It works pretty well in Lisp, Scheme, Haskell, and friends. However in the cases of Java and C# it is certainly detrimental, as it disables the only really effective mechanism for managing resources in those languages - destructors.

      Perhaps the thing to do is introduce an analogue for Lisp unwind-protect mechanisms. I suggested this a while back on the Java community forums, in the form of having sentry objects with the lifetime semantics of C++ automatic objects. Someone made the suggestion that the volatile keyword could be twisted to serve this purpose.

    5. Re:Under the Rug by Spy+Hunter · · Score: 3, Insightful
      So we should ignore GC because it doesn't solve all the world's resource problems at once? Your post doesn't provide a convincing argument against the use of GC. Non-portability is a non-issue; only language writers have to worry about that, and it's already their job. The cache-thrashing issue is the only real problem you mention. Generational GC significantly reduces this problem, to the point where the small runtime performance hit (if there is one at all; malloc and free take time too you know) is balanced out by increased programmer productivity (giving more time to optimize if you so desire, or add features if that's what you value more).

      Side note: Boost Lambda and FC++ are impressive but ugly hacks with horrible syntax, lots of "gotchas" that make code not work (often related to operator precedence and order of evaluation), and compiler errors from hell. Probably not the best examples of the power of C++. (OTOH, maybe that makes them the perfect examples of the "power" of C++ ;-)

      --
      main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
    6. Re:Under the Rug by hak1du · · Score: 3, Insightful

      this article sweeps under the rug most of the reasons why languages dependent on garbage collection have always failed to find much deployment in industrial settings.

      For the same reason people in industry have kept programming in Cobol and Fortran, and for the same reason they keep producing software with all sorts of problems, bugs, and limitations.

      A previous poster noted that most GC algorithms are distinctly unfriendly to virtual memory systems. They usually have similar problems with cache locality, which can result in an enormous slowdown, regardless of the time actually spent in the GC itself

      Not true at all. Generational collectors generally achieve far better locality than malloc-style allocators.

      A more fundamental problem is that memory is only one of many resources a typical industrial program must manage. GC takes over memory management, but leaves the other scarce resources -- file descriptors, sockets, mutexes, database connections -- to be managed manually, as in C.

      Fundamentally, the point behind GC is not to make your life easier, it is to make it possible for the language to be safe. Without GC, a language with heap allocated mutable data structures just cannot be safe. GC generally cannot reliably manage any other resources besides memory and it is not meant to.

      Given a resource management regime that can handle all these other important resources, as is commonly practiced in C++, memory becomes just another resource.

      But memory isn't just any resource, memory is a resource that can contain machine pointers to other memory (as well as references to other resources).

      The problem of resource management for memory is that of arbitrary directed graph reachability. And that is exactly the problem that a garbage collector solves, as efficiently as possible.

      A language that lacks the tools necessary to implement such a regime needs GC, so the presence of GC may actually (as in the case of Java) indicate a fundamental weakness in the language.

      C++ solves a common but limited subset of the resource management problem and then just declares victory. And even that false victory is not very satisfying because in order to achieve it, C++ has sacrified runtime safety. (In fact, with that choice, it has also sacrificed efficiency, but you aren't going to believe that no matter what I say.)

  3. Re:An Obvious Fault by p2sam · · Score: 3, Insightful

    An obvious fault that seems to go with out notice about sorting algorithms, particularly bubble sort is that it takes O(n^2) time to complete.

  4. Circular References by bcore · · Score: 3, Insightful

    Another flaw of ref counting is that if you have two objects which are no longer referenced by any of the active application, but which have references to each other, they will not get GC'ed, leading to memory leaks. Circular refs alone are just not good enough for any serious application, unless you force the programmer to look after cleaning up circular references, which kinda defeats alot of the benefit of using a GC'ed language.

  5. Re:Sigh. It's not a "feature" of other languages.. by jonadab · · Score: 3, Interesting

    > By "correctly," I'm specifically leaving out memory leaks.

    What a thing to leave out. Memory leaks are one of the hardest-to-track-down
    and most annoying kinds of bugs that we perpetually see in application after
    application. Okay, crashes are more annoying and pervasive, sure. And
    buffer overruns (which are not a problem in most languages that have GC,
    albeit GC is not the reason they're not a problem). But memory leaks are
    high on the list.

    > And in functional programming, you're creating functions on the fly.

    I'm trying to imagine a programming language that doesn't let you create
    functions on the fly but is powerful enough for writing real applications.
    The only thing I can come up with is that you could write what basically
    amounts to an interpreter so that you wouldn't have to write "functions"
    in the implementation language but could write them in the interpreted
    language instead. But that seems like a really ugly hack, just to avoid
    including real memory management in the compiler/interpreter/vm/whatever.

    It is possible to get around the need for closures (i.e., anonymous routines
    that hold references to otherwise-out-of-scope lexicals), if you have a
    sufficiently powerful object system. But again, it seems like a questionable
    goal; sometimes closures are really the most convenient way to accomplish
    something. (Sometimes they're not, of course... that's why I favour
    multiparadigmatic languages.)

    > So for all those languages, it's not an "ease of use" thing. It's a
    > "there's no way for a programmer to do even do it manually at all" thing.
    > GC is the only option.

    Strictly *theoretically*, the programmer can do all that stuff in any
    Turing-complete language; it's possible to do functional programming in
    8086 assembly language, for example, if you're willing to go far out of
    your way to do it. But in practice, neither assembly language nor C
    really makes that easy or practical, no. But then, there are actually
    quite a lot of things that those languages don't make easy or practical.

    --
    Cut that out, or I will ship you to Norilsk in a box.
  6. Dilbert GC by StarWynd · · Score: 5, Funny

    This is my kind of garbage collection!

  7. Re:Reference Counting... by Pseudonym · · Score: 4, Insightful

    It can be.

    Let's ignore circular references for a moment. To be honest, cycles don't turn up as often as people claim in programs where reference counting is done manually (or through smart pointers) because people are smart enough to know the issues and avoid them (e.g. by using weak references or other non-owning pointers to break cycles).

    For a start, reference counting interacts badly with multithreading. The reference count has to be protected against concurrent updates, and that can cost a lot, especially if the count is already effectively protected in some other way (e.g. by only being used single-threadedly). This is such a problem that many C++ library vendors are doing away with reference counting in their std::basic_strings.

    Secondly, every time you copy a pointer, you modify the reference count. Every single time. Sometimes (e.g. if you take a temporary local copy) that will be in cache, but not always. If there's contention between CPUs (see previous point), for example, the count will bounce between them. Sometimes it's an almost guaranteed cache miss.

    Admittedly, this isn't such a big problem in C++-implemented reference counting, because the programmer is usually far more aware of what's going on with pointer copying and will go to some lengths to avoid copying, but it can cost if reference counting is automatic. Have a look at the Python source code some time and see just how much trouble it goes to to avoid manipulating reference counts.

    --
    sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  8. a way to give the GC a hint? by Doppler00 · · Score: 3, Interesting

    It might be useful if some languages had an optional method of hinting that an object should be garbage collected soon. This would help in languages like Java where you get a huge amount of data stored and then all at once the disk thrashes as it GC everything. For some algorithms, it would be nice to tell Java ahead of time that you're done with the object and you're not going to reference it anymore. The nice thing is though, it wouldn't be a requirement, so you wouldn't have to worry about deleting an object still in use by mistake. I wonder how efficient this would be.

  9. GC has always been efficient by hak1du · · Score: 3, Insightful
    however, [manual storage management] can be more efficient in many ways if properly handled. This discrepancy in efficiency has slowed the widespread adoption of the automated approach.

    There hasn't been a "discrepancy in efficiency". Good garbage collectors have been comparable to, or better than, manual storage allocators for decades.

    The perception of a "discrepancy in efficiency" has several causes:
    • Garbage collection allows programmers to get sloppy about storage managmentt: if a non-GC program gets sloppy about storage management, it crashes, if a non-GC program gets sloppy about storage management, it just runs slowly. Unfortunately, as a result, many core libraries in garbage collected languages are pretty sloppily written and slow--the fault is with the libraries, not with garbage collection.
    • Garbage collection allows language implementors to make different design decisions. Many garbage collected languages will do memory allocation every time you use a floating point number. Imagine how slow C would be if you called "malloc" for every floating point number.
    • Garbage collection often bundles memory management overhead into single chunks of time, while manual storage allocators don't. Furthermore, garbage collector implementations really rub your nose in it, printing messages like "[starting garbage collection... done]". But doing a lot of storage management at once is usually more efficient overall--in aggregate, manual storage managers spend more time, they just diffuse it out. However, both kinds of behaviors exist with both storage managers, and you can pick and choose.
    The article is right that garbage collection is a good choice today. It is wrong in that it has pretty much always been a good choice. Garbage collection could have been widely adopted in the 1970's or 1980's, and we would have saved ourselves a lot of headaches and troubles without any loss in efficiency.
    1. Re:GC has always been efficient by hak1du · · Score: 4, Insightful

      What about real-time constraints?

      What about them? Real-time garbage collectors give you guaranteed real-time responses.

      I suspect that you have actually never used a real-time storage allocator of any form. The memory allocators that ship with major C/C++ compilers certainly make no real-time guarantees. The way people usually get real-time performance out of them is by pre-allocating large chunks of memory. Well, you can do in garbage collected languages as well.

      GC are generally non-deterministic (they start and finish according to their own rules),

      No, they don't. Just like with malloc implementations, their behavior may differ from implementation to implementation, but it is generally pretty well understood. It can usually also be controlled well.

      Simple garbage collectors only will start a garbage collection when you ask for a block of memory and it can't satisfy the request; they don't just start up for no reason at all. Parallel garbage collector may run a thread in parallel to the main program but never stopping it. Incremental collectors do a little bit of work each time you allocate. Real-time collectors guarantee well-defined maximum responses for allocation.

      If the garbage collector in your language (Java?) doesn't do what you want, it's not a problem with garbage collection in general, it's a problem with the specific implementation your vendor has chosen to give you. Just like there are mediocre or bad malloc implementations, there are mediocre or bad garbage collectors.

      How about one of the earlier comments to the effect that mark-and-sweep type algorithms page-faults all the memory used by an application? That has got to be inefficient, and since virtual memory is not under the control of the application by definition there is nothing that can be done, except if the GC is directly under the control of the OS, which doesn't often makes sense (it's not very flexible then).

      Well, that comment is wrong. First of all, you don't have to use a mark-and-sweep collector. Most high-performance collectors are, in fact, generational and are very VM friendly (moreso than malloc/free in many cases). Second, operating systems have interfaces to their VM subsystems, so the GC can, in fact, control what is happening with paging--prefetching pages, etc. And they do. Even 20 years ago, Berkeley UNIX had system calls specifically designed to let Franz Lisp let the kernel know what it was doing. Third, a malloc implementation cannot move pointers around to make accesses more local or sequential--good garbage collectors do, so GC is actually superior in that regard.

      The article itself says that there is no way to make a GC perform as well or better as a finely tuned hand-micro-managed in every case.

      You can "micro-manage" and "fine tune" in the presence of a GC as much as you can in its absence. But in the presence of GC, you have the freedom to be sloppy and your code will still run--so many people don't bother. In C/C++, you don't have a choice.

      In languages that don't have GCs you can add one yourself (Bohm's GC works fine for C/C++, and is in fact used for GCJ, the GNU implementation of the Java language), with the benefit that you can turn it off if you don't want it for some reason, something you can't do in Java for example.

      No, that is backwards. In languages without GC, you cannot add a GC and get all the benefits from the GC. Boehm's GC, for example, may retain arbitrary amounts of garbage, and its lack of integration with the language and compiler means that it can't be anywhere near as efficient as an integrated GC. Boehm's GC is a great hack, and it work really well, but it is not something you can ultimately rely on. Furthermore, if you add Boehm's GC to a language without GC, you are still left with an unsafe programming language.

      Secondly, languages with garbage collection often give you full control over the GC: you can enable it or disable i

  10. The GC pitfall by jtheory · · Score: 4, Insightful

    Good article, though very limited in scope (basically just a list of GC methods, wrapping up with the methods used by recent Java and .NET interpreters). I was a little disappointed that they didn't get into the implications of using languages with GC.

    One pitfall that I've noticed basically comes along with the benefit of avoiding "micro-managed" explicit memory management -- there are a lot of Java coders who don't think at *all* about memory management, because they think it's all handled for them. Mix that in with an over-excitement about OO, and you get some impressively slow and non-scaleable code.

    You DO need to understand, at least on a basic level, what's going onto the heap, and what the garbage collector has to do to keep up with your "garbage". Carefully nulling out objects that are going to be out of scope in a millisecond is just wasting space, but you should definitely keep an eye on what objects you're allocating within that loop that runs a million times. They're all going on the heap; are they all going to be on there at the same time? When are they going to be eligible for collection? Are they just Strings, or larger objects (which possible create other objects when they are created)?

    If you have to optimize a section of code, consider sticking to primitives and Strings (obviously you're balancing this against the cost of possibly less-maintainable code!), and don't forget that when you instantiate com.foo.Bar, all of its superclasses are also instantiated, including any member objects they hold. And don't make a variable static for no reason -- it won't get collected with the object instance....

    Two useful things to think about -- heap size (the objects you're actively using at a given moment, so they can't be collected), and churn rate (how fast you're creating and trashing objects). Object creation/destruction isn't as costly as it was with the early versions of Java (no, you probably don't need that Thread pool!). But any application that needs to scale requires some thought on memory usage and churn before you start coding.

    --
    There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
  11. Re:Reference Counting... by Circuit+Breaker · · Score: 3, Interesting

    Reference counting can interact nicely with multithreading on modern (post `96) hardware - most modern CPUs have this nice "compare-and-swap" atomic operation, which can be used to manage refcounts without any form of locking. Yes, it is a little less efficient and a little more intricate, but it's doable; In Windows, for example, it's called "InterlockedIncrement()" and "InterlockedDecrement()".

    Also, in many environments you DON'T modify the reference count every time you copy a pointer; there's a concept called "borrowed references" which is used in Python, COM, and many other ref count schemes to avoid some useless refcounts.

    Python (pre 2.0) used to do only refcount, and did it much better than Java (using GC) in all respects except thread friendliness. Modern python (2.0 and beyond) does both -- but it's extremely rare for the gc to be needed at all.

  12. Reference counting by Antity-H · · Score: 3, Informative

    It was mentionned earlier that reference counting was pretty good, but had a few drawbacks when it came to cycles and multi-threading.

    I took a bit of time to go and read Wikipedia's page

    In the description they give, they mention that reference counting GC can represent managed objects by directed graphs.
    I know there exists algorithm to find cycles in such graphs. So I suppose these could be applied to this problem. Other proposal are to use a tracing GC to detect them. To which it was replied that this would be able to reclaim the memory but not to properly finalize the objects. I don't see why that would be true. I mean, if you have found a member of the cycle to be collected, can't you just finalize that one and let the whole cycle unravel itself ? If there are cycles inside that cycle, just do it again on these etc ...

    As I said, another common objection was the cost of updating the counters in multithreaded environnments. Multiple solutions have been proposed, some more portable than others (using processor/platform specific atomic increments, or deferring the update until it is really necessary and using the standard mutex protection)

    All this said, I try to understand a couple of things.
    -I am no genius, thus these ideas must not be new, what is the problem which can't be solved with these?
    -Reference counting seems to integrate better in the runtime of the program. All the other techniques proposed seem to imply some monolithic operation on the memory summing up all the overheads at on time and doing the cleaning once in a while, with the possibility of becoming a bottleneck in heavily loaded systems. Reference counting OTOH seems to allow the cleanup to continually add a little bit of overhead to the system but nothing which will bring the whole thing to a grinding halt before allowing it to go on. What have I missed?

  13. Java doesn't have *a* garbage collector by blamanj · · Score: 3, Informative

    It has different collectors, which you can select according to the needs of your application. Currently there are two, the default collector (generational) and an incremental collector which is slower but less likely to pause.

    Also, the default collector is a 3-generation one, not 2, at least as of Java 1.4.1. More details here.