A Glance At Garbage Collection In OO Languages
JigSaw writes "Garbage collection (GC) is a technology that frees programmers from the hassle of explicitly managing memory allocation for every object they create. Traditionally, the benefit of this automation has come at the cost of significant overhead. However, more efficient algorithms and techniques, coupled with the increased computational power of computers have made the overhead negligible for all but the most extreme situations. Zachary Pinter wrote an excellent article about all this."
...it's required by them.
Stack-based languges like the C family (including Java) don't need GC to operate correctly, but can use it if it's available. (Java just has it all turned on by default.)
By "correctly," I'm specifically leaving out memory leaks. Your program may leak, but it will still run correctly, give the right answers to computations, not suddenly lose track of variables, etc. (Right up until you run out of swap.)
Those "other languages" the author dumps a list of don't use GC just to free the poor programmer from the burden of thinking, or whatever. Nearly every one of those languages either has support for functional programming, or is centered around it. And in functional programming, you're creating functions on the fly.
Which means returning functions as data. Possibly involving local variables in the creating function. Which means that locally-declared variables have to keep existing after the creating function returns, even if the coder can't get to them anymore. And the only way to do that is to have the runtime system manage its own heap, which means a garbage collector.
So for all those languages, it's not an "ease of use" thing. It's a "there's no way for a programmer to do even do it manually at all" thing. GC is the only option.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
An obvious fault that seems to go with out notice about sorting algorithms, particularly bubble sort is that it takes O(n^2) time to complete.
Another flaw of ref counting is that if you have two objects which are no longer referenced by any of the active application, but which have references to each other, they will not get GC'ed, leading to memory leaks. Circular refs alone are just not good enough for any serious application, unless you force the programmer to look after cleaning up circular references, which kinda defeats alot of the benefit of using a GC'ed language.
It depends on the language. Haskell, for example, has very different memory access patterns than Java. Being lazy, a value is produced only when it's time to be first consumed, at which point it often becomes garbage immediately. It follows that most of the garbage that a decent generational GC will be collecting will probably be in cache.
I'm one of those rarest of beasts, a programmer who regularly uses (and likes) both Haskell and C++. (Disclaimer: I'm not familiar with FC++, though from what I've read it doesn't really support lazy evaluation, which is one of Haskell's most important distinguishing features.)
From a reductionist point of view, of course, neither is more powerful than the other. However, even with Boost.Lambda and the likw, I still find Haskell almost always allows for far more rapid development than C++ does, all other things being equal. Naturally, all other things are rarely equal, and speed of development is not always the greatest concern, and I won't be drawn into ranking one of my two favourite languages over the other.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
It can be.
Let's ignore circular references for a moment. To be honest, cycles don't turn up as often as people claim in programs where reference counting is done manually (or through smart pointers) because people are smart enough to know the issues and avoid them (e.g. by using weak references or other non-owning pointers to break cycles).
For a start, reference counting interacts badly with multithreading. The reference count has to be protected against concurrent updates, and that can cost a lot, especially if the count is already effectively protected in some other way (e.g. by only being used single-threadedly). This is such a problem that many C++ library vendors are doing away with reference counting in their std::basic_strings.
Secondly, every time you copy a pointer, you modify the reference count. Every single time. Sometimes (e.g. if you take a temporary local copy) that will be in cache, but not always. If there's contention between CPUs (see previous point), for example, the count will bounce between them. Sometimes it's an almost guaranteed cache miss.
Admittedly, this isn't such a big problem in C++-implemented reference counting, because the programmer is usually far more aware of what's going on with pointer copying and will go to some lengths to avoid copying, but it can cost if reference counting is automatic. Have a look at the Python source code some time and see just how much trouble it goes to to avoid manipulating reference counts.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
There hasn't been a "discrepancy in efficiency". Good garbage collectors have been comparable to, or better than, manual storage allocators for decades.
The perception of a "discrepancy in efficiency" has several causes:
- Garbage collection allows programmers to get sloppy about storage managmentt: if a non-GC program gets sloppy about storage management, it crashes, if a non-GC program gets sloppy about storage management, it just runs slowly. Unfortunately, as a result, many core libraries in garbage collected languages are pretty sloppily written and slow--the fault is with the libraries, not with garbage collection.
- Garbage collection allows language implementors to make different design decisions. Many garbage collected languages will do memory allocation every time you use a floating point number. Imagine how slow C would be if you called "malloc" for every floating point number.
- Garbage collection often bundles memory management overhead into single chunks of time, while manual storage allocators don't. Furthermore, garbage collector implementations really rub your nose in it, printing messages like "[starting garbage collection... done]". But doing a lot of storage management at once is usually more efficient overall--in aggregate, manual storage managers spend more time, they just diffuse it out. However, both kinds of behaviors exist with both storage managers, and you can pick and choose.
The article is right that garbage collection is a good choice today. It is wrong in that it has pretty much always been a good choice. Garbage collection could have been widely adopted in the 1970's or 1980's, and we would have saved ourselves a lot of headaches and troubles without any loss in efficiency.Good article, though very limited in scope (basically just a list of GC methods, wrapping up with the methods used by recent Java and .NET interpreters). I was a little disappointed that they didn't get into the implications of using languages with GC.
One pitfall that I've noticed basically comes along with the benefit of avoiding "micro-managed" explicit memory management -- there are a lot of Java coders who don't think at *all* about memory management, because they think it's all handled for them. Mix that in with an over-excitement about OO, and you get some impressively slow and non-scaleable code.
You DO need to understand, at least on a basic level, what's going onto the heap, and what the garbage collector has to do to keep up with your "garbage". Carefully nulling out objects that are going to be out of scope in a millisecond is just wasting space, but you should definitely keep an eye on what objects you're allocating within that loop that runs a million times. They're all going on the heap; are they all going to be on there at the same time? When are they going to be eligible for collection? Are they just Strings, or larger objects (which possible create other objects when they are created)?
If you have to optimize a section of code, consider sticking to primitives and Strings (obviously you're balancing this against the cost of possibly less-maintainable code!), and don't forget that when you instantiate com.foo.Bar, all of its superclasses are also instantiated, including any member objects they hold. And don't make a variable static for no reason -- it won't get collected with the object instance....
Two useful things to think about -- heap size (the objects you're actively using at a given moment, so they can't be collected), and churn rate (how fast you're creating and trashing objects). Object creation/destruction isn't as costly as it was with the early versions of Java (no, you probably don't need that Thread pool!). But any application that needs to scale requires some thought on memory usage and churn before you start coding.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
There are Godlike Real Programmers and there are sloppy programmers. I don't think that GC will automatically make everyone a sloppy programmer. If it stops some of the sloppy programmers from creating applications with huge memory leaks, isn't that a good thing?
Side note: Boost Lambda and FC++ are impressive but ugly hacks with horrible syntax, lots of "gotchas" that make code not work (often related to operator precedence and order of evaluation), and compiler errors from hell. Probably not the best examples of the power of C++. (OTOH, maybe that makes them the perfect examples of the "power" of C++ ;-)
main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
On the other hand,it can probably deal with transaction-style locks easier than RAII can - although I've seen a system to handle that that uses RAII objects combined with functors (instead of closures). It works almost identically to the Ruby model.
Obviously the poster's point was that there are better garbage collection algorithms that do not generally suffer from the original poster's problems, as there are sorts that generally perform better than bubble sort. It was intended to be a tad sarcastic, I'd think.
> > I'm trying to imagine a programming language that doesn't let you
> > create functions on the fly but is powerful enough for writing real
> > applications.
>
> C, C++, Java.
[Scratches Java off list of languages to learn.]
I know C and C++ have been traditionally used for writing applications, but I
have long been of the opinion that they're not really powerful enough for the
job. It takes several times as many programmer-hours as it ought to to do
anything, from prototyping to new feature work to debugging, which IMO means
that "powerful enough" is a real stretch. These languages get by and continue
to be used at this point mostly because a lot of people know them.
In the past, these languages were selected because programmer time was cheaper
than computer resources (with which they're more miserly than a higher-level
language), but that's no longer anywhere near true, as the article points out;
the *average* computer has enough RAM to run three horribly-inefficient
extreme memory-hog applications at the *same time* without needing any swap,
and newer models are coming with more and more. You talk about GC screwing
up virtual RAM algorithms, but it's really not an issue on most systems; if
a process grows to three or four *times* the size it needs to be, it doesn't
actually have any user-noticeable impact on performance. Memory leaks are
actually much worse, because in that case the wasted memory doesn't ever get
collected and eventually it becomes a problem, after a couple of hours of
use. (Actually, a very small memory leak can go for days without being a
problem, but those aren't the ones we notice so much.) In 1996, when most
consumer-grade operating systems were so stable that you had to reboot every
few hours, memory leaks weren't such a big deal (provided you had lots of
swap space), but now that almost any modern OS (and most applications) can
run for weeks and weeks if not months or even years without being restarted,
memory leaks are now a big deal. It's okay to continually use five times as
much RAM as you technically need; it's not okay for your memory requirements
to keep growing as a function of how long you've been running, because that
can get to be *way* more than five times what you need.
Back to creating functions on the fly, I'm just a little bit surprised to
learn that Java doesn't have such an important feature; I had been lead to
believe it was a relatively high-level language with fairly high-level
features. It runs on a virtual machine, for crying out loud; I had imagined
it would be fairly modern and flexible in its design. Are you sure it can't
create functions on the fly, or is that just something you don't know how to
do in Java? That's a pretty serious accusation to level at a language,
almost as bad as saying it can't allocate extra memory on the fly.
Cut that out, or I will ship you to Norilsk in a box.
this article sweeps under the rug most of the reasons why languages dependent on garbage collection have always failed to find much deployment in industrial settings.
For the same reason people in industry have kept programming in Cobol and Fortran, and for the same reason they keep producing software with all sorts of problems, bugs, and limitations.
A previous poster noted that most GC algorithms are distinctly unfriendly to virtual memory systems. They usually have similar problems with cache locality, which can result in an enormous slowdown, regardless of the time actually spent in the GC itself
Not true at all. Generational collectors generally achieve far better locality than malloc-style allocators.
A more fundamental problem is that memory is only one of many resources a typical industrial program must manage. GC takes over memory management, but leaves the other scarce resources -- file descriptors, sockets, mutexes, database connections -- to be managed manually, as in C.
Fundamentally, the point behind GC is not to make your life easier, it is to make it possible for the language to be safe. Without GC, a language with heap allocated mutable data structures just cannot be safe. GC generally cannot reliably manage any other resources besides memory and it is not meant to.
Given a resource management regime that can handle all these other important resources, as is commonly practiced in C++, memory becomes just another resource.
But memory isn't just any resource, memory is a resource that can contain machine pointers to other memory (as well as references to other resources).
The problem of resource management for memory is that of arbitrary directed graph reachability. And that is exactly the problem that a garbage collector solves, as efficiently as possible.
A language that lacks the tools necessary to implement such a regime needs GC, so the presence of GC may actually (as in the case of Java) indicate a fundamental weakness in the language.
C++ solves a common but limited subset of the resource management problem and then just declares victory. And even that false victory is not very satisfying because in order to achieve it, C++ has sacrified runtime safety. (In fact, with that choice, it has also sacrificed efficiency, but you aren't going to believe that no matter what I say.)
I think you mean "mark and sweep" collectors. "Stop and copy" collectors just trace the working set from whatever your heap root is. Add in the copy step, and you only touch twice the size of you working set. If your collector is well-written and the OS provides the hooks, it will ask for the new space to be allocated in core, and the old space to be discarded, wherever it is.
In the great CONS chain of life, you can either be the CAR or be in the CDR.