Memory Management Technique Speeds Apps By 20%
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.
If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.
However, that isn't the end of the story. There are proprietary allocators, like the Lockless memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.
Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).
In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.
Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.
The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.
And now I've read their paper. Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).
In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.
A witty [sig] proves nothing. --Voltaire