Memory Management Technique Speeds Apps By 20%
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Beware the key term there: "up to."
Are you adequate?
Nothing to see here...
Moving malloc() to a separate thread does not do a thing for the putative word processor.
They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.
But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.
I wish I'd thought of it.
Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.
Need a Python, C++, Unix, Linux develop
If most programs are spending 20% of their time on memory management, something is wrong.
Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.
I once submitted an April Fool's joke to this effect to a moderated website; I claimed superlinear speedup for sorting by starting with a bubble sort, then "modifying it for parallelism" until it morphed into a quicksort. Alas, the moderator rejected it...
Have you read my blog lately?
A common simplified structure is:
With these new innovations you get:
And free shouldn't take a noticable amount of time.
Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
If I still coded much anymore it would drive me to drink.
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
It's a performance gain because it's extremely rare that all your cores are maxed out at once, if you can distribute the computing power more evenly it's a performance gain in most circumstances even if the net computing power required increases.
Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.
Game! - Where the stick is mightier than the sword!
The Fivefold Mother
There are very few developers left in the US that even know what memory management is.
People don't even try anymore.
This sig is the express property of someone.
The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.
If I may quote: "Phkmalloc was written by Poul-Henning Kamp for FreeBSD in 1995-1996 and subsequently adapted by a number of operating systems, including NetBSD, OpenBSD, and several Linux distributions." Phkmalloc is the example implementation they used for the benchmarks..
Not much to see in the article. Move along.
IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).
One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.
Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.