Memory Management Technique Speeds Apps By 20%
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Beware the key term there: "up to."
Are you adequate?
Sound nice, but I hope the library also handle multi processor systems, where each processer have its own ram block. You don't want one cpu to allocate the memory, which is used by an other cpu.
Do operation systems even have support to say "These 2 threads should run on the same processor, but I don't care about which one".
Nothing to see here...
Moving malloc() to a separate thread does not do a thing for the putative word processor.
They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.
But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.
You can use a wait-free lock-free data structure to add deallocated memory to a free list to reduce contention on the critical section in a heap. You can also use it to implement the idea presented in the paper.
On Windows, you can use the Interlocked Singly Linked List data structure (see InitializeSListHead).
You can also partition memory allocations over several heaps to reduce contention further.
"The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."
The release says this technique is for programs difficult to parellelize "by running the memory-management functions on a separate thread".
I wish I'd thought of it.
Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.
Need a Python, C++, Unix, Linux develop
If most programs are spending 20% of their time on memory management, something is wrong.
Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.
Uhm, so what's the big deal? .Net's garbage collector runs in its own thread.
No, I will not work for your startup
A common simplified structure is:
With these new innovations you get:
And free shouldn't take a noticable amount of time.
Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
If I still coded much anymore it would drive me to drink.
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
As far as I can tell, the author of this paper just figured out a way to offload a bunch of memory management tasks to an idle CPU core, and then counted it as a performance gain. OK?
So, monolithic single-threaded applications can be made faster on multi-core systems, at the cost of bulk-allocating more memory than is actually required, and only if there is enough idle CPU to run your memory-management thread in realtime on another core. I am not exactly blown away.
The press release tries to trump it up as some kind of major advancement in performance, but it's not a performance gain at all. If you ran four copies of these single-threaded benchmarks simultaneously, you'd probably see a net performance decrease.
Haven't ready the OP: shame on me.
But as it is true that a program which calls malloc() is basically blocked until malloc() returns, the trick here could be that malloc() is made to be a lot dumber... but free() just puts pointers in a queue "to the other thread" and then the other thread can be more time consuming about being smart with making that space available to malloc() again.
OK I'm expressing it badly. Obviously an app which needs memory is blocked until it gets the memory. But the same isn't true of releasing memory: you don't *need* your free()ed blocks to be immediately available. So maybe that's what they do: backload the joint malloc()+free() work into free(), make malloc() really dumb, and shift the free() work off to the other thread via a simple queue.
The guys that developed SmartHeap http://www.microquill.com/ have been preaching this for years.
Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
How does this compare to jemalloc(), for example? This paper seems to be either very overhyped or plain outright silly. What does this do that other methods to improve allocations like the Windows Low Fragmentation Heap (LFH) and BSD's jemalloc() don't?
The Fivefold Mother
So basically, they apply garbage collection techniques to regular malloc/free program
Not bad, all things considered...
After I scrolled down and then back up the front page, I am genuinely disappointed that this was not a psychology-based article. Imagine slowpokes like me thinking all all the possibilities for a title I parsed as "Memory management technique speed APES by 20%" :(
a good friend of mine has done this and much better. check out http://newcodeinc.com/ for a threaded fast drop-in replacement for malloc. yes, it is closed source and commercial. i am slowly influencing him to open source it and gain more customers for support. it is currently being used by several large financial firms who love it.
Okay, I can understand multicore for web broswers. But word processors? What kind of insanity does it take to write a word processor that needs more than a late-90s PC to run smoothly, let alone multi-anything?
If your word processor can't run smoothly on a single core, it's not the memory allocator's fault. Jesus.
There are very few developers left in the US that even know what memory management is.
People don't even try anymore.
This sig is the express property of someone.
It's clear this is widely available - the devil is in the details. You can always abstract over a blocking system when you have threads/cores available. Those who work on fixed platforms like me kinda have to look to making the threading/locking more efficient. If you have the option of spending money on more potential parallel computing power for less, and can move up the chain, then solutions like this are more valuable.
It's interesting watching people talk about this kinda stuff, because ultimately the value of these kinds of approaches really either come down to what might be practical on the next game console that MS or Sony will release, or what is immediately valuable to you if you're pretty malloc bound and are in a position to replace your hardware anytime you like.
They mention word processors cause that's what turns heads. Ultimately, that's what R&D is .. its super specific, it's not immediately applicable to all, and in 20 years, it'll probably be standard.
As it pertains to this article .. it's handling the caching for new/free on a thread. As other posters have pointed out, it's not a new academic point of interest, and like all other things like this, it's value will go from specific to becoming an "ok, I have the hardware and or implementation bandwidth for that."
"Old man yells at systemd"
The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.
Link
You are being MICROattacked, from various angles, in a SOFT manner.
given you could do most things a typical word processor can do on a 486, why in the world would you need to parallelize it?
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
It's interesting to read your post http://duiattorneyorangecountyca.com/
Read the paper too. 100% agree with you, which of course makes sense since
IPDPS is known for publishing subpar systems papers.
While most programs do use libc others try to either avoid libc altogether (for example: http://blog.ksplice.com/2010/03/libc-free-world/ ) or use other "diet" versions of libc (for example: http://www.fefe.de/dietlibc/ )
What about those programs? Will they get to enjoy what the article talks about?
If I may quote: "Phkmalloc was written by Poul-Henning Kamp for FreeBSD in 1995-1996 and subsequently adapted by a number of operating systems, including NetBSD, OpenBSD, and several Linux distributions." Phkmalloc is the example implementation they used for the benchmarks..
I have a wonderful memory management technique to speed up apps by 20%:
MANAGE to put MOAR RAMZ.
o hai
I saw a talk by the Erlang developers that also said that malloc was the bottleneck in a multicore system, so they wrote their own for Erlang virtual machine. See "Insights" on page 11: http://www.erlang-factory.com/upload/presentations/188/ErlangUserConference2009-PatrikNyblom.pdf Or watch the talk: http://www.erlang-factory.com/conference/ErlangUserConference2009/speakers/PatrikNyblom
Shouldn't web browsers be almost embaressingly easy to parallellize, or am I missing something here?
Each page in its own thread / process, plus loading of page:
1. dns resolve
2. fetch html
3. parse html
3a. find css link, start thread for fetching and parsing css
3b. find image link, start thread for fetching and parsing image
3x. so on and so forth for other resources encountered.
4. run js in seperate thread if needed
5. run plugins in seperate threads / processes
The lags I forsee here are network related, and as far as I can see, it should be rather easy to parallellize large chunks of it.
But, as I haven't written my own web browser, I'm probably missing something fundamental here.
It's The Golden Rule: "He who has the gold makes the rules."
"The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."
What's so difficult about parallelizing Web browsers? I have 5 tabs open in my Google Chrome browser right now. Google Chrome runs each tab in it's own process. Seems to be taking advantage of my multi-core system quite well.
I can see how you can make gains when _DE_ allocating the memory, since there are no dependencies on the memory at that stage (by it's very nature, you've free'd it because nothing is using it), so I can see how deallocs can be totally decoupled and made to run asynchronously as a "fire and forget", that's an easy "win" if done right.
But I'm much less convinced about allocations via a separate thread, due to the very fact that you're implicitly waiting for either successful allocation, NULL or an exception, depending on your memory allocation error handling model....
there are many allocator implementations that behave way better than the single-thread-optimized default malloc. Hooray for another one. Here is one old article about that: http://developers.sun.com/solaris/articles/multiproc/multiproc.html IIRC even glibc uses a slab allocator these days, and is much better performance wise on multi processor machines.
Sig ?
How is this different than the linux slab allocator? http://en.wikipedia.org/wiki/Slab_allocation I guess I could RTFPdf ...
News at 11.
WTF Slashdot, why do I have to login 50 times to post?
It is concurrency to evolve to better memory management, not the opposite.
A modern PC has a small amount of RAM, plenty of Hard disk space and network storage.
Operating System evolved for managing CPU time and application memory. The application has little knowledge of memory management.
In a similar way, database are a way of managing huge amount of information, in the same abstraction like Service Oriented Application can be seen like an abstraction of web communication.
Take a look to
http://gioorgi.com/2009/evolving-concurrency/
-- Giovanni Daitan Giorgi http://gioorgi.com http://www.siforge.org
It also seems to me that the memory handling thread can also spend its idle time locating the next available chunk -- or several of them, based on size ranges. (This only makes sense with custom pooling.) That being said, I'm gonna go and read TFP to see if they've already talked about that...
Not much to see in the article. Move along.
IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).
One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.
Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.
I always find it interesting when academics write papers about things that games programmers have been doing for years. Not that we had a client-server memory management thread mind you, because that would be fucking crazy, but preallocation of memory, and avoiding locks and syncs etc. Seriously, if you have a program that does 13 million allocations per second (see Table II), wouldn't a better memory manager be on your list?
wow, being from NCSU and personally knowing most of the people involved in this project, this is nothing but a paper. There was nothing much in this work that has not been done already and this paper itself got published after getting rejected numerous times in most high quality conferences. There are so many issues with this technique I dont know where to begin with. But it suffices to say that these people have abandoned this and moved on. How did this article become so popular ? Well, a few days back some newspaper people came to us to ask for a paper that has been published but not presented and this seemed to be the only one. They have glorified this paper so much in their article it almost sounds like an invention !!! Wow, we truly live in an mis-information age