Memory Management Technique Speeds Apps By 20%
Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.
Beware the key term there: "up to."
Are you adequate?
Sound nice, but I hope the library also handle multi processor systems, where each processer have its own ram block. You don't want one cpu to allocate the memory, which is used by an other cpu.
Do operation systems even have support to say "These 2 threads should run on the same processor, but I don't care about which one".
Nothing to see here...
Moving malloc() to a separate thread does not do a thing for the putative word processor.
They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.
But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.
You can use a wait-free lock-free data structure to add deallocated memory to a free list to reduce contention on the critical section in a heap. You can also use it to implement the idea presented in the paper.
On Windows, you can use the Interlocked Singly Linked List data structure (see InitializeSListHead).
You can also partition memory allocations over several heaps to reduce contention further.
"The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."
The release says this technique is for programs difficult to parellelize "by running the memory-management functions on a separate thread".
I wish I'd thought of it.
Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.
Need a Python, C++, Unix, Linux develop
If most programs are spending 20% of their time on memory management, something is wrong.
Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.
Uhm, so what's the big deal? .Net's garbage collector runs in its own thread.
No, I will not work for your startup
A common simplified structure is:
With these new innovations you get:
And free shouldn't take a noticable amount of time.
Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
If I still coded much anymore it would drive me to drink.
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
As far as I can tell, the author of this paper just figured out a way to offload a bunch of memory management tasks to an idle CPU core, and then counted it as a performance gain. OK?
So, monolithic single-threaded applications can be made faster on multi-core systems, at the cost of bulk-allocating more memory than is actually required, and only if there is enough idle CPU to run your memory-management thread in realtime on another core. I am not exactly blown away.
The press release tries to trump it up as some kind of major advancement in performance, but it's not a performance gain at all. If you ran four copies of these single-threaded benchmarks simultaneously, you'd probably see a net performance decrease.
Haven't ready the OP: shame on me.
But as it is true that a program which calls malloc() is basically blocked until malloc() returns, the trick here could be that malloc() is made to be a lot dumber... but free() just puts pointers in a queue "to the other thread" and then the other thread can be more time consuming about being smart with making that space available to malloc() again.
OK I'm expressing it badly. Obviously an app which needs memory is blocked until it gets the memory. But the same isn't true of releasing memory: you don't *need* your free()ed blocks to be immediately available. So maybe that's what they do: backload the joint malloc()+free() work into free(), make malloc() really dumb, and shift the free() work off to the other thread via a simple queue.
Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
The Fivefold Mother
So basically, they apply garbage collection techniques to regular malloc/free program
Not bad, all things considered...
There are very few developers left in the US that even know what memory management is.
People don't even try anymore.
This sig is the express property of someone.
It's clear this is widely available - the devil is in the details. You can always abstract over a blocking system when you have threads/cores available. Those who work on fixed platforms like me kinda have to look to making the threading/locking more efficient. If you have the option of spending money on more potential parallel computing power for less, and can move up the chain, then solutions like this are more valuable.
It's interesting watching people talk about this kinda stuff, because ultimately the value of these kinds of approaches really either come down to what might be practical on the next game console that MS or Sony will release, or what is immediately valuable to you if you're pretty malloc bound and are in a position to replace your hardware anytime you like.
They mention word processors cause that's what turns heads. Ultimately, that's what R&D is .. its super specific, it's not immediately applicable to all, and in 20 years, it'll probably be standard.
As it pertains to this article .. it's handling the caching for new/free on a thread. As other posters have pointed out, it's not a new academic point of interest, and like all other things like this, it's value will go from specific to becoming an "ok, I have the hardware and or implementation bandwidth for that."
"Old man yells at systemd"
The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.
Link
You are being MICROattacked, from various angles, in a SOFT manner.
given you could do most things a typical word processor can do on a 486, why in the world would you need to parallelize it?
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
If I may quote: "Phkmalloc was written by Poul-Henning Kamp for FreeBSD in 1995-1996 and subsequently adapted by a number of operating systems, including NetBSD, OpenBSD, and several Linux distributions." Phkmalloc is the example implementation they used for the benchmarks..
I saw a talk by the Erlang developers that also said that malloc was the bottleneck in a multicore system, so they wrote their own for Erlang virtual machine. See "Insights" on page 11: http://www.erlang-factory.com/upload/presentations/188/ErlangUserConference2009-PatrikNyblom.pdf Or watch the talk: http://www.erlang-factory.com/conference/ErlangUserConference2009/speakers/PatrikNyblom
How does this compare to jemalloc(), for example? This paper seems to be either very overhyped or plain outright silly. What does this do that other methods to improve allocations like the Windows Low Fragmentation Heap (LFH) and BSD's jemalloc() don't?
It solves an entirely different problem: those systems improve multithreaded allocation performance by (effectively) reducing the amount of locking and/or number of cache misses per operation. This system improves *single* threaded performance by offloading work to a new thread.
Shouldn't web browsers be almost embaressingly easy to parallellize, or am I missing something here?
Each page in its own thread / process, plus loading of page:
1. dns resolve
2. fetch html
3. parse html
3a. find css link, start thread for fetching and parsing css
3b. find image link, start thread for fetching and parsing image
3x. so on and so forth for other resources encountered.
4. run js in seperate thread if needed
5. run plugins in seperate threads / processes
The lags I forsee here are network related, and as far as I can see, it should be rather easy to parallellize large chunks of it.
But, as I haven't written my own web browser, I'm probably missing something fundamental here.
It's The Golden Rule: "He who has the gold makes the rules."
there are many allocator implementations that behave way better than the single-thread-optimized default malloc. Hooray for another one. Here is one old article about that: http://developers.sun.com/solaris/articles/multiproc/multiproc.html IIRC even glibc uses a slab allocator these days, and is much better performance wise on multi processor machines.
Sig ?
How is this different than the linux slab allocator? http://en.wikipedia.org/wiki/Slab_allocation I guess I could RTFPdf ...
News at 11.
WTF Slashdot, why do I have to login 50 times to post?
It is concurrency to evolve to better memory management, not the opposite.
A modern PC has a small amount of RAM, plenty of Hard disk space and network storage.
Operating System evolved for managing CPU time and application memory. The application has little knowledge of memory management.
In a similar way, database are a way of managing huge amount of information, in the same abstraction like Service Oriented Application can be seen like an abstraction of web communication.
Take a look to
http://gioorgi.com/2009/evolving-concurrency/
-- Giovanni Daitan Giorgi http://gioorgi.com http://www.siforge.org
It also seems to me that the memory handling thread can also spend its idle time locating the next available chunk -- or several of them, based on size ranges. (This only makes sense with custom pooling.) That being said, I'm gonna go and read TFP to see if they've already talked about that...
Not much to see in the article. Move along.
IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).
One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.
Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.
I always find it interesting when academics write papers about things that games programmers have been doing for years. Not that we had a client-server memory management thread mind you, because that would be fucking crazy, but preallocation of memory, and avoiding locks and syncs etc. Seriously, if you have a program that does 13 million allocations per second (see Table II), wouldn't a better memory manager be on your list?
wow, being from NCSU and personally knowing most of the people involved in this project, this is nothing but a paper. There was nothing much in this work that has not been done already and this paper itself got published after getting rejected numerous times in most high quality conferences. There are so many issues with this technique I dont know where to begin with. But it suffices to say that these people have abandoned this and moved on. How did this article become so popular ? Well, a few days back some newspaper people came to us to ask for a paper that has been published but not presented and this seemed to be the only one. They have glorified this paper so much in their article it almost sounds like an invention !!! Wow, we truly live in an mis-information age