Memory Management Technique Speeds Apps By 20%

← Back to Stories (view on slashdot.org)

Memory Management Technique Speeds Apps By 20%

Posted by kdawson on Monday April 5, 2010 @12:04PM from the rememberance-of-data-past dept.

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.

12 of 252 comments (clear)

Min score:

Reason:

Sort:

Re:Wow, this is pretty clever by nxtw · 2010-04-05 12:24 · Score: 3, Informative

If you want faster AES, just upgrade your CPU.
Re:Nothing to see here.... by kscguru · 2010-04-05 12:53 · Score: 4, Informative

Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.
With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.
And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.
Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...

--
A witty [sig] proves nothing. --Voltaire
Re:Wow, this is pretty clever by macemoneta · 2010-04-05 13:07 · Score: 4, Informative

Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.
That's already implemented in the high performance ssh patch, available here. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.
I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.

--
Can You Say Linux? I Knew That You Could.
Re:free() is probably more parallizable than mallo by JessGras · 2010-04-05 13:11 · Score: 5, Informative

Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.
Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!
Re:Beware the key term there: by Anonymous Coward · 2010-04-05 13:12 · Score: 3, Informative

The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.
It would be nice if Unix/Linux had easier and better support for asynchronous I/O.
Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.
Much simpler than Linux, where you have to use completely different and complex mechanisms if you want to use something better than synchronous read()/write() calls.
http://en.wikipedia.org/wiki/QIO
http://en.wikipedia.org/wiki/Asynchronous_System_Trap
Re:20%?! by naasking · 2010-04-05 13:30 · Score: 4, Informative

I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free.

--
Higher Logics: where programming meets science.
Re:You can malloc it but you can't use it by kisielk · 2010-04-05 13:43 · Score: 3, Informative

The PDF of the paper has all the details. The article is just fluff.
Re:Just remember to be aware of multi PROCESSOR by Mad+Merlin · 2010-04-05 14:20 · Score: 4, Informative

The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.

--
Game! - Where the stick is mightier than the sword!
Re:Beware the key term there: by EvanED · 2010-04-05 14:26 · Score: 3, Informative

Operating systems like VMS made all I/O asynchronous by default...
This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.
(I think this is more-or-less correct anyway.)
Re:Beware the key term there: by Anonymous Coward · 2010-04-05 15:28 · Score: 3, Informative

Clearly you didn't read the paper.
One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.
The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.
Re:Nothing to see here.... by Anonymous Coward · 2010-04-05 19:28 · Score: 4, Informative

They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.
It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.
Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"
In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.
Re:Nothing to see here.... by kscguru · 2010-04-06 05:12 · Score: 3, Informative

I should perhaps note that I do implement low-level libraries for an extremely reputable company as a day job; I'm familiar with low-level lock implementations both in kernel and in userlevel on Linux, Windows, and MacOS, and exactly how those implementations balance spinning versus blocking.
The Linux kernel preference for spinlocks dates from years ago, when the whole kernel ran under the BKL and was non-premptable anyways so you couldn't use blocking locks. When the BKL came out, all locks were made spinlocks to maintain correctness (and the -rt patchset started up, doing a conversion). The default implementation (still in use today by anything except the -rt patchset!) disables interrupts while any spinlock is held, and thus assumes the only thing holding the lock is another core.
In contrast, Solaris and Windows (and I think MacOSX, though I would have to check my references) use a mix of spinlocks and adaptive locks - spinlocks for use within interrupt handlers, and adaptive locks for everywhere else. Good pthread implementations (glibc included) use adaptive locks - which means the pthread implementation this paper declared too slow to use ALREADY spins ~1000 cycles before blocking. The canonical rule here is that an adaptive lock spins for the same amount of time it would take for a block/wakeup cycle, then blocks; this is guaranteed to be within a factor of 2 of optimal in all cases, which is the best overall lower bound you can possibly get. (Yes, Linux kernel is behind the times; they are slowly getting better, and when eventually the -rt patchset gets merged, Linux will have finally caught up. Sorry, Linux fanboys.)
Spinning versus blocking is a tradeoff. The research paper manages to extract all the gains from the "spin forever" side of the tradeoff without ever admitting the drawbacks (one full CPU core wasted).

--
A witty [sig] proves nothing. --Voltaire