Slashdot Mirror


Memory Management Technique Speeds Apps By 20%

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.

31 of 252 comments (clear)

  1. Beware the key term there: by Estanislao+Mart�nez · · Score: 5, Insightful

    Beware the key term there: "up to."

    1. Re:Beware the key term there: by Idiomatick · · Score: 3, Interesting

      Why couldn't it be applied at the compiler option level? A checkbox and recompile isn't so terrible. It could probably be done at the OS level but it'd be more of a pain.

    2. Re:Beware the key term there: by Anonymous Coward · · Score: 3, Informative

      The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.

      It would be nice if Unix/Linux had easier and better support for asynchronous I/O.

      Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.

      Much simpler than Linux, where you have to use completely different and complex mechanisms if you want to use something better than synchronous read()/write() calls.

      http://en.wikipedia.org/wiki/QIO
      http://en.wikipedia.org/wiki/Asynchronous_System_Trap

    3. Re:Beware the key term there: by Spatial · · Score: 4, Insightful

      I like to mentally replace that with the actual meaning: "between 0 and".

      It could allow software applications to run between 0 and 20% faster!

    4. Re:Beware the key term there: by Georules · · Score: 4, Insightful

      You might consider mentally replacing it with the sad reality that it might be between 0 and x faster AND it could also be infinitely slower.

    5. Re:Beware the key term there: by EvanED · · Score: 3, Informative

      Operating systems like VMS made all I/O asynchronous by default...

      This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.

      (I think this is more-or-less correct anyway.)

    6. Re:Beware the key term there: by Anonymous Coward · · Score: 3, Informative

      Clearly you didn't read the paper.

      One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.

      The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.

    7. Re:Beware the key term there: by mswhippingboy · · Score: 4, Insightful

      What you are missing (as are most of the posters so far) is that there is considerable overhead involved in the actual management of the memory in terms of keeping track of what memory is free or allocated. This is outside the issue of maintaining locks. Moving this management overhead to a separate thread allows the otherwise single threaded app to take advantage of additional cores without any code changes. This does not appear all that novel however as modern garbage collectors do this today.

      --
      Sometimes the light at the end of the tunnel is the headlight of an oncoming train.
  2. Nothing to see here.... by Ancient_Hacker · · Score: 5, Insightful

    Nothing to see here...

    Moving malloc() to a separate thread does not do a thing for the putative word processor.

    They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.

    But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.

    1. Re:Nothing to see here.... by Anonymous Coward · · Score: 5, Funny

      new malloc()

      I see what you did there.

    2. Re:Nothing to see here.... by Anonymous Coward · · Score: 5, Interesting

      Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.

      If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.

      However, that isn't the end of the story. There are proprietary allocators, like the Lockless memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.

    3. Re:Nothing to see here.... by kscguru · · Score: 4, Informative

      Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.

      With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.

      And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.

      Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...

      --

      A witty [sig] proves nothing. --Voltaire

    4. Re:Nothing to see here.... by b4dc0d3r · · Score: 4, Interesting

      Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.

      The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.

    5. Re:Nothing to see here.... by kscguru · · Score: 5, Interesting

      And now I've read their paper. Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).

      In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.

      --

      A witty [sig] proves nothing. --Voltaire

    6. Re:Nothing to see here.... by wealthychef · · Score: 3, Insightful

      But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.

      --
      Currently hooked on AMP
    7. Re:Nothing to see here.... by Angst+Badger · · Score: 3, Funny

      This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers.

      On the contrary, this opens up whole new possibilities for MS-DOS!

      --
      Proud member of the Weirdo-American community.
    8. Re:Nothing to see here.... by Anonymous Coward · · Score: 4, Informative

      They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.

      It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.

      Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"

      In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.

    9. Re:Nothing to see here.... by kscguru · · Score: 3, Informative

      I should perhaps note that I do implement low-level libraries for an extremely reputable company as a day job; I'm familiar with low-level lock implementations both in kernel and in userlevel on Linux, Windows, and MacOS, and exactly how those implementations balance spinning versus blocking.

      The Linux kernel preference for spinlocks dates from years ago, when the whole kernel ran under the BKL and was non-premptable anyways so you couldn't use blocking locks. When the BKL came out, all locks were made spinlocks to maintain correctness (and the -rt patchset started up, doing a conversion). The default implementation (still in use today by anything except the -rt patchset!) disables interrupts while any spinlock is held, and thus assumes the only thing holding the lock is another core.

      In contrast, Solaris and Windows (and I think MacOSX, though I would have to check my references) use a mix of spinlocks and adaptive locks - spinlocks for use within interrupt handlers, and adaptive locks for everywhere else. Good pthread implementations (glibc included) use adaptive locks - which means the pthread implementation this paper declared too slow to use ALREADY spins ~1000 cycles before blocking. The canonical rule here is that an adaptive lock spins for the same amount of time it would take for a block/wakeup cycle, then blocks; this is guaranteed to be within a factor of 2 of optimal in all cases, which is the best overall lower bound you can possibly get. (Yes, Linux kernel is behind the times; they are slowly getting better, and when eventually the -rt patchset gets merged, Linux will have finally caught up. Sorry, Linux fanboys.)

      Spinning versus blocking is a tradeoff. The research paper manages to extract all the gains from the "spin forever" side of the tradeoff without ever admitting the drawbacks (one full CPU core wasted).

      --

      A witty [sig] proves nothing. --Voltaire

  3. Wow, this is pretty clever by Omnifarious · · Score: 3, Interesting

    I wish I'd thought of it.

    Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

    1. Re:Wow, this is pretty clever by nxtw · · Score: 3, Informative

      If you want faster AES, just upgrade your CPU.

    2. Re:Wow, this is pretty clever by macemoneta · · Score: 4, Informative

      Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

      That's already implemented in the high performance ssh patch, available here. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.

      I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.

      --

      Can You Say Linux? I Knew That You Could.

    3. Re:Wow, this is pretty clever by nxtw · · Score: 3, Insightful

      Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)

      Are your storage and network devices that fast?

  4. 20%?! by temojen · · Score: 5, Insightful

    If most programs are spending 20% of their time on memory management, something is wrong.

    1. Re:20%?! by naasking · · Score: 4, Informative

      I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free.

    2. Re:20%?! by guyminuslife · · Score: 5, Funny

      I was aware that malloc() had a price tag attached, but free()? That's misleading advertising.

      --
      I don't believe in time. It's a grand conspiracy designed to sell watches.
  5. Re:Might be particularly applicable to Java by SpazmodeusG · · Score: 4, Interesting

    Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).

    In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.

  6. It's programmers that need parallelization by w0mprat · · Score: 5, Insightful

    Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
     
    If I still coded much anymore it would drive me to drink.

    --
    After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
  7. Re:free() is probably more parallizable than mallo by JessGras · · Score: 5, Informative

    Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.

    Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!

  8. Re:You can malloc it but you can't use it by kisielk · · Score: 3, Informative

    The PDF of the paper has all the details. The article is just fluff.

  9. Re:Just remember to be aware of multi PROCESSOR by Mad+Merlin · · Score: 4, Informative

    The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.

  10. Re:Does it matter anymore? by jasmusic · · Score: 3, Insightful

    Those developers can hold the rest of the software industry hostage for mad income. OS kernels don't write themselves.