Slashdot Mirror


Memory Management Technique Speeds Apps By 20%

Dotnaught writes "A paper (PDF) to be presented later this month at the IEEE International Parallel and Distributed Processing Symposium in Atlanta describes a new approach to memory management that allows software applications to run up to 20% faster on multicore processors. Yan Solihin, associate professor of electrical and computer engineering at NCSU and co-author of the paper, says that using the technique is just a matter of linking to a library in a program that makes heavy use of memory allocation. The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers." Informationweek has a few more details from an interview with Solihin.

252 comments

  1. Beware the key term there: by Estanislao+Mart�nez · · Score: 5, Insightful

    Beware the key term there: "up to."

    1. Re:Beware the key term there: by Jurily · · Score: 1

      OK, so they run memory allocation in a separate thread. What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?

      The idea sounds fun, but this approach requires a rewrite to make it usable, like most everything else out there.

      The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.

    2. Re:Beware the key term there: by Idiomatick · · Score: 3, Interesting

      Why couldn't it be applied at the compiler option level? A checkbox and recompile isn't so terrible. It could probably be done at the OS level but it'd be more of a pain.

    3. Re:Beware the key term there: by Anonymous Coward · · Score: 3, Informative

      The most noticeable speedup I found with threading was to separate disk I/O out in its own thread.

      It would be nice if Unix/Linux had easier and better support for asynchronous I/O.

      Operating systems like VMS made all I/O asynchronous by default, with optional flags or function wrappers that would wait for an I/O to complete before returning if you really wanted synchronous I/O. You could even designate a user function to run asynchronously (without the drawbacks and limitations of Unix signals) whenever any specific I/O completed.

      Much simpler than Linux, where you have to use completely different and complex mechanisms if you want to use something better than synchronous read()/write() calls.

      http://en.wikipedia.org/wiki/QIO
      http://en.wikipedia.org/wiki/Asynchronous_System_Trap

    4. Re:Beware the key term there: by Spatial · · Score: 4, Insightful

      I like to mentally replace that with the actual meaning: "between 0 and".

      It could allow software applications to run between 0 and 20% faster!

    5. Re:Beware the key term there: by Georules · · Score: 4, Insightful

      You might consider mentally replacing it with the sad reality that it might be between 0 and x faster AND it could also be infinitely slower.

    6. Re:Beware the key term there: by EvanED · · Score: 3, Informative

      Operating systems like VMS made all I/O asynchronous by default...

      This is mostly true in Windows too actually, given NT's strong VMS inspirations. From what I understand, drivers implement (only) asynchronous I/O calls, and the read/write system calls (NtReadFile and NtWriteFile) contain a specification of whether it should be asynchronous or synchronous. If synchronous, a higher level of the kernel handle that aspect itself, without the driver knowing anything about what's going on.

      (I think this is more-or-less correct anyway.)

    7. Re:Beware the key term there: by gmhowell · · Score: 1

      Why limit yourself? With enough overhead from the memory management, you could get it to go some negative amount faster.

      --
      Jesus was all right but his disciples were thick and ordinary. -John Lennon
    8. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      Not really, where you get your inferior limit from?

      It actually means, betwen -infinutum to 20% faster

    9. Re:Beware the key term there: by Anonymous Coward · · Score: 3, Informative

      Clearly you didn't read the paper.

      One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.

      The MM thread preallocates blocks the application is likely to ask for, so that when the application asks for it's 300th small block for storing window coordinates or whatever, the memory manager thread can instantly say "here, I had that". It also batches deallocations and does them out-of-band, while the application continues running.

    10. Re:Beware the key term there: by mdf356 · · Score: 2, Interesting

      but this approach requires a rewrite to make it usable

      A rewrite of part of libc, yes. Change the implementation of malloc(3) and link with -lpthread and you're pretty much done.

      I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.

      --
      Terrorist, bomb, al Qaeda, nuclear, yellowcake, kill, assassinate. Carnivore is dead... long live Echelon.
    11. Re:Beware the key term there: by SanityInAnarchy · · Score: 2, Informative

      Actually, I can see how malloc would help, if you assume that they're always allocating small-ish amounts -- just keep a certain amount reserved, or don't free stuff instantly (in case it's about to be reallocated).

      However, all of this seems very much like it could be done either inside libc (as proposed) or in the kernel, without having to touch existing apps, at least on platforms like Linux where libc is shared.

      --
      Don't thank God, thank a doctor!
    12. Re:Beware the key term there: by nikanth · · Score: 1

      Upto 20% may not be actually 0 to 20% always, it could be -infinity to 20%

    13. Re:Beware the key term there: by mswhippingboy · · Score: 4, Insightful

      What you are missing (as are most of the posters so far) is that there is considerable overhead involved in the actual management of the memory in terms of keeping track of what memory is free or allocated. This is outside the issue of maintaining locks. Moving this management overhead to a separate thread allows the otherwise single threaded app to take advantage of additional cores without any code changes. This does not appear all that novel however as modern garbage collectors do this today.

      --
      Sometimes the light at the end of the tunnel is the headlight of an oncoming train.
    14. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      I like to mentally replace that with the actual meaning: "between 0 and".

      It could allow software applications to run between 0 and 20% faster!

      Why assume the lower bound is 0%? I read it as "between -100% and 20% faster"!

    15. Re:Beware the key term there: by Darinbob · · Score: 2, Interesting

      I had a prof who referred to benchmarks and predictions as "guaranteed not to exceed" numbers.

    16. Re:Beware the key term there: by Jurily · · Score: 2

      Whatever the case may be, the idea certainly didn't spread into neither (cross-platform?) APIs, nor application design. Qt, for example, offers no asynchronous file operations, and most applications I've seen do disk I/O in the GUI thread.

      It's easy to notice because everything grinds to a halt when the disk is thrashing. One would think we can do better in the age of supercomputers.

    17. Re:Beware the key term there: by Michael+Kristopeit · · Score: 1
      imagine a dynamic construction project on mars... there has to be a process if a part is needed from earth... either the people who need parts will have their own ships, or have make a request for someone else's ship to bring the part to them. depending on the volume of part requests, different approaches become ideal, and any approach will fall victim to a worst case scenario.

      this new approach streamlines a problem of construction managers implementing a process to immediately dispatch a ship to earth every time they need a single screw, and they never keep a supply on hand. this process will put a screw store, with a supply to last far longer than the time between necessary ship trips to earth with free cargo space to lease, right in the ship dock, and flag down pilots before they leave for earth.

      you might wonder how the pilot or part manager never looked around to realize the waste of their process.... but mars doesn't even have an atmosphere yet, WE'RE TOO BUSY!

    18. Re:Beware the key term there: by Michael+Kristopeit · · Score: 1

      so from the aspect of the part manager (the user program), there is no change in process (no need for rewrite).... they just send their pilot out who comes back a lot sooner than they thought he would.

    19. Re:Beware the key term there: by spitzak · · Score: 1

      Don't Linux drivers work this way as well?

      It is true that the normal POSIX api does not have async methods. I am unsure if Linux offers any. However on both Linux and Windows you nowadays get better performance by using multiple threads and having one of them do the synchronous calls.

    20. Re:Beware the key term there: by AcidPenguin9873 · · Score: 2, Informative

      What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?

      They keys are speculative allocation, and batch freeing. They decouple the actual allocation/deallocation that the system's memory management library performs (which may involve slow system calls into the kernel, even), from the malloc and free calls that the program makes. By decoupling the rest of the program thread from the memory allocation thread, the application then doesn't always have to wait for all the accounting and data structure manipulation that malloc and free do. Of course there are times when it does have to wait, but with intelligent speculative mallocs, and batched frees, they try to minimize those.

      It's actually very similar to threaded I/O in that it decouples two aspects of the program, and allows for forward progress on things that aren't dependent on the thing that gets shoved off to another thread.

    21. Re:Beware the key term there: by KeithIrwin · · Score: 2, Informative

      It's true that in this case we're looking at a max of about 20%, but we're also looking at an average of about 16-18% (I'm eye-balling from the graphs). There's one test in the benchmark suite which is almost entirely CPU-bound and it is only a few percentage points faster. This, of course, makes perfect sense as some real processes are CPU-bound and so should be included in the benchmark suite. But realistically, no allocation scheme can speed up a process which does almost no allocating by very much. The rest of the tests in the benchmark show a pretty uniform increase in speed in the 15-22% range (again, eye-balling from the charts).

    22. Re:Beware the key term there: by Hognoxious · · Score: 1

      Nope. I don't get it. Is it like I'm driving a car, and I hit the gas pedal and the engine knew I was going to do that and already revved itself up?

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
    23. Re:Beware the key term there: by Anonymous Coward · · Score: 1, Interesting

      I don't see how spinning off malloc(3) calls would help anything, but if there's unused CPUs then clearly free(3) can be done by another thread.

      With compiler support there could be a significant benefit here. Suppose you have a function that looks like this:

      void fn(int array[], const unsigned size)
      {
              for(int i=0; i size; i++)
                      array[i] = array[i] * 3;
              int* newarray = calloc(size, sizeof(int));
              for(int i=0; i size; i++)
                      newarray[i] = array[i] + 5; // etc.
      }

      If you notice, the memory allocation is independent of the first loop. Suppose the compiler knows what malloc/calloc is and can rewrite this function internally like this:
      void fn(int array[], const unsigned size)
      {
              calloc_in_a_different_thread(size, sizeof(int));
              for(int i=0; i size; i++)
                      array[i] = array[i] * 3;
              int* newarray = get_result_of_the_async_calloc();
              for(int i=0; i size; i++)
                      newarray[i] = array[i] + 5; // etc.
      }

      That way the memory allocation is happening at the same time as the first loop is executing. The memory allocation becomes "free" (no pun intended) as long as there is enough independent code to execute concurrently with the malloc.

    24. Re:Beware the key term there: by weicco · · Score: 2

      IIRC NT 5.1 and earlier didn't have asynchronous I/O canceling. That caused some problems even when I/O stuff were ran asynchronously. But my memory is bit flaky so I just can't remember what those problems were... Vista and 7 has asynchronous canceling also.

      --
      You don't know what you don't know.
    25. Re:Beware the key term there: by Sun · · Score: 2, Informative

      Maybe at the API level. The API for asynchronous IO is there for every system call in Win32. What isn't there, however, is the use. I've read through the specs for doing "overlapped IO", and the conclusion is always the same - getting the semantics right is a pain. It's so much easier to create a GUI thread and a working thread, that no one I have ever seen bothered with it.

      Linux does have asynchronous IO, and they suffer from, pretty much, the same problem - getting the semantics right is difficult.

      Shachar

    26. Re:Beware the key term there: by Cyberax · · Score: 1

      No, they have it since long ago.

      For example, to move files to the trashbin Explorer first sends 'delete' command to check that it has enough privileges to proceed and then _cancels_ the pending 'delete' command to stop file from actually being destroyed.

    27. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      But that would be completely wrong. In worst case scenarios it could be much much slower than the existing methods. So 0% is not a lower bound. They only listed an upper bound, saying that in no case will it be more than 20% faster.

    28. Re:Beware the key term there: by Jurily · · Score: 1

      just keep a certain amount reserved, or don't free stuff instantly (in case it's about to be reallocated).

      I imagine that wouldn't work well with some security features.

      You don't want to be labeled "less secure than Vista", do you?

    29. Re:Beware the key term there: by gwjgwj · · Score: 1

      I like to mentally replace that with the actual meaning: "between 0 and".

      Are you sure, that this implementation cannot be actually slower in some cases?
      "It could allow software applications to run between -20% and 20% faster!"

    30. Re:Beware the key term there: by KiloByte · · Score: 2, Interesting

      Except that pre-allocating small chunks of expected size can be done much faster in-thread if you first allocate large chunks and then sub-allocate constant sized parts. That's what g_slice() from glib does.

      Replacing malloc() with g_slice() tends to improve allocation speed by insane factors -- I've seen cases where the speedup of allocations was 10x, and since the program in question was quite malloc-heavy, the overall speed was increased by over 100%.

      --
      The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
    31. Re:Beware the key term there: by weicco · · Score: 1

      I meant that allthough there is "cancel" command it does not actually cancel the query asynchronously but synchronously thus blocking until cancellation is complete. I may wrong here though and it wouldn't be the first time :)

      --
      You don't know what you don't know.
    32. Re:Beware the key term there: by Chatterton · · Score: 1

      I always know that the Prius was in advance of its time :)

    33. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      No, it's asynchronous as well.

      In Windows _all_ requests are represented as asynchronous IRPs, which can be deferred or canceled 'in-flight'. That's actually a clever way to create a proprietary kernel, which can nevertheless be easily expanded by 3-d part developers.

    34. Re:Beware the key term there: by home-electro.com · · Score: 1

      That's what I think.
      20% percent speedup appears to be huge exaggeration. I mean what does the app need to do in the first place to be able to save 20% just by moving free to another thread?

    35. Re:Beware the key term there: by asdf7890 · · Score: 1

      OK, so they run memory allocation in a separate thread. What exactly does the other thread do while the mm thread is running, and if it blocks like I think, how does that speed anything up?

      If the new memory allocation thread keeps a small pool pre-allocated then the calling thread doesn't need to block unless it needs something more than the pool has available. Whether or not such a pool is kept, when free is called it can return almost immediately so the main thread can get on with something else leaving the memory management thread to either add the freeed block back to the pool or hand it back to the OS in the background (presumably another CPU/core).

      For a task that allocates and deallocates small bocks a lot, and isn't itself keeping the available cores busy, I can image this giving a noticeable-but-not-massive boost in performance which could be significant for long-running number crunching or high performance tasks like physics engine calculations. Though it is not unlikely that some use patterns will see a small performance drop with this technique so blindly applying it everywhere is not a good idea (even where there may be a small gain I'd refrain - a small gain in performance may not be worth the little bit of extra complexity when debugging and potential limiting of portability).

    36. Re:Beware the key term there: by SkunkPussy · · Score: 1

      read the paper, its quite approachable and your questions are answered.

      (basically it pre-allocates in the separate thread and doesn't involve any rewrite of source code as it can be implemented via library preloading)

      --
      SURELY NOT!!!!!
    37. Re:Beware the key term there: by ILostMyOldUserID · · Score: 2, Funny

      Checkbox with a compiler?!! I denounce you as a Windows user - hand in your geek card!

    38. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      Actually, it is possible that some applications could run more slowly. Right?

    39. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      Various linux kernel schedulers.
      Do not leave the programmer to manage that which is historically his greatest failure. :. EOD QFT

    40. Re:Beware the key term there: by Sprouticus · · Score: 1

      Well assuming the memory manager was smart enough to not use the same core as the primary app thread(s),, I'm not sure how it could make things slower. i suppose they could be short sighted and not have the api verify that there IS a free core to use, but since that is the point of the api I would hope they were smart enough to check for that.

    41. Re:Beware the key term there: by Anonymous Coward · · Score: 2, Funny

      He could be talking about Eclipse, but I think that's just as bad.

    42. Re:Beware the key term there: by tibit · · Score: 2, Insightful

      I think that part of the problem is that there are still human developers on the other side of the keyboard. Code that utilizes asynchronous I/O in the general case, where you may be accessing multiple files from different places in your application, is just a pain to write in languages like C or C++.

      You need at least sensible coroutine support to make it palatable, IMHO. To really utilize async I/O without spawning many threads that each use sync I/O, you need to have cooperative multitasking -- thus coroutines or somesuch.

      --
      A successful API design takes a mixture of software design and pedagogy.
    43. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      What? I did Windows programming for years, before moving to real-time embedded systems. What's so hard about Async. I/O???

    44. Re:Beware the key term there: by 517714 · · Score: 2, Informative

      Are you sure that zero is the baseline? They may have suckered you in.

      There is undoubtedly some trivial utilization of memory for which the overhead exceeds any gains.

      --
      The US government have made it clear that we have no inalienable rights; any we do not defend vigorously will be taken.
    45. Re:Beware the key term there: by Idiomatick · · Score: 2, Funny

      I just assumed Windows needed the help more, my bad!

    46. Re:Beware the key term there: by thechao · · Score: 1

      I prefer between -20 and 20% faster.

    47. Re:Beware the key term there: by spacey · · Score: 2, Interesting

      One of the goals was to *not* require a rewrite of applications, and they succeeded on that goal.

      This is interesting stuff, but if the goal is to not have to change source, isn't this sub-par? Hasn't the Boehm collector been tested as faster than using malloc/free forever? See http://www.drdobbs.com/cpp/184401632;jsessionid=IRGXEUGCDWGBJQE1GHOSKH4ATMY32JVN for a trivial example (a paper at ftp.cs.boulder.edu is offline, I guess with the server for now).

      -Peter

      --
      == Just my opinion(s)
    48. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      Yes, except.... What guarantee do we have 0 is the lower bound here ? Many a speedup technique slows stuff down when things are not the way you assume (you always make assumptions !) when designing the technique. So I suggest this :

      It could allow software applications to run somewhere between infinitely much slower and 20% faster!

    49. Re:Beware the key term there: by EvanED · · Score: 1

      Don't Linux drivers work this way as well?

      My understanding is that this is not the case, and that Linux drivers have separate entry points for synchronous and asynchronous I/O, and that implementing the asynch I/O calls isn't required (and if an application then tries to make an asynchronous call, it'll just fail).

      This is basically the one way that implementing a Linux driver is actually more complex in some sense than a windows driver, since for a production driver you really want the asynchronous version too, and in the Windows world you don't have to think about the synchronous version yourself at all.

      (That said, I'm not a big kernel hacker or anything, so I could be wrong.)

    50. Re:Beware the key term there: by newdsfornerds · · Score: 1

      You are inadequate if you think Thomas Kinkade is an artist. I clicked the link in your sig and now I need to flush my browser's cache and wash my hands.
      Yikes!

      --
      Damping absorbs vibrations. Dampening is caused by moisture.
    51. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      Insight? Only if insightful means moronic. The whole point of async. I/O is you DON'T need multitasking. I've been writing real-time systems for 20+ yrs., and I gotta say: you and your moderators don't know jack. You WANT sync. I/O for cooperative multitasking, not async! You want something to block on.

    52. Re:Beware the key term there: by agbinfo · · Score: 1

      Would you always need a free core? Since they are doing pre-allocation, the allocator thread could be allocating some of the memory during the I/O cycles of the application thread. The gain may not be large enough for improvement overall but it might offset the overhead. Similarly, by pooling the free requests, executing them in a separate thread might provide some gains even with a single core. ?

    53. Re:Beware the key term there: by Spatial · · Score: 1

      Good point. I've been had!

    54. Re:Beware the key term there: by jgrahn · · Score: 1

      Beware the key term there: "up to."

      Yeah. And also: you don't normally even *notice* a 20% speed increase. If you have to wait 5 seconds or 4 seconds doesn't really matter -- you're annoyed in either case.

    55. Re:Beware the key term there: by Anonymous Coward · · Score: 0

      You'd better translate it for "between -infinity and".

      You're (incorrectly) assuming this technique cannot have a bad impact in performance.

    56. Re:Beware the key term there: by lsatenstein · · Score: 1

      It should not be a compiler option. I could request pooled memory for use as global, local, shared memory, semaphore or buffer and continue processing in anticipation that when I need it, it will be there. Consider a simple case using C language char *ptr = malloc(somememory amount); // Ask for memory using a new contrived malloc function to use the new facility execute a few statements then if (memory_obtained(ptr) == NULL) // wait for completion or test for failure to obtain memory crash..... else continue working. memory_obtained() is the function I use to test if I may need to wait for the new malloc() to complete, If the test was done immediately, without intervening instructions, we are serializing our work. Is it a compiler function to do the blocking? I think not. Good Idea for a research project. I like it.

      --
      Leslie Satenstein Montreal Quebec Canada
  2. Just remember to be aware of multi PROCESSOR by TheSunborn · · Score: 1

    Sound nice, but I hope the library also handle multi processor systems, where each processer have its own ram block. You don't want one cpu to allocate the memory, which is used by an other cpu.

    Do operation systems even have support to say "These 2 threads should run on the same processor, but I don't care about which one".

    1. Re:Just remember to be aware of multi PROCESSOR by WrongSizeGlass · · Score: 1

      What platform is this library available for? I didn't see it in the article.

      How much does this slow down an application that's running on a single CPU with a single core? Splitting memory allocation off into its own thread will negatively impact performance when running on many of the existing desktops out there.

    2. Re:Just remember to be aware of multi PROCESSOR by jisatsusha · · Score: 1

      Windows does - SetThreadIdealProcessor().

    3. Re:Just remember to be aware of multi PROCESSOR by sys.stdout.write · · Score: 1, Troll

      wrapped in a [sic] ugly brown robe and a poorly draped orange sarong

      Ah, the Ubuntu color scheme!

    4. Re:Just remember to be aware of multi PROCESSOR by Idiomatick · · Score: 1

      It is purple now you insensitive clod!

    5. Re:Just remember to be aware of multi PROCESSOR by Mad+Merlin · · Score: 4, Informative

      The type of system you're talking about is NUMA (Non-Uniform Memory Architecture), and yes, any OS worth its salt has supported it automagically for years. I think even Windows advertises support for NUMA now, whether it works is another question.

    6. Re:Just remember to be aware of multi PROCESSOR by maxwell+demon · · Score: 1

      That's not what he requested. It lets you say "I want this thread run on that core." What he wanted was a function to tell "I want this thread to be run on whatever core that other thread runs without caring which core it is."

      --
      The Tao of math: The numbers you can count are not the real numbers.
    7. Re:Just remember to be aware of multi PROCESSOR by GigaplexNZ · · Score: 1

      "I want this thread to be run on whatever CPU that other thread runs without caring which core it is."

      I assume you meant something more along these lines?

    8. Re:Just remember to be aware of multi PROCESSOR by godrik · · Score: 1

      well, I guess it depends on your memory allocation policy. I used to use numactl to have a firsttouch allocation policy, which is, the physical memory is allocated when first read or writen and it is allocated to the processor that did this access.

  3. Nothing to see here.... by Ancient_Hacker · · Score: 5, Insightful

    Nothing to see here...

    Moving malloc() to a separate thread does not do a thing for the putative word processor.

    They might get some speedup if they take a lousy old malloc() and have one thread hold onto the locks.

    But of course the *right* way would be to write a new malloc() that can from the get-go run re-entrantly and not require a bevy of slow locks.

    1. Re:Nothing to see here.... by Anonymous Coward · · Score: 5, Funny

      new malloc()

      I see what you did there.

    2. Re:Nothing to see here.... by Anonymous Coward · · Score: 5, Interesting

      Exactly, and they are even comparing it to the old and relatively slow Doug Lea allocator.

      If you want to test a new memory allocator, the benchmarks these days are the Hoard allocator, and the TCmalloc allocator from google. These alone will give you more than the 20% speedup mentioned in the article.

      However, that isn't the end of the story. There are proprietary allocators, like the Lockless memory allocator, that are about twice as fast as the older allocators which aren't optimized for multi-core machines.

    3. Re:Nothing to see here.... by kscguru · · Score: 4, Informative

      Indeed. This "technique" appears to be nothing more than replacing a poorly-locked malloc() implementation with a good malloc() implementation that has better locks and (probably) does some work speculatively.

      With a proper malloc() implementation, locks are NOT highly contended and a thread doing malloc() does not lock out other threads for a long period of time. In theory, the overhead of managing the queueing / signalling to offload work to a malloc()-thread should be higher than the overhead of doing a proper malloc() in the first place - if its not, then the original malloc() implementation is sub-optimal. Modern malloc() implementations use slabs, thread-local caches, and other tricks to avoid slow paths - they really aren't that inefficient, there isn't "20% CPU time" left to squeeze out unless your baseline is a non-optimal malloc() in the first place. Which leads me to conclude that they are doing speculative execution: use an extra thread to pre-expand caches, pre-fault pages, pre-grow the heap segment, and burn a bunch of CPU time on another thread to do it. Speculative execution is, after all, the sexy research area nowadays (especially for some EE researchers who like to publish "Foo runs 20% faster!" papers while ignoring the larger systemic slowdown they induced) - speculative execution only works when hardware is idle, and in the current climate of low-power computing, it's cheaper to turn off idle hardware than use it for speculative execution.

      And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup, and I think they are more likely to solve P=NP than to have actually made malloc() take less total work ... which is why I'm so convinced this is just speculative execution, the only way to do less work is to guess what that work was beforehand and burn another thread doing it.

      Now, maybe the paper is more restrained in its claims and it's just the journalist hyping this up. But if this is the hyped-up work coming out of a CS department, I wouldn't want to go there...

      --

      A witty [sig] proves nothing. --Voltaire

    4. Re:Nothing to see here.... by AuMatar · · Score: 2, Insightful

      Wouldn't it be rather trivial to write a lockless malloc? Just have every thread allocate its own memory and maintain its own free list- problem solved.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    5. Re:Nothing to see here.... by martin-boundary · · Score: 2, Interesting

      And we don't see the trade-off. A technique isn't very good if it burns 40% more CPU time (and thus 40% more battery life) to get a 20% speedup

      There's theoretical good and then there's practical good. A good rule of thumb these days is that RAM is the new disk, and most current and legacy software results in huge numbers of CPU stalls. If those stalls can be converted to useful work, even at 2:1 conversion, that's better than having the stalls.

    6. Re:Nothing to see here.... by b4dc0d3r · · Score: 4, Interesting

      Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted, or turn off all interrupts and switching for a few hundred cycles so it can do the allocation. Usually the runtime requests reserved pages far in excess of what it needs, and then doles out pieces, committing them as needed. You need 2k, so the runtime gets 4MB and commits 32k page(s). Next time you need more, then the runtime just returns more of the 32k block.

      The operating system has to lock its list for the runtime, and/or the runtime does the same for the program. Someone's locking something somewhere.

    7. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      Most CPU architectures support atomic operations - that's all you need to build lock-free architectures. No disabling of interrupts required.

    8. Re:Nothing to see here.... by kscguru · · Score: 5, Interesting

      And now I've read their paper. Quick summary: (1) they do indeed speculatively pre-allocate heap blocks, and cache pre-allocated blocks per client thread. (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing. (3) They busy-loop the malloc()-thread because pthread_semaphore wakeups are too slow for them to see a performance gain (section 2.E.2-3).

      In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.

      --

      A witty [sig] proves nothing. --Voltaire

    9. Re:Nothing to see here.... by wealthychef · · Score: 3, Insightful

      But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.

      --
      Currently hooked on AMP
    10. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      And that's similar to what good malloc's do. What makes it tricky is the need to balance the per-CPU pools.

      A very common multithread case is that you have a producer thread (calling malloc a lot) and a consumer thread (which free's all of those objects) If you don't rebalance the pools, the consumer thread will end up all of the system's memory on its per-CPU free list.

      The best-of-breed malloc's deal with this by hitting the per-CPU pool in the 99% case, but they still need to occasionally synchronize to balance the free lists.

      This is on top of all of the other considerations that make building a great malloc tricky (fragmentation, NUMA-affinity, bounding wasted RAM, ...)

    11. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      Have every thread allocate its memory from... what? At some point either the operating system or the runtime has to lock something so it doesn't get interrupted

      For Fsck Sake.

      Clearly your parent is saying that each thread should allocate and manage its own memory at thread startup time.

    12. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      In this performance context an atomic operation is not that much better than a lock. It's a locked bus cycle either way, and too many of them will cause scalability problems.

    13. Re:Nothing to see here.... by Tenareth · · Score: 1

      If each thread allocates its own memory then it is just returning to LWP, which was good a while back, but threading should try to avoid allocating its own memory except in a few specific instances.

      However, based on this thread most people don't know how CPUs work.

      --
      This sig is the express property of someone.
    14. Re:Nothing to see here.... by Darinbob · · Score: 1

      Well, the article was talking about applications that aren't easily parallelizable, so I imagined that they were single-threaded already, and thus didn't need locks.

      Now if you had an application that used garbage collecting memory allocators, then you'd certainly have a win by having that part run on a side thread. Minimal CPU to alloc memory, and none to free it.

    15. Re:Nothing to see here.... by Angst+Badger · · Score: 3, Funny

      This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers.

      On the contrary, this opens up whole new possibilities for MS-DOS!

      --
      Proud member of the Weirdo-American community.
    16. Re:Nothing to see here.... by KeithIrwin · · Score: 0

      It's nice to talk about what happens if you're on a laptop, but the fact of the matter is that there are still a lot of power-hungry applications out there where response-time is key. Mainly these are scientific simulations, genome processing, image processing, 3-d render farms and similar (and these days, computer games some of the time). For those high-power applications, they would love to get a 20% speed-up across the board (rather than just in the parallelizable portions of their code). Obviously sure, this isn't going to cut it when you're worried about cents per computing power or battery life, but that's not the only market which exists.

    17. Re:Nothing to see here.... by zunger · · Score: 1

      Also, I can't help noticing that most of their argument for speed starts from the hypothesis that an uncontended mutex lock takes about as long as a single-threaded malloc. That makes me wonder what the hell kind of locking implementation they're using for their design, and whether their time wouldn't be better spent improving *that*.

    18. Re:Nothing to see here.... by zunger · · Score: 1

      Speaking as someone who writes performance-critical very-large-scale applications, the idea sounds just as nuts in that world. A 20% speedup in malloc is worth, at the most, a 2% speedup in the overall process. (If you're spending more than 10% of time in malloc, perhaps you should be using a freelist, arena, *anything* else?) Wasting an entire core for a 2% speedup doesn't seem horribly efficient to me.

    19. Re:Nothing to see here.... by headLITE · · Score: 2, Insightful

      A large amount of malloc()/free() calls is something very typical of server applications that handle many concurrent requests. In this scenario, the problem is made worse by the locking used in many traditional implementations. Don't underestimate that.

      This is becoming more and more of a problem in client applications as well. Thanks to object orientation, many modern applications are little more than endless streams of created and subsequently destroyed objects; and in many modern languages this happens implicitly all the time.

    20. Re:Nothing to see here.... by Anonymous Coward · · Score: 4, Informative

      They block the thread (by spinning, not sleeping) that calls malloc() while the allocation request is serviced by the server thread. The server thread is not busy looping it is signalled when a request is issued. The combination of pre-allocation and the lockless server protocol means that the probability of a thread needing to be blocked in the first place is very low, and if it is the lock will be held for a very short time, And for short periods of time spinning is more efficient than the whole signal/goto sleep/wakeup dance.

      It's not a cute trick, its a way of reducing latency (used in thousands of places in fast code paths in most operating systems), and your claims about CPU utilisation is mostly false. It's true that it incurs some penalty for the worst case scenario.

      Also the paper says "...especially for sequential applications which cannot easily benefit from the multicore architecture otherwise"

      In sequential apps the overhead of the spin locks is much less anyway, because there is less internal concurrency in the process.

    21. Re:Nothing to see here.... by Rockoon · · Score: 1

      Remember that a program that sits idle 90% of the time, but spends 5% of the time in malloc()...

      ...will save an effective 25% processor load by doubling the performance of malloc... which means longer battery life, and so forth.

      Also, there are plenty of inner loops with malloc in them. Really. Most programs are not optimized, at all.

      --
      "His name was James Damore."
    22. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      OMG! Lowercase class name!!! *ducks*

    23. Re:Nothing to see here.... by Carewolf · · Score: 1

      Moving malloc() to a separate thread does not do a thing for the putative word processor.

      In a memory allocator there is a design choice in how much bookkeeping is done in malloc, and how much is done in free. They can make malloc really fast, by making free do most of the work, and unlike malloc, free can return to the calling thread immediately and does not need to way for the result. This gives real parallelism. I am guessing they only make malloc async to keep one thread doing memory-allocations thus saving locks.

    24. Re:Nothing to see here.... by Carewolf · · Score: 2, Informative

      In other words, it's a cute trick for making one thread go faster, at the expense of burning 100% of another core by busy-looping. If you are on a laptop, congrats, your battery life just went from 4 hours to 1 hour. On a server, your CPU utilization just went up by 1 core per process using this library. This trick absolutely cannot be used in real life - it's useful only when the operating system runs exactly one process, a scenario that occurs only in research papers. Idea (2) is interesting (though not innovative); idea (3) makes this whole proposal a non-starter for anything except an academic paper.

      When used for locking it is called spinning and not busy-looping, and stop your silly doomsday speak and grow a brain. The linux kernel itself more often use spinning than locking, because it is much faster and uses less cpu-cycles. You have busy-looping thousands of time each second when the kernel synchronizes threads and hardware, this is a no-go in application design, but a really common and efficient trick in low-level libraries and routines, and it will save you cpu-cycles and energy compared to semaphores, not use more.

    25. Re:Nothing to see here.... by julesh · · Score: 2, Informative

      But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.

      Average is about 15%. Many programs spend nearly 50% of time in memory allocation.

    26. Re:Nothing to see here.... by julesh · · Score: 2, Insightful

      When used for locking it is called spinning and not busy-looping, and stop your silly doomsday speak and grow a brain. The linux kernel itself more often use spinning than locking, because it is much faster and uses less cpu-cycles. You have busy-looping thousands of time each second when the kernel synchronizes threads and hardware, this is a no-go in application design, but a really common and efficient trick in low-level libraries and routines, and it will save you cpu-cycles and energy compared to semaphores, not use more.

      Only if you restrict its use to occasions where you know the lock will become available quickly. The Linux kernel uses spinlocks for its internal structures where it knows that no other CPU is going to lock them for more than a few thousand cycles at most. I also believe it (usually) disables interrupts while the lock is held, so it knows that nothing will interrupt that operation prior to its completion. This is a very different situation from an environment where there may easily be multiple seconds between allocation requests.

    27. Re:Nothing to see here.... by julesh · · Score: 1

      Yeah. This isn't about performance critical card, as TFA makes it abundantly clear. Such code is usually using a custom allocator that's *much* faster than malloc, and is very often easily parallelizable, so will be using all of the cores on your machine most of the time anyway.

      The article makes it very clear that the target application is the kind of code that is not trivial, and specifically mentions "word processors and Web browsers". So laptop power use clearly *is* a concern.

    28. Re:Nothing to see here.... by julesh · · Score: 1

      It's nice to talk about what happens if you're on a laptop, but the fact of the matter is that there are still a lot of power-hungry applications out there where response-time is key. Mainly these are scientific simulations, genome processing, image processing, 3-d render farms and similar

      Except this is explicitly *not* what TFA is talking about. TFA states the target application is "hard-to-parallelize applications, such as word processors and Web browsers". Laptop power usage is a killer for the very application they suggest they're targetting.

    29. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      Holy crap that's hokey.

    30. Re:Nothing to see here.... by taniwha · · Score: 2, Interesting

      I've worked with real multi-threaded apps that turned out to use more than 50% of their time in malloc/new free/delete and the associated locks - large;y due to the use of C++ string routines by people who didn't understand the single threadedness that was going on behind the scenes. The most important thing to take away from this is that malloc/free are not cheap, they involve synchronization in multithreaded code (like stdio and most people don't know that either). (and should be avoided like the plague in real-time code because of the risk of priority inversion)

    31. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      And for a BSD take on this, see http://people.freebsd.org/~jasone/jemalloc/, including benchmarks of memory fragmentation, discussion on thread access locality etc.

    32. Re:Nothing to see here.... by milosoftware · · Score: 1

      old malloc()

      Now that made me think.

      --
      Musicians don't die. They just decompose.
    33. Re:Nothing to see here.... by jenesuispasgoth · · Score: 1

      Don't forget that many systems use a "first touch" allocation policy. It means even though you might actually allocate only at the beginning of your program, it could actually start to really allocate right in the middle of your program (take a look at how gnu sort is implemented, for example).

    34. Re:Nothing to see here.... by ultranova · · Score: 1

      But how much of your time is spent allocating memory? If you spend 5% of your time in malloc(), doubling its speed saves you 2.5% of your execution time.

      And if you're spending more, consider using Java or another language with a relocating garbage collector. Doing lots of short-lived allocations seems to be the perfect match for generational collection.

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    35. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      Remember that a program that sits idle 90% of the time, but spends 5% of the time in malloc()... ...will save an effective 25% processor load by doubling the performance of malloc... which means longer battery life, and so forth.

      Remember, a program that sits idle 90% of the time, and spends 5% of its time in malloc, actually spends 0.5% of the system time in malloc, which is also the effective maximum of processor load. However since most time spent in classic malloc is waiting locked, it's not actual CPU load as much as it's just waiting.

      So you can save up to a quarter *percent* of CPU load if you double the speed of malloc, but in fact usually less, as, again, it's not pure CPU load.

      Ah, numbers are such fun to play with :P

    36. Re:Nothing to see here.... by wealthychef · · Score: 1

      You're right, I've written parallel applications on the SGI's where malloc() was such a problem I rewrote it to avoid locks, I had forgotten that...

      --
      Currently hooked on AMP
    37. Re:Nothing to see here.... by kscguru · · Score: 3, Informative

      I should perhaps note that I do implement low-level libraries for an extremely reputable company as a day job; I'm familiar with low-level lock implementations both in kernel and in userlevel on Linux, Windows, and MacOS, and exactly how those implementations balance spinning versus blocking.

      The Linux kernel preference for spinlocks dates from years ago, when the whole kernel ran under the BKL and was non-premptable anyways so you couldn't use blocking locks. When the BKL came out, all locks were made spinlocks to maintain correctness (and the -rt patchset started up, doing a conversion). The default implementation (still in use today by anything except the -rt patchset!) disables interrupts while any spinlock is held, and thus assumes the only thing holding the lock is another core.

      In contrast, Solaris and Windows (and I think MacOSX, though I would have to check my references) use a mix of spinlocks and adaptive locks - spinlocks for use within interrupt handlers, and adaptive locks for everywhere else. Good pthread implementations (glibc included) use adaptive locks - which means the pthread implementation this paper declared too slow to use ALREADY spins ~1000 cycles before blocking. The canonical rule here is that an adaptive lock spins for the same amount of time it would take for a block/wakeup cycle, then blocks; this is guaranteed to be within a factor of 2 of optimal in all cases, which is the best overall lower bound you can possibly get. (Yes, Linux kernel is behind the times; they are slowly getting better, and when eventually the -rt patchset gets merged, Linux will have finally caught up. Sorry, Linux fanboys.)

      Spinning versus blocking is a tradeoff. The research paper manages to extract all the gains from the "spin forever" side of the tradeoff without ever admitting the drawbacks (one full CPU core wasted).

      --

      A witty [sig] proves nothing. --Voltaire

    38. Re:Nothing to see here.... by someSnarkyBastard · · Score: 1

      But doesn't that eliminate one of the primary reasons of threading, that you can have multiple executable processes share the same memory within a program? Also, to have each thread keep its own copy of everything is extremely wasteful and stands a good chance of introducing concurrency bugs (e.g. thread A's copy of var foo = 3, but thread B's copy of var foo = 5, hilarity and core dumps ensue)

    39. Re:Nothing to see here.... by mini+me · · Score: 1

      And all future Apple products.

    40. Re:Nothing to see here.... by SoftwareArtist · · Score: 1

      (2) They run free() asynchronously, and batch up blocks of ~200 frees for bulk processing.

      That's essentially identical to what a concurrent garbage collector does, which Java has had for how many years now?

      --
      "I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."
    41. Re:Nothing to see here.... by AuMatar · · Score: 1

      You didn't get what I was saying. Its not that every thread gets its own memory space. Its that every thread gets its own free list, initialized at thread startup with some memory from the OS. So malloc would look like this in pseudo code

      pid=getpid()
      freelist=getFreeListFromPid(pid) //probably a has lookup for speed
      return freelist.allocate(size)

      This does require getpid() to be fast, preferably not requiring an OS call, but that's doable if the OS supports you a bit.

      So threads could read each other's memory as normal, no concurrency issues (or at least no copying issues- you would still need to lock things the same as you always would). Allocating memory would be lockless. Deallocating memory allocated on the same thread would be lockless. Deallocating memory allocated on another thread could require a lock, but that would be a rare occurence most likely.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    42. Re:Nothing to see here.... by b4dc0d3r · · Score: 1

      How does the thread know how much to reserve at startup time? Unless you know exactly how much you'll need, you'll run into the problems I mentioned. The specific problem being solved involves a program that makes heavy use of memory allocation, so you probably can't count on knowing how much you need and you'll have to go back and get more. More specifically, this is intended for single-threaded application running on multi-core processors.

      Atomic instructions are basically system-wide critical sections, and heavy allocation will make this a bottleneck for the application (not necessarily the system). For most normal programs with occasional allocation it's not going to matter how you implement it.

      Windows already does exactly this, reserving 1MB by default and this is controllable within the executable or when creating threads, so if you do know what you need you can certainly do this at thread level. So yes, it would be trivial to write a lockless malloc where the thread controls its own allocations and free list. There just isn't a benefit to doing that, and that's not the problem being solved here.

      http://msdn.microsoft.com/en-us/library/ms686774(VS.85).aspx

    43. Re:Nothing to see here.... by Anonymous Coward · · Score: 0

      What if they implemented TCMalloc or the lockless allocator in the "server" thread that runs on the unused core? I suspect there would still be some gains to be had, because even the free() in TCMalloc would not be... er... free. However, the gains may not be anywhere near a 20% speedup.

  4. Wait-free lock-free data structures by Anonymous Coward · · Score: 1, Informative

    You can use a wait-free lock-free data structure to add deallocated memory to a free list to reduce contention on the critical section in a heap. You can also use it to implement the idea presented in the paper.

    On Windows, you can use the Interlocked Singly Linked List data structure (see InitializeSListHead).

    You can also partition memory allocations over several heaps to reduce contention further.

    1. Re:Wait-free lock-free data structures by master_p · · Score: 1

      In C++, you can use size-based and type-based pool allocators, which is the fastest way to allocate memory from. Since most programs have a few hundred types max, using a specific allocator for each type is actually a very good idea.

      If combined with thread-specific pools, then it can provide good performance without busy loops and other tricks.

  5. Summary by mapuche · · Score: 1

    "The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."

    The release says this technique is for programs difficult to parellelize "by running the memory-management functions on a separate thread".

    1. Re:Summary by 14erCleaner · · Score: 2, Funny
      The speedup comes from using a memory-allocation library (PHKMalloc) that does extensive and expensive checking to avoid programmer errors, then basically hides most of its overhead in the second thread (so that the allocation thread would be mostly doing sanity checking). For most programs, this probably won't help performance any. It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.

      I once submitted an April Fool's joke to this effect to a moderated website; I claimed superlinear speedup for sorting by starting with a bubble sort, then "modifying it for parallelism" until it morphed into a quicksort. Alas, the moderator rejected it...

      --
      Have you read my blog lately?
    2. Re:Summary by beakerMeep · · Score: 1

      He didn't reject it, he just moved it to another processor.

      --
      meep
    3. Re:Summary by droopycom · · Score: 1

      It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.

      Actually its a very interesting trick.

      There are so many slow algorithm in the wild that having a simple method to speed them all up would be very useful.

      Ah, yes, this would not be a theoretical break-through, but a very practical one indeed... ... if their claims can be substantiated of course

    4. Re:Summary by Anonymous Coward · · Score: 0

      It's an old trick in parallel processing research: pick a slow algorithm, then speed it up via parallelism, rather than starting out with an efficient solution.

      Er, it's not a trick if the final result is faster.

      Case in point - a well-known physics engine was using the latest state-of-the-art algorithm for coarse collision detection, but it could not be put onto the PS3 SPUs as it was fundamentally serial.

      I reverted to the simplest possible algorithm, which could then be parallelized.

      Guess what? Overall performance of this work was doubled.

      Starting out with an "efficient" solution is the first mistake people make in parallel programming. "Efficient" on a single core is not the same as "efficient" on 8 cores.

  6. Wow, this is pretty clever by Omnifarious · · Score: 3, Interesting

    I wish I'd thought of it.

    Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

    1. Re:Wow, this is pretty clever by nxtw · · Score: 3, Informative

      If you want faster AES, just upgrade your CPU.

    2. Re:Wow, this is pretty clever by Omnifarious · · Score: 1

      Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)

    3. Re:Wow, this is pretty clever by macemoneta · · Score: 4, Informative

      Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented. As it is, on even my fastest computers, multi-gigabyte transfers are very CPU bound over ssh with the ssh process pegged at 100% (and just 100%) CPU.

      That's already implemented in the high performance ssh patch, available here. Scroll down to the "Threaded CTR cipher mode" patch, if that's the only part you're interested in.

      I've applied it to the openssh package on my Fedora 12 system. As is, it provides about 40% increased throughput on my quad-core. You may be able to get more by tweaking it to increase the thread count.

      --

      Can You Say Linux? I Knew That You Could.

    4. Re:Wow, this is pretty clever by nxtw · · Score: 3, Insightful

      Well, the Intel AES instructions would benefit even more from parallelized AES CTR mode pre-computation than straight multiple cores, so that doesn't invalidate what I'm saying at all. :-)

      Are your storage and network devices that fast?

    5. Re:Wow, this is pretty clever by mirix · · Score: 1

      Of course, you need software that uses the instructions as well.

      That said, I've got one of the Via chips with hardware RNG on it, and once I loaded the module for it, /dev/random just spews data. It's an insane improvement over normal (software) /dev/random. I believe it has some other sorts of encryption friendly features, but I haven't played with it much, yet.

      --
      Sent from my PDP-11
    6. Re:Wow, this is pretty clever by sowth · · Score: 1

      I was thinking Soekris Engineering's vpn accelerator card would help, but it appears to only be able to do 250 Mbps. (You wanted 1 gigabit/s, right?)

      That card is really old too. I first read about it probably 10 years ago. I don't think it has changed in that time... I wonder if someone makes a faster accelerator? Then again, what about the GPU? Has anyone tried encryption with GPUs before? They've done other supercomputing tasks. A quick search says they have.

    7. Re:Wow, this is pretty clever by x2A · · Score: 1

      "Are your storage and network devices that fast?"

      Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    8. Re:Wow, this is pretty clever by maxwell+demon · · Score: 1

      I always store on /dev/null. I can tell you, it's super fast! However, I'm experiencing a lot of data loss; maybe the device is broken.

      --
      The Tao of math: The numbers you can count are not the real numbers.
    9. Re:Wow, this is pretty clever by nxtw · · Score: 1

      Depends what else the system is doing, surely. Having several clients connected to a server, you want to free up the processor as much as possible for servery duties, any savings you make go to those when running full pelt, or convert to energy savings when you're not.

      What does that have to do with anything? AES-NI enables AES implementations to encrypt and decrypt much faster than any network and storage system I've ever used can provide data. Parallelizing AES-NI would save energy only in the case when running multiple cores at partial load is more efficient than running just one core at near-full load (which could be the case if all cores run at the same clock speed), and would be faster only if your I/O systems are already faster than a single core can encrypt/decrypt with AES-NI.

    10. Re:Wow, this is pretty clever by Omnifarious · · Score: 1

      The other thing, I was thinking of getting an Intel CPU for my next firewall for that very reason, and I went to their website to try to go through the options and pick one. Turns out, they demand I use Flash. Oops. I guess I can't figure out which one I want, so I'm not buying one.

    11. Re:Wow, this is pretty clever by nxtw · · Score: 1

      Sucks for you.

    12. Re:Wow, this is pretty clever by Omnifarious · · Score: 1

      *sigh* And... (I forgot to put this in) Intel's new instructions only make things twice as fast with CBC mode because CBC mode can't be pipelined. CBC mode requires the results of the previous operation before doing the next.

      That also implies that if you're going to be using them to increase the speed of CTR mode you are best doing several blocks before you switch to doing something non-AES related. That also argues for pre-computing blocks in CTR mode. So really, the whole pre-computation thing should be done regardless of whether you're using Intel's new instructions or not.

    13. Re:Wow, this is pretty clever by Jeffrey+Baker · · Score: 1

      ssh -c arcfour

      Or, on certain builds

      ssh -c none

    14. Re:Wow, this is pretty clever by x2A · · Score: 1

      Right... so there's absolutely no chance of any possible benefits whatsoever? Impossible that there could be any context switching overhead savings reachable, impossible that correct scheduling could reduce stalls in tight running loops by doing heavy cacheline busting processing elsewhere? Even with the cost of some latency, there's -no- way to improve throughput by parallel processing it? What if there was (somehow) nothing else running on the system /at all/?

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    15. Re:Wow, this is pretty clever by nxtw · · Score: 1

      Right... so there's absolutely no chance of any possible benefits whatsoever?

      Perhaps if you are encrypting a single stream at a very high rate. But a server will be more likely serving many streams, which can already be performed on different cores.

      Impossible that there could be any context switching overhead savings reachable, impossible that correct scheduling could reduce stalls in tight running loops by doing heavy cacheline busting processing elsewhere? Even with the cost of some latency, there's -no- way to improve throughput by parallel processing it? What if there was (somehow) nothing else running on the system /at all/?

      Unless your time is worthless, it's probably cheaper to buy more CPUs.

    16. Re:Wow, this is pretty clever by x2A · · Score: 1

      "Perhaps if you are encrypting a single stream at a very high rate. But a server will be more likely serving many streams, which can already be performed on different cores"

      Sure, they could be, but if the code for a single stream can achieve higher throughput (at a possible cost of higher latency) by splitting the process in two and running them side by side, even by just a little, and if the other streams are encrypting using the same code, they could get that same, small benefit. If you don't need individual threads for each seperate stream, and can use a worker thread, at least for one side of the split if not both, the context switching overhead savings you get for a single stream will carry on for multiple.

      And if it works for that, even just a little, even if it seams like just putting more cpu power into the system would be cheaper, and if it's known about, then those savings are likely to be available elsewhere in a complete system too, which could shape the way the system is designed to allow it to run more efficiently on a larger number of simpler cores. Granted when Intel experimented with creating an architecture based on instruction level parallelism, it didn't work out quite as successfully as they'd hoped, but I think their cores were possibly too simple (at least with early Itaniums, I didn't really follow progress after that) and of course the compiler not generating code well enough to take full advantage of it, as you said, the software development manhours can end up costing more than just more processors.

      That's what gives this kind of research its value, it's looking at identifying and understanding the baby steps between here and where Intel tried to jump straight to so we can have fewer execution units stalling and wasting silicon space, along with all the circuitry to try and keep those units fed by rescheduling the instructions out of order etc because of how the instructions are scheduled by the code. Knowing these things can be useful, maybe only showing benefits when you have dual on-die memory controllers, maybe only when you have a single shared, maybe with shared cache, maybe only with seperate. Saying benefits are impossible doesn't really help anybody.

      --
      The revolution will not be televised... but it will have a page on Wikipedia
    17. Re:Wow, this is pretty clever by thePowerOfGrayskull · · Score: 1

      Of course, it's related to a similar fine-grained parallelism idea for crypto that I wish would be widely implemented, and that's offloading most of AES CTR mode onto a separate thread, or several separate processes since each block has a computation step that can be computed in advance in parallel with all the other blocks. I might start doing multi-gigabyte transfers over ssh if that were implemented.

      You may find this helpful, which does exactly what you're thinking of... it's basically a set of patches to OpenSSH. HPN-13 does do AES CTR processing onto multiple threads; in addition to its original functionality which was to include various network optimizations to increase throughput. http://www.psc.edu/networking/projects/hpn-ssh/ (I've been digging into it a bit very recently because someone reported it's not playing nicely with BBSSH)

    18. Re:Wow, this is pretty clever by thePowerOfGrayskull · · Score: 1

      Aw, damn - macemoneta beat me to it.

  7. 20%?! by temojen · · Score: 5, Insightful

    If most programs are spending 20% of their time on memory management, something is wrong.

    1. Re:20%?! by kisielk · · Score: 1

      From the paper:

      Previous studies show that some C programs spend up to one third of their execution time in dynamic memory management routines such as malloc and free

      You can check the PDF for the cited studies.

    2. Re:20%?! by naasking · · Score: 2, Insightful

      Not at all. 20% is a very typical overhead for dynamic memory management. Did you think malloc/free costs nothing?

    3. Re:20%?! by SpazmodeusG · · Score: 1

      I don't think that's what he was getting at. I think he means you can avoid that much malloc/free-ing.
      Memory pooling and allocating outside of tight loops comes to mind.

    4. Re:20%?! by naasking · · Score: 4, Informative

      I'm saying that 20% overhead for dynamic memory management is typical of even well-designed programs. Very few programs can take good advantage of efficient bulk-deallocating arenas/regions, and research has shown custom memory pooling schemes are generally no better than malloc/free.

    5. Re:20%?! by AuMatar · · Score: 1

      And OOP makes things worse in this area- it tends to have a lot of small, short life objects that need to be constructed and destructed. Java is particularly bad at this due to decisions like immutable strings. Those extra object allocations add up quickly.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    6. Re:20%?! by naasking · · Score: 1

      Memory management overhead in GC'd languages is typically around 30%, so not much worse than malloc/free which averages around 15-20%. You gain a lot of productivity for that 10-15% overhead tradeoff though.

    7. Re:20%?! by AuMatar · · Score: 1

      I've personally never seen much if any productivity gain from GC. If anything I've seen a loss- I find that managing memory helps me make better designs, and memory ownership problems are almost always the first sign that there's a major design flaw. Quite frankly memory bugs are rare among decent programmers- I find one a year or so in most places I've worked, and those usually come about due to someone trying to be a bit too clever minimizing memcpy calls. It doesn't even make the top 20 list for bug causes.

      But at any rate, I wasn't talking GC vs non-GC, but OOP vs non-OOP. There are non-OOP GC languages and OOP non-GCed languages. I was pointing Java out because particular design decisions in their standard library make the problem even worse- the large number of immutable objects, especially strings. The more objects you make, the more time you spend in constructors, destructors, new, and free regardless of language. A design requiring lots of small objects should be avoided if at all possible, and patterns like flyweight applied if not.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    8. Re:20%?! by Anonymous Coward · · Score: 0

      Today's forecast: a constant cache miss and some bubbles in the pipeline..

    9. Re:20%?! by Nadaka · · Score: 1

      Besides that java running with optimized options under hotspot can beat c++ in creating/destroying objects vs c++'s alloc/malloc by up to 4 times. In some cases it can push java to execute faster than c++ (well crafted c with structs would still beat them both though).

    10. Re:20%?! by dgatwood · · Score: 1

      Double the overhead is not much worse?

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    11. Re:20%?! by naasking · · Score: 1

      By doubling the memory management overhead in the general case, you now get to work with pervasive immutable data, generally not worry about memory leaks, and dramatically simplify the semantics of interface boundaries.

    12. Re:20%?! by naasking · · Score: 1

      You significantly understate the complexities of manual memory management. Any two programs with compatible interfaces written in a GC'd language will compose without leaks, where any two programs with compatible interfaces written with manual memory management will not necessarily compose. This requires diverting valuable resources from development to analyzing the safety of any composition of non-composable programs.

    13. Re:20%?! by Anonymous Coward · · Score: 0

      Define "very few". It depends on the context. Network client/server tasks easily benefit from simple arenas. This is because the work is usually broken down into small request/respond cycles. Any competent C programmer will use arenas to preallocate and bulk-free objects during the particular request/respond phase. Not just for performance, but because it makes memory management far easier, especially the job of growing and parsing input buffers (the actual task of tracking pointers and not leaking memory is usually quite easy regardless).

    14. Re:20%?! by Tenareth · · Score: 1

      Yes, most modern programmers do think that way based on most of the code I've seen in the past 10 years.

      --
      This sig is the express property of someone.
    15. Re:20%?! by RAMMS+EIN · · Score: 2, Insightful

      ``20% is a very typical overhead for dynamic memory management. Did you think malloc/free costs nothing?''

      Many people actually seem to think that, and that only automatic memory management is costly. Out in the real world, of course, managing memory costs resources no matter how you do it, and you can sometimes make huge performance gains by optimizing it. I've seen percentages of time spent on memory management anywhere from 99% in real programs. As always: measure, don't guess.

      --
      Please correct me if I got my facts wrong.
    16. Re:20%?! by AuMatar · · Score: 1

      No, I actually write programs in C++ as a career. Memory management is trivial. It just doesn't cause bugs, unless you hire really bad programmers. The number of bugs it does cause are trivial- under 1% of bugs are due to it. And I say this with over a decade of experience coding.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    17. Re:20%?! by GigaplexNZ · · Score: 1

      And a lot of those small, short life objects tend to sit on the stack, not heap, so malloc doesn't come into play there. That's one of the first annoyances I encountered in C#, I can't create intentionally short lived class objects on the stack.

    18. Re:20%?! by dgatwood · · Score: 1

      Not really. You get to not worry about memory leaks, perhaps, but the interface semantics should be largely unchanged. Something either allocates memory or it doesn't, and it's still important to know which is which even in a garbage collected environment because you take a substantial performance penalty if you're going around allocating and freeing memory when it isn't necessary.

      And if a library function keeps a reference to something, the library routine should be using a reference-counted object anyway, which means the difference between GC and non-GC is an extra call to "release" or the equivalent in the function that allocated the object initially. (BTW, IMHO, all functions that use retain/release semantics should always return retained objects. The alternative is madness.)

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    19. Re:20%?! by dumael · · Score: 1

      Only if your compiler/VM does escape analysis for stack allocation. Afaik, Sun's Hotspot and the JHC haskell compiler are capable of it, I don't know of any others.

    20. Re:20%?! by guyminuslife · · Score: 5, Funny

      I was aware that malloc() had a price tag attached, but free()? That's misleading advertising.

      --
      I don't believe in time. It's a grand conspiracy designed to sell watches.
    21. Re:20%?! by noidentity · · Score: 1

      If most programs are spending 20% of their time on memory management, something is wrong.

      The program would have to be spending 100% in memory allocation. But hey, every one of their test programs did so, so it can't be that uncommon!

    22. Re:20%?! by GigaplexNZ · · Score: 1

      Sorry, I seem to have made an error of omission. I was referring to small short lived C++ objects typically being on the stack. My point was that not all OOP languages automatically throw all their objects at some form of heap, which AuMatar seemed to be claiming.

    23. Re:20%?! by AuMatar · · Score: 1

      Throwing it on the stack speeds allocation/deallocation, but doesn't get rid of constructor and destructor overhead. It's better, but still inefficient. As for whether those small objects are mostly on the stack... most is an overstatement I think. And some languages don't allow you to choose- in fact I think only C++ does.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    24. Re:20%?! by Rockoon · · Score: 1

      Are you serious?

      Memory management is only trivial in trivial programs. I guess thats what you work on.

      Try working in a team on a monolith sometime, and you can bet your ass that you wished the language was GC'd, because otherwise you dont know jack about whats going on memory-wise without a very extensive review.

      Evidence: Firefox

      The memory leaks were so bad that they had to write memory leak detection routines to pinpoint the problems for them, and it STILL leaks memory to this day.

      ALL memory leaks are bugs.

      --
      "His name was James Damore."
    25. Re:20%?! by Anonymous Coward · · Score: 0

      And OOP makes things worse in this area- it tends to have a lot of small, short life objects that need to be constructed and destructed. Java is particularly bad at this due to decisions like immutable strings. Those extra object allocations add up quickly.

      And OOP makes things worse in this area- it tends to have a lot of small, short life objects that need to be constructed and destructed. Java is particularly bad at this due to decisions like immutable strings. Those extra object allocations add up quickly.

      Any decent JVM (such as HotSpot) optimizes that and will not free/reallocate memory but swap the objects in place. It is so much optimized that using short-lived objects is actually one of the favored techniques for Java programming.
      Also, immutability solves many more problems than it creates. Immutable strings is one of the things Java got right. In fact, immutability is the one property that allows safe objects reuse and removes the need for copy and thus memory waste. Get your facts straight.

      Of course, you might need to follow some best practices to make sure your code gets optimized the way you want (don't loop in a static method, tune your JVM for optimization and proper garbage collector set up, etc).

    26. Re:20%?! by Anonymous Coward · · Score: 0

      Yeah, let's LIE about what the research shows, shall we?

      Just read the fucking document you link. In 25% of the cases they looked at, a custom allocator was WAY faster (44%) than their state-of-the-art malloc/free. In 75% of cases the difference was marginal.

      So "very few programs" means one in fucking four, does it?

      Also, your idea that 8 examples constitute a demonstration of a general fact is incredibly stupid.

      I expect your claim that 20% is "typical" is, equally, based on a tiny non-representative sample. If you're spending 20% of your time in memory management your program is crap.

    27. Re:20%?! by Atzanteol · · Score: 1

      One of the greatest features of modern languages is that I no longer need to *care* where my objects are created. Sure, I don't feel as elite, but I got over that with time.

      --
      "Ignorance more frequently begets confidence than does knowledge"

      - Charles Darwin
    28. Re:20%?! by chocapix · · Score: 1

      It's free() as in speech(), not as in beer().

    29. Re:20%?! by naasking · · Score: 1

      Reference counting is GC.

    30. Re:20%?! by GigaplexNZ · · Score: 1

      You're perfectly entitled to not care where it is allocated. There is a good reason I brought it up though - this article is about speeding up dynamic allocation, which won't affect stack based objects at all.

    31. Re:20%?! by tibit · · Score: 1

      AFAIK, with most commonplace GC implementations, the short-lived objects have O(1) cost.

      If you do a lot of allocations and soon thereafter you don't reference them anymore, the GC cost - in a decent implementation - will be zero for deallocation, and essentially O(1) for allocation. That's not bad IMHO. For deallocation, if it's not referenced then it's not touched. Only stuff that's disposable has a cost, but then you don't really want to be churning through lots of those.

      --
      A successful API design takes a mixture of software design and pedagogy.
    32. Re:20%?! by AuMatar · · Score: 1

      No, it's only non-trivial if you don't know what you're doing.

      1)Use smart pointers
      2)All memory should be owned by either function or an object. Make sure an object deletes everything it creates, and the same for each function
      3)Avoid exceptions- they're nice for making high numbers on test coverage, but they make memory management troublesome. If you must use them, make sure to use smart pointers.
      4)If you find yourself breaking rule #2 (by returning malloced memory from a function, for example) make sure there isn't a better design. There usually is.
      5)If you must break rule 2, document the fuck out of it. This is where you're going to have a leak, if anywhere.

      I've written everything from embedded systems with multiple millions of LOC to back end web services all in C or C++. Memory management never caused problems- you'd get 1 or 2 bugs a year due to it tops. And on the embedded systems a leak of even a few bytes would have not only left it unusable (memory was tight), but would have been discovered since the system had no malloc and did all allocations from per job pools that were checked for leaks in debug mode.

      Memory management is just not that hard. Period.

      --
      I still have more fans than freaks. WTF is wrong with you people?
    33. Re:20%?! by dgatwood · · Score: 1

      Some garbage collection is based on reference counting, but reference counting by itself is most certainly not a form of GC.

      Consider a basic reference counting design:

      void *ref_alloc(size_t *size) {
      void *tmp = malloc(size + 16);
      uint64_t *count = (uint64_t *)tmp;
      *count = 1;
      return tmp+16;
      }

      void retain(void *obj) {
      void *tmp = obj - 16;
      uint64_t *count = (uint64_t *)tmp;
      (*count)++;
      }

      void release(void *obj) {
      void *tmp = obj - 16;
      uint64_t *count = (uint64_t *)tmp;
      if (--(*count)) return;
      free(tmp);
      }

      When the reference count reaches zero, the object goes away immediately. Because the creation and destruction are solely under programmatic control, pure reference counting is not considered garbage collection. There's no garbage collector, no run-time functionality to speak of, no notion of scope, and nothing is done without explicit action by the caller. It fails to qualify as garbage collection in basically every way.

      Admittedly, some reference counting schemes do qualify (e.g. the monstrosity that is Perl's GC), but reference counting schemes in general do not.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    34. Re:20%?! by naasking · · Score: 1

      I'm afraid you're mistaken for two reasons:

      1. "garbage collection" has always been synonymous with "automatic memory management"; your ref-counting scheme is thus GC, albeit not a tracing collector.
      2. Bacon et al. established that all GC algorithms are hybrids between reference counting and tracing. Ref-counting provides incrementality, tracing provides throughput and cycle detection.

      Your original point was, "BTW, IMHO, all functions that use retain/release semantics should always return retained objects. The alternative is madness." I agree, because you are basically stating exactly what I stated earlier, which is that automatic memory management/GC dramatically simplifies the semantics of abstractions and thus should be pervasive.

    35. Re:20%?! by Anonymous Coward · · Score: 0

      That is why I usually fake the free function and stick the free nodes into a list, then the new function will grab a node from the list if one exists instead of allocating a new one. It is easy to double the speed of a function with this technique.

  8. Might be particularly applicable to Java by Anonymous Coward · · Score: 2, Interesting

    Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules. Java already has some multi-threaded load balancing that occurs automatically, but this algorithm might yield some additional benefits.

    1. Re:Might be particularly applicable to Java by binarylarry · · Score: 2, Informative

      Actually, Java already does something very similar to this: http://en.wikipedia.org/wiki/Java_Memory_Model

      --
      Mod me down, my New Earth Global Warmingist friends!
    2. Re:Might be particularly applicable to Java by SpazmodeusG · · Score: 4, Interesting

      Actually i've found the opposite. Java tends to be really good at transparent memory re-use. From experience if i have ~1,000,000 objects of the same type with the some constantly being deleted and replaced Java will run that program faster than a non-memory pooled C implementation (of course the memory pooled C implementation will be faster again).

      In fact many of the benchmarks around that you see claiming Java is faster than C will use an example of a program that creates and destroys objects in a tight loop. The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.

    3. Re:Might be particularly applicable to Java by Anonymous Coward · · Score: 1, Informative

      Actually, Java does the complete opposite. Java allocates a large chunk of heap, and then internally allocates its own objects on that heap. because the majority of objects are short-lived, a generational garbage collector can be used. Also, some allocations are done on the stack. See here

    4. Re:Might be particularly applicable to Java by glwtta · · Score: 1

      Java tends to generate a far greater number of malloc/free operations than a typical C program, and so this algorithm might yield particularly good performance on Java modules.

      That's not my understanding of how Java works. In fact, I can't see any benefit from this approach in Java at all: new object allocation is extremely fast (trivial, really) and heap compacting / garbage collection is already parallelized, and much more optimized than what malloc/free can do.

      I may be wrong about this, but the only time Java actually uses malloc is: a) expanding the heap, and b) allocating direct buffers (there's probably some other io-related cases).

      --
      sic transit gloria mundi
    5. Re:Might be particularly applicable to Java by radish · · Score: 1

      Java does basically no mallocs. Obviously there's an initial malloc to allocate the heap to the JVM but after that it's all managed by the JVM itself, and it's been demonstrated as being much faster than a traditional malloc/free approach. Assuming you set your Xms and Xmx sizes correctly the system malloc implementation is basically irrelevant to Java execution speed.

      --

      ---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"

    6. Re:Might be particularly applicable to Java by Nadaka · · Score: 1

      I would assume you are running hotspot with the server parameter? Anything else you do to bump java performance?

    7. Re:Might be particularly applicable to Java by Azarael · · Score: 1

      And how does this situation differ other than the fact that the alloc/free operations are done local to the JVM instead of making system calls? The fact that the JVM is doing the work doesn't magically make memory management easier.

      The other thing that I'm skeptical about is that the article seems to be contradicted by a more recent paper by the author that they are referencing (see Zorn http://portal.acm.org/citation.cfm?id=582419.582421). In the newer paper, Zorn et al. say that custom allocators are less efficient than a modern general purpose allocator.

    8. Re:Might be particularly applicable to Java by buchner.johannes · · Score: 1

      The C program will be as written with tons of calls to malloc/free, the Java program will simply reuse the same parts of memory again and again without any system calls. These benchmarks are a bit misleading as the C program isn't optimised with memory re-use whereas the Java Garbage collector tends to do that implicitly.

      What would the results be when using boehmgc?

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    9. Re:Might be particularly applicable to Java by SpazmodeusG · · Score: 1

      I'll get around to trying boehm gc one day but I'm pretty sure i'd see a speedup over a naive implementation (bulk allocations and frees are always faster than allocating in a loop) but it wouldn't be as good as an implementation that uses a memory pool (millions of the same type of object for a DMC compression algorithm - it's essentially made to be memory pooled).
      Similar to the Java scenario.

    10. Re:Might be particularly applicable to Java by GigaplexNZ · · Score: 1

      And how does this situation differ other than the fact that the alloc/free operations are done local to the JVM instead of making system calls? The fact that the JVM is doing the work doesn't magically make memory management easier.

      Yes it does. The JVM already owns the memory and nothing else can touch it, so it doesn't have to get low level locks or switch between user space and kernel space contexts.

    11. Re:Might be particularly applicable to Java by SpazmodeusG · · Score: 1

      That and the Xms memory option. If Xms is set high enough Java will do 1 call to malloc throughout the life of the program. Not that that's a huge improvement as Java doesn't call malloc much anyway.

    12. Re:Might be particularly applicable to Java by dumael · · Score: 1

      Sun's java still needs to perform it's own internal memory management for the main heap. The minor heap is bump allocated, and live data copied from that to the main heap.

    13. Re:Might be particularly applicable to Java by dumael · · Score: 1

      Except any decent allocator should only need to cross the kernel threshold when expanding the heap -or- releasing "excess" for some value worth of free memory back to the OS.

    14. Re:Might be particularly applicable to Java by GigaplexNZ · · Score: 1

      I may be using a different defintion of heap to you, I see the heap as the dynamically allocated memory from the kernel and any smart allocator just maintains a local pool of memory to reduce system calls (and the JVM was shown as an example that does this). Azarael asked how this is better than system calls (note: system calls are by definition calls to the kernel) and you seem to know the answer to that already.

      If the smart allocator is at the shared library level then it has 2 main options. It can be a globally smart allocator which means all processes use the same pool, but this again has contention issues and potential security issues. The other option is to have each process have it's own instance of the allocator but the downside is that if every application links against that, the system as a whole will use a lot more memory and systems that really do need a smart allocator tend to have a custom one anyway. It's very difficult to make a general purpose allocator that is fast and space efficient and supports a wide range of different sized allocations, but since the JVM knows a lot more about the code it is running/JITing it can make better optimisations based off assumptions a general purpose allocator can't safely make. Plus, the JVM can defragment the pool more safely than native allocators.

    15. Re:Might be particularly applicable to Java by CaptnMArk · · Score: 1

      Java allocation might be more optimized than malloc. But garbage collections is often slower than free, for large apps (practical observation).

    16. Re:Might be particularly applicable to Java by Tyler+Durden · · Score: 1

      I may be using a different defintion of heap to you, I see the heap as the dynamically allocated memory from the kernel and any smart allocator just maintains a local pool of memory to reduce system calls (and the JVM was shown as an example that does this).

      That's how malloc works. Malloc is not a system call. Occasionally it does make a system call (either brk() or mmap() on Unix) to add a block of memory to the running process when it needs it. If it has enough free space within the block already assigned to it, malloc just carves that up without making an additional system call.

      Or is there some point of yours I'm missing?

      --
      Happy people make bad consumers.
    17. Re:Might be particularly applicable to Java by Anonymous Coward · · Score: 0

      AFAIK the default (Hotspot) garbage collector is 'copying' and 'compacting', i.e. when it performs a collection, it copies the live objects from the source space into an empty target space (as opposed to removing the dead objects) and then releases the source space. This makes allocations of short-lived objects very fast (faster than malloc/free), since they get reclaimed automatically. Also, Hotspot recently added (experimental) support for Escape Analysis, which can be used to avoid heap allocation for temporary objects completely.

    18. Re:Might be particularly applicable to Java by glwtta · · Score: 1

      But garbage collections is often slower than free, for large apps (practical observation).

      Well, sure, but garbage collection in Java is already parallelized, and in a much more sophisticated way than the algorithm mentioned. It doesn't really even make much sense to compare garbage collection and free, what really matters is how Java allocation + garbage collection compares to malloc + free; and there you are basically getting fairly extensive optimization (without having to write any code for it) at the expense of predictability. Obviously the trade-off is different in different scenarios.

      I really doubt that the asynchronous malloc/free would perform better than the Java model, in the general case.

      --
      sic transit gloria mundi
  9. Uhm, isn't this just garbage collection? by GWBasic · · Score: 1

    Uhm, so what's the big deal? .Net's garbage collector runs in its own thread.

    1. Re:Uhm, isn't this just garbage collection? by jasmusic · · Score: 1

      Plus the allocator is lightning fast because allocation goes in a straight consecutive line rather than looking for empty slots in a predefined grid.

    2. Re:Uhm, isn't this just garbage collection? by Rockoon · · Score: 1

      It doesnt sit there spinning when it has nothing better to do?

      --
      "His name was James Damore."
    3. Re:Uhm, isn't this just garbage collection? by godrik · · Score: 1

      No it is not garbage collection at all, it is allocating in a separate thread. There is no detectino of unreachable memory which characterize garbage collectors. It is just a wrapper to malloc/free.

  10. You can malloc it but you can't use it by lordlod · · Score: 2, Insightful
    The article(s) are very scarce on details but it seems like the gains will be limited in most applications. Fundamentally you have to block until the malloc has finished before you can use it. So it helps if you malloc well ahead of time, but not if you malloc as you need it.

    A common simplified structure is:

    malloc memory
    use memory
    free memory

    With these new innovations you get:

    async malloc memory
    block until malloc finishes
    use memory
    async free memory

    And free shouldn't take a noticable amount of time.

    1. Re:You can malloc it but you can't use it by jasmusic · · Score: 1

      Looks just like Windows OVERLAPPED I/O.

    2. Re:You can malloc it but you can't use it by kisielk · · Score: 3, Informative

      The PDF of the paper has all the details. The article is just fluff.

    3. Re:You can malloc it but you can't use it by Anonymous Coward · · Score: 0

      You're a fucking moron.

    4. Re:You can malloc it but you can't use it by jasmusic · · Score: 1

      Wow that was out of left field, not even worth logging in for!

    5. Re:You can malloc it but you can't use it by Anonymous Coward · · Score: 0

      Wow that was out of left field

      Not at all. Remarks like:

      Looks just like Windows OVERLAPPED I/O.

      are in fact moronic. Even (perhaps especially) if you don't mean them.

      Also, I had a look at your previous posts; your moronitude seems to be generalized, rather than tech specific.

    6. Re:You can malloc it but you can't use it by godrik · · Score: 1

      you should RTFA. The main goal is to move sanity checks on the free() operation to a separate thread so that they can be done concurrently with the execution of the actual program. The goal is not asynchronous_malloc()

  11. It's programmers that need parallelization by w0mprat · · Score: 5, Insightful

    Because we learnt to program for a single threaded core with it's single processing pipeline since way back, using high level languages that pre-date the multi-threaded era, and it involves re-thinking how things are done on a fundamental level if we're ever to make proper use of 32, 64, 128 cores. Oh and we all know how many programmers are 'get off my lawn' types, myself included.
     
    If I still coded much anymore it would drive me to drink.

    --
    After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
    1. Re:It's programmers that need parallelization by __aasqbs9791 · · Score: 2, Funny

      ...If I still coded much anymore it would drive me to drink.

      Maybe that's my problem? If I started drinking maybe I could handle it [programming for other people] again.

    2. Re:It's programmers that need parallelization by Anonymous Coward · · Score: 0

      I think for most programmers it's not so much that they have problems mentally modeling for multithreading as it is that people have problems modeling in general, and multithreading shows their slop more clearly. It's like watching people code when you remove garbage collection from their programming environment. As with garbage collection, though, the fact that tools to help with multithreading will also probably hide poor design skills isn't a reason to be against them.
       
      Btw, the article is pretty worthless and as far as I could bother reading it, it sounds like its far from revolutionary.

    3. Re:It's programmers that need parallelization by Anonymous Coward · · Score: 0

      On my first job, 15 years ago, I wrote multi-threaded programs for OS/2 running on a dual 100MHz Pentium that had a price tag of a small apartment.

      It's nothing "new". Now get of my lawn.

    4. Re:It's programmers that need parallelization by Anonymous Coward · · Score: 0

      I don't know much programming yet (just bash scripting and a bit of assembly on linux) and certainly not how to program to take advantage of multi-core, but as a computer user I use it by confining processes to particular cpu cores so they don't interfere with what else I'm doing, for example doing video format conversion on one so it doesn't interfere with my web browsing. So I don't get "OMG super fast process because of multiple cores" but I do get a very substantial advantage. Sure, I could just use nice, but I find this method works very well for me. It also makes sure my system remains usable in the case of a process maxing out the cpu.

  12. So what's the big deal here? by Edgewize · · Score: 1

    As far as I can tell, the author of this paper just figured out a way to offload a bunch of memory management tasks to an idle CPU core, and then counted it as a performance gain. OK?

    So, monolithic single-threaded applications can be made faster on multi-core systems, at the cost of bulk-allocating more memory than is actually required, and only if there is enough idle CPU to run your memory-management thread in realtime on another core. I am not exactly blown away.

    The press release tries to trump it up as some kind of major advancement in performance, but it's not a performance gain at all. If you ran four copies of these single-threaded benchmarks simultaneously, you'd probably see a net performance decrease.

    1. Re:So what's the big deal here? by Zironic · · Score: 2, Insightful

      It's a performance gain because it's extremely rare that all your cores are maxed out at once, if you can distribute the computing power more evenly it's a performance gain in most circumstances even if the net computing power required increases.

  13. free() is probably more parallizable than malloc() by JessGras · · Score: 1

    Haven't ready the OP: shame on me.

    But as it is true that a program which calls malloc() is basically blocked until malloc() returns, the trick here could be that malloc() is made to be a lot dumber... but free() just puts pointers in a queue "to the other thread" and then the other thread can be more time consuming about being smart with making that space available to malloc() again.

    OK I'm expressing it badly. Obviously an app which needs memory is blocked until it gets the memory. But the same isn't true of releasing memory: you don't *need* your free()ed blocks to be immediately available. So maybe that's what they do: backload the joint malloc()+free() work into free(), make malloc() really dumb, and shift the free() work off to the other thread via a simple queue.

  14. Nothing new here by Anonymous Coward · · Score: 0

    The guys that developed SmartHeap http://www.microquill.com/ have been preaching this for years.

  15. Re:free() is probably more parallizable than mallo by JessGras · · Score: 5, Informative

    Now digesting the real paper at http://www.ece.ncsu.edu/arpers/Papers/MMT_IPDPS10.pdf, they do do a trick of making free() asynchronous to avoid blocking there, but they also do a kind of client-server thing, with a nontrivial but fast and dumb malloc client in the main thread.

    Not bad. They really tried a lot of different stuff, thought some stuff out carefully. This reviewer approves!

  16. How does this compare by Anonymous Coward · · Score: 0

    How does this compare to jemalloc(), for example? This paper seems to be either very overhyped or plain outright silly. What does this do that other methods to improve allocations like the Windows Low Fragmentation Heap (LFH) and BSD's jemalloc() don't?

    1. Re:How does this compare by julesh · · Score: 1

      How does this compare to jemalloc(), for example? This paper seems to be either very overhyped or plain outright silly. What does this do that other methods to improve allocations like the Windows Low Fragmentation Heap (LFH) and BSD's jemalloc() don't?

      It solves an entirely different problem: those systems improve multithreaded allocation performance by (effectively) reducing the amount of locking and/or number of cache misses per operation. This system improves *single* threaded performance by offloading work to a new thread.

  17. Obligatory Penny Arcade by Yvan256 · · Score: 2, Interesting
    1. Re:Obligatory Penny Arcade by supssa · · Score: 1

      Huhuh. How is it being a virgin?

      --
      Hatin' on products I don't like and getting modded up talking about tech I totally don't understand like it was 2005!
  18. Re:free() is probably more parallizable than mallo by droopycom · · Score: 1

    So basically, they apply garbage collection techniques to regular malloc/free program

    Not bad, all things considered...

  19. Read that title sensationalisticly wrong by Anonymous Coward · · Score: 0

    After I scrolled down and then back up the front page, I am genuinely disappointed that this was not a psychology-based article. Imagine slowpokes like me thinking all all the possibilities for a title I parsed as "Memory management technique speed APES by 20%" :(

  20. shameless plug by Anonymous Coward · · Score: 0

    a good friend of mine has done this and much better. check out http://newcodeinc.com/ for a threaded fast drop-in replacement for malloc. yes, it is closed source and commercial. i am slowly influencing him to open source it and gain more customers for support. it is currently being used by several large financial firms who love it.

  21. Multicore for word processing? by Anonymous Coward · · Score: 0

    Okay, I can understand multicore for web broswers. But word processors? What kind of insanity does it take to write a word processor that needs more than a late-90s PC to run smoothly, let alone multi-anything?

    If your word processor can't run smoothly on a single core, it's not the memory allocator's fault. Jesus.

  22. Does it matter anymore? by Tenareth · · Score: 2, Interesting

    There are very few developers left in the US that even know what memory management is.

    People don't even try anymore.

    --
    This sig is the express property of someone.
    1. Re:Does it matter anymore? by jasmusic · · Score: 3, Insightful

      Those developers can hold the rest of the software industry hostage for mad income. OS kernels don't write themselves.

    2. Re:Does it matter anymore? by alesplin · · Score: 1

      It matters to the swarms of new "iPhone programmers" dealing with memory management for the first time. I've seen lots of questions on StackOverflow dealing with memory management from people migrating from other languages.

  23. Re:free() is probably more parallizable than mallo by SirSlud · · Score: 1

    It's clear this is widely available - the devil is in the details. You can always abstract over a blocking system when you have threads/cores available. Those who work on fixed platforms like me kinda have to look to making the threading/locking more efficient. If you have the option of spending money on more potential parallel computing power for less, and can move up the chain, then solutions like this are more valuable.

    It's interesting watching people talk about this kinda stuff, because ultimately the value of these kinds of approaches really either come down to what might be practical on the next game console that MS or Sony will release, or what is immediately valuable to you if you're pretty malloc bound and are in a position to replace your hardware anytime you like.

    They mention word processors cause that's what turns heads. Ultimately, that's what R&D is .. its super specific, it's not immediately applicable to all, and in 20 years, it'll probably be standard.

    As it pertains to this article .. it's handling the caching for new/free on a thread. As other posters have pointed out, it's not a new academic point of interest, and like all other things like this, it's value will go from specific to becoming an "ok, I have the hardware and or implementation bandwidth for that."

    --
    "Old man yells at systemd"
  24. More useful is the inline protection algorithms by Chirs · · Score: 2, Interesting

    The thing that got my attention was the fact that once you offload the memory management onto the other core you can then do tracing, profiling, debugging and security analysis of the memory management portion (pre-zeroing, range analysis, double-free, etc.) with very little impact to the main thread because the additional work is done on the (otherwise mostly unused) memory management thread.

  25. processor affinity by SpaceLifeForm · · Score: 1
    --
    You are being MICROattacked, from various angles, in a SOFT manner.
  26. Parallelize word processor? by Khyber · · Score: 1

    given you could do most things a typical word processor can do on a 486, why in the world would you need to parallelize it?

    --
    Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
    1. Re:Parallelize word processor? by julesh · · Score: 1

      given you could do most things a typical word processor can do on a 486, why in the world would you need to parallelize it?

      I'd like to see an implementation of grammar-check-while-typing that works realistically quickly on a 486.

    2. Re:Parallelize word processor? by CarlosM7 · · Score: 1

      Memory defragmentation?

    3. Re:Parallelize word processor? by JasterBobaMereel · · Score: 1

      I'd like to see an implementation of grammar-check-while-typing that works ....at all

      All the implementations so far are wrong and annoying ...

      --
      Puteulanus fenestra mortis
  27. hi by nehaworld · · Score: 0, Offtopic

    It's interesting to read your post http://duiattorneyorangecountyca.com/

  28. Re:Nothing to see here....IPDPS paper by Anonymous Coward · · Score: 0

    Read the paper too. 100% agree with you, which of course makes sense since
    IPDPS is known for publishing subpar systems papers.

  29. libc by Anonymous Coward · · Score: 0

    While most programs do use libc others try to either avoid libc altogether (for example: http://blog.ksplice.com/2010/03/libc-free-world/ ) or use other "diet" versions of libc (for example: http://www.fefe.de/dietlibc/ )

    What about those programs? Will they get to enjoy what the article talks about?

  30. Not exactly news (at least 15years old) by Nagilum23 · · Score: 2, Interesting

    If I may quote: "Phkmalloc was written by Poul-Henning Kamp for FreeBSD in 1995-1996 and subsequently adapted by a number of operating systems, including NetBSD, OpenBSD, and several Linux distributions." Phkmalloc is the example implementation they used for the benchmarks..

  31. My technique: by dushkin · · Score: 0

    I have a wonderful memory management technique to speed up apps by 20%:

    MANAGE to put MOAR RAMZ.

    --
    o hai
  32. A talk from the Erlang developers address this by Sanoj,+the+developer · · Score: 1

    I saw a talk by the Erlang developers that also said that malloc was the bottleneck in a multicore system, so they wrote their own for Erlang virtual machine. See "Insights" on page 11: http://www.erlang-factory.com/upload/presentations/188/ErlangUserConference2009-PatrikNyblom.pdf Or watch the talk: http://www.erlang-factory.com/conference/ErlangUserConference2009/speakers/PatrikNyblom

  33. Web browsers? by Terrasque · · Score: 1

    Shouldn't web browsers be almost embaressingly easy to parallellize, or am I missing something here?

    Each page in its own thread / process, plus loading of page:

    1. dns resolve
    2. fetch html
    3. parse html
      3a. find css link, start thread for fetching and parsing css
      3b. find image link, start thread for fetching and parsing image
      3x. so on and so forth for other resources encountered.
    4. run js in seperate thread if needed
    5. run plugins in seperate threads / processes

    The lags I forsee here are network related, and as far as I can see, it should be rather easy to parallellize large chunks of it.

    But, as I haven't written my own web browser, I'm probably missing something fundamental here.

    --
    It's The Golden Rule: "He who has the gold makes the rules."
    1. Re:Web browsers? by JasterBobaMereel · · Score: 1

      ...the only things that take a significant time (besides network lag) are the things they are always trying to speed up (Plugins and Javascript/Java)

      The correct solution is to run a real application instead

      The workaround is to speed up the processing of scripting languages and VM languages that have no way of fully utilising the real hardware

      --
      Puteulanus fenestra mortis
    2. Re:Web browsers? by godrik · · Score: 1

      Shouldn't web browsers be almost embaressingly easy to parallellize, or am I missing something here?

      In parallel computing, nothing is embarassingly parallel. We should say pleasantly parallel since it makes our life easier. :)

      BTW, in my favourite slugish firefo, rendering is the part that takes time (probably mainly because it depends on cairo) and not fetching pages...

  34. Web browsers??? by Anonymous Coward · · Score: 0

    "The technique could be especially valuable for programs that are difficult to parallelize, such as word processors and Web browsers."

    What's so difficult about parallelizing Web browsers? I have 5 tabs open in my Google Chrome browser right now. Google Chrome runs each tab in it's own process. Seems to be taking advantage of my multi-core system quite well.

  35. Hmm... by Anonymous Coward · · Score: 0

    I can see how you can make gains when _DE_ allocating the memory, since there are no dependencies on the memory at that stage (by it's very nature, you've free'd it because nothing is using it), so I can see how deallocs can be totally decoupled and made to run asynchronously as a "fire and forget", that's an easy "win" if done right.

    But I'm much less convinced about allocations via a separate thread, due to the very fact that you're implicitly waiting for either successful allocation, NULL or an exception, depending on your memory allocation error handling model....

  36. libmtmalloc, libumem on Solaris by zm · · Score: 1

    there are many allocator implementations that behave way better than the single-thread-optimized default malloc. Hooray for another one. Here is one old article about that: http://developers.sun.com/solaris/articles/multiproc/multiproc.html IIRC even glibc uses a slab allocator these days, and is much better performance wise on multi processor machines.

    --
    Sig ?
  37. How is this different than the linux slab ... by guysmilee · · Score: 1

    How is this different than the linux slab allocator? http://en.wikipedia.org/wiki/Slab_allocation I guess I could RTFPdf ...

  38. Researchers learn about threading by Rasperin · · Score: 1

    News at 11.

    --
    WTF Slashdot, why do I have to login 50 times to post?
  39. It is just the opposite... by DaitanGio · · Score: 1

    It is concurrency to evolve to better memory management, not the opposite.
    A modern PC has a small amount of RAM, plenty of Hard disk space and network storage.
    Operating System evolved for managing CPU time and application memory. The application has little knowledge of memory management.

    In a similar way, database are a way of managing huge amount of information, in the same abstraction like Service Oriented Application can be seen like an abstraction of web communication.

    Take a look to
    http://gioorgi.com/2009/evolving-concurrency/

    --
    -- Giovanni Daitan Giorgi http://gioorgi.com http://www.siforge.org
  40. Re:free() is probably more parallizable than mallo by thePowerOfGrayskull · · Score: 1

    It also seems to me that the memory handling thread can also spend its idle time locating the next available chunk -- or several of them, based on size ranges. (This only makes sense with custom pooling.) That being said, I'm gonna go and read TFP to see if they've already talked about that...

  41. As usual Smalltalk's been there done that by itsybitsy · · Score: 2, Informative

    Not much to see in the article. Move along.

    IBM (not Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).

    One problem is that when you run out of memory space the application native threads (many in Smalltalk/MT) are blocked waiting for the one garbage collection thread to catch up. It all depends upon how much new memory is allocated depending on how much is freed up. They have a solution and are working to implement it. It's likely to use multiple native threads for the gc balancing out the freeing with the consumption. Another solution is to have each worker thread also switch into a gc thread in cases when it's starved for memory.

    Another solution is to use memory structures that don't require garbage collection. In other words, REUSE rather than RECYCLE.

    1. Re:As usual Smalltalk's been there done that by itsybitsy · · Score: 1

      OOPS, that should be:

      IBM (NOW Instantiations) Visual Age Smalltalk has run the garbage collector in a separate native thread for eons now, as has Smalltalk/MT (Multi-Threaded).

  42. As ever, video games way ahead of academia by jamie(really) · · Score: 1

    I always find it interesting when academics write papers about things that games programmers have been doing for years. Not that we had a client-server memory management thread mind you, because that would be fucking crazy, but preallocation of memory, and avoiding locks and syncs etc. Seriously, if you have a program that does 13 million allocations per second (see Table II), wouldn't a better memory manager be on your list?

  43. BS getting published by parallel_prankster · · Score: 1

    wow, being from NCSU and personally knowing most of the people involved in this project, this is nothing but a paper. There was nothing much in this work that has not been done already and this paper itself got published after getting rejected numerous times in most high quality conferences. There are so many issues with this technique I dont know where to begin with. But it suffices to say that these people have abandoned this and moved on. How did this article become so popular ? Well, a few days back some newspaper people came to us to ask for a paper that has been published but not presented and this seemed to be the only one. They have glorified this paper so much in their article it almost sounds like an invention !!! Wow, we truly live in an mis-information age