Slashdot Mirror


Scalable Nonblocking Data Structures

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."

25 of 216 comments (clear)

  1. why by damn_registrars · · Score: 4, Interesting

    why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
    1. Re:why by Chris+Burke · · Score: 4, Informative

      Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").

      It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.

      --

      The enemies of Democracy are
    2. Re:why by maraist · · Score: 5, Informative

      Message passing systems and MT systems solve different problems. Consider that Message Passing is a subclass of Multi-Processing; in general the amount of work is much larger than the data-set. But Multi-Threading often involves many micro-changes to a large message (the entire state of the process).

      Consider an in-memory database. (Mysql-cluster (NDB), for example). You wouldn't want to pass the entire database around (or even portions of it around) for each 'job'. Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Then your message passing is limited to only interactions that aren't held in the local memory space (i.e. NUMA).

      With Terracotta you are breaking a sequential application into a series of behind-the-scenes messages which go from clustered node to clustered node as necessary (I'm not very well versed on this product, but I've reviewed it a couple times).

      Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. And thus, the more CPUs (that leverage MESI (sp?) CPU cache) the more efficient the overall architecture.

      Now, I can't imagine that a 768CPU monster is that cost effective - you're problem space is probably pretty limited. But a simultaneous 700 thread application is NOT hard to write in java at all. I regularly create systems that have between 1,000 and 2,000 semi-active threads. But I try to keep a CPU-intensive pool down to near the number of physical CPUs (4, 8 or 16 as the case may be). Java has tons of tools to allow execution-pools of configureable size.

      --
      -Michael
  2. Inspiration... by green-alien · · Score: 5, Informative

    The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [Practical Lock Freedom]

  3. Google Talk by jrivar59 · · Score: 5, Informative

    Google Talk by the author.

  4. Re:Sounds great! by Linker3000 · · Score: 4, Funny

    1988? Atomic?

    Is this something to do with a Blondie tour?

    --
    AT&ROFLMAO
  5. Re:768 Cores? by jlechem · · Score: 3, Funny

    Or better yet actually run Windows Vista. Zing!

    --
    Hold up, wait a minute, let me put some pimpin in it
  6. Google Tech Talk by Bou · · Score: 4, Informative

    Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 .

  7. Re:Java???? by Anonymous Coward · · Score: 4, Insightful

    700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

    Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

  8. Re:Sounds great! by moderatorrater · · Score: 4, Funny

    Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).

  9. Re:Java???? by AKAImBatman · · Score: 3, Insightful

    700 threads in JAVA? Why not use C++

    Hmm... lemme think about that. Maybe because Java has decent threading support built into the language? Maybe because the platform is portable to any architecture? Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?

    "Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore. Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.
  10. Re:Java???? by Kupek · · Score: 3, Informative

    Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.

  11. Re:Java???? by Anonymous Coward · · Score: 5, Funny

    700 threads in assembler? Why not use JAVA, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

    Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

  12. Geek serendipity in a summary by ThreeGigs · · Score: 3, Funny

    and a finite-state machine built from the atomic update and logically replicated per array word.


    Now *that* is what I call geek speak.
  13. Re:Java???? by famebait · · Score: 4, Insightful

    There is no way anything less than _really_ good coders would get something like this to work with any semblance of efficiency. If you still evaluate coders by which language they use, chances are you're not really that good a programmer.

    --
    sudo ergo sum
  14. From the article: by Kingrames · · Score: 3, Funny

    "# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."

    Clearly, the data structures have been touched by his noodly appendage.

    --
    If you can read this, I forgot to post anonymously.
  15. Re:Java???? by Reverend528 · · Score: 3, Funny

    Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job
    Since when has writing code quickly ever been considered one of Java's strong points? Personally I'd take stdio over Java's alternative (file wrapped in a stream buffer wrapped in a buffered reader wrapped in an enigma) any day of the week.

    Sure, Java manages memory for you, but it's generally much easier to incorporate a garbage collector into C than it is to write java without file I/O.

  16. Re:scalable noNBLocking data sTRructures .. :) by Seferino · · Score: 5, Interesting

    Good for us. Get the rabble away from Slashdot. Only true nerds should understand the contents. Let me add a few keywords to get rid of the softies: monads, higher-order type systems, return type, genericity.
    Your turn.

  17. Re:Sounds great! by mikael · · Score: 4, Informative

    The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).

    With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).

    To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.

    The author uses the Load-Link/Store Conditional pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.

    Check-And-Set only replaces the variable with a new value if the value of the variable matches a previously read old value.

    Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  18. Re:Java???? by JMZero · · Score: 5, Insightful

    Java is perfectly fast for real world applications, and I'd agree that the "Java is Slow" idea is outdated.

    But it's not conclusively faster than C++, at least not in a general sense. In terms of a small task involving lots of simple operations, you'll still often see a significant speed increase using C++. This is a good example. Now I'm sure there's more optimizations available on both sides - and plenty of stuff to tweak - but C++ is going to come out ahead by a significant margin on a lot of these tasks.

    A good example where the participants on both sides have some motivation is on TopCoder (where I spend a fair bit of time). Performance isn't usually the driving factor in language choice there - but sometimes it is, and when it is the answer is pretty much always C++ (unless it's a comparison between Java BigInteger and a naive implementation of the same in C++).

    Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance.

    To be clear - these are not usually "real world" tasks. As more memory and objects come into play, Java is going to do better and better. But these kinds of tasks still exist - there's still plenty of places where C++ is going to be the choice because of performance.

    In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong.

    --
    Let's not stir that bag of worms...
  19. Re:"2.5"? WTF? by badboy_tw2002 · · Score: 4, Informative

    He's built two working data structures and is working on a third (had to read the slides to figure that one out).

  20. Re:Sounds great! by Linker3000 · · Score: 3, Funny

    I hereby appoint you official summary explainer!

    Thanks

    --
    AT&ROFLMAO
  21. Re:Sounds great! by hey! · · Score: 4, Funny

    Which is great and all, but what we usually need is more of a summary executioner.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  22. Re:Sounds bogus? by Kupek · · Score: 4, Informative

    Locking in software has implications that locking at the hardware level does not.

    If a thread locks in software, any subsequent thread must block, waiting for the first thread to finish. If the thread is preempted, then the waiting threads wait needlessly. If the thread dies, then the waiting threads are hosed.

    Lock-free techniques prevent this problem, at the expense of more complicated algorithms and data structures. The basic structure of most lock-free algorithms is read a value, do something to it, and then attempt to commit the changed value back to memory. The attempt fails if another thread has changed the value from underneath you, and you must try again. (This is detected through operations like compare-and-swap.) This allows greater concurrency and guarantees that the system as a whole will make progress, even if a thread is preempted or dies.

    Lock-free algorithms and data structures is a well established area. What Click has done here is provide a Java implementation of some data structures that yield good performance on the manycore systems his company makes.

  23. Re:Sounds bogus? by Chris+Burke · · Score: 5, Informative

    These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

    That's not strictly true.

    First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.

    Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.

    Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).

    Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.

    So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.

    --

    The enemies of Democracy are