Scalable Nonblocking Data Structures
An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
Woah, yeah - sounds great from the summary.
Now WTF does that all mean!?
I have to RTFA!?
Damn!
AT&ROFLMAO
why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [Practical Lock Freedom]
Google Talk by the author.
700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Or... is this just a way to avoid having to get the really good coders who are more costly than the burn-bags?
If brevity is the soul of wit, then how does one explain Twitter?
Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 .
Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.
...
KeyWords:
concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures
davecb5620@gmail.com
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.
2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.
If it is not an inner loop then the locking is probably not a big issue any way.
3. Java, really, Java?
Now *that* is what I call geek speak.
There are 48 Vega2 processors per board, with 1 core per processor. The biggest setup stacks 16 of these boards to get to 768 cores.
Too bad this isn't being done in Perl.
It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.
This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more important, especially in Java; stalls due to array resizing are a headache in real time systems with dynamic memory. It works about the way you'd expect; there are flags indicating the location (old or new) of each array element as the array is copied. Making all this work correctly is touchy.
Concurrent algorithms like this are incredibly hard to write, and you need formal methods to get them right. The author of the paper still hasn't figured out to code a FIFO under these rules.
"# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."
Clearly, the data structures have been touched by his noodly appendage.
If you can read this, I forgot to post anonymously.
Anything (free software) like this on C/C++?
If Java surpasses C in performance, then why aren't the Unreal or Doom engines written in it? What you meant to say is: "for most applications, Java can surpass C++ performance."
Oversimplifying things helps no one, except maybe your moderation score if the moderators are that clueless.
Will someone please explain WTF "2.5" means in the summary?
Media that can be recorded and distributed can be recorded and distributed.
-kfg
Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java
Hmmm... one core per Java thread?
That sounds about right for Java apps...
Just my
3. [...] The FSM supports array resize and is used to control writes.
So their algorithm does not work without the interference of the Flying Spaghetti Monster?
Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.
Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.
Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.
I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists is 1 CPU cycle.
Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
I work in the real-time embedded world and we've been doing this for years.
.....does it run Linux?
Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this.
Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks was LISP (I think it was Common LISP). I don't use LISP because I hate the syntax (and LISP fanboys regard it as a requirement of club membership that you like it). But the performance is quite impressive (with GC too), and that's what I would look into if I needed top performance from a high level language. Eat your heart out C++.
I use Java quite a bit, and the biggest drawback performance wise is the amount of memory chewed up by the GC schemes and JIT. It is otherwise quite fast. I often turn off JIT to save memory (at the expense of CPU), and I would like to see options to use a GC algorithm that minimizes memory consumption (at the expense of CPU).
Bah. Just use /dev/null. Unlimited concurrent writers. Scales to massive volumes.
Well... if I remember what I learned from OS and hardware classes right. LL/SC and CAS operations do involve locks at the hardware level. These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.
So I fail to see anything radically new in this project. It is not lock-free at all. It sounds more like bogus marketing pitch to me.
Can someone correct or explain to me?
They are really non-blocking, and use a standard Java API - the java.util.concurrent.atomic package. The description of the package indicates it was intended as the low level basis for non-blocking data structures. While most JVMs are written largely in C, the Java code gets compiled to machine language. Standard packages do not have to be implemented via JNI, and can be implemented directly by the JVM - just like GNU C implements many standard C library functions directly in the compiler for performance. In fact, java.util.concurrent.atomic would be a poor choice for implementing via JNI because the transition between C and Java code that happens on every JNI call would defeat the high performance purpose of the package.
Since you're apparently reading this - shoot me an email. I'd like to discuss if this could work with Jython.
I'm thinking using a single Honkin' Big Iron instead of a farm for a massive-use service.
TransactionKit is another lockless highly concurrent data structure, specifically a hash table.
But a FSM-controlled data access for multi-threading is a bit different. The closest I've seen is the implementation of dual-port RAM which uses hardware state machines to control access to the underlying shared RAM.
Oh, I've been doing embedded for over 20 years too.
Engineering is the art of compromise.
Someone posted the video and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.
The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?
[signature]
Want to show us one example we can compile and run on our PCs to show that Java is so uber-fast compared to C++?
Just one example of a program written in both C++ and Java where Java runs faster?
If the Java runs faster I guarantee it's badly written C++ and I'll "optimize the hell" out of it in about 10 minutes and beat your Java. I guarantee it.
The reason nobody bothers with C++/Java benchmarks is because nobody does serious work in Java any more.
These algorithms will always be tainted by Dr. Click's usage of the non-word "performant."
In using the word "performant" instead of the word "fast" or "fast enough" the author shows he is more interested in fellating his career than in advancing computer science or communicating his ideas.
What a douchebag.
it is the platform (i.e. JRE). Slow. As molassas. But then again, so is managed C++ on the CLR.
So it goes, the circular argument....
I am very small, utmostly microscopic.
The problem is with doing a copy on write (COW) of the hashtable on a resize operation. In conventional COW a thread makes the new copy in private storage and then does a compare and swap (or ll/sc) on a global pointer to the hashtable. Since resizing a hashtable is a rather expensive operation you can get a lot of threads all attempting to do the resize with a lot of private copies. All but one attempt will fail, making conventional COW rather wasteful.
I believe Cliff's technique involves allocating a new array and letting all threads accessing the hashtable attempt to populate the new hash array. Kind of a lazy initialization on a per array element basis. Once the new array is fully populated the old one can be discarded. There's probably some FSM logic involved. I haven't looked in detail.
With an army of 768 atomic powered, logically replicated finite state machines, I would rule the world!
Cliff, stop being a douchebag. The word you are looking for is "fast" or "speedy" or "fast enough."
There is no word as "performant". Pretending there is, using words just to obfuscate or make your dick seem bigger is a sign of pretentious douchebaggery.
STOP IT!
> Your turn.
:)
Catamorphisms. Linear Logic.
Back to you
Done with slashdot, done with nerds, getting a life.
This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.
And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too!
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
So is this going to help get rid of the python global interpreter lock(GIL) or what?
I don't necessarily take issue with what you've said (I've got you on my friend list, so I'm fairly sure that your one of the professionals on this site whose insight I love to read... I'm a software development major in college, so I usually add programmers to my friends list when I genuinely value their insights...), but I would like to point out that I've seen another approach to your first point about locking busses. The Linux kernel, IIRC, does a sideways and backwards cache flush to RAM before masking IRQs on an SMP system. I think it defers top half of interrupts to another CPU and doesn't queue the bottom halves. I might be mistaken on that, though. It then 'takes over' the CPU for the process/thread, and when it's done, it flushes all the caches once again.
But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Because coding a bad, poorly parallelizable algorithm in ASM vs C will give you at best a 2x speedup.
And coding a good, hyper parallelizable algorithm in Java vs C will give you at worst a 1/2x speed down, and at the same time an nx speed up, where n=768.
And I'm not even talking about the mess ASM would be.
Usually scalable algorithm are slower when the CPU count is low so the question is: how many CPU are needed so that the scalable algorithm is as good as the non scalable one?
Strangely this isn't talked about in this 'lightweight' article..
Amb. Calculus of constructions. Higher-order barbed bisimulation. Stochastic processes. :)
And you're cheating with your signature
Anything like this (Scalable Nonblocking Data Structures) but on C/C++?
And in the meantime, clever guys who decided not to spoil their lives with half-baked programming languages are doing things in a much more beautiful way...
Ezekiel 23:20
The presenter states that there are only a handful of people in the world savvy enough to program like this.
Does the 700 core hashtable go ping? You are not allowed to use it because you are not q-u-a-l-i-f-i-e-d!