Scalable Nonblocking Data Structures
An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
It means that the atomic operation required for lock-free data structures are now available for Java from this vendor. Welcome to 1988!
The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [Practical Lock Freedom]
Google Talk by the author.
700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Or... is this just a way to avoid having to get the really good coders who are more costly than the burn-bags?
If brevity is the soul of wit, then how does one explain Twitter?
1988? Atomic?
Is this something to do with a Blondie tour?
AT&ROFLMAO
Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 .
Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).
Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.
...
KeyWords:
concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures
davecb5620@gmail.com
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.
2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.
If it is not an inner loop then the locking is probably not a big issue any way.
3. Java, really, Java?
Now *that* is what I call geek speak.
There are 48 Vega2 processors per board, with 1 core per processor. The biggest setup stacks 16 of these boards to get to 768 cores.
Too bad this isn't being done in Perl.
It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.
This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more important, especially in Java; stalls due to array resizing are a headache in real time systems with dynamic memory. It works about the way you'd expect; there are flags indicating the location (old or new) of each array element as the array is copied. Making all this work correctly is touchy.
Concurrent algorithms like this are incredibly hard to write, and you need formal methods to get them right. The author of the paper still hasn't figured out to code a FIFO under these rules.
"# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."
Clearly, the data structures have been touched by his noodly appendage.
If you can read this, I forgot to post anonymously.
Anything (free software) like this on C/C++?
If Java surpasses C in performance, then why aren't the Unreal or Doom engines written in it? What you meant to say is: "for most applications, Java can surpass C++ performance."
Oversimplifying things helps no one, except maybe your moderation score if the moderators are that clueless.
Will someone please explain WTF "2.5" means in the summary?
Media that can be recorded and distributed can be recorded and distributed.
-kfg
Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java
Hmmm... one core per Java thread?
That sounds about right for Java apps...
Just my
This still has limited applicability. Making a single data structure useful across hundreds of CPUs is impressive, but many problems can be more easily solved by using multiple structures - pipelining it, or using divide and conquer, or any of many other approaches.
The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).
With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).
To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.
The author uses the Load-Link/Store Conditional pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.
Check-And-Set only replaces the variable with a new value if the value of the variable matches a previously read old value.
Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
3. [...] The FSM supports array resize and is used to control writes.
So their algorithm does not work without the interference of the Flying Spaghetti Monster?
Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.
Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.
Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.
I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists is 1 CPU cycle.
Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
I work in the real-time embedded world and we've been doing this for years.
Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this.
Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks was LISP (I think it was Common LISP). I don't use LISP because I hate the syntax (and LISP fanboys regard it as a requirement of club membership that you like it). But the performance is quite impressive (with GC too), and that's what I would look into if I needed top performance from a high level language. Eat your heart out C++.
I use Java quite a bit, and the biggest drawback performance wise is the amount of memory chewed up by the GC schemes and JIT. It is otherwise quite fast. I often turn off JIT to save memory (at the expense of CPU), and I would like to see options to use a GC algorithm that minimizes memory consumption (at the expense of CPU).
I hereby appoint you official summary explainer!
Thanks
AT&ROFLMAO
Which is great and all, but what we usually need is more of a summary executioner.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
Bah. Just use /dev/null. Unlimited concurrent writers. Scales to massive volumes.
Well... if I remember what I learned from OS and hardware classes right. LL/SC and CAS operations do involve locks at the hardware level. These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.
So I fail to see anything radically new in this project. It is not lock-free at all. It sounds more like bogus marketing pitch to me.
Can someone correct or explain to me?
They are really non-blocking, and use a standard Java API - the java.util.concurrent.atomic package. The description of the package indicates it was intended as the low level basis for non-blocking data structures. While most JVMs are written largely in C, the Java code gets compiled to machine language. Standard packages do not have to be implemented via JNI, and can be implemented directly by the JVM - just like GNU C implements many standard C library functions directly in the compiler for performance. In fact, java.util.concurrent.atomic would be a poor choice for implementing via JNI because the transition between C and Java code that happens on every JNI call would defeat the high performance purpose of the package.
Since you're apparently reading this - shoot me an email. I'd like to discuss if this could work with Jython.
I'm thinking using a single Honkin' Big Iron instead of a farm for a massive-use service.
TransactionKit is another lockless highly concurrent data structure, specifically a hash table.
But a FSM-controlled data access for multi-threading is a bit different. The closest I've seen is the implementation of dual-port RAM which uses hardware state machines to control access to the underlying shared RAM.
Oh, I've been doing embedded for over 20 years too.
Engineering is the art of compromise.
Someone posted the video and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.
The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?
[signature]
Want to show us one example we can compile and run on our PCs to show that Java is so uber-fast compared to C++?
Just one example of a program written in both C++ and Java where Java runs faster?
If the Java runs faster I guarantee it's badly written C++ and I'll "optimize the hell" out of it in about 10 minutes and beat your Java. I guarantee it.
The reason nobody bothers with C++/Java benchmarks is because nobody does serious work in Java any more.
it is the platform (i.e. JRE). Slow. As molassas. But then again, so is managed C++ on the CLR.
So it goes, the circular argument....
I am very small, utmostly microscopic.
The problem is with doing a copy on write (COW) of the hashtable on a resize operation. In conventional COW a thread makes the new copy in private storage and then does a compare and swap (or ll/sc) on a global pointer to the hashtable. Since resizing a hashtable is a rather expensive operation you can get a lot of threads all attempting to do the resize with a lot of private copies. All but one attempt will fail, making conventional COW rather wasteful.
I believe Cliff's technique involves allocating a new array and letting all threads accessing the hashtable attempt to populate the new hash array. Kind of a lazy initialization on a per array element basis. Once the new array is fully populated the old one can be discarded. There's probably some FSM logic involved. I haven't looked in detail.
With an army of 768 atomic powered, logically replicated finite state machines, I would rule the world!
Cliff, stop being a douchebag. The word you are looking for is "fast" or "speedy" or "fast enough."
There is no word as "performant". Pretending there is, using words just to obfuscate or make your dick seem bigger is a sign of pretentious douchebaggery.
STOP IT!
Nothing vindicates a man's work more than attracting the obsessive attention of anonymous shitstains like yourself.
> Your turn.
:)
Catamorphisms. Linear Logic.
Back to you
Done with slashdot, done with nerds, getting a life.
Researchers don't work at the fringes of what can be done "more easily". They work at the fringes of what is currently possible.
Think about multi-tasking for a moment. If your problem is inherently sequential (e.g. everything is effectively sequentialised by a shared resource that can't be split, such as a piece of hardware), then you can use a single-process event loop. If your problem is inherently parallel (e.g. a web server), then you can use multiple forked processes. In otherwords: If your problem is easy, use an easy solution.
But not every job is easy. That's why we have multi-threading.
Similarly, these new data structures are designed to solve difficult problems.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.
And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too!
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
So is this going to help get rid of the python global interpreter lock(GIL) or what?
I don't necessarily take issue with what you've said (I've got you on my friend list, so I'm fairly sure that your one of the professionals on this site whose insight I love to read... I'm a software development major in college, so I usually add programmers to my friends list when I genuinely value their insights...), but I would like to point out that I've seen another approach to your first point about locking busses. The Linux kernel, IIRC, does a sideways and backwards cache flush to RAM before masking IRQs on an SMP system. I think it defers top half of interrupts to another CPU and doesn't queue the bottom halves. I might be mistaken on that, though. It then 'takes over' the CPU for the process/thread, and when it's done, it flushes all the caches once again.
But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
We outsourced that to China.
They said they'd provide nonblocking executions in a scalable manner.
Because coding a bad, poorly parallelizable algorithm in ASM vs C will give you at best a 2x speedup.
And coding a good, hyper parallelizable algorithm in Java vs C will give you at worst a 1/2x speed down, and at the same time an nx speed up, where n=768.
And I'm not even talking about the mess ASM would be.
Usually scalable algorithm are slower when the CPU count is low so the question is: how many CPU are needed so that the scalable algorithm is as good as the non scalable one?
Strangely this isn't talked about in this 'lightweight' article..
Amb. Calculus of constructions. Higher-order barbed bisimulation. Stochastic processes. :)
And you're cheating with your signature
In Computer Science terms, atomic operations are ones which prevent the 'he who writes last - wins' syndrome.
In multiprocessor architectures the use of resource locking normally does the job, although one of the drawbacks is forcing other threads of execution to wait until the lock is released. There are non-locking and wait-free methods that use low level atomic operations (like 'compare and save') and generally use data structures that support these methods (stacks, queues, sets and hashes).
Lodragan Draoidh
The more you explain it, the more I don't understand it. - Mark Twain
Anything like this (Scalable Nonblocking Data Structures) but on C/C++?
And in the meantime, clever guys who decided not to spoil their lives with half-baked programming languages are doing things in a much more beautiful way...
Ezekiel 23:20
The presenter states that there are only a handful of people in the world savvy enough to program like this.
Does the 700 core hashtable go ping? You are not allowed to use it because you are not q-u-a-l-i-f-i-e-d!