Scalable Nonblocking Data Structures
An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."
why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [Practical Lock Freedom]
Google Talk by the author.
If brevity is the soul of wit, then how does one explain Twitter?
1988? Atomic?
Is this something to do with a Blondie tour?
AT&ROFLMAO
Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 .
700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?
Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).
Hmm... lemme think about that. Maybe because Java has decent threading support built into the language? Maybe because the platform is portable to any architecture? Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?
"Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore. Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.
Javascript + Nintendo DSi = DSiCade
Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.
700 threads in assembler? Why not use JAVA, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.
Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?
1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.
2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.
If it is not an inner loop then the locking is probably not a big issue any way.
3. Java, really, Java?
Now *that* is what I call geek speak.
There is no way anything less than _really_ good coders would get something like this to work with any semblance of efficiency. If you still evaluate coders by which language they use, chances are you're not really that good a programmer.
sudo ergo sum
Umm, because Azul runs the Java in hardware. It *is* optimized by being in Java.
"# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."
Clearly, the data structures have been touched by his noodly appendage.
If you can read this, I forgot to post anonymously.
Sure, Java manages memory for you, but it's generally much easier to incorporate a garbage collector into C than it is to write java without file I/O.
Badass Resumes
Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java
Hmmm... one core per Java thread?
That sounds about right for Java apps...
Just my
Good for us. Get the rabble away from Slashdot. Only true nerds should understand the contents. Let me add a few keywords to get rid of the softies: monads, higher-order type systems, return type, genericity.
Your turn.
This still has limited applicability. Making a single data structure useful across hundreds of CPUs is impressive, but many problems can be more easily solved by using multiple structures - pipelining it, or using divide and conquer, or any of many other approaches.
The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).
With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).
To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.
The author uses the Load-Link/Store Conditional pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.
Check-And-Set only replaces the variable with a new value if the value of the variable matches a previously read old value.
Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
He never said that Java had surpassed C in speed, he said that Java had surpassed C++. C++ library classes are not the same as C library classes, and many C++ libraries (especially the ones outside STL and Boost) are woefully under optimized. Java has many more optimized libraries "packaged in" with the language itself.
Second, neither the Doom or Unreal engines are multi-threaded. Java has threading support built into the language. To get the same with C you'd have to use POSIX threads (killing Windows compatibility) or the Windows threading API (killing POSIX compatibility). With Java, you don't have to make that choice.
We all know what to do, but we don't know how to get re-elected once we have done it
Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.
Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.
Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.
I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists is 1 CPU cycle.
Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
Java is perfectly fast for real world applications, and I'd agree that the "Java is Slow" idea is outdated.
But it's not conclusively faster than C++, at least not in a general sense. In terms of a small task involving lots of simple operations, you'll still often see a significant speed increase using C++. This is a good example. Now I'm sure there's more optimizations available on both sides - and plenty of stuff to tweak - but C++ is going to come out ahead by a significant margin on a lot of these tasks.
A good example where the participants on both sides have some motivation is on TopCoder (where I spend a fair bit of time). Performance isn't usually the driving factor in language choice there - but sometimes it is, and when it is the answer is pretty much always C++ (unless it's a comparison between Java BigInteger and a naive implementation of the same in C++).
Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance.
To be clear - these are not usually "real world" tasks. As more memory and objects come into play, Java is going to do better and better. But these kinds of tasks still exist - there's still plenty of places where C++ is going to be the choice because of performance.
In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong.
Let's not stir that bag of worms...
He's built two working data structures and is working on a third (had to read the slides to figure that one out).
Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this.
Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks was LISP (I think it was Common LISP). I don't use LISP because I hate the syntax (and LISP fanboys regard it as a requirement of club membership that you like it). But the performance is quite impressive (with GC too), and that's what I would look into if I needed top performance from a high level language. Eat your heart out C++.
I use Java quite a bit, and the biggest drawback performance wise is the amount of memory chewed up by the GC schemes and JIT. It is otherwise quite fast. I often turn off JIT to save memory (at the expense of CPU), and I would like to see options to use a GC algorithm that minimizes memory consumption (at the expense of CPU).
I hereby appoint you official summary explainer!
Thanks
AT&ROFLMAO
Which is great and all, but what we usually need is more of a summary executioner.
Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
In the problems that TFA addresses, I'd wager that most of the time is spent in dissemination barriers of some sort, since invariably the problems in parallel computing move into issues within the problem domain (which is ideally what we want, after all).
As for a JIT being inefficient compared to a static optimizing compiler, it depends so much on the code in question and on the platform, as to not be something you can make blanket statements about.
Let's hear from some HPC researchers on this. Get some real numbers.
-fb Everything not expressly forbidden is now mandatory.
Someone posted the video and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.
The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?
[signature]
Locking in software has implications that locking at the hardware level does not.
If a thread locks in software, any subsequent thread must block, waiting for the first thread to finish. If the thread is preempted, then the waiting threads wait needlessly. If the thread dies, then the waiting threads are hosed.
Lock-free techniques prevent this problem, at the expense of more complicated algorithms and data structures. The basic structure of most lock-free algorithms is read a value, do something to it, and then attempt to commit the changed value back to memory. The attempt fails if another thread has changed the value from underneath you, and you must try again. (This is detected through operations like compare-and-swap.) This allows greater concurrency and guarantees that the system as a whole will make progress, even if a thread is preempted or dies.
Lock-free algorithms and data structures is a well established area. What Click has done here is provide a Java implementation of some data structures that yield good performance on the manycore systems his company makes.
These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.
That's not strictly true.
First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.
Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.
Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).
Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.
So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.
The enemies of Democracy are
> Your turn.
:)
Catamorphisms. Linear Logic.
Back to you
Done with slashdot, done with nerds, getting a life.
So your argument is that with C/C++ you can make something almost as portable, almost as re-usable, almost as fast to write, almost as easy to debug, using a library selection that's almost as complete as Java. And in return you might gain a tiny bit of speed increase?
And you're also postulating that there's more likely to be a fully standards-compliant C++ compiler with a standard thread interface on all those esoteric machines of which you speak than a basic JVM?
I understand it's cool to hate Java on Slashdot, but sometimes I think the Java bashers are trying a little too hard.
E pluribus unum
Researchers don't work at the fringes of what can be done "more easily". They work at the fringes of what is currently possible.
Think about multi-tasking for a moment. If your problem is inherently sequential (e.g. everything is effectively sequentialised by a shared resource that can't be split, such as a piece of hardware), then you can use a single-process event loop. If your problem is inherently parallel (e.g. a web server), then you can use multiple forked processes. In otherwords: If your problem is easy, use an easy solution.
But not every job is easy. That's why we have multi-threading.
Similarly, these new data structures are designed to solve difficult problems.
sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
I don't necessarily take issue with what you've said (I've got you on my friend list, so I'm fairly sure that your one of the professionals on this site whose insight I love to read... I'm a software development major in college, so I usually add programmers to my friends list when I genuinely value their insights...), but I would like to point out that I've seen another approach to your first point about locking busses. The Linux kernel, IIRC, does a sideways and backwards cache flush to RAM before masking IRQs on an SMP system. I think it defers top half of interrupts to another CPU and doesn't queue the bottom halves. I might be mistaken on that, though. It then 'takes over' the CPU for the process/thread, and when it's done, it flushes all the caches once again.
But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
When performance doesn't matter much I'd use perl instead of Java, or some other higher level language (e.g. Python).
:).
;).
Of course when your boss/customer requires Java, use Java
I'm mixed on Lisp- Lisp is kind of high level and kind of not... I guess I'm just too lazy - perl has CPAN = lots of wheels I don't have to reinvent - and document
Amb. Calculus of constructions. Higher-order barbed bisimulation. Stochastic processes. :)
And you're cheating with your signature