Slashdot Mirror


Scalable Nonblocking Data Structures

An anonymous reader writes "InfoQ has an interesting writeup of Dr. Cliff Click's work on developing highly concurrent data structures for use on the Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java. The basic idea is to use a new coding style that involves a large array to hold the data (allowing scalable parallel access), atomic update on those array words, and a finite-state machine built from the atomic update and logically replicated per array word. The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well."

216 comments

  1. why by damn_registrars · · Score: 4, Interesting

    why are there fewer than 1 thread per core? It says 768 cores, but only 700 threads. Does it need the rest of the cores just to manage the large number of threads?

    --
    Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
    1. Re:why by felipekk · · Score: 1

      Funny, the article doesn't even mention that. Bad summary?

    2. Re:why by Chris+Burke · · Score: 4, Informative

      Because one is a general statement ("supports 700+ threads"), and the other is a statement about a specific hardware setup ("in production with 768 cores").

      It was not meant to imply that the 768 processor system will use exactly 700 worker threads. It was meant to imply that the system breaks through the traditional scalability limits of 50-100 threads, thus the 700+.

      --

      The enemies of Democracy are
    3. Re:why by Anonymous Coward · · Score: 0

      Because the article got it wrong?

      We routinely run codes with 10,000's of threads; I've tested it with 100,000 *runnable* threads - works great.

      Cliff Click

    4. Re:why by Anonymous Coward · · Score: 2, Informative

      his implementation is in Java and the JVM adds some of its own threads like threads for garbage collection, compiler threads etc. so, some of the compute goes towards those threads.

    5. Re:why by drspliff · · Score: 1

      Each Vega2 processor has 48 cores, 768 cores in just 16 processors is pretty good and you can be certain a number of those are reserved for system use on such a large-scale machine; these are already fairly lightweight hardware threads and I can only presume more hardware threads per-core and you'd get some serious I/O starvation issues.

      How I'd love to have one of these boxes :)

    6. Re:why by MarkEst1973 · · Score: 1

      Surely, 640 threads ought to be enough for anybody.

      But seriously, have you seen the price of this beast? HUGE price tag. Why not build your own cluster?

      There are technologies today (like, say, Terracotta Server) that allow for easy distribution of work across a large number of JVMs.

      I suppose the companies that need all those cores and threads in one machine can afford the Honkin' Big Iron. For the rest of us, clustering is getting cheaper and cheaper these days.

    7. Re:why by cheater512 · · Score: 1

      Or its a development system with limited production.
      Chances are it is being used for R&D and not actually crunching numbers.

      When systems like that hit mass production then we have the data structures for them.
      Cant do that with a cluster.

    8. Re:why by blitzkrieg3 · · Score: 1

      If all of the threads need to be able to update one huge data structure, then eventually you reach a point of no return. The fact that the threads need to lock the data structure every time they do an update causes a bottle neck. Conventionally, 700 threads are no better than 50-100 because the additional ~600 threads would be waiting for the chance to update the data structure. In this case it doesn't matter that they are on separate processors, since those 600 threads aren't using cpu time anyway. Each additional thread added after 100 will go straight to the wait queue, doing effectively nothing.

      What this technology allows is atomic updates of data structures, which means that it will not be necessary for a thread to acquire a lock before updating the data structure. Thus, 700+ threads perform better than 50-100.

    9. Re:why by tsalaroth · · Score: 2, Funny

      I, for one, welcome our fake Anonymous Coward Thread-count Overlords!

    10. Re:why by Ed+Avis · · Score: 1

      Have you seriously investigated the hardware costs of building your own cluster? With equivalent specs to this one?

      Hint: it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors.

      --
      -- Ed Avis ed@membled.com
    11. Re:why by amnezick · · Score: 0

      because the other 68 are taken by java itself trying to interpret itself

      --
      mov ax,4c00h
      int 21h
    12. Re:why by maraist · · Score: 5, Informative

      Message passing systems and MT systems solve different problems. Consider that Message Passing is a subclass of Multi-Processing; in general the amount of work is much larger than the data-set. But Multi-Threading often involves many micro-changes to a large message (the entire state of the process).

      Consider an in-memory database. (Mysql-cluster (NDB), for example). You wouldn't want to pass the entire database around (or even portions of it around) for each 'job'. Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Then your message passing is limited to only interactions that aren't held in the local memory space (i.e. NUMA).

      With Terracotta you are breaking a sequential application into a series of behind-the-scenes messages which go from clustered node to clustered node as necessary (I'm not very well versed on this product, but I've reviewed it a couple times).

      Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. And thus, the more CPUs (that leverage MESI (sp?) CPU cache) the more efficient the overall architecture.

      Now, I can't imagine that a 768CPU monster is that cost effective - you're problem space is probably pretty limited. But a simultaneous 700 thread application is NOT hard to write in java at all. I regularly create systems that have between 1,000 and 2,000 semi-active threads. But I try to keep a CPU-intensive pool down to near the number of physical CPUs (4, 8 or 16 as the case may be). Java has tons of tools to allow execution-pools of configureable size.

      --
      -Michael
    13. Re:why by SanityInAnarchy · · Score: 1

      it's not about how many CPUs you have or how fast they are, it's how fast the interlinks are between processors. That depends how well your project scales to a cluster, then. SETI or Folding, for example, won't really care how fast the interconnect is.

      Of course, the same programming techniques used to build this are absolutely not going to scale to a cluster. Probably vice versa, but I'm not convinced of that yet.
      --
      Don't thank God, thank a doctor!
    14. Re:why by SanityInAnarchy · · Score: 1

      Consider an in-memory database. OK.

      Instead, you'd like at most only partitions of the data where massive working-sets reside on each partition and do inter-data operations. Got it. Can't find a link, but I'm thinking specifically the hashing mechanism. Given a key, I can find which node should be caching that key.

      Thus for certain problems that do not nicely break down into small messages, you are indeed limited to single-memory-space hardware. I'm not sure I've seen such a problem. For example, the CPU cache alone is an example of what happens when you break a problem down into smaller chunks.

      I can see where a single memory space might do better, though.

      a simultaneous 700 thread application is NOT hard to write in java at all. Once you know how, I suppose. Consider that most programmers who use threads find ways to deadlock on one or two cores.

      The reason I'm drawn to message-passing systems is that pretty much any higher-level abstraction is a Good Thing, as far as threads are concerned. I've come to believe that threads are as harmful as GOTOs. Sure, we'll use them under the hood, but we really need something more structured on top of them.

      Also: Message-passing and shared memory are not mutually exclusive. If the message is being passed between, say, two Erlang "processes" on the same machine, I see no reason the contents of that message need to be copied, even if those "processes" are in different OS threads.
      --
      Don't thank God, thank a doctor!
    15. Re:why by Anonymous Coward · · Score: 0

      All I got from TFA is lot of cores and concurrent access to some very simple (hash tables,arrays) data structures. No context no mention about the memory architecture (NUMA,SMP..etc) no reason to be particularly excited about any of it except for the insane number of cores.

      Speaking in very general terms academia is quite good at showing examples which highlight how their snazzy new algorithms will change the world and quite bad at understanding the practical downsides which will preclude the above from ever actually taking place.

    16. Re:why by RockDoctor · · Score: 1

      Surely, 640 threads ought to be enough for anybody.
      I have a slightly sickening image of Bill Gates as mill overseer, using his whip to drive the children back under the looms in some sort of dark, satanic mill. 640 threads at 20 threads/inch would give you about 32inch wide cloth, which is adequate for anyone I'm sure.

      Will someone please take the expected jokes about "patching" or "darning" over the 640 thread limit?

      --
      Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
    17. Re:why by vikstar · · Score: 1

      All I need is Microsoft Outlook to use more than one thread, so that it doesn't lock up the entire program if I decide to download one email (with attachements). Traditional 50-100?... tell that to Microsoft.

      --
      The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.
    18. Re:why by maraist · · Score: 1

      Believe me, I prefer message passing system. I'm not knocking them in any way. When given the chance, I reorganize code into stateless (rather self-contained, context-free) serializable messages - in the case that they might need to one day transport onto an alternate clustered node.

      However, I've recently had to deal with a problem that required a massive in-memory priority heap (hundred million DB records with 100k in-mem group/category descriptors). The best you could do is partition the data-set across multiple clustered nodes then write in synchronization routines which guarantee you're pulling the next highest priority item (either from a local node or a remote node).. But in our case the working set could fit in memory, and clustering would have been a lot of extra potential for bugs - far outweighing the probability of race-condition bugs (which are far worse than dead-locks, because they're silent but deadly).

      A pure DB solution was deemed far too slow for many common mass-data manipulation events. Not to mention the transient nature of the priorities.

      In this case, a trivial fail-over detection with a semi-idle slave node was sufficient. Semi-idle in the sense that there were truely clusterable activities that occurred on the same hardware (namely the real-time responses to the prioritized messages).

      --
      -Michael
    19. Re:why by visible.frylock · · Score: 1

      I've been trying to understand this. Been reading the slides and some of the source, but I still don't get what he's referring to.

      Correct me if I'm wrong. You're saying that the problem being treated is that, instead of locking an entire shared object, the atomic logic is pushed into the object itself so as to allow more granularity and thus less blocking? IOW, locking a specified portion of the structure rather than the entire structure?

      --
      Billy Brown rides on. Yolanda Green bypasses Gary White.
    20. Re:why by blitzkrieg3 · · Score: 1

      I think the idea is to use special purpose instructions within the processor to make operations that used to require multiple operations only require one.

      Simple example, lets say you had a data structure that was 128 bits long and every member depended on the other members being correct. Typically, an update to that structure would require you to lock the entire structure, then update it piecemeal in 32 bit blocks. What they are trying to do is get an sse-like instruction to update it atomically, thus making a lock unnecessary.

      Disclaimer: I haven't read the source code and am going completely on the article and slashdot comments.

  2. Re:Sounds great! by Anonymous Coward · · Score: 1, Informative

    It means that the atomic operation required for lock-free data structures are now available for Java from this vendor. Welcome to 1988!

  3. Inspiration... by green-alien · · Score: 5, Informative

    The compare-and-swap approach is backed up by academic research: http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-579.pdf [Practical Lock Freedom]

    1. Re:Inspiration... by Monkius · · Score: 1

      Indeed there is a lot of published research--by Herlihy, Michael, Fraser, Sundell & Tsigas, and others, going back to 2001. It's a hot topic for everyone trying to scale on modern hardware, certainly this custom Java processor thing Click works on seems way out in left field, for me.

      --
      Matt
  4. Google Talk by jrivar59 · · Score: 5, Informative

    Google Talk by the author.

    1. Re:Google Talk by kestasjk · · Score: 2, Funny

      I wish I could call myself "Dr. Click" :-(

      --
      // MD_Update(&m,buf,j);
    2. Re:Google Talk by Anonymous Coward · · Score: 0

      That's a Google Tech Talk, not Google Talk. </pedantic>

    3. Re:Google Talk by Anonymous Coward · · Score: 0

      I suspect these data structures are all just a hoax. It's just that all his colleagues had so much fun saying "right, Click".

  5. Java???? by maz2331 · · Score: 0, Flamebait

    700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

    Or... is this just a way to avoid having to get the really good coders who are more costly than the burn-bags?

    1. Re:Java???? by Anonymous Coward · · Score: 4, Insightful

      700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

      Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

    2. Re:Java???? by AKAImBatman · · Score: 3, Insightful

      700 threads in JAVA? Why not use C++

      Hmm... lemme think about that. Maybe because Java has decent threading support built into the language? Maybe because the platform is portable to any architecture? Maybe because the JVM can "optimize the hell" out of the running Java code far better than you could "optimize the hell" out of your C++ by hand?

      "Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore. Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.
    3. Re:Java???? by Kupek · · Score: 3, Informative

      Java has a well-defined memory model. C++ (and C) do not; behavior depends on the hardware it is run on.

    4. Re:Java???? by Anonymous Coward · · Score: 5, Funny

      700 threads in assembler? Why not use JAVA, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread.

      Or... is this just a way to avoid having to get the really, really good coders who are more costly than the burn-bags?

    5. Re:Java???? by Anonymous Coward · · Score: 0

      Oh dear.

      Having achieved a certain level of competence - this is an assumption of mine, it may be entirely wrong - it looks like you're resistant to change of any nature.

      In 5 years, let me know how being the COBOL programmer of our age - and less productive by far than everyone around you - is working out.

    6. Re:Java???? by Anonymous Coward · · Score: 0

      700 threads in JAVA? Why not use C++, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread. Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job; plenty-o-examples upon request) and the 768-way Azul box is 10x cheaper than the same cycles that can run C code (by which I mean: same count of MIPs in a cache-coherent shared-memory UMA machine)?

      soapbox-off.... but you speak like this is 1995 and are acting as-if Java hasn't moved forward in the last 13 years.

      Cliff Click
    7. Re:Java???? by Brian+Gordon · · Score: 1

      I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)

    8. Re:Java???? by Daniel+Dvorkin · · Score: 1

      Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.

      Okay, I'll agree that well-written Java code is generally performance-competitive with compiled code, but this is a pretty sweeping assertion. Do you have any evidence for it -- or is it just a little too convenient that "no one even bothers" with benchmarks?

      --
      The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
    9. Re:Java???? by Viol8 · · Score: 1

      "Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job; plenty-o-examples upon request)"

      Sorry , remind me what language the JIT compiler is written in. Java is it? No , thought not.

    10. Re:Java???? by phasm42 · · Score: 1

      I don't think architecture portability is a concern when you're writing for a 768-core supercomputer :)
      You don't actually write code specific to a 768-core computer. The code doesn't need more than 1 core; the idea is to make it scale well to 768 cores.
      --
      "No one likes working in a hamster wheel, and your shop smells of cedar shavings from here." - TaleSpinner
    11. Re:Java???? by Anonymous Coward · · Score: 0

      What are burn-bags? I honestly have no idea. Logically, I can infer that they are not the "really, really good coders", since otherwise the "really, really good coders" would be more costly than themselves.

    12. Re:Java???? by famebait · · Score: 4, Insightful

      There is no way anything less than _really_ good coders would get something like this to work with any semblance of efficiency. If you still evaluate coders by which language they use, chances are you're not really that good a programmer.

      --
      sudo ergo sum
    13. Re:Java???? by Anonymous Coward · · Score: 0

      Yes thought so.

      http://jikesrvm.org/

    14. Re:Java???? by tppublic · · Score: 2, Interesting
      "Why not use C++"

      Umm, because Azul runs the Java in hardware. It *is* optimized by being in Java.

    15. Re:Java???? by pohl · · Score: 1

      It's all machine language down at the bottom. So what's your point?

      --

      The "cue the foo posts in 3, 2, 1..." posts will commence with no subsequent foo posts in 3, 2, 1...

    16. Re:Java???? by arthurpaliden · · Score: 1

      Actually I like being a COBOL consultant/programmer I get to work when I want and I get paid really big bucks. Lots of COBOL still out there in mission critical processes.

    17. Re:Java???? by Anonymous Coward · · Score: 0

      "Java is Slow" is a mantra that is easily 5+ years out of date. Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.
      Hahaha, oh yes, of course! Do you honestly believe that one C-derived language that's compiled to your target architecture can somehow magically outperform another C-derived language compiled to the same target architecture by "a wide margin"? Java compilers aren't anywhere near as mature as C++ compilers, so excuse my scepticism.

      I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.

      Anyone repeating the "Java is Slow" mantra is merely branding themselves ignorant.
      Or they've suffered through some of these super-duper Java applications that will save the world.
    18. Re:Java???? by Anonymous Coward · · Score: 0

      Yes, it does. A memory model that they're apparently flagrantly violating, as the Java memory model is quite clear that while reads and writes are atomic, they're not global to all threads.

      Namely, if Thread A writes to a value foo and Thread B reads from that value, Thread B will never get half-stale data. It will always get either the value before Thread A set foo, or the value after Thread A set foo. But the Java language is also quite clear that there's no way to know WHICH value it gets. Even if the read in Thread B occurs after the write in Thread A chronologically, Thread B may be looking at stale data.

      To get around that, you need to use locks. Which make these "nonblocking data structures" whatcha-call it, blocking, as the only way to ensure that you get the latest value is to lock all access to the data structure and then flush the memory off the CPU cache and into main memory. (The "volatile" keyword causes all read/writes to volatile values to create such a lock and force such a flush.)

      So if they're really non-blocking, the magic is not happening in Java. It's happening somewhere else, either in a modified virtual machine or in code linked via JNI - in other words, C.

    19. Re:Java???? by famebait · · Score: 1

      700 threads in JAVA? Why not use C++,

      To get the work done and working correctly before the hardware is obsolete.

      --
      sudo ergo sum
    20. Re:Java???? by Reverend528 · · Score: 3, Funny

      Because the code is written faster in Java, runs as fast as C code can (because the JIT does an equivalent job
      Since when has writing code quickly ever been considered one of Java's strong points? Personally I'd take stdio over Java's alternative (file wrapped in a stream buffer wrapped in a buffered reader wrapped in an enigma) any day of the week.

      Sure, Java manages memory for you, but it's generally much easier to incorporate a garbage collector into C than it is to write java without file I/O.

    21. Re:Java???? by Anonymous Coward · · Score: 0

      That's why I tell everyone at work to stay away from Java. If you want performance it's assembly or nothing, why waste time with trade-offs?

    22. Re:Java???? by Hairy+Heron · · Score: 1

      But above you talked about how JIT compilers sucked. So if we go by the usual logic of X Java program sucks -> Java sucks, then clearly the language that the JIT compiler is written in must by extension also suck, no?

    23. Re:Java???? by Profane+MuthaFucka · · Score: 0, Troll

      And still, after that beautiful rant, Java is still slow.

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    24. Re:Java???? by DarthStrydre · · Score: 1

      Namely, if Thread A writes to a value foo and Thread B reads from that value, Thread B will never get half-stale data. It will always get either the value before Thread A set foo, or the value after Thread A set foo. It would be nice if this was true... And it is for most things. However per the language spec...

      17.7 Non-atomic Treatment of double and long
      Some implementations may find it convenient to divide a single write action on a 64-bit long or double value into two write actions on adjacent 32 bit values. For efficiency's sake, this behavior is implementation specific; Java virtual machines are free to perform writes to long and double values atomically or in two parts.
      For the purposes of the Java programming language memory model, a single write to a non-volatile long or double value is treated as two separate writes: one to each 32-bit half. This can result in a situation where a thread sees the first 32 bits of a 64 bit value from one write, and the second 32 bits from another write. Writes and reads of volatile long and double values are always atomic. Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.

      VM implementors are encouraged to avoid splitting their 64-bit values where possible. Programmers are encouraged to declare shared 64-bit values as volatile or synchronize their programs correctly to avoid possible complications.

    25. Re:Java???? by raddan · · Score: 1

      700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread. Good idea, but-- assembly language is way too high-level. Why not take it a step further and just give programmers front panel switches? Bonus: you save money on keyboards!
    26. Re:Java???? by Jherek+Carnelian · · Score: 1

      700 threads in C++? Why not use assembler, actually optimize the hell out of the code, and get it down considerably. Or get a lot more done per thread Because the hardware platform his company sells actually has more than 700 cores - that means 700 simultaneous threads - not sequential, simultaneous.

      When you have a highly parallel system, it is counter-productive to try to make the workload sequential.
    27. Re:Java???? by Kupek · · Score: 1

      Even if the read in Thread B occurs after the write in Thread A chronologically, Thread B may be looking at stale data. If Thread B has stale data, then it's commit (through a compare-and-swap or similar atomic operation) should fail, and the thread should try the operation again.

      Take a look at the algorithms in the presentation. I've done a fair amount of lock-free programming, but all of it was in C. His algorithms look like standard lock-free practices, but since I'm not familiar with what correct lock-free Java code looks like, I can't say with confidence.

      I will, however, say that I would be surprised if Cliff Click made such the obvious error you assume he is making. I saw him talk at ISMM 2006, and I trust him more than random people on slashdot.

    28. Re:Java???? by JMZero · · Score: 5, Insightful

      Java is perfectly fast for real world applications, and I'd agree that the "Java is Slow" idea is outdated.

      But it's not conclusively faster than C++, at least not in a general sense. In terms of a small task involving lots of simple operations, you'll still often see a significant speed increase using C++. This is a good example. Now I'm sure there's more optimizations available on both sides - and plenty of stuff to tweak - but C++ is going to come out ahead by a significant margin on a lot of these tasks.

      A good example where the participants on both sides have some motivation is on TopCoder (where I spend a fair bit of time). Performance isn't usually the driving factor in language choice there - but sometimes it is, and when it is the answer is pretty much always C++ (unless it's a comparison between Java BigInteger and a naive implementation of the same in C++).

      Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance.

      To be clear - these are not usually "real world" tasks. As more memory and objects come into play, Java is going to do better and better. But these kinds of tasks still exist - there's still plenty of places where C++ is going to be the choice because of performance.

      In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong.

      --
      Let's not stir that bag of worms...
    29. Re:Java???? by neuromancer23 · · Score: 1

      Thanks for the awesome advice. x86 assembler is clearly the obvious choice for anyone writing a web or desktop application.

    30. Re:Java???? by cheater512 · · Score: 1

      Well the AVR32 architecture does support some Java byte codes.

      Mind you the manual does specifically state that you shouldn't use them inside interrupts. :)

    31. Re:Java???? by freedumb2000 · · Score: 1

      I just pictured 700+ worker(-threads) in front of a huge panel with 700+ switches. I think that is where assembly(-line) coding came from.

    32. Re:Java???? by fishbowl · · Score: 2, Informative


      In the problems that TFA addresses, I'd wager that most of the time is spent in dissemination barriers of some sort, since invariably the problems in parallel computing move into issues within the problem domain (which is ideally what we want, after all).

      As for a JIT being inefficient compared to a static optimizing compiler, it depends so much on the code in question and on the platform, as to not be something you can make blanket statements about.

      Let's hear from some HPC researchers on this. Get some real numbers.

      --
      -fb Everything not expressly forbidden is now mandatory.
    33. Re:Java???? by mikael · · Score: 1

      They develop Java hardware for internet applications . Presumably this is for commercial organisations where it is more important to get a functional application designed and assembled than to design/test and implement a completely optimised C++ application.

      --
      Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
    34. Re:Java???? by billcopc · · Score: 1

      As much as I love assembler, it's an offer vs demand thing. Very few people are actually proficient at assembler coding, but a crapload of morons can write a thousand lines of useless Java code a week, and they're willing to work for peanuts. Assembler guys want big bucks, mostly because their rarity gives them that power, along with the obscurity of the art.

      Let's face it: there are some low-level things you just can't do in any language, and asm coders know it! If it's only a speed issue, it's usually cheaper to throw more hardware at the problem and let the cheap idiotic factory coders chew on it.

      --
      -Billco, Fnarg.com
    35. Re:Java???? by nih · · Score: 1, Funny

      700 threads in JAVA? Why not use Brainfuck, actually optimize the hell out of...no wait, fs my head hurts now:(

      --
      I'm a rabbit startled by the headlights of life :(
    36. Re:Java???? by Anonymous Coward · · Score: 0

      Just imagine not using Java, but a decent and fast programming language instead and getting the same performance out of your application on a NORMAL system. Really, all trolling/hatred aside, it always pisses me of royally to see someone provide near-supercomputing setups to accelerate something which is INHERENTLY SLOW. Solve the actual problem instead and get Java out of the equation.

    37. Re:Java???? by Anonymous Coward · · Score: 0

      ERLANG should be great on this machine and it is very simple to scale an ERLANG system to use many cores, without worrying about side effects and such due to its functional nature.

    38. Re:Java???? by fpgaprogrammer · · Score: 1

      Software is for noobs. Consider Verilog.

    39. Re:Java???? by AKAImBatman · · Score: 1, Interesting

      Your link says: "Unfortunately, there's also a third conclusion. It seems that it's much, much easier to create a well performing program in Java. So, please consider it for a moment before you start recoding your Java program in C++ just to make it faster..."

      In any case, the author of that test failed. The test includes start-up times in the results. Java is always going to lose micro-benchmarks where start-up time is included. Why? Because the JVM load gets included in the benchmarks. As anyone who knows Java well can tell you, Java's startup time is extremely costly. Once the JVM is running, however, it produces much better performance. The end result is that the startup cost will eclipse the actual benchmark being run.

      I'm willing to bet you'll have a hard time finding a modern benchmark for Java that doesn't make that (simple!) mistake.

      However, I will admit that I was overextending my reach by suggesting that no one runs benchmarks anymore because of the strong performance of Java. The reality is (as one might expect) much more complex. It's not that Java performed poorly in benchmarks. Quite the contrary. By 2004, Java was cleaning up in benchmarks. The problem is that micro-benchmarking sucks. All you end up getting is the two sides fighting over who can produce the most optimized benchmark possible. Which invariably ends up defeating the purpose of benchmarking for "real-world" code.

      The long and short of it is that Java was proven to be more than fast enough, with the average case easily keeping pace or beating C++ code. Thus the statement by the author of the article you linked to:

      "Unfortunately, there's also a third conclusion. It seems that it's much, much easier to create a well performing program in Java. So, please consider it for a moment before you start recoding your Java program in C++ just to make it faster..."

    40. Re:Java???? by hattig · · Score: 1

      If the problem is inherently slow, then the speed of the language really isn't the issue. Most of the time will be spent waiting on network I/O or similar. In this case it's better to have more understandable code and a more verbose network protocol (e.g., using XML) than it is to optimise such a process away.

      For the vast number of places that Java is actually programmed, it is a very good choice of language because (a) it's well known, (b) it has loads of third party libraries, and (c) it performs good enough, threading is easy so you can just add cores to your solution, and so on.

      These are probably the same people that think that Java + Struts/Spring/Hibernate is hard and slow, and then try and reinvent the wheel in PHP.

    41. Re:Java???? by DarkOx · · Score: 1

      Because the cost of developing your code in ASM would be a gigantic leap in terms of time and provide little in terms of gain over the output of a good compiler. It would also destroy any potential for reuseability on other platforms.

      The parent on the other hand I can agree with, Java development is not much faster then C++ provided you are working on more esoteric things and can't take advantage of the huge library selection. The syntax is formal patterns are mostly the same after all. Developing a massively parallel application targert at multi-hundred core boxen most likely implies you are doing something rather esoteric. If you are going to make the investment in massive and specialized hardware it would seem to me you would also want to make the 'slight' extra investment and write a C++ or C application rather then a Java one. You can with a little thought up front produce fairly hardware agnostic C and C++ so you don't really lose the portability. Infact given we are talking about fairly specialed hardware and software platforms here there may not exist a usuable JVM for the next platform you want to move onto, so C or C++ might be more portable.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    42. Re:Java???? by Anonymous Coward · · Score: 0

      Because Java has a very well defined memory model, that describes how threads interact through memory.AFAIK, there is no similar definition for C++ or C. It depends on the underlaying hardware

    43. Re:Java???? by JMZero · · Score: 1


      The test includes start-up times in the results.

      This may be a significant factor for some of his smaller tests, but I don't think it jeopardizes the overall conclusion. For example, the Ackerman test was 15s vs. 60s. There isn't going to be more than a second of startup, and most of the tests are showing differences on the order of many seconds.

      In terms of "is it harder to write fast code in C++", I guess it depends on what you're good at. Looking at the changes he did to the C++ code, none of it is esoteric - but I'm also more interested in this kind of thing than most of the programmers I know. I think it's fair to say many coders would produce faster code for these functions in Java than C++.

      All in all, it's a subject that probably gets more attention than it deserves. Like you say, Java is plenty fast. Whenever I see performance problems at work, I never think "Ooh, let's rewrite this in a lower-level language" (and very seldom would that help). I think, let's look at the SQL, look at the bandwidth, look for other resource conflicts, or change overall approach.

      --
      Let's not stir that bag of worms...
    44. Re:Java???? by Jerry+Coffin · · Score: 1

      The long and short of it is that Java was proven to be more than fast enough, with the average case easily keeping pace or beating C++ code.

      More accurately, Java is usually more than fast enough, even though it's nearly never as fast as equivalent C++. When you get down to it, performance rarely matters much.

      --
      The universe is a figment of its own imagination.
    45. Re:Java???? by shish · · Score: 1

      I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.

      What's absurd about one bit of native binary code outperforming a different bit of native binary code? o_O

      --
      I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    46. Re:Java???? by SanityInAnarchy · · Score: 1

      Except that the whole reason you'd use one of these, instead of a much cheaper cluster of commodity hardware, is because you want to use shared memory and threads. There are problems which scale much better to shared memory than to shared-nothing -- or so I'm told.

      Oh, and Erlang is wannabe-functional. Go play with Haskell if you want a purely-functional language. No side effects means, among other things, the ability to have the parallelism done for you, instead of having to explicitly spawn a thread -- or an Erlang "process".

      Me, I'm working on a cute wrapper for Ruby objects to turn them into Erlang-style threads.

      --
      Don't thank God, thank a doctor!
    47. Re:Java???? by loubs001 · · Score: 1

      Because as the article says, Java has a well defined and correctly implemented memory model that provides certain essential guarentees about ordering and memory consistency when dealing with high levels of concurrency. C++ does not (in its current form) have any well specified memory model, making his techniques impractical, if not impossible, since the behaviour is unpredictable. To acheive concurrent data structures in C++, the only way to do it safely, correctly and in a portable way, is to use locks. For this reason, you can expect to get much better performance in highly concurrent environments with Java with than you would get with C++.

      The next version of C++ in development, is trying to provide such a memory model, based on Javas.

    48. Re:Java???? by Profane+MuthaFucka · · Score: 0, Troll

      Let me characterize the person who modded me down:

      1) He has a fancy computer that he bought at a department store, but has never cracked the box open. Check!
      2) He went to the local IT Professional Outfitters store and dropped $1500 on shoes, special pants, a shirt with weird zippers and pockets everywhere, harnesses, backpacks, laptop carts, mouse pads, and USB coffee warmers. All are colored hot fluorescent pink and glow under UV light. Check!
      3) He went to the bookstore and has the Java for Idiots, Java for Dummies, Learn Java in 21 Days, and 3D game programming in Java books. Check!
      4) He downloaded Safari and installed it onto his spanking new Vista machine. Check!
      5) He's got a JVM. Installed it from a CD from one of his books. Check!
      6) He has a 2 year degree from De Vry. Certificate on the wall. Check!
      7) He subscribed to all the Ziff Davis publications he could find. Check! Check! Check!

      Yessir! That motherfucker's a PROGRAMMER!

      --
      Fascism trolls keeping me up every night. When I starts a preachin', he HITS ME WITH HIS REICH!
    49. Re:Java???? by samkass · · Score: 2, Insightful

      So your argument is that with C/C++ you can make something almost as portable, almost as re-usable, almost as fast to write, almost as easy to debug, using a library selection that's almost as complete as Java. And in return you might gain a tiny bit of speed increase?

      And you're also postulating that there's more likely to be a fully standards-compliant C++ compiler with a standard thread interface on all those esoteric machines of which you speak than a basic JVM?

      I understand it's cool to hate Java on Slashdot, but sometimes I think the Java bashers are trying a little too hard.

      --
      E pluribus unum
    50. Re:Java???? by A+Numinous+Cohort · · Score: 1

      This thread discusses the merits of the STM features of Clojure, a JVM language. It's another way to do lock-free concurrent programming on the JVM, using appropriate data structures.

    51. Re:Java???? by Anonymous Coward · · Score: 0

      Are you joking? True garbage collection for C is a nightmare because you can not identify all of the pointers properly. Figuring what in memory is no longer needed, and following/updating all pointers pointing to it is impossible.

      Hint: You can use integers to hold pointers...

    52. Re:Java???? by mishagam · · Score: 1

      From what I seen on Topcoder, C++ is preferred to Java mostly because it allows more compact coding using preprocessor. Generally I never seen any example that the same algorithm would run in 2 sec in C++, but more that 2 sec and so failed in Java. Also notice that all Topcoder application is written in Java.

    53. Re:Java???? by TheLink · · Score: 2, Interesting

      When performance doesn't matter much I'd use perl instead of Java, or some other higher level language (e.g. Python).

      Of course when your boss/customer requires Java, use Java :).

      I'm mixed on Lisp- Lisp is kind of high level and kind of not... I guess I'm just too lazy - perl has CPAN = lots of wheels I don't have to reinvent - and document ;).

      --
    54. Re:Java???? by smellotron · · Score: 1

      I'm not even going to bother with the absurdity of the idea of JIT bytecode outperforming a compiled language on any architecture.
      What's absurd about one bit of native binary code outperforming a different bit of native binary code? o_O

      Well, nothing... but based on the GP's usage of the phrase "JIT bytecode", my best guess is that he's talking about Python (compiles source code to bytecode "JIT") as opposed to Java (compiles bytecode to machine code JIT).

    55. Re:Java???? by AnyoneEB · · Score: 1

      If speed if your concern, why would you switch from Java to C++ instead of just switching from JRE to gcj native? Is there really a significant difference between gcj and g++ in terms of speed? I guess Java has bounds checks which are probably not completely optimized out, but they could be disabled with --no-bounds-check.

      --
      Centralization breaks the internet.
    56. Re:Java???? by FlyingGuy · · Score: 1

      Call me old fashioned, but ummmmm, you malloc() it , you free() it.

      I mean when did this whole notion of malloc() it and forget it come to pass? Did Ron popeil suddenly start writing programming languages or something? I don't care how complex your code is, at some point that bit of memory is no longer required and at that point it is your job as the programmer, to clean up after yourself and free() it, no?.

      --
      Hey KID! Yeah you, get the fuck off my lawn!
    57. Re:Java???? by RAMMS+EIN · · Score: 1

      I think the deal is that, ultimately, lower-level languages (assembly is lower than C, which is lower than C++, which is lower than Java) allow you to gain better performance than you could achieve with higher-level languages.

      Whether you will actually achieve the highest possible performance depends on whether you actually write the most efficient code possible. This, in turn, depends on your skills and the time you have to complete the project.

      Thus, what you will see in practice is that there is a lot of variance in the performance of programs written in lower-level languages, and less variance in the performance of higher-level languages. Libraries also have a huge influence here: the more code you use from a library, the faster you will complete the project and the less variance there will be between your program that uses that library and someone else's program that uses that library.

      The final result will probably be that the fastest programs are in the lowest level languages (provided these programs exist at all), but, largely, the speed scales will overlap. It may even be that the slowest programs in a lower-level language are slower than the slowest programs in a higher-level language.

      Thus, it is not a priory obvious that lower-level programming languages will yield faster programs than higher-level ones.

      For another data point, have a look at OCaml. With garbage collection, polymorphism, classes and objects, and type inference, it's definitely a high-level language. In my experience, it's much nicer to program in than Java, and the implementation offers run-time performance that is about on par with Java and often a lot better.

      --
      Please correct me if I got my facts wrong.
    58. Re:Java???? by SanityInAnarchy · · Score: 1

      Not entirely sure what that has to do with my post, but thanks, that's interesting.

      I'm also not entirely sure I like transactions yet. Haven't thought hard enough about it, yet.

      --
      Don't thank God, thank a doctor!
    59. Re:Java???? by Dr.Ruud · · Score: 1
    60. Re:Java???? by thomas.galvin · · Score: 1

      Since when has writing code quickly ever been considered one of Java's strong points? Personally I'd take stdio over Java's alternative (file wrapped in a stream buffer wrapped in a buffered reader wrapped in an enigma) any day of the week. Yeah, Java's file I/O is kind of... verbose. A lot of things in Java are. But that's why pretty much everyone who codes in Java has some utility class with some version of writetoFile( byte[] data, File file ). Solve the problem once and forget about it. To me, that's less onerous than trying to figure out which of the thirteen thousand different implementations of a string class someone is using, but maybe that's just me.

      Though this is one of the things that they should look into fixing in newer versions of Java. They really need to go ahead and make a version that is extensively similar to, but not necessarily backwards-compatible with, previous version, that is more internally consistent and less verbose. I've been coding Java for almost a decade, and I still have to stop and think about whether I need to call size or length.
    61. Re:Java???? by The_reformant · · Score: 1

      I wrap my enigma in a factory class

      --
      I have discovered a truly remarkable sig which this post is too small to contain.
    62. Re:Java???? by JMZero · · Score: 1

      You're correct in that people don't generally choose C++ based on performance. But it does happen that non-C++ coders will switch for certain problems.

      As an example, As here is Kawigi (a fairly high level competitor who usually uses Java), switching to C++ so he can brute-force a problem instead of using DP.

      Notwithstanding that, all TopCoder problems (well, all correct ones) have a Java solution that runs in under 1 second - as that's the requirement on the reference solution. Where you'll see switches to C++ is to make other approaches feasible.

      --
      Let's not stir that bag of worms...
    63. Re:Java???? by ToasterMonkey · · Score: 1
      I don't know, doesn't much of Java's optimization come from being able to profile a running program continuously and recompile the 'hotspots' on the fly? Isn't the HotSpot VM _exactly_ what lead to claims of Java being as fast as C++ in the first place? The author of the benchmarks doesn't mention or seem to be aware of this.

      Wouldn't a more objective comparison show Java's throughput over time while running these tests? The way he munges the source bothers me as well. Instead of subjectively modifying the code so it seems 'fair', shouldn't the tests include little or no calls to standard libraries or other classes? Or, if you are going to compare those things, do a baseline as I suggest, then call the following tests what they are.. a Vector vs. HashMap or whatever comparison.

      Reasonably often you'll see people write an initial solution in Java, find it runs a bit too slow, and quickly port it to C++ (or pre-emptively switch to C++ if they think they'll be near the time limit). It's not uncommon to see a factor of two difference in performance. Where do you see this? The 2x gains and 'quick' Java->C++ ports that is, not the preemptive bit, I could care less about all this subjective garbage.

      In any case, your contention that Java is so much faster that nobody does benchmarks anymore is unsupported and wrong. That benchmark you provided is crap.

      Look, I'm just picking on you because it feels like you're comparing two cars by the HP number on the sticker instead of looking at a full dyno chart :P

    64. Re:Java???? by Lodragandraoidh · · Score: 1

      To expand on your comment, I would say there are certain tasks that are suitable for all the languages listed (and then some):

      Assembly - embedded systems/drivers for a given platform.

      C - larger systems programming tasks that need to be portable -- too big to be effectively built in Assembly without going crazy, too small for C++; you can embed Assembly instructions in C as needed - so you can have your cake and eat it too.

      C++ - for complex systems requiring near-real-time response (e.g. multiplayer video game).

      Ada - for complex systems requiring real-time control (e.g. avionics) without error.

      Java - for complex systems not requiring real-time control or response (e.g. GUI Office Suite that interfaces to user via http).

      Python - for simple or complex systems not requiring real-time control - but where speed of deployment/refactoring is more important than processing speed (e.g. embedded control language, or specialized control/reporting interface tied to a changing back-end system).

      It is thus irrelevant to talk about 'java is faster than C++' or any similar argument, without including a complete evaluation of the factors involved in a given project. The factors would include, the time required to complete the project, as well as performance factors.

      We can certainly agree that building a GUI http interface using Assembly would not be optimal, given the ease of using Java to do the same thing. Of course, there may be an instance when it would be more desirable (perhaps building it in an embedded system with severe limitations on memory or performance).

      As a developer, you should be able to reach into your bag and pull out whatever tool is best suited for a given job. One-trick ponies end up in the glue factory.

      Religious wars are counter productive. We should frame the discussion to examine the practicalities of a given set of projects.

      --

      Lodragan Draoidh
      The more you explain it, the more I don't understand it. - Mark Twain
    65. Re:Java???? by Funks · · Score: 1

      I'm pretty sure a large majority of the users for this technology are "FINANCIAL" customers, where latency matters. Stock trading companies come to mind - requires very high performance / w low latency. Yes, alot of them write their software in Java (not C++)

    66. Re:Java???? by JMZero · · Score: 1

      Where do you see this?

      I posted one example, and it would easy for you to find others. For myself, I don't feel any particular obligation to further establish this.

      As to the guy's benchmarks, they seem fairly reasonable to me. I don't think it's unreasonable to include things like vectors and hashmap implementations when talking about the speed of a language - and the benchmarks this guy was countering did the same thing.

      You're also confused on my motivation, I think.

      I don't hate Java. I was arguing against the idea that it's so much faster than C++ that nobody measures it anymore. Even if Java would outpace C++ if it was let to run for longer or something, we'd still be left with the conclusion that there are some tasks C++ is faster for (which is all I was trying to establish).

      Do you want to disagree with that?

      --
      Let's not stir that bag of worms...
    67. Re:Java???? by vyrus128 · · Score: 1
      "Now remember, kids:
      IF you malloc, SET it free!"

      ;-)

    68. Re:Java???? by FlyingGuy · · Score: 1

      "When I find myself in times of trouble, Dennis Richie comes to me, speaking words of wisdom, code in C, C, C, Ceeeeeee".

      With apologies to so many.....

      --
      Hey KID! Yeah you, get the fuck off my lawn!
    69. Re:Java???? by ToasterMonkey · · Score: 1

      As to the guy's benchmarks, they seem fairly reasonable to me. I don't think it's unreasonable to include things like vectors and hashmap implementations when talking about the speed of a language - and the benchmarks this guy was countering did the same thing. _All_ the benchmarks I've seen look like crap. Hypothetically speaking, _if_ the run time compiled/optimized code could theoretically outperform statically compiled/optimized code, but the overhead of a more robust standard library implementation hampered it's performance, wouldn't you be a _little_ curious of that? I think you and I both know how complicated and/or impossible a question like "Is Java faster than C++" is, so how could ten or fifteen little micro benchmarks answer it?

      You're also confused on my motivation, I think. I really don't care what your motivation is/was. I'm after facts.

      It's not uncommon to see a factor of two difference in performance. Is it common to see unbiased, scientific comparisons of the two running equivalent, real world tasks?
      Has it been done?
      Do we have anything besides junk micro-benchmarks? Is that just a phrase that's been repeated for years on the internet?
      I'm not kidding with you. When I hear stuff like that, I can't help but ask for details.

      I don't hate Java. Um... good? It's pretty silly to get all emotional over programming languages.

      I was arguing against the idea that it's so much faster than C++ that nobody measures it anymore. Which dumbass said that? Are you sure the idea wasn't that it's often fast enough where it's not worth the effort to port to C++ just to find out?
      But, if someone has, I'd be interested to know.

      Even if Java would outpace C++ if it was let to run for longer or something, we'd still be left with the conclusion that there are some tasks C++ is faster for (which is all I was trying to establish). Well, you're right, there are tasks that C++ can perform better than Java. Why didn't you just say that? ;)
      In the end, both can produce well optimized code, but I haven't read anything suggesting that the structure of either language enables fundamentally better optimizations than the other.
    70. Re:Java???? by JMZero · · Score: 1

      Are you sure the idea wasn't that it's often fast enough where it's not worth the effort to port to C++ just to find out?

      Here's the part of the original post I was replying to:

      Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore

      I suppose one could read that in a number of lights, but it seemed a bit off to me.

      --
      Let's not stir that bag of worms...
    71. Re:Java???? by alexo · · Score: 1

      Java surpassed C++ performance many years ago, and by such a wide margin that no one even bothers running benchmarks anymore.

      Do you have definite proof of that categorical statement or are you talking out of your ass?

      In general, I have found that most benchmarks in which Java surpasses C++ in speed were done by people who have either poor understanding of C++ (using sub-optimal constructs) or poor familiarity with the compiler (using sub-optimal optimization options).

      Here's what I found:
      • Przemyslaw Bruski in his take on the Java vs C++ benchmark writes:

        When looking at C++ code, I've noticed many performance problems, which probably may go unnoticed for people with strong Java background. These problems were not present in Java code. In other words, to make this benchmark fair, I had to make some modifications to the original code. In doing so, I tried to make the code as close to the original as possible, even if my personal coding style is completely different. The results?

        On Intel, all tests but two ended with C++ being better. Overall, C++ was twice as fast as Java.
        On Athlon, every individual test ended with C++ performing better than Java. Overall, C++ was three times faster than Java.

      • Nikolaus Gebhardt in his blog entry about C++ vs. Java 1.6 - A fair Benchmark writes:

        Knowing a bit about the C++ and Java optimizers, I was also able to revert the results completely. For example although using equivalent Java and C++ code, I was able to make the C++ version take 16 seconds for the nested loops-test while Java only needed 1 second by adding some virtual method calls which I knew Java could optimize. I also was able to make the C++ version run in only 78 ms while the equivalent Java version needed even 10 seconds by only using static loops - the C++ optimizer then would not even loop anything and just add the values together. This is partially what some of the wrong benchmarks out there are doing - even if not intentional. The results?

        To make it short: As I expected C++ outperforms Java in every single test. But interestingly Java is getting closer now

      • The Debian Computer Language Benchmarks Game shows that, on average Java is 2 to 2.5 times slower than C++.

      To summarize:
      1. That when benchmarking an algorithm in a programming language, you have to use the most efficient implementation for the language you are testing.
      2. That it does not matter that language X is slower than language Y as long as it is fast enough for the task at hand.
      3. That AKAImBatman was talking out of his ass.

      Best regards,
      Alex
  6. 768 Cores? by Shadow+Wrought · · Score: 2, Funny
    Call me a CPU luddite, but does this mean that I can lose 768 games of Solataire simultaneously?

    --
    If brevity is the soul of wit, then how does one explain Twitter?
    1. Re:768 Cores? by jlechem · · Score: 3, Funny

      Or better yet actually run Windows Vista. Zing!

      --
      Hold up, wait a minute, let me put some pimpin in it
    2. Re:768 Cores? by shervinemami · · Score: 1

      Do you seriously think Vista could handle 768 processors?

    3. Re:768 Cores? by _xeno_ · · Score: 1

      Sorry, but no: the Vista license only allows for using up to two cores. If you want to run Windows Vista today, your best bet remains liquid nitrogen cooling while overclocking.

      --
      You are in a maze of twisty little relative jumps, all alike.
  7. Re:Sounds great! by Linker3000 · · Score: 4, Funny

    1988? Atomic?

    Is this something to do with a Blondie tour?

    --
    AT&ROFLMAO
  8. Google Tech Talk by Bou · · Score: 4, Informative

    Click gave a Google Tech Talk last year on his lock-free hashtable as part of the 'advanced topics in programming languages' series. The one hour talk is available on Google Video here: http://video.google.com/videoplay?docid=2139967204534450862 .

  9. Re:Sounds great! by moderatorrater · · Score: 4, Funny

    Before, data structures would only perform well in 50-100 threads. With this work, he has it up to over 700 threads, but it hasn't been load tested yet. There's a good chance that he's on the forefront of the next generation of data structures, there's a good chance that his work will be included in the java core (although that's not saying much considering).

  10. scalable noNBLocking data sTRructures .. :) by rs232 · · Score: 1, Insightful

    Congrads Slashdot, you've managed to produce a story that is guaranteed to totally baffle the non-techie sector.

    KeyWords:

    concurrent data structures, hardware threads, java, large array, scalable parallel access, atomic update, words, finite-state machine, lock-free, data structures ...

    --
    davecb5620@gmail.com
    1. Re:scalable noNBLocking data sTRructures .. :) by Seferino · · Score: 5, Interesting

      Good for us. Get the rabble away from Slashdot. Only true nerds should understand the contents. Let me add a few keywords to get rid of the softies: monads, higher-order type systems, return type, genericity.
      Your turn.

    2. Re:scalable noNBLocking data sTRructures .. :) by Anonymous Coward · · Score: 0

      Ahem... OMG Pink Ponies!
      What?.. What do you mean I fail!?

      Noooo! Give me back my geek card!

    3. Re:scalable noNBLocking data sTRructures .. :) by Anonymous Coward · · Score: 0

      Your turn. Sex, drugs, rock'n'roll!
    4. Re:scalable noNBLocking data sTRructures .. :) by Anonymous Coward · · Score: 0

      I hope that Azul system with 700+ cores performs just as well with Erlang, now Erlang is the way to go not java. Erlang = pure message passing, no side effects (unless youre doing process dictionary, or ets), and you can have billions of parallel processes, distributed over networks if you wish...

    5. Re:scalable noNBLocking data sTRructures .. :) by Seferino · · Score: 1

      I'm curious. I mean, I've used Erlang but not for industrial work. While Erlang itself is fully message-passing/functional, its libraries are full of side-effects. Does anyone know what actual implementations use ?

    6. Re:scalable noNBLocking data sTRructures .. :) by Anonymous Coward · · Score: 0

      "monad"? Singular of "gonad"?

  11. false sharing? by shrimppoboy · · Score: 2, Interesting
    The brief description in the article sounds suspicious and incompetent.
    1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.

    2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.

    If it is not an inner loop then the locking is probably not a big issue any way.

    3. Java, really, Java?

    1. Re:false sharing? by Chirs · · Score: 1

      If the data set is much larger than the number of cpus, then it may be possible to arrange things such that the likelihood of two cpus hitting the same cacheline is pretty small.

      As for Java, in the article Dr. Click says it has a well-understood and well-implemented memory model.

    2. Re:false sharing? by julesh · · Score: 2, Informative

      The brief description in the article sounds suspicious and incompetent.
      1. A common killer in parallelization is false sharing. That is, threads on two processors fight over a cache-line even though they are accessing independent variables. A cache-line is typically bigger than an individual variable. The approach of using adjacent elements of an array for parallelism sounds naive. One needs to pad the array.


      The keyword in your statement is "typically". Click is working on the Azul processor, which is designed from the ground up for this kind of task. While I haven't been able to find details of its cache organisation, I'd say it's pretty safe money that it uses smaller than average cache lines. The hashtable structure described uses the contents of the array in pairs, i.e. 64 bits at a time. If each cache line were 64 bits wide, this would be efficient, no?

      2. Updating a shared variable, especially a non-scaler, in an inner loop is naive. One should reference local scalers in inner loops when parallelizing. Just once, should the thread update the shared variable. Don't reference non-scalers or shared variables in an inner loop. Don't lock in the inner loop, either, if you can avoid it.

      What if the bottleneck of your application is referencing shared data that needs to be updated in real time? There are many applications for which this is the case, and this kind of work is useful for anyone who is working on such an application. Just throwing an example out there: the scheduler for a massively parallel operating system would typically need to have many cores referring to a single queue of waiting threads with a high chance for collisions between read and write operations. I'm sure there are plenty more, too.

      3. Java, really, Java?

      Yes. Java. Really. Java is probably the most important language at the moment for enterprise application development, which is the environment where this kind of issue most frequently occurs, so developing solutions to these problems that work in Java is particularly important. Azul, Click's employer, specialises in high performance computers optimized for running Java software.

    3. Re:false sharing? by elFarto+the+2nd · · Score: 1

      And lets not forget Cliff Click's previous job, working for Sun on the Hotspot VM.

      Regards
      elFarto
  12. Geek serendipity in a summary by ThreeGigs · · Score: 3, Funny

    and a finite-state machine built from the atomic update and logically replicated per array word.


    Now *that* is what I call geek speak.
    1. Re:Geek serendipity in a summary by rrohbeck · · Score: 1

      and a finite-state machine built from the atomic update and logically replicated per array word.

      Now *that* is what I call geek speak. Atomic? Damn geeks! ZOMG, we're all going to die from radiation!!!11!eleven!
  13. I think you have it wrong by Anonymous Coward · · Score: 0

    There are 48 Vega2 processors per board, with 1 core per processor. The biggest setup stacks 16 of these boards to get to 768 cores.

    1. Re:I think you have it wrong by Anonymous Coward · · Score: 1, Interesting

      Well, actually, it's 16 Vega2 chips (each of which is termed a "processor"), each chip has 48 cores, making for 768 total cores. 2 chips per board, and 8 boards in a system. All chips are fully meshed (no switch, fabric, or bus, just a whole bunch of direct multi-GHz wires), so it's a flat SMP UMA machine.

      The new Vega3 (announced last week) takes it to 54 cores per chip, and 864 cores in a system (16 x 54 = 864 ...).

  14. Why not Perl? by Anonymous Coward · · Score: 0

    Too bad this isn't being done in Perl.

  15. Well, sort of lock-free. by Animats · · Score: 1, Informative

    It's not really "lock free". The algorithms in the slides still have WHILE loops wrapped around atomic compare-and-swap operations, so they are effectively spin locks, tying up the CPU while the other CPUs do something. However, the design is such that the WHILE loops shouldn't stall for too long.

    This concept has two parts - a way of constructing bigger atomic operations from hardware-supported word-sized atomic operations, and a scheme for resizing arrays while they're in use. The latter is more important, especially in Java; stalls due to array resizing are a headache in real time systems with dynamic memory. It works about the way you'd expect; there are flags indicating the location (old or new) of each array element as the array is copied. Making all this work correctly is touchy.

    Concurrent algorithms like this are incredibly hard to write, and you need formal methods to get them right. The author of the paper still hasn't figured out to code a FIFO under these rules.

    1. Re:Well, sort of lock-free. by egoots · · Score: 1
      A shared data structure is considered "lock-free" if its operations don't require mutual exclusion. So if a process is interrupted in the middle of an operation, it will not prevent the other processes from operating on that object.

      Technically, there are more formal definitions which state that it guarantees that some process in the system operating on the concurrent object, will make progress in a finite number of steps.

      An object is termed "wait-free" if each process will make progress in a finite number of steps.

    2. Re:Well, sort of lock-free. by Kupek · · Score: 1

      It is lock-free, but it is not wait-free. He explains the difference in his slides, and there's plenty of literature around for those who have access to Google.

    3. Re:Well, sort of lock-free. by master_p · · Score: 1

      "Lock-free" does not mean no locking at all, it means waiting threads spin instead of entering the blocked threads queue.

  16. From the article: by Kingrames · · Score: 3, Funny

    "# A Finite State Machine (FSM) built from the atomic update and logically replicated per array word. The FSM supports array resize and is used to control writes."

    Clearly, the data structures have been touched by his noodly appendage.

    --
    If you can read this, I forgot to post anonymously.
  17. C/C++ by Anonymous Coward · · Score: 0

    Anything (free software) like this on C/C++?

  18. Uh, no. by Anonymous Coward · · Score: 0

    If Java surpasses C in performance, then why aren't the Unreal or Doom engines written in it? What you meant to say is: "for most applications, Java can surpass C++ performance."

    Oversimplifying things helps no one, except maybe your moderation score if the moderators are that clueless.

    1. Re:Uh, no. by quanticle · · Score: 2, Informative

      He never said that Java had surpassed C in speed, he said that Java had surpassed C++. C++ library classes are not the same as C library classes, and many C++ libraries (especially the ones outside STL and Boost) are woefully under optimized. Java has many more optimized libraries "packaged in" with the language itself.

      Second, neither the Doom or Unreal engines are multi-threaded. Java has threading support built into the language. To get the same with C you'd have to use POSIX threads (killing Windows compatibility) or the Windows threading API (killing POSIX compatibility). With Java, you don't have to make that choice.

      --
      We all know what to do, but we don't know how to get re-elected once we have done it
    2. Re:Uh, no. by Anonymous Coward · · Score: 0

      Not only that, but with libraries like the Java Monkey Engine ( http://jmonkeyengine.com/index.php?option=com_content&task=view&id=68&Itemid=84 ), you really COULD write something like Unreal or Doom in it.

      Yes, java monkey engine has a stupid name, but it's a pretty decent library.

    3. Re:Uh, no. by DarkOx · · Score: 1

      disclaimer:I am C/C++ fanboi

      Java is not faster then C++ and it never will be. You can't fundamentally add the overhead of a byte code interpreter and somehome come out ahead. What Java does and does very well is save programers from themselves.

      Like most programers I am not God's gift to software development. Chances are pretty good that as any given software project I am working on becomes larger and starts to involve more objects, and or data structures the memory management in Java is going to be better then that I would do myself. The garbage collector in Java is very good. It was implemented by very good coders who had a very specific interest that they could focus on. Whatever application I am doing has some other focus which means that managing memory is simply a required part of the task not the aim of it. Its a safe bet that my data managment is not going to compete with what is in Java. Now for small projects its a different ball of wax.

      --
      Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    4. Re:Uh, no. by ChrisFedak · · Score: 1

      While I do love C++ (C is for OS programmers, which I am not) the truth is that JIT compilers perform optimizations that native compilers don't. They can inline things based on dynamic usage, they can inline library functions, they can even recompile based on the hardware the code is running on (similar to compiling a custom kernel in linux, but on the fly) . When all of this is working VM based languages can really fly.

      Some of these ideas will eventually make it to native C++ compilers. However, I can't see all of them making the transition since they depend so heavily on having an intermediate form of the code to work with.

      So, yes, bytecode can be faster than native code in certain situations. I think there's still a lot of work to be done with these compilers. "Certain situations" might well become "nearly all situations" in the future.

    5. Re:Uh, no. by Jerry+Coffin · · Score: 1

      Java is not faster then C++ and it never will be. You can't fundamentally add the overhead of a byte code interpreter and somehome [sic] come out ahead.

      Java doesn't necessarily involve a byte-code interpreter. JVMs have supplied Just In Time compilation for years now.

      That's not a panacea though: optimization is generally a difficult task, and good optimization can be quite difficult indeed -- many (perhaps most) NP-complete problems are basically ones of optimization. While optimizing code isn't necessarily identical to a traveling salesman problem (etc.), there is quite a bit in common.

      The result of that is that the really difficult optimizations are harder to justify in a JIT compiler. In a traditional compiled language, you compile a small number of times, and run many times. In a JIT environment, you compile far more often, so you can't amortize that cost over nearly as many executions.

      What Java does and does very well is save programers [sic] from themselves.

      I'm not convinced of this either -- Java adds garbage collection, which many people think will save the programmers from themselves. My experience indicates rather the opposite -- garbage collection makes very little difference to memory management, and lack of deterministic finalization makes most other resources considerably more difficult to manage.

      The garbage collector in Java is very good. It was implemented by very good coders who had a very specific interest that they could focus on.

      There is no one garbage collector in Java -- there's a garbage collector in each individual JVM. Some of them are better than others, and some are optimized for different loads than others. Languages that use manual memory management are similar: each implementation has its own, with its own characteristics.

      For that matter, quite a few people use garbage collectors with C and C++. These garbage collectors are generally implemented by extremely good coders as well. C and C++ place a few obstacles in their way, so it's a bit more difficult to implement them, but they work reasonably well nonetheless.

      In the end, however, I think the effect of garbage collection on speed of either execution or development seems to be quite exaggerated. In a few circumstances, it's really helpful. In others it's really harmful. Most of the time, it makes very little difference at all.

      --
      The universe is a figment of its own imagination.
    6. Re:Uh, no. by Anonymous Coward · · Score: 0

      Any decent C/C++ programmer will use an abstraction when dealing with threads. They will not have to make a choice between POSIX and Win32 threads: their code will be cross-platform compatible.

      Any decent C/C++ programmer who writes multi-threaded / multi-processor code will use a higher level abstraction like tasks which threads run. They won't be worrying about thread management.

    7. Re:Uh, no. by afidel · · Score: 1

      Java GC sucks in general. I use quite a number of different JVM's and the tuning advice I see most often and have experienced is don't allow any given JVM to go above ~1.5GB of ram or GC performance will bring down overall system performance. That's just stupid on modern machines with up to hundreds of GB of ram. Sure there are frontends that allow you to spawn additional JVM's automatically, but that just means there's more processes and more overhead.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    8. Re:Uh, no. by delt0r · · Score: 1

      Also a lot of "fast" C programs out there also have big asm blocks and/or massive amounts of align hints. That kinda stops it begin C.

      IMHO there are just a few areas where a asm/C block will trash any pure java implementation. Thats symmetric crypto and video codecs. Its easy to see why too. Cryto algos are often designed from the ground up with x86 asm in mind. Where MMX etc was introduced really for DCT type calculations.

      --
      If information wants to be free, why does my internet connection cost so much?
    9. Re:Uh, no. by quanticle · · Score: 1

      And, going back to my original point, neither cryptographic algorithms nor video codecs are multithreaded. In fact, video and audio encoding are some of the few things that really benefit from massively long pipelines and high clock speeds (which is why the old P4s used to do that sort of thing well).

      --
      We all know what to do, but we don't know how to get re-elected once we have done it
    10. Re:Uh, no. by quanticle · · Score: 1

      Well, if you're already abstracting away that much, why not go to Java where you don't have to worry about your abstraction being under-optimized?

      --
      We all know what to do, but we don't know how to get re-elected once we have done it
    11. Re:Uh, no. by nullchar · · Score: 1

      We use these JVM options on a box with 16GB, and we could even increase the Xmx=12G to more and it would be fine, but we wanted to leave some room for disk cache.

      JAVA_OPTS="-server -Xms3000m -Xmx12000m -Xss256k -XX:ThreadStackSize=256 -XX:-HeapDumpOnOutOfMemoryError -XX:-PrintGCDetails -XX:MaxPermSize=256m -XX:-PrintGCTimeStamps -Xloggc:/path-to-log-dir/gc.log -Dsun.net.inetaddr.ttl=10 -Dsun.net.inetaddr.negative.ttl=0 -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000"

      Use the GC log option to watch it in action. You can safely ignore the DNS cache stuff, but I like to add it to every JVM to fix that major annoyance.

      This is for Sun JDK 1.6 btw. We used different settings for JRockit.

    12. Re:Uh, no. by Anonymous Coward · · Score: 0

      Fucking kids. Do you make this shit up yourself or does you mom help you? Real programmers have been writing cross-platform code using POSIX threads that run on Windows for years.

  19. "2.5"? WTF? by adavies42 · · Score: 0

    The end result is a coding style that has allowed Click to build 2.5 lock-free data structures that also scale remarkably well.

    Will someone please explain WTF "2.5" means in the summary?

    --
    Media that can be recorded and distributed can be recorded and distributed.
    -kfg
    1. Re:"2.5"? WTF? by badboy_tw2002 · · Score: 4, Informative

      He's built two working data structures and is working on a third (had to read the slides to figure that one out).

    2. Re:"2.5"? WTF? by adavies42 · · Score: 1

      Huh. The most straightforward answer, and the one that seemed least likely to me....

      --
      Media that can be recorded and distributed can be recorded and distributed.
      -kfg
  20. one per by spoonist · · Score: 2, Funny

    Azul hardware (which is in production with 768 cores), supporting 700+ hardware threads in Java

    Hmmm... one core per Java thread?

    That sounds about right for Java apps...

    1. Re:one per by afidel · · Score: 1

      They are probably lightweight cores, much like those on the Sun coolthreads processors. Plus if you are paying for a system with 700+ cores you probably have an app that can keep 700+ threads busy =)

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  21. Bulk-Synchronous Parallel model, anyone ? by Seferino · · Score: 2, Insightful
    This is interesting indeed. When reading the summary, it made me think about BSPML, although the slides make it clear that there are a number of differences. Essentially
    • BSPML doesn't limit itself to FSM but has full expressive power, including exceptions -- some implementations of BSPML use monads to solve things that this work solves by scaling down to a FSM
    • BSPML doesn't support dynamic changes to the number of threads
    • many BSPML algorithms are provable
    • BSPML is typically compiled to fully native code
    • BSPML doesn't use processor-specific concurrency-specific optimizations
    • BSPML works on distributed systems.
    • Despite the differences, both models work by sharing code and operating on a high-speed concurrent pseudo-vector, in a completely lock-free model (although locks can be implemented on top of the model, as usual).
      Just my .02 â.
    1. Re:Bulk-Synchronous Parallel model, anyone ? by mritunjai · · Score: 1


      So where's the code ?

      (Yeah, I know about it, played with it... lots of noise, not enough code!)

      --
      - mritunjai
    2. Re:Bulk-Synchronous Parallel model, anyone ? by Seferino · · Score: 1

      One implementation here, another, more recent, there. I remember there is an Haskell version somewhere, but I can't remember its name.

  22. Re:Sounds great! by hasdikarlsam · · Score: 2, Insightful

    This still has limited applicability. Making a single data structure useful across hundreds of CPUs is impressive, but many problems can be more easily solved by using multiple structures - pipelining it, or using divide and conquer, or any of many other approaches.

  23. Re:Sounds great! by mikael · · Score: 4, Informative

    The author has developed a programming methodology class for parallel programming in Java. In this system, a single application can have 700+ separate threads running (user input, background tasks, dialog windows, scripts, automatic undo logging).

    With such applications you will often have a array of variables that are accessible by all threads (eg. current processing modes of the application).

    To preserve the integrity of the system, you need to only allow one thread to write to each variable at any time. If you have a single read/write lock for all the variables, you will end up with large number of threads queuing up in a suspended state waiting to read a variable, while one thread writes.

    The author uses the Load-Link/Store Conditional pair of instructions to guarantee that the new value is written to all locations. Load-Link loads the value from memory. Store-Conditional only writes the value back if no other write requests have been performed on that location, otherwise it fails.

    Check-And-Set only replaces the variable with a new value if the value of the variable matches a previously read old value.

    Using these methods (having the writer check for any changes) eliminates the need for suspending threads when trying to read shared variables.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  24. Quoting out of context can be fun... by Anonymous Coward · · Score: 0

    3. [...] The FSM supports array resize and is used to control writes.

    So their algorithm does not work without the interference of the Flying Spaghetti Monster?

  25. WTF? by neuromancer23 · · Score: 2, Informative

    Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads". Thousands of threads are typical of java applications even running on Pentium IIIs. Every J2EE Application server spawns a new thread for every request. It's part of the specification. The real issue with Java is the hard thread limit in most JVMs where even calling -Xss will not override the limit. These limits are both asinine and arbitrary. Linux can very easily handle the instantiation of millions of pthreads on a single core host, and we all have multi-core machines these days. The JVM ought to scale to millions of threads, otherwise if a JVM can only support 20k concurrent threads, there really isn't any reason to ever pay more than a few hundred dollars for a web server since you quickly reach the maximum thread count while 99% of your opteron system is pretty much sitting there doing nothing.

    Furthermore, the need to synchronize a collection is an indication of poor design (i.e. lack of encapsulation) that is better solved by writing better code that doesn't use disgusting global data structures, but if you really wanted to use a shared collection, all you would have to do is create an implementation that makes each object in the data structure lockable individually, like a linked list where any index could be locked independent of the list itself, and new capacity could be created and then added in bulk. But in the end, updating a collection is a simple constant time operation (just update the reference to point to a new location), so you should never have performance problems even when locking an entire collection to do the update. If you are having performance problems with that, then chances are, it is the way that you are interacting with the collection.

    Anyway, here's to hoping that people can learn basic programming concepts so that the world can move on to solving real problems.

    1. Re:WTF? by julesh · · Score: 1

      Nowhere in the article is it mentioned anywhere that they are running "700 hardware threads".

      Quoth the article: "On Azul's hardware it obtains linear scaling to 768 CPUs"

      That kind-of implies 768 hardware threads are in use.

    2. Re:WTF? by neuromancer23 · · Score: 1

      K. Thanks. I missed that in the article. I jumped over it to look at the performance metrics.

    3. Re:WTF? by TheLink · · Score: 1

      Linear scaling does not necessarily mean 700 threads though.

      The factor does not have to be 1.

      If it's 4, then it would be 4 threads for one core, and 2800 threads for 700 cores. And that would still be linear scaling.

      --
  26. No it is Lock Free by tbcpp · · Score: 2, Interesting

    I used to think this too until I saw the video by the article's author. By lock free we mean that if the thread that has the "lock" were to die, it would not stall out the entire program. With CAS updates, a crashing thread would simply die and cause no ill effects to the data structure. With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock. The maximum time a CAS "lock" can exists is 1 CPU cycle.

    --
    Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
    1. Re:No it is Lock Free by neuromancer23 · · Score: 1

      >> With Mutex style locks, if the locking thread crashes (or otherwise forgets to unlock the mutex) then the entire program grinds to a halt as other threads start waiting on the lock.

      Not in Java. I will give you $1000 if you can:

      1. Kill a thread prematurely.
      2. Cause it not to release a lock inside the JVM.

    2. Re:No it is Lock Free by tbcpp · · Score: 1

      That's probably true. However the point of CAS style "locks" (for they really aren't locks) still stands in that with this new method of doing Hash tables, it is possible to only lock that portion of the data structure for 1 instruction. In addition, correct me if I'm wrong, this should also keep a kernel call from happening. Java threads are kernel threads right? So this could cut down on the overhead dramatically. I would encourage everyone to watch the Google Talk by the author of the article. It's deep stuff, but incredibly interesting: http://video.google.com/videoplay?docid=2139967204534450862&q=hash+tables&ei=9JI8SMLhA5m05AKXgqXfAw

      --
      Man is the lowest-cost, 150-pound, nonlinear, all-purpose computer system which can be mass-produced by unskilled labor.
    3. Re:No it is Lock Free by egomaniac · · Score: 1

      "Kill" doesn't have to mean "thread terminates". Entering an infinite loop or deadlock are equally effective ways to "kill" a thread which will not cause it to release its locks, meaning you still have the problem even in Java.

      --
      ZFS: because love is never having to say fsck
    4. Re:No it is Lock Free by neuromancer23 · · Score: 1

      Well that's news to me. I thought that the term kill was related to the subject dying; i.e. (A) kills (B) then (B) is dead. Whereas a thread that is in an infinite loop or is dead-locked waiting on a lock to be released, lives forever. So basically, in your view of the English language, the word kill does not mean "to cause the death of" but the exact opposite, "to make immortal."

      http://dictionary.reference.com/browse/schizophrenia

    5. Re:No it is Lock Free by egomaniac · · Score: 1

      The threads are dead in the sense that they are nonfunctional. We call it a "dead"lock for a reason, numbnuts.

      --
      ZFS: because love is never having to say fsck
    6. Re:No it is Lock Free by neuromancer23 · · Score: 1

      Now you are changing the context of the original conversation. Thanks for the compliment tho

  27. So? by Anonymous Coward · · Score: 0

    I work in the real-time embedded world and we've been doing this for years.

  28. Benchmark from many years ago by CustomDesigned · · Score: 2, Insightful

    Yes, Java didn't surpass, but was competitive with C++ many years ago - at least as far as CPU goes. In fact, if you use gcj, you'll find that Java performance is *very* similar to C++ :-). For the IBM JVM, the turning point was JDK 1.1.6. I think having each thread allocate blocks of memory to dish out without lock contention was a big part of it. One interesting benchmark was this.

    Now, before you C/C++ or Java fanboys get too excited, the absolute hands down fastest language on most of the benchmarks was LISP (I think it was Common LISP). I don't use LISP because I hate the syntax (and LISP fanboys regard it as a requirement of club membership that you like it). But the performance is quite impressive (with GC too), and that's what I would look into if I needed top performance from a high level language. Eat your heart out C++.

    I use Java quite a bit, and the biggest drawback performance wise is the amount of memory chewed up by the GC schemes and JIT. It is otherwise quite fast. I often turn off JIT to save memory (at the expense of CPU), and I would like to see options to use a GC algorithm that minimizes memory consumption (at the expense of CPU).

    1. Re:Benchmark from many years ago by mishagam · · Score: 1

      I bet this happened only because LISP is more obscure language, and so only LISP fans compared it's speed. I am pretty sure read (in LISP Literature) that many AI projects were rewritten on C to gain speed. Also main LIsp structure - linked list - is inherently slowest possible structure, and gain speed on Lisp is only possible using unnatural for Lisp structures like vectors (and I bet trying to escape recursion - which is natural thing to do in Lisp).

    2. Re:Benchmark from many years ago by pherth · · Score: 1

      First of all, vectors and hash tables are not "unnatural" data structures in Common Lisp. The are fist class citizens like lists are. And for many applications, lists are actually the fastest data stucture. Of course if you want to access an element by index, lists are the wrong tool and no Lisper uses them in this fashion. Similiarly, while recursion is natural in Lisp, looping is in Common Lisp. So noone would recurse where you want to loop.

  29. Re:Sounds great! by Linker3000 · · Score: 3, Funny

    I hereby appoint you official summary explainer!

    Thanks

    --
    AT&ROFLMAO
  30. Re:Sounds great! by hey! · · Score: 4, Funny

    Which is great and all, but what we usually need is more of a summary executioner.

    --
    Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
  31. These kids today! by Anonymous Coward · · Score: 0

    Bah. Just use /dev/null. Unlimited concurrent writers. Scales to massive volumes.

  32. Sounds bogus? by hackingbear · · Score: 1

    Well... if I remember what I learned from OS and hardware classes right. LL/SC and CAS operations do involve locks at the hardware level. These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

    So I fail to see anything radically new in this project. It is not lock-free at all. It sounds more like bogus marketing pitch to me.

    Can someone correct or explain to me?

    1. Re:Sounds bogus? by Anonymous Coward · · Score: 0

      What a troll here:

      Searching for CAS / Memory brings up something entirely different

      Equivocating between using CPUs and CPU cores

      Anyway, to answer the question, this is a multi-core processor so locks would not be sent over the memory bus but instead sent in L1 cache or something and the memory bus would not be the bottleneck.

      Your comments about using multiple CPUs are most likely valid though.

    2. Re:Sounds bogus? by Kupek · · Score: 4, Informative

      Locking in software has implications that locking at the hardware level does not.

      If a thread locks in software, any subsequent thread must block, waiting for the first thread to finish. If the thread is preempted, then the waiting threads wait needlessly. If the thread dies, then the waiting threads are hosed.

      Lock-free techniques prevent this problem, at the expense of more complicated algorithms and data structures. The basic structure of most lock-free algorithms is read a value, do something to it, and then attempt to commit the changed value back to memory. The attempt fails if another thread has changed the value from underneath you, and you must try again. (This is detected through operations like compare-and-swap.) This allows greater concurrency and guarantees that the system as a whole will make progress, even if a thread is preempted or dies.

      Lock-free algorithms and data structures is a well established area. What Click has done here is provide a Java implementation of some data structures that yield good performance on the manycore systems his company makes.

    3. Re:Sounds bogus? by Fourier · · Score: 1

      By convention, "lock-free" is sort of an ill-defined term that roughly means "no software-based locking operations, except maybe we'll ignore spin-locking, provided the algorithm always makes global progress while spinning occurs." In any case, there is no common definition for "lock-free" which implies the absence of hardware locking.

      I'm not sure there is anything radically new in this project. However, given the exceptional difficulty of writing correct lock-free algorithms, just about any project exploring this area is noteworthy.

    4. Re:Sounds bogus? by hackingbear · · Score: 1

      That's what I understood. Maybe you can be very creative and use very complicate way to avoid contentions except a few independent shared variable updates -- and then you scale your algorithm to hundreds or thousands of CPUs, you will eventually hit the memory bus / communication overhead as usual. MPP supercomputers have been trying all tricks to minimize this overhead but still a bottleneck. I just don't see what this project has achieved but to create maybe a faster concurrent hash map (or the likes) -- it sounds like there were some sort of fundamental breakthrough in computer science -- that's not the case. So it is more like a marketing bluff.

    5. Re:Sounds bogus? by Chris+Burke · · Score: 5, Informative

      These operations may need no OS system call, may use no explicit semaphore or lock, but the memory bus has to be locked briefly -- especially to guarantee all CPUs seeing the same updated value, it has to do a write-through and cannot just update the values in cache local to the CPU. And when you have large number of CPU cores running, the memory bus becomes the bottleneck by itself.

      That's not strictly true.

      First, most lock operations do not require a full bus lock. All you have to do is to ensure atomicity of the load and store. Which effectively means you have to 1) acquire the cache line in the modified state (you're the only one who has it here), and 2) prevent system probes from invalidating the line before you can write to it by NACKing those probes until the LOCK is done. Practically this means the locked op has to be the oldest on that cpu before it can start, which ultimately delays its retirement, but not by as much as a full bus lock. Also it has minimal effect on the memory system. The LOCK does not fundamentally add any additional traffic.

      Second, the way the value is propagated to other CPUs is the same as any other store. When the cache line is in the modified state, only one CPU can have a copy. All other CPUs that want it will send probes, and the CPU with the M copy will send its data to all the CPUs requesting it, either invalidating or changing to Shared its own copy depending on the types of requests, coherence protocol, etc. If nobody else wants it, and it is eventually evicted from the CPU cache, it will be written to memory. This is the same, LOCK or no.

      Third, an explicit mutex requires at least two separate memory requests, possibly three: One to acquire the lock, and the other to modify the protected data. This is going to result in two cache misses for the other CPUs, one for the mutex and one for the data, which are both going to be in the modified state and thus only present in the cache of the cpu that held the mutex. In some consistency models, a final memory barrier is required to let go of the mutex to ensure all writes done inside the lock are seen (x86 not being one of them).

      Fourth, with fine enough granularity, most mutexes are uncontested. This means the overhead of locking the mutex is really just that, overhead. Getting maximal granularity/concurrency with mutexes would mean having a separate mutex variable for every element of your data array. This is wasteful of memory and bandwidth. Building your assumptions of atomicity into the structure itself means you use the minimal amount of memory (and thus mem bw), and have the maximal amount of concurrency.

      So basically, while it isn't necessarily "radical" (practical improvements often aren't), it is definitely more than bogus marketing. There's a lot more to it than that.

      --

      The enemies of Democracy are
    6. Re:Sounds bogus? by dedazo · · Score: 1
      I worked on a high-concurrency application a few years ago (not multi-CPU). The problems inherent to today's x86 hardware vis-a-vis concurrency are very interesting indeed. We had a few thousand threads at any given point trying to access something that would pass as a hashtable with millions of key-value pairs. We ran into all sorts of problems until I created a lightweight structure that looked like a cross between a mutex and a Win32 critical section, which I derived from an old MSDN code sample written for Windows CE, of all things. Along with the excellent implementation of I/O completion ports introduced with Windows 2000, I'd have to say we did a good job. As far as I know all that code is still running 24/7/365 without problems and scales perfectly well across machines. This was all C++, by the way.

      Also, I'd like to just say "thank you". It's not often here on /. that one gets to read comments like these. I'd have to wonder why you're not modded up to +5.

      Cheers.

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
    7. Re:Sounds bogus? by dedazo · · Score: 1
      Gah, I hate replying to myself but I forgot my main point, which is to incorporate the blocking mechanism in the data structure itself. I would simply never have thought of that, and I think that's pretty much the novel thing about this whole idea.

      Talk about thinking outside of the box... probably the reason I don't work for Google =)

      --
      Web2.0: I love when people Flickr my cuil and digg my boingboing until my google is reddit and I start to yahoo
  33. java.util.concurrent.atomic by CustomDesigned · · Score: 1

    They are really non-blocking, and use a standard Java API - the java.util.concurrent.atomic package. The description of the package indicates it was intended as the low level basis for non-blocking data structures. While most JVMs are written largely in C, the Java code gets compiled to machine language. Standard packages do not have to be implemented via JNI, and can be implemented directly by the JVM - just like GNU C implements many standard C library functions directly in the compiler for performance. In fact, java.util.concurrent.atomic would be a poor choice for implementing via JNI because the transition between C and Java code that happens on every JNI call would defeat the high performance purpose of the package.

  34. Dr. Click by tsalaroth · · Score: 1

    Since you're apparently reading this - shoot me an email. I'd like to discuss if this could work with Jython.

    I'm thinking using a single Honkin' Big Iron instead of a farm for a massive-use service.

  35. TransactionKit by johne_ganz · · Score: 1

    TransactionKit is another lockless highly concurrent data structure, specifically a hash table.

  36. I doubt that very much by EmbeddedJanitor · · Score: 1
    Sure embedded folk have been using FSMs for years. Sure embedded folk have been using arrays for many years,

    But a FSM-controlled data access for multi-threading is a bit different. The closest I've seen is the implementation of dual-port RAM which uses hardware state machines to control access to the underlying shared RAM.

    Oh, I've been doing embedded for over 20 years too.

    --
    Engineering is the art of compromise.
  37. From the video... by Zarf · · Score: 2, Insightful

    Someone posted the video and that was great. In particular I really like the use of a finite state machine as a proof of correctness. That might be a novel approach in this day and age when everyone is in love with UML. It makes you wonder if many of these things aren't made too complex by adding too much cognitive over-head. To hear Dr. Cliff Click talk it seems so trivial in retrospect. I suppose this is how you know his solution is elegant... I seriously doubt I'd have thought of it myself but when you see something elegant that seems natural afterward it's probably right.

    The other thing is that his algorithm shows a remarkable departure from traditional concurrent programming (as I learned it a decade ago) and he's not getting bogged down with locking and synchronize... instead he has a very simple "think about it" approach that uses the state machine as a thinking aide. Whom ever posted the video... thank you that was very enlightening. Perhaps many of these concurrency problems just need some creativity after all?

    --
    [signature]
  38. Java vs C++ by Anonymous Coward · · Score: 0

    Want to show us one example we can compile and run on our PCs to show that Java is so uber-fast compared to C++?
    Just one example of a program written in both C++ and Java where Java runs faster?
    If the Java runs faster I guarantee it's badly written C++ and I'll "optimize the hell" out of it in about 10 minutes and beat your Java. I guarantee it.
    The reason nobody bothers with C++/Java benchmarks is because nobody does serious work in Java any more.

    1. Re:Java vs C++ by Anonymous Coward · · Score: 0

      24 hours later and not one example of "uber-fast" Java has been produced. I rest my case

  39. It's not the Java language that is the problem by gatkinso · · Score: 1

    it is the platform (i.e. JRE). Slow. As molassas. But then again, so is managed C++ on the CLR.

    So it goes, the circular argument....

    --
    I am very small, utmostly microscopic.
  40. Copy on write hashtables by Anonymous Coward · · Score: 0

    The problem is with doing a copy on write (COW) of the hashtable on a resize operation. In conventional COW a thread makes the new copy in private storage and then does a compare and swap (or ll/sc) on a global pointer to the hashtable. Since resizing a hashtable is a rather expensive operation you can get a lot of threads all attempting to do the resize with a lot of private copies. All but one attempt will fail, making conventional COW rather wasteful.

    I believe Cliff's technique involves allocating a new array and letting all threads accessing the hashtable attempt to populate the new hash array. Kind of a lazy initialization on a per array element basis. Once the new array is fully populated the old one can be discarded. There's probably some FSM logic involved. I haven't looked in detail.

  41. Evil Geniuses Unite! by slugabed · · Score: 0

    With an army of 768 atomic powered, logically replicated finite state machines, I would rule the world!

  42. Hey Cliff, there is no word as performant. by Anonymous Coward · · Score: 0

    Cliff, stop being a douchebag. The word you are looking for is "fast" or "speedy" or "fast enough."

    There is no word as "performant". Pretending there is, using words just to obfuscate or make your dick seem bigger is a sign of pretentious douchebaggery.

    STOP IT!

  43. Re:Performant is not a word by Anonymous Coward · · Score: 0

    Nothing vindicates a man's work more than attracting the obsessive attention of anonymous shitstains like yourself.

  44. Theory Pong by nuzak · · Score: 2, Funny

    > Your turn.

    Catamorphisms. Linear Logic.

    Back to you :)

    --
    Done with slashdot, done with nerds, getting a life.
    1. Re:Theory Pong by TuringTest · · Score: 1

      Damn you. I've read the Wikipedia's catamorphism page and now I know what it is! (I needed to reach the "a catamorphism is a generalization of the folds on lists known from functional programming to arbitrary abstract data types" phrase to understand what the heck what's going on, though).

      --
      Singularity: a belief in the "God" idea with the "demiurge" relation inverted.
  45. Re:Sounds great! by Pseudonym · · Score: 2, Insightful

    Researchers don't work at the fringes of what can be done "more easily". They work at the fringes of what is currently possible.

    Think about multi-tasking for a moment. If your problem is inherently sequential (e.g. everything is effectively sequentialised by a shared resource that can't be split, such as a piece of hardware), then you can use a single-process event loop. If your problem is inherently parallel (e.g. a web server), then you can use multiple forked processes. In otherwords: If your problem is easy, use an easy solution.

    But not every job is easy. That's why we have multi-threading.

    Similarly, these new data structures are designed to solve difficult problems.

    --
    sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  46. Welcome back by Duncan3 · · Score: 1

    This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.

    And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too!

    --
    - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    1. Re:Welcome back by Zarf · · Score: 1

      This is how we've been teaching computer science to share memory since there was more then one thread. Anyone over 40 will immediately recognize this as just "how we do that". To all you younger viewers, welcome to multi-core/SMP circa 1980.

      And to all you industry people, if you'd stop firing everyone when they turn 30, you'd know this too! Can you provide a reference? I'd like to learn more.
      --
      [signature]
  47. What can you use it for? by Anonymous Coward · · Score: 0

    So is this going to help get rid of the python global interpreter lock(GIL) or what?

  48. Can also be done with a clean cache by Gazzonyx · · Score: 2, Interesting

    I don't necessarily take issue with what you've said (I've got you on my friend list, so I'm fairly sure that your one of the professionals on this site whose insight I love to read... I'm a software development major in college, so I usually add programmers to my friends list when I genuinely value their insights...), but I would like to point out that I've seen another approach to your first point about locking busses. The Linux kernel, IIRC, does a sideways and backwards cache flush to RAM before masking IRQs on an SMP system. I think it defers top half of interrupts to another CPU and doesn't queue the bottom halves. I might be mistaken on that, though. It then 'takes over' the CPU for the process/thread, and when it's done, it flushes all the caches once again.

    But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.

    --

    If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    1. Re:Can also be done with a clean cache by Chris+Burke · · Score: 1

      But, as you said, the lock doesn't require a bus lock. Please correct me if I'm wrong on this, it's been a few months since I've read that piece of code and I'm a little hazy on the exact details.

      Ooh, don't know anything about that code actually. A cache flush is a lot heavier weight than a normal lock, if you're only accessing one cache line that is, but I could see it being a good idea in a number of situations in OSes. Interesting.

      --

      The enemies of Democracy are
  49. Re:Sounds great! by TheLink · · Score: 1

    We outsourced that to China.

    They said they'd provide nonblocking executions in a scalable manner.

    --
  50. Stoopid by Nicolas+MONNET · · Score: 1

    Because coding a bad, poorly parallelizable algorithm in ASM vs C will give you at best a 2x speedup.
    And coding a good, hyper parallelizable algorithm in Java vs C will give you at worst a 1/2x speed down, and at the same time an nx speed up, where n=768.
    And I'm not even talking about the mess ASM would be.

  51. But what about performance? by renoX · · Score: 1

    Usually scalable algorithm are slower when the CPU count is low so the question is: how many CPU are needed so that the scalable algorithm is as good as the non scalable one?

    Strangely this isn't talked about in this 'lightweight' article..

    1. Re:But what about performance? by ZerdZerd · · Score: 1

      Your question is answered in this talk

      --
      I'm not insane! My mother had me tested.
  52. Ping by Seferino · · Score: 2, Funny

    Amb. Calculus of constructions. Higher-order barbed bisimulation. Stochastic processes.
    And you're cheating with your signature :)

    1. Re:Ping by nuzak · · Score: 1

      > Amb

      Pshaw. That may or may not be really easy or not.

      > Calculus of constructions

      You win.

      > Higher-order barbed bisimulation

      I know a few people on Folsom Street who charge quite a bit of money for this sort of thing!

      Truthfully, I only pick up terms from LtU -- Monads are about as far as I can actually comprehend. If I ever get through my copy of TAPL, I'll at least understand some more of the vocabulary to level up my theory.

      --
      Done with slashdot, done with nerds, getting a life.
    2. Re:Ping by Seferino · · Score: 1

      :)
      I admit I've only browsed quickly through TAPL. But I work in the field, so that's probably a major sin.

  53. Re:Sounds great! by Lodragandraoidh · · Score: 1

    In Computer Science terms, atomic operations are ones which prevent the 'he who writes last - wins' syndrome.

    In multiprocessor architectures the use of resource locking normally does the job, although one of the drawbacks is forcing other threads of execution to wait until the lock is released. There are non-locking and wait-free methods that use low level atomic operations (like 'compare and save') and generally use data structures that support these methods (stacks, queues, sets and hashes).

    --

    Lodragan Draoidh
    The more you explain it, the more I don't understand it. - Mark Twain
  54. C/C++ by Rainefan · · Score: 1

    Anything like this (Scalable Nonblocking Data Structures) but on C/C++?

  55. STM? by K.+S.+Kyosuke · · Score: 1

    And in the meantime, clever guys who decided not to spoil their lives with half-baked programming languages are doing things in a much more beautiful way...

    --
    Ezekiel 23:20
  56. Does it go ping? by Anonymous Coward · · Score: 0


    The presenter states that there are only a handful of people in the world savvy enough to program like this.

    Does the 700 core hashtable go ping? You are not allowed to use it because you are not q-u-a-l-i-f-i-e-d!