Slashdot Mirror


Revisiting Amdahl's Law

An anonymous reader writes "A German computer scientist is taking a fresh look at the 46-year old Amdahl's law, which took a first look at limitations in parallel computing with respect to serial computing. The fresh look considers software development models as a way to overcome parallel computing limitations. 'DEEP keeps the code parts of a simulation that can only be parallelized up to a concurrency of p = L on a Cluster Computer equipped with fast general purpose processors. The highly parallelizable parts of the simulation are run on a massively parallel Booster-system with a concurrency of p = H, H >> L. The booster is equipped with many-core Xeon Phi processors and connected by a 3D-torus network of sub-microsecond latency based on EXTOLL technology. The DEEP system software allows to dynamically distribute the tasks to the most appropriate parts of the hardware in order to achieve highest computational efficiency.' Amdahl's law has been revisited many times, most notably by John Gustafson."

16 of 54 comments (clear)

  1. Buzzword-heavy by Animats · · Score: 4, Insightful

    The article makes little sense. The site of the DEEP project is more useful. It has the look of an EU publicly funded boondoggle. Those have a long history; see Plan Calcul, the 1966 plan to create a major European computing industry. That didn't do too well.

    The trouble with supercomputers is that only governments buy them. When they do, they tend not to use them very effectively. The US has pork programs like the Alabama Supercomputer Center. One of their main activities is providing the censorware for Alabama schools.

    There's something to be said for trying to come up with better ways of making sequential computation more parallel. But the track record of failures is discouraging. The game industry beat their head against the wall for five years trying to get the Cell processors in the PS3 to do useful work. Sony has given up; the PS4 is an ordinary shared-memory multiprocessor. So are all the XBox machines.

    It's encouraging to see how much useful work people are getting out of GPUs, though.

    1. Re:Buzzword-heavy by cold+fjord · · Score: 3, Interesting

      The article makes sense, but I don't think the work appears to be especially innovative even if it could be very useful.

      It is more than governments that buy supercomputers. They are also used in industry for things like oil and gas exploration, economic modeling, and weather forecasts. Universities and research organizations also use them for a variety of purposes. Time on an actual supercomputer tends to be highly valuable and sought after. You may disagree with the use, but that is a different question from not being used effectively.

      The Secret Lives of Supercomputers, Part 1

      "It is probably the biggest trend in supercomputers -- the movement away from ivory-tower research and government-sponsored research to commerce and business," Michael Corrado, an IBM spokesperson, told TechNewsWorld. In 1997, there were 161 supersystems deployed in business and industry, but that figure grew to 287 by June 2008, he noted. "More than half the list reside in commercial enterprises. That's a huge shift, and it's been under way for years."

      Uses for supercomputers

      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    2. Re:Buzzword-heavy by cold+fjord · · Score: 3, Interesting
      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    3. Re:Buzzword-heavy by rioki · · Score: 3, Interesting

      You might want to read / view these slides:An Introduction to Modern GPU Architecture Especially slide 42.

      Modern GPUs are massively parallel in their execution. Yes they work "only" on one image, but when rendering one scene the sharers work in parallel. For example a fragment (aka per pixel) shader will be run in parallel for each pixel, limited by the number of available shader units (aka core). THIS is why you get the awesome performance: small, self contained programs running in parallel.

    4. Re:Buzzword-heavy by smallfries · · Score: 2

      How dare you criticise the author - he is a physicist and he has stooped to coming and telling us computer science types how to do it properly!

      There is a deeply appropriate xkcd but I cannot be bothered to find it. Decoding the garbage in the pcworld story tell us that he is going to break Amdahl's Law by dynamically partitioning the workload between a fast single threaded processor and many slower parallel processors. I would guess that my failing to make a fair comparison they can claim that the portion running under the boosted clock somehow beats the bounds predicted by Amdahl's law. Sadly it does not as the law is worded in the proportion of the code that can be executed on the parallel architecture.

      It is quite possible that much of the hyperbole was added as sales pitch, which is a little unfortunate as the dynamic partitioning and the toolchain support are far more interesting anyway.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    5. Re:Buzzword-heavy by rgbatduke · · Score: 2

      Hey, don't disrespect physicists in parallel computing. Some of us actually understand how to do it properly and agree with what you state. Superlinear speedup is not precisely unknown, but it is rare and depends on architectural "tricks" that typically preserve Amdahl's law at a low level but apparently violate it at a higher level. In the naivest, stupidest example, if we didn't count cores instead of processors, even embarrassingly parallel code would exhibit superlinear speedup on a single processor system. Replace core count with internal ALUs, pipelines, SIMD/MIMD in the architecture, onboard vector units, etc, and one can get the same sort of thing per core for just the right code.

      I am deeply skeptical of any sort of toolset that purports to be able to either statically or dynamically partition a given set of upper level code to get superlinear speedup. I won't say it is impossible to build a set that "works" for some fraction of the parallelizable code in the Universe, but given the complex tradeoffs between computation and communication in different communication topologies and task partitionings, this is not a problem that has a simple universal solution and I suspect that in lots of cases an experienced parallel programming human being could spend a half day analyzing the code and architecture and beat (or tie, as the USUAL rule is going to be no meaningful superlinear speedup boring for coarse grained parallel or embarrassingly parallel code) the output from an automated tool.

      An interesting example of a tool that does this sort of tuning (semi-empirically!) that works is ATLAS, the automatically tuned linear algebra system. Basically it does a search of the space of partitionings and algorithms to determine the best combination of the two for performing basic linear algebra functions (BLAS) and then implements it in a transparent library. It is semi-empirical because it is nearly impossible to predict the overall effect of every combination of SSE support, clock speed, bus speed, core architecture -- it is a lot easier to just go and find out. But the problem ATLAS solves is comparatively simple relative to even static task partitioning on heterogeneous computational resources with variable costs for core-to-core communication, especially in today's multicore world where one has different speeds between cores on a processor, between processors in a system, between systems, between general purpose processors and special purpose e.g. GPU/vector processors, where the communication topology itself can have a major impact on the kind of parallel speedup any given task has.

      This, then, is the interesting part as you note, and who knows, maybe they've built a sufficiently intelligent system to get nominally superlinear speedup (or hell, who cares, just getting close to optimal speedup sublinear or not) from a meaningful fraction of the space of possible parallizable code. But God couldn't get superlinear speedup on fine grained synchronous parallel code with long range coupling out of any multi-node scalable parallel architecture available in the real world today, no matter how fancy the partitioning tool.

      rgb

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    6. Re:Buzzword-heavy by rgbatduke · · Score: 2

      Double ditto. I've written magazine articles on beowulf-style supercomputers I've built at home (I used to write a column for "Cluster World" magazine in the brief time that it existed, but I also wrote an article or two for more mainstream computer mags). I have also set up clusters for companies I've founded and helped others set up corporate clusters. Some of what are arguably the world's largest parallel supercomputers -- Google's cluster, for example -- are not government funded. Many others aren't government funded, they are built by companies that sell products to many entities, among them (perhaps) the government. Aerospace engineering companies all need supercomputers to do computational fluid dynamics on hull designs, for example. Ordinary engineering companies use them to do finite element analysis. Gaming clusters are by any sensible definition a highly parallelized, dynamically partitioning supercomputer.

      Ever since the invention of PVM and open source versions of MPI, anybody with a small pile of computational resources and a network has been able to implement a beowulf-style supercomputer built from them, an architecture so successful that nearly all of the supercomputers built in the world today are basically "beowulfs". I've helped a few dozen individuals (one at a time, not via my book or magazine articles) build beowulf clusters at home just to dink around with for fun, or to learn a new job skill, or to set up a learning cluster at a small community college or university. No government funding, often out of pocket funding or repurposing old computers that are lying around. Not all of these clusters could beat Moore's Law, which has inexorably eaten Amdahl's Lunch after a few years (that is, by the time they were built it was often the case that a single processor over the counter computer at the high end of clock and so on would beat the small cluster made of older systems) but there is no doubt that they were supercomputer architecture with substantial (but sublinear) speedup compared to single threaded execution times for a suitable parallelized chore.

      Besides, it is useful to remember that your cell phone would have been considered a munition a bit over a decade ago. A better thing to state is that everybody buys supercomputers because almost every processor based system from navigation systems in cars to cell phones to tablets to personal computers is, these days, a supercomputer. My i7 laptop has four cores, eight contexts, and exhibits linear speedup on in-cache embarrassingly parallel code out to eight simultaneous tasks because Intel has done a pretty amazing job of internally parallelizing the execution subsystems for the contexts. It beats the hell out of almost all of the small clusters I've ever built, including clusters with many, many more nodes. Build even a small stack of i7 systems on a gigabit or better network -- two, for example -- and you've got a sixteen core supercomputer with a complex communication topology (variable speeds and nonlinear thresholds, as the i7 does stop giving you the purely linear 8 way speedup for large enough tasks and drops down to a bit over four -- again an instance of "superlinear speedup" of parallel code even WITHOUT using a fancy tool if you simply count cores instead of context and ignore internal parallelism for certain kinds of code that permits a single core to be managing memory I/O for one task while executing the other).

      rgb

      --
      Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
    7. Re:Buzzword-heavy by Bengie · · Score: 2

      GPU cores are broken into groups. Each group must be doing the exact same instruction at the exact same time. Branches are horrible for performance as it will force some cores to stop computing all together while waiting for the branch to finish.

      There are many concurrent algorithms that don't like to keep the execution path in perfect sync. This is where a many-core CPU will take out a GPU in performance. GPUs also have horrible random access and very small caches. Actually, the per core cache of GPUs has been going down over the years for both nVidia and AMD.

      GPUs are excellent for what they're good at, and horrible for everything else. If I could remember the link, there was a 100GFlop Intel CPU kicking the crap out of a 2Tflop nVidia GPU on transcoding, even though both code-paths were highly optimized for each architecture. It just so happened that the algorithm used did not play well with GPUs.

  2. SMBC by klapaucjusz · · Score: 2
    1. Re:SMBC by mysidia · · Score: 2

      then you can actually try guessing, and executing next step an all possible outcomes of previous step, then throwing away every result but one as previous step completes.

      However... this requires power consumption, and it still does take time and tie up your infrastructure working on the 'guess'. Meanwhile, the previous step completes, and your CPUs are all still busy working on guessing the previous step, and you need additional sequential overhead to initiate and terminate the guessing process.

      You show 'CPU usage' on your 'idle cores', but it's 99% waste heat.

  3. Re:Xeon dream on by godrik · · Score: 3, Informative

    "Xeon Phi = unavailable vaporware"

    You know, I wrote a paper on SpMV for Xeon Phi and I got quite a lot of people from all over the world asking me for clarification and for code. So it seems to be quite widespread. You can actually buy some online, Google points to several vendors.

    "in order to discourage folks from porting big science applications to CUDA"

    There are two things wrong with this statement. First of all, I do not think scientist are discourage from giving a shot to CUDA. Just check any scientific conference and you'll see GPU and CUDA everywhere. Actually we see so much GPU programming that it is getting boring.
    Also porting to CUDA is difficult and alien for most people. If we can get similar performance using programming model people are used to, how is that not a good thing? What is so good about CUDA? It is just pretty much the only way to get good performance out of NVIDIA gpus.

    The tradeoff between performance, hardware cost and developper cost is a difficult tradeoff. I say let's throw them all in the arena and see what stands.

    Disclaimer: my research is supported by both Intel and NVIDIA.

  4. Poor summary by Anonymous Coward · · Score: 5, Informative

    Amdahl's Law still stands. TFA is about changing the assumptions that Amdahl's Law is based on; instead of homogenous parallel processing, you stick a few big grunty processors in for the serial components of your task, and a huge pile of basic processors for the embaressingly parallel components. You're still limited by the fastest processing of non-parellel tasks, but by using a heterogenous mix of processors you're not wasting CPU time (and thus power and money) leaving processors idle.

  5. Understanding Amdahl's law by deadline · · Score: 2

    You can't cheat Amdahl's law anymore than you can give birth in one month with nine women. The law is a rather simple idea similar to chemical kinetics, when you think about it. i.e. a rate limiting steps.

    If you are interested in a non-mathematical description of Amdahl's law have a look at http://www.clustermonkey.net/Parallel-Programming/parallel-computing-101-the-lawnmower-law.html

    --
    HPC for Primates. Read Cluster Monkey
  6. Repeat after me: by Mashdar · · Score: 4, Insightful

    Ahmdal's Law only applies to individual algorithms. Ahmdal's Law only applies to individual algorithms. Ahmdal's Law only applies to individual algorithms.

    Besides which, Ahmdal's law is an obvious truth unless you can make a process take negative time. All attempts to make Ahmdal's Law sound fancy or complicated are a disservice. All attempts to pigeonhole Ahmdal's Law into only applying to parallel design are a disservice. Any attempts to "revisit" are either fallacious or focus on algorithm changes, which Amdahl made no attempt to address.

    Ahmdal's law in a nutshell: If you spend 10% of your time on X and 90% of your time on Y, you will never get more than a 1/.9 speedup by optimizing X, even if you manage to make X instantaneous. Another way to put it is that if Y takes 9 seconds, you are never going to get the process under 9 seconds by modifying X...

  7. Not breaking Amdahls law by sjames · · Score: 2

    This most certainly does NOT break Amdahl's law. It simply partitions the problem to use the cheap gear for the embarrassingly parallel portion of the workload and the expensive gear for the harder to parallelize workload.

    It necessarily cannot make a non-parallelizable portion (the serial part) run in parallel.

    Note that what part of the problem is serial depends on the hardware. The lower the latency and the higher the bandwidth of the interconnect, the more of the problem you can get to run effectively in parallel. However, there comes a point where the problem cannot be decomposed further. The atoms that remain after that may all be run at once, but the individual atom will run serially. No matter what you do, 5*(2+3) can go no faster than serially adding and then multiplying (yes, you could do two multiplications in parallel and then add, but you gain nothing for it).

  8. Re:Xeon dream on by ImprovOmega · · Score: 2

    Optimizing CUDA is almost, but not quite, as arcane as optimizing assembly code by hand. It requires a deep knowledge of the underlying architecture. The addressing, the memory read patterns, and the role of each of the tiers of memory and the cost of moving between tiers, the size restrictions on each buffer, and how to coalesce the whole mess into a coherent answer. I once got a 30% performance increase by offsetting the addressing on my memory buffers so that they didn't all start on 16-byte boundaries. It allowed the data to be read in parallel and avoided collisions from the different processes trying to access the same block at the same time. The problem is most programmers aren't particularly hardware oriented, so CUDA comes with a steep learning curve if you want to do it well.