Slashdot Mirror


IEEE Says Multicore is Bad News For Supercomputers

Richard Kelleher writes "It seems the current design of multi-core processors is not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"

22 of 251 comments (clear)

  1. Time for vector processing again by suso · · Score: 5, Insightful

    Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.

    1. Re:Time for vector processing again by jellomizer · · Score: 5, Interesting

      I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.

      --
      If something is so important that you feel the need to post it on the internet... It probably isn't that important.
    2. Re:Time for vector processing again by virtual_mps · · Score: 5, Insightful

      It's very simple. Intel & AMD spend about $6bn/year on R&D. The total supercomputing market is on the order of $35bn (out of a global IT market on the order of $1000bn) and a big chunk of that is spent on storage, people, software, etc., rather than processors. That market simply isn't large enough to support an R&D effort which will consistently outperform commodity hardware at a price people are willing to pay. Even if a company spent a huge amount of money developing a breakthrough architecture which dramatically outperformed existing hardware, the odds are that the commodity processors would catch up before that innovator recouped its development costs. Certainly they'd catch up before everyone rewrote their software to take advantage of the new architecture. The days when Seymour Cray could design a product which was cutting edge & saleable for a decade are long gone.

    3. Re:Time for vector processing again by AlpineR · · Score: 5, Insightful

      My supercomputing tasks are computation-limited. Multicores are great because each core shares memory and they save me the overhead of porting my simulations to distributed memory multiprocessor setups. I think a better summary of the study is:

      Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

    4. Re:Time for vector processing again by TapeCutter · · Score: 4, Informative

      "Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days."

      Sorry but that's not entirely correct, most super computers work on highly parallel problems using numerical analysis techniques. By definition the problem is broken up into millions of smaller problems that make ideal "small apps", a common consequence is that the bandwidth of the communications between the 'small apps' becomes the limiting factor.

      "Back in the early-mid 90's we had different processors for Desktop and Super Computers."

      The earth simulator was refered to in some parts as 'computenick', it's speed jump over it's nearest rival and longevity at the top marked the renaissance of "vector processing" after it had been largely ignored during the 90's.

      In the end a supercomputer is a purpose built machine, if cores fit the purpose then they will be used.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    5. Re:Time for vector processing again by knails · · Score: 5, Insightful

      No, proper spelling and grammar are important for everyone, not just english majors. With computers so important, if the computer professionals cannot use the language correctly, then who will? We cannot let ignorant people degrade the quality of language and therefore remove beauty and subtle distinctions between similar words just because they're too lazy to conform to standards. If a linguist misused/ignored computing standards, would you not correct them, even though it's not their chosen field of study?

      --
      "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
    6. Re:Time for vector processing again by postbigbang · · Score: 4, Insightful

      Look are deceptive.

      The problem with multicores relates to the fact that the cores are processors, but the relationship to other cores and to memory aren't fully 'cross-bar'. Sun did a multi-CPU architecture that's truly crossbar (meaning that there are no dirty cache problems and semaphor latencies) among the processors, but the machine was more of a technical achievement than a decent workhorse to use in day to day stuff.

      Still, cores are cores. More cores aren't better necessarily until you fix what they describe. And it doesn't matter what they look like at all. Like any other system, it's what's under the hood that count. Esoteric-looking shells are there for marketing purposes and cost-justification.

      --
      ---- Teach Peace. It's Cheaper Than War.
    7. Re:Time for vector processing again by necro81 · · Score: 5, Interesting

      A related problem to the speed of memory access is the energy efficiency of it. In an IEEE Spectrum Radio piece interviewing Peter Kogge, current supercomputers can spend many times more energy shuffling bits around than operating on them. Today's computer can do a double-precision (64-bit) floating point operation using about 100 picojoules. However, it takes upwards of 30 pJ per bit to get the 128 bits of data loaded into the floating point math unit of the CPU, and then moving the 64-bit result elsewhere.

      Actual math operations consume 5-10% of a supercomputer's total power, moving data from A to B is approaching 50%. Most optimization and innovation in the past few decades has gone into compute algorithms in the CPU core, and very little has gone into memory.

    8. Re:Time for vector processing again by hey! · · Score: 5, Interesting

      It may be true that "That market simply isn't large enough to support an R&D which will consistently outperform commodity hardware at a price people are willing to pay," that's not quite tantamount to saying "there is no possible rational justification for a larger supercomputer budget." There are considerable inflection points and external factors to consider.

      The market doesn't allocate funds the way a central planner does. A central planner says, "there isn't room in this budget to add to supercomputer R&D." The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.

      There may also be reasons for public investment in R&D. Naturally the public has no reason to invest in commodity hardware research, but it may have reason to look at exotic computing research. Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today. It'd be quite reasonable to put a fair amount of public research funds into supercomputing in order to have the that ability in five to ten years' time.

      --
      Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
    9. Re:Time for vector processing again by knewter · · Score: 5, Funny

      Hey dipshit. When you mock someone's grammar, you'd sure as fuck better not mis-spell 'apostrophe'

      Idiot.

      I'll paste it a few times so you can look at your grotesque failure more:

      aprostrophe
      aprostrophe
      aprostrophe
      aprostrophe

      See how stupid that looks?

      --
      -knewter
    10. Re:Time for vector processing again by lysergic.acid · · Score: 4, Interesting

      well, supercomputing has always been about maximizing system performance through parallelism, which can only be done in three main ways: instruction level parallelism, thread level parallelism, and data parallelism.

      ILM can be achieved through instruction pipelining, which means breaking down instructions into multiple stages so that CPU modules can work in parallel and reduce idle time. for instance, in a RISC pipeline you break an instruction down into 5 operations:

      1. instruction fetch
      2. instruction decode / register fetch
      3. instruction execute
      4. memory access
      5. register write-back

      so while the first instruction is still in the decode stage the CPU is already fetching a second instruction. thus if fully-pipelined there are no stalls or wasted idle time, and a new instruction is loaded every clock cycle, resulting in a maximum of 5 parallel instructions being processed simultaneously.

      then there are superscalar processors, which have redundant functional units--for instance, multiple ALUs, FPUs, or SIMD (vector processing) units. and if each of these functional units are also pipelined, then the result is a processor with an execution rate far in excess of one instruction per cycle.

      thread level parallelism OTOH is achieved through multiprocessing (SMP, ASMP, NUMA, etc.) or multithreading. this is where multicore and multiprocessor systems come in handy. multithreading is generally cheaper to achieve than multiprocessing since fewer processor components need to be replicated.

      lastly, there's data level parallelism, which is achieved in the form of SIMD (Single Instruction, Multiple Data) vector processors. this type of parallelism, which originated from supercomputing, is especially useful for multimedia applications, scientific research, engineering tasks, cryptography, and data processing/compression, where the same operation needs to be applied to large sets of data. most modern CPUs have some kind of SWAR (SIMD Within A Register) instruction set extension like MMX, 3DNow!, SSE, AltiVec, but these are of limited utility compared to highly specialized dedicated vector processors like GPUs, array processors, DSPs, and stream processors (GPGPU).

  2. Well doh by Kjella · · Score: 4, Insightful

    If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...

    --
    Live today, because you never know what tomorrow brings
    1. Re:Well doh by cheater512 · · Score: 4, Insightful

      There are limits however to what you can do.
      Its not like multi-processor systems where each cpu gets its own ram.

    2. Re:Well doh by Targon · · Score: 4, Interesting

      You have failed to notice that AMD is already on top of this and can add more memory channels to their processors as needed for the application. This may increase the number of pins the processor has, but that is to be expected.

      You may not have noticed, but there is a difference between AMD Opteron and Phenom processors beyond just the price. The base CPU design may be the same, but AMD and Intel can make special versions of their chips for the supercomputer market and have them work well.

      In a worst case, with the support from AMD or Intel, a new CPU with extra pins(and an increased die size) could add as many channels of memory support as required for the application. This is another area where spinning off the fab business might come in handy.

      And yes, this might be a bit expensive, but have you seen the price of a supercomputer?

  3. Yeah! by crhylove · · Score: 5, Funny

    Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!

    --
    I hold very few opinions. I hold information based on observation and fact. If you wish to disagree, please use facts.
  4. It's so obvious... by Alwin+Henseler · · Score: 4, Interesting

    That to remove the 'memory wall', main memory and CPU will have to be integrated.

    I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.

    Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.

    Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).

    Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.

  5. Memory by Detritus · · Score: 4, Insightful

    I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.

    --
    Mea navis aericumbens anguillis abundat
  6. Multiple CPUs? by Dan+East · · Score: 4, Insightful

    This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.

    --
    Better known as 318230.
  7. The problem allegedly being.. by Junta · · Score: 4, Informative

    For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.

    All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.

    For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  8. Re:but.. by David+Gerard · · Score: 4, Funny

    Only for Office 2007.

    --
    http://rocknerd.co.uk
  9. Unganged channels = already non shared lanes today by DrYak · · Score: 5, Insightful

    The issue is with a single processor that has multiple cores.
    There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.

    No, sorry. That's how Phenom processor are *Already* working.

    Each physical CPU package has two 64-bit memory controllers, each controlling a separate bank of 64bits DDR-2 memory chips. (Each of the two bank in a dual channel mother board).

    Phenom have two mode of function :
    - Ganged : both memory controllers work in parallel, working as if they were a huge 128bits memory connection. That's how dual channel has worked since it was invented.
    That's good for system running few very bandwidth-hungry applications (for example : benchmarks)

    - Unganged : each memory controller work on its own. Thus you have two completely separate 64bits memory channel accessible at the same time. By correctly lying the applications in memory thanks to a NUMA-aware OS (anything better than Windows Vista), that means that two separate applications can simultaneously access each one's memory at the exact same moment, although at only half the bandwith *per process* (but still the same total of bandwidth for all processes running at the same time on a multi core chip).
    This is perfect for systems running lots of tasks in parallel, and is the default mode on most BIOSes I've seen.

    This gives a tremendous boost to heavily multi-tasked applications (a busy database server, for example), and it's what TFA's author are looking for.

    Probably that at some point in the future, Intel will follow the same trend with its QPI processors.

    Also, the future trend is to multiply the memory channels on the CPU: Intel has already planned Triple Channel DDR-3 for their high-end server Xeons (the first crop of QPI chips). AMD has announced 4 memory channels for their future 6- and 12- core chips targeting the G34 socket.

    So the net effect of Unganged Dual Channel is that today you already have 4 cores having a choice of 2 sets of memory lanes to choose among, and within 1 year, you'll have 6-to-12 cores sharing 4 sets of memory lanes.

    By the time you reach 32 cores on CPU, probably that almost each slot will have its own dedicated memory channel (probably with the help of some technology which communicates serially with fewer lines, like FB-DIMM). Or even weirder memory interfaces (who knows ? maybe DDR-6 will be able to give several simultaneous access to the same memory module).

    So, well, once again, it proves that running stupid simulations without taking into account that other technologies will improves beside the number of cores* yields stupid non realistic results.

    Shame on TFA's Author, because the trends to increase bandwith have already started. I little bit more background research would have avoided this kind of stupidity.
    But on the other hand, they would have missed the opportunity to publish an alarmist article with an eye catching title.

    --

    *: Although, yes, the number of cores you can slap inside the same package seems to be the "new megahertz" in the manufacturers' race, with some like Intel trying to increase this number faster without putting so much efforts on the rest.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  10. Re:Kill all engineering then! by Shamenaught · · Score: 4, Insightful

    The phrase "By logical extension" is just another way of saying "This is a straw man argument"

    I believe that the point he was making was not that it's pointless to go beyond X86 hardware, but that it's more cost-effective to use consumer hardware. Consumer hardware is not necessarily X86 hardware. See IBM's Roadrunner, presently the fastest supercomputer in the world, which uses an advanced version of the PS3's processor (the PowerXCell 8i).

    In time, we'll probably see demand in consumer hardware for breaking past the boundaries and bottlenecks of multi-core processing, and so supercomputers will follow.

    --
    mysql> SELECT * FROM `places` WHERE `place` LIKE 'home`; Empty set (0.00 sec)