IEEE Says Multicore is Bad News For Supercomputers

← Back to Stories (view on slashdot.org)

IEEE Says Multicore is Bad News For Supercomputers

Posted by timothy on Friday December 5, 2008 @12:04AM from the unexpected-downsides dept.

Richard Kelleher writes "It seems the current design of multi-core processors is not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"

199 of 251 comments (clear)

Min score:

Reason:

Sort:

Time for vector processing again by suso · 2008-12-05 00:06 · Score: 5, Insightful

Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.
1. Re:Time for vector processing again by jellomizer · 2008-12-05 00:40 · Score: 5, Interesting
  
  I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.
  
  --
  If something is so important that you feel the need to post it on the internet... It probably isn't that important.
2. Re:Time for vector processing again by suso · 2008-12-05 00:49 · Score: 1
  
  Yes I agree and have the same odd feeling. The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?
  I missed the days when super computers looked like alien technology or Raiders of the Lost Ark.
3. Re:Time for vector processing again by virtual_mps · 2008-12-05 01:07 · Score: 5, Insightful
  
  It's very simple. Intel & AMD spend about $6bn/year on R&D. The total supercomputing market is on the order of $35bn (out of a global IT market on the order of $1000bn) and a big chunk of that is spent on storage, people, software, etc., rather than processors. That market simply isn't large enough to support an R&D effort which will consistently outperform commodity hardware at a price people are willing to pay. Even if a company spent a huge amount of money developing a breakthrough architecture which dramatically outperformed existing hardware, the odds are that the commodity processors would catch up before that innovator recouped its development costs. Certainly they'd catch up before everyone rewrote their software to take advantage of the new architecture. The days when Seymour Cray could design a product which was cutting edge & saleable for a decade are long gone.
4. Re:Time for vector processing again by Anonymous Coward · 2008-12-05 01:17 · Score: 1, Funny
  
  I missed the days when super computers looked like alien technology or Raiders of the Lost Ark.
  How about supercomputers that look like alien technology from Kingdom of the Crystal Skull?
  George Lucas and Steven Spielberg raped supercomputing!
5. Re:Time for vector processing again by Retric · 2008-12-05 01:17 · Score: 3, Informative
  
  Modern CPU's have 8+ Mega Bytes of L2/L3 cache on chip so RAM is only a problem when your working set it larger than that. The problem super computing folks are having is they want to solve problems that don't really fit in L3 cache which creates significant problems but they still need a large cache. However, because of speed of light issues off chip ram is always going to be high latency so you need to use some type of on chip cache or stream lot's off data to the chip.
  
  There are really only 2 options for modern systems when it comes to memory you can have lot's of cores and a tiny cache like GPU's or lot's of cache and fewer cores like CPU's. (ignoring type of core issues and on chip interconnects etc.) So there is little advantage to paying 10x per chip to go custom vs using more cheaper chips when they can build supper computers out of CPU's, GPU's, or something between them like the Cell processor.
6. Re:Time for vector processing again by AlpineR · 2008-12-05 01:18 · Score: 5, Insightful
  
  My supercomputing tasks are computation-limited. Multicores are great because each core shares memory and they save me the overhead of porting my simulations to distributed memory multiprocessor setups. I think a better summary of the study is:
  Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.
7. Re:Time for vector processing again by Timothy+Brownawell · 2008-12-05 01:28 · Score: 1
  
  If they can make superconducting FETs that can be manufactured on ICs, I could see there being a very big difference that will last until they can reach liquid nitrogen temperatures (at which point it goes mainstream and cryogenics turns into a boom industry for a while).
8. Re:Time for vector processing again by timeOday · 2008-12-05 01:44 · Score: 3, Insightful
  
  IMHO this study is not an indictment against the use of today's multi-core processors for supercomputers or anything else. They're simply pointing out that in the future (as cores continue to grow exponentially) some memory bandwidth advances will be needed. The implication that today's multi-core processors are best suited for games is silly - where they're really well utilized is in servers, and they work very well. The move towards commodity processors in supercomputing wasn't some kind of accident, it occurred because that's what currently gets the best results. I'd expect a renaisance in true supercomputing just as soon as it's justified, but I wouldn't hold my breath.
9. Re:Time for vector processing again by David+Gerard · 2008-12-05 01:53 · Score: 2, Interesting
  
  I eagerly await the Slashdot story about an Apple laptop with liquid nitrogen cooling. Probably Alienware will do it first.
  
  --
  http://rocknerd.co.uk
10. Re:Time for vector processing again by IceCreamGuy · 2008-12-05 01:57 · Score: 1
  
  Maybe they should all just be simulated at Sandia!
11. Re:Time for vector processing again by yttrstein · 2008-12-05 02:04 · Score: 3, Informative
  
  We still have different processors for desktops and supercomputers.
  
  http://www.cray.com/products/XMT.aspx
  
  Rest assured, there are still people who know how to build them. They're just not quite as popular as they used to be, now that a middle manager who has no idea what the hell they're talking about can go to an upper manager with a spec sheet that's got 8 thousand processors on it and say "look! This ones got a whole ton more processors than that dumb Cray thing!"
12. Re:Time for vector processing again by TapeCutter · 2008-12-05 02:04 · Score: 4, Informative
  
  "Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days."
  
  Sorry but that's not entirely correct, most super computers work on highly parallel problems using numerical analysis techniques. By definition the problem is broken up into millions of smaller problems that make ideal "small apps", a common consequence is that the bandwidth of the communications between the 'small apps' becomes the limiting factor.
  
  "Back in the early-mid 90's we had different processors for Desktop and Super Computers."
  
  The earth simulator was refered to in some parts as 'computenick', it's speed jump over it's nearest rival and longevity at the top marked the renaissance of "vector processing" after it had been largely ignored during the 90's.
  
  In the end a supercomputer is a purpose built machine, if cores fit the purpose then they will be used.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
13. Re:Time for vector processing again by LingNoi · 2008-12-05 02:14 · Score: 2, Insightful
  
  This is slashdot, our professions are computer related not literature based. You're on the wrong website.
14. Re:Time for vector processing again by Anonymous Coward · 2008-12-05 02:27 · Score: 2, Funny
  
  Because I'm sure inserting a random apostrophe into your code would make it run just fine...
15. Re:Time for vector processing again by knails · 2008-12-05 02:34 · Score: 5, Insightful
  
  No, proper spelling and grammar are important for everyone, not just english majors. With computers so important, if the computer professionals cannot use the language correctly, then who will? We cannot let ignorant people degrade the quality of language and therefore remove beauty and subtle distinctions between similar words just because they're too lazy to conform to standards. If a linguist misused/ignored computing standards, would you not correct them, even though it's not their chosen field of study?
  
  --
  "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
16. Re:Time for vector processing again by postbigbang · 2008-12-05 02:37 · Score: 4, Insightful
  
  Look are deceptive.
  The problem with multicores relates to the fact that the cores are processors, but the relationship to other cores and to memory aren't fully 'cross-bar'. Sun did a multi-CPU architecture that's truly crossbar (meaning that there are no dirty cache problems and semaphor latencies) among the processors, but the machine was more of a technical achievement than a decent workhorse to use in day to day stuff.
  Still, cores are cores. More cores aren't better necessarily until you fix what they describe. And it doesn't matter what they look like at all. Like any other system, it's what's under the hood that count. Esoteric-looking shells are there for marketing purposes and cost-justification.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
17. Re:Time for vector processing again by Vellmont · 2008-12-05 02:39 · Score: 1
  
  Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.
  
  I thought the same thing. Years ago with the massively-parallel architectures you could have said that massively-parallel architectures don't help inherently serial tasks.
  The other thing I wonder is how server and desktop tasks will drive the multi-core architecture. It may be the case that many of the common server and desktop tasks have massive IO need (gaming?). The current memory architectures aren't set in stone, but I also doubt they'll be driven by what the Supercomputers consumers need.
  
  --
  AccountKiller
18. Re:Time for vector processing again by Johnny_Longtorso · 2008-12-05 02:39 · Score: 1
  
  You seem to have missed the point of COTS entirely. And you're way off the mark on the price differential - unless you're talking about an 8 CPU "supercomputer".
  The whole reason HPTC bled down to COTS product was the outrageous costs of more proprietary hardware AND the fact that COTS product performance and reliability were on a massive upswing.
  I work with LOTS of customers using HPTC - and very, very few of them are still running a single application. It's the nature of growth & development - there are older apps and there are newer apps.
  There's a 235 node 10G InfiniBand connected HP-based supercomputer humming right behind me as I type this....
  
  --
  Even casual involvement excludes total freedom by it's inherent nature. John Valby
19. Re:Time for vector processing again by necro81 · 2008-12-05 02:41 · Score: 5, Interesting
  
  A related problem to the speed of memory access is the energy efficiency of it. In an IEEE Spectrum Radio piece interviewing Peter Kogge, current supercomputers can spend many times more energy shuffling bits around than operating on them. Today's computer can do a double-precision (64-bit) floating point operation using about 100 picojoules. However, it takes upwards of 30 pJ per bit to get the 128 bits of data loaded into the floating point math unit of the CPU, and then moving the 64-bit result elsewhere.
  Actual math operations consume 5-10% of a supercomputer's total power, moving data from A to B is approaching 50%. Most optimization and innovation in the past few decades has gone into compute algorithms in the CPU core, and very little has gone into memory.
20. Re:Time for vector processing again by Methuselah2 · 2008-12-05 02:53 · Score: 2, Funny
  
  That does it...I'm not buying a supercomputer this Christmas!
21. Re:Time for vector processing again by X0563511 · 2008-12-05 02:54 · Score: 1
  
  It's a shame we don't compile written word. ...
  Programming is not literature, it's machine instructions.
  
  --
  For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
22. Re:Time for vector processing again by Pharmboy · 2008-12-05 02:54 · Score: 2, Informative
  
  Yes, that is what I want, a super computer designed by an English major...
  Please get over yourself. This is slashdot, not something important like a resume or will.
  
  --
  Tequila: It's not just for breakfast anymore!
23. Re:Time for vector processing again by DiegoBravo · 2008-12-05 02:56 · Score: 1
  
  >> I'd love to see some new technologies.
  Yeah, It would be nice to see that Quantum Computing ( http://en.wikipedia.org/wiki/Quantum_Computer ) finally adds a couple of arbitrary integers. Despite the many publications in the subject, it smells like the superstrings theory of computing. Hope that's not the case.
24. Re:Time for vector processing again by ipoverscsi · 2008-12-05 02:58 · Score: 2, Insightful
  
  Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.
  Computation is communication. It's communication between the CPU and memory.
  The problem with multicore is that, as you add more cores, the increased bus contention causes the cores to stall making so they cannot compute. This is why many real supercomputers have memory local to each CPU. Cache memory can help, but just adding more cache per core yields diminishing returns. SMP will only get you so far in the supercomputer world. You have to go NUMA for performance, which means custom code and algorithms.
25. Re:Time for vector processing again by hey! · 2008-12-05 03:02 · Score: 5, Interesting
  
  It may be true that "That market simply isn't large enough to support an R&D which will consistently outperform commodity hardware at a price people are willing to pay," that's not quite tantamount to saying "there is no possible rational justification for a larger supercomputer budget." There are considerable inflection points and external factors to consider.
  The market doesn't allocate funds the way a central planner does. A central planner says, "there isn't room in this budget to add to supercomputer R&D." The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.
  There may also be reasons for public investment in R&D. Naturally the public has no reason to invest in commodity hardware research, but it may have reason to look at exotic computing research. Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today. It'd be quite reasonable to put a fair amount of public research funds into supercomputing in order to have the that ability in five to ten years' time.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
26. Re:Time for vector processing again by knewter · 2008-12-05 03:04 · Score: 5, Funny
  
  Hey dipshit. When you mock someone's grammar, you'd sure as fuck better not mis-spell 'apostrophe'
  Idiot.
  I'll paste it a few times so you can look at your grotesque failure more:
  aprostrophe
  aprostrophe
  aprostrophe
  aprostrophe
  See how stupid that looks?
  
  --
  -knewter
27. Re:Time for vector processing again by Dishevel · 2008-12-05 03:07 · Score: 2, Insightful
  
  I do not study literature. I do not like those that do. Come on though. Knowing the difference between adding an "s" in a plural or possessive situation is truly basic. If you want to sound like a complete idiot then don't mangle true English. Just speak Ebonics.
  
  --
  Why is it so hard to only have politicians for a few years, then have them go away?
28. Re:Time for vector processing again by knails · 2008-12-05 03:22 · Score: 3, Insightful
  
  Who said anything about a supercomputer?
  
  Language is a tool, and everyone who uses the tool needs to use it properly. HTML is a tool, and there are proper use standards for it. Some, however, choose not to use those standards, and it only makes a mess for everyone else who do use them. If you're going to use a tool, you need to learn to use it correctly; language is no exception.
  
  --
  "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
29. Re:Time for vector processing again by TheRaven64 · 2008-12-05 03:26 · Score: 1
  
  The same is true of some of IBM's offerings. They still run Linux on PowerPC, but they are just using Linux as an I/O scheduler and the PowerPC chips as I/O controllers.
  
  --
  I am TheRaven on Soylent News
30. Re:Time for vector processing again by Zebra_X · 2008-12-05 03:35 · Score: 2, Insightful
  
  Yeah, if you buy Intel chips. Despite the fact that they are slower clock for clock than the new intel chips, amd's architecture was and is the way to go, which is of course why Intel has copied it (i7). If you properly architect the chips to contain all of the "proper" plumbing, then this becomes less of a problem. Unfortuantely Intel has for the past few years simply cobbled together "cores" that are nothing more than processors that are linked via a partially adequite bus. So when contention goes up they don't perform as well. Most users don't ever consistentnly utilize their cpu at 80% so this hasn't really been a problem for the market at large. This is why amd's solutions have scaled further and for less. As a result companies like Cray have been utilizing opteron chips for their newest super computers.
31. Re:Time for vector processing again by Waffle+Iron · 2008-12-05 03:55 · Score: 1
  
  Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.
  While supercomputers might have to come up with unique architectures again, vector processing isn't it. The issue here is the total bandwidth to a single shared view of a large amount of memory. Off-the-shelf PC cores with their SIMD units are already too fast for the available memory bandwidth; swapping those cores out for vector units won't do anything to solve the problem. (Especially if you were go back to the original Cray approach of streaming vectors from main memory with no caches, which would just compound the problem.)
  What's really needed is big advances in memory architecture.
32. Re:Time for vector processing again by Anonymous Coward · 2008-12-05 03:57 · Score: 1, Funny
  
  How many times did you spell-check your post?
33. Re:Time for vector processing again by meson2439 · 2008-12-05 04:22 · Score: 1
  
  Lol... The spellchecker guy mis-spelled his own spelling. The English grammar is a beast that contradict itself. Unlike programming, where the rule is constant, English grammar is a variable changing with time, location and context. Not very reliable I would say. As a tool, English fails.
34. Re:Time for vector processing again by frieko · 2008-12-05 04:40 · Score: 2, Interesting
  
  I think the solution here is to go a bit more fine-grained when defining the "commodity". This seems to be what IBM is doing. Their current strategy is, "we have a sweet-ass core design, you're welcome to slap it on whatever chip you can dream up."
  
  Thus the "commodity" is the IP design, not the finished chip. If everybody else is doing a chip with 128 cores and one interconnect, they'll be happy to fab you a chip with one core and 128 interconnects.
35. Re:Time for vector processing again by David+Greene · 2008-12-05 04:45 · Score: 3, Informative
  
  Cray did not stream vectors from memory. One of the advances of the Cray-1 was the use of vector registers as opposed to, for example, the Burroughs machines which streamed vectors directly to/from memory.
  We know how to build memory systems that can handle large vectors. Both the Cray X1 and Earth Simulator demonstrate that. The problem is that those memory systems are currently too expensive. We are going to see more and more vector processing in commodity processors.
  
  --
36. Re:Time for vector processing again by David+Greene · 2008-12-05 04:53 · Score: 1
  
  I need to correct a few things about your post.
  Clusters are not supercomputers. Supercomputing is not only or even primarily about processor technology. The network and memory architecture have a larger impact. Supercomputer vendors are doing plenty of architecture innovation. They're doing it in the memory, network and I/O layers.
  Supercomputers do not exist mainly to run single large apps. They are batch machines running multiple jobs simultaneously.
  Supercomputer customers are not willing to pay a premium for custom processors because CPU efficiency is not the most important thing. Memory and network performance is. That's why the article focuses on the memory bandwidth problems of many-core CPUs.
  What we're seeing with multi- and man-core processing is the continued commoditization of supercomputer technology. It started with the micros' adaption of strong scalar performance. The Cray-1 was not outstanding because it had vector processing. It was outstanding because it was the fatest scalar processor in the world at the time. Then the micros adopted vector processing (albeit with very short vectors, but that will change with Sandy Bridge and Haswell). Then they adopted multithreading. Now they're embracing multiprocessing.
  It's the nature of the business that technology trickles down from the high end to the consumer. If you look at the timeline of processor development, commodity CPUs have gained supercomputer features about 20-30 years after their first appearance in high-end computing. What hasn't happened is commodity adoption of high-end memory, network and I/O systems because the home user does not need them.
  
  --
37. Re:Time for vector processing again by flaming-opus · 2008-12-05 04:53 · Score: 1
  
  Back in the 90s, there were custom super-computer processors (both vector and scalar), that were faster than desktop processors for all supercomputing tasks. This hit a wall, as the desktop processors became faster than the custom processors, at least for some tasks. If you can get a processor that's faster for some tasks and slower for others, but costs 1/10th the price of the other, you're probably going to go with the cheap one. The world has petaflop computers because of the move to commodity parts. Noone could afford to build 160,000 processor systems from YMP processors.
  btw, multi-cores are pretty terrible for desktop applications. They really excel for server transaction processing, but most desktop users haven't any use for more than 2 cores. A radical shift in programing is going to be needed before massively multi-core processors are any use to a desktop user.
38. Re:Time for vector processing again by DMalic · 2008-12-05 04:54 · Score: 2, Funny
  
  That's racist! I, myself, have consumed so much caffeine that I am now Korean, and I therefore converse in the language "l337".
39. Re:Time for vector processing again by DMalic · 2008-12-05 04:59 · Score: 1
  
  This isn't about CPU utilization. All of this bandwidth-limited stuff never seems to apply to CPU benchmarks until you get to, say, eight or sixteen core systems (used to partially affect four-core systems, back before quads..) - which don't affect most consumer systems. (Not to say it's not important). Intel's design was pretty FUBAR in some ways, but they kept it nice and fast for desktops.
40. Re:Time for vector processing again by Waffle+Iron · 2008-12-05 05:02 · Score: 1
  
  So it had a tiny 4 kilobyte, manually allocated cache (aka vector registers). But yes, they still had to be juggled by streaming vectors to and from memory. That approach still requires more memory bandwidth than current cache architectures, and it wouldn't solve the memory issues any more effectively than today's common designs, which just happen to stream cache blocks instead of vectors.
41. Re:Time for vector processing again by flaming-opus · 2008-12-05 05:03 · Score: 1
  
  The problem is that no idea doubles the rate at which supercomputers advance. Most of the ideas out there jump foreward, but they do it once. Vectors, streams, reconfigurable computing. All of these buzzwords once were the next big thing in supercomputing. Today everyone is talking about GPGPUs. None of them go very far. How much engineering goes into the systems? How long does it take to get to market? How difficult is it to rewrite all the algorithms to take advantage of the new machine? What proportion of the codes see a real advantage on the new machine? Can your company stay afloat long enough to reap the rewards? (remember that supercomputing is a tiny niche market compared to computing in general.)
  I've seen a lot of "game changing ideas" come along in the supercomputing world. Commodity computing is the only one left.
42. Re:Time for vector processing again by Anonymous Coward · 2008-12-05 05:06 · Score: 1, Insightful
  
  We can invest our time writing our posts to an exacting standard or we can get our points across and move on with our lives. We're not writing professionally. We're having a conversation. Do you know what "lot's" means? Of course. It means "lots" and someone made a typo. Cut people a little slack.
43. Re:Time for vector processing again by mikael · 2008-12-05 05:21 · Score: 1
  
  Early supercomputers were built from custom chips designed for specific applications along with a custom network topology. This may have reduced the energy demands of the system, but meant that the system was good for one application only.
  Also, different supercomputers would have different network topologies depending upon the application. It become immediately obvious that a single bus shared between a group of CPU's wasn't going to achieve peak performance, so different architectures were developed for each application: ring, star, open 2D grid, open 3D grid, toroidal, hypercube. In the end, it become cheaper to have intelligent network controllers that could be dynamically reconfigured into the particular topology required by the application.
  It wasn't good value for money for a university department if the system they had just paid millions to build and install was going to be scrapped entirely two years later for a new system with a different architecture, and requiring all the software to be written.
  At the same time, Intel and AMD chips gradually adopted the technologies used by supercomputer proessors (large cache, superscalar instructions, floating-poing processors). In the end it becomes cheaper using a commodity processor than designed a 100 million transistor CPU from scratch.
  With multi-core architectures, the single bus cache-snooping algorithms would have to be replaced with an internal programmable network topology as with the supercomputers.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
44. Re:Time for vector processing again by hey! · 2008-12-05 05:42 · Score: 1
  
  I'm talking about the scenario TFA proposes: that directions in technology cause supercomputing advancement to stall. In that case expressed as a ratio any advancement at all would be infinite. However, I don't expect improvements in supercomputing will go to all the way down to zero.
  Now, why couldn't the rate of improvement double over some timescale from what it is now? I think it is because investors don't care a rat's ass about the rate of technological advance; they care about having something to sell that somebody else doesn't. What you're saying amounts to this: the people working in the field pretty much agree on what's needed to make a supercomputer go faster, so the focus of supercomputing development is on commodity concepts.
  However, if you believe that supercomputer could go, say four times as fast as they do now in ten years and sixteens times as fast as they do now in twenty years, then it certainly is physically possible for them to become eight times as fast in ten years. The difference between four and sixteen in the initial scenario comes from ten additional years of investment, in both money and thought. There is not some kind of intrinsic limitation on the rate that technology gets better (hmm.. sounds like a premise for a sci-fi story).
  Provided that it is physically possible to reach some higher level of performance in the future, it is not the rate of improvement that is the limitation. The limitation is the rate of investment. And increasing the rate of improvement is non-linear in investment; investments are subject to diminishing returns. Even so, investment is not subject to a strict upper limit. The right person with the right ideas could attract additional investment, which would increase the rate of technological improvement.
  I'm not expecting it to happen. But I don't think you can rule out the idea that more investment will go into supercomputers. If supercomputing improvement stalls, it could go either way: some investors will abandon the field as a commodity field, others will look at it as a chance to get ahead by doing something different and proprietary.
  
  --
  Post may contain irony: discontinue use if experiencing mood swings, nausea or elevated blood pressure.
45. Re:Time for vector processing again by Duhavid · 2008-12-05 05:43 · Score: 1
  
  Lots of people think so.
  
  --
  emt 377 emt 4
46. Re:Time for vector processing again by pcarter7 · 2008-12-05 05:56 · Score: 1
  
  The days when Seymour Cray could design a product which was cutting edge & saleable for a decade are long gone.
  Yes, but this is largely because Cray is dead, not because it is impossible for someone similarly gifted to do what he did.
47. Re:Time for vector processing again by osu-neko · 2008-12-05 06:06 · Score: 1
  
  Esoteric-looking shells are there for marketing purposes and cost-justification.
  In other words, the important part. :)
  I'm only half-joking, of course. In a capitalistic system, "marketing purposes" alludes to the primary reason the computer is being built: to sell and thus make money.
  
  --
  "Convictions are more dangerous enemies of truth than lies."
48. Re:Time for vector processing again by knails · 2008-12-05 06:15 · Score: 1
  
  On the whole, english grammar doesn't change with time, but it does have a lot of nuances to represent its diverse origins and subtle meaning differences. Just because every rule has an exception doesn't mean you shouldn't at least try. If you forget a certain situation is an exception to a rule, I can forgive that with a simple informative correction, but plural vs. possessive? That's a different matter entirely.
  
  --
  "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
49. Re:Time for vector processing again by postbigbang · 2008-12-05 06:19 · Score: 1
  
  Certainly sales efforts are justified. But it's what the machine does, rather than its esoteric facade, that makes a difference. CEOs like Lambo looks. Nerds understand that it's how much you can actually productively crunch that makes the difference. Gimme crunch, as the aesthetics are somewhat meaningless. These are computers.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
50. Re:Time for vector processing again by digitalunity · 2008-12-05 06:25 · Score: 1
  
  Absolutely. A supercomputers good looks are for the buyers to make a purchasing decision. The people actually using the supercomputers don't necessarily even see them on a regular basis.
  
  --
  You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
51. Re:Time for vector processing again by gbjbaanb · 2008-12-05 06:44 · Score: 1
  
  Modern CPU's have 8+ Mega Bytes of L2/L3 cache on chip so RAM is only a problem when your working set it larger than that.
  Unfortunately, most modern apps require far more working set than that! The crappy .NET app I use at work had a working set of 700MB today.
  The other issue is that HPC applications generally require small amounts of processing on lots and lots of snippets of data - ie highly parallel processing. This means that memory bandwidth is a very significant bottleneck.
  
  you can have lot's of cores and a tiny cache like GPU's
  Incidentally GPUs have a lot of cache - my Graphics card has 512Mb RAM.
52. Re:Time for vector processing again by 5pp000 · 2008-12-05 06:56 · Score: 3, Funny
  
  "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
  ... as long as you spell it right :)
  
  --
  Your god may be dead, but mine aren't!
53. Re:Time for vector processing again by David+Greene · 2008-12-05 06:59 · Score: 1
  
  I agree that advances in commodity memory architecture are needed to support large vectors. However, a vector register file is quite a bit different than a cache, just as a scalar register file is very different than a cache. They're architected to serve different needs.
  Vector registers can help the memory system over current cache-based designs. We wouldn't have SSE registers if they didn't. Software can control the stream presented to the memory system using registers, for example. If there's enough data reuse, vector registers help tremendously.
  Vector registers won't help streaming apps, but that's not what they're designed to do. A stream buffer or similar mechanism is better for that.
  We are going to see longer vectors in commodity processors. Sandy Bridge will have AVX. Haswell will have something more. What's not clear is how Intel plans to architect the memory system. That will be interesting to see.
  
  --
54. Re:Time for vector processing again by daniel_gustafsson · 2008-12-05 07:02 · Score: 1
  
  You are correct. More common hardware is lots cheaper and can be used for more tasks.
55. Re:Time for vector processing again by nategoose · 2008-12-05 07:17 · Score: 1
  
  And that should be
  
  Faster computation doesn't help communication-limited tasks very much . Faster communication doesn't help computation-limited tasks very much .
56. Re:Time for vector processing again by lysergic.acid · 2008-12-05 07:24 · Score: 4, Interesting
  well, supercomputing has always been about maximizing system performance through parallelism, which can only be done in three main ways: instruction level parallelism, thread level parallelism, and data parallelism.
  ILM can be achieved through instruction pipelining, which means breaking down instructions into multiple stages so that CPU modules can work in parallel and reduce idle time. for instance, in a RISC pipeline you break an instruction down into 5 operations:
  instruction fetch
  instruction decode / register fetch
  instruction execute
  memory access
  register write-back
  so while the first instruction is still in the decode stage the CPU is already fetching a second instruction. thus if fully-pipelined there are no stalls or wasted idle time, and a new instruction is loaded every clock cycle, resulting in a maximum of 5 parallel instructions being processed simultaneously.
  then there are superscalar processors, which have redundant functional units--for instance, multiple ALUs, FPUs, or SIMD (vector processing) units. and if each of these functional units are also pipelined, then the result is a processor with an execution rate far in excess of one instruction per cycle.
  thread level parallelism OTOH is achieved through multiprocessing (SMP, ASMP, NUMA, etc.) or multithreading. this is where multicore and multiprocessor systems come in handy. multithreading is generally cheaper to achieve than multiprocessing since fewer processor components need to be replicated.
  lastly, there's data level parallelism, which is achieved in the form of SIMD (Single Instruction, Multiple Data) vector processors. this type of parallelism, which originated from supercomputing, is especially useful for multimedia applications, scientific research, engineering tasks, cryptography, and data processing/compression, where the same operation needs to be applied to large sets of data. most modern CPUs have some kind of SWAR (SIMD Within A Register) instruction set extension like MMX, 3DNow!, SSE, AltiVec, but these are of limited utility compared to highly specialized dedicated vector processors like GPUs, array processors, DSPs, and stream processors (GPGPU).
57. Re:Time for vector processing again by PingPongBoy · 2008-12-05 07:27 · Score: 1
  
  I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.
  Your mind must still be stuck in the 90's.
  Let's see what happened since then.
  I've always felt there was something odd about the recent trend of Super Computers using common hardware components -- the components are not so common if you compare them to what was available even just months ago. High end computer parts evolve so fast that if you custom design a supercomputer, you can turn around and find that an off-the-shelf machine is breathing down your neck. The custom computer is hard to replicate because the next customer is always asking you for an even better procssor. With ordinary components you can get a lot more customers because they don't bother you for custom everything.
  Or Clustering a bunch of PC's togeter -- This is an insensitive statement. People who want supercomputers but can't afford them make clusters. The clusters turn out to be quite scalable so we have really advanced cluster technology, which poorer people can afford to scale down from.
  We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. -- What can one say? There are so many problems that require powerful compute power that if a supercomputer can be multitasked so much the better! PCs are designed to deliver multitasking, at some cost to straight ahead speed
  
  --
  Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
58. Re:Time for vector processing again by SpinyNorman · 2008-12-05 07:31 · Score: 1
  
  A crossbar switch in of itself isn't the solution to increased memory access bandwidth. What you need is to increase parallelism of memory access per CPU which means increasing the number of independent memory banks per CPU - one bank per N cores. You also then need (for global memory) to provide a way for each core group to access non-local memory memory banks (a shared bus would be sufficient).
  e.g. something like this:
  Cores 1 -4 = Memory bank 1 =+
  =================+
  Cores 5 -8 = Memory bank 2 =+
  =================+
  Cores 9-12 = Memory bank 3 =+
  =================+
  Cores 13-16 = Memory bank 4 =+
  =================+
  This type of memory/core architecture could easily map onto todays software. For example, have each process assigned to a single core group, such that pthread_create() assigns threads to cores only in that core group, and malloc() allocates memory only from the corresponding local memory bank. You'd presumably map part of the virtual address space onto the local memory bank and part onto the global address space (accessed via the shared bus/switch). The inter-process shared memory APIs (mmap, etc) would allocate memory from whereever they saw fit and then map it into the virtual address space accordingly.
  If you wanted to allow a single process to create threads across multiple core groups (maybe controlled via processor affinity), then you could introduce something like pthread_malloc() to allocate local per-thread memory, have malloc() allocate per-process memory from the shared access space. You'd probably only want to have this cross-core-group thread allocation happen when explicity requested via the thread library so that existing binaries and recompiled applications would automatically only use the more efficient local memory, and only heavily threaded apps written to also use pthread_malloc() might ask to be spread across multiple core groups.
  The idea here would be that each thread would use local memory as much as possible, thereby achieving maximum parallelism of memory access. Global memory access would result in less parallelism since you'd usually be competing for memory access with the local core group, as well as for global memory bus access. You could choose to implement a crossbar switch in place of the global memory bus, but that would be optional.
59. Re:Time for vector processing again by Antimatt3r · 2008-12-05 07:38 · Score: 1
  
  Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.
  The technology already exists Companies like Sandvine, Cloudsheild, and Bivio Networks make supercomputer platforms.
60. Re:Time for vector processing again by postbigbang · 2008-12-05 07:40 · Score: 1
  
  You've made a good description of the variants in traditional state-dependent (Von Neumann) problem solving, although various parallelisms are the crux of the problem, and that's where multi-cores tend to bottleneck. Cross-bar relationships certainly beat the bottlenecks that are inherent to current multi-core designs. Everything is subject to the slave of the clock.
  Once dimensionality is conquered, supercomputing becomes even more interesting, IMHO. I like how GPUs can be slaved to each other, even though instruction and cache coherency problems aren't solved to any degree of fun yet.
  Multi-threading, however, is really more of an applied concept, and begins to achieve multi-dimensionality in ways that make my clocks clang. I wait longingly for CPU designers to step outside of the box even farther, then destroy the box as we know it. But that's rather abstract; until then, I spend a good part of my time cutting through vendor propaganda crap in search of optimizations and the reality of CPU/GxPU/DSP offerings, only to then be thwarted by the next announcement.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
61. Re:Time for vector processing again by mprinkey · 2008-12-05 07:43 · Score: 1
  
  OK, if you actually programmed vector processors, you'd know that the current CPU development is largely heading in the direction of vector processing--not with multicores, but with SSE4/altivec/MMX and especially with GPUGP. The approaches for making most common numerical algorithms work well on vector CPUs were sorted out on the original Cray machines in the 80s.
  One question that both AMD and Intel are asking is the target application for more CPU. Intel has their pet ray-tracing app, etc. For encoding/decoding video, numerical simulations, etc., there is excellent evidence that wide SIMD vector-type operations can give dramatic performance increases...see the aforementioned GPUGP. What Intel/AMD may decide is that more real world performance can be found by limiting core count and provide 8-wide or 16-wide SIMD instructions and allow the 20-year-old vectorizing compiler techniques to exploit that parallelism. Four cores with 16-wide SIMD instructions is probably far better than 16 cores with 4-wide SIMDs.
62. Re:Time for vector processing again by jd · 2008-12-05 08:09 · Score: 1
  
  Consider this. A multi-core processor will typically have one cache which is shared between cores. SMP systems, on the other hand, have one cache per CPU. For SIMD or MISD problems, a single cache is not a big penalty. For MIMD problems, it will kill you stone dead. Most "interesting" problems (the problems that eat Crays for breakfast, and enjoy a leisurely lunch of Blue Genes sautee'd with Information Silos in a brisk red hat) are MIMD. It's these problems that have people talking in hypercubes, Processor-in-Memory, MPI-enabled RAM, RDMA, and any other trick they can possibly come up with to shunt data around faster. When these guys can (and do) play with 60 gigabits per second bandwidth in both directions, they need CPU technology that won't take one look at that kind of hardware abuse and run screaming through the night.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
63. Re:Time for vector processing again by virtual_mps · 2008-12-05 08:15 · Score: 1
  
  The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.
  I really can't parse what you're trying to say. I'm guessing it's along the lines of "well, someone could have a radical idea and transform the supercomputer market". I won't preclude that. Until that happens, the market is what it is.
  
  Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today.
  Well, first, you'd have to be an idiot to make a prediction about the capabilities of computers in twenty years. Second, it's about leverage. Do you really think a small supercomputer-specific R&D expenditure will outperform the aggregate multibillion dollar generic computer R&D budget over decades? Smart money says you leverage that enormous budget and spend your money on tailoring the generic capabilities to maximize performance on your workload--which is how we ended up with clusters of commodity hardware. Yeah, yeah, maybe everything would be better if civilization spent an unbounded amount of money on R&D for task-specific supercomputers--but the reality is that the world is full of narrowly-focused geeks who want more money spent on their pet project and they can't all get it.
64. Re:Time for vector processing again by PitaBred · 2008-12-05 08:21 · Score: 1
  
  The people buying supercomputers aren't your typical PHB's. If you need a supercomputer, you know it, and you know enough to look at the whole architecture.
  
  --
  My blog. Good stuff (when I remember to update it). Read it.
65. Re:Time for vector processing again by virtual_mps · 2008-12-05 08:29 · Score: 1
  
  I'm talking about the scenario TFA proposes: that directions in technology cause supercomputing advancement to stall.
  Well, if that's what you want to talk about I don't have much to say. I reject the premise that in the future the fastest computers won't be any to process things faster than the fastest computers today. If you think about the history of the industry the premise actually seems absurd. Is it impossible? No. Is the possibility worth agonizing over? Also no.
66. Re:Time for vector processing again by virtual_mps · 2008-12-05 08:32 · Score: 1
  
  No, it's because the market has matured. Even Cray's final ideas hit an engineering brick wall (it happens). The R&D budget for the segment can't support as many dead ends as the R&D budget for the industry as a whole. Does that mean the end of R&D? No. But it does impact where you spend your money.
67. Re:Time for vector processing again by virtual_mps · 2008-12-05 08:35 · Score: 1
  
  But I'm wondering if there isn't some intermediate ground between using commodity CPUs and custom architectures. Seems like some company ought to be able to license a design from Intel or AMD, and modify it to better meet the needs of high performance computing. You might not gain a lot on memory bandwidth (that seems deeply tied to the architecture), but you might be able to speed up floating point trig functions, extend vector processing from 128-bit to 256-bit or longer, improve SIMD support for double precision, maybe integrate some FPGA-like circuitry...things which aren't economical for mainstream CPUs, but don't require a major redesign. It seems to me that unlike general users, HPC would benefit greatly from improved floating point precision and speed.
  No, because by the time your small team finishes that effort and gets it right, Intel is two generations ahead and their commodity part gets the same performance as yours, essentially for free (because the development was paid for by solitaire players). That's exactly the problem.
68. Re:Time for vector processing again by LingNoi · 2008-12-05 09:27 · Score: 1
  
  There's a time and place for everything, we're here to discuss new's, not the finer point's of the English language to people to whom English probably isn't even their native language.
  Stop wasting everyone's time having to read and skip your post's. Get on talking about why Multicore is bad.
  The only people who correct English here are dumbfuck's that have nothing to add about the article.
  Spelling mistake's added on purpose because it pisse's you off.
69. Re:Time for vector processing again by knails · 2008-12-05 09:36 · Score: 1
  
  Like I said, I didn't originally point out the errors, and, except for cases of "would of" and "irregardless", I never do, I will always defend someone for doing so. If english isn't their native language, I'm actually more willing to point out errors, in a non-hostile way, of course, for I respect language correctness, and will always help anyone who is trying to learn.
  
  --
  "I disapprove of what you say, but I'll defend to the death your right to say it" -Voltaire
70. Re:Time for vector processing again by afidel · 2008-12-05 09:45 · Score: 1
  
  What you described is exactly how AMD's CPU's work today, there are two memory controllers for 4 cores on each chip. The problem is that outside data lines and memory modules are expensive. That's why SUN servers were always vastly more expensive per MIP than Xeon's, they were designed to do real work and so had much wider memory busses whereas all Xeon's in the system went through a single FSB memory controller (well that and lower volume meant design work needed to spread over fewer units).
  The problem is that going forward we will add significantly more cores per memory controller. There's also the relationship between cores, L2, and L3 cache to consider. The L3 cache on a 32 core chip could become a VERY dirty area with certain workloads.
  
  --
  There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
71. Re:Time for vector processing again by sjames · 2008-12-05 12:02 · Score: 1
  
  Actually, it makes a great deal of sense. The 'big iron' supercomputers were orders of magnitude more expensive than clusters. It wasn't a matter of paying $80,000 more for custom processors, it was more like paying $80,000,000 more. That's quite a chunk of change for something that will be obsolete in a few short years. I have seen an X-MP used as a coffee table and an old Cray I used for a stylish sofa, they would cost more in electricity to run for a month than a new just as fast system (A big desktop PC!).
72. Re:Time for vector processing again by meson2439 · 2008-12-05 13:59 · Score: 1
  
  Just to demonstrate how English varies with time, here is some example from 200 years ago: to-morrow, to-day. Using can't is also illegal previously. Onto is nonexistent.
73. Re:Time for vector processing again by TapeCutter · 2008-12-05 15:00 · Score: 1
  
  The sort of thing they use in the LHC? - Thanks for the interesting post, I'm off to look up some of those acronyms.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
74. Re:Time for vector processing again by hardwarefreak · 2008-12-05 15:05 · Score: 1
  
  The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?
  The system you're thinking of was called ASCI RED, and it was installed at Sandia, not Los Alamos:
  http://www.sandia.gov/ASCI/Red/
  http://www.top500.org/system/4428
75. Re:Time for vector processing again by hardwarefreak · 2008-12-05 15:07 · Score: 1
  
  The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?
  The system you're thinking of was called ASCI RED, and it was installed at Sandia, not Los Alamos:
  http://www.sandia.gov/ASCI/Red/
  http://www.top500.org/system/4428
76. Re:Time for vector processing again by hardwarefreak · 2008-12-05 15:10 · Score: 1
  
  The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?
  The system you're thinking of was called ASCI RED, and it was installed at Sandia, not Los Alamos:
  http://www.sandia.gov/ASCI/Red/
  http://www.top500.org/system/4428
77. Re:Time for vector processing again by hardwarefreak · 2008-12-05 15:26 · Score: 1
  
  The move towards commodity processors in supercomputing wasn't some kind of accident, it occurred because that's what currently gets the best results.
  You are totally misinformed. COTS clusters and the move toward x86 commodity CPUs in MPPs is the direct result of CPLANT, and it was all about initial hardware acquisition cost, not performance:
  http://www.cs.sandia.gov/cplant/
  CPLANT is the father of all current cluster supercomputers, which make up over 90% of all supers in the world. Anyone find it interesting that the CPU architecture that started the move to COTS supercomputers was actually Alpha, not x86? What's that saying? "Truth is often stranger than fiction"?
78. Re:Time for vector processing again by hardwarefreak · 2008-12-05 15:34 · Score: 1
  
  http://www.cray.com/products/XMT.aspx
  Did you know that the only customer to ever buy an XMT is the United States National Security Agency, commonly known as the NSA? And they have more than one. They're listed as "classified government agency" in the SEC 10K filings for system sales.
  Did you happen to read what this particular architecture really excels at? Pattern matching. Now, where do you think all those "illegally" obtained phone company records got dumped into for analysis?
79. Re:Time for vector processing again by adisakp · 2008-12-05 19:13 · Score: 1
  
  the relationship to other cores and to memory aren't fully 'cross-bar'
  In AMD multicore processors and the latest i7 Intel chips with integrated memory (rather than a shared FSB), the system memory is "stacked to the processor" already (one bank of memory per physical chip). You just need to make sure that your working set per core is allocated from the memory that is directly connected to the CPU the you are working on.
  
  FWIW, Windows already has a solution for this. You run each thread on a core with a set processor affinity so it doesn't get swapped around. Then you perform your memory management with NUMA allocations so your working set comes from the memory directly connected to your core.
  
  You will have some memory that is "shared" for communication and other uses but as long as your primary working set is directly attached to your CPU, all should be roughly equivalent to the "stacked memory" custom supercomputer architectures mentioned in the article.
  
  I'm sure that other OS's support NUMA allocation as well.
  
  FWIW, you can further increase performance by "chunking" data into segments that will fit into L1 for inner loops if possible so that multiple cores in a single die don't compete for L2/L3 and Memory Bus bandwidth as often.
  
  Finally, avoid having shared/overlapped data between multiple cores. Especially for "false sharing" which is two separate data elements which are not actually shared but are co-located in the same cache line such that independent accesses to them cause cache updates/invalidations.
80. Re:Time for vector processing again by dsanfte · 2008-12-06 03:18 · Score: 2, Informative
  
  Bad English isn't something you can keep locked out of sight in the back closet of society; it's like a termite infestation. Allow it a foothold and it'll spread everywhere. It's a higher-entropy state.
  There's a world difference between someone who's just writing casually (and goofing up), and someone else who is completely unable to grasp the tenets of grammar. The former are perfectly capable of writing well on a resume, as you say; the latter are functionally illiterate, and they should be told when their English isn't good enough to participate in a discussion. There's nothing wrong with that; it gives them the opportunity to improve.
  
  --
  occultae nullus est respectus musicae - originally a Greek proverb
81. Re:Time for vector processing again by Retric · 2008-12-06 06:32 · Score: 1
  
  Your .Net app did not need all of that 700MB at the same time. Chances are a lot of that data is vary rarely accessed and by carefully managing what is in 8MB of cache the CPU can dramatically cut down on how often it needs to access the ram. (Think of it this way: To read a single byte from ram wastes ~1 million cycles on the CPU so it's something to be avoided.)
  
  Graphs cards need to access a much larger working set, they still have ~64kb or so of cache per core because it's really useful but 30 times a second they need to access most of that 512Mb's of RAM so an 8Mbyte would be wasted. The secret is setting up that memory so the GPU does not need to request each byte of Memory and wait for it then send out the next request but to load longer stretches of data and processes them while waiting on the next chunk of data.
  
  PS: CPU's also load chucks of data from RAM to Cache, and from L3 to L2 / L1 Cache etc. Because for the most part if you need a Byte there is a good chance you are going to need the next one as well. The problem is branching code and complex data structures breaks up how useful this can be.
82. Re:Time for vector processing again by postbigbang · 2008-12-06 06:50 · Score: 1
  
  I'm with you until the last paragraph. It's the reason why crossbar relationships overcome the dirty inter-core state machine.
  Nonetheless, you argue well for NUMA-- except that the HPC seems to still have cache thrash. An exchange for another day, perhaps.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
83. Re:Time for vector processing again by Dralnu · 2008-12-06 09:08 · Score: 1
  
  What about the Universities? A newly developed CPU could be a huge avenue for profit for them, and along with cheap (student) labor, they might could come up with something better than what is being crapped out today, whether it is a new version of a previously discontinued system (ternary CPUs, for example), or something entirely new. The CPUs we are using today are based on tech from the 90's with some extra crap added onto them.
84. Re:Time for vector processing again by adisakp · 2008-12-06 09:33 · Score: 1
  
  Heavy use of False sharing (packed structures with independent member variables accessed by multiple cores) is a huge performance penalty on multicore. It's best to separate the variables or to shadow them on a per core (i.e. per thread / TLS) basis if possible to get the highest performance.
  
  Also, if you access shared control variables, with atomic operations (compare-exchange / load-and-reserve + store-conditional), you can avoid undue cache cross talk (updates and invalidates) between the CPUs.
85. Re:Time for vector processing again by postbigbang · 2008-12-06 09:50 · Score: 1
  
  I believe that Shanghai and the i7 multicores were built more with hypervisors, rather than clusters, in mind. Tho I'm loathe to cite Sun's cluster model and have even more difficulty with some of their processor architectures, I see their multicores as going down a more astute track.
  There are other cache semaphoring techniques and instruction pipeline optimization possibilities, but it's the compiler that gets it right-- or not. I wish aloud for a compiler analyzer that can take various constructs (like as was done in the old 'Lifeboat' days) and examine objects for optimizations and executions--> and importantly, states.
  Cell CPUs also intrigue me a lot.... but the thread between us is now long. Dinner knows no wait states where I live.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
86. Re:Time for vector processing again by jd · 2008-12-07 10:23 · Score: 1
  SIMD - Single Instruction, Multiple Data. Useful with highly parallelizable problems.
  
  MISD - Multiple Instructions, Single Data. Radiosity in parallel with Raytracing would be an example of that.
  
  MIMD - Multiple Instructions, Multiple Data. Very common in high-end supercomputing.
  
  RDMA - Remote Direct Memory Access. Access another machine's memory as if it were your own. Bypasses the kernel and is very lightweight on the CPU(s), mostly done on the NIC. (NICs capable of RDMA are sometimes called RNICs.)
  
  PIM - Processor In Memory. Advanced functions (like message passing) embedded into the RAM chips themselves, bypassing the CPU(s) and the Operating System entirely. Very useful when you're shunting data around, rather than crunching it.
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
87. Re:Time for vector processing again by adisakp · 2008-12-08 08:50 · Score: 1
  
  Cell CPUs also intrigue me a lot....
  
  That's ironic because I have been programming the Cell processor at my job for the last three years. As a professional video game programmer, I do about half my coding on the PS3 Cell Processor :-)
  
  I do most of the rest of my coding on the XBOX 360 which is also multicore but is a bit easier than the Cell to program.
  
  If you want to learn Cell multiprocessor programming though, you can easily pick up a used PS3 for about $250 on Craigslist and set it up to run Linux. You don't get access to the GPU so hardware-accelerated graphics isn't possible but you can do a fair amount of Cell SPU multicore coding without too large an investment. If you plan on remotely targeting the PS3, I suggest setting up a simple VirtualBox VM on your home PC is you run Windows. Also, this site is a pretty good start for PS3 home brew performance optimizing for the Cell processor.
  
  You won't run into any of the false sharing issues (which is a big deal on XBOX 360 performance) since there's only one PPC core and the SPU's are basically "DMA" driven for access to main RAM.
88. Re:Time for vector processing again by postbigbang · 2008-12-08 10:42 · Score: 1
  
  Thanks for this info.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
89. Re:Time for vector processing again by flaming-opus · 2008-12-08 10:53 · Score: 1
  
  Not quite.
  see http://www.pnl.gov/science/images/highlights/computing/cray.pdf
  http://www.pnl.gov/topstory.asp?id=320
  http://www.cacr.caltech.edu/news/story.cfm?ID=30
90. Re:Time for vector processing again by hardwarefreak · 2008-12-08 12:03 · Score: 1
  
  Follow the funding trail. Most of the small single cabinet systems are "seed" machines, loaned by Cray at no cost. They do this, especially for brand new unproven architectures, in the hopes that 'early users' will find the machine is great and spread the word. *Then* Cray actually starts selling them. It's very difficult to get people to spend millions of dollars on a brand new architecture. NSA did. They had staff interfacing with Cray staff during development of the XMT. NSA has a huge budget, and the human resources to dedicate to making this new architecture hum. This is the point I was making. NSA bought whilst everyone else was borrowing the XMT machines. Maybe one or two of the small ones were actually purchased, but most were loaned. Those loans can turn into purchases.
  The single cabinet systems in those articles you link to are likely seed machines on loan, not purchased. The $900k mentioned in that NSF grant is awfully small to cover a single cabinet XMT *AND* the salaries of the principal researchers who won the grant.
91. Re:Time for vector processing again by brokenbeaker · 2008-12-12 11:33 · Score: 1
  
  you are an idiot
Well doh by Kjella · 2008-12-05 00:17 · Score: 4, Insightful

If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...

--
Live today, because you never know what tomorrow brings
1. Re:Well doh by cheater512 · 2008-12-05 00:34 · Score: 4, Insightful
  
  There are limits however to what you can do.
  Its not like multi-processor systems where each cpu gets its own ram.
2. Re:Well doh by Anonymous Coward · 2008-12-05 00:53 · Score: 1, Insightful
  
  There are limits however to what you can do.
  Its not like multi-processor systems where each cpu gets its own ram.
  I like where your brain is. What if they reworked the board layouts so that each proc had its own bank, or two, or ram. 4 cpu's with 2gig each. I sense amazing possiblities here. Dual north/south bridges. I mean you write a smart enough CPU set you could do amazing things for the future.
3. Re:Well doh by Anonymous Coward · 2008-12-05 01:02 · Score: 1, Interesting
  
  Why is it important that each CPU/Core has it's own RAM? How is that more efficient than a *huge* chunk of RAM that could be accessed by any CPU/Core?
  I understand there are risks -- concurrent access and the like -- but completely separate RAM seems like an extreme solution if this is the only problem it is trying to address.
4. Re:Well doh by cheater512 · 2008-12-05 01:16 · Score: 2, Informative
  
  The summary mentions that the path to the memory controller gets clogged.
  There is only so much bandwidth to go around.
5. Re:Well doh by jebrew · 2008-12-05 01:20 · Score: 2, Informative
  
  That's how a lot of boards are already done. The issue is with a single processor that has multiple cores.
  There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.
6. Re:Well doh by TheRaven64 · 2008-12-05 01:31 · Score: 1
  
  Actually, that is a closer approximation of the real problem. Cache coherency is a performance killer, so most supercomputer software is written around the CSP model and fits nicely with NUMA architectures. The Opteron and friends run a NUMA architecture but expose it to the programmer as a UMA model.
  
  --
  I am TheRaven on Soylent News
7. Re:Well doh by DRobson · 2008-12-05 01:42 · Score: 1
  
  With any pool of memory, there is some limit to latency and throughput. Thus, the more processing elements you throw at the problem the more they compete for this resource.
  Now, if you can separate this memory into discrete pools associated with each processing element, you have less contention (locally) hence the possibility of lower latency and higher throughput (locally).
  If you can design your algorithms such that multiple writers (and hopefully readers) to the same location happen infrequently, then you have a net win, despite higher costs addressing foreign memory.
8. Re:Well doh by Targon · 2008-12-05 01:43 · Score: 4, Interesting
  
  You have failed to notice that AMD is already on top of this and can add more memory channels to their processors as needed for the application. This may increase the number of pins the processor has, but that is to be expected.
  You may not have noticed, but there is a difference between AMD Opteron and Phenom processors beyond just the price. The base CPU design may be the same, but AMD and Intel can make special versions of their chips for the supercomputer market and have them work well.
  In a worst case, with the support from AMD or Intel, a new CPU with extra pins(and an increased die size) could add as many channels of memory support as required for the application. This is another area where spinning off the fab business might come in handy.
  And yes, this might be a bit expensive, but have you seen the price of a supercomputer?
9. Re:Well doh by gabebear · 2008-12-05 01:44 · Score: 2, Interesting
  
  if you read the article they are talking about is the disparity that is growing between CPU speed and access to memory. The stacked memory they are talking about is shoving the physical CPU dies and RAM chips closer together(in the same package) so that you can have a LOT more interconnects(less wire to screw things up). Build up a huge stack of these and you have supercomputer cluster WITH fast access to all memory in the stack. For infomatic applications you need random access to any bit of memory in the entire array and this might do it for them. The biggest problem is heat dissipation...
  
  The other options I see are creating some kind of super giant shared buffered RAM pool that has high latency but great throughput and then sticking as many cores as they can on a single motherboard(1000+), or for a wizard to find some caching algorithm that will let them stay on commodity hardware(a.k.a. use those extra cores to figure out what you are going to need and optimize for it).
  
  I'd put my money on them finding a wizard.
10. Re:Well doh by Lumpy · 2008-12-05 01:48 · Score: 1
  
  no but by using dual or quad cores with a crapload of ram each you do get a benefit.
  a 128 processor quad core supercomputer will be faster than a 128 processor single core supercomputer.
  you get a benefit.
  
  --
  Do not look at laser with remaining good eye.
11. Re:Well doh by Targon · 2008-12-05 01:53 · Score: 2, Interesting
  
  Multi-channel memory controller is my response to this. Remember how going to a dual-channel memory controller increased the available bandwidth to memory? Having support for even 32 banks of memory could be implemented if the CPU design and connections are there.
  You are thinking along the lines of current computers, not of the applications. People keep quoting the old statement that 640KB should be enough memory for anyone, but then repeat the same mistake they quote. Quantity of memory not only goes up, but the way to talk to that memory also evolves over time.
  We used to see the CPU to chipset to memory as the way personal computers would work. Since then, AMD moved to an integrated memory controller on their CPUs, and Intel is finally following the example that AMD set. A dual-channel memory controller used to be the exception, not the rule, but now the idea is very common. In time, a 32 channel memory controller will be the standard even in an average home computer. How those channels are used to talk to memory of course remains to be seen, but you get the idea.
12. Re:Well doh by bwcbwc · 2008-12-05 02:00 · Score: 2, Informative
  
  Actually that is part of the problem. Most of the architectures have core-specific L1 cache, and unless a particular thread has its affinity mapped to a particular core, a thread can jump from a core where its data is in the L1 cache to a core where its data is not present, and is forced to undergo a cache refresh from memory.
  Also, regardless of whether a system is multi-processing within a chip (multi-core) or on a board (multi-CPU), the number of communication channels required to avoid communication bottlenecks goes up as O(n^2) the number of cores.
  So yes, we are probably seeing the beginning of the end of performance gains using general-purpose CPU interconnects and have to go back to vector processing. Unless we are somehow able to jump the heat dissipation barrier and start raising GHz again.
  
  --
  We are the 198 proof..
13. Re:Well doh by GooberToo · 2008-12-05 02:12 · Score: 1
  
  This really is a problem that doesn't exist. The issue at hand is if you have all cores cranking away you run out of bandwidth. Simple solution - don't run all cores and continue to scale horizontally as they currently do. So if you need 8-core CPUs and it has the bandwidth you need, only buy 8-core CPUs. If your CPUs run out of bandwidth are 16-cores (or whatever), then only buy up to 16-core CPUs, passing on the 32-core CPUs.
  Wow that a hard problem to solve. Next.
14. Re:Well doh by confused+one · 2008-12-05 03:15 · Score: 1
  
  You just described AMD's current memory architecture in multi-processor systems.
15. Re:Well doh by Timothy+Brownawell · 2008-12-05 05:27 · Score: 1
  
  A dual-channel memory controller used to be the exception, not the rule, but now the idea is very common. In time, a 32 channel memory controller will be the standard even in an average home computer. How those channels are used to talk to memory of course remains to be seen, but you get the idea.
  Where are you going to find enough pins for that, or the space for the DIMM slots?
  I'd expect either (1) completely on-die or in-package memory (and fixed memory per core) or (2) some sort of stackable chips (how would this interact with heatsinks?) where your CPU has a grid of contacts on top and the memory has a grid on the bottom, which should allow for much higher speeds because you're not driving some hugely long set of wires.
16. Re:Well doh by Timothy+Brownawell · 2008-12-05 05:47 · Score: 1
  
  So yes, we are probably seeing the beginning of the end of performance gains using general-purpose CPU interconnects and have to go back to vector processing. Unless we are somehow able to jump the heat dissipation barrier and start raising GHz again.
  That's what the superconducting FETs are for, just wait a few years / couple decades for them to get something that can be made on an IC and works at liquid nitrogen temperatures.
17. Re:Well doh by Targon · 2008-12-05 10:52 · Score: 1
  
  This has a fairly simple solution if you think about it. There is nothing that says that you can't increase the density of the connection to memory. With a greater density aka smaller size, you could have multiple banks of memory on the same memory module.
  The question I was addressing was about the whole idea of memory bandwidth and how to improve the connection to memory. Many people think one channel per memory module, which isn't going to give the greatest amount of bandwidth for the space provided. If the fab process can be improved for processors, the link to memory can also be shrunk. Why limit ourselves to the current "stick" method of having memory go in a slot when a socket supporting 1TB of RAM could be implemented in the same space on the motherboard(with a heat sink/fan if needed)?
Re:but.. by jellomizer · 2008-12-05 00:45 · Score: 1

Well we are talking about CPU to ram not the Hard drive. But a similar process the Ram is order of magnitude slower then the CPU. But the When the CPU talks to the ram it goes over the bus and talks to the ram and back threw the bus to the CPU. With a single core Fast CPU you can have a bus for each core, which is like adding more lanes to a highway it allows more traffic so the CPU while may be waiting for the ram it will be faster as you are not waiting for your bits because an other core requested some other bits.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Yeah! by crhylove · 2008-12-05 00:51 · Score: 5, Funny

Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!

--
I hold very few opinions. I hold information based on observation and fact. If you wish to disagree, please use facts.
1. Re:Yeah! by TehBlahhh · 2008-12-05 01:35 · Score: 1
  
  erm. A beowulf cluster won't perform worse than any individual machine in it. The limitation is the cpu to memory path, which becomes saturated when you have N cores. And in a beowulf, you'd have many many of those paths, meaning you would still get a speedup - BUT each individiual machine is limited as per TFA.
2. Re:Yeah! by VeNoM0619 · 2008-12-05 05:42 · Score: 1
  
  You are both wrong, he said "+1Eienstien", and apparently there are none like him. He is unique in his own way
  
  --
  Disclaimer: I am not god.
  We may not be created equal
  But we can be treated equal.
3. Re:Yeah! by Lodragandraoidh · 2008-12-05 07:04 · Score: 1
  
  Once the Bus exceeds the speed of the network, then I will worry about this. Until then, a FDDI based Beowulf cluster with attached storage, will continue to outperform any monolithic supercomputer (are there any left that are not clusters nowadays?)
  
  --
  
  Lodragan Draoidh
  The more you explain it, the more I don't understand it. - Mark Twain
4. Re:Yeah! by forkazoo · 2008-12-05 09:22 · Score: 1
  
  erm. A beowulf cluster won't perform worse than any individual machine in it. The limitation is the cpu to memory path, which becomes saturated when you have N cores. And in a beowulf, you'd have many many of those paths, meaning you would still get a speedup - BUT each individiual machine is limited as per TFA.
  Well, strictly speaking, a beowulf cluster *need not* perform worse, but there is no guarantee that it won't fuckups happen. A moronic process migration controller may decide it is bored and move a process from one machine to another, and then you incur massive communication delays between processes compared to if they were all on the same machine. Different usage models have different best-cases, so if things are tuned wrong, you can accidentally waltz right into a worst case. Assume that everything is tuned for a workload where you have extremely high disk I/O for each process, so you want things running on as many machines as possible so that all disks are used. Then assume you run some fluid sim on that wrongly tuned cluster, and suddenly the interprocess communication requirements dominate compared to running on a single machine.
  Of course, you were responding to a joke, so it's probably silly to nit pick your nit pick of a goof.
5. Re:Yeah! by TapeCutter · 2008-12-05 15:12 · Score: 1
  
  Wonderfull! BTW: I thought about putting +1Albert because amoung the many words I can't spell is his last name, sure I could look it up but I am too lazy to swivel my chair and look at the biography sitting on the bookshelf.
  
  --
  And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
So what does it mean for PCs? by theaveng · 2008-12-05 00:53 · Score: 3, Insightful

>>>"After about 8 cores, there's no improvement," says James Peery, director of computation, computers, information, and mathematics at Sandia. "At 16 cores, it looks like 2 cores."
>>>
That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?

--
FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
1. Re:So what does it mean for PCs? by jebrew · 2008-12-05 01:35 · Score: 1
  
  No, you'll have a slew of desktop apps that will get split out amongst several cores. For the applications your likely to run, the more cores the better (well, I'm sure there's an upper limit, but it's most likely much higher than 32).
2. Re:So what does it mean for PCs? by bigsexyjoe · 2008-12-05 01:58 · Score: 1
  
  What they are saying doesn't apply. The problem is specifically that a supercomputer usually runs one big demanding program. Right now, my task manager says that I am running 76 processes to look at the internet. So I could easily benefit from extra cores as each process being run could go on a separate core.
  
  --
  Democracy Now! - your daily, uncensored, corporate-free
3. Re:So what does it mean for PCs? by David+Gerard · 2008-12-05 01:58 · Score: 2, Funny
  
  It will only affect you if you're running ForecastFoxNG, where you can set the weather and the CPU will calculate where the butterfly should flap to get the effect you want (M-x butterfly).
  
  --
  http://rocknerd.co.uk
4. Re:So what does it mean for PCs? by Johann+Lau · 2008-12-05 02:42 · Score: 1
  
  Most of these processes hardly utilize the CPU though, and if you have 8 cores, a process can use at most 1/8 of your total CPU power, right? That might be nice when browsing and e-mailing and such, but when converting a big image or archiving lots of files it means most of the CPU will sit idle. So I tend to think 2 cores is kinda perfect: when a program does some heavy crunching, it can eat up 50% at most, and the other core can be used to run all those small trivial processes you mentioned smoothly.
5. Re:So what does it mean for PCs? by dreamchaser · 2008-12-05 03:19 · Score: 1
  
  Image processing was a bad example for you to use, as it lends itself well to multi-threaded operations.
6. Re:So what does it mean for PCs? by theaveng · 2008-12-05 03:46 · Score: 1
  
  I've only ever used a QuadCore PC once in my life:
  - Core 1 was 100% utilized.
  - Core 2 was only 25%.
  - Cores 3 and 4 were sitting idle doing nothing.
  It was clocked at 2000 megahertz and based upon what I observed, it doesn't look like I'm "hurting" myself by sticking with my "singlecore" 3100 megahertz Pentium. The multicores don't seem to be used very well by Windows, and my singlecore Pentium might actually be faster for my main purpose (web browsing/watching tv shows).
  
  --
  FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
7. Re:So what does it mean for PCs? by bigsexyjoe · 2008-12-05 03:59 · Score: 1
  
  Well this is true. However, by that rational they don't really need to develop microprocessors anymore because they are idle most of the time. However there is some marginal advantage to increasing the percent of time an idle processor is available. Because even with two processors there will be a little bit of time that some processes spend in waiting in a queue to be run.
  
  --
  Democracy Now! - your daily, uncensored, corporate-free
8. Re:So what does it mean for PCs? by Johann+Lau · 2008-12-05 04:48 · Score: 1
  
  That doesn't mean all image processing apps are multithreaded, does it? I also mentioned "archiving lots of files"... so let's just say "spikes of CPU usage" instead... switching to browser tab with a complex webpage on it, for example, whatever... the more cores you have, the less the maximum possible speed for a single thread. It's neat when a single process cannot lock everything up, but you don't need more than two cores for that...
9. Re:So what does it mean for PCs? by VeNoM0619 · 2008-12-05 05:48 · Score: 1
  
  I still don't understand the problem... they are saying 1 computer with a limited throughput bus technology will be limited by adding more cores... well... duh? If it was a supercomputer, then chances are they will open the bandwidth to the processors/memory etc.
  
  Make it a car analogy: you can transport cargo, each car = core. You can only transport 2 cargo loads down a 2 lane road, adding a 3rd car makes them go slower... if you think the engineers don't think about expanding the road is idiotic.
  
  Adding cores isn't the problem... more cores = more power regardless, there's no limit, otherwise you are saying distributed computing has an upper limit?
  
  --
  Disclaimer: I am not god.
  We may not be created equal
  But we can be treated equal.
10. Re:So what does it mean for PCs? by dreamchaser · 2008-12-05 06:39 · Score: 1
  
  I think that depends upon the person and how they use their machine. I got a huge increase in performance going from 2 to 4 cores, but I often have some heavy multitasking going on.
11. Re:So what does it mean for PCs? by drachenstern · 2008-12-05 10:40 · Score: 1
  
  Are you assuming a fair scheduler, or are you assuming affinity based scheduling, or some amalgamation of the two?
  I'm personally just waiting on Microsoft to get that affinity works as well on the desktop as it does on the server and to fix that little snag for us, offering permanent affinity settings in the UI... Perhaps Vista does that and I haven't drilled down looking for it?
  
  --
  2^3 * 31 * 647
12. Re:So what does it mean for PCs? by sjames · 2008-12-05 12:15 · Score: 1
  
  That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?
  That really depends on how cache friendly the apps you run are.
It's so obvious... by Alwin+Henseler · 2008-12-05 00:55 · Score: 4, Interesting

That to remove the 'memory wall', main memory and CPU will have to be integrated.
I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.
Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.
Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).
Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.
1. Re:It's so obvious... by theaveng · 2008-12-05 01:42 · Score: 1
  
  P.S. Your idea of putting memory on the CPU is certainly workable. The very first CPU to integrate memory was the 80486 (8 kilobyte cache), so the idea has been proven sound since at least 1990.
  
  --
  FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
2. Re:It's so obvious... by AlXtreme · 2008-12-05 01:44 · Score: 3, Informative
  
  You mean something like a CPU cache? I assume you know that every core already has a cache (L1) on multi-core systems, and shares a larger cache (L2) between all cores.
  The problem is that on/near-core memory is damn expensive, and your average supercomputing task requires significant amounts of memory. When the bottleneck for high performance computing becomes memory bandwidth instead of interconnect/network bandwidth you have something a lot harder to optimize, so I can understand where the complaint in IEEE comes from.
  Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...
  
  --
  This sig is intentionally left blank
3. Re:It's so obvious... by DRobson · 2008-12-05 02:00 · Score: 1
  
  Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...
  Even discounting price concerns, L1 caches can only increase a certain amount. As the capacity increases, so does the search time for the data, until you find yourself with access times equivalent to the next level down the cache heirarchy, thus negating use of L1. L1 needs to be /quite/ fast for it to be worthwhile.
4. Re:It's so obvious... by Funk_dat69 · 2008-12-05 02:58 · Score: 2, Insightful
  
  Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).
  It also sounds like you are described the Cell Processor setup. Each SPU has local memory on-die - but cannot do operations on main memory(remote). Each SPU also has a DMA engine that will grab data from main memory and bring it into its local store. The good thing is you can overlap the DMA transfer and the computation so the SPUs are constantly burning through computation.
  This does help against the memory wall. And is a big reason why Roadrunner is so damn fast.
  
  --
  FUNK!
5. Re:It's so obvious... by TheRaven64 · 2008-12-05 03:36 · Score: 2, Informative
  
  More likely is going to something like the Cell's design. Cache is by definition hidden from the programmer, but on-die SRAM doesn't have to be cache, it can be explicitly-managed memory with instructions to bulk fetch from the slower external DRAM. For supercomputer applications, this would probably be more efficient, and lets you get rid of all of the cache coherency logic and use the space for more ALUs or SRAM.
  
  --
  I am TheRaven on Soylent News
6. Re:It's so obvious... by Cajun+Hell · 2008-12-05 04:04 · Score: 1
  
  Yes. Both a Commodore 64 and Commodore 128, although the 1 gigabyte RAM is typically used as a fast drive rather than as CPU-addressable DRAM.
  Are you sure that wasn't a megabyte of RAM?
  
  --
  "Believe me!" -- Donald Trump
7. Re:It's so obvious... by Cajun+Hell · 2008-12-05 04:08 · Score: 1
  
  The very first CPU to integrate memory was the 80486 (8 kilobyte cache), so the idea has been proven sound since at least 1990.
  I seem to recall the 68020 (1984?) having an instruction cache. (Though a lot smaller than 8k, if I recall.)
  
  --
  "Believe me!" -- Donald Trump
8. Re:It's so obvious... by theaveng · 2008-12-05 05:54 · Score: 1
  
  You are correct. The Motorola 68020 had 1/4 kilobyte of memory onboard, and was also a true 32-bit processor in 1984.
  I should have known. Motorola CPUs were always more-advanced than Intel. Of course I'm biased since I always preferred Amigas and Macs. ;-)
  
  --
  FOX NEWS.com should be BANNED from television and internet. Have the Congress take it over and give us Truespeak.
9. Re:It's so obvious... by Cajun+Hell · 2008-12-05 06:38 · Score: 1
  
  Motorola CPUs were always more-advanced than Intel. Of course I'm biased since I always preferred Amigas and Macs. ;-)
  Or maybe you're not biased, and preferred those machines because of their more advanced tech. :-)
  
  --
  "Believe me!" -- Donald Trump
10. Re:It's so obvious... by DoubleReed · 2008-12-05 09:15 · Score: 1
  
  Caches are SRAM, composed of purely logic transistors. (i.e. SRAM is made of the same stuff you build CPUs out of.)
  
  Main memory is DRAM, which is much much less expensive per bit due to much higher density. This higher density is achieved by using fewer parts per bit. DRAM is *not* made out of just logic transistors. One capacitor per bit is also required. So, it isn't just a matter of re-arranging parts that they already use in CPUs. ("Parts" meaning electrical components you can etch on a die, not anything discrete.)
  
  "The advantage of DRAM is its structural simplicity: only one transistor and a capacitor are required per bit, compared to six transistors in SRAM." (source wikipedia)
  
  Being able to put DRAM on the same die as a CPU would change the equation a little bit. Even if it didn't find its way into workstation grade CPUs, it would probably be useful for system on a chip applications / ASICs / FPGAs.
  
  Basically, anywhere that you only want to have a single chip for cost reasons. You can have flash, SRAM, voltage regulator, analog to digital and digital to analog conversion, all integrated into the same die as your CPU. But, if you want to have DRAM you need a second chip.
11. Re:It's so obvious... by lagomorpha2 · 2008-12-05 11:27 · Score: 1
  
  Being able to put DRAM on the same die as a CPU would change the equation a little bit. Even if it didn't find its way into workstation grade CPUs, it would probably be useful for system on a chip applications / ASICs / FPGAs.
  Good news! http://en.wikipedia.org/wiki/EDRAM "eDRAM stands for "embedded DRAM", a capacitor-based dynamic random access memory usually integrated on the same die or in the same package as the main ASIC or processor, as opposed to external DRAM modules and transistor-based SRAM typically used for caches."
12. Re:It's so obvious... by expatriot · 2008-12-05 11:40 · Score: 1
  
  Good point. May of the posters above and below are confused about the difference between cache and on-chip RAM.
  Cache is great if you have huge programs. Smaller applications (such as a might be used in a parallel element) work better with on-chip SRAM.
  The main difference is the way the memory is addressed. SRAM has it's own address space, cache is a fast copy of conventional memory.
  There are some corner cases where you can lock down cache to make it function as SRAM, but generally on-chip SRAM is easier to optimize for than cache. Cache optimizes everything so needs (relatively) less care in programming.
  Of course the technology for the individual memory cells is the same. What is different is the addressing logic and (for cache) keeping track of whether the cache matches the external memory.
13. Re:It's so obvious... by DoubleReed · 2008-12-06 05:14 · Score: 1
  
  Awesome :-)
Re:but.. by peragrin · 2008-12-05 01:01 · Score: 2, Interesting

So your saying that next generation processors need a gig of cache. Plus 4gigs of ram.
I think what is really needed is new OS designs. Something that is no longer tied quite as close to the hardware. So that new hardware ideas can be tried.

--
i thought once I was found, but it was only a dream.
Memory by Detritus · 2008-12-05 01:02 · Score: 4, Insightful

I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.

--
Mea navis aericumbens anguillis abundat
Multiple CPUs? by Dan+East · 2008-12-05 01:03 · Score: 4, Insightful

This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.

--
Better known as 318230.
No.... by fitten · 2008-12-05 01:15 · Score: 1

Maybe they should do something like we did back when the Paragon (yes, that far back) had multiple CPUs on a node and the memory bandwidth wasn't enough to support them all simultaneously... Don't use some of CPUs on the card (leave some idle) so that all the bandwidth is availalbe to the one, or few, cores that need it. Alternatively, figure out a way (algorithms) to make sure that no more than one core is memory intensive at a time... take turns being bandwidth intensive. Or, just realize, as it's always been, that some solutions/algorithms just aren't optimal on commodity hardware.
Re:Optical Computing? by arktemplar · 2008-12-05 01:15 · Score: 1

Well in a way it could be. I'd read the spectrum article some time back, but since I work in the field I can give some insight.
RAM latencies are a huge hit for applications that are based on random access. DRAMs etc. don't actually do random access the way you'd want they access one memory over a large time period, and provide faster access to some successive elements. New processor architectures based on smart caches and intelligent memories could be a lot more useful, basically though a rethinking of processor architecture is involved - in the end electrical and computer engineering is still that : Engineering there will always be tradeoffs.

--
blog plug -> The Darker Side of Light
Re:but.. by lloydchristmas759 · 2008-12-05 01:15 · Score: 1

And do you need a supercomputer to run a spellchecker ?

--
I'd give my right arm to be ambidextrous.
Re:but.. by KeithJM · 2008-12-05 01:16 · Score: 1

Isn't this already a problem in today's computers? The CPU isn't the bottleneck, the HDD is.
Generally this isn't true if you're talking about a supercomputer because of the tasks they'll be performing. You don't build supercomputers to be file servers (or even database servers, which can still use a lot of CPU)
The problem allegedly being.. by Junta · 2008-12-05 01:21 · Score: 4, Informative

For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.
All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.
For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.

--
XML is like violence. If it doesn't solve the problem, use more.
1. Re:The problem allegedly being.. by amori · 2008-12-05 03:18 · Score: 2, Interesting
  
  Earlier this year, I had access to a large supercomputer cluster. Often I would run code on the supercomputer (with 1000+, 100, 10, 2 CPUs), and then I would try running it on my own dual core machine. Benchmarking the 2 CPUs for comparison purposes. More than anything, just the manner in which memory was being shared or distributed would influence the end results, tremendously. You really have to rethink how you choose to parallelize your vectors when dealing with supercomputers vs. multicore machines. As a researcher, I've found that I don't necessarily have the time to rewrite my code for both scenarios. I think this too might factor in heavily ...
Different Chip Architecture by rabun_bike · 2008-12-05 01:24 · Score: 1

You might see a super computer design around other RISC processors such as the ARM. A supercomputer using the ARM takes more chips perhaps but the power savings is substantial compared to the x86. Furthermore, companies that like Nvidia with their Telsa platform are pushing into the supercomputing space with specialized chips that are purposefully designed to deal with large linear problem solving. Interestingly the Telsa chip is a multicore chip as well. http://www.nvidia.com/object/product_tesla_s1070_us.html
Re:but.. by nicolas.kassis · 2008-12-05 01:32 · Score: 1

Except that will obviously be slower due to the overhead of abstracting from the hardware. And that is already what most OS out there do. Linux on RISC does exist and so does Linux on * But there is a drawback to all that.
Re:Optical Computing? by thommym · 2008-12-05 01:36 · Score: 1

Someone who has already thought of this... http://www.nytimes.com/2008/03/24/technology/24wafer.html?_r=1&ref=technology

--
Don't feed the penguins
Re:Optical Computing? by nicolas.kassis · 2008-12-05 01:36 · Score: 1

Would optical get around such a barrier?
Physical space seems to be one of the major hurdles in CPU design today, due to leakage with the ever shrinking processes.
And i think it is about damn time that new silicon laser receiver thing (forgot the details) was put into implementation and testing.
IBM is already working on it. Stay tuned.
Re:but.. by David+Gerard · 2008-12-05 01:56 · Score: 4, Funny

Only for Office 2007.

--
http://rocknerd.co.uk
Re:So don't use conventional processors by David+Gerard · 2008-12-05 02:02 · Score: 1

So they can play GTA IV in their time off.

--
http://rocknerd.co.uk
Re:but.. by jellomizer · 2008-12-05 02:07 · Score: 1

Like on the 386 0 wait computers.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Unganged channels = already non shared lanes today by DrYak · 2008-12-05 02:24 · Score: 5, Insightful

The issue is with a single processor that has multiple cores.
There's no real way to split the banks for each core, so the net effect is that you have 4-32 cores sharing the same lanes for memory.
No, sorry. That's how Phenom processor are *Already* working.
Each physical CPU package has two 64-bit memory controllers, each controlling a separate bank of 64bits DDR-2 memory chips. (Each of the two bank in a dual channel mother board).
Phenom have two mode of function :
- Ganged : both memory controllers work in parallel, working as if they were a huge 128bits memory connection. That's how dual channel has worked since it was invented.
That's good for system running few very bandwidth-hungry applications (for example : benchmarks)
- Unganged : each memory controller work on its own. Thus you have two completely separate 64bits memory channel accessible at the same time. By correctly lying the applications in memory thanks to a NUMA-aware OS (anything better than Windows Vista), that means that two separate applications can simultaneously access each one's memory at the exact same moment, although at only half the bandwith *per process* (but still the same total of bandwidth for all processes running at the same time on a multi core chip).
This is perfect for systems running lots of tasks in parallel, and is the default mode on most BIOSes I've seen.
This gives a tremendous boost to heavily multi-tasked applications (a busy database server, for example), and it's what TFA's author are looking for.
Probably that at some point in the future, Intel will follow the same trend with its QPI processors.
Also, the future trend is to multiply the memory channels on the CPU: Intel has already planned Triple Channel DDR-3 for their high-end server Xeons (the first crop of QPI chips). AMD has announced 4 memory channels for their future 6- and 12- core chips targeting the G34 socket.
So the net effect of Unganged Dual Channel is that today you already have 4 cores having a choice of 2 sets of memory lanes to choose among, and within 1 year, you'll have 6-to-12 cores sharing 4 sets of memory lanes.
By the time you reach 32 cores on CPU, probably that almost each slot will have its own dedicated memory channel (probably with the help of some technology which communicates serially with fewer lines, like FB-DIMM). Or even weirder memory interfaces (who knows ? maybe DDR-6 will be able to give several simultaneous access to the same memory module).
So, well, once again, it proves that running stupid simulations without taking into account that other technologies will improves beside the number of cores* yields stupid non realistic results.
Shame on TFA's Author, because the trends to increase bandwith have already started. I little bit more background research would have avoided this kind of stupidity.
But on the other hand, they would have missed the opportunity to publish an alarmist article with an eye catching title.
--
*: Although, yes, the number of cores you can slap inside the same package seems to be the "new megahertz" in the manufacturers' race, with some like Intel trying to increase this number faster without putting so much efforts on the rest.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
Do you know anything about supercomputing tasks? by SaDan · 2008-12-05 02:30 · Score: 1

CUDA has zero benefit for supercomputing projects that cannot be broken into tiny bits and spread across multiple cores.
It's not just about memory, or clock speed.
as expected by tyler.willard · 2008-12-05 02:32 · Score: 2, Funny

"A supercomputer is a device for turning compute-bound problems into I/O-bound problems."
-Ken Batcher
Simple, if it doesn't work, don't use it. by JoeMerchant · 2008-12-05 02:35 · Score: 1

What's distressing here? That they have to keep building supercomputers the same way they always have? I worked with an ex IBM'er from their supercomputing algorithms department, he and I BSed about future chip performance alot in the late 2006 - early 2007 timeframe. We were both convinced that the current approaches to CPU design were going to top out in usefulness at 8 to maybe 16 cores due to memory bandwidth.
I guess the guys at Sandia had to do a little more than BS about it before they published, but c'mon guys, this has been obvious for a while. And, if it's obvious to all of us out here, don't you think that Intel knew about it during their 2002 roadmap meetings?
Ok. soooo.... by Taibhsear · 2008-12-05 03:00 · Score: 1

Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.
So increase the bandwidth on the memory to something more suited to supercomputers then. Design and make a supercomputer for supercomputer purposes. You are scientists using supercomputers, not kids begging mom for a new laptop on christmas. Make it happen.
Well, duh.... by SpinyNorman · 2008-12-05 03:28 · Score: 2, Insightful

It's hardly any secret that CPU speed, even for single core processors, has been running ahead of memory bandwidth gains for years - that's why we have cache, and ever increasing amounts of it. It's also hardly any relevation to realize that if you're sharing your memory bandwidth between multiple cores then the bandwidth available per core is less than if you weren't sharing. Obviously you need to keep the amount of cache per core and the number of cores per machine (or, more precisely, per unit of memory sybsystem bandwidth) within reasonable bounds to keep it usable for general purpose aplications, else you'll end up in GPU-CPU (e.g. CUDA) territory where you're totally memory constrained and applicability is much less universal.
For cluster-based ("supercomputer") applications, partitioning between nodes is always going to be an issue in optimizing performance for a given architecture, and available memory bandwidth per node and per core is obviously a part of that equation. Moreover, even if CPU designers do add more cores per processor than is useful for some applications, no-one is forcing you to use them. The cost per CPU is going to remain approximately fixed, so extra cores per CPU essentially come for free. A library like pthreads, and different implementations of it (coroutine vs LWP based), gives you the flexibility over the mapping of threads to cores, and your overall across-node application partitioning gives you control over how much memory bandwidth per node you need.
It is cost effective to have a low CPU utilization by thpr · 2008-12-05 03:53 · Score: 1

You *can* integrate memory and CPU on the same silicon die, but the overhead to do it in terms of additional error rates and processing tasks makes it economically inefficient to do so. (It is more economically efficient to build a larger cluster). It's possible (I'd even hedge likely) that we will see 3D packaging technology try to get around some of the latency. (More in my post a few years ago). However, the overall difference in speed between memory and processor is unlikely to change in the near future, so we need to continue to architect around that limitation at a system level, rather than a chip level.
The other thing I'd point out is that your analogy to "balanced" general purpose computing systems can (and should) fail for supercomputers... there is no rational reason to continue scaling in a linear fashion.
Then again, this is seriously old news. Trying to optimize a supercomputer to get anywhere close to 100% CPU utilization is known to be a problem. Others have already pointed out IBM's Blue Gene, and there's a reason it's a good example.
From Marc Snir, et al. at IBM, September, 2001 in a file called BlueGenePublic.pdf, which discussed the design philosophy for the Blue Gene supercomputer:
"Standard microprocessors are optimized for running as fast as possible one instruction stream...
Standard nodes suffer from 'von Neumann' bottleneck: computation speed increases much faster than memory access speed"
"Let's think from scratch.... in order to build general purpose systems that overcome constraints of conventional architectures.
Let's accept that significant improvements in cost/performance can be achieved by building an 'unbalanced' system" (emphasis added)
"CPU is a vanishingly small fraction of total system, in silicon area, or power
It is rational to build systems with a surfeit of compute power, so as to reduce memory requirements and reduce the need to move data around
It is cost effective to have a low CPU utilization"
Why didn't theconsider multiple memory contollers? by Bartoki · 2008-12-05 04:04 · Score: 1

Use the same trick that RAID does: multiple memory modules in parallel. The SUN Niagara 2 processors have 4 memory controllers to feed the 4-6-8 cores (32-48-64 threads). The Tilera TILE64 processors also have 4 memory controllers to feed the 64 cores.
Re:Kill all engineering then! by Shamenaught · 2008-12-05 04:25 · Score: 4, Insightful

The phrase "By logical extension" is just another way of saying "This is a straw man argument"
I believe that the point he was making was not that it's pointless to go beyond X86 hardware, but that it's more cost-effective to use consumer hardware. Consumer hardware is not necessarily X86 hardware. See IBM's Roadrunner, presently the fastest supercomputer in the world, which uses an advanced version of the PS3's processor (the PowerXCell 8i).
In time, we'll probably see demand in consumer hardware for breaking past the boundaries and bottlenecks of multi-core processing, and so supercomputers will follow.

--
mysql> SELECT * FROM `places` WHERE `place` LIKE 'home`; Empty set (0.00 sec)
There are still vector processors out there. by flaming-opus · 2008-12-05 04:47 · Score: 2, Insightful

NEc still makes the SX9 vector system, and cray still sells X2 blades that can be installed into their xt5 super. So vector processors are available, they just aren't very popular, mostly due to cost/flop.
A vector processor implements an instruction set that is slightly better than a scalar processor at doing math, considerably worse than a scalar processor at branch-heavy code, but orders of magnitude better in terms of memory bandwidth. The X2, for example, has 4 25gflop cores per node, which share 64 channels of DDR2 memory. Compare that to the newest xeons where 6 12 gflop processors share 3 channels of DDR3 memory. While the vector instruction set is well suited to using this memory bandwidth, a massively multi-core scalar processor could also make use of a 64-channel memory controller.
The problem is about money. These multicore processors are coming from the server industry. web-hosting, database-serving, and middleware crunching jobs tend to be very cache-friendly. Occasionally they benefit from more bandwidth to real memory, but usually they just want a larger L3 cache. Cache is much less useful to supercomputing tasks, which have really large data-sets. The server-processor makers aren't going to add a 64-channel memory controller to server processors; it wouldn't do any good for their primary market, and it would cost a lot.
Of course, you could just buy real vector processors, right? Not exactly. Many supercomputing tasks work acceptably on quad-core processors with 2 memory channels. It's not ideal, but they get along. This has put a lot of negative market pressure on the vector machines, and they are dying away again. It's not clear if cray will make a successor to the X2, and NEC has priced itself into a tiny niche market in weather forcasting, that is unapproachable by other supercomputer users, for price reasons.
Multicore Is Doomed Unless... by Louis+Savain · 2008-12-05 04:48 · Score: 1

The continued increase in multicore processing power is doomed unless a solution to the memory bottleneck is found. We need a memory system that obviates the needs for caching by completely eliminating bus contention in shared memory. This should be one of the primary research areas for companies like Intel and for government-funded research labs. We should pump billions of dollars into finding a solution to this problem over the next five years or ten years. I suspect that optical memory or quantum tunelling are promising areas of inquiry. This is what physicists should be focusing their efforts on instead of pursuing pipe dreams like quantum computing.
The number of cores per megabyte should double every 18 months so as to pursue the hypothetical ideal of one processor per byte. At that point we will have reached the end of the performance curve.
Multicore is a fallback plan by ClosedSource · 2008-12-05 05:12 · Score: 1

It's worth noting that multicore CPUs are just a plan B technology. What the market really wants is faster CPUs, but the current old technology can't deliver them, so CPU makers are trying to convince people that multicore is a good idea.
1. Re:Multicore is a fallback plan by Culture20 · 2008-12-05 07:49 · Score: 1
  
  In the days when the 1GHz barrier was breached for commodity hardware, I was using a dual CPU 300MHz workstation that seemed faster than the 1GHz single CPU boxes running the same version of redhat. It was then that realized I always wanted more than one CPU in my personal machines to combat context switching. Multi-core may be bad for supercomputers, but it rocks for workstations.
Because single cores are just going to... by WolverineOfLove · 2008-12-05 05:18 · Score: 1

Even if this is the case, which sounds plausible... So what?

Somehow I have a sneaking suspicion that if multi-core has less performance in super computing than single-core... Companies will continue to manufacture specialized single-core processors for supercomputing.
And now for the local news by billcopc · 2008-12-05 05:43 · Score: 1

This just in:
* Intel sucks at making zillion-dollar computers
* AMD sucks at everything
* Supercomputer engineers are worried for their jobs
I realize these people have a legitimate complaint, but quite frankly if you're worried about a certain processor affecting your code, maybe you suck at programming ?! So what if the internal bandwidth is ho-hum ? These old dogs need to stop complaining and learn to adapt, else their overpaid jobs will be given to others who can.

--
-Billco, Fnarg.com
ADAPTABLE Chip Archetecture by maz2331 · 2008-12-05 07:28 · Score: 1

Actually, my thinking is that rather than just tossing more cores at the problems, we should be looking at making the hardware adapt itself to the problem to be solved. IE: instead of just crunching "instructions" on data, we need hardware that effectively rewires itself to the problem at hand.
Something like an FPGA integrated into the archetecture with huge gate/interconnect counts plus some "normal" cores may be a better approach. Done well, loops can be unrolled and executed in one clock cycle, entire memories can be created on-chip and then destroyed when no longer needed, etc.
Of course, this would require changing some programming techniques around, and require compilers far more advanced than we currently have available. Still, it should be at least a semi-achievable technology.
1. Re:ADAPTABLE Chip Archetecture by rabun_bike · 2008-12-05 07:33 · Score: 1
  
  Sounds like an fun, interesting problem to me. I think one the fundamental problems with supercomputers is that they generally are restricted to work on problems that can be broken down into individual pieces. Perhaps groups of FPGA chips might be adapted to portions of the problem and then aggregated as a whole to provide a solution?
Supercomputing is mostly a boondoggle by Animats · 2008-12-05 07:30 · Score: 1

Supercomputing is mostly a Government-funding boondoggle. The private sector buys few if any supercomputers.
Most of the US government applications are either related to nuclear weapons, or are busywork for underutilized nuclear weapons labs. Sandia, Los Alamos, and Livermore, lacking bombs to design, are looking for something else to justify their continued existence. To some extent, they're senior activity centers for old physicists. There's also "stockpile stewardship", which is an activity center for younger physicists. The idea is to keep some people around who can build an H-bomb if necessary, so that the technology isn't lost as people die off. Since that crowd isn't allowed to actually do much of anything, they want to simulate a lot. It's really a political problem. If the US and Russia allowed each other one underground bang each year, there would be less need for all this iffy simulation.
So don't worry too much about whining from Sandia about supercomputers. When Google or Amazon start complaining that multicore machines are choking in their server farms, it's time to listen.
Rather than laptops with zillions of CPU cores, we're probably going to see CPU chip real estate used for more cache, and maybe even main memory. The near future is the one-chip laptop that sells for $100 or so.
Re:Unganged channels = already non shared lanes to by jebrew · 2008-12-05 07:37 · Score: 1

Good call, I didn't see the Direct Connect(tm) stuff. I should try to keep more abreast of such things. Nice design too.
Re:but.. by PitaBred · 2008-12-05 08:26 · Score: 1

...so nothing ties the application to the hardware? You need to have SOMETHING there. That's basically the whole point of an OS... to tie itself to the hardware so the applications don't have to. If you want to try new hardware ideas, you need to write a new OS. There's no way around it. How in the hell did this get modded up?

--
My blog. Good stuff (when I remember to update it). Read it.
Multi-cores useless for any data intensive use? by boddhisatva · 2008-12-05 08:47 · Score: 1

What the article points out is that while the number of ALUs per chip has increased, memory to processor throughput has not. If you are working with large amounts of data (i.e. not factoring numbers) the processor is unable to keep the cores fed. Most supercomputer applications today involve large data sets. In one situation examined an 8-core CPU performed about the same as a dual core and with more cores the processor degraded quickly to less than that of a dual core with 16-64 cores.
This is the memory bottleneck and is likely to be the case for database systems and other systems processing large data sets. The bottleneck needs a name. Any ideas?
If memory is the problem... by emil · 2008-12-05 08:50 · Score: 1
...then how about these approaches?
- Several memory controllers on the die?
- Will memristor memory be faster? Can it break 500MHz?
- Would a multiplexed memory controller, large scale interleaving, or some sort of new dedicated memory bus improve the situation?
- And, most importantly, which of these technologies are useful to gamers? Gaming seems to be driving supercomputing.
In any case, the article only refers to data mining. Perhaps these questions are better answered by Oracle, DB2, or other TPC score winners.
1. Re:If memory is the problem... by drinkypoo · 2008-12-05 09:13 · Score: 1
  
  Would a multiplexed memory controller, large scale interleaving, or some sort of new dedicated memory bus improve the situation?
  Ever looked up the Cray T3E?
  Doing so should answer all of these questions.
  
  --
  "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
2. Re:If memory is the problem... by postbigbang · 2008-12-05 09:17 · Score: 1
  
  Multiple memory controllers are ok, if they satisfy the problem of eliminating bottle necks.
  Memristors and better FSB technology could help, see above.
  Multiplexed memory is intriguing, as are memory bus architectures that eliminate bottlenecks or achieve more rapid cache coherency and less state dependency in multi-core constructions.
  Gamers? Given the current nature of games, multi-player games suffer the most from I/O bandwidth problems. In terms of rendering, so as to aid player decision making process, it would seem that pseudoephedrine is really important. Just kidding.
  Data mining or large table processing/joins/rendering is a somewhat linear process, and while supercomputing components helps, there are lots of optimizations yet to be squeezed in db activity. It's a whole other subject, fraught with fanbois and architectural whizzes/bigots.
  From an interdisciplinary perspective, I still hold that given current dimensionality, crossbar arrays with high speed interconnects are the most productive and reliable without changing lots of thinking.
  
  --
  ---- Teach Peace. It's Cheaper Than War.
Re:Grey Code addressing? by FourthAge · 2008-12-05 09:15 · Score: 1

There is less need for such a scheme today because sequential addresses are not sent across the memory bus. Instead, burst mode is used. A burst is specified by sending a start address and a size. This is necessary because the memory bus latency may be hundreds of clock cycles; bursts are the only way to achieve reasonable bandwidth in such conditions.

--
The tao of democracy: the government you can vote for is not the real government.
Sun ROCK Processor by turgid · 2008-12-05 09:18 · Score: 1

Sun's Niagara processors are not particularly suitable for supercomputers, but they have some innovative useful features for getting around the memory bandwidth problem.
Their truly innovative processor that should be superb for supercomputing is ROCK, if it ever sees the light of day.
As well as multiple cores, multiple threads per core (i.e. "contexts"), powerful floating-point cores and SIMD units, its killer feature is what sounds like a very clever kind of automatic speculative pre-fetching from main memory into cache.
intel's chips always make me laugh a little. All that processing power and no memory or I/O bandwidth. They've only just caught up with AMD in that respect and now they're planning 80 cores with very little improvement in memory bandwidth...

--
Stick Men
ADD MORE CHANNELS by Chris+Snook · 2008-12-05 09:39 · Score: 1

There's a very simple solution for the memory and interconnect bandwidth bottlenecks, and that is to widen the channels. If you look at Intel's Nehalem roadmap, they're planning to move from triple-channel to quad-channel on the really monstrous chips coming out down the line. Likewise, AMD has been doing a lot of work on making hypertransport channels more configurable, so you could allocate less bandwidth to an I/O bridge and more to the interconnect, if you're building a system where that's what's important.
If you're just adding more execution cores without changing what's around them, then this criticism hold, but the long-term significance of multi-core design is about VLSI, which lowers the latency between components. Latency is critical in supercomputing applications, so any time you can squeeze those gigaflops and their attached memory closer together, performance improves.

--
There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
Mod parent up by David+Gerard · 2008-12-05 10:00 · Score: 1

*applause*

--
http://rocknerd.co.uk
nVidia warned you by SoopahCell · 2008-12-05 10:48 · Score: 1

A previous Slashdot article included an nVidia executive saying Intel has been wrong on cpu design for a long time - that the critical design feature needs to be memory bandwidth, not cpu ticks or speed or any of the numbers they've so far focused on.
But I think this just shows supercomputer designers need to stop thinking about CPUs and start thinking about GPUs. Multicore is here and commoditized already, and if you can do your work on shaders then you're looking at not 8, 16, or 32 cores but 640 or 1280 cores to do your work, all with bus designs that put memory first.
Re:but.. by peragrin · 2008-12-05 10:58 · Score: 1

yea but the OS is so tied into the hardware that you can't port apps out of the OS, hardware combo. Applications shouldn't care what they are running on, hardware or software.
The only system to even begin to accomplish that is Inferno.

--
i thought once I was found, but it was only a dream.
Re:Have they heard of CUDA? by kramulous · 2008-12-05 11:55 · Score: 1

GPUs don't do error detection/correction. Not a desirable feature for scientific models.

--
.
Re:Unganged channels = already non shared lanes to by poached · 2008-12-05 11:55 · Score: 1

bravo.
Re:Unganged channels = already non shared lanes to by Agripa · 2008-12-05 12:17 · Score: 1

I was just reading the AMD documentation on the Phenom. The part I found interesting was that Phenom uses the same 144 bit ECC code in both ganged and unganged mode. In the later case, the ECC code is used across two 72 bit transfers from the same channel which optionally can be bitwise interleaved. 4 bit chipkill correction is lost in this case but detection still works.
Re:Parallel processing by drachenstern · 2008-12-05 12:44 · Score: 1

do you mean like for multiple processor single box supercomputers or do you mean for clusters? For clusters they already do, and for supercomputers-in-a-box they would have to.
In re: clusters, start here http://en.wikipedia.org/wiki/Message_Passing_Interface or http://en.wikipedia.org/wiki/OpenMP for more info.

--
2^3 * 31 * 647
Re:Why didn't theconsider multiple memory contolle by drachenstern · 2008-12-05 14:14 · Score: 1

erm, don't you still only have one RAID controller per system? Don't I/O requests still bottleneck at the DMA interface if they're faster ($DEITY forbid that happening) because there is only one bus? So how does your RAI[R|M] facilitate this system? You've still got to coordinate all that memory. Unless....
What if each RAM chip is assigned a write bit to one DMA style controller, but multiple or various other controllers have READ ONLY access to the RAM? I don't know if this is even feasible, but it's a thought...

--
2^3 * 31 * 647
CPU cores or CUDA cores, where to go? by Douglas+Goodall · 2008-12-06 04:29 · Score: 1

I bought a Mac Pro 8-core machine to learn multi-core programming, then I discovered NVIDIA CUDA programming and I am looking at buying a C1080 240-core GPU to learn to program that. The industry is manufacturing lots of multi-core devices, but programming (parallel) hasn't adapted to this new paradigm and provided the right tools to leverage off these new technologies.
Vectors yes, but the bandwidth to use them? by flaming-opus · 2008-12-08 11:08 · Score: 1

vector processing in commodity designs isn't enough. Of course we are going to see it, at this point it's not very expensive to add. Adding vector processing for increased flops is easy. The hard part is the bandwidth. One of the reasons the X1 processors were expensive, was that they were custom, but so are the network chips in commodity-cpu supers, and they only add $1000/node. The real cost of X1-style memory is that you have 64 channels of memory, which is a lot of wires, dimms, memory parts, etc. There's a very real cost to all the memory components needed to get the kind of bandwidth you need to support a high-throughput vector pipeline.
The commodity processor vendors aren't going to do this sort of thing, as it adds to the cost of the chip, but provides nothing to the bulk of their customers who are running mysql, apache, or halflife.
The one hope I have is something like the core2 architecture, where ddr3 is used for desktop processors, and fbdimm is used for server parts. The two components share a lot of architecture, and only a few of the asic cells are different. If a cpu vendor were interested in the HPC market, they could design a cpu to use a standard memory channel for desktop/low-end server parts, and something more expensive, but higher bandwidth for the HPC space. It would mean HPC specific processors, but sharing most of the engineering with the commodity part. Maybe Cray could license them the design for their weaver memory controller in the X2. It's kind of like the AMB on a FB-DIMM, but it includes 4 channels of DDR2 on each stick of memory.
Cores per memory controller. by flaming-opus · 2008-12-08 11:19 · Score: 1

I'd love to see each core on a massively multicore design get its own memory controller. I'm not holding my breath, however. If you think of a 32-core CPU, it's pretty unlikely that most supercomputer or cluster vendors are going to pay for 32 dimms for each cpu socket. So then you're talking about multiple memory channels per memory stick. You can still get ECC using 5 memory chips per channel, so you can imagine 4 channels fitting on a memory riser. Cray does this on the X2. Then 32 channels would only require 8 dimms, which is reasonable. Then what do you do for 64-core CPUs?
It's tricky, and the problem for the market is that it's expensive. Can you get the commodity CPU vendors interested in such a thing, given that most of their addressable market is not in the supercomputing space?
I think We're gonna see more cores in a CPU that there's bandwidth to use. They might increase the bandwidth a bit, but probably just enough to get good linpack numbers.
x86 processors were never designed for HPC by gupg · 2008-12-12 18:48 · Score: 1

Note the comment from
Steve Conway from IDC
Steve Conway, senior analyst with IDC for high performance computing issues, said this problem has been around for a while, and multi-core is only exacerbating it. "x86 processors were never designed for HPC," he told InternetNews.com. "Those processors were not designed to communicate with each other at a high speed. With these big systems, you have to move data over large territories.