Slashdot Mirror


The Potential of Science With the Cell Processor

prostoalex writes "High Performance Computing Newswire is running an article on a paper by computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory. They have evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The full paper is available from Computer Science department at Berkeley."

121 of 176 comments (clear)

  1. Cell + Linux = success by Anonymous Coward · · Score: 3, Funny

    OS X is closed source. This means that it is the work of the devil - its purpose is to make the end users eat babies.

    Linux is the only free OS. Yes the BSD lincenses may appear more free, but as they have no restrictions, they are actually less free than the GPL. You see, restricting the end user more actually makes them more free than not putting restrictions on them. You must be a dumb luser for not understanding this.

    And you obviously dont have a real job. A real job involves being a student or professional academic. You see, academics are the ones who know all about productivity - if you work for a commercial organisation you obviously do not know anything about computers. Usability is stupid. Whats wrong with the command line? If you cant use the command line then you shouldnt be using a computer. vi should be the standard word processor - you are such a luser if you want to use Word. Installing software should have to involve recompiling the kernel of the OS. If you dont know how to do this, you are a stupid luser who should RTFM. Or go to a Linux irc channel or newsgroup. After all, they are soooo friendly. If you dont know how the latest 2.6 kernel scheduling algorithm works then they will tell you to stop wasting their time, but they really are quite supportive.

    Oh, and M$ is just as evil as Apple. Take LookOUT for instance. You could just as easily use Eudora. Who needs groupware anyway, a simple email client should be all we use (thats all we use as academics, why cant businesses be any different).

    And trend setters - Linux is the trend setter. It may appear KDE is a ripoff from XP, but thats because M$ stole the KDE code. We all know they have GPL'ed code hidden in there somewhere (but not the things that dont work, only the things that work could possibly have GPL'ed code in it).

    And Apple is the suxor because they charge people for their product. We all know that its a much better business model to give all your products away for free. If you charge for anything, then you are allied with M$ and will burn in hell.

  2. What about the compiler? by Watson+Ladd · · Score: 2, Insightful

    The paper did a lot of hand-optimization, which is irrelevent to most programmers. What gcc -O3 does is way more importent then what an assembly wizard can do for most projects.

    --
    Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
    1. Re:What about the compiler? by Anonymous Coward · · Score: 5, Insightful

      Hand optimization _is_ relevant to scientific programmers

    2. Re:What about the compiler? by TommyBear · · Score: 5, Insightful

      Hand optimizing code is what I do as a game developer and I can assure you that it is very relevant to my job.

    3. Re:What about the compiler? by suv4x4 · · Score: 2, Interesting

      The paper did a lot of hand-optimization, which is irrelevent to most programmers. What gcc -O3 does is way more importent then what an assembly wizard can do for most projects.

      Actually bullshit. We're talking scientific applications here, and it's not uncommon that programs written to run on supercomputers *are* optimized by an assembly wizard to squeeze every cycle out of it.

    4. Re:What about the compiler? by maximthemagnificent · · Score: 1

      Hard science is exactly the sort of application that would employ an assembly programmer to optimize code.

    5. Re:What about the compiler? by C.A.+Nony+Mouse · · Score: 1

      That games can be written to run well on Cell is not news. That the same might be true for scientific code is.

      --
      J
    6. Re:What about the compiler? by Anonymous Coward · · Score: 2, Informative

      Insightful? Ah... no.

      Scientific users code to the bleeding edge. You give them hardware that blows their hair back and they will figure out how to use it. You give them crappy painful hardware (Maspar, CM*) that is hard to optimize for, then they probably won't use it.

      Assembly language optimization is not a big deal. Right now the biggest thing bugging me is that I have to rewrite a core portion of a code to use SSE, since SSE is so limited for integer support. As this is a small amount of work, and the potential gains are so large (about 4x), it doesn't make sense not to do this. Some of it will be hand coded and optimized assembler. This is how we have to program. Scientists need the fastest possible cycles, and as many of them as possible ... at least the ones I know need this. There are a few who do all their analysis on Excel spreadsheets. They don't need much in the way of speed. The rest of us do.

    7. Re:What about the compiler? by Watson+Ladd · · Score: 1

      Most projects, not some superexpensive code. Sure, fast API's like BLAS will use hand-written asembler, but it takes a compiler to find those optimizations that are too complex to do by hand or hard to find while being easy to do. And the asembler advantage is negative on some RISC processors now due to advances in compiler design. So gcc -O3 might outpreform asm, so then gcc -O3 is relevant as nobody will want to use asm as gcc can outpreform it. But I haven't seen anything about how true this is for the Cell.

      --
      Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
    8. Re:What about the compiler? by samkass · · Score: 5, Insightful

      What seems to be more important than that is:

      "According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell's peak double precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double precision performance."

      So the Cell is great because there's going to be millions of them sold in PS3's so they'll be cheap. But it's only really great if a new custom variant is built. Sounds kind of contradictory.

      --
      E pluribus unum
    9. Re:What about the compiler? by Anonymous Coward · · Score: 1, Insightful

      Methinks that the point was that if a GAME development company is going to fork over the cash for ASM wizards, a company spending a few hundred mil. building a super-computer might just consider doing the same. Maybe.

      And I know from Uni that many profs WILL hand optimize code for complex, much used algorithms. Then again, some will just use matlab.

    10. Re:What about the compiler? by JanneM · · Score: 3, Informative

      Hand optimizing code is what I do as a game developer and I can assure you that it is very relevant to my job.

      It makes sense for a game developer - and even more an embedded developer. You spend the time to optimize once, and then the code is run on hundreds of thousands or millions of sites, over years. The time you spend can effectively be amortized over all those customers.

      For scientific software the calculation generally changes. You write code, and that code is typically used in one single place (the lab where the code was written), and only run a comparatively few times, indeed sometimes only once.

      For a game developer to spend three months extra to shave a few seconds of one run of a piece of code makes perfect sense. For an embedded developer using a couple of months' worth of development cost to be able to use a slower, cheaper chip, shaving a dollar of the production of perhaps tens of millions of gadgets makes sense.

      For a graduate student (cheap as they are in the funny-mirror economics of science) to spend three months to make one single run of a piece of software run a few hours faster does not make sense at all.

      In fact, disregarding the inherent coolness factor of custom hardware, in most situations it just doesn't pay to make custom stuff for science when you can just run it for a little longer to get the same result. In fact, not infrequently have I heard about labs spending the time and effort to make custom stuff, but by the time they're done, the off the shelf hardware had already caught up.

      --
      Trust the Computer. The Computer is your friend.
    11. Re:What about the compiler? by penguin-collective · · Score: 2, Insightful

      Except for a tiny minority of specialists, most scientific programmers, even those working on large-scale problems, have neither the time nor the expertise to hand-optimize. Many of them don't even know how to use optimized library routines properly.

    12. Re:What about the compiler? by FromWithin · · Score: 2, Informative

      So the Cell is great because there's going to be millions of them sold in PS3's so they'll be cheap. But it's only really great if a new custom variant is built. Sounds kind of contradictory.

      Did you not read the last bit?

      On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance. If Cell were to include at least one fully utilizable pipelined double precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double.

      So it's really great already. If it was tweaked a bit, it would be ludicrously great.

    13. Re:What about the compiler? by cfan · · Score: 2, Interesting

      >So the Cell is great because there's going to be millions of them sold in >PS3's so they'll be cheap. But it's only really great if a new
      >custom variant is built. Sounds kind of contradictory.

      No, the Cell is great because, as the pdf shows, it has an incredible Gflops/Power ratio, even in its current configuration.

      For example, here are the Gflops (double precision) obtained in 2d FFT:

            Cell+ Cell X1E AMD64 IA64

      1K^2 15.9 6.6 6.99 1.19 0.52
      2K^2 26.5 6.7 7.10 0.19 0.11

      So a single, normal, Cell can be compared with the processor of a Cray (that uses 3 times more power and costs a lot more).

    14. Re:What about the compiler? by john.r.strohm · · Score: 4, Interesting

      Irrelevant to most C/C++ code wallahs doing yet another Web app, perhaps.

      Irrelevant to people doing serious high-performance computing, not hardly.

      I am currently doing embedded audio digital signal processing, On one of the algorithms I am doing, even with maximum optimization for speed, the C/C++ compiler generated about 12 instructions per data point, where I, an experienced assembly language programmer (although having no previous experience with this particular processor) did it in 4 instructions per point. That's a factor of 3 speedup for that algorithm. Considering that we are still running at high CPU utilization (pushing 90%), and taking into account the fact that we can't go to a faster processor because we can't handle the additional heat dissipation in this system, I'll take it.

      I have another algorithm in this system. Written in C, it is taking about 13% of my timeline. I am seriously considering an assembly language rewrite, to see if I can improve that. The C implementation as it stands is correct, straightforward, and clean, but the compiler can only do so much.

      In a previous incarnation, I was doing real-time video image processing on a TI 320C80. We were typically processing 256x256 frames at 60 Hz. That's a little under four million pixels per second. The C compiler for that beast was HOPELESS as far as generating optimal code for the image processing kernels. It was hand-tuned assembly language or nothing. (And yes, that experience was absolutely priceless when I landed on my current job.)

    15. Re:What about the compiler? by Angstroman · · Score: 1
      So the Cell is great because there's going to be millions of them sold in PS3's so they'll be cheap. But it's only really great if a new custom variant is built. Sounds kind of contradictory.

      The HPC world is substantially different from either gaming or "normal" application programming. The strong draw of the cell is that it is a production core with characteristics that are important to High Performance Computing, particularly power dissipation per flop. While conventional applications target getting the most out of a processor, HPC applications center on scalability in number of processors. This means running the largest number of processors for a given power/cooling supply, and maintaining the lowest latency in interprocessor communication. The latter is closely related to the physical layout of the processor array, which is also dependent upon cooling strategy. Hand coding, or at least hand optimization of the code, is reasonable for these applications. The resulting improvement can make possible calculations that would otherwise not be accomplished. As the number of processors increases substantially, the leading issue shifts from local execution speed to load balancing. Load balancing requires at least an initial "hand code" for a given architecture in any event.

      There are several application spaces for HPC. Some, like semantic network processing do not require double precision and can be mounted on cell processors as they stand. Those which are fundamentally based on massive differential equation solution would benefit from the double precision modification. The key point here is that the double precision pipeline unit is a modification, not a different core. It is likely that IBM can make such a change at a fraction of the cost of the original core development with benefits not only to the HPC community, but also to potential workstation use.

      The bottom line is than one can be easily mislead trying to think of HPC architectures and programming from the familiar standpoint of game and web server development.

    16. Re:What about the compiler? by Gromius · · Score: 1

      I'm a particle physicist. Our computing needs are insane but massively parrallel, basically the grid is being developed for us and us alone although we figure that some other people might find a use for it. We spend the fast majority of our day to day job programming. And we're, with only a few exceptions, piss poor at it. Forget hand optimized assembly, I'm currently fighting a losing battle to stop people using x = pow(y,2) (and I have found that in our base software package, one suposedly written by the experts). However the solution usually is just to buy a faster machine to run it on.

    17. Re:What about the compiler? by Frumious+Wombat · · Score: 1

      Actually, for my field (Chemistry), what GCC -O3 does is irrelevant, except during the development phase of a program, or as a last resort for portability. We care about what the fastest native compiler we can find + optimized libraries does. The Cell will be no different; a few hand-optimized routines such as BLAS, FFTPack, etc, in libraries, then an auto-vectorizing Fortran-95 compiler on top. I will be interested in seeing how packages such as GAMESS or NWChem http://www.emsl.pnl.gov/docs/nwchem/nwchem.html/ behave once Fortran is available, and Cell shipped in something other than game consoles.

      On the other hand, the GROMACS guys http://www.gromacs.org/, who write hand-optimized code on a per-processor basis, ought to be stoked. It already runs well using single-precision, so it looks to be tailor-made to a Cell-based setup.

      --
      the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
    18. Re:What about the compiler? by statusbar · · Score: 1

      Jeez, that reminds me of the "Database Specialists" doing "SELECT * from mytable;" and then doing a java for() loop to find the rows they are interested in.. Then they complain about the database machine being too slow so they get it upgraded.

      How much do these new machines cost?

      How much does a competent programmer cost?

      Which one is the best option?

      --jeffk++

      --
      ipv6 is my vpn
    19. Re:What about the compiler? by JanneM · · Score: 1

      The data analysis would have taken a few weeks if it weren't for some clever optimisations. So I don't think the time I spent on that is wasted time.

      It's not wasted time if the time spent optimizing is less than the time saved. So for your example, assuming it wasn't algorithmic optimizations (which are orthogonal to doing funky assembler stuff), you may save a few days on a few weeks running time. So if the optimization took a couple of days of coding it may have been worth it. Otherwise it was not.

      And for scientific apps especially, you really do have to factor in the added cost of tweaking the software - you _always_ need to tweak, often over many cycles - when part of it is as opaque and difficult to understand as assembly optimizations are (which often implies explicit use of the semi-parallel features of modern CPU_s today).

      --
      Trust the Computer. The Computer is your friend.
    20. Re:What about the compiler? by Shinobi · · Score: 1

      Well, that's where you're wrong. There are more people who hand-optimize than the academic world cares to admit, since admitting it would also mean admitting that the oh-so-sacred academic practices as well as compiler technology+libraries has some areas where they can't be applied efficiently.

    21. Re:What about the compiler? by adam31 · · Score: 3, Informative
      Actually bullshit.

      Actually, it's not bullshit. Simple C intrinsics code is the way to go to program the Cell... there's just no need for hand-optimized asm. Intrinsics has a poor rep on x86 because SSE sucks. 8 registers. A source operand must be modified on each instr, no MADD, MSUB, etc.

      But Cell has 128 registers and a full set of vector instructions. There's no danger of stack spills. As long as the compiler doesn't freak out about aliasing (which is easy), and it can inline everything, and you present it enough independent execution streams at once... the SPE compiler writes really, really nice code.

      The thing that does need to be hand-optimized still is the memory transfer. DMA can be overlapped with execution, but it has to be done explicitly. In fact, algorithms typically need to be designed from the start so that accesses are predictable and coherent and fit within ~180kb. (Generally, someone seeking performance would do this step long before asm code on any platform anyway...)

    22. Re:What about the compiler? by SeeMyNuts! · · Score: 1

      "a company spending a few hundred mil. building a super-computer might just consider doing the same"

      Well, if they hire the typical contractor to do the work, $10 million goes towards the computer, $90 million goes towards a coffee service, and $300 million goes towards per diem.

    23. Re:What about the compiler? by fbg111 · · Score: 1

      It's a good start, a good platform upon which to expand. I would bet IBM would be willing to make Cell+, given their traditional involvement in scientific computing. But you left a key part out of your quote, that even in its current form, Cell appears to be 8x faster and more power efficient than current Opterons and Itaniums in double-precision calculations. Doubling that by making a few modifications to the silicon is probably not out of the question, though whether this would allow Cell+ the price reductions of Cell's economy of scale is another question.

      "Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors wrote. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance. If Cell were to include at least one fully utilizable pipelined double precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double."

      --
      Flying is easy, just throw yourself at the ground and miss. -Douglas Adams
    24. Re:What about the compiler? by netwiz · · Score: 1

      it takes a compiler to find those optimizations that are too complex to do by hand or hard to find while being easy to do.

      Say what? Um, that type of optimization doesn't exist, unless the programmer is really untalented. Most of the big opportunities should stand out like a sore thumb on a trace. Once you know what's taking all the time in the code, you can look at the way it's put together to catch the low-hanging fruit. Generally, the first 10% of the work gets you 90% of the way there. Then there's all the corner-case work, the 90% that gets you the other 10%. In both cases, so long as you've the appropriate tools, finding opportunity for speedup is relatively easy, excepting in cases where a routine needs a complete reevaluation from the ground up. This can happen if the data model's not quite right, or there's significant resource blocking. These are a real bitch because they can cause you to completely redesign large sections of code, and there's absolutely nothing an optimizing compiler will do to help in those cases.

      I should point out that a well-designed program should not encounter the last two issues, as it suggests the problem wasn't well-enough understood beforehand. You have to know exactly what you want to do with a computer before you figure out how you're going to do it.

    25. Re:What about the compiler? by uarch · · Score: 1

      Be careful assuming a 3x cut in the number of instructions is a 3x speed increase.
      There's plenty of instances where that isn't true.

    26. Re:What about the compiler? by adam31 · · Score: 3, Informative
      I am also an experienced assembly programmer, and I too shared your mistrust of the compiler. However, I started SPE programming several months ago and I promise you that the compiler can work magic with intrinsics now. Knowledge of assembly is still helpful, because you need to have in mind what you want the compiler to generate... make sure it sees enough independent execution clumps that it can cover latencies and fill both the integer pipe and FP pipe, understand SoA vs AoS, etc. But you get to write with real variable names, not worry about scheduling/pairing of individual instructions or loop unrolling issues.

      Some of my best VU routines that I spent a couple weeks hand-optimizing, I re-wrote with SPE intrinsics in an afternoon. After some initial time figuring out exactly how the compiler likes to see things, it was a total breeze. My VU code ran in 700 usec while my SPE code ran in 30 usec (@ ~1.3 IPC! Good work, compiler).

      The real worry now is becoming DMA-bound. For example, assuming you're running all 8 SPEs full-bore, and you write as much data as you read. At 25.6 GB/s, you get 3.2 GB/s per SPE, so 1.6 GB/s in each direction (assuming perfect bus utilization), so @3.2 GHz, that's 0.5 Bytes/cycle. So, for a 16-byte register, you need to execute 32 instructions minimum or you're DMA-bound!

      Food for thought.

    27. Re:What about the compiler? by TopSpin · · Score: 1

      But it's only really great if a new custom variant is built.

      Cell had a specific problem domain to address during the design of the initial product. If Cell really is all that, there will be future revisions. These researchers are pointing out what is necessary to make Cell more viable to a broader base of users. They are putting themselves at the head of the line.

      They have evaluated the existing Cell, added their guesswork as to what could be done with modest changes and quantified the result relative to competitors. The best case outcome includes another $200+ million contract to build another massive compute grid. If someone shows up at the door with this paper in one hand and the funding resources to accomplish it in the other... Cell2 (or Cell+ as they call it) gets put on the drawing board.

      --
      Lurking at the bottom of the gravity well, getting old
    28. Re:What about the compiler? by pjabardo · · Score: 1

      Even though calculation changes, there are several core math kernels that almost every numerical application uses. One such kernel is BLAS (Basic linear algebra software). An optimized version can be much faster than the fortran standard implementation. I've code that uses optimized BLAS and it is 2 times faster than the fortran implementation. If the software is going to run for days this makes a big difference.

    29. Re:What about the compiler? by Anonymous Coward · · Score: 1, Insightful

      I heard about labs spending the time and effort to make custom stuff, but by the time they're done, the off the shelf hardware had already caught up

      Haha, dude, have you ever run tests that take weeks to complete? The FLOPS improvement shown in that paper is around a factor 8 compared to AMD64 machines. You jump from weeks to days in simulation time. That is HUGE.

      As for the development time, doing a basic optimisation will give already give you a great boost in performance. You do not have handcode each and every instruction/function. As a side note, we already spend weeks on optimising pieces of code for SSE/SSE2/SSE3. I would guess using another set of assembler would not delay us too much. Especially if we can gain 8x performance.

      Our lab also does video coding, processing 8 times faster would mean that we can go from demonstrating our technology on 352x288 (CIF) sequences to demonstrating it on 720p (HD) sequences. That is if we keep it realtime, or we could process 8 CIF streams at once. Now that is WAY impressive.

    30. Re:What about the compiler? by Wolfier · · Score: 1

      What the academics does with a new technology by hand is often what makes things you do daily, like -O3, possible.
      Sometimes people DO use published research results to construct compilers.

      -O3 is more important when the optimization is just a mean to an end - however, when optimization is an end itself, it's easy to see the value of disciplined hand tuning.

    31. Re:What about the compiler? by m874t232 · · Score: 1

      Having seen a lot of scientific codes, I can assure you: most people writing scientific software don't even know about BLAS, and even if they do, they don't bother.

    32. Re:What about the compiler? by m874t232 · · Score: 1

      It's not wasted time if the time spent optimizing is less than the time saved.

      Wrong. A programmer hour is much more valuable than a machine hour.

      And this hasn't been lost on scientists and engineers--hence the popularity of software like MATLAB.

    33. Re:What about the compiler? by salad_fingers · · Score: 1

      What needs to be remembered is that Cell is a strictly in-order CPU, meaning that it cannot execute instuctions out of order. Therefore, hand-optimization can only get you so much. If you are waiting on an add, you are waiting on an add; not much can be done in that sense. It really comes down to compiler logistics and how the various tasks of branch prediction, locality and "cache" come into play/are optimized. Remember that Cell has no real cache but a virtual one. It ultimately comes down to the compiler, which I am sure is pretty beefy in this case.

    34. Re:What about the compiler? by Anonymous Coward · · Score: 1, Insightful

      If a simulation will run for several months, saving a weeks worth of run time is adventageous. That could translate into more time to do analysis, publishing sooner than a competitor, reduced overhead, etc.

      But as with any of the examples you gave, a cost benefit analysis needs to be considered.

      And with any optimization strategy, it is often better to use better data structures than to tune serial instruction streams.

      For the Cell, this might translate into reshaping data chunks to better fit the local processor environment.

    35. Re:What about the compiler? by plalonde2 · · Score: 1

      Oddly, on the Cell, most of the optimization is low-level algorithmic stuff. Yes, assembly gets you that last little boost, but most of the Cell optimizations I've worked with (for the last 15 months or so) have been data movement and data decomposition exercises. Breaking your data into SPU-sized chunks, or into SPU-streamable chunks is the hard part. It's also the part compilers are *useless* for.

    36. Re:What about the compiler? by jericho4.0 · · Score: 2, Informative

      Maybe true on our computers, but not on supercomputers.

      --
      "A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
    37. Re:What about the compiler? by try_anything · · Score: 1
      I think you're mixing up CS theorists with the scientists and engineers who just want to crunch a bunch of numbers and get the answers. You'd like the latter group; they write horrible code and aren't ashamed of it.

      You know the saying, "You can write Fortran in any language?" Scientists judge a language by how easy it is to write Fortran in it. That's why C is their second-favorite language.

    38. Re:What about the compiler? by m874t232 · · Score: 1

      Yes, which brings us back to only a "tiny fraction" of all scientific programmers dealing with optimizations. In fact, supercomputers are disappearing from all but a few niches.

    39. Re:What about the compiler? by Memnos · · Score: 1

      No Shit. For mathematically-intensive uses the Cell Processor blows everything else out of the water, by far. Given that faster might actually be better in such contexts, what would you buy? Granted, it's backed by a little-known company by the name of IBM, which may actually have its shit together on chip design.

      --
      I don't trust atoms -- they make up stuff.
    40. Re:What about the compiler? by pjabardo · · Score: 1

      I have to agree with you. Many (most?) people do not know about it. I think the problem is that many people on scientific computing have most of their background on science not on computers and they don't know about these libraries. Since usually it is very simple to implement a replacement they don't bother to look for 'standard' solutions. But performance goes down. But even in this situation it is likely that the programmer is using BLAS even if he is not aware of it. R (www.-r-project.org) uses BLAS and LAPACK and if I'm no mistaken, Matlab does too. Several matrix libraries provide interfaces to BLAS (see uBlas from boost as one example). But when the problemas start to grow, things like BLAS begin to gain importance. I do 3D fluid dynamic simulations using clusters (up to 64 processors) and many simulations take days. So 50% gain matters. Most manufacturers provide these optimized kernels. Intel has MKL, Amd has ACML, DEC (there are still many alphas around) has DXML. There are several standalone implementations - atlas, Goto atlas, etc. Even the GNU Scientific Library uses blas (it has a C implementation that can be replaced by faster routines). I do recommend these libraries. You gain in performance, bugs and even portability.

    41. Re:What about the compiler? by Tough+Love · · Score: 2, Informative

      A programmer hour is much more valuable than a machine hour

      You forgot to take into account the team of scientists waiting for the machine to produce a result.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    42. Re:What about the compiler? by m874t232 · · Score: 1

      First of all, few scientists have the luxury of having dedicated programmers anymore. Second, the team of scientists can generally find other things to do with no problems (grant applications, etc.).

    43. Re:What about the compiler? by WindShadow · · Score: 1

      Actually given the time it takes to run a big engineering calculation and the time it take to hand optimize, for long runs it makes lots of sense and is relevant to users. Also note that if this ever became more widely used there's no reason gcc can't be taught to do a much better job for this hardware.

  3. What about the programmer? by Anonymous Coward · · Score: 5, Insightful

    "The paper did a lot of hand-optimization, which is irrelevent to most programmers. "

    But not to programmers who do science.

    "What gcc -O3 does is way more importent then what an assembly wizard can do for most projects."

    Not an unsurmountable problem.

    1. Re:What about the programmer? by zCyl · · Score: 2, Insightful

      Hand optimization or writing portions of code in assembler is
      the last thing 85% of these people want to do. They don't want
      to be computing experts to do their science/research.


      When you're talking about reuseable modules like an FFT or matrix multiplication, then many scientists doing simulations would love to have a hand optimized FFT or matrix module to plug in as a simulation component. Even if they don't know a drop of assembly themselves, having the optimized module available can make a large difference in running time for big simulations.

    2. Re:What about the programmer? by fatphil · · Score: 1

      Some hardware/system companies have a small bunch of volunteers for this very task - firstly they select programmers which they believe have l33t programming skills, then they lend you a top-of-the-range model (and even are prepared to ship it all the way to Finland if the volunteer lives in Finland), and in return you promise to work on hand-optimising code for their platform, and publishing the results.

      FatPhil, in Finland. ;-)

      --
      Also FatPhil on SoylentNews, id 863
  4. Re:Xbox 2 is a "commodity" by MooUK · · Score: 2, Insightful

    I think you misunderstand what HPC actually is.

    High performance computing is that which you'd want to throw a huge Beowulf cluster at, or possibly a supercomputer or twenty. Not three small pathetic cores.

  5. Doesn't it easily scale up? by Poromenos1 · · Score: 1, Interesting

    Doesn't the Cell's design mean that it can very easily scale up, without requiring any changes in the software? Just add more computing CPUs (SPEs they are called, I think?) and the Cell runs faster without changing your software.

    I'm not entirely sure of this, can someone corroborate/disprove?

    --
    Send email from the afterlife! Write your e-will at Dead Man's Switch.
    1. Re:Doesn't it easily scale up? by owlstead · · Score: 1

      Yes, if there isn't any communication overhead between the processors. If you have 100 seperate threads or processes, without (or almost without) any computation, then the application is perfect for multiple CPU's. If there is a lot of communication needed, then much less so. You cannot write an application for 8 cores with very fast communications and expect it to run on multiple processors without any modifications. That's why many parallel processor designs cost more for the networking part than for the processors itself.

    2. Re:Doesn't it easily scale up? by jacksonj04 · · Score: 1

      It should be best suited to things needing concurrent, but not parallel processing. For example you could be running several simulations at once, none of which are interdependent. When one is done, the processor can be handed another instruction without needing to wait for the results from everything else.

      The code will be the tricky bit.

      --
      How many people can read hex if only you and dead people can read hex?
  6. Not likely to be low cost CPUs by maraist · · Score: 1

    An interesting point is that most consoles sell their hardware at a loss. At least the XBox does. This means that there is no guarantee that IBM is willing to sell their CPUs at the same price that one would believe they cost for the PS3.

    Moreoever, the scientific community is very likely to push their cell+ architecture and I'm sure IBM would be more than happy to help... For a massive price.

    So, when building an HPC system, you're likely to work around the best architecture (the more expensive cell+), and purchasers of the HPC will then have a cray-like proprietary system at enormous cost.

    Not that this is a bad thing, I just don't believe this "low cost" "high volume" statement.

    --
    -Michael
    1. Re:Not likely to be low cost CPUs by Oswald · · Score: 1

      Doesn't sound right. IBM isn't taking a loss on PS3 hardware. If anybody is, it's Sony, and they would be subsidizing the volume that would allow IBM to sell the chip (relatively) cheaply.

    2. Re:Not likely to be low cost CPUs by WindBourne · · Score: 1

      I just don't believe this "low cost" "high volume" statement.If not, then you are about the only one. Simply look at the top500.org to see what low cost,high volume produces. My bet is that IBM is using sony to get to high volume rather quickly. After that point, they will start using this in a number of their own systems. And you can bet that this will form the foundation of a very very fast parallel arch for top500. I also expect to see it upgraded to cell+ quickly.

      --
      I prefer the "u" in honour as it seems to be missing these days.
    3. Re:Not likely to be low cost CPUs by maraist · · Score: 1

      I just don't believe this "low cost" "high volume" statement.If not, then you are about the only one.

      Well, I'm just saying, I wouldn't bet money on IBM coming out with their cell in a high volume enough way to provide ultra-low pricing as in the PowerPC or obviously x86 markets. History has shown time and time again, that innovation is not what is important, dominance is. Alpha had a superb chip but was in no way marketable. Apple has always had a better design in computer hardware, but will likely never achieve any respectable market presense.

      My post is one of pessimism; that the PS3 will not be a sufficient vehicle to drive the cell processor to a volume of scale (where you could afford to make only a few pennys of gross profit per CPU). Certainly anything could happen. But I have yet to see a video graphics processor (in the 10 years I've watched GPU progression) break out of it's niche market (despite all these innovative ways of using their specialized mathmatical processing power).

      This article was about the scientific community seeing a potential to use the cell for what it was meant, but in a different venue than still-frame graphics rendering. That's fine, but any architectural descision requries taking the whole project into account, and I simply don't see the cost effectiveness of cell processing until and unless it becomes ubiquitous.. Otherwise what you have is an original cray (prior to the opteron or even alpha chip): 100% custom. And you pay for it.

      Basically it's slightly cheaper than designing the chip yourself.. But my argument is... Not by much.

      --
      -Michael
  7. Lattice QCD people: by ettlz · · Score: 1

    Isn't Cell similar to things like QCDOC (from what my LQCD colleagues tell me, it's based on PowerPC, but are there similarities in the wider architecture, interconnects, etc.)? Have any plans to use it here?

    1. Re:Lattice QCD people: by Watson+Ladd · · Score: 1

      A little bit. The big difference is the Cell has SPE's which are like DSP's on the chip which are controlled by a PPC processor. QCDOC is a lot of PPC processors connected similarly. Also, memory is symmetric on QCDOC, while it is asymmetric on the Cell. The similarity is mostly in the kind of bus used. Think about one QCDOC node connected to seven QCDSP nodes and only the QCDOC node having a lot of memory and you will have the right idea. Ars Technica had a good review of the Cell.

      --
      Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
    2. Re:Lattice QCD people: by Quiberon · · Score: 1

      QCDOC people ended up making BlueGene

  8. Re:Xbox 2 is a "commodity" by PhotoBoy · · Score: 1

    Except neither of those links point to anything that proves the Cell is good for High Performance Computing which is the point of the article. This isn't anything to do with 360 vs PS3. If MS wanted to design a CPU that could be scaled up for HPC they would have done, instead they just got IBM to customise a PPC chip for their games console because their goal is dominance in the living room, not to become the next Intel.

    To be honest I question the validity of this study anyway, I seem to recall lots of papers proclaiming the PS2's so-called "Emotion Engine" as the future of super computing and that never happened either. This is probably more hype paid for by Sony to make people believe the PS3 will be the second coming.

    Plus if you actually watch that whole interview with Carmack you linked to, he says the only advantage of the PS3 hardware is peak performance, which if it's anything like the PS2 will be limited by memory bandwidth. And everything I've seen of the PS3's RSX suggests it's just an nVidia 7800 GTX, which means the 360 should have the advantage graphically. With the PS3 having more CPU power but the 360 having more polygon power I suspect we'll end up with fairly similar looking games.

  9. WTF? by SmallFurryCreature · · Score: 4, Insightful
    First off you are talking about consoles being sold at a loss. NOT their components.

    IF IBM was the maker of the chip they would most certainly not sell them at a loss. Why should they? Sony might sell the console at a loss to recoup the loss from game sales but IBM has no way to recoup any losses.

    Then again IBM is in a parnetship with Sony and Toshiba so the chip is probaly owned by this partnership and Sony will just be making the chips it needs itself.

    So any idea that IBM is selling Cells at a loss is insane.

    Then the cost of the PS3 is mostly claimed to be in the Blu-ray drive tech. Not going to be off much intrest to a science setup is it? Even if they want to use a blu-ray drive they need just 1 in a 1000 cell rig. Not going to break the bank.

    No the cell will be cheap because when you run an order of millions of identical cpu's prices drop rapidly. There might even be a very real market for cheap cells. Regular CPU's always have lesser quality versions. Not a problem for an intel or AMD who just badge them celeron or whatever but you can't do that with a console processor. All cell processors destined for the PS3 must be off similar spec.

    So what to do with a cell chip that has one of the cores defective? Throw it away OR rebadge it and sell it for blade servers? That is were celerons come from (defective cache)

    We already know that the cell processor is going to be sold for other purposes then the PS3. IBM has a line of blade servers coming up that will use the cell.

    No I am afraid that it will be perfectly possible to buy Cells and they will be sold at a profit just like any other cpu. Nothing special about it. they will however benefit greatly from the fact that they already got a large customer lined up. Regular CPU's need to recover their costs as quickly as possible because their success will be uncertain. This is why regular top end cpu's are so fucking expensive. But the Cell allready has an order for millions, meaning the costs can be spread out in advance over all those units.

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.

    1. Re:WTF? by Kjella · · Score: 3, Insightful

      So what to do with a cell chip that has one of the cores defective? Throw it away OR rebadge it and sell it for blade servers?

      Use it. Seriously, that's why there's central + 7 of them, not 8. One is actually a spare so that unless it's either flawed in the central logic or two separate cores, the chip is still good. Good way to keep the yields up...

      --
      Live today, because you never know what tomorrow brings
    2. Re:WTF? by epiphani · · Score: 1

      So what to do with a cell chip that has one of the cores defective? Throw it away OR rebadge it and sell it for blade servers? That is were celerons come from (defective cache)

      Actually, the cell has 8 SPU's on die. It only utilizes seven, specifically to handle the possibility of defective units. They throw the extra SPU on there to increase yields.

      --
      .
    3. Re:WTF? by jericho4.0 · · Score: 1

      The Cell in the PS3 has 7 SPEs. The Cell as used in other places will likely have the full 8 available.

      --
      "A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
  10. Re:Xbox 2 is a "commodity" by Anonymous Coward · · Score: 1

    "We also conclude that Cell's heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multi-core processors."

    Whether or not HPC is something you'd want to throw 20 or more supercomputers at in a Beowulf cluster, at least you know that the PS3 is really the only next-generation video game system because nobody concerned with raw performance and power efficiency would want to use the Xbox 2 in a HPC environment.

  11. Hmm by Poromenos1 · · Score: 1

    Yes, but the Cell is designed to process data in independent packages which are scheduled and sent to processors by the central unit, it's not a traditional multiprocessor system. Hmm, I guess that from the specs the processors could be communicating via the network instead of just buses as well, which would make what you say correct. I guess we should wait and see.

    --
    Send email from the afterlife! Write your e-will at Dead Man's Switch.
    1. Re:Hmm by owlstead · · Score: 1

      The cell architecture makes it easy to distribute workloads, that's true. But that's just the beginning of solving the parallel puzzle. The trick is to spread the workload in such a way that the communication overhead is minimal. Otherwise, it may be wiser to use a different architecture. My guess is that the cell processor is interesting to grid computing, but needs a serious platform, both hardware and software-wise to be viable for the more serious work. On the other hand, IBM should be big enough to handle this.

  12. When can we start Folding with it? by BartonOC · · Score: 1

    Sounds like this cpu would end up having great folding performance. I so hope the PS3 ends up being hackable and we get to throw Linux on it ;-)

    1. Re:When can we start Folding with it? by ahodes1 · · Score: 1

      Linux will be pre-installed on the PS3 HDD, no hacking needed: http://www.gamasutra.com/php-bin/news_index.php?st ory=9290

    2. Re:When can we start Folding with it? by Xymor · · Score: 1

      E3: Kawanishi Talks Homebrew Linux PS3 Development there's also some talks on idie game development, just google PS3 + Linux

  13. Re:Xbox 2 is a "commodity" by KitesWorld · · Score: 1

    at least you know that the PS3 is really the only next-generation video game system because nobody concerned with raw performance and power efficiency would want to use the Xbox 2 in a HPC environment.

    Not quite. What they're saying is that the Cell is better suited to parralel applications, like physics simulations, and that it is more scaleable - ie, easier to build supercomputers or distributed computing nodes from.

    However, that has no bearing upon what 'generation' the host console is - largely because a console has a pre-determined number of chips installed, and cannot be scaled without breaking it's own specification. Remember, the fact that there are exactly n cores in a console is what makes that console a stable development platform (as opposed to the PC, where performance is different on each unit).

    You *could* argue that console is using more modular technology, but that on its own doesn't tell you anything about overall performance, ease of development, stability, robustness, nor any of the other metrics that you can really apply to a console. If 'older' technology can be used to provide those same metrics in a home console, then which is better simply becomes an issue of cost. If the older gear does the same job, but is cheaper to produce, then it is the better alternative from everything but a marketing standpoint. Expandibility of the hardware in other platforms does not affect the quality of the platform in question.

  14. The ball is in the hands of developpers. by stengah · · Score: 2, Insightful

    The fact is that most scientists use high-level software (MATLAB, Femlab, ...) to do their simulations. Altough theses scientists may be interested by any potential speed-up to their workflow, they are not willing to invest any bit of their time to translate all their codebase to asm-optimized C. Thus, the ball is in the hands of software developpers, not scientists.

    --
    I'm jack's useless sig
    1. Re:The ball is in the hands of developpers. by infolib · · Score: 3, Informative

      The fact is that most scientists use high-level software (MATLAB, Femlab, ...) to do their simulations.

      Indeed, most scientists. They also know very little about profiling but since the simulation is used only maybe a hundred times that hardly matters.

      The cases we're talking about here are where thousands of processors grind the same program (or evolved versions of it) for years as the terabytes of data roll in. Such is the situation in weather modelling, high energy physics and several other disciplines. That's not a "program" in the usual sense, but rather a "research program" occupying a whole department including everyone from "domain-knowledge" scientists down to some very long haired programmers who will not shy away from a bit of ASM. If you're a developer good at optimization and parallellism there might just be a job for you.

      --
      Any sufficiently advanced libertarian utopia is indistinguishable from government.
    2. Re:The ball is in the hands of developpers. by Surt · · Score: 1

      In the article they mentioned that they had ported several scientific kernels to cell, so presumably the porting work isn't going to be the core of the challenge. It sounds like the real work to be done will be convincing sony to make modifications to the next generation of cell processors to improve the double precision performance.

      --
      "Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
    3. Re:The ball is in the hands of developpers. by fitten · · Score: 1

      The fact is that most scientists use high-level software (MATLAB, Femlab, ...) to do their simulations. Altough theses scientists may be interested by any potential speed-up to their workflow, they are not willing to invest any bit of their time to translate all their codebase to asm-optimized C. Thus, the ball is in the hands of software developpers, not scientists.

      Isn't this the same argument as the Itanium proponents used? ...It's up to the compiler writers to make good compilers so the code runs well...

    4. Re:The ball is in the hands of developpers. by ceoyoyo · · Score: 1

      Those scientists are NOT high performance computing scientists.

      I do a bit of HPC. I wouldn't touch Matlab with a ten foot pole. Of course, I wouldn't touch Matlab with a ten foot pole for non-HPC stuff either.

  15. Ease of Programming? by MOBE2001 · · Score: 2, Interesting

    FTA: While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors,

    The Cell processor may be faster but how easy is it to implement an optimizing development system that eliminates the need to hand-optimized the code? Is not programming productivity just as important as performance? I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind. I don't think it is a good idea to design a software model around a CPU. It is much wiser to design the CPU around an established model. In this vein, I don't see the cell as a truly revolutionary processor because, like every other processor in existence, it is optimized for the algorithmic software model. A truly innovative design would have embraced a non-algorithmic, reactive, synchronous model, thereby killing two birds with one stone: solving the current software reliability crisis while leaving other processors in dust in terms of performance. One man's opinion.

    1. Re:Ease of Programming? by adam31 · · Score: 1
      I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind.

      It's possible that this is the case, however IBM is actively working on compiler technology to abstract the complexity of an unshared memory architecture from developers whose goal isn't to squeeze the processor:

      When compiling SPE code, the compiler identifies data references in system memory that have not been optimized by using explicit DMA transfers and inserts code to invoke the software-cache mechanism before each such reference.

      So for developers who want performance, the architecture is ideal. 2 Megs of L1-speed memory, a 25 GB/s bus servicing 8 processors each with 128 128-bit registers. And for the rest, it's still a high-performance programmer-friendly development environment.

      Your point is not going unnoticed by IBM.

    2. Re:Ease of Programming? by Chris+Snook · · Score: 1

      I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind. I don't think it is a good idea to design a software model around a CPU. It is much wiser to design the CPU around an established model. In this vein, I don't see the cell as a truly revolutionary processor because, like every other processor in existence, it is optimized for the algorithmic software model. A truly innovative design would have embraced a non-algorithmic, reactive, synchronous model, thereby killing two birds with one stone: solving the current software reliability crisis while leaving other processors in dust in terms of performance.

      I've read this a dozen times, and can't figure out what the hell you're talking about.

      Anyway, as so many other people have pointed out, if 99% of your CPU cycles are spent doing matrix multiplication, and you can make matrix multiplication go 5 times faster with some assembly optimization, your application is now almost 5X faster, without touching 99% of your code. This really happens in scientific computation. It's the extremely friendly end of the spectrum of Amdahl's Law, and is why reusable libraries are very good.

      --
      There's no failure quite as dissatisfying as a complete and total solution to the wrong problem.
    3. Re:Ease of Programming? by zCyl · · Score: 1

      Is not programming productivity just as important as performance?

      When you're talking about scientific computations which can sometimes take a month or more to do one run, then suddenly it can become worth it to sacrifice a bit of programmer time if it can make a substantial increase in performance. If you can do a run in a week instead of a month, then that makes a huge difference in what you can investigate. Often it's not a question of just buying more machines because sometimes you need to know the answer to the last run before starting the next one.

    4. Re:Ease of Programming? by jthill · · Score: 1
      how easy is it to implement an optimizing development system that eliminates the need to hand-optimize the code?
      Not much payoff optimizing development systems for slow hardware. Cray tout the X1E as offering "Unrivalled Vector Processing and Scalability for Extreme Performance". These guys smoked one for dinner, woke up the next day, rebuilt their code from the ground up a completely different way and smoked it again for lunch.

      It took them a month to figure out how to do that, on maybe $3K worth of hardware. Think anybody wants to teach a compiler how to get close? TFP:

      Having become experienced Cell programmers, the single precision time skewed stencil -- although virtually a complete rewrite from the double precision single step version -- required only a single day to code, debug, benchmark, and attain spectacular results of over 65 Gflop/s. This implementation consists of about 450 lines, due once again to unrolling and the heavy use of intrinsics.

      I'm just a fanboi in this territory, but last I looked the guys who don't quite need to do that just use pre-tuned libraries to get a nice chunk of what's possible. Who really cares how hard it is to tune those, once?

      And when they were just doodling, not thinking hard?

      These results are conservative given the naive 1D FFT implementation we used on Cell whereas the other systems in the comparison used highly tuned FFTW or vendor-tuned FFT implementations [...] Cell performance is nearly at parity with the X1E in double precision.

      They say DP arithmetic is apparently in there as an afterthought -- it's not really necessary for game-quality 3D, after all -- and they think they know how tweak the pipeline for better than double the throughput.

      --
      IABCOT!

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    5. Re:Ease of Programming? by Lars+T. · · Score: 1
      The Cell processor may be faster but how easy is it to implement an optimizing development system that eliminates the need to hand-optimized the code? [...] I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind.

      Hunh? From a (assembler) programmer's POV we have something close to AltiVec/VMX vs. x86 and EPIC - and you ask which is easier?

      --

      Lars T.

      To the guy who modded me down from perfect to terrible Karma - Apple haters still suck

  16. 'designed', nothing by Szplug · · Score: 1

    All MP machines have: communication channels, and processors. If the designers envisioned it being used a certain way and optimized it for that, well, what of it? Maybe that's how the standard game API does things but, it's still processors and communication channels. It's more than likely you can get better performance out of it by adapting your problem for it specifically, minimizing communication and keeping processors busy as much as the problem allows, same as for all other MP systems.

    --
    Someday we'll all be negroes
  17. And why Apple going Intel was so sad by Anonymous Coward · · Score: 1, Insightful

    x86, the commodity, has registers from the days when RAM was faster than the CPU (ie 8-bit days)

    The tacked on FPU, MMX, SSE SIMD stuff whilst welcome still leaves few registers for program use

    The PowerPC on the otherhand has a nice collection of regs, and as good if not better SIMD--The CELL goes a big step further

    More regs = more varibles in the CPU = higher bandwidth of calculation
    be they regular regs or SIMD regs.
      That plus the way it handles cache
    Could be a pig to program without the right kind of compiler optimizing
    Would that mean game developers using FORTRAN 95?

    1. Re:And why Apple going Intel was so sad by uarch · · Score: 1

      x86-64 has bumped the number of general registers up to 16. Sure, its still less than the 32 used by PowerPC but the performance difference between the two will be negligable for most apps. In some cases more registers win. In other cases fewer registers win. (Think about saving registers on function calls, context switches, etc)

      Besides, Cell in its current form wouldn't be a huge win in standard desktops. Its specialized for specific workloads and you wouldn't see the same performance gains across the board. In several areas which might be more common to a desktop PC you would probably see a drop in performance. This story doesn't really apply directly to Apple.

    2. Re:And why Apple going Intel was so sad by rrohbeck · · Score: 1

      Now when is AMD going to add a bunch of small slave CPUs? With limited 64-bit instruction set, few fast integer and FP execution units, with local SRAM, hooked up through DMA via HyperTransport?

      Ooh, the idea makes me drool.

  18. bang, buck, effort by penguin-collective · · Score: 3, Informative

    Over the last several decades, there have been lots of parallel architectures, many significantly more innovative and powerful than Cell. If Cell succeeds, it's not because of any innovation, but because it contains fairly little innovation and therefore doesn't require people to change their code too much.

    One thing that Cell has that previous processors didn't is that the PS3 tie-in and IBM's backing may convince people that it's going to be around for a while; most previous efforts suffered from the problem that nobody wanted to invest time in adapting their code to an architecture that was not going to be around in a few years anyway.

    1. Re:bang, buck, effort by m874t232 · · Score: 1

      Well, that's why the Subject says "bang, buck, effort", so, yes, I agree that bang for the buck matters.

      However, there is a problem with the PS3: the only chip that will be made in volume is the chip that goes into the PS3, and that will likely remain at its current clock frequency for a while. And that means that it will be obsolete pretty soon. Faster versions will be much smaller runs and hence much more expensive.

      So, I hope the high volume of the PS3 will help, but I wouldn't bet on it.

  19. single threaded vs multithreaded by abigsmurf · · Score: 1

    I thought the Cells performance was mediocre if you only had a single task going on at a time. Given that scientific simulations aren't real time, it doesn't need to be hugely multithreaded as it's better for each tick/frame/etc of the simulation to be done one after the other.

    1. Re:single threaded vs multithreaded by be-fan · · Score: 1

      1) Cell's performance is mediocre on typical single-threaded applications (eg: AI). Not because it has inherently bad single-threaded performance, but because most single-threaded code happens to be integer code, and the SPE's integer and branching performance sucks.

      2) Most simulations are highly parallel. There are lots of cases where you can simulate many parts of the system simultaniously, and only synchronize state at certain points.

      --
      A deep unwavering belief is a sure sign you're missing something...
  20. Re:Xbox 2 is a "commodity" by Darkfred · · Score: 2, Informative

    Did Sony pay you or did Mr. Kutaragi come over to your house and type it for you.

    Have you seriously never seen anything like this before? As a professional ps2/360/ps3 developer I have to say that I was seriously underwhelmed by this demo. Every one of the effects has been used before. THe original xbox has every effect he mentioned. And HL2 has a significantly more complex lighting system and postprocessing effects.
    The demo appears to be a single high-poly character in a texture mapped box. The demoer admits that this is a cut-scene quality model. I believe this scene could be rendered on an original xbox with similar 'visual' quality. Why not use some of those polys to make a realistic background? Black on PS2 looked better. And they couldn't even show a solid second of actual gameplay.
    I think it will be an amaxing game, but the demo was no technical achievement. It was a hurried render test for an obviously incomplete engine. Bragging about poly count when your competition can push 1.5x-3x as many is not going to win them any points either.

    Regards,

    --
    ----- 70% of all statistics are completely made up.
  21. No, this is why we have subroutine libraries by golodh · · Score: 5, Interesting
    Although I agree with your point that crafting optimised assembly language routines is way beyond most users (and indeed a waste of time for all but an expert) there are certain "standard operations" that

    (a) lend themselves extremely well to optimisation

    (b) lend themselves extremely well to incorporation in subroutine libraries

    (c) tend to isolate the most compute-intensive low-level operations used in scientific computation

    SGEMM

    If you read the article, you will find (among others) a reference to a operation called "SGEMM". This stands for Single precision General Matrix Multiplication. This is the sort of routines that make up the BLAS library (Basic Linear Algebra Subprograms) (see e.g. http://www.netlib.org/blas/). High performance computation typically starts with creating optimised implementation of the BLAS routines (if necessary handcoded at assembler level), sparse-matrix equivalents of them, Fast Fourier routines, and the LAPACK library.

    ATLAS

    There is a general movement away from optimised assembly language coding for the BLAS, as embodied in the ATLAS software package (Automatically Tuned Linear Algebra Software; see e.g. http://math-atlas.sourceforge.net/). The ATLAS package provides the BLAS routines but produces fairly optimal code on any machine using nothing but ordinary compilers. How? If you run a makefile for the ATLAS package, it may take about 12 hours (depending on your computer of course; this is a typical number for a PC) or so to compile. In this time the makefile will simply run through multiple switches and for the BLAS routines and run testsuites for all its routines for varying problem sizes. And then it picks the best possible combination of switches for each routine and each problem size for the machine architecture on which it's being run. In particular it takes account of the size of caches. That's why it produces much faster subroutine libraries than those produced by simply compiling e.g. the BLAS routines with an -O3 optimisation switch thrown in.

    Specially tuned versus automatic?: MATLAB

    The question is of course: who wins? Specially tuned code or automatic optimisation? This can be illustrated with the example of the well-known MATLAB package. Perhaps you have used MATLAB on PC's, and wondered why its matrix and vector operations are so fast? That's because for Intel and AMD processors it uses a specially (vendor-optimised) subroutine library (see http://www.mathworks.com/access/helpdesk/help/tech doc/rn/r14sp1_v7_0_1_math.html) For SUN machines, it uses SUN's optimised subroutine library. For other processors (for which there are no optimised libraries) Matlab uses the ATLAS routines. Despite the great progress and portability that the ATLAS library provides, carefully optimised libraries can still beat it (see the Intel Math Kernel Library at http://www.intel.com/cd/software/products/asmo-na/ eng/266858.htm)

    Summary

    In summary:

    -large tracts of Scientific computation depend on optimised subroutine libraries

    -hand-crafted assembly-language optimisation can still outperform machine-optimised code.

    Therefore the objections that the hand-crafted routines described in the article distort the comparison or are not representative of real-world performance are invalid.

    However ... it's so expensive and difficult that you only ever want to do it if you absolutely must. For scientific computation this typically means that you only consider handcrafting "inner loop primitives" such as the BLAS routines, FFT's, SPARSEPACK routines etc. for this treatment, and that you just don't attempt to do that yourself.

    1. Re:No, this is why we have subroutine libraries by definate · · Score: 1
      Specially tuned versus automatic?: MATLAB


      I believe you're mistaking MATLAB for Matlock.
      --
      This is my footer. There are many like it, but this one is mine.
  22. Ran simulations, not code by jmichaelg · · Score: 5, Insightful
    Lest anyone think they actually ran "several scientific application kernels" on the Cell/AMD/Intel chips, what they actually did was run simulations of several different tasks such as FFT and matrix multiplication. Since they didn't actually run the code, they had to guess as to some parameters like DMA overhead. They also came up with a couple of hypothetical Cell processors that dispatched double precision instructions differently than how the Cell actually does it and present those results as well. They also said that IBM ran some prototype hardware that came within 2% of their simulation results, though they didn't say which hypothetical Cell the prototype hardware was implementing.

    By the end of the article, I was looking for their idea of a hypothetical best-case pony.

    1. Re:Ran simulations, not code by Keeper · · Score: 1

      By the end of the article, I was looking for their idea of a hypothetical best-case pony.

      That would be a sphere, right? :)

    2. Re:Ran simulations, not code by the_ed_dawg · · Score: 1
      Lest anyone think they actually ran "several scientific application kernels" on the Cell/AMD/Intel chips, what they actually did was run simulations of several different tasks such as FFT and matrix multiplication.
      Simulation makes computer architecture research possible because researchers don't have access to prototype hardware. If we insisted that all experiments run on real hardware, the only people who could possibly do research are Intel, AMD, and IBM because they have access to the fab and masks to make modifications. Worse, it would take months and tremendous financial resources to test whether an idea even works.

      Any good architecture course goes over how to properly configure simulation parameters to make a practical comparison. The guys at Berkeley spent time trying to tune those DMA numbers because they went to the trouble to make a comparison. They have some truly talented architects at Berkeley, so I'm sure they have the experience to guide their numbers.

      Of course, this is all a moot point, since the numbers are so far in Cell's favor that I doubt the DMA transfer rate would make a damn bit of difference.

      --
      There are two types of people: those prepared for the zombie apocalypse and those who will be eaten.
    3. Re:Ran simulations, not code by Sycraft-fu · · Score: 2, Insightful

      Hey it makes a real difference. There's a great quote that shows up on /. from time to time that goes along the lines of "The difference between tehory and reality is that in theory there's no difference but in reality there is."

      Researchers are very good at simulating things that have little or nothing to do with reality. It all looks good in theory according to their formulas, but they fail to take something in to account. As an example take the defunct Elbrus E2K computer chip. It was supposed to be an awesome processor that would kick the crap out of anything Intel or AMD offered. It was being designed by people with real computer experience, Elbrus made several Soviet supercomputers. Basically, the chip was to be their Elburs 3 supercomputer reimplemented on one chip.

      Everything looked good in simulations... But obviously nothing has ever come of it. The E2K never hit the market, and it and followups have been nothign but vapourware. Why? Well again, because of the difference between theory and reality. The design was all well and good on a VHDL simulator, but the hard part of chip design is not developing some powerful stuff in VHDL, it's developing powerful stuff that can be actually fabbed to a real chip.

      So as with anything like this, I reserve judgement until I see real silicon. To me this looks like people getting overly excited about something that doesn't exist yet. Yes, the Cell is good in theroy, we know that, that's not the issue. The issue is how will it really perform against other chips running real code. That we don't know, and won't know for some time. One simple issue that will have to be dealt with is compiler inefficiencies. Most sicentific code isn't written in assembly, often it's Fortran. Well, if there's one thing Intel's got it's a rockin' Fortran compiler. So even if the Cell's units are actually more pwoerful in theory, if the code it gets isn't optimized it may not matter.

      Either way, any time I hear things about what an amazing jump forward some new tech will be, I am skeptical. It just generally seems that doesn't happen. Improvements happen in small jumps, not nearly an order of magnitude of increase (which is what they are claiming with the 8x faster stat).

    4. Re:Ran simulations, not code by adam31 · · Score: 1
      Sycraft-fu, I understand your skepticism, and I think it's a unfortunate that they didn't publish physical timings. Your post has 3 main points: 1) Their simulations don't factor in something that will account for additional slow-down, 2) Their compilers aren't adapted, and that will contribute to slowdown. 3) Realistic improvements are incremental.

      1) The Cell is actually a pretty simple architecture. Once memory is transfered to SPE local store, performance is deterministic within a fraction of a %. The big question mark is the performance of the DMAC and XDR, in both bandwidth and latency. I feel like, because the paper consistently assumes 25.6 GB/s (theoretical max memory bandwidth), that will be the cause of unexpected slow-down. Achievable should be somewhere 18-24, and that will only affect operations that are memory-bound. They assume 1000 cycles of latency, which should be sufficient in any case.

      2) The fact is that their simulations were run using machine code generated from a real compiler. The language of the source code is irrelevant. More logical is to argue that other compilers have exhausted their potential, while Cell compilers are still in their youth. More straightforwardly, you can argue that a typical compiler has three main deficiencies: it doesn't appreciate the cost of spilling to the stack, it is a slave to correctness in the face of any potential pointer aliasing, and a compiler's nature is scalar processing. The three answers as far as Cell is concerned are: 128 vector registers minimize spilling, all aliasing can be hidden by the restrict keyword + local intrinsic variables, and SPEs are vector-only, with integer and FP sharing a register file. Never has an architecture been more ripe for compiler optimization.

      3) I don't know that this is more than an incremental step, at least as far as high performance computing technology is concerned. It is fundamentally different from AMD64, for example, but so in a way that addresses major concerns. 128 Registers per SPE, 25.6 GB/s bus, 256 kb L1-speed memory per processor, all at minimal power consumption... plus they can be linked together on a 35 GB/s bus. The key is that if I ask you to point to the major architectural bottleneck... could you?

      I remember many years ago, I was listening to a talk given by a Pixar tech guy. He articulated that one of the primary benchmarks they used in how to construct their renderfarm was flops per meter cubed (based on performance - heap dissipation of rack space). The Cell isn't quite revolutionary, but it will make many companies re-evaluate their high-performance needs.

    5. Re:Ran simulations, not code by Sycraft-fu · · Score: 1

      No I can't point out for sure the major bottleneck, I don't claim to be a chip engineer. However I can point out one that might not have been considered by the simulation: The registers. While tons o' registers sounds like nothing but a boon, you have to remeber that on any system you are likely to see today, you are going to be running a multi-tasking OS. Well, that of course means every time the OS switches tasks, all the registers need to be saved, so the task can resume properly when it switches back. Not a big deal if you are saving the 30 or so registers more processors have. Gets to be a little more problematic if you are talking a couple thousand, which it sounds like the Cell is. Even on a small scale, it does matter and for that reason Intel and AMD leave the vector registers used by SSE disabled unless the OS explictly turns them on so only tasks that need them save them.

      Well, I can see this being a non-trivial source of slowdown. OSes task switch a lot. Even if you aren't running anything else, it still has lots of system processes and drivers running. Every time a driver pops an interrupt, you have to push everything your program is doing on the stack so it can run, then pop it all back off.

      Now I'm not saying this will be killer, or that there aren't ways of mitigating it (perhaps the OS can just save the state of 1 SPE to use for execution, the rest can just be suspended as they are) but it's one of those kinds of things that tends to fall by the wayside in simulations.

      I just find that when people talk theoritical numbers, they often fall woefully short of the actual performance you see in the real world. I see much in the way of hyping and little in the way of actual hardware demos. Until I see the silicon running in a real environment, I remain ever the skeptic for new products, espically ones that claim to be a major leap over what's come before them. Probably because so many times in the past, I've seen it not pan out.

      Also, I think you are a little overly optimistic on memory speeds. For example the fastest desktop systems these days (which actually have faster RAM speeds than servers due to lack of error correction) on x86 get memory speends in the 5-6GB/sec range on a real computer, running a peak theoritical benchmark. For example on my system I get 4.8GB/sec, using DDR2 RAM rated to 5.3GB/sec at the speed it's running. You can get a little faster with a faster CPU and bus, but not much. That's on a bus with the theoritical max bandwidth of 10GB/sec (according to the test software at least).

      Talking about quadrupling that, well that's a hell of a feat. Where you'd even get RAM that can do that is a good question. Currently the fastest DDR2 DIMMs on the market are about 8.5GB/sec theoritical in dual channel configs and they aren't cheap. So to achieve memory numbers like you are talking about you now need much faster RAM than is on the market.

      Also please remember we are talking peak speeds here, 4.8GB/sec is what I measure running a simple RAM benchmark, not what the speed is running actual software.

      Now look, please don't think I'm trying to speak with authority here as to what the Cell's problems are. I don't. All I'm trying to point is things under appreciated by theoritical tests. There are a LOT of things to consider on real hardware that can make the best theoritical plan not work out as well as you'd hoped.

    6. Re:Ran simulations, not code by egghat · · Score: 1

      Elbrus may have "failed" because market leader Intel chose to buy them.

      Bye egghat.

      --
      -- "As a human being I claim the right to be widely inconsistent", John Peel
    7. Re:Ran simulations, not code by adam31 · · Score: 1
      The point about context switching is a good one. Not only do all the registers need to be saved, but the entire 256 kb of local store! That's a hugely non-trivial feat, but I think performance applications will be written to avoid context switches entirely.

      The RAM is XDR. The IOIF (to talk to other Cells) connection is 2 FlexIO ports. The bus itself (called the EIB ) is something like 300 GB/s. I agree that peak is never achievable, but it should be possible to get around 18 GB/s or so.

    8. Re:Ran simulations, not code by jthill · · Score: 1
      Full-system emulators are just that. They model bus contention and DRAM refresh and everything else. If anything at all shows up in the actual hardware that those emulators didn't predict, the engineers figure it out and fix it; they don't like not understanding the hardware they're building, and IBM aren't the only ones who've been doing things like this for a while now.

      The LBNL guys started with a simple model. Their model generally predicted performance within 2% of what the full emulator said. It was off by ~13% once, and that bugged them; it turned out the emulator knew about a dispatch interlock that they didn't.

      I believe their predictions will going to be dead on the mark.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
    9. Re:Ran simulations, not code by Sycraft-fu · · Score: 1

      The problem I see with the "let's just not context switch" idea is how do you do that, barring using the chip is a dedicated DSP? If you want to use it as a CPU, it's going to context switch. A lot. That's just how it works on a modern OS. If nothing else, the kernel wants to check on things perodicly. I don't know how often most OSes reenter their kernel, but I'd bet it's multiple times per second. Then of course there's the hardware. Every time the hardware needs attention, which is again multiple times per second I'm sure, it'll fire an interrupt and you have to switch to it and execute it's code to deal with whatever it needs done.

      Just sounds like a major potential slowdown. Now maybe you don't use it as a CPU for this reason, you have it as an addin card. Ok, fair enough, but then comparing it to one of the chips being used as a CPU is a little disengenious.

      As for the RAM, we'll see. Forgive me if I'm skeptical of Rambus's offerings but having seen their under delievered first go at desktops. Against stuff that was good in theory, but just failed to pan out.

    10. Re:Ran simulations, not code by ivan256 · · Score: 1

      However I can point out one that might not have been considered by the simulation: The registers. While tons o' registers sounds like nothing but a boon, you have to remeber that on any system you are likely to see today, you are going to be running a multi-tasking OS. Well, that of course means every time the OS switches tasks, all the registers need to be saved, so the task can resume properly when it switches back. Not a big deal if you are saving the 30 or so registers more processors have. Gets to be a little more problematic if you are talking a couple thousand, which it sounds like the Cell is.

      When running HPTC tasks, processing units are reserved for exactly the reason you describe. Preemptive multi-tasking (if it's done at all, which isn't a given) is only done on a subset of the compute units (this may be a cluster node, a CPU, etc...) while the others are free to run the CPU bound task without worry of context switches. This is also the way the Cell architecture is intended to be used, which is why they can get away with having so many registers.

      Incidentally, even in general purpose computing on modern operating systems, it isn't uncommon only to save a subset of the registers depending on what kind of context change is occuring for performance reasons.

      Also, I think you are a little overly optimistic on memory speeds. For example the fastest desktop systems these days (which actually have faster RAM speeds than servers due to lack of error correction) on x86 get memory speends in the 5-6GB/sec range on a real computer, running a peak theoritical benchmark. For example on my system I get 4.8GB/sec, using DDR2 RAM rated to 5.3GB/sec at the speed it's running. You can get a little faster with a faster CPU and bus, but not much. That's on a bus with the theoritical max bandwidth of 10GB/sec (according to the test software at least).

      Talking about quadrupling that, well that's a hell of a feat. Where you'd even get RAM that can do that is a good question. Currently the fastest DDR2 DIMMs on the market are about 8.5GB/sec theoritical in dual channel configs and they aren't cheap. So to achieve memory numbers like you are talking about you now need much faster RAM than is on the market.


      RAM performance is largely a function of how much money you're willing to spend. The memory in commodity servers and your desktop computer is slow because it is designed as much for the low transistor count as it is for the speed. There is already MUCH faster memory in your desktop computer, but only a very small amount because of how much die space it takes up. The DRAM you use as main system memory is very cost efficient because bits are stored with a single transistor and a capacitor. This makes it slow, however, because the charge in the capacitor has to be refreshed, and the bits cannot be accessed while this is occuring. There are other alternatives, however. SRAM uses 6 transistors, and thus is signifigantly more expensive, but can be had at speeds in the hundreds of gigabits per second in multi-channel configurations.

    11. Re:Ran simulations, not code by Bert64 · · Score: 1

      Well, actually on highend servers the memory will still be faster overall due to a number of things:

      Interleaving
      NUMA (one memory controller per cpu)
      Wider memory bus width

      --
      http://spamdecoy.net - free throwaway anonymous email - avoid spam!
  23. 14 times slower vs 8 times faster by Kell_pt · · Score: 1
    On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance.

    So, that means that the cell in it's current design is 14/8= 1.75x times slower for double precision than an Opteron/Itanium is for single precision. I searched around byt couldn't find a good answer on what is the ratio between an Opteron/Itanium single and double power precision performances? If it's actually just 50% slower (as I think it is) then the cell is still slower (currently 75%).

    So, anyone knows for sure what is the ratio between an Opteron/Itanium single and double power precision performances?
    --
    "I don't mind God, it's his fan club I can't stand!" E8
    1. Re:14 times slower vs 8 times faster by be-fan · · Score: 1

      The Opteron/Itanium's SP/DV performance is about the same.

      And you misread the statement. It said that Cell was 8 times faster than Opteron in DP.

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:14 times slower vs 8 times faster by Kell_pt · · Score: 1

      Aye, seems I misunderstood, thanks. That "despite" word in there makes a difference. :)
      Still, it would seem that Cell is 1.75x (14/8) times slower for double precision (although on average it's 8x times faster (which makes sense, because its single precision speed is enough to raise the average).

      --
      "I don't mind God, it's his fan club I can't stand!" E8
  24. Benchmark by roadrouter · · Score: 1
    I don't understand how they can compare the new Cell with a amd64 or an Itanuim and be so happy.

    Cell have 8 vector processor and something like a ppc to "control" all of them, it's done specially for FP operations. It's like a comparation of a GPU with a CPU, it haven't got so much sense.

  25. marketing by prurientknave · · Score: 1

    Check if this was sponsored by the same marketing team that was running ads that kept peddling the lackluster g4 as a supercomputer on the national watchlist.

  26. Femlab? by colinrichardday · · Score: 1

    Did you mean Fermilab, or am I not keeping up with scientific progress? :-)

  27. Ignore everything important? by Duncan3 · · Score: 2, Interesting

    I love how they manage to completely ignore all the other vector-type architectures already in the market, and just compare it to Intel/AMD which are not even designed for floating point performance.

    Scream "my computer beats your abacus" all you want.

    But then it is from Berkeley, so that's normal. ;)

    --
    - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
    1. Re:Ignore everything important? by jthill · · Score: 1

      I have to wonder whether the poster, the modder or both are actively committing slashdot self-parody, because this is just screamingly funny.

      --
      As always, all IMO. Insert "I think" everywhere grammatically possible.
  28. not a fair comparison by MonaLisa · · Score: 2, Insightful

    The authors discuss hand tuning and assembler coding for Cell, but not necessarily for the other processors. Their 2D FFT results, for example, are a factor a 10 slower than others I have seen. Also, for the IA64 and Opteron, the performance many of these numerical kernels are highly dependent on the compiler used. The IA64 especially is very sensitive to compiler optimization to keep the 6 pipeline slots busy and also generate memory prefetch instructions at the right time to prevent stalling. As often seems to occur in these sorts of HPC comparisons, they spend a lot of time hand opitmizing for a particular platform, and compare it to other platforms that have not necessarily received the equivalent effort. As has been noted above, how much time you have to spend developing, debugging, and tuning a code matters a lot. This is particularly true for research codes. Finally, who uses single precision for scientific computing anymore? Any field that I am aware of that would use large FFTs, large linear algebra solvers, etc. requires at least double precision to get anything meaningful.

  29. Re:I am buying a psp3 as soon as they are availabl by thejam · · Score: 1

    What are you running your renderer on now? Or is this power lust? You'll pay a heavy price, especially in your time. I regret doing this myself.

  30. Re:Xbox 2 is a "commodity" by Darkfred · · Score: 1

    I will try to clear up a little of your confusion.

    > You assumed that the MGS4 trailer was pre-rendered cutscene,
    > that obviously shows that you have little knowledge of the PlayStation
    > and MGS. MGS has NEVER used pre-rendered cutscenes.. blah blah blah

    I never said it was prerendered. You simply misunderstood they way these things work. In-game cut scenes use different models than the regular game. That is because the artists need more detailed control of the animations. They can be much more complex because artists can focus on the elements used in that specific cut-scene.
    Therefor even when rendered in-game cutscenes are a bad estimate of actual gameplay experience. This is why you so often see xbox cutscenes in commercials rather than actual gameplay. Sure it is rendered real time but it will always look the best possible quality.

    > The original Xbox CANNOT produce similar quality as MGS4. Snake's
    > hair alone would cause the original Xbox to be at its limitation.

    60,000 polys for hair alone! My GOD call the nobel prize committee! Even if you wanted to waste this many polys on something that could be done with similar quality and 5k polys. What is so spectacular about 60k? They XBOX could do this at its native resolution withou too much difficulty, its not an impressive number, even the ps2 could do it, although you would only be able to render hair and nothing else.

    > Finally, where did you hear that the Xbox360 can push 3x more polygons
    > than the PS3? Your ass? You are NOT a developer, and it is obvious from
    > your lack of knowledge in the subject.

    Well I didn't say 3x. It depends on what you are rendering. But the simplest limitation is the clock speed and the number of pipelines. I am not saying PS3 is worst, since it can do a lot more shader ops per second (3x as many). But it can only do them on half as many polys at a lower clock speed. This is all academic anyway since total performance is a combination of many things. But I deal with 400k poly models every day, and I just wasn't impressed by the demo.

    --
    ----- 70% of all statistics are completely made up.
  31. Re:Xbox 2 is a "commodity" by CronoCloud · · Score: 1

    The Emotion Engine was the future of HPC, the Cell is simply an extension of ideas and concepts tested out with the EE.

  32. You are SO wrong by Memnos · · Score: 1

    If the chip runs fast with some hand-optimization, then it will get done. Just follow the money. Sheesh!

    --
    I don't trust atoms -- they make up stuff.
  33. We really did run code for Stencil and SpMV by SWWilliams · · Score: 1
    The work for the paper was actually started a year ago, and the paper was finalized 6 months ago. During that period IBM began to release their matrix multiplication and FFT results. It seemed wasteful for us to duplicate their work, so we stopped those at the performance model.

    However, the stencil code and SpMV kernels were actually coded up and simulated for the paper. They were then run (exact same code) on real hardware (a 2.1GHz prototype machine) and those results were presented at the EDGE workshop last week. The hardware performance was pretty close to the simulator (the more computationally bound the kernel, the more accurate the simulator)

  34. You don't consider Cray's X1E a vector processor? by SWWilliams · · Score: 1

    The X1E MSP is certainly a vector processor, and we ran the same kernels on it and presented them in the paper. It would certainly not be considered a commodity processor though. We wanted a nice sample set of architectures: superscalar, VLIW, and vector.

  35. That's why F0rtran really doesn't matter here by billstewart · · Score: 1
    There's a lot of scientific programming that's complex, but a lot of it really involves doing lots of setup and transformation twiddling that hands big chunks of data to a standard package like a matrix multiplier or a Fourier Transformer or Linear Programmer etc. that really burns most of the CPU cycles. Or maybe you're doing graphics and it's a ray tracer / shader / lighter / etc., but you've still got one side of your program that's harder-to-parallelize complexity and another that's just raw standard number crunching.

    So if somebody writes a couple of dozen standard routines that crank the number-crunching part of the Cell processor well, and there's a halfway-adequate compiler for the conventional-processing side, you can still get a big win from a small budget.

    I did a lot of scientific-style programming on VAXes in the early-mid 80s, and my iPod Shuffle has more CPU, more disk-equivalent, faster I/O bus, and probably more RAM (? not sure, but all the non-shuffle versions do.) Our applications sped up by 2 orders of magnitude once we could get enough RAM :-)

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks