Slashdot Mirror


User: Troy+Baer

Troy+Baer's activity in the archive.

Stories
0
Comments
190
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 190

  1. Re:Why Alpha's? Screaming FP performance, that's w on Linux Supercomputer Wins Weather Bid · · Score: 3

    If the G4 can sustain >1gflops, then why not build a cluster of G4s running LinuxPPC?

    I'm not convinced the G4 can sustain 1 GFLOP/s in any kind of real calculation -- it simply doesn't have enough memory bandwidth. The G4 uses the standard PC100 memory bus, AFAIK. That's 64 bits wide running at 100MHz = 800MB/s peak. So without help from the caches, the absolute best you can do is on *any* PC100 based system is 200 MFLOP/s using 32-bit FP or 100 MFLOP/s using 64-bit FP. In practice you can only sustain about 300-350 MB/s out of the PC100 memory bus, so things get even worse. The caches will help quite a bit (maybe a factor of 2-4), but I have trouble imagining the G4 being able to sustain over 500 MFLOP/s even on something small like Linpack 100x100 because of the limited bandwidth and latency of the PC100 bus. Other processors that have similar peak FP ratings have much higher memory bandwidths; we've benchmarked an Alpha 21264 (1 GFLOP/s peak, ~400 MFLOP/s sustained) at about 1 GB/s memory bandwidth (that's measured, not peak), and a Cray T90 CPU (1.8 GLOP/s peak, ~700 MFLOP/s sustained) at 11-13 GB/s (again, measured not peak).

    There's also the question of compilers. You have to have a compiler that recognizes vectorizable loops and generates the appropriate machine code to use the vector unit. Unless Motorola's feeling *really* magnanimous, I don't see that kind of technology making it into gcc (and g77, more importantly for scientific codes) any time soon. Otherwise, you're at the mercy of a commercial Fortran compiler vendor like Portland Group or Absoft. PGI hasn't shown any interest in PowerPC to this point, and Absoft currently does PPC compilers only for MacOS 8, not OSX or LinuxPPC.

    I'd love to be proven wrong on this, but based on my experience I don't see how you could do it.

    --Troy
  2. Infrastructure costs (Re:BEOWOLF!) on Linux Supercomputer Wins Weather Bid · · Score: 2

    For $15,000,00 to buy an Alpha Beowolf, it sounds like they might have 2,500 nodes with a 'decent' Alpha system. But if they go really high end, they'll have about 750 nodes (For the 'killer' $20,000 Alpha machines).

    That doesn't include the cost of the Myrinet cards and switches, racks, 3rd party software, support people, power, cooling, etc. Believe me, if you're paying $15M for a machine, part of it better be going for support personnel and infrastructure. The configuration's probably more like 250-500 nodes with a corresponding number of Myrinet cards and switch ports, 30-75 racks (8 nodes/rack if you're lucky), a *buttload* of power and air conditioning, and 2-5 onsite support people working in it full time.

    --Troy
  3. Why Alpha's? Screaming FP performance, that's why on Linux Supercomputer Wins Weather Bid · · Score: 3

    I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?

    The big thing about the Alpha for people like NOAA (who run big custom number-crunching apps written in FORTRAN) is its stellar FP performance. A 500MHz 21264 Alpha peaks at 1 GFLOPS and can sustain 25-40% of that, because of the memory bandwidth available. A Pentium III Xeon at the same clock rate peaks at 500MFLOPS and can sustain 20-30% of that.

    That doesn't fly for everybody, though. Where I work, we have a huge hodgepodge of message-passed, shared-memory, and vector scientific codes, plus needs for some canned applications that aren't available on the Alpha. We picked quad Xeons for our cluster and bought the Portland Group's compiler suite to try to get some extra performance out of the Intel chips.

    --Troy
  4. Re:Linux and scaling... on First official SAP R/3 benchmarks on Linux · · Score: 2

    As far as 128 processor SGI's, it would be cool if SGI would contribute some of their IRIX multiprocessor code to the open source community. Would that be better than the current 2.3 stuff?

    Probably a little. SGI has some engineers working on improving Linux's SMP scalability, but Linus has seemed to be wary of accepting patches from them. In any case, the scalability of Intel-based SMP systems is currently limited by the memory bus, which is so narrow (800MB/s peak) that a single CPU can saturate it. The memory systems on SGI and Compaq/DEC Alpha SMP systems are often twice as fast or more and switched.

    Something to keep in mind about the the Origin 2000 (SGI's 128-256 CPU boxes) is that they're not SMP systems. They're ccNUMA machines, and a lot of the "ccNUMAness" (including cache coherence, I think) is handled largely by the hardware. I wouldn't be surprised if you could boot the MIPS version of Linux on (for instance) an Origin with little or no modification. I don't know how well it would scale, though.

    --Troy
  5. Re:NEW FILES SYSTEMS??? on Linux 2.4 Feature Freeze · · Score: 4

    I sure would like to know what the news is with SGI porting their XFS journalling file system to Linux. This will be awsome.

    From what I've heard, XFS is going to require some heavy-duty changes to the VFS layer to allow for >2GB files on 32-bit systems. (It's also the VFS layer that keeps ext2 from allowing >2GB files, so this may end up killing two birds with one stone.) I don't think the SGI engineers working on XFS for Linux have any code that's fit for public consumption yet anyway.

    --Troy
  6. Speedshop is *very* nice... on SGI releases "Jessie" to the Open Source · · Score: 1

    I've used some of the command-line SpeedShop tools (ssrun/prof and ssusage) on our Origin, and they are quite nice. The way you can do profiling without having to recompile an instrumented version of your code is slick.

    What I would kill for, OTOH, is a version of perfex that runs under Linux. (Perfex gives you access to the CPUs' hardware performance counters, so you can directly measure things like MFLOPS and cache miss rates -- it's the bomb for doing performance tuning and optimization of scientific codes.) The various Linux performance counter drivers are almost to the point where you could port perfex to use them; I just wish the developers would come to a consenus on which one would go in the mainstream kernel. (Hint, hint.)

    --Troy
  7. Benchmark numbers? on Tom on the Athlon (And an Intel Conspiracy?) · · Score: 1
    Anybody know of sites with Athlon numbers for HPC type benchmarks (Linpack-100x100, stream_d, SPECint95, SPECfp95, etc.)? Tom's data isn't terribly useful for determining if Athlon's a good processor for Beowulf-style HPC clusters...
    --Troy
  8. Re:RAM limit a problem? on SGI Installing Beowulf · · Score: 1

    Linus has stated that he doesn't want to 'kludge' the kernel to support more memory than 32 bits can address (only in 32-big CPU's, of course). This creates a limit of 2 gigabytes of addressable memory (Not 100% sure here).

    Actually, it's theoretically possible to address up to 4 GB. SGI's "bigmem" patch makes it possible to do this, but I think Linus has rejected the patch for the mainstream kernel. That wouldn't shop SGI from shipping kernels built with this patch, so long as they also distribute the source for it.

    Will this limit any of your applications?

    Not really. Our J90/SV1 and Origin system both have 16GB of memory, but I don't think we allow a single job to use more than 2-4 GB of memory. Large files is actually more of a problem than large memory; Gaussian can generate 20+ GB output files. Thankfully support of large files on 32-bit platforms seems to be coming along.

    What was your reasoning behind not choosing a similar solution from an Alpha vendor? (64-bit CPU, much more addresses)

    We have a fair amount of Alpha experience in-house already; several years ago we had a classroom cluster of DEC Alpha workstations, and we currently have a Cray T3E which is also Alpha-based. Our main concern with the Alpha was software availability, especially compilers. The Compaq Digital Fortran beta is a good start, but I would like to see the Portland Group and KAI compilers for Alpha Linux as well.

    --Troy
  9. Well, if you're an OSU student... on SGI Installing Beowulf · · Score: 1

    Do they need to hire any programmers?

    We often hire OSU students as programmers, gofers, experimental test subjects (oops, did I say that out loud? :), etc. It's a great place to work. Stop by at the beginning of fall quarter or watch the OSU "green sheets".

    --Troy
  10. Design Decisions, System Details on SGI Installing Beowulf · · Score: 1
    Several people have commented that OSC might be coming out on the
    short end of the stick on this deal. We disagree, and here's why:

    1. Reliability: The most common hardware failures in cluster systems are hard drives and power supplies. The 1400Ls have redundant power
    supplies, and smaller numbers of nodes will generally have a lower component failure rate.

    2. Migration Path: The "cluster of SMPs" model is in use in several of the largest computers currently in use, including the ASCI Blue Mountain and Blue Pacific machines. Users should be able to develop code on our cluster and then move their code to these much larger platforms with little additional effort.

    3. Application Needs: We have several users with applications which need in excess of a GB of RAM and several GBs of temporary disk storage per node. Many of these applications are "legacy codes" from the vector machines which are difficult to parallelize using message
    passing approaches, but which can be parallelized relatively easily on SMP systems using compiler directives. The architecture we have
    selected allows this as well as multilevel parallel programming, using message passing between nodes and compiler directives within a node.

    4. Flexibility: We have users in virtually all scientific and engineering disciplines, most of whom (50-75%) write their own code. We need a cluster architecture which can accomodate a mix of serial, SMP parallel, and MPP parallel applications.

    There are also drawbacks to this approach, primarily related to memory bandwidth and the added cost for the quad processor nodes.

    Here is a slightly more detailed description of the new OSC Beowulf than was in the press release:


    32 compute nodes plus a front end node, each with
    4 Pentium III Xeon 500MHz processors
    2 GB RAM
    18 GB SCSI-UW disk
    1 Fast Ethernet interface
    2 Myrinet interfaces
    8 16-port Myrinet switches
    various software:
    SGI's modified Red Hat distribution
    PBS queuing system
    Portland Group and KAI compilers
    AMBER (computational chemistry)
    Gaussian 98 (computational chemistry)
    Cactus (computational physics)

    We will be posting further details at http://oscinfo.osc.edu/hardware/ as things develop.

    Sincerely,
    --Troy Baer and Doug Johnson, OSC

  11. Some numbers... on SGI Installing Beowulf · · Score: 1
    Cray-1: 1 CPU, 80 MFLOPS theoretical peak, about 40-60 MFLOPS on real-world code.

    Cray YMP-8: 8 CPUs, 250 MFLOPS theoretical peak per CPU, about 150-200 MFLOPS on real-world code.

    Cray T94: 4 CPUs, 1800 MFLOPS theoretical peak per CPU, about 450-900 MFLOPS per CPU on real-world code.

    Cray T3E600/LC-136: 136 300MHz DEC Alpha 21164s (8 OS/command + 128 applications), 600 MFLOPS theoretical peak per CPU, about 90-150 MFLOPS per CPU on real-world code.

    SGI/Cray Origin 2000: 32 300MHz MIPS R12ks, 600 MFLOPS theoretical peak per CPU, about 160-200 MFLOPS per CPU on real-world code.

    OSC Mk.1 Beowulf node: 2 400MHz Pentium IIs, 400 MFLOPS theoretical peak per CPU, about 80-100 MFLOPS per CPU on real-world code.

    (Assuming 64-bit floating point throughout; the Intel chips don't suffer as much as you might think from this, as they do all FP internally with 80-bit precision and truncate to 32 or 64 bits.)

    If you've ever wondered why people pay big bucks for Cray vector machines, let me sum it up in three words: sustainable memory bandwidth. The T90 machines can sustain on the order of 13 GB/s memory bandwidth, and the J90/SV1 machines can sustain about 5 GB/s. By comparison, most workstation and PC systems can sustain about 300-500 MB/s on a good day with a tail wind.

    --Troy Baer

  12. Keep dreaming! :) on SGI Installing Beowulf · · Score: 1
    The new OSC Beowulf is going to be used as a compute engine -- I'm not even sure if the 1400Ls have graphics cards! I don't see our SGI MIPS graphics hardware going anywhere soon. :)

    --Troy Baer, Systems Engineer, OSC Science & Technology Support
  13. UNICOS's UDB? on UNIX Machines that don't use /etc/passwd · · Score: 1
    UNICOS (the Cray UNIX) has a system called the UDB (user database) which includes all of the information in /etc/passwd plus information on resource limits and some other stuff. There's still an /etc/passwd file, but I don't think it's used for much. Unfortunately I can't find any links on SGI's site describing how the thing works.

    --Troy

  14. Nobody has a complete MPI-2 implementation yet... on SGI Announces New Strategy and Alliance · · Score: 1

    I really like the direction that SGI has taken lately(aside from the foolish name change). Their focus on open source has been great. Now, as a Cray T3e programmer, I have a request: Please update the MPI libraries to MPI2 compliance.

    Nobody (that I know of, anyway) has a complete MPI-2 implementation. There are a few MPI parallel IO implementations out there (PVFS, ROMIO, and one from IBM whose name escapes me), and one of the free MPI implementations (LAM) has the new dynamic process allocation mechanism. Nobody's implemented the event model and some of the other stuff though, because a lot of it is hard to do and there hasn't been that much user demand. It's hard enough to convince users to port to MPI-1.

    If anybody knows of an MPI implementation that implements all of MPI-2, I'd love to see it.

    --Troy
  15. Re:Not very useful on All-Purpose Distributed Computing · · Score: 1

    Check out page 9 of paper.pdf. The TSIA paradigm does not allow inter-task communication ("During its execution, a task does not communicate with other tasks"). That pretty much settles the issue for most jobs. Burkhard says that almost all tasks can be completed without communication, I remain unconvinced.

    As well you should be. Saying that tasks can't communicate pretty much restricts you to embarrassingly parallel problems and what I call "serially parallel problems" (eg. doing parameter studies with a single-threaded code by running all the different cases on separate CPUs/machines simultaneously). Real parallel programs need to communicate.

    --Troy
  16. This is sort of what the grid projects are doing.. on All-Purpose Distributed Computing · · Score: 1

    There are a couple projects out there in the HPC community that are aimed at something like this. The main ones are the Globus project (mainly distributed computing services) and the Cactus project (an application framework). I saw presentations on both last week at the HPDC conference, and while they still have a lot of work to do both are rather impressive.

    --Troy

  17. Re:Still No X + PR = SERVER OUT! on SGI's Linux Server · · Score: 1

    Will they use the Cobalt Chipset?

    No.

    Will they use the same Motherboards?

    No.

    Will this simply be "adding a drive sled bay" to a visual workstation?

    No. It's not based on the VW architecture at all.

    Will they be cutting back on the Video and Audio abilities?

    Why does the price mentioned seem higer than the Visual Workstation (if you just adding a sled, but taking out all the video and audio stuff?)?

    Well, the 2+1 redunant power supplies, for one thing. The hot-plug drive bays, for another. The 11 fans, for another. From what I've seen, it's a fairly well-engineered machine.

    --Troy
  18. Blue Mountain isn't strictly ccNUMA... on SGIs Linux Future · · Score: 2

    It's a cluster of 48 128PE Origins tied together with an 800MB/s GSN (a.k.a. HiPPI) network. The individual Origin boxes are ccNUMA, of course, but the boxes communicate with each other using MPI over the GSN (much like Beowulf clusters do MPI over Ethernet or Myrinet or whatever). There used to be hardware specs on the web, but those disappeared after the China fiasco a few months back. This page gives some details.

    --Troy

  19. Boy, Brin must hate a lot of SF/fantasy... on David Brin on Star Wars: TPM · · Score: 1
    It seemed to me that one of his big issues from the first essay was that he has a problem with the idea of a "chosen one", a born leader or Messiah figure. He must dislike a whole hell of a lot of SF and fantasy fiction then, because this is an *extremely* common motif. Some characters I can think of like that, just off the top of my head are Paul Atreides (Dune), Aragorn (Lord of the Rings), hell even Neo from the Matrix.

    Maybe his point was that this motif is a tired cliche because it's so widely used, but if so he articulated it very poorly. Between that and the abomination that was the Postman, I doubt I'll be buying any of Brin's books any time soon.

    --Troy

  20. Re:XFS on High Density Storage · · Score: 1

    Cool, but I'm not running IRIX. Linux needs a successor to ext2.

    Did you miss the announcement that SGI's porting xfs to Linux? Apparently the only reason they haven't released code already is that their lawyers are still haggling over the license.

    --Troy
  21. Re:What filesystem to put on a 216GB drive? on High Density Storage · · Score: 1

    Why, xfs of course!

    --Troy

  22. Re:G3 vs. rackmount PII -- per-node costs and perf on 'Black Lab' Linux For G3 Clusters · · Score: 1

    I got similar results until I tried BlockMoveDataUncached(). It makes a helluva difference. It also makes you wonder how relevant stream_d really is.

    Where can I find docs on BlockMoveDataUncached()? Is it a NeXT/Apple proprietary call? It's certainly not ANSI C or POSIX...

    I've never had to tweak stream_d like this on anything that had a decent compiler. The code is so simple that it shouldn't be that hard to optimize. On x86, gcc 2.7.2.3, egcs/gcc 2.90.29, and pgcc 3.0.4 all give stream_d results within less than 1% of each other. On our Origin 2000, there's about a 10-15% penalty for using gcc 2.8.1 in place of the SGI C compiler 7.2.1. I've assembled a summary of stream_d results for various systems at http://www.osc.edu/~troy/stream_d.html (and yes, the Cray numbers there are real).

    --Troy
  23. PPC 604e FPU(s) on 'Black Lab' Linux For G3 Clusters · · Score: 1

    This is not true... the 604e has 2 fpu units and 2 integer units...

    That's not what Motorola's PowerPC 604e product summary says. It lists 3 integer units (2 single-cycle and 1 multi-cycle) and 1 floating point unit. If there's documentation to the contrary, I'd like to see it.

    --Troy
  24. Re:Benchmark comparisons? on 'Black Lab' Linux For G3 Clusters · · Score: 1

    BTW... RS/6000 is a product line, not a CPU. The RS/6000 line uses the PowerPC series (601,603,603e,604,604e) for interger math and Power, Power2, Power2sc, and Power3 processors for floating point math.

    Huh? I hope you don't mean to imply that RS/6000 systems have both a PPC chip and a Power chip, because they don't. They ship with either a Power series chip or a PowerPC chip.

    (As an aside, the ASCI Blue Pacific system at Livermore is all PowerPC 604e's rather than Power3s, according to this web page. Personally, I think this is why the machine is so much slower than ASCI Blue Mountain, even though Blue Pacific's theoretical peak is higher; the PPCs only have 1 FPU, while the Power3s and the R10ks in Blue Mountain have 2 FPUs.)

    --Troy
  25. SW Academy Awards on Phantom Menace Reviews · · Score: 1

    I think that A New Hope won best film because at the time, no one had ever made a movie quite like it.

    Star Wars did not win Best Picture at the '78 Oscars. It lost to Rocky, if I remember correctly. See this link on IMDB.