Domain: netlib.org
Stories and comments across the archive that link to netlib.org.
Comments · 145
-
ease of programming for, ease of usingSo there is growing interest in the notion of "time to solution" as a combination of ease of programming for, ease of using, and of course running a data set on the machine.
Sure, you got it quite right. A computer that's useful for a scientist has tools that can find the eigenvectors of a matrix, calculate the positions of planets in the solar system, solve the Navier-Stokes equation to plot the shock waves around a supersonic wing. Exactly the kind of problems which are so easy to solve using Microsoft Windows... -
More obvious linksFor those wanting to know how the figures are calculated, or wanting to calculate them for their own machine, the following links will be helpful:
- High Performance Linpack (Requires MPI and either BLAS or VSIPL)
- High Performance Computing Challenge - the ultimate in stress-testing software
Dependencies:
- LAMPI - MPI from the Government's laboratories at Los Alamos
- MPICH - another version of MPI
- ATLAS - a portable version of BLAS
- VSIPL - a heavy number-crunching image processing package
I doubt many Slashdotter machines will do well against the top 500, but it might be fun to do our own "top 500" (for sheer geek value and bragging rights). - High Performance Linpack (Requires MPI and either BLAS or VSIPL)
-
Re:Your facts are wrong, wrong, wrong
Five things: Linpack was obsolete in the 80s, so Lapack was created.
A one program benchmark is essentially worthless, so spec tries to remedy this with 12 integer programs and 14 fp programs.
Rpeak is hardly a valid benchmark, since it can only be achieved with instructions doing basically nothing, and the erratic differences between Rpeak and Rmax on the top500 make it even more useless.
The G5 has similar architecture to the POWER4+, so I guess its SPECfp2000 would be around 1400 +/-100.
And finally, these benchmarks don't matter if you have the program that you intend the machine to run, say Photoshop, and it's faster on the Mac than x86.
This still leaves the results I gave that the G5 is as third as slow as an Opteron in both int and fp rates valid. -
Re:And for your particular problem...
Well, since you mention Numerical Recipes, it is obligatory to post one of the many rebuttals. Basically, the books have some okay discussions (and cover a very WIDE range of subjects) but their code is crap. I say that boldly, since I must maintain code that was originally developed using their C libraries. There have always been better alternatives, and especially these days when so much is available on the web.
http://en.wikipedia.org/wiki/Numerical_Recipes
http://en.wikipedia.org/wiki/GNU_Scientific_Librar y
http://en.wikipedia.org/wiki/Linear_least_squares
http://en.wikipedia.org/wiki/Singular_value_decomp osition
http://www.netlib.org/ -
Re:probably only running on the central powercore
The APU are also very specialised, so you will ot only have to allow acces to the cell from the OS(and manage those), but you also have to write the userland programs that take advantage of the APU's strong points. That applies to every program you want to use the apus, so the chance that this happens overnight/soon is pretty slim.
Humbug! Code up an optimized Blas for Lapack, and you will have a vast number of scientific apps ready to burn rubber.I'm pretty excited about this story, because it means IBM has the intent to make a blade server from the cell. The current state of the product isn't that important. 2.8 TFLOPS from a 7-blade rack sounds awfully good, even if that's just the theoretical max.
-
Overlooked open source model
In these discussions I never see mention of software produced in academia. Guys, research software was open source decades before the term was first invented. NSF or DOE give us money to do research that involves producing software, and when the software is finished it typically goes on a web site. Ftp archive before the web was invented. If you use it, kindly acknowledge us.
Unfortunately some research software is completely forgettable, but there is plenty that is high quality. Just a few names: Lapack http://www.netlib.org/lapack/ does linear algebra software, has been around forever, and is in fact part of the Intel and IBM scientific libraries. Atlas http://www.netlib.org/atlas/ gives highly optimized kernels. We suspect that vendors take this as a code base for their own optimizations. Petsc http://www-unix.mcs.anl.gov/petsc/petsc-2/ is orders of magnitude better than anything available commercially, is now probably 10 years old, still developed and supported, and used all over by engineers and scientists.
Just thought I'd mention that model. No "kindness of big corporations" needed.
Victor. -
Overlooked open source model
In these discussions I never see mention of software produced in academia. Guys, research software was open source decades before the term was first invented. NSF or DOE give us money to do research that involves producing software, and when the software is finished it typically goes on a web site. Ftp archive before the web was invented. If you use it, kindly acknowledge us.
Unfortunately some research software is completely forgettable, but there is plenty that is high quality. Just a few names: Lapack http://www.netlib.org/lapack/ does linear algebra software, has been around forever, and is in fact part of the Intel and IBM scientific libraries. Atlas http://www.netlib.org/atlas/ gives highly optimized kernels. We suspect that vendors take this as a code base for their own optimizations. Petsc http://www-unix.mcs.anl.gov/petsc/petsc-2/ is orders of magnitude better than anything available commercially, is now probably 10 years old, still developed and supported, and used all over by engineers and scientists.
Just thought I'd mention that model. No "kindness of big corporations" needed.
Victor. -
Re:someone with CPU knowledge?
Yes.
http://www.netlib.org/utk/people/JackDongarra/faq- linpack.html
The Linpack benchmark used for that list is tens or hundreds of times more strenuous than the simple benchmark Sony and MS are using. My Athlon 1.3GHz benches 51MFLOPS with Linpack and an app or two open. -
Re:Observations on Apple's GCC4 release
If you're performing large (block-)matrix operations, you might want to take a look at ATLAS; ATLAS basically is an automatically tuned version of LAPACK and BLAS. Not only does it use vectorization on CPUs that support it, it also takes the sizes of your L1 and L2 caches into account, by reordering the matrix operations such that the cache hit rate is highest.
ATLAS doesn't depend much on the compiler's optimization abilities; most of it is generated as assembly. And it's a _very_ realiable thing: big commercial tools such as MATLAB are built on top of (slightly modified versions of) ATLAS.
As for you; the ATLAS build system definitately supports AltiVec things.
-
Linpack
On a serious note, linpack (http://www.netlib.org/linpack/) You can run as many processes as required for your machine (i.e. 1 process per cpu) and with care you can use as much memory as you want (memory gets really hot and takes a lot more power than most people think). I do development testing for manufacturers who want us to sell their kit, I have burnt out many (now the manufacturers believe me that they are underspec'd) power supplies by using a well tuned linpack run to overload the system. It will take a bit of compilation to get it right for your system I suspect, but it can really heat a room up (got exhaust temps >60 celcius on some machines).
-
Re:LAPACK et. al.
Uhh.... LAPACK.
LinPack also exists. From Linpack's website, "LINPACK was designed for supercomputers in use in the 1970s and early 1980s. LINPACK has been largely superceded by LAPACK, which has been designed to run efficiently on shared-memory, vector supercomputers."
But, dude, why the hate? These are subroutines that solve specifically defined linear algebra problems. They happen to consume a large amount of cpu time for large scale problems. So why the reply? Or are you just trying to flamebait?
-
Re:LAPACK et. al.
Uhh.... LAPACK.
LinPack also exists. From Linpack's website, "LINPACK was designed for supercomputers in use in the 1970s and early 1980s. LINPACK has been largely superceded by LAPACK, which has been designed to run efficiently on shared-memory, vector supercomputers."
But, dude, why the hate? These are subroutines that solve specifically defined linear algebra problems. They happen to consume a large amount of cpu time for large scale problems. So why the reply? Or are you just trying to flamebait?
-
Re:Neat
That's not how linpack works. Sure, increasing your number of nodes will give definite performance advantages to course-grained, embarassingly parallel applications, but Linpack is not one of these applications. As well, Linpack should not be used as a guide for raw floating point performance, but is much better suited to gauge throughput.
Linpack does its benchmarks using a more fine-grained algorithm, creating lots of communications for Message Passing to share segments of dense matrices for rather large linear systems. Not only is the number of nodes a factor, but so is the interconnect speed. If that cluster was using GigE for its interconnect, its Linpack benchmarks would not be nearly as impressive. Haven't RTFA but its likely that BlueGene/L is using Myranet or Infinband for its interconnect (or possibly a more proprietary backplane style interconnect, though that cluster is way too big for that).
These latest generations of high-speed interconnects (esp. Infinband) have brought clusters closer to the point of being near shared-memory performance and hence is more of a throughput test than anything else.
This description of the HPL benchmark (The "official" name for the Linpack benchmark) should provide some clarity as to how memory-dependent Linpack actually is:
The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Recursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1.
http://www.netlib.org/benchmark/hpl/
They took a lot of time to get Linpack to be less shared-memory dependent, like adding the swap-broadcast algorithm (which i'm fairly certain was absent in the old mainframe version of Linpack), to make it more "fair" to run on a cluster versus a shared memory set up. However, on a typical cluster, Linpack can push your interconnect pretty hard, esp. if you are stuck on GigE. However, Linpack has _lots_ of settings and parameters to "tune" the benchmark for your particular cluster.
My point: Linpack/HPL is not an overall flops benchmark for a cluster. It measures the performance not only of double precision CPU performance, but also the performance of a cluster's interconnect. -
LINPACK usage?
I think of LAPACK as being much more up-to-date for benchmarking.
-
Re:hardly unfortunateIt is *possible* to write C that runs as fast as Fortran for heavy math. However, it involves hand-optimizing your C until this happens.
With libraries like SPOOLES I don't need to. One of the primary advantages of C is the availability of many libraries out there that have been developed over the years. Of course, this is in general. In specific domains (like particle physics) the great deal of Fortran code out there ensures that the language won't go away any time soon.
-
Re:Learning It?
My advice is to stay away from it unless you like learning languages of old.
Well, we won't be taking your advice, because you obviously haven't got a fucking clue. New Fortran language standards were issued in 1990, 1995 and 2003/4, keeping Fortran the language of choice for all numerically-intensive tasks.
C/C++ have attempted to replace Fortran, but have serious language problems (such as argument aliasing) that prevent compilers from optimizing code to the same degree as Fortran. Furthermore, contrary to your claim, there simply isn't a C/C++ library resource that comes anywhere close to the Fortran Netlib repository.
Mathematica is a nice product, but it deals with Math, and Math != Numerical Simulation. Mathematica is useless for serious computations, of the sort that would run on parallel computers.
-
Why FORTRAN makes people think FORTRAN-66 (or 77)
The p/o'd response basically sounds like "He's equating Fortran with FORTRAN-66 (or 77)".
I know that I do this too. When someone says "It's written in FORTRAN" I don't think Fortran-95, I think FORTRAN-77... and I'm usually right.
I suspect that there are two reasons for this:
- FORTRAN-77 was the big thing during FORTRAN's heyday, so most of the legacy FORTRAN code out there is FORTRAN-77.
- For a long time, the best Free Software FORTRAN compilers out there (g77, f2c) have been FORTRAN-77 compilers. g95 is still fairly young.
-
Actually, you're wrong
I don't know how much Dell's Tungsten cluster cost but those guys went online last year and got ranked #4 (just behind this Mac cluster) and they're #5 or something now. These bozos have spent a year fscking around with upgrades and from the theoretical #3 (as they were taken out since the cluster couldn't enter production) will have dropped to #7 or more in the next ranking....
Tungsten cost $12 million. Just for the hardware.
System X cost a total of $6 million, and it's still faster.
Not to mention that Virginia Tech was able to pull of a publicity coup and become #3 in the world, #2 in the US, and #1 academic for a paltry $5.2M. And they were "taken out" of the list voluntarily, because they dismantled the entire thing to replace it with Xserve G5s. With the renewed US focus on supercomputing, no one will likely EVER be able to hit #3 on this list for something anywhere close to $5.2M again.
Here's the current list:
http://www.netlib.org/benchmark/performance.pdf
Here's just the current top 20, as of 10/26/04:
http://das.doit.wisc.edu/misc/top500.jpg
Confusingly, you seem to have forgotten that since VT dropped on the list, since VT is still much faster than Tungsten, that means Tungsten also dropped. Tungsten is currently #16. For $12 million. VT's 2.5 Tflops faster - a respectable standalone clusters' worth faster - for half the price. Plus VT got all the huge publicity and news articles, and attracted millions of dollars in funding and grants for their new supercomputer center. Not to mention bringing a whole new OS, platform, interconnect, and processor onto the scene, which will benefit everyone (competition and choice is good, right?).
Also, here's a really great cost/performance comparison of all the top clusters.
Nice try at trolling, but next time don't be so obvious and pathetic about it, especially when Tungsten looks like it clearly is the raw end of the deal, when you have to spend over twice as much money to get a cluster that performs significantly worse, and has worse power requirements. -
Here's the current list...
Prof. Jack Dongarra of UTK is the keeper of the official list in the interim between the twice-yearly Top 500 lists:
http://www.netlib.org/benchmark/performance.pdf See page 54.
And here's the current top 20 as of 10/26/04... -
Actually, VT will be #8 this time around
Prof. Jack Dongarra of UTK is the keeper of the official list in the interim between the twice yearly Top 500 lists:
http://www.netlib.org/benchmark/performance.pdf (see page 54)
There have been some new entries, including IBM's BlueGene/L, at 36Tflops, finally displacing Japan's Earth Simulator, and a couple other new entries in the top 5.
Here's just the top 16 as of 10/25/04:
http://das.doit.wisc.edu/misc/top500.jpg
No matter what anyone says, Virginia Tech pulled an absolute coup when they appeared on the list at the end of 2003: no one will likely EVER be able to be #3 on the Top 500 list for a mere US$5.2M...even if the original cluster didn't perform much, or any, "real" work, the publicity and recognition that came of it was absolutely more than worth it.
Also interesting is that there is also a non-Apple PowerPC 970 entry in the top 10, using IBM's JS20 blades... -
Re:65 TFlop is only an estimate
Virginia Tech achieved an "efficiency" of about 58%. Not great, especially as compared to earlier test builds of 128 nodes, but not among the worst performers, either. If a cluster has a low latency, high bandwidth interconnect, it's RMax score will approach its Rpeak scores, although a certain fraction of the computing task cannot be parallelized.
According to Dongarra a certain cluster using the Apple XServe platform, composed of 1080 dual 2.3 IBM PowerPC w.Mellanox Infiniband and Cisco Ethernet secondary fabric scored 12050 GFlops on the RMax test.
BlueGene 36.0 TF
Earth Simulator 35.9 TF
Red Thunder? 20.0 TF
Project Columbia 19.6 TF
ASQI Q 13.9 TF
VT's Terascale 12.050 -
Re:Code section destroyed by Slashdot.
Great, now try porting a complex Airy function to Java using those kinds of definitions. Welcome to writing numerical code in what amounts to assembly language.
-
Re:No, no, no!
nothing but gcc can compile the linux kernel anyway
So it's a bit silly to suggest using another compiler, isn't it? Personally I find /bin/true is even faster than Plan 9, when I don't care about the results. :-)
I don't think "embarrassingly parallel" means what you think it means, because compiling the kernel clearly is not such a problem. There are dependencies between tasks, only a limited number of tasks can be issued at a time and the whole thing ends up in two big links that can't be parallelized at all. In that benchmark, you end up with all but one CPU idle for the last second or so. -
Re:Check out Lisp
-
Re:Check out Lisp
-
Re:NO Individual's ComplaintsYou are the biggest idiot I've seen in a while.
Well, thanks. It's nice to see some unbiased reports from the Apple users. That makes me the best in at least one area, right?
Care to post those benchmarks
OK. First thing, get Lapack. Then install ATLAS.
Run a matrix multiplication program. I tried to post it here, but got stuck on Slashdot's lame lameness filters, sorry about that, but the point is, multiply two matrices, at least 500x500. The matrices should be built of random numbers, likefor (i = 0; i < N * N; i++) {
AT[i] = (double)rand() / (double)RAND_MAX;
BT[i] = (double)rand() / (double)RAND_MAX;
}
To get the time needed for each multiplication do this:gettimeofday(&tv, &tz);
bs = tv.tv_sec;
bu = tv.tv_usec;
dgemm_(&opa, &opb, &c1, &c1, &c1, &alfa, AT, &c1, BT, &c1, &beta, CT, &c1);
gettimeofday(&tv, &tz);
du = tv.tv_usec - bu;
ds = tv.tv_sec - bs;
Do it in each CPU which you want to check. Use each compiler you want to check. See the results. ATTENTION SLASHDOTS FUCKING MODERATORS: RUN THIS BENCHMARK BEFORE MODERATING ME EITHER TROLL, FLAMEBAIT OR REDUNDANT, OK?. Or, otherwise, fuck you, Apple moderators, I don't care. I don't need the mod points. I'll be 50+ and able to post at +2 after all the Apple (-1,Troll) points you give me, so I don't really care. The point is, for any of you who have the wits to run the benchmark, you'll realize that the "Apple is faster" stuff is a myth, believed only by those feeble minds who have paid an absurd price for a shitty Apple computer, which is unable to outperform a P4 computer.
Care to post ... benchmarks that do a bit more than linear algebra?
Not really. I don't care about how much time your computer spends doing Excel spreadshits. The really CPU-intensive tasks today can be reduced to linear-algebra problems. That's what people call "vector processing", or "digital signal processing" problems, or "neural networks", or whatever your CPU intensive number-crunching application is. The fact is that mathematicians have spent uncount years transforming algorithms into floating point add/multiply operations, so that, when you really need CPU performance, what really matters today is how many add/multiply operations your CPU can do. Everything else is bullshit. However, since I've realized, from the Slashdot Apple moderators, how much bullshit people can swallow, I must agree that bullshit isn't unimportant at all. Long live Apple Marketing Bullshit! -
Re:NO Individual's ComplaintsYou are the biggest idiot I've seen in a while.
Well, thanks. It's nice to see some unbiased reports from the Apple users. That makes me the best in at least one area, right?
Care to post those benchmarks
OK. First thing, get Lapack. Then install ATLAS.
Run a matrix multiplication program. I tried to post it here, but got stuck on Slashdot's lame lameness filters, sorry about that, but the point is, multiply two matrices, at least 500x500. The matrices should be built of random numbers, likefor (i = 0; i < N * N; i++) {
AT[i] = (double)rand() / (double)RAND_MAX;
BT[i] = (double)rand() / (double)RAND_MAX;
}
To get the time needed for each multiplication do this:gettimeofday(&tv, &tz);
bs = tv.tv_sec;
bu = tv.tv_usec;
dgemm_(&opa, &opb, &c1, &c1, &c1, &alfa, AT, &c1, BT, &c1, &beta, CT, &c1);
gettimeofday(&tv, &tz);
du = tv.tv_usec - bu;
ds = tv.tv_sec - bs;
Do it in each CPU which you want to check. Use each compiler you want to check. See the results. ATTENTION SLASHDOTS FUCKING MODERATORS: RUN THIS BENCHMARK BEFORE MODERATING ME EITHER TROLL, FLAMEBAIT OR REDUNDANT, OK?. Or, otherwise, fuck you, Apple moderators, I don't care. I don't need the mod points. I'll be 50+ and able to post at +2 after all the Apple (-1,Troll) points you give me, so I don't really care. The point is, for any of you who have the wits to run the benchmark, you'll realize that the "Apple is faster" stuff is a myth, believed only by those feeble minds who have paid an absurd price for a shitty Apple computer, which is unable to outperform a P4 computer.
Care to post ... benchmarks that do a bit more than linear algebra?
Not really. I don't care about how much time your computer spends doing Excel spreadshits. The really CPU-intensive tasks today can be reduced to linear-algebra problems. That's what people call "vector processing", or "digital signal processing" problems, or "neural networks", or whatever your CPU intensive number-crunching application is. The fact is that mathematicians have spent uncount years transforming algorithms into floating point add/multiply operations, so that, when you really need CPU performance, what really matters today is how many add/multiply operations your CPU can do. Everything else is bullshit. However, since I've realized, from the Slashdot Apple moderators, how much bullshit people can swallow, I must agree that bullshit isn't unimportant at all. Long live Apple Marketing Bullshit! -
Re:For the price
It has done nothing of the sort. It got a great score on a very old benchmark: LINPACK benchmark results were first published in 1979 - LINPACK itself dates back to the early '70s.
It is at best a very rough approximation of a systems performance on a very specific type of problem. At worst, it is completely useless - there are so many different factors that contribute to LINPACK performance, and these factors often have *nothing* to do with the performance of a system on a real problem (in that the limiting factors in a *real* problem don't come into play)
For example: looking at the top 500 list, you'd think that the Earth Simulator was only three times faster than X. On its actual workload, however (simulating global weather patterns with resolutions of 100s of metres or less) it is very roughly 20 times faster than X. Why? It has about *60* times the memory bandwidth of X. But you'd never know that, looking at only the LINPACK figure.
Subtleties of benchmarking aside, the only other comment I'd like to make is that "Big Mac" hasn't proved anything: the machine was dismantled without any interesting scientific work performed on it whatsoever, as far as I can tell. I could be wrong here: can you point us to even a single application of this "very powerful" cluster?
I'm pretty sure that the only application so far has been some cheap PR for Apple (basically, the markdown on the 1000 G5 systems from new. vs the "ex-big mac" price that's $200 less, if I remember correctly...) $200,000 is not a lot of the kind of publicity they ended up getting. (And continue to get, judging by your post, even though the machine doesn't exist any more!)
-
Re:RAD6000s seem closest to PPC601s
Sure, if all you had were 68k binaries-- on integer, the PPC601 emulator was half as fast as a comparably clocked 68030. But native applications were quite fast.
A Macintosh (7.8MHz 68000) rates about 0.40 Dhrystone MIPS. A PowerMac 7100 (66 MHz PPC601) gets about 129 MIPS. A 33 MHz 68040, perhaps 23. source
Of course. dhrystone mips are pretty much obsolete as a benchmark.
-
Atlas :: empirically optimized blas/lapackThere is a variation of this that is very clever that optimizes BLAS and LAPACKroutines "empirically" called ATLAS that has been around a while, originally (and perhaps still) a research project by err Clint Whaley from tennessee (BLAS/LAPACK are numerical routines that do a slew of thing people generally find useful linear algebra and vectors). These routines will often time sit "under" many mathematical packages like matlab/octave/maple/mathematica/scilab as well
as make up the core of much custom scientific computing packages (or even libraries like the "Gnu Scientific Library")
Basically the jist is atlas "empirically" (read: use an optimizer for instance like GA, though empirical may actually mean brute force in this case) to optimize various parameters that will affect things like optimize the routine for the cache size of the processor etc. The cool thing about this, is they can get w/in 10% of hand machine coded BLAS/LAPACK libraries w/out the pain!
-
Atlas :: empirically optimized blas/lapackThere is a variation of this that is very clever that optimizes BLAS and LAPACKroutines "empirically" called ATLAS that has been around a while, originally (and perhaps still) a research project by err Clint Whaley from tennessee (BLAS/LAPACK are numerical routines that do a slew of thing people generally find useful linear algebra and vectors). These routines will often time sit "under" many mathematical packages like matlab/octave/maple/mathematica/scilab as well
as make up the core of much custom scientific computing packages (or even libraries like the "Gnu Scientific Library")
Basically the jist is atlas "empirically" (read: use an optimizer for instance like GA, though empirical may actually mean brute force in this case) to optimize various parameters that will affect things like optimize the routine for the cache size of the processor etc. The cool thing about this, is they can get w/in 10% of hand machine coded BLAS/LAPACK libraries w/out the pain!
-
Re:Two posts up...From Netlib:
What is the Linpack's "Highly Parallel Computing" benchmark? The third benchmark is called the Highly Parallel Computing Benchmark and can be found in Table 3 of the Benchmark Report. (This is the benchmark use for the Top500 report). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. http://www.netlib.org/benchmark/hpl/
Please note the words "highly-parallel" and "best performance" and the following phrase from the link from the quote:Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.
In a round-about way those quotes mean "if it does X FLOPs on 1 processor, it'll do 10X FLOPs on 10 processors because it's embarrassingly parallel. We're talking SETI- or RC5-type parallel. Regarding the 5% efficiency, please reference the following paper to see some numbers. In STREAMS Triad, the Cray X1 has 24% efficiency, while a P4 had 3.4% of a peak rated at 2x the X1 processor. The paper was done by the Army HPC Research Center in Minneapolis, MN: http://www.ahpcrc.org/publications/X1CaseStudies/C luster_CrayX1_Comparison_Paper.pdf -
Re:Two posts up...From Netlib:
What is the Linpack's "Highly Parallel Computing" benchmark? The third benchmark is called the Highly Parallel Computing Benchmark and can be found in Table 3 of the Benchmark Report. (This is the benchmark use for the Top500 report). This benchmark attempts to measure the best performance of a machine in solving a system of equations. The problem size and software can be chosen to produce the best performance. http://www.netlib.org/benchmark/hpl/
Please note the words "highly-parallel" and "best performance" and the following phrase from the link from the quote:Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scalable in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage.
In a round-about way those quotes mean "if it does X FLOPs on 1 processor, it'll do 10X FLOPs on 10 processors because it's embarrassingly parallel. We're talking SETI- or RC5-type parallel. Regarding the 5% efficiency, please reference the following paper to see some numbers. In STREAMS Triad, the Cray X1 has 24% efficiency, while a P4 had 3.4% of a peak rated at 2x the X1 processor. The paper was done by the Army HPC Research Center in Minneapolis, MN: http://www.ahpcrc.org/publications/X1CaseStudies/C luster_CrayX1_Comparison_Paper.pdf -
Wrong
See http://www.netlib.org/benchmark/performance.pdf page 53.
1. Earth simulator
2. ASCI Q
3. Virginia Tech G5 cluster (9.555 Tflops and rising, $5.2M HARDWARE ONLY)
4. PNL Itanium2 cluster (8.633 Tflops, $24.5M HARDWARE ONLY)
So nope, not only will the PNL Itanium2 cluster not be #2, it will also be 1Tflop behind the Virginia Tech cluster, and it will have done it at almost 5 times the cost. Bravo! -
Re:stupid question
Efficiency is the percentage of theoretical output compared to the measured output. If the theoretical and measured output were the same, then the efficiency would be 100%. Efficiency beyond 100% is a perpetual motion machine, and even the US Patent Office won't let you submit a patent for one of these guys anymore.
In the supercomputer context there are 2 measurements of compting power in terms of FLOPS (Floating-Point Operations Per Second) on 64bit "double precision" numbers, Rmax, and Rpeak. Rpeak is the theoretical potential of the machine, it is estimated by taking the number of floating point ops/cycle * the clockrate of each processor * the number of processors. Rmax is measured FLOPS by running the High Performance Linpack benchmark. -
Re:The Mac cluster is still on top per CPUThe Mac cluster is still on top per CPU
From the same document the Mac proponents have been quoting from: Dondarra Doc
Table 3 - page 53:
Big Mac -> Rmax: 8164 Processors: 1936
Cray X1 -> Rmax: 2932.9 Processors: 252Please be careful when making general statements. Thank you.
That said, yes, it has the highest per CPU performance of the machines with commodity processors. (that are listed, at least - including the year-old Xeons)
-
Now at 8.2 Tflop as of today (Oct 22)
See http://www.netlib.org/benchmark/performance.pdf page 53.
Since yesterday's release at 7.41 Tflop, the G5 cluster has already increased almost a Tflop, and is now ahead of the current #3 MCR Linux cluster, and about 0.5 Tflop behind a new Itanium 2 cluster. -
Re:No pussy-footing for NEC
just out of interest, can we get the benchmark program for these top500 scores?
I would want to relate their very impressive score with my miniscule AMD Athlon 2600 :)
I just did a bit of digging myself, and it appears as though the benchmark is called Linpack, and they have a Java version of it available. Definately doesn't appear point and click, could anyone help me out with some figures on consumer class hardware - or is it simply not available?
-
Re:Macs ?
I am one of the designers of KLAT2 and KASY0, and the guy who ran the Linpack benchmarks on both. Over 3 years ago when we submitted our results for KLAT2 to the top500 list, there was no public indication that 64-bit floating point was required. It took them awhile, but the top500 website now has a FAQ that indicates "full precision" is required, and they interpret that as 64-bit for most machines. FYI, 32-bit FLOPs are useful in many situations, and machines had been on the top500 list that had used 32-bit FLOPs. You might take a look at our KASY0 FAQ on GFLOPS. As a means to rank the top500, I think it is quite legitimate to require 64-bit FLOPS, but that doesn't make it "illegal" to use 32-bit Linpack FLOPS for other comparisons.
As for the G5, it won't need AltiVec to get good Linpack numbers due to its fused multiply-add capability in its dual floating point pipes. That's 4 FLOPs per clock peak! I hope VT was able to get Apple to leave out, and not charge for, the components not needed in a cluster node. The PCI-X slots in the G5 should allow VT to better use a high-speed cluster network technology. Commodity x86 boxes tend to only have 32-bit 33MHz PCI, limiting the usable link bandwidth between nodes to under a gigabit per second. For 64-bit Linpack GFLOPS per dollar, a cluster of G5's could be competative. I look forward to seeing their results, and any similar work using the upcoming Athlon 64.
-
LINPACK, the one true benchmarkFor decades, the number-crunching community has used the same benchmark, LINPACK. The standard problem is to solve a 100x100 system of linear equations. The latest results for hundreds of machines, updated through June 3, 2003, are here. Some highlights:
- IBM eServer pSeries 690 Turbo: 1462 MFlops/sec.
- Intel Pentium 4, 3.06GHz: 1414 MFlops/sec.
- Cray T94: 1129 MFlops/sec.
- Cray Y-MP EL: 41 MFlops/sec.
- Pentium Pro 200MHz: 38 MFlops/sec.
- Apple Macintosh: 0.0038 MFlops/sec.
- Palm Pilot III: 0.00081 MFlops/sec.
That's how much floating point work the CPU can do per unit time. That's the no-hype benchmark for CPUs.
-
Netlib
-
right tool for the right job...it all depends no what you're doing. For a lot of applied math (read as: PDE, linear algebra intensive, etc.) there's a lot of optimizations that are best done in code.
At UCLA's math department, the applied people have a QUIST related research program that uses Fortran77 and shell scripts, and C++ for various parts of their code to implement the Level-Set method to simulate the growth of thin films (atom by atom construction of electronic devices). The language choice seems to be due to legacy reasons: the grad student that started it so long ago used it, and the code has continued to grow ever since(ie. legacy reasons).
Although I wouldn't call it *math* research, here at MBI, bioinformatics programs run on the cluster *seem* to be written in C or C++ for the most part. I thinks it's more of the former because that's the interface for a number of bioinformatics libraries that we have licenced. Also, these things tend to be mixed heavily with shell and perl scripts; so the language is only for ease of integration with support libraries.
For most all of my undergrad work, I saw everyone use matlab, mathimatica, and their relatives for their work. In grad school, it seems to depend more on the class and the religious leanings of the mathematician involved.
There's a class on scientific computing that uses VC++ with fortran libraries from netlib (leveraged by f2c) solve some math implementation problem (tends to vary from year to year). Prof Anderson tends to by a junkyard warrior when it comes to math code generation. But then he's the mathematician's MacGuyver. (side note: Prof Anderson is a wonderful teacher and researcher - check out his page for some handy software tools and papers. Also, look at 270B for tidbits of linear algebra optimizations).
The benifit of matlab-ish programs is that you can usually implement your math structure quickly. The down side is that if you want to use any advance optimization then it near impossible. On the other hand, if you don't have a numerical analysis background, then many of the things you try to do to optimize your code in more mundane languages are probably going to be *much* slower then matlab, et al.
All of this is assuming you'r doing numerical analysis. If you're interested in abstract algebra , then I think you're stuck with maple. good program, but I don't have a review on it since I did most of my work by pencil and paper. I did use it for one of my crypto classes and found its implementation of Z_n groups very nice... although I ended up just coding it in C++ anyway
:)Also, check out the R project as it is GNU matlab.
-
Re:FortranThere is a healthy schepticism of "black box" programs and libraries so programs like Mathematica and Mathlab are pretty much not used.
And as a result, physicists (who generally don't really know numerical analysis: witness Numerical Recipes) waste hundreds of man-years trying to reproduce the functionality of LAPACK, ARPACK, FFTW, etc., which were carefully written and tested by people who know what they are doing.
Moreover, none of these are black boxes. Full source is available at netlib. Matlab and octave use these packages to perform basic operations, so you can't really call Matlab a black box either.
-
what I have used
For a text retrieval / linear algebra project (latent semantic indexing, et cetera), I used Matlab for quick testing and experimentation. Then, when it was time to write some stand-alone code to accompany my paper, I used the GNU Scientific Library (GSL). In addition to its own operations, it provides an interface to CBLAS (C Basic Linear Algebra Subprograms) which are pretty useful themselves. I considered LAPACK, but the documentation seemed less accessible. Both LAPACK and GSL are based on BLAS (Basic Linear Algebra Subprograms).
For a non-research project (nautilus shell simulation for the Ball State University math department), I used Maple and the Geometer's Sketchpad.
josh -
what I have used
For a text retrieval / linear algebra project (latent semantic indexing, et cetera), I used Matlab for quick testing and experimentation. Then, when it was time to write some stand-alone code to accompany my paper, I used the GNU Scientific Library (GSL). In addition to its own operations, it provides an interface to CBLAS (C Basic Linear Algebra Subprograms) which are pretty useful themselves. I considered LAPACK, but the documentation seemed less accessible. Both LAPACK and GSL are based on BLAS (Basic Linear Algebra Subprograms).
For a non-research project (nautilus shell simulation for the Ball State University math department), I used Maple and the Geometer's Sketchpad.
josh -
Re:It's nice
There are, believe it or not, beautiful pieces of Fortran IV out there --
You better believe it. If we're looking for long-lived code, we should start looking at Netlib. Netlib is full of beautiful code -- beautiful in part because it is beautifully written, but also because it expresses beautiful ideas. What makes the bazillions of lines of C++ and Javascript disposable is that it's mostly about ideas that might just as well be forgotten.My nomination for the Beautiful Fortran Award would have to be ARPACK. It is clear, concise, elegant, efficient, and TOTALLY ROCKS. Eigenvalue problems are not easy, which only increases the beauty of any solution.
-
Re:Oldest working code...
No doubt there are the zillions of line of code still kicking and screaming within industry, but I'm more interested with code that is out in the wild, and still being used somewhat actively.
Any other contenders?
Try www.netlib.org. It's all mathematical libraries, the old stuff is all in Fortran, and it does still get used even though some of it goes back to the late 60's and early 70's. Completely debugged code is hard to find, and when you get your hands on it, you hang on to it forever.
-JS
-
Re:FLOPs
-
Re:FLOPs
-
Has its place of courseFortran still has (and will likely continue to have) its place in science and engineering especially for simulation and computational purposes. For the reasons already listed, efficient and robust computational libraries like BLAS and LAPACK are written in Fortran.
Fortran was designed specifically for numerical calculations and thus is better suited for certain tasks than a language designed for system programming like C. Many engineering schools (mine included) require Fortran in the first year curriculum. They keep talking about switching to a more "modern" language like c or matlab, but that has been mostly talk.
So go learn some Fortran, if you will be doing computational programming. Even if you won't it is one of the easiest languages to get into programming.