It may take a few nanoseconds for the light to bounce around, but that light can be modulated at extremely high rates (that electrical wires cannot). Managing latency is a well understood problem, generally solved by using speculation, buffering, etc..
The extra bandwidth does indeed allow more in-flight memory accesses, but there are many problems involved with this.
First of all, there are implicit problems in the memory-level parallelism in applications. How many memory accesses are independent of each other? For example, code that manipulates hash tables or linked lists does not profit from additional bandwidth because the next memory access depends on the current one. Such code normally does things like:
LD R1, 0(R1) do something on R1
LD R1, 0(R1) etc
See the dependency on R1
Second, there are problems with the microarchitecture. Microprocessors contain two structures necessary to handle off-chip memory accesses: The Load/Store Queues (LSQ) and a special table to track these external accesses (called something like the miss address table, where miss refers to the corresponding cache miss). The LSQs track all in-flight loads and stores and they are used to check dependencies between loads and stores. The largest implementation I remember can track a total of 48 loads (is it p4?). The Miss Address Table contains references to all off-chip accesses. This table is usually smaller. Thus, even if all the memory accesses are independent and can be issued in parallel you cannot really take profit of all that bandwidth. Theoretically you can issue hundreds of memory accesses during the time an off-chip access is in progress. In reality you will end up with 20 or 30 (that does not include prefetches and alike).
I work at UPC and there has been a lot of hype here for machine #4, which is (or is going to be) a >4500 PPC970s machine running linux (nice work, ibm). I disagree with the claim that the Virgina Tech cluster is the first academic supercomputer. As far as I'm concerned the Technical University of Catalonia (UPC) is also an academic institution.
Anyway. we now got europe's fastest supercomputer. That's what matters. ha!;-)
You may also be interested in reading about loop quantum gravity, an alternative theory of everything. I' not expert, better refer to this reference that I looked up.
so the stuff may have been posted as a joke, but as you'll already know it works just fine, which is not very typical of a fools joke... the programming style is a la brainfuck, but does somebody know if this is a turing machine?
however, what is most curious to me is the usage of haskell (a functional language) as the language of choice for the compiler. perhaps the author wants to fool us?
The Glasgow Haskell Compiler may be found here. It's currently at 5.04.3. I used this one to compile the sources. Haskell is not an easy language to learn so better look not into the source if you want to have a happy day!;)
SpecFP is theoretically designed to be a cross-platform rating. however, there are always issues with compilers and the test system that make the results more orientating than conclusive. In the past some processors/compilers have even gone as far as to detect algorithms used by specific specfp tests and return precomputed values. this may be good for the chip's marketing, but, of course, it makes the benchmark results completely useless. so you should always take care before making a decision.
The extra bandwidth does indeed allow more in-flight memory accesses, but there are many problems involved with this.
First of all, there are implicit problems in the memory-level parallelism in applications. How many memory accesses are independent of each other? For example, code that manipulates hash tables or linked lists does not profit from additional bandwidth because the next memory access depends on the current one. Such code normally does things like:
LD R1, 0(R1)
do something on R1
LD R1, 0(R1)
etc
See the dependency on R1
Second, there are problems with the microarchitecture. Microprocessors contain two structures necessary to handle off-chip memory accesses: The Load/Store Queues (LSQ) and a special table to track these external accesses (called something like the miss address table, where miss refers to the corresponding cache miss). The LSQs track all in-flight loads and stores and they are used to check dependencies between loads and stores. The largest implementation I remember can track a total of 48 loads (is it p4?). The Miss Address Table contains references to all off-chip accesses. This table is usually smaller. Thus, even if all the memory accesses are independent and can be issued in parallel you cannot really take profit of all that bandwidth. Theoretically you can issue hundreds of memory accesses during the time an off-chip access is in progress. In reality you will end up with 20 or 30 (that does not include prefetches and alike).
sorry for not including references
Just a minor comment...
;-)
I work at UPC and there has been a lot of hype here for machine #4, which is (or is going to be) a >4500 PPC970s machine running linux (nice work, ibm). I disagree with the claim that the Virgina Tech cluster is the first academic supercomputer. As far as I'm concerned the Technical University of Catalonia (UPC) is also an academic institution.
Anyway. we now got europe's fastest supercomputer. That's what matters. ha!
You may also be interested in reading about loop quantum gravity, an alternative theory of everything. I' not expert, better refer to this reference that I looked up.
however, what is most curious to me is the usage of haskell (a functional language) as the language of choice for the compiler. perhaps the author wants to fool us?
The Glasgow Haskell Compiler may be found here. It's currently at 5.04.3. I used this one to compile the sources. Haskell is not an easy language to learn so better look not into the source if you want to have a happy day! ;)
regards
SpecFP is theoretically designed to be a cross-platform rating. however, there are always issues with compilers and the test system that make the results more orientating than conclusive. In the past some processors/compilers have even gone as far as to detect algorithms used by specific specfp tests and return precomputed values. this may be good for the chip's marketing, but, of course, it makes the benchmark results completely useless. so you should always take care before making a decision.