Posted by
ryuzaki0
on from the cut-on-out-speed-one-up dept.
INicheI writes "According to Intel, the plans for a release of a 2GHz Xeon for dual processor servers have been cancelled. Instead Intel is planning to debut a 2.2GHz chip codenamed "Prestonia" that will be ready the first quarter of 2002. I would love to see Quake running on a 4.4GHz computer."
For pc-emulation
by
Baki
·
· Score: 3, Informative
Such extremely fast computers might be good for virtual-PC environments such as vmware. You Windows-in-a-virtual-PC always takes a huge performance hit due to emulation, so much that it isn't even possible to emulate 3D graphics hardware acceleration (direct-x) in such products.
Having an obscenely fast PC might make it possible to run Windows under Linux, and still have Windows including direct-x run with enough performance to do some serious gaming.
Re:what is it good for?
by
OmegaDan
·
· Score: 3, Informative
Everyone knows that nobody can really see the difference between 40fps and 100+fps
This is true but you've missed the point... FPS is a measurement of *average* framerate. Ultra fast cards are an attempt to raise the *worst case* performance of the card not the average case. A mere side-effect of this is raising the FPS.
Re:Stupid Question
by
Dwain_Snyders
·
· Score: 3, Informative
There are several advantages to a setup as described in this story... a dual-processor Xeon can have benefits on the desktop. Of course, I'd never push a Xeon processor in this enviroment as I honestly don't think it will be the overall best solution in the near future. With dual-Athlons and Durons on the horizon, I'd take a closer look at them before considering a dual Xeon system, if only for the price aspect. However, I will attempt to explain why the Xeon architecture is superior to a standard Pentium III and why it potentially matters on the desktop.
Intel produces a version of the Pentium II and III called the "Xeon", which contains up to 2 megabytes of L2 cache. The Xeon is used frequently in servers as it supports 8-way multi-processing, but on the desktop the Xeon does offer considerable speed advantages over the standard Pentium III when large amounts of data are involved.
Basically, the larger the working set of an application, that is, the amount of code and data in use at any given time, the larger the L2 cache needs to be. To keep costs low, Intel and AMD have both actually DECREASED the sizes of their L2 caches in newer versions of the Pentium III and Athlon, which I believe is a mistake. (AMD is working on this in the new chips - new technology will be used to increase the size of the L2 cache while retaining the full data-shuttle flexibility).
The top level cache, the L1 cache, is the most crucial, since it is accessed first for any memory operation. The L1 cache uses extremely high speed memory (which has to keep up with the internal speed of the processor), so it is very expensive to put on chip and tends to be relatively small. Again, from 8K in the 486 to 128K in the Athlon.
The next step is the decoder, and this is one of the two major flaws of the P6 family. The 4-1-1 rule prevents more than one "complex" instruction from being decoded each clock cycle. Much like the U-V pairing rules for the original Pentium, Intel's documents contain tables showing how many micro-ops are required by every machine language instructions and they give guidelines on how to group instructions.
Unlike main memory, the decoder is always in use. Every clock cycle, it decodes 1, 2, or 3 instructions of machine language code. This limits the throughput of the processor to at most 3 times the clock speed. For example, a 1 GHz Pentium III can execute at most 3 billion instructions per second, or 3000 MIPS. In reality, most programmers and most compilers write code that is less than optimal, and which is usually grouped for the complex-simple-complex-simple pairing rules of the original Pentium. As a result, the typical throughput of a P6 family processor is more like double the clock speed. For example, 2000 MIPS for a 1 GHz processor. You'll notice that the Athlon outperfoms the P3 family in this regard by a large margin.
I've certainly never seen it happen on real production code, but its easy to make it happen, so its not theoretical.
Recipe:
Write a program that operates on a dataset just smaller than 2x your CPU's L2 cache size. Time it on your single CPU box.
Add the second CPU, and break the program into two threads, one operates on the first half of the dataset, the other on the second half. Time it.
I'm sure a similar parallelization has just happened to occur at some point in the history of computing... =)
Re:At what point...
by
phil_was_here
·
· Score: 2, Informative
brings back memories of school at the medical center. this probably goes under what is called the temporal sensitivity of the human visual system. if you gradually increase the frequency of a blinking light you reach what is called the CFF or critical flicker frequency (or FFF) where the system can no longer detect that the light is flashing (it appears continuous). we got to do all these fun experiments. central vision (foveal cones) has a CFF of 50-70 Hz depending on the lighting conditions (state of adaptation); whereas, peripheral vision (rods) has a CFF of only 15-20 Hz.
another point is that the fovea is not that sensitive to changes in light amplitude (level); whereas, in the periphery small luminance changes can be detected. this accounts for being able to detect the flicker of fluorescent lights out of the corner of your eye... then when viewed fovealy it stops flickering because here it is less sensitive. in summary we can say that peripheral vision fuses at low frequency and that it can detect flicker with small modulation. becoming a doctor was a lot of fun:p.
Re:Stupid Question
by
Anonymous Coward
·
· Score: 1, Informative
One of the reason for which you can't generate optimal 4-1-1 code on the P6 family is that every instruction that stores to memory consists of at least 2 microops: one for the data and one for the address. Given the almost registerless nature of x86, especially when compiling position independent code for shared libraries (which makes EBX unusable): if you need
a frame pointer because you have a variable sized array on the stack or use alloca(), you are left with 5 registers, with another one needlessly clobbered as soon as you need a division (EDX) or a modulo (EAX) by a non power of 2. The other 3 are clobbered as soon as you use a string instruction...
Bottom line: with so few registers you have to keep a lot of live variables on the stack, spilling and reloading them like mad. Of course every spill is a store.
Also when performing an operation between memory and register, you have 2 possibilities: either
loading into a free register and then performing a register to register operation, this is good for the decoders but may indirectly cause another spill because temporary registers are so hard to find, or use a memory to register operation which takes 2 microops and can only be issued by the first decoder.
Actually the rules for the Pentium were simpler and more general: never use a register to memory operation, but use memory to register especially if you can pair them since in this case you win
on register pressure and code size without any
execution time or decoding penalty.
Actually the choice of AMD to extend the architecture to 16 registers is quite clever and solves a lot of the spill/reload problems: the increase from 8 to 16 is often in practice an increase from 6 to 14 or 5 to 13, multiplying the number of free registers by about 2.5. This is enough to solve the spill/reload problems on many, but not all algorithms (with 64 bit addressing you try to keep more addresses in registers
for code size issue while 32 bit addresses can easily be embedded in code).
Having hand-coded myseld some critical subroutines on machines with 8 general purpose registers (x86), hybrid 8/16 (68k family, the address/data
split is sometimes annoying but at least any register can be used as an index for addressing,
with scaled index from 68020 onwards, it becomes
quite nice), 16 general purpose registers (VAX and IBM mainframes, the fact that R0 can't be used for addressing on the latter is irrelevant in practice) and 32 (PPC mostly, with some MIPS and Alpha). I can say that x86 is by far too limited while I hardly ever ran out of registers on other architectures. 16 registers with a rich set of addressing modes is fine, although RISC machines with 32 registers and less choice or addresses
are actually slightly easier to handle.
Bottom line: the x86 architecture sucks and requires astronomical bandwidth to the cache
to run fast because is seriously lacks in registers (and don't get me started on the floating-point stack).
Having an obscenely fast PC might make it possible to run Windows under Linux, and still have Windows including direct-x run with enough performance to do some serious gaming.
This is true but you've missed the point ... FPS is a measurement of *average* framerate. Ultra fast cards are an attempt to raise the *worst case* performance of the card not the average case. A mere side-effect of this is raising the FPS.
Free Techno/Jazz/DNB/MI Music by guys obsessed with monkeys!
There are several advantages to a setup as described in this story ... a dual-processor Xeon can have benefits on the desktop. Of course, I'd never push a Xeon processor in this enviroment as I honestly don't think it will be the overall best solution in the near future. With dual-Athlons and Durons on the horizon, I'd take a closer look at them before considering a dual Xeon system, if only for the price aspect. However, I will attempt to explain why the Xeon architecture is superior to a standard Pentium III and why it potentially matters on the desktop.
Intel produces a version of the Pentium II and III called the "Xeon", which contains up to 2 megabytes of L2 cache. The Xeon is used frequently in servers as it supports 8-way multi-processing, but on the desktop the Xeon does offer considerable speed advantages over the standard Pentium III when large amounts of data are involved.Basically, the larger the working set of an application, that is, the amount of code and data in use at any given time, the larger the L2 cache needs to be. To keep costs low, Intel and AMD have both actually DECREASED the sizes of their L2 caches in newer versions of the Pentium III and Athlon, which I believe is a mistake. (AMD is working on this in the new chips - new technology will be used to increase the size of the L2 cache while retaining the full data-shuttle flexibility).
The top level cache, the L1 cache, is the most crucial, since it is accessed first for any memory operation. The L1 cache uses extremely high speed memory (which has to keep up with the internal speed of the processor), so it is very expensive to put on chip and tends to be relatively small. Again, from 8K in the 486 to 128K in the Athlon.The next step is the decoder, and this is one of the two major flaws of the P6 family. The 4-1-1 rule prevents more than one "complex" instruction from being decoded each clock cycle. Much like the U-V pairing rules for the original Pentium, Intel's documents contain tables showing how many micro-ops are required by every machine language instructions and they give guidelines on how to group instructions.
Unlike main memory, the decoder is always in use. Every clock cycle, it decodes 1, 2, or 3 instructions of machine language code. This limits the throughput of the processor to at most 3 times the clock speed. For example, a 1 GHz Pentium III can execute at most 3 billion instructions per second, or 3000 MIPS. In reality, most programmers and most compilers write code that is less than optimal, and which is usually grouped for the complex-simple-complex-simple pairing rules of the original Pentium. As a result, the typical throughput of a P6 family processor is more like double the clock speed. For example, 2000 MIPS for a 1 GHz processor. You'll notice that the Athlon outperfoms the P3 family in this regard by a large margin.2DUP * ;
I've certainly never seen it happen on real production code, but its easy to make it happen, so its not theoretical.
Recipe:
Write a program that operates on a dataset just smaller than 2x your CPU's L2 cache size. Time it on your single CPU box.
Add the second CPU, and break the program into two threads, one operates on the first half of the dataset, the other on the second half. Time it.
I'm sure a similar parallelization has just happened to occur at some point in the history of computing... =)
brings back memories of school at the medical center. this probably goes under what is called the temporal sensitivity of the human visual system. if you gradually increase the frequency of a blinking light you reach what is called the CFF or critical flicker frequency (or FFF) where the system can no longer detect that the light is flashing (it appears continuous). we got to do all these fun experiments. central vision (foveal cones) has a CFF of 50-70 Hz depending on the lighting conditions (state of adaptation); whereas, peripheral vision (rods) has a CFF of only 15-20 Hz. another point is that the fovea is not that sensitive to changes in light amplitude (level); whereas, in the periphery small luminance changes can be detected. this accounts for being able to detect the flicker of fluorescent lights out of the corner of your eye... then when viewed fovealy it stops flickering because here it is less sensitive. in summary we can say that peripheral vision fuses at low frequency and that it can detect flicker with small modulation. becoming a doctor was a lot of fun :p.
a frame pointer because you have a variable sized array on the stack or use alloca(), you are left with 5 registers, with another one needlessly clobbered as soon as you need a division (EDX) or a modulo (EAX) by a non power of 2. The other 3 are clobbered as soon as you use a string instruction...
Bottom line: with so few registers you have to keep a lot of live variables on the stack, spilling and reloading them like mad. Of course every spill is a store.
Also when performing an operation between memory and register, you have 2 possibilities: either
loading into a free register and then performing a register to register operation, this is good for the decoders but may indirectly cause another spill because temporary registers are so hard to find, or use a memory to register operation which takes 2 microops and can only be issued by the first decoder.
Actually the rules for the Pentium were simpler and more general: never use a register to memory operation, but use memory to register especially if you can pair them since in this case you win
on register pressure and code size without any
execution time or decoding penalty.
Actually the choice of AMD to extend the architecture to 16 registers is quite clever and solves a lot of the spill/reload problems: the increase from 8 to 16 is often in practice an increase from 6 to 14 or 5 to 13, multiplying the number of free registers by about 2.5. This is enough to solve the spill/reload problems on many, but not all algorithms (with 64 bit addressing you try to keep more addresses in registers
for code size issue while 32 bit addresses can easily be embedded in code).
Having hand-coded myseld some critical subroutines on machines with 8 general purpose registers (x86), hybrid 8/16 (68k family, the address/data
split is sometimes annoying but at least any register can be used as an index for addressing,
with scaled index from 68020 onwards, it becomes
quite nice), 16 general purpose registers (VAX and IBM mainframes, the fact that R0 can't be used for addressing on the latter is irrelevant in practice) and 32 (PPC mostly, with some MIPS and Alpha). I can say that x86 is by far too limited while I hardly ever ran out of registers on other architectures. 16 registers with a rich set of addressing modes is fine, although RISC machines with 32 registers and less choice or addresses
are actually slightly easier to handle.
Bottom line: the x86 architecture sucks and requires astronomical bandwidth to the cache
to run fast because is seriously lacks in registers (and don't get me started on the floating-point stack).