Absolutely wrong. Anything that is going to be scaled or rotated (think matrix transformations, vector multiplication, and physics calculations) needs to have floating point representation, unless the processor architecture is incapable of it (the Gameboy Advance or other embedded platforms, say). You can have pseudo floating point with ints (last x bits are behind the decimal, say), but the software needs to do extra work: there's special cases for multiplication, etc. Most processors these days are designed to do floating point operations as fast as anything else, so adding unnecessary overhead that ignores basic funtionality would be stupid.
Thats a common misconception. You do not need floats for the accuracy, but for the dynamic range. Actually an int (32bit) is far more accurate than a float (24 Bit mantissa). An int64 would be more accurate than a double (53 bit mantissa?). For cases where the dynamic range is known beforehand it is beneficial to use fixed point (integer) numbers. This is especially true for normalized rotational matrices and vector data. AFAIK many GPUs do even work with integer only arithmetic, since the dynamic range can easily be derived at all places. A place where you need floats is color computations..
Fixed point arithmetic is probably slower than floating point on a x86, because shifts and imuls also slow
If course I meant "are".. but otherwhise I am totally serious about this - check out the execution timing of shifts and imuls. On a P4 shifts take multiple cycles and there is only one shifter, imuls have always been slower than fmuls and are not pipelined, too.
Ok, nice. But where is the actual advantage ? Are the 64Bit features used anywhere or is it just an updated version to cope with 64 bit addressing or semething similar? The timing of the release seems to hint at this:)
Great Idea - I reall like it. However the design could be worked on. Uhm, but the site is totally dynamic and PHP based. According to the response times - total slashdotting is near...
But the worst part is: How do you dare to submit this to Slashdot without having Neal Stephenson in the Authors list ?
The design can ignore WAW and RAW dependencies regarding r0 and pipeline stalls due to related result bypassing are eliminated.
R0 is almost never used as target register - there are no potential hazards. So it does not help here either. For the other registers the mentioned problems are solved with register renaming..
NOP = add r0, r0, r0 (actually, add any two registers with the destination r0)
OR rx,rx,rx and AND rx,rx,rx are two examples for more NOPs.
All your arithmetic comparisons are based off subtractions/additions (which are really the same thing). Compare with 0 for a lot of stuff.
ALL Mips compare instruction do actually compare with 0. You do not need an explicit register for it.
Also, lots of assignments are to 0. Lots of if, for, while conditionals are to 0
You can clear any register with XOR rx,rx,rx. See above for the conditionals.
Also, there are few addressing modes, one is register + immediate offset. There is no register only addressing mode. How do you do direct addressing? r0 + immediate offset.
Immediate adressing works only for the first and last 32kb of the memory. Therefore there is hardly any use for this. If you really need it you can still initialize a register to zero.
Hardcoding r0 to zero actually keeps you from having to add more instructions just to handle the special cases of zero.
Yes, but where are theses cases ? Except of the NEG example nobody managed to bring up anything worthy...
The MIPS architecture already has a proper 'move' instruction without using r0: r12 = r8 | r8, or r12 = r8 | 0 (zero specified as immediate)
Right, noticed it byself right after posting. There are also several other combinations like rt = rs1 AND rs2 and the shift instructions.
The r0 is frozen at 0 so you can do negations (for which ARM uses 'rsb' or reverse subtraction) and other things where zero must be the first argument.
Ok, looks like this is the only application for it. But still, setting r0=0 violates the principle of orthogonality and introdeces a lot of redundancy into the instruction set. I see it looks academiaclly clean to do it like that, but is it really superiour to the other options
?
- Adding a dedicated neg instruction. There are lots of redundant instruction endcoding, which could have been used. Disadvantage: Maybe an addiotional mux in the instruction decoder, violates the idea of have three operand instructions..
- Emulating neg using two instructions. Disadvantage: Slowdown by less than 1%..
Couldn't do any better than to choose the MIPS instruction set. I looked at it years ago and was impressed with its clean design
Thats no wonder - it was refined during years of research by Henessey and Patterson.
However if you look close you will notice that the instruction set does also contain some obselete legacy. For example branch delay slots do not make any sense with OOO Architectures. It is also questionable whether wasting quite a bit of instruction space for integer arithmetic both with and without overflow trapping is worth it. Maybe the could just have used the extra space for a proper move instruction so R0 is freed.
Totally redundant. How about checking the domain before blind karma whoring ? I have never seen eetimes slashdotted, and there are lots of eetimes stories on Slashdot.
The server is already quite slow, here is the Text:
IBM PowerPC Blade Prototype from the IBM Development Lab in Böblingen, Germany
With the planned introduction of the PowerPC Blade, IBM will expand the performance of the IBM eServer BladeCenter and further extend its range of open source IBM eServers. To customers in the high performance computing sector (HPCS) the PowerPC Blade presents a very interesting and competitive extension of the IBM eServer BladeCenter and offers cost-effective computing power in the Unix and Linux area.
The PowerPC Blade offers outstanding performance and is superior to Intel Blades for certain applications in the High Perfomance Computing Sector. It is ideal for very computing intensive applications, for example in the area of simulation like meterology or geological calculations. The PowerPC Blade integrates seamlessly into the IBM eServer BladeCenter architecture with all its software components. Power und Intel Blades can be mixed in a BladeCenter in any order depending on the software applications.
The new IBM PowerPC 970 is the heart of the PowerPC Blade. It is based on the 64-Bit Power 4 architecture which is also used in the processors of the IBM eServer pSeries. The 64-bit microprozessor Offers full symmetrical multi-processing Has a high reliability (with parity L1, ECC L2 and parity checked system bus) Is manufactured in the latest 0,13 micrometer Copper/SOI CMOS technology Runs at frequences ranging from 1.8 GHz - 2.5 Ghz Therefore the IBMPowerPC 970 is the fastest PowerPC so far.
Further technical highlights of the PowerPC 970:
Onchip 512 KB L2 Cache Altivec (TM) Vector/SIMD unit 6,4 GB/s I/O system bus throughput
The IBM eServer BladeCenter has been available since December 2002 and is currently delivered with Intel processor blades.
IBM offers a solution for modular computing with the space-saving BladeCenter. The IBM eServer BladeCenters distinguish themselves by their high reliability and extensive system management software. The flat servers create free space in the computing center and can simply be supplemented with additional server "slices" when needed.
Nice to get a head start on what we'll be cloning next year
Thats the bitter truth. But I have to admit it isnt the worst way. Although I am an avid LaTex user, I have to admit that I really like the User interface functionality of Office XP. However, the admiration stops when it comes to text editing in word byself..:(
I think his reference to code size is quite interesting, considering he works for transmeta. The Crusoe translates the x86 code to its own VLIW code during program execution. Since VLIW architectures use usually very redundant instruction encoding, the code density/size of native crusoe code is most likely abysmal. (Who knows.. its not public) Therefore the Crusoe needs more bandwidth for code fetch than other cpus.
The question is, exactly where does big code size hurt? Disk space ? Certainly not, you can still decompress it during load time. Main Memory? Unlikely, the bigger part of the memory is still data. Instruction fetch Bandwidth and L1 cache clogging ? I guess so.. but thats exactly where the crusoe fails, too. So what about his obsession with code size ?
What on earth would you do with the low-level code for a 2600 game?
Funny enough, you can find the source for many games on the web, which have been reverse engineered by enthusiast. There is still a vivid scene of hobbyist developers hacking games for the vcs 2600.
Last I heard of FPGAs they were being touted in the same realm of likelyness as FMD or MRAM... I had no idea they are out there and working in reality already...
This was around 1985, right ? However, the reference to MRAM is interesting, since nobody thought about this in 1985. (GMR was discovered 1989)
But still, wouldn't it be cheaper and easier to use something a bit more modern to emulate that Commodore rather than use x amount of energy to run a 20MHz CPU and all the disk drives etc that normally attend a PC?
is a fairly common design project for undergraduate engineering students in control theory.
Yes, its known as the "inverse pendulum problem" and is routinely modelled and solved in control system classes.
.. maybe the interesting thing about the LegWay is, that the author managed to build it without knowing proper control theory ? The "controller" looks like a stupid two-point controller. A PID controller with properly tuned parameters would probably have improved the characteristics a lot.
First of all, the parameter you are speaking of is not the "Gate height", but the gate oxide thickness. Dry oxidation allows very thin gate oxides, also below the current mark. Manufacturing these oxides is a comparably easy problem, however decreasing oxide thickness will increase the amount of current tunneling through the gate. This is going to be quite a problem in 65nm and below.
To circumvent these problems there are a multitude of options under investigation, like high-k gate insulators, FinFets and more..
The point is: the TGV has once reached a maximum speed of more than 500km/h with a specially designed trainset on special rails, while 400km/h is the usual travelling speed for the transrapid. I see quite a difference there. The TGV does not come close to 400km/h, let alone 500km/h in everyday travel..
Eh, but I bet there are very few people in the world (outside Intel Corp) that know the IA32 instruction set better than Transmeta's favorite poster boy, Linus.
Well, I double Linus knows that much about IA32. After all Linux is coded in C. However, Transmeta has Christian Ludloff of www.sandpile.org fame working for them. If it comes to the IA32 ISA he is definitely the guy.
Artificial Diamonds ? I was not aware of the connection to "space technology". Most efforts looked rather earth-bound to me like the works of the FHG CVD Diamond group: IAF.
It depends on what you're trying to do. An awful lot of supercomputer sites *are* solving, more or less, very large matrices. In that case it means everything.
Just think about particle simulation. These are quite common in computational physics and require a lot of interconnection bandwidth. Of course, there is no benefit for applications like parallel raytracing..
Measuring MFlops does not mean a lot - even if it is from a "real life" benchmark. The TOP500 might look much worse for linux-clusters, if more communication-latency dependent benchmarks were used. Linpack, which works mainly on very large matrices, shifts the benchmarks results a lot towards linux-cluster solutions.
A real supercomputer supports much faster I/O, higher interconnection bandwidth and lower interconnection latency.
And btw. the new Cray X1 delivers the performance of a all but the largest linux-clusters in a single cabinet (820 GFlops peak that is..). In terms of computing efficiency it makes even the Earth Simulator look pale. I am really looking forward to the next iteration of the TOP500, when the first X1 machines are included.
Thats a common misconception. You do not need floats for the accuracy, but for the dynamic range. Actually an int (32bit) is far more accurate than a float (24 Bit mantissa). An int64 would be more accurate than a double (53 bit mantissa?). For cases where the dynamic range is known beforehand it is beneficial to use fixed point (integer) numbers. This is especially true for normalized rotational matrices and vector data. AFAIK many GPUs do even work with integer only arithmetic, since the dynamic range can easily be derived at all places. A place where you need floats is color computations..
If course I meant "are" .. but otherwhise I am totally serious about this - check out the execution timing of shifts and imuls. On a P4 shifts take multiple cycles and there is only one shifter, imuls have always been slower than fmuls and are not pipelined, too.
Fixed point arithmetic is probably slower than floating point on a x86, because shifts and imuls also slow..
Ok, nice. But where is the actual advantage ? Are the 64Bit features used anywhere or is it just an updated version to cope with 64 bit addressing or semething similar? The timing of the release seems to hint at this :)
But the worst part is: How do you dare to submit this to Slashdot without having Neal Stephenson in the Authors list ?
R0 is almost never used as target register - there are no potential hazards. So it does not help here either. For the other registers the mentioned problems are solved with register renaming..
NOP = add r0, r0, r0 (actually, add any two registers with the destination r0)
OR rx,rx,rx and AND rx,rx,rx are two examples for more NOPs.
All your arithmetic comparisons are based off subtractions/additions (which are really the same thing). Compare with 0 for a lot of stuff.
ALL Mips compare instruction do actually compare with 0. You do not need an explicit register for it.
Also, lots of assignments are to 0. Lots of if, for, while conditionals are to 0
You can clear any register with XOR rx,rx,rx. See above for the conditionals.
Also, there are few addressing modes, one is register + immediate offset. There is no register only addressing mode. How do you do direct addressing? r0 + immediate offset.
Immediate adressing works only for the first and last 32kb of the memory. Therefore there is hardly any use for this. If you really need it you can still initialize a register to zero.
Hardcoding r0 to zero actually keeps you from having to add more instructions just to handle the special cases of zero.
Yes, but where are theses cases ? Except of the NEG example nobody managed to bring up anything worthy...
Right, noticed it byself right after posting. There are also several other combinations like rt = rs1 AND rs2 and the shift instructions.
The r0 is frozen at 0 so you can do negations (for which ARM uses 'rsb' or reverse subtraction) and other things where zero must be the first argument.
Ok, looks like this is the only application for it. But still, setting r0=0 violates the principle of orthogonality and introdeces a lot of redundancy into the instruction set. I see it looks academiaclly clean to do it like that, but is it really superiour to the other options ?
- Adding a dedicated neg instruction. There are lots of redundant instruction endcoding, which could have been used. Disadvantage: Maybe an addiotional mux in the instruction decoder, violates the idea of have three operand instructions..
- Emulating neg using two instructions. Disadvantage: Slowdown by less than 1%..
Thats no wonder - it was refined during years of research by Henessey and Patterson.
However if you look close you will notice that the instruction set does also contain some obselete legacy. For example branch delay slots do not make any sense with OOO Architectures. It is also questionable whether wasting quite a bit of instruction space for integer arithmetic both with and without overflow trapping is worth it. Maybe the could just have used the extra space for a proper move instruction so R0 is freed.
Totally redundant. How about checking the domain before blind karma whoring ? I have never seen eetimes slashdotted, and there are lots of eetimes stories on Slashdot.
The server is already quite slow, here is the Text:
IBM PowerPC Blade
Prototype from the IBM Development Lab in Böblingen, Germany
With the planned introduction of the PowerPC Blade, IBM will expand the performance of the IBM eServer BladeCenter and further extend its range of open source IBM eServers. To customers in the high performance computing sector (HPCS) the PowerPC Blade presents a very interesting and competitive extension of the IBM eServer BladeCenter and offers cost-effective computing power in the Unix and Linux area.
The PowerPC Blade offers outstanding performance and is superior to Intel Blades for certain applications in the High Perfomance Computing Sector. It is ideal for very computing intensive applications, for example in the area of simulation like meterology or geological calculations. The PowerPC Blade integrates seamlessly into the IBM eServer BladeCenter architecture with all its software components. Power und Intel Blades can be mixed in a BladeCenter in any order depending on the software applications.
The new IBM PowerPC 970 is the heart of the PowerPC Blade. It is based on the 64-Bit Power 4 architecture which is also used in the processors of the IBM eServer pSeries. The 64-bit microprozessor
Offers full symmetrical multi-processing
Has a high reliability (with parity L1, ECC L2 and parity checked system bus)
Is manufactured in the latest 0,13 micrometer Copper/SOI CMOS technology
Runs at frequences ranging from 1.8 GHz - 2.5 Ghz
Therefore the IBMPowerPC 970 is the fastest PowerPC so far.
Further technical highlights of the PowerPC 970:
Onchip 512 KB L2 Cache
Altivec (TM) Vector/SIMD unit
6,4 GB/s I/O system bus throughput
The IBM eServer BladeCenter has been available since December 2002 and is currently delivered with Intel processor blades.
IBM offers a solution for modular computing with the space-saving BladeCenter. The IBM eServer BladeCenters distinguish themselves by their high reliability and extensive system management software. The flat servers create free space in the computing center and can simply be supplemented with additional server "slices" when needed.
Thats the bitter truth. But I have to admit it isnt the worst way. Although I am an avid LaTex user, I have to admit that I really like the User interface functionality of Office XP. However, the admiration stops when it comes to text editing in word byself .. :(
I think his reference to code size is quite interesting, considering he works for transmeta. The Crusoe translates the x86 code to its own VLIW code during program execution. Since VLIW architectures use usually very redundant instruction encoding, the code density/size of native crusoe code is most likely abysmal. (Who knows.. its not public) Therefore the Crusoe needs more bandwidth for code fetch than other cpus.
The question is, exactly where does big code size hurt? Disk space ? Certainly not, you can still decompress it during load time. Main Memory? Unlikely, the bigger part of the memory is still data. Instruction fetch Bandwidth and L1 cache clogging ? I guess so.. but thats exactly where the crusoe fails, too. So what about his obsession with code size ?
Funny enough, you can find the source for many games on the web, which have been reverse engineered by enthusiast. There is still a vivid scene of hobbyist developers hacking games for the vcs 2600.
This was around 1985, right ? However, the reference to MRAM is interesting, since nobody thought about this in 1985. (GMR was discovered 1989)
Nice one, thank you very much.
But you really should learn how to control the flash and white balance of your camera. Hey, and does the lady have a face ?
It is, because we can!
Yes, its known as the "inverse pendulum problem" and is routinely modelled and solved in control system classes.
I smell vaporware, badly. Enjoy with care.
To circumvent these problems there are a multitude of options under investigation, like high-k gate insulators, FinFets and more..
The point is: the TGV has once reached a maximum speed of more than 500km/h with a specially designed trainset on special rails, while 400km/h is the usual travelling speed for the transrapid. I see quite a difference there. The TGV does not come close to 400km/h, let alone 500km/h in everyday travel..
Well, I double Linus knows that much about IA32. After all Linux is coded in C. However, Transmeta has Christian Ludloff of www.sandpile.org fame working for them. If it comes to the IA32 ISA he is definitely the guy.
Artificial Diamonds ? I was not aware of the connection to "space technology". Most efforts looked rather earth-bound to me like the works of the FHG CVD Diamond group: IAF.
Just think about particle simulation. These are quite common in computational physics and require a lot of interconnection bandwidth. Of course, there is no benefit for applications like parallel raytracing..
A real supercomputer supports much faster I/O, higher interconnection bandwidth and lower interconnection latency.
And btw. the new Cray X1 delivers the performance of a all but the largest linux-clusters in a single cabinet (820 GFlops peak that is..). In terms of computing efficiency it makes even the Earth Simulator look pale. I am really looking forward to the next iteration of the TOP500, when the first X1 machines are included.