I run XP with cygwin. I get to use my favourite unix shells with all of the fancy completion and history buffering in windows.
I have perl for windows. I can automate tasks very easily.
And...best of all...I have Emacs for windows. I never use the mouse when I'm immersed in Emacs. My default config even turns off the menu bar.
Running all of this in Windows is far more palatable than trying to emulate windows under Linux. This way I can run all my of windows games and apps and have the ease of use of the unix interface. XP is stable enough for my needs as a user. It's not like I'm running a server.
And...when I really need linux...I have it running a VNC server on another box. I simply VNC into it via my windows XP notebook when I really need all of the bells and whistles that linux offers.
This marketing-speak is silly. They will fetch a 256 bit VLIW word. I guess the Itanium is a 128 bit machine since it fetches a 128 bit word. By convention, when someone says they have a 64 or 32 bit processor, they are referring to the datapath. The width of the ALUs and the number of bits used to address memory.
If a really low power processor was useful, then Intel or AMD would already have an ultra-low power product out the door to fulfill the market need.
Transmeta claims they can get equivalent performance at much lower power. This is a dubious claim given that their past products have fallen far short of this goal. Their customers are few and far between and the stock price has reacted accordingly.
My point was that the majority of developers don't write compilers or anything else that is dependent on knowing the low level details of the instruction set.
I would imagine that the vast majority of developers use either VC++ or gcc when programming in C/C++.
Do you program in assembly? Does anyone program in assembly anymore? Why should anyone care what the low level addressing modes are?
Modern computer architectural techniques have diminished the importance of instruction sets. Trace caches (P4) and/or pre-decoded instruction caches (Athlon) translate the "cruft" into risc ops.
Go to www.spec.org and look at what the fastest computers are. They happen to be x86 these days.
That all depends on how you measure "big". You could make a case either way.
MSFT is *twice* the size of IBM in terms of market capitalization.
Lets take a look at last year. MSFT sales = 26.8B, EBITDA = 13.1B IBM sales = 85.9B, EBITDA = 14.1B (EBITDA = earnings before interest, taxes, depreciation and amortization)
So MSFT is massively more profitable. No wonder the market values MSFT higher. MSFT earns almost as much as IBM on a quarter of the revenue, which is really what counts at the end of the day. Not how many employees or products you produce, but the bottom line earnings.
Lets look at cash position. IBM = 6.39B MSFT = 38.2B!!!!
Literally...MSFT has so much money, they don't know what to do with it.
You're probably right in that most applications don't get optimized. This is likely because performance is good enough without the optimization.
But for applications that do care about performance (increasinly rare - it seems like games are all that's left), it would be easy enough to install Intel's proton, underneath visual studio and just recompile with the -Ox flag.
Run stuff optimized for P4 on it, Intel now has the advantage.
Given that Intel has approximately an 80% share of the market, it is likely that software developers will optimize for the common case - ie - Pentium 4. I expect to see newer apps perform better due to P4 optimizations.
The answer is a pretty complicated one and to explain that would require some basic knowledge that you just can't squeeze into a 30 second commercial. AMD has made noise about a marketing campaign that will educate the public, however so far it has been just that, noise.
Actually the answer isn't really complicated at all. All I have to ask myself is this...
-Does this machine-A run the set of applications that I care about faster than machine-B? -Is the speedup of A relative to B worth the price differential?
I don't care how many MegaHertz or Quantispeed nonsense the marketroids spit out.
I tend to look at C as a shorthand for assembly code.
I may be in the minority here, but I think that programming should be taught from the bottom up. First understand computer architecture which will explain assembly code. Then move onto C. Then move onto some higher level languages like Java. This will teach you to appreciate what's really going on in the machine and why the abstractions you are using exist.
Follow Moore's Law and assume that memory chip density will double every 18 months.
This means that rate of growth of addressable bits is 1 bit per 18 months.
Do the arithmetic and we may see the 64 bit address bit limit getting hit within our lifetimes...assuming that we currently need 40 bits of addressable memory.
Memory density doubles every 18-24 months which means we need an extra address bit every 18-24 months. If 32 bits is required today, then we won't need more than 64 bits for another 48 to 64 years.
I'll be long retired by then. Your grandchildren can deal with it then.
Do you not think that with close to $30 billion a year at stake, Intel's engineer's just might have considered those things and optimized them to the Nth frickin degree?
They optmize profit, and they do a damn good job of it. I would imagine that the profit equation has hundereds, if not thousands of parameters that are wildy complex and interrelated.
Except the bandwidth out of the L3 is only 4GB/sec. P4 Rambus bandwidth is 3.2GB/sec and will be 4.2GB/sec when they increase the bus frequency to 533MHz.
Also, multimedia applications work on large datasets that stream. What does this mean? It means that you lose a lot of spatial and temporal locality where caches won't help.
1. Except that your more likely to get a mispredict on a 10 stage pipeline than a 7 stage. Simply due to having more instructions in the pipe (Simple stats question). Not to mention the problems of CISC Latency (minor, but visible) due to Athlon using a RISC core and CISC2RISC decoder. Sure a 1.5GHz Athlon will Take less of a time hit, however it's going to take more of those hits. Don't forget that the Athlon has the weakest Speculative Execution unit out of the three CPU's (G4, P4, Athlon XP), the XP is an improvement over T-Bird, but not there yet in efficiency.
The number of mispredicts is independent of the number of pipeline stages. I refer you to Mark Evers Phd thesis from the University of Michigan. Prediction is correlated to the dynamic instruction path.
2. The Athlon is the poster child of why beefy FPU's rock, it's one of the main reason the Athlon keeps up with the P4, despite the defficiency in clock speed. The G4 gains most of it's performance here, with a huge advantage in SIMD power (200-300% over any x86 CPU, clock for clock). SIMD performance is where it's at for Multimedia, which is what these towers are aimed at. Show me a dual XP system doing MPEG-2 encoding at DVD quality in 1/2 time in software
Most FP intensive applications will not fit in even multi-megabyte caches. I refer you to SPECfp2000. Look at how memory bandwidth tracks the scores. If Athlon's beefy FPU is so great, why does it have such weak scores compared to the P4? P4 has 3.2GB/sec bandwidth and Athlon has 2.1GB/sec, that's why. You need to feed the execution engine with operands that are NOT in the caches. Also, FP applications are *very* latency tolerant.
At the risk of sounding argumentative, how fast you can do ROTLs is a measure of how good your processor is, if doing ROTLs is what you need to do. Of course, Apple hypes this capability mercilessly, but they have some justification: SIMD (Single Instruction, Multiple Data) is, by definition, very good at situations when you want to do the same operation on a masses of data -- which is exactly the case when you're dealing with large graphics or audio, the Mac's traditional strong points.
What I meant to say was that RC5 is not a good measure of general purpose computing power.:)
While I agree that the G4 has a strong SIMD engine, this is only useful if your memory system can supply the operands at a rate commensurate with the execution bandwidth.
Let's say I have a 1GHz G4 and I can execute 1 altivec instruction per clock. The Altivec operands are 128 bits each. I need to read 2 operands and write 1 operand. So, I need 256 bits per clock of read bandwidth and 128 bits per clock of write bandwidth. Lets just consider read bandwidth. According to my rudimentary arithmetic, I need 32 Gbytes/sec of read memory bandwidth to feed this engine at the peak rate.
Unfortunately for Mac fans, they are still using a weak 133MHz SDRAM memory system which only provides ~1 Gbyte/sec of bandwidth. This is not a balanced system.
1) You assume that the G4 is more efficient than an Athlon when it comes to misprediction penalties.
I have seen many people make this mistake. Performance is measured in units of *time*.
Lets walk through an exercise using the numbers you have quoted:
You take a 7 cycle hit on a mispredict on an 800MHz G4. That's 8.5 nanoseconds of penalty.
You take a 10 cycle hit on a 1.5GHz Athlon. That's 6.6 nanoseconds!!! The Athlon whips the G4 in mispredict penalty, even though it has a longer pipe.
2) You assume that a beefy FPU and Altivec unit translates into higher performance.
Let's take this apart...
The Mac with G4 has lots of *execution* bandwidth for altivec and fp. However...can the memory system keep up to feed the operands to this engine? I doubt it...They're still using a 133 single channel SDRAM memory solution with it's wimpy ~1Gbyte/sec bandwidth. Their L3 SRAM cache (which is 2 Megs) only has 4Gbyte/sec bandwidth. This is pitiful considering that a 400MHz RAMBUS system gives you 3.2GByte/sec bandwidth to DRAM! The Northwood Pentium 4 processors that will come out with the 533MHz bus will have 4.2Gbyte/sec bandwidth to DRAM.
If you look at the assembly code, you will see that the algorithm's critical path is littered with ROTL (Rotate Left) instructions. These chains of dependent instructions can also be parallelized with SIMD instructions. RC5 is not a measure of how good your processor is, it is a measure of how fast you can do ROTLs.
I believe that altivec provides a SIMD version of ROTL which is why G4s do well.
In contrast, x86 provides no MMX/SSE instructions for ROTL
The Pentium-4 takes 4 clocks to do a ROTL. Yikes.
Athlon takes a single cycle for the ROTL.
Re:SPEC isn't a good benchmark for consumer CPU's
on
AthlonXP Released
·
· Score: 1
Why? Because it's all about how much the compilers can be optimized for it. Even worse, compiliers highly optimized for SPEC often produce poor code for realworld applicatios. The fact is, very little software is optimized for SSE2 anyway. Especially consumer software, which for the most part is written to the least common denominator. Without the special optimizations, Pentium 4's just don't compete well with Athlons.
Compilers are very important for processor performance. I would argue that spec is a good measure of processor performance on tomorrow's apps. This is because compilers will evolve to take advantage of the Pentium 4 architecture. Because Intel has 70-80% of the market, compilers will optimize for this common case. A lot of today's software has been optimized for a P6 type of core (PIII/PII/Athlon style architecture). This was hardly the case when the P6 architecture debuted in the Pentium Pro.
SPEC scores are a preview of what is to come for apps that are optimized for the P4 via compiler optmization. As for your comment about SSE2, that's just life. AMD has stated that they will support SSE2 in the K8. It takes a while for software to catch up to the hardware it runs on.
Well are you going to run applications without a chipset/RAM/motherboard? This is just not realisitic.
If you want to eliminate the system components from the test, then you'll have to make the benchmark run entirely out of the processor caches which is not at all realistic. This was the problem with SPEC95. A lot of the applications started fitting in the processor's caches so the benchmark became less useful. So, SPEC2000 was created with much larger datasets.
I wonder how much AMD may be "Osbourning" themselves.
Unix admins cost more than MCSEs, too.
I guess it's true - you do get what you pay for.
Does that mean Microsoft is better because it costs more than Linux?
I agree with you that a good legal system is necessary for wealth generation. I've just finished reading Hernando De Soto's "Mystery of Capital".
However, I have never seen this assertion validated by game theory. I would me extremely interested in a reference to a book or paper on this subject.
This is not a troll or criticism. I am genuinely interested in learning more.
Thanks in advance for your insight.
I run XP with cygwin. I get to use my favourite unix shells with all of the fancy completion and history buffering in windows.
I have perl for windows. I can automate tasks very easily.
And...best of all...I have Emacs for windows. I never use the mouse when I'm immersed in Emacs. My default config even turns off the menu bar.
Running all of this in Windows is far more palatable than trying to emulate windows under Linux. This way I can run all my of windows games and apps and have the ease of use of the unix interface. XP is stable enough for my needs as a user. It's not like I'm running a server.
And...when I really need linux...I have it running a VNC server on another box. I simply VNC into it via my windows XP notebook when I really need all of the bells and whistles that linux offers.
This marketing-speak is silly. They will fetch a 256 bit VLIW word. I guess the Itanium is a 128 bit machine since it fetches a 128 bit word. By convention, when someone says they have a 64 or 32 bit processor, they are referring to the datapath. The width of the ALUs and the number of bits used to address memory.
If a really low power processor was useful, then Intel or AMD would already have an ultra-low power product out the door to fulfill the market need.
Transmeta claims they can get equivalent performance at much lower power. This is a dubious claim given that their past products have fallen far short of this goal. Their customers are few and far between and the stock price has reacted accordingly.
My point was that the majority of developers don't write compilers or anything else that is dependent on knowing the low level details of the instruction set.
I would imagine that the vast majority of developers use either VC++ or gcc when programming in C/C++.
On FP. On the integer side, they are 2.4GHz Pentium 4s.
Do you program in assembly? Does anyone program in assembly anymore? Why should anyone care what the low level addressing modes are?
Modern computer architectural techniques have diminished the importance of instruction sets. Trace caches (P4) and/or pre-decoded instruction caches (Athlon) translate the "cruft" into risc ops.
Go to www.spec.org and look at what the fastest computers are. They happen to be x86 these days.
That all depends on how you measure "big". You could make a case either way.
MSFT is *twice* the size of IBM in terms of market capitalization.
Lets take a look at last year.
MSFT sales = 26.8B, EBITDA = 13.1B
IBM sales = 85.9B, EBITDA = 14.1B
(EBITDA = earnings before interest, taxes, depreciation and amortization)
So MSFT is massively more profitable. No wonder the market values MSFT higher. MSFT earns almost as much as IBM on a quarter of the revenue, which is really what counts at the end of the day. Not how many employees or products you produce, but the bottom line earnings.
Lets look at cash position.
IBM = 6.39B
MSFT = 38.2B!!!!
Literally...MSFT has so much money, they don't know what to do with it.
(Source = yahoo finance)
You're probably right in that most applications don't get optimized. This is likely because performance is good enough without the optimization.
But for applications that do care about performance (increasinly rare - it seems like games are all that's left), it would be easy enough to install Intel's proton, underneath visual studio and just recompile with the -Ox flag.
Given that Intel has approximately an 80% share of the market, it is likely that software developers will optimize for the common case - ie - Pentium 4. I expect to see newer apps perform better due to P4 optimizations.
Actually the answer isn't really complicated at all. All I have to ask myself is this...
-Does this machine-A run the set of applications that I care about faster than machine-B?
-Is the speedup of A relative to B worth the price differential?
I don't care how many MegaHertz or Quantispeed nonsense the marketroids spit out.
I tend to look at C as a shorthand for assembly code.
I may be in the minority here, but I think that programming should be taught from the bottom up. First understand computer architecture which will explain assembly code. Then move onto C. Then move onto some higher level languages like Java. This will teach you to appreciate what's really going on in the machine and why the abstractions you are using exist.
Follow Moore's Law and assume that memory chip density will double every 18 months.
This means that rate of growth of addressable bits is 1 bit per 18 months.
Do the arithmetic and we may see the 64 bit address bit limit getting hit within our lifetimes...assuming that we currently need 40 bits of addressable memory.
Memory density doubles every 18-24 months which means we need an extra address bit every 18-24 months. If 32 bits is required today, then we won't need more than 64 bits for another 48 to 64 years.
I'll be long retired by then. Your grandchildren can deal with it then.
Everytime I hear this argument, I am reminded of Bill Gates saying..."640K should be enough for everybody"
Nobody can predict the future.
Thanks for your "insightful" advice.
Do you not think that with close to $30 billion a year at stake, Intel's engineer's just might have considered those things and optimized them to the Nth frickin degree?
They optmize profit, and they do a damn good job of it. I would imagine that the profit equation has hundereds, if not thousands of parameters that are wildy complex and interrelated.
Info from the foveon faq on their website:
Question
When will the Sigma SD-9 be available?
Answer
Sigma will begin taking orders for the SD-9 digital SLR camera at the PMA show on Feb. 24, 2002. The company plans to begin shipping in May 2002.
Except the bandwidth out of the L3 is only 4GB/sec. P4 Rambus bandwidth is 3.2GB/sec and will be 4.2GB/sec when they increase the bus frequency to 533MHz.
Also, multimedia applications work on large datasets that stream. What does this mean? It means that you lose a lot of spatial and temporal locality where caches won't help.
1. Except that your more likely to get a mispredict on a 10 stage pipeline than a 7 stage. Simply due to having more instructions in the pipe (Simple stats question). Not to mention the problems of CISC Latency (minor, but visible) due to Athlon using a RISC core and CISC2RISC decoder. Sure a 1.5GHz Athlon will Take less of a time hit, however it's going to take more of those hits. Don't forget that the Athlon has the weakest Speculative Execution unit out of the three CPU's (G4, P4, Athlon XP), the XP is an improvement over T-Bird, but not there yet in efficiency.
The number of mispredicts is independent of the number of pipeline stages. I refer you to Mark Evers Phd thesis from the University of Michigan. Prediction is correlated to the dynamic instruction path.
2. The Athlon is the poster child of why beefy FPU's rock, it's one of the main reason the Athlon keeps up with the P4, despite the defficiency in clock speed. The G4 gains most of it's performance here, with a huge advantage in SIMD power (200-300% over any x86 CPU, clock for clock). SIMD performance is where it's at for Multimedia, which is what these towers are aimed at. Show me a dual XP system doing MPEG-2 encoding at DVD quality in 1/2 time in software
Most FP intensive applications will not fit in even multi-megabyte caches. I refer you to SPECfp2000. Look at how memory bandwidth tracks the scores. If Athlon's beefy FPU is so great, why does it have such weak scores compared to the P4? P4 has 3.2GB/sec bandwidth and Athlon has 2.1GB/sec, that's why. You need to feed the execution engine with operands that are NOT in the caches. Also, FP applications are *very* latency tolerant.
At the risk of sounding argumentative, how fast you can do ROTLs is a measure of how good your processor is, if doing ROTLs is what you need to do. Of course, Apple hypes this capability mercilessly, but they have some justification: SIMD (Single Instruction, Multiple Data) is, by definition, very good at situations when you want to do the same operation on a masses of data -- which is exactly the case when you're dealing with large graphics or audio, the Mac's traditional strong points.
What I meant to say was that RC5 is not a good measure of general purpose computing power.
While I agree that the G4 has a strong SIMD engine, this is only useful if your memory system can supply the operands at a rate commensurate with the execution bandwidth.
Let's say I have a 1GHz G4 and I can execute 1 altivec instruction per clock. The Altivec operands are 128 bits each. I need to read 2 operands and write 1 operand. So, I need 256 bits per clock of read bandwidth and 128 bits per clock of write bandwidth. Lets just consider read bandwidth. According to my rudimentary arithmetic, I need 32 Gbytes/sec of read memory bandwidth to feed this engine at the peak rate.
Unfortunately for Mac fans, they are still using a weak 133MHz SDRAM memory system which only provides ~1 Gbyte/sec of bandwidth. This is not a balanced system.
1) You assume that the G4 is more efficient than an Athlon when it comes to misprediction penalties.
I have seen many people make this mistake. Performance is measured in units of *time*.
Lets walk through an exercise using the numbers you have quoted:
You take a 7 cycle hit on a mispredict on an 800MHz G4. That's 8.5 nanoseconds of penalty.
You take a 10 cycle hit on a 1.5GHz Athlon. That's 6.6 nanoseconds!!! The Athlon whips the G4 in mispredict penalty, even though it has a longer pipe.
2) You assume that a beefy FPU and Altivec unit translates into higher performance.
Let's take this apart...
The Mac with G4 has lots of *execution* bandwidth for altivec and fp. However...can the memory system keep up to feed the operands to this engine? I doubt it...They're still using a 133 single channel SDRAM memory solution with it's wimpy ~1Gbyte/sec bandwidth. Their L3 SRAM cache (which is 2 Megs) only has 4Gbyte/sec bandwidth. This is pitiful considering that a 400MHz RAMBUS system gives you 3.2GByte/sec bandwidth to DRAM! The Northwood Pentium 4 processors that will come out with the 533MHz bus will have 4.2Gbyte/sec bandwidth to DRAM.
More than the mac's L3 cache!!!
If you look at the assembly code, you will see that the algorithm's critical path is littered with ROTL (Rotate Left) instructions. These chains of dependent instructions can also be parallelized with SIMD instructions. RC5 is not a measure of how good your processor is, it is a measure of how fast you can do ROTLs.
I believe that altivec provides a SIMD version of ROTL which is why G4s do well.
In contrast, x86 provides no MMX/SSE instructions for ROTL
The Pentium-4 takes 4 clocks to do a ROTL. Yikes.
Athlon takes a single cycle for the ROTL.
Compilers are very important for processor performance. I would argue that spec is a good measure of processor performance on tomorrow's apps. This is because compilers will evolve to take advantage of the Pentium 4 architecture. Because Intel has 70-80% of the market, compilers will optimize for this common case. A lot of today's software has been optimized for a P6 type of core (PIII/PII/Athlon style architecture). This was hardly the case when the P6 architecture debuted in the Pentium Pro.
SPEC scores are a preview of what is to come for apps that are optimized for the P4 via compiler optmization. As for your comment about SSE2, that's just life. AMD has stated that they will support SSE2 in the K8. It takes a while for software to catch up to the hardware it runs on.
Well are you going to run applications without a chipset/RAM/motherboard? This is just not realisitic.
If you want to eliminate the system components from the test, then you'll have to make the benchmark run entirely out of the processor caches which is not at all realistic. This was the problem with SPEC95. A lot of the applications started fitting in the processor's caches so the benchmark became less useful. So, SPEC2000 was created with much larger datasets.