Okay. you gotta separate the what the processor is capable of and what the compiler is capable of.
Camp 1: traditional branch prediction. x86 uses a BTB to see if the branch is taken or not and will fetch that instruction and start executing that branch.
Camp 2: EPIC style - execute both branches at the same time. Throw away the one that doesn't pan out.
I'll get to the compiler optimizations in a moment... but you can see how EPIC would improve single thread performance a ton right? I mean you'd always have the correct branch executed when you got the official result of your branch. Of course the major downside to this is that you waste a ton of hardware doing this, but hey you'll improve single thread performance!
The other bad thing about this, is that there are only so many branches your hardware can do this with - which is why Itanium has a BTB. since 1 in every 7 instruction has a branch, imagine yourself walking down you code path, and all of a sudden you're trying to preexecute not 2 but 3,4,8 paths. You need more hardware support, which is limited so at some point you have to give up and just return to standard branch prediction methods.
okay now compiler time. compilers can take profiling data and knowledge of your processor and rearrange your basic code blocks to optimize performace. Example: Assume that if there's no history for a branch, that the pentiums will assume that the branch is not taken. in code order it can be...
er wish i could draw here, but assume that in code order, code basic block 1 (BB1) has a branch at the end of it. if it's not taken, then it'll got to BB2, if it is, then it'll go to BB3. in your executable the code will be arranged -> BB1, BB2, BB3. Assume profiling information thinks that the branch will be taken most of the time, hence executing BB3. Well this is bad for the pentium since it's branch prediction will assume that BB2 is the next instruction, since the default is to assume the branch isn't taken. So then the compiler can go back an reorder the basic blocks. so your code can look like this: BB1, BB3, BB2 so the default is to fall thru to BB3 - you have to change the branch code and add some jumps at the end of some of the basic blocks, but it will be faster.
incidentally, I don't think modern compilers insert nops in code. I think the processor bubbles the pipeline. If compilers were to insert nops all over the place, then your binaries would be pretty massive.
The Pentium 3/4 were both designed to be inefficient.
I think it might be more accurate to say that they were optimized for single thread performance. Well the P4 was really optimized for clock speed.:P
And you're misrepresenting what the chip is doing. It would take it's best guess at a branch and execute that code. And it would pretty much go thru the same logic as the normal instructions would, except that those instructions would get tagged with some flag that said, "don't commit these results until the branch outcome has been determined." and if it was the wrong path, then it would throw out that code. But i think putting a 75% failure on that is a bit high. From what I remember branch prediction has always been pretty successful. on the order of 95% to (now) maybe 98% accurate.
SMT will generally degrade performance of each single thread that you're running. but if you take 2 threads and run them on an SMT processor, the overall throughput will be faster.
if you've stalled on a memory access, which i think is on the order of 100 clock cycles, then you might as well do something useful in those wasted clock cycles. Of course, there's some cost to switching out contexts.. probably about 10 cycles or so to swap out and swap back in. it's that cost that decreases the performance of your single thread.
SMT was good for Netburst since clock speed was high and there was a lot of time to do otehr work while you're waiting for data to be fetched from memory - it doesn't necessarily have much to to with bad branch prediction, altho waiting for an instruction fetch from a mispredicted branch may or may not cause a pipeline stall depending on the code and the icache.
hrm... why don't you just drill a little hole in the case, and reroute the SATA connector to a nice 3.5" SATA drive? That'll probably be plenty fast enough, and hey! you get to keep your ports!
Well, just so you know, cores will probably always be in powers of 2. I'm betting it'll be Core Quad next, and then they'll move away from that naming convention. It'll turn to Core 8, Core 16, Core 32. Kinda like 8 bit, 16 bit, 32 bit.
it's quite likely that devices in the far future will still be able to decode JPEG images.
Okay... well I don't doubt this statement is true. You just forgot to factor in cost.
Assuming that you have a perfectly preserved JPEG bit stream in 2100. Will you:
a) be able to tell it's a JPEG?
b) have some spec on how to decode a JPEG?
c) assuming you have a spec, does the actual bitsream conform to tho specs?
d) are there ambiguities in the spec?
e) if there are how do you handle these ambiguities when you decide to recreate the actual pixels from the bitsream?
f) are there any undocumented features/proprietary additions?
g) again how do you handle these additions?
Again, you can probably recreate the image, but it will take a lot of time, effort, and yes money. I forget what NASA mission it was that recorded the entire mission in like 128k, but we lost the spec to the datastream. Could we interpret the bitstream into something useful. Sure. It would probably take a million bucks to do, and for what? is it even worth that much? NASA didn't seem to think so. This is only some 30 odd years later. 100 years from now, who the hell really will know what a computer system will look like? So much has changed in only 10 years.
This thing might be trying to compete with chips like the Ultrasparc T1.
I was with you up until this point. There is no way Itanium is going to compete with T1. They are totally in 2 separate spaces. The T1 will deal with a lot of data dependent software - db accesses, web apps, etc. Itanium's only hope right now is highly parallelizable code that needs a lot of FP computing power. Also T1, i believe, is a lot less complex of a chip - it's even in-order execution (i think). I don't think people will confuse the 2.
Agreed. The point I was trying to make was that realizing the benefits of compiler improvements requires updating your software, not replacing the processor. Obviously, recompiling the same software isn't going to be an advantage.
Ah... but you see. this is the problem. improving compiler technology is extremely hard. Of course, the big hope in VLIW and EPIC architectures was that compiler technology would improve by some huge factor. This hasn't really panned out. Most code that we run is highly data dependent and branches way too frequently to parallelize anything. This is the same reason chips are moving to multiple cores now. It's hard to eek out that extra 3% single thread performance now - in chip or in the compiler.
From your original post...
Most modern processors have to evaluate wether to insert a pipeline stall every single time that an instruction is executed. This is, essentially, wasted work because such a computation could be done by the assembler, however, it does spare the processor the burden of loading useless NOPs into the pipeline and the cache
uh this doesn't make any sense. Inserting nops for data dependencies/cache misses/etc doesn't "burden" processors. The only burden is if you happen to load your instruction stream with a ton of useless NOPs. Now I don't know IA-64 well, but somehow I doubt they removed all data dependency stalls - the instruction code explosion would amazing. your binaries would be huge.
Look at Itaniums performance on data dependent branches, it is underwhelming...
This is unfortunate; do you know what is limiting the chip here?
data dependant branches - the hold back is that it's a serial stream of instructions. you can't parallelize code at all if each instruction is dependent on the instruction before it.
Where, generally, does the compile-and-execute profile work improve things? Does it use the profiling output to hint the processor's branch predictor?
no, you feed back the profiling information to the compiler, which will use loop counts and branch results to unroll certain loops, spend more time software pipelining heavily used loops, moving around basic blocks to reduce branching and increase block sizes. then you'll get faster code. Of course, it's not unheard of for intel or amd to make specific compiler optimizations to speed up SPEC. When I mean specific, i mean like very specific. if you see a unique-only-to-SPEC block of code, then compile into the nice hand optimized assembly.:P
I believe that retails stores look at some ratio of profit per retail store space. Obviously retail store space is limitted and you want to maximize your return on it.
It doesn't make sense to stock something that takes up a lot of floor space, if 1) you don't profit much from it. or 2) the volume isn't high enough to make a decent profit.
So in a retail music store, I can see #2 coming into effect. You don't move a lot of the rarer stuff, tho it takes up the same floor space, so hence, the higher prices?
I could totally be off in my thinking tho.
So in terms of iTMS, floor space is not really limited, so this isn't an issue?
Dunno why you are saying grandparent is an asshole post. it's real data. Plus the original discussion was about revenue, not stock prices which are 2 separate things.
tho also, grandparent was comparating the wrong categories. Apple's revenue grew year to year, then just had and operating loss that one year.
Itanium is dead. Alpha is dead. Sparc is dying. There will be no mobile Itanium.
Realize that all there will be left in 5 years is x86... everything else will be reduced to a niche market.
The promise of VLIW never materialized. Inherently, it was made to simplify CPU design and push off complexity into the compiler. Of course, it's really frickin' hard to make your compiler output really nice VLIW code. Itanium and other VLIW machines, rely on this to run well. It's not going to happen.
Intel's advantage is that is has superior manufacturing vs AMD (plus marketing). Intel will get it's advantage back once the Pentium M architecture is ported across their entire CPU line.
This is not necessarily a problem that you sit down and solve. It's a problem in which you come up with the method to solve the problem.
One day, a really advanced alien comes down to earth and finds the 2 smartest guys on the planet, Tom and Harry. The alien tells them, "I'm going to destroy Earth, unless you guys can solve my riddle. I'm thinking of 2 numbers between 3 and 99. The sum of the 2 numbers is... " and he whispers the sum into Tom's ear. "And the product of the 2 numbers is..." and he whispers the product into Harry's ear.
"what are the 2 numbers?"
Being such smart fellas... They sat there and thought about it for awhile...
Tom suddenly says, "Ah ha! you can't possibly know what the 2 numbers are!" Harry then responds, "Ah Ha! you can't possibly know what the 2 numbers are either!" Then Tom, says, "Ah ha! Now I know what the 2 numbers are!" Then Harry says, "Ah ha! Now I know what the 2 numbers are also!"
what are the 2 numbers, and more importantly, how did they find out what they were?
Will people please please please please give up on this whole Cell-will-take-over-the-world mentality. Cell will always always be a highly specialized _graphics_ processor. It won't make a difference running general apps, and nobody's going to pick it up because IBM just decided to open up its software.
Besides, nvidia will come out with something in 1 year that will spank cell in the graphics dept.
Get over it people.
There's something called the Pentium M which runs quite nicely and isn't all that power hungry.
Plus, the clock speed war is over. We've reached the ceiling. Everyone is using performance numbers now to advertise their lineup.
Just cause most consoles have a diamond shaped button layout doesn't make it the end all be all of controller designs. And why should I have to get used to the layout? Why shouldn't the layout serve my purposes?
It's like saying why do we need windowing systems when you can do everything from a command line? all computers have command line. just get used to it average joe user.
Gamecube has a usable, intuitive controller layout!
For people that don't eat and breathe consoles all day, I find playing Xbox and Playstation to be a hard thing. I really dislike the OX-square-triangle or the XYAB button layout. There's no mapping to which button X is. In a game, like Halo, if I need to press Y to swap weapons. What button is Y? I have no clue. I have to actaully look down at my controller. It's a distraction that really shouldn't be there.
In my short time on the Gamecube, not only would games tell you which button you should press, but it gives you the shape of the button. Very very easy to map to the controller and no need to look away from the screen!
Okay. you gotta separate the what the processor is capable of and what the compiler is capable of.
Camp 1: traditional branch prediction. x86 uses a BTB to see if the branch is taken or not and will fetch that instruction and start executing that branch.
Camp 2: EPIC style - execute both branches at the same time. Throw away the one that doesn't pan out.
I'll get to the compiler optimizations in a moment... but you can see how EPIC would improve single thread performance a ton right? I mean you'd always have the correct branch executed when you got the official result of your branch. Of course the major downside to this is that you waste a ton of hardware doing this, but hey you'll improve single thread performance!
The other bad thing about this, is that there are only so many branches your hardware can do this with - which is why Itanium has a BTB. since 1 in every 7 instruction has a branch, imagine yourself walking down you code path, and all of a sudden you're trying to preexecute not 2 but 3,4,8 paths. You need more hardware support, which is limited so at some point you have to give up and just return to standard branch prediction methods.
okay now compiler time. compilers can take profiling data and knowledge of your processor and rearrange your basic code blocks to optimize performace. Example: Assume that if there's no history for a branch, that the pentiums will assume that the branch is not taken. in code order it can be...
er wish i could draw here, but assume that in code order, code basic block 1 (BB1) has a branch at the end of it. if it's not taken, then it'll got to BB2, if it is, then it'll go to BB3. in your executable the code will be arranged -> BB1, BB2, BB3. Assume profiling information thinks that the branch will be taken most of the time, hence executing BB3. Well this is bad for the pentium since it's branch prediction will assume that BB2 is the next instruction, since the default is to assume the branch isn't taken. So then the compiler can go back an reorder the basic blocks. so your code can look like this: BB1, BB3, BB2 so the default is to fall thru to BB3 - you have to change the branch code and add some jumps at the end of some of the basic blocks, but it will be faster.
incidentally, I don't think modern compilers insert nops in code. I think the processor bubbles the pipeline. If compilers were to insert nops all over the place, then your binaries would be pretty massive.
SMT will generally degrade performance of each single thread that you're running. but if you take 2 threads and run them on an SMT processor, the overall throughput will be faster. if you've stalled on a memory access, which i think is on the order of 100 clock cycles, then you might as well do something useful in those wasted clock cycles. Of course, there's some cost to switching out contexts.. probably about 10 cycles or so to swap out and swap back in. it's that cost that decreases the performance of your single thread. SMT was good for Netburst since clock speed was high and there was a lot of time to do otehr work while you're waiting for data to be fetched from memory - it doesn't necessarily have much to to with bad branch prediction, altho waiting for an instruction fetch from a mispredicted branch may or may not cause a pipeline stall depending on the code and the icache.
hrm... why don't you just drill a little hole in the case, and reroute the SATA connector to a nice 3.5" SATA drive? That'll probably be plenty fast enough, and hey! you get to keep your ports!
Mod parent up. Okay how many people would have made fun of Intel if they had come out with the Pentium 5? Quite a few I bet.
Well, just so you know, cores will probably always be in powers of 2. I'm betting it'll be Core Quad next, and then they'll move away from that naming convention. It'll turn to Core 8, Core 16, Core 32. Kinda like 8 bit, 16 bit, 32 bit.
Okay... well I don't doubt this statement is true. You just forgot to factor in cost.
Assuming that you have a perfectly preserved JPEG bit stream in 2100. Will you:
a) be able to tell it's a JPEG?
b) have some spec on how to decode a JPEG?
c) assuming you have a spec, does the actual bitsream conform to tho specs?
d) are there ambiguities in the spec?
e) if there are how do you handle these ambiguities when you decide to recreate the actual pixels from the bitsream?
f) are there any undocumented features/proprietary additions?
g) again how do you handle these additions?
Again, you can probably recreate the image, but it will take a lot of time, effort, and yes money. I forget what NASA mission it was that recorded the entire mission in like 128k, but we lost the spec to the datastream. Could we interpret the bitstream into something useful. Sure. It would probably take a million bucks to do, and for what? is it even worth that much? NASA didn't seem to think so. This is only some 30 odd years later. 100 years from now, who the hell really will know what a computer system will look like? So much has changed in only 10 years.
I was with you up until this point. There is no way Itanium is going to compete with T1. They are totally in 2 separate spaces. The T1 will deal with a lot of data dependent software - db accesses, web apps, etc. Itanium's only hope right now is highly parallelizable code that needs a lot of FP computing power. Also T1, i believe, is a lot less complex of a chip - it's even in-order execution (i think). I don't think people will confuse the 2.
Ah... but you see. this is the problem. improving compiler technology is extremely hard. Of course, the big hope in VLIW and EPIC architectures was that compiler technology would improve by some huge factor. This hasn't really panned out. Most code that we run is highly data dependent and branches way too frequently to parallelize anything. This is the same reason chips are moving to multiple cores now. It's hard to eek out that extra 3% single thread performance now - in chip or in the compiler.
From your original post...
Most modern processors have to evaluate wether to insert a pipeline stall every single time that an instruction is executed. This is, essentially, wasted work because such a computation could be done by the assembler, however, it does spare the processor the burden of loading useless NOPs into the pipeline and the cache
uh this doesn't make any sense. Inserting nops for data dependencies/cache misses/etc doesn't "burden" processors. The only burden is if you happen to load your instruction stream with a ton of useless NOPs. Now I don't know IA-64 well, but somehow I doubt they removed all data dependency stalls - the instruction code explosion would amazing. your binaries would be huge.
Look at Itaniums performance on data dependent branches, it is underwhelming... This is unfortunate; do you know what is limiting the chip here?
data dependant branches - the hold back is that it's a serial stream of instructions. you can't parallelize code at all if each instruction is dependent on the instruction before it.
Where, generally, does the compile-and-execute profile work improve things? Does it use the profiling output to hint the processor's branch predictor?
no, you feed back the profiling information to the compiler, which will use loop counts and branch results to unroll certain loops, spend more time software pipelining heavily used loops, moving around basic blocks to reduce branching and increase block sizes. then you'll get faster code. Of course, it's not unheard of for intel or amd to make specific compiler optimizations to speed up SPEC. When I mean specific, i mean like very specific. if you see a unique-only-to-SPEC block of code, then compile into the nice hand optimized assembly. :P
It doesn't make sense to stock something that takes up a lot of floor space, if 1) you don't profit much from it. or 2) the volume isn't high enough to make a decent profit.
So in a retail music store, I can see #2 coming into effect. You don't move a lot of the rarer stuff, tho it takes up the same floor space, so hence, the higher prices?
I could totally be off in my thinking tho.
So in terms of iTMS, floor space is not really limited, so this isn't an issue?
tho also, grandparent was comparating the wrong categories. Apple's revenue grew year to year, then just had and operating loss that one year.
Realize that all there will be left in 5 years is x86... everything else will be reduced to a niche market.
The promise of VLIW never materialized. Inherently, it was made to simplify CPU design and push off complexity into the compiler. Of course, it's really frickin' hard to make your compiler output really nice VLIW code. Itanium and other VLIW machines, rely on this to run well. It's not going to happen.
Intel's advantage is that is has superior manufacturing vs AMD (plus marketing). Intel will get it's advantage back once the Pentium M architecture is ported across their entire CPU line.
This is not necessarily a problem that you sit down and solve. It's a problem in which you come up with the method to solve the problem.
One day, a really advanced alien comes down to earth and finds the 2 smartest guys on the planet, Tom and Harry. The alien tells them, "I'm going to destroy Earth, unless you guys can solve my riddle. I'm thinking of 2 numbers between 3 and 99. The sum of the 2 numbers is... " and he whispers the sum into Tom's ear. "And the product of the 2 numbers is..." and he whispers the product into Harry's ear.
"what are the 2 numbers?"
Being such smart fellas... They sat there and thought about it for awhile...
Tom suddenly says, "Ah ha! you can't possibly know what the 2 numbers are!"
Harry then responds, "Ah Ha! you can't possibly know what the 2 numbers are either!"
Then Tom, says, "Ah ha! Now I know what the 2 numbers are!"
Then Harry says, "Ah ha! Now I know what the 2 numbers are also!"
what are the 2 numbers, and more importantly, how did they find out what they were?
really? 27w is a power hog? man you are pretty strict there.
Will people please please please please give up on this whole Cell-will-take-over-the-world mentality. Cell will always always be a highly specialized _graphics_ processor. It won't make a difference running general apps, and nobody's going to pick it up because IBM just decided to open up its software. Besides, nvidia will come out with something in 1 year that will spank cell in the graphics dept. Get over it people.
There's something called the Pentium M which runs quite nicely and isn't all that power hungry. Plus, the clock speed war is over. We've reached the ceiling. Everyone is using performance numbers now to advertise their lineup.
Just cause most consoles have a diamond shaped button layout doesn't make it the end all be all of controller designs. And why should I have to get used to the layout? Why shouldn't the layout serve my purposes? It's like saying why do we need windowing systems when you can do everything from a command line? all computers have command line. just get used to it average joe user.
Gamecube has a usable, intuitive controller layout!
For people that don't eat and breathe consoles all day, I find playing Xbox and Playstation to be a hard thing. I really dislike the OX-square-triangle or the XYAB button layout. There's no mapping to which button X is. In a game, like Halo, if I need to press Y to swap weapons. What button is Y? I have no clue. I have to actaully look down at my controller. It's a distraction that really shouldn't be there.
In my short time on the Gamecube, not only would games tell you which button you should press, but it gives you the shape of the button. Very very easy to map to the controller and no need to look away from the screen!