Yonah is a 32-bit only chip. Driving more wires/pins/etc in 64-bit mode internally and externally burns more power. I doubt that the K8 core shuts off the upper 32 bits of various logic/flip flop/RAM/CAM structures while in 32-bit mode...if anyone has real information on this, that would be appreciated, but that's probably not public knowledge.
After thinking about this one for a bit, it's the static power leakage out of those structures that would burn the power in 32-bit mode...
Yonah is a 32-bit only chip. Driving more wires/pins/etc in 64-bit mode internally and externally burns more power. I doubt that the K8 core shuts off the upper 32 bits of various logic/flip flop/RAM/CAM structures while in 32-bit mode...if anyone has real information on this, that would be appreciated, but that's probably not public knowledge.
Yonah is built on a 65nm process. Transistor sizing from 90-->65nm gains you significant power reductions and performance increases.
Yonah just came out. Since AMD started being competitive with Intel, the performance crown (and now the performance/watt crown) has flipped between these two companies every time they release a new chip.
The real question is, does AMD have anything up its sleeve to match Yonah (and they better have it soon), or will Intel regain its dominance?
The "problem" with Multiscalar is that it requires compiler support to partition the program (AFAIK). It's not really a problem, I suppose, because Multiscalar is an academic project. But for millions of existing codes out there, a compiler-driven TLS system isn't going to buy you anything in terms of single-thread performance.
There are other academic projects that are attempting to do TLS dynamically, in hardware. PolyFlow at Illinois is one, Dynamic Multithreading (mentioned elsewhere in this story) is another, I'm sure there are others.
And spill code is at least twice as slow as using the extra registers for locals. Go read up on compilers and register allocation.
That said, other factors probably offset the additional registers (more bytes per instruction => less effective use of icache). Having more registers probably turns out to be a wash, or at best a very modest performance gain.
x86 processors since the Pentium and the Am586 have more registers than they expose, and when you perform a context switch, they can swap in the other registers, meaning they can cut the time of a context switch down a great deal.
Register renaming has nothing to do with context switches. The "invisible" registers are used to remove false dependencies in the instruction stream to increase Instruction-Level Parallelsim (ILP) within a single thread. In fact, on a context switch, the architectural state exactly matches the physical state (no "invisible" registers are in use), and so the processor doesn't have to save any extra registers other than the architecturally-visible ones.
The details (skip if you're not interested):
loop:
movl %ecx, (%ebx)
# Do something complicated with ECX
addl $4, %ebx
cmpl $64, %ebx
jl loop
In the above assembly, the instructions are dependent upon one another: you can't execute the incl until after the movl because the incl overwrites EBX. You can't start executing the next iteration of the loop until the current iteration is finished, because the movl at the top of the loop overwrites ECX. These restrictions only arise because you are reusing the registers EBX and ECX. If you could somehow use different "copies" of these registers, you could execute multiple iterations of the loop in parallel, and execute instructions inside the loop out of order.
Inside the processor, the instruction stream may be seen like this:
%r0 <- (%r1)
# Do something complicated with r0
%r2 <- %r1 + 4
cmpl $100, %r2
jl loop
%r3 <- (%r2)
# Do something complicated with r3
%r4 <- %r2 + 4
cmpl $100, %r4
jl loop
...
The processor has removed all false dependencies by using its internal, non-visible registers to remap different loop's "instances" of EBX and ECX to different physical registers. This enable out-of-order execution: since the next "copy" of EBX has been renamed to be a different physical register (r2) than the original value of EBX (r1), the processor can execute the addl instruction LONG before it executes the "Do something complicated" portion of the loop.
This then allows the processor to execute multiple iterations of the loop in parallel (with branch speculation and recovery) by performing the addl instruction very soon after the loop begins, which will allow further iterations of the loop to run by calculating the "next" value of EBX. The processor has effectively performed loop unrolling in hardware.
Here's the real problem: I'll bet that Arstechnica pays Slashdot to have their article linked from the main page. Doing what you suggest would remove the article summary from the main page, and Slashdot would lose a revenue source.
I think the grandparent was referring to technical expertise necessary for building the foundations of a technical business, not "business skills" that can be acquired with experience in the business world. I'm not sure what sort of business(es) you're involved in, but the fields that interest many people basically require attending a university (or higher) in order to learn the concepts and practice applying them. Such fields include most math, science, engineering, and medical disciplines.
Now, it's entirely possible that someone who is extremely motivated would be able to read textbooks, read conference papers and journals, and set up experiments/projects on their own (given some modest amount of funding), learning the field on their own without university support. However, the university system seems to be a good way to have all these tools at your disposal much more easily, not to mention having support from professors and other students who are interested in similar things.
If the end goal is to make money AND you are interested in entrepreneurship, doing what you've done may be the right way to go. If your goals are to work in one of the fields I've mentioned above (because you're interested in one of them), going to a university seems to be the best option.
You missed the entire point of my argument, which, after reading it again, wasn't clear enough:
The Cell fundamentally requires program transformations to be performed by a compiler to make use of most of the chip. The only other CPU that comes somewhat close to that is Itanium.
Now, we can debate just how much performance loss is seen with unoptimized code on dynamically-scheduled out-of-order superscalars, and you have a point there: it can be significant. But not as significant as only using 1/8th (or 1/nth, where n is the number of processing elements) of a chip.
There are two kinds of performance we're talking about here: baseline CPU performance, and that few extra percent of performance you can get by doing some fancy compiler tricks.
The GP was contending that if a fancy compiler is required to achieve good baseline CPU performance (i.e., using all the SPEs on the Cell concurrently), the architecture in question won't be as successful as an architecture that can get good baseline CPU performance without special optimizations.
In modern CPUs, the out-of-order instruction window is what allows independent instructions to execute when their operands are ready, regardless of the schedule the compiler lays down in the binary. Sure, if you put a load and a use of that load right next to each other, the use is going to have to wait. But meanwhile, other instructions from earlier/later in the stream can execute. Dependencies are resolved on the fly via register renaming and memory disambiguation hardware.
On the other hand, Cell needs a compiler to figure out where the dependencies are and aren't so it can schedule code to execute independently on different SPEs. Today's compilers could produce code that would execute well on one SPE, but all the rest of them would sit unused. This sort of "optimization" (I wouldn't even call it that, I would call it program transformation) is difficult to do.
I realize the days of turning high-level languages into a fixed instruction sequence are long gone, but today's CPUs would get within, oh, say 80-85% of their optimal performance if a compiler did do that. The Cell, on the other hand, would see a slowdown of factors of 4 or 5 (vs. using all the SPEs) without a using parallelizing compiler or writing code in a completely different programming paradigm.
The blanket "Incorrect" in your response is itself incorrect unless Yonah is, in fact, NOT a 65nm chip and does not see ANY performance gains/power reduction (vs. a 90nm chip) due solely to transistor sizing. Furthermore, Athlon64's pipeline is similar in length and is clocked at similar frequencies, so neither chip has an advantage/disadvantage there. Your "shorter" is almost certainly comparing Yonah to Prescott's 31 stage pipeline; that comparison is irrelevant in this discussion.
Yonah being able to turn off unused sections of the chip to save power is certainly a significant feature.
The dual-core Yonah consumes less power at 100% usage than the Athlon64 3800+ X2 does when idle while competing with it performance-wise. A low-power mobile Intel chip competing with a high-end desktop chip from AMD. Good design, indeed.
The dual-core Intel chip realizes gains in performance and reductions in power consumption due to transistor sizing alone. Its 65nm process just plain beats a 90nm one. I wouldn't make any comparisons or conjectures about the quality of the architecture and design for each chip because of this difference in process.
Many of the cards with open-source drivers have a binary-only firmware component that gets loaded onto the card at module load time (Intel for sure has this, I think others do as well). This firmware component, which is in a binary format known only to the vendor and presumably runs on a proprietary microcontroller, is what manages things like tx/rx power and frequency.
Turning your wireless card into a police or airport radio jammer would certainly involve changing the frequency band and tx power to illegal ranges, something the firmware won't let you do. And, reverse engineering the firmware so that the entire chip can be user-controlled is probably impossible.
Yonah is produced on a 65nm process. That automatically buys it gains in speed and reductions in power compared to the 90nm-produced X2.
I also don't buy your claim that AMD has nothing in its roadmap (I'm using the term "roadmap" to mean the next ~3 years) that is competitive with Yonah or post-Yonah. Do you have a link to such a roadmap? Mobile chips are AMD's weak link; something tells me that a few people in AMD also know this and are currently working on new mobile stuff.
Here's a large problem with Itanium: It is an in-order architecture.
This means anytime it misses in L1, the entire machine stalls waiting for the data to come back from L2/L3/memory. This is fine for applications where the compiler can figure out all the data dependences and schedule the code to hide these cache misses (i.e. scientific applications). It is not good for your run-of-the-mill GUI programs like Word, Firefox, your favorite email reader, etc. Out-of-order architectures like Pentium Pro/II/III/4 and Athlon hide L1 misses a LOT better because other (independent) instructions can execute while the cache miss is going on.
A few points brought up in the article that I'll respond to:
Predication - Predication (conversion of if/else code with branches to branchless straight-line code using predicated instructions) is not limited to EPIC/Itanium architectures. Conditional movs (cmov) in x86/AMD64/EM64T are a watered-down version, but they suffice for a lot of simple situations such as the one the article brings up.
Instruction Level Parallelism (ILP) - Sure, the Itanium can decode/execute/retire up to 6 instructions per clock. That's dependent on two things: a) the compiler finding 6 independent instructions to schedule every clock, b) no L1 cache misses occurring (remember, Itanium is in-order, cache miss = stall).
ILP is dead anyway - CPU cores are much faster than memory. Any time you have to go to main memory for something, you take a HUGE hit in performance. Who cares if your CPU core executes 100,000 instructions in 0.00001 ns if it takes 100,000 cycles to bring a cache line in from memory? Memory bottlenecks are starting to dominate CPU performance (see this paper for more info), so single-thread performance is going to be dominated by how well the cores mitigate cache misses. Out-of-order cores can do this well (it's getting harder, read the paper), but it's difficult for in-order cores.
Thread Level Parallelism (TLP) - Any benefits of TLP stated in the article will apply to dual-core out-of-order processors in the same way they will apply to Itanium processors.
Power - Intel just came out with their dual-core mobile stuff. AMD will sometime before the summer. The article claims that performance per watt is superior for Itanium; that may have been true a year ago, but it's about to not be true.
Floating point performance - Itanium is the fastest FP chip on the planet. However, a lot of consumer apps aren't floating point-intensive, they're non-FP apps like Word, Firefox, an email client. Performance of these apps, like I said before, is much more dependent on not having cache misses dominate performance. Plus, with SSE2/SSE3 taking over all the FP duties in the latest Athlon64/Xeon/P4s, and Intel and AMD concentrating their efforts on improving those functional units, I bet consumer-level FP performance goes up.
Now, one predicted trend for the future is for all architectures to move to simple, cheap, in-order cores, and put a lot of them on the chip to give increases in TLP without using a hugely complicated, expensive, lots-of-power-and-chip-area out-of-order core. From what I can tell, Itanium is a hugely complicated, expensive, in-order core, not exactly what we need to put 16 cores on a chip. Intel could easily resurrect the original Pentium core, retrofit SSE/SSE2/SSE3 to it, maybe add some runahead execution stuff (from that paper I linked to above) or maybe two-pass pipelining to mitigate the cache misses, and voila: a cheap, in-order core.
Oh yeah, this is all academic anyway; backwards-compatibility (x86 has it, Itanium doesn't) is probably going to be the real driving force like it has been for the past 6 years.
I've donated to Wikipedia twice in a year. At this point, I've given probably four times the amount of money that I would for, say, Encarta. I love Wikipedia, but 1) I don't have a permanent copy of it on a DVD, like I would for Encarta, and 2) I feel like I'm being "forced" to buy the latest upgrade of Wikipedia when they set up these pleas for donations, since the performance of my encyclopedia directly depends on these fund drives.
I'm all for charitable organizations and such, but Wikipedia is a little bit of a different beast. Organizations like the Red Cross can keep asking for donations continuously because that's what they do - they give the money out because there is always a need. Wikipedia always has a need too; however, it being an encyclopedia, I want a usable product after some amount of donation.
The brute force megahertz wars ended years ago; Motorola/Freescale, IBM and now Intel realize this.
It's convenient that the same technique (transistor size reduction / process scaling) that Intel used in the megahertz wars to gain that extra GHz is now what gives Intel its lower power numbers. A processor produced using a 65nm process (Yonah) runs cooler than a 90nm one (Athlon X2) by default.
Wait to make this comparison until AMD's 65nm dual-core Turion comes out.
Everyone knows those battery-life claims are total crap. After having the system for about 6 months, your battery life will shrink to the standard < 2 hours.
One argument I've heard in favor of anonymous edits on sites like Slashdot and Wikipedia is that it lowers the barrier to entry. An example: "Ah, a misspelled word. I can just edit this? Cool." versus "Ah, a misspelled word. I have to sign up to edit this? Screw it."
And then after a few minor changes, a user will move on to get an actual account and start contributing regularly.
Two titles does not equal plenty. As a person trying to refute the assertion that there are very few native games there are for Linux, it was your responsibility to find and name more than two.
You've only served to underline the belief (or possibly fact) that there really aren't many native Linux games.
http://www.ussg.iu.edu/hypermail/linux/kernel/9906 .0/0746.html
He is, of course, referring to all the research in the '80s and '90s on microkernels and IPC-based operating systems.
After thinking about this one for a bit, it's the static power leakage out of those structures that would burn the power in 32-bit mode...
The real question is, does AMD have anything up its sleeve to match Yonah (and they better have it soon), or will Intel regain its dominance?
There are other academic projects that are attempting to do TLS dynamically, in hardware. PolyFlow at Illinois is one, Dynamic Multithreading (mentioned elsewhere in this story) is another, I'm sure there are others.
That said, other factors probably offset the additional registers (more bytes per instruction => less effective use of icache). Having more registers probably turns out to be a wash, or at best a very modest performance gain.
Register renaming has nothing to do with context switches. The "invisible" registers are used to remove false dependencies in the instruction stream to increase Instruction-Level Parallelsim (ILP) within a single thread. In fact, on a context switch, the architectural state exactly matches the physical state (no "invisible" registers are in use), and so the processor doesn't have to save any extra registers other than the architecturally-visible ones. The details (skip if you're not interested):
loop:
movl %ecx, (%ebx)
# Do something complicated with ECX
addl $4, %ebx
cmpl $64, %ebx
jl loop
In the above assembly, the instructions are dependent upon one another: you can't execute the incl until after the movl because the incl overwrites EBX. You can't start executing the next iteration of the loop until the current iteration is finished, because the movl at the top of the loop overwrites ECX. These restrictions only arise because you are reusing the registers EBX and ECX. If you could somehow use different "copies" of these registers, you could execute multiple iterations of the loop in parallel, and execute instructions inside the loop out of order.
Inside the processor, the instruction stream may be seen like this:
%r0 <- (%r1)
...
# Do something complicated with r0
%r2 <- %r1 + 4
cmpl $100, %r2
jl loop
%r3 <- (%r2)
# Do something complicated with r3
%r4 <- %r2 + 4
cmpl $100, %r4
jl loop
The processor has removed all false dependencies by using its internal, non-visible registers to remap different loop's "instances" of EBX and ECX to different physical registers. This enable out-of-order execution: since the next "copy" of EBX has been renamed to be a different physical register (r2) than the original value of EBX (r1), the processor can execute the addl instruction LONG before it executes the "Do something complicated" portion of the loop.
This then allows the processor to execute multiple iterations of the loop in parallel (with branch speculation and recovery) by performing the addl instruction very soon after the loop begins, which will allow further iterations of the loop to run by calculating the "next" value of EBX. The processor has effectively performed loop unrolling in hardware.
Here's the real problem: I'll bet that Arstechnica pays Slashdot to have their article linked from the main page. Doing what you suggest would remove the article summary from the main page, and Slashdot would lose a revenue source.
http://slashdot.org/comments.pl?sid=174483&cid=14
</shameless plug> :)
Now, it's entirely possible that someone who is extremely motivated would be able to read textbooks, read conference papers and journals, and set up experiments/projects on their own (given some modest amount of funding), learning the field on their own without university support. However, the university system seems to be a good way to have all these tools at your disposal much more easily, not to mention having support from professors and other students who are interested in similar things.
If the end goal is to make money AND you are interested in entrepreneurship, doing what you've done may be the right way to go. If your goals are to work in one of the fields I've mentioned above (because you're interested in one of them), going to a university seems to be the best option.
The Cell fundamentally requires program transformations to be performed by a compiler to make use of most of the chip. The only other CPU that comes somewhat close to that is Itanium.
Now, we can debate just how much performance loss is seen with unoptimized code on dynamically-scheduled out-of-order superscalars, and you have a point there: it can be significant. But not as significant as only using 1/8th (or 1/nth, where n is the number of processing elements) of a chip.
The GP was contending that if a fancy compiler is required to achieve good baseline CPU performance (i.e., using all the SPEs on the Cell concurrently), the architecture in question won't be as successful as an architecture that can get good baseline CPU performance without special optimizations.
In modern CPUs, the out-of-order instruction window is what allows independent instructions to execute when their operands are ready, regardless of the schedule the compiler lays down in the binary. Sure, if you put a load and a use of that load right next to each other, the use is going to have to wait. But meanwhile, other instructions from earlier/later in the stream can execute. Dependencies are resolved on the fly via register renaming and memory disambiguation hardware.
On the other hand, Cell needs a compiler to figure out where the dependencies are and aren't so it can schedule code to execute independently on different SPEs. Today's compilers could produce code that would execute well on one SPE, but all the rest of them would sit unused. This sort of "optimization" (I wouldn't even call it that, I would call it program transformation) is difficult to do.
I realize the days of turning high-level languages into a fixed instruction sequence are long gone, but today's CPUs would get within, oh, say 80-85% of their optimal performance if a compiler did do that. The Cell, on the other hand, would see a slowdown of factors of 4 or 5 (vs. using all the SPEs) without a using parallelizing compiler or writing code in a completely different programming paradigm.
Yonah being able to turn off unused sections of the chip to save power is certainly a significant feature.
The dual-core Intel chip realizes gains in performance and reductions in power consumption due to transistor sizing alone. Its 65nm process just plain beats a 90nm one. I wouldn't make any comparisons or conjectures about the quality of the architecture and design for each chip because of this difference in process.
Turning your wireless card into a police or airport radio jammer would certainly involve changing the frequency band and tx power to illegal ranges, something the firmware won't let you do. And, reverse engineering the firmware so that the entire chip can be user-controlled is probably impossible.
I also don't buy your claim that AMD has nothing in its roadmap (I'm using the term "roadmap" to mean the next ~3 years) that is competitive with Yonah or post-Yonah. Do you have a link to such a roadmap? Mobile chips are AMD's weak link; something tells me that a few people in AMD also know this and are currently working on new mobile stuff.
This means anytime it misses in L1, the entire machine stalls waiting for the data to come back from L2/L3/memory. This is fine for applications where the compiler can figure out all the data dependences and schedule the code to hide these cache misses (i.e. scientific applications). It is not good for your run-of-the-mill GUI programs like Word, Firefox, your favorite email reader, etc. Out-of-order architectures like Pentium Pro/II/III/4 and Athlon hide L1 misses a LOT better because other (independent) instructions can execute while the cache miss is going on.
A few points brought up in the article that I'll respond to:
- Predication - Predication (conversion of if/else code with branches to branchless straight-line code using predicated instructions) is not limited to EPIC/Itanium architectures. Conditional movs (cmov) in x86/AMD64/EM64T are a watered-down version, but they suffice for a lot of simple situations such as the one the article brings up.
- Instruction Level Parallelism (ILP) - Sure, the Itanium can decode/execute/retire up to 6 instructions per clock. That's dependent on two things: a) the compiler finding 6 independent instructions to schedule every clock, b) no L1 cache misses occurring (remember, Itanium is in-order, cache miss = stall).
- ILP is dead anyway - CPU cores are much faster than memory. Any time you have to go to main memory for something, you take a HUGE hit in performance. Who cares if your CPU core executes 100,000 instructions in 0.00001 ns if it takes 100,000 cycles to bring a cache line in from memory? Memory bottlenecks are starting to dominate CPU performance (see this paper for more info), so single-thread performance is going to be dominated by how well the cores mitigate cache misses. Out-of-order cores can do this well (it's getting harder, read the paper), but it's difficult for in-order cores.
- Thread Level Parallelism (TLP) - Any benefits of TLP stated in the article will apply to dual-core out-of-order processors in the same way they will apply to Itanium processors.
- Power - Intel just came out with their dual-core mobile stuff. AMD will sometime before the summer. The article claims that performance per watt is superior for Itanium; that may have been true a year ago, but it's about to not be true.
- Floating point performance - Itanium is the fastest FP chip on the planet. However, a lot of consumer apps aren't floating point-intensive, they're non-FP apps like Word, Firefox, an email client. Performance of these apps, like I said before, is much more dependent on not having cache misses dominate performance. Plus, with SSE2/SSE3 taking over all the FP duties in the latest Athlon64/Xeon/P4s, and Intel and AMD concentrating their efforts on improving those functional units, I bet consumer-level FP performance goes up.
Now, one predicted trend for the future is for all architectures to move to simple, cheap, in-order cores, and put a lot of them on the chip to give increases in TLP without using a hugely complicated, expensive, lots-of-power-and-chip-area out-of-order core. From what I can tell, Itanium is a hugely complicated, expensive, in-order core, not exactly what we need to put 16 cores on a chip. Intel could easily resurrect the original Pentium core, retrofit SSE/SSE2/SSE3 to it, maybe add some runahead execution stuff (from that paper I linked to above) or maybe two-pass pipelining to mitigate the cache misses, and voila: a cheap, in-order core.Oh yeah, this is all academic anyway; backwards-compatibility (x86 has it, Itanium doesn't) is probably going to be the real driving force like it has been for the past 6 years.
http://en.wikipedia.org/wiki/Byte#Abbreviation
I'm all for charitable organizations and such, but Wikipedia is a little bit of a different beast. Organizations like the Red Cross can keep asking for donations continuously because that's what they do - they give the money out because there is always a need. Wikipedia always has a need too; however, it being an encyclopedia, I want a usable product after some amount of donation.
It's convenient that the same technique (transistor size reduction / process scaling) that Intel used in the megahertz wars to gain that extra GHz is now what gives Intel its lower power numbers. A processor produced using a 65nm process (Yonah) runs cooler than a 90nm one (Athlon X2) by default.
Wait to make this comparison until AMD's 65nm dual-core Turion comes out.
Everyone knows those battery-life claims are total crap. After having the system for about 6 months, your battery life will shrink to the standard < 2 hours.
PIN number
ATM machine
New on the list:
IE explorer
And then after a few minor changes, a user will move on to get an actual account and start contributing regularly.
Engineers figured this out a long time ago. TFA says it's only 10% of current systems anyway.
What iPod user records with their iPod? What iPod user even knows what Vorbis or FLAC are? Who cares?
You've only served to underline the belief (or possibly fact) that there really aren't many native Linux games.