Yeah I wasn't talking about cache "ways". I meant the standard issue of macrops is 4. Just like in AMDland where we retire a line of [upto] 3 macrops at once Intel likely does that with 4.
Again only partially correct. While it's true you need more address decoder bits and area [e.g. longer wires] the actual data read is a single 64-bit value from a cache bank [bits discarded if smaller]. Both Intel and AMD pipeline their LSUs because they actually have multiple steps of work.
This is how RaW works for instance... You'd have at least two cycles
1. present address (write buffers, L1 and L2 pick up request)
This is important because cache coherency is important. You have to make sure that you're going to read the latest and greatest copy. It may be anywhere in the CPU.
2. read data (if in write buffer or L1)or stall (if in memory or L2) 3. either done or... wait.... X. read from L2 or memory
I don't know the exact design of either off by heart but that's the jist of a LSUs job.
A larger cache does mean a larger area [longer wires] so that's entirely possible the reason for 3 cycles instead of 2. But fundamentally the LSU of the K8 and P4 are not the same so even if the K8 had an 8KB cache it's possible that the delay would still be 3 cycles.
Largely what they were supposed to do is make sure standards of broadcast technology were adhered to. E.g. licensing spectrum, making sure TV signals are in their respective bands, etc, etc.
This whole "policing morality" bullshit is not new but it's also a lot different now than say 30 years ago. Nobody would have given two shits about a nipple showing off at a superbowl in 1978. That it happened in 2004 [or 2005?] is a crying shame and we must fight this injustice!!!
Let me explain it to you in terms you can understand.
1. There are asshats in every camp 2. Stopping the show because of said asshats is stupid 3. Grow up.
Linux and the OSS scene is just fine despite the asshats who are members. If we were to stop what we were doing to concentrate on fixing the asshats we'd never get shit done.
There are those who do and there are those who blog about those who do.
As the author of cryptographic open source [a double whammy] I can attest to the "there are many stupid people out there". Even when you write documentation [like say hundreds of Doxygen comments and a 130 page user manual] people still ignore it and ask you anyways.
I'd say a full 30% of my support emails (of which I can get quite a few at times) are from people who are completely and utterly lazy and don't read any of the documentation. They ask questions that are specifically answered in the text and often I just cite page numbers if I'm tired.
It's the standard usenet attitude. You'll get a question [say in sci.crypt] such as "Where can I find an implementation of AES?" and the answer is "Go fucking Google for it".
That's a symptom of the problem. People are just lazy and want everything handed to them on a silver platter. The problem is then you get dictated your platform and how it works and what you get to do with it. Can't figure out OpenOffice? Then get stuck with the uni-platform MS Word and it's closed proprietary document format. Can't figure out KDE, then get stuck with explorer, etc, etc, etc..
Learning stuff often involves research. Like often when a new GFX feature creeps into a bug the answer is on an archive mailing list somewhere. Often a quick 10 mins of googling finds the answer and you're on your way.
The problem of Linux adoption is multi-faceted.
1. Yes, many projects lack documentation. 2. Yes, lots of users just don't read it anyways 3. Lots of people are lazy and unwilling to spend the time to learn something 4. There are enough OSS zealots and assholes to spoil the party 5. Lots of anti-OSS fud [like this] such as that from MSFT about ToC 6. General misunderstandings of OSS like how licensing works, who do you contact for support, etc...
It's too easy to just point the finger at developers but that's naive and doesn't actually answer the question.
While lack of sufficient documentation is a common problem [even in the commercial side] I don't think it's the main barrier.
I got into Gentoo Linux [of all the distros] with minimal "Linux" knowledge. I knew the coreutils [e.g. ls, cp, cd] and bash fron Cygwin but that was about it. It took me a few tries to get Gentoo going at first but now it's a breeze. I can do an install without referring to the manual, what's more the install works:-)
I think the main barrier is people are really apathetic to change or improvement. They're not willing to learn and furthermore learn why things are better in the OSS world. They just assume "Linux is hard" and go on their way.
I mean the Gentoo install manual explains step for step how to install it. Other distros are even pointy-clicky installed.
And don't forget that it's a feedback system too. The more users of a project the more support you're likely to see. So if you think, for instance, firefox is too hard to use and nobody uses it, chances are it won't get support. On the otherhand there are millions of users so it gets upgrades and updates often. Other OSS projects are no different really.
The trick is to not give up when you get the slightest inconvenience. Which is even odder because most people will put up with WinXP ineptitudes but give up on Linux when the first device fails to start on bootup or something.
Who supports users? How about the author of the damn tool?
It's called personal responsibility.
Unfortunately all too many people want the credit for writing OSS [no matter how shoddy] but don't want the actual work of supporting it. How many OSS projects are known for their stellar documentation and 24 hour turnaround e-mail support?
Not that the commercial world is any better. I mean who do I write to, to get a behaviour in MS Word changed?
Answer: No. It would cause the real L1 to have L0.delay additional cycles of delay. This is also why most cpus don't have L3s even if you could make them out of DRAM on chip. So unless your hit rate for the L0 was like 99% you'd expect to lose performance.
In both the Intel and AMD cases the L1 access is pipelined which is why it's multiple cycles. Intel merely has a shorter pipe to the LSU which is why they have [often] 2 cycle caches as opposed to the 3 cycle AMD has.
You'll find that most LSUs in modern processors are really their own independent units.
Think of a processor as a program with a bunch of threads and really efficient IPC [inter process communication]. the Load-Store Unit [LSU] is just one of many things going on.
Both Intel and AMD have hardware prefetchers which examine memory usage and makes fetches to system memory to bring stuff in [L1 or L2 depending on the design].
Snooping between cores is fast but not super efficient. It also can send out snoops to the HT bus in MP systems if neither core owns the cache line. And as far as I know [from public info] you have to hit the SRQ before a memory read from the other core can read something written from the other core. The latency to the L2 is ~20 cycles or so on it's own. etc, etc, etc....
As for the other comments, the ALU is already wide enough. You're right about the SSE side. At best FPU opcodes are 2 [of 4] EX cycles giving a latency of 2 cycles. That's partly because of the scheduler though as it looks for things in steps of 2 cycles.
Any task you can sufficiently isolate to different cores is a task you can thread. Otherwise if there is a lot of interdependence the concept won't work. Specially if they work in the same memory space. Keeping the caches sync'ed between cores would basically kill any benefit you think you can get.
Unlike say the Intel designs the AMD "dual cores" are really two distinct independent cores with their own caches living in their own worlds.
So why would AMD spend money on researching a concept which is basically doomed to failure.
The problem is that CPUs are very independent once instructions get into the decoder window. The only way to stop it is to raise an exception or interrupt (e.g. APIC signal).
So just because you may have 4 cores in your box [say dual-core 2P] doesn't mean all of the cores can act as one logically to the OS in a meaningful and efficient manner.
The striping analogy would be to dispatch instructions in round-robin fashion to all the processors. The problem with that is that the architectural state has to be shared. Keeping that insync with current cores would kill any sort of performance gain you might hope to obtain.
Um actually you're wrong. The Core [64-bit stuff coming out] processors have a 4-way instruction window which is 1 larger than AMD already. That means they can issue upto 4 macro-ops per cycle. So processors are already using more pipes.
There there THREE FPU pipes. Therefore it is possible to add an adder [or vice versa] to the multiplier then have the decoder be aware of this and feed stuff into either pipe. So technically you don't have to change the ICU at all to support more FPU resources.
As for the ALU performance I never said make it wider. They're vastly underutilized as it is. L1 cache stalls account for quite a bit of cycles even when there is a hit.
As for threading... that's an OS issue. Doing anything on the level the CPU will recognize is not feasible. You simply cannot extract architectural state fast enough. The best way to use two cores is with SMP aware software.
I haven't heard of any AMD projects to merge cores like this and in fact the emphasis has always been on SMP and NUMA aware development practices.
"armchair"... whatever. I'd say I know a bit more about the K8 design than your average slashdotter.
The point is as it stands now the K8 cannot, repeat cannot, get a register from one core to another FASTER THAN THE L1 CACHE WORKS.
Now that we got that out of the way... realize that...
IPC OF 99% OF ALL CODE is less than 1 on most cases and why is that? Aside from register contention there is the three cycle latency of the L1. So it's very trivial to stall an entire execution unit.
So AMD would see little benefit from tying the ALUs on core 1 (which can only access the registers local to it) to core 0 since they would just go unused most of the time.
The only possible benefit is the FPU of the second core but even then it's pushing it. Getting data from one core to the other is really slow.
AMD would benefit more from just adding another FPU adder or multiplier [or both] to a single core than by adding high speed super-wide busses between cores (which in terms of processors are "far away").
For those not in the know... reading a register from core 1 and loading it in core 0 would work like this
1. core 1 issues a store to memory [dozens if not hundreds of cycles] 2. core 0 issues a read, the XBAR realises it owns the address and the SRQ picks up the read 3. core 0 now read a register from core 1
It would be so horribly slow that accessing the L1 data cache as a place to spill would be faster.
The IPC of most applications is less than three and often around one. So more ALU pipes is not what K8 needs. It needs more access to the L1 data cache. Currently it can handle two 64-bit reads or one 64-bit store per cycle. It takes three cycles from issue to fetched.
Most stalls are because of [in order of frequency]
1. Cache hit latency 2. Cache miss latency 3. Decoder stalls (e.g. unaligned reads or instructions which spill over 16 byte boundary) 4. Vectorpath instruction decoding 5. Branch misprediction
AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance. Unfortunately it's probably not feasible with the current LSU. That is, you can get upto 33% faster in L1 intense code with that change.
But compared to "pairing" a core, die space is better used improving the LSU, adding more pipes to the FPU, etc.
The bus between the two cores is FAR TOO SLOW for this sort of operation. Moving [say] EAX from core 0 to core 1 would take hundreds of cycles.
So if the theory is to take the three ALU pipes from core 1 and pretend they're part of core 0... it wouldn't work efficiently. Also what instruction set would this run? I mean how do we address registers on the second core?
AMD would get more bang for buck by doing other improvements such as adding more FPU pipes, adding a 2nd multiplier to the integer side, increasing L1 bandwidth, etc.
Yeah I wasn't talking about cache "ways". I meant the standard issue of macrops is 4. Just like in AMDland where we retire a line of [upto] 3 macrops at once Intel likely does that with 4.
Tom
It's called "selling out". Everyone has their price. The trick is to recognize it and work within it.
He came back spouting the virtues of MSFT because he basically sold his values and convictions [the good kind] for a paycheque and status.
Tom
Again only partially correct. While it's true you need more address decoder bits and area [e.g. longer wires] the actual data read is a single 64-bit value from a cache bank [bits discarded if smaller]. Both Intel and AMD pipeline their LSUs because they actually have multiple steps of work.
... wait ....
This is how RaW works for instance... You'd have at least two cycles
1. present address (write buffers, L1 and L2 pick up request)
This is important because cache coherency is important. You have to make sure that you're going to read the latest and greatest copy. It may be anywhere in the CPU.
2. read data (if in write buffer or L1)or stall (if in memory or L2)
3. either done or
X. read from L2 or memory
I don't know the exact design of either off by heart but that's the jist of a LSUs job.
A larger cache does mean a larger area [longer wires] so that's entirely possible the reason for 3 cycles instead of 2. But fundamentally the LSU of the K8 and P4 are not the same so even if the K8 had an 8KB cache it's possible that the delay would still be 3 cycles.
Tom
They are the ones who "think of our children". :-)
Largely what they were supposed to do is make sure standards of broadcast technology were adhered to. E.g. licensing spectrum, making sure TV signals are in their respective bands, etc, etc.
This whole "policing morality" bullshit is not new but it's also a lot different now than say 30 years ago. Nobody would have given two shits about a nipple showing off at a superbowl in 1978. That it happened in 2004 [or 2005?] is a crying shame and we must fight this injustice!!!
Tom
Stupid ACs...
Let me explain it to you in terms you can understand.
1. There are asshats in every camp
2. Stopping the show because of said asshats is stupid
3. Grow up.
Linux and the OSS scene is just fine despite the asshats who are members. If we were to stop what we were doing to concentrate on fixing the asshats we'd never get shit done.
There are those who do and there are those who blog about those who do.
Tom
On behalf of those who are helpful, you're welcome.
It's all just a cycle of karma. I help you learn a Linux distro and you do something good for someone else as a result and so on.
Tom
As the author of cryptographic open source [a double whammy] I can attest to the "there are many stupid people out there". Even when you write documentation [like say hundreds of Doxygen comments and a 130 page user manual] people still ignore it and ask you anyways.
I'd say a full 30% of my support emails (of which I can get quite a few at times) are from people who are completely and utterly lazy and don't read any of the documentation. They ask questions that are specifically answered in the text and often I just cite page numbers if I'm tired.
It's the standard usenet attitude. You'll get a question [say in sci.crypt] such as "Where can I find an implementation of AES?" and the answer is "Go fucking Google for it".
That's a symptom of the problem. People are just lazy and want everything handed to them on a silver platter. The problem is then you get dictated your platform and how it works and what you get to do with it. Can't figure out OpenOffice? Then get stuck with the uni-platform MS Word and it's closed proprietary document format. Can't figure out KDE, then get stuck with explorer, etc, etc, etc..
Learning stuff often involves research. Like often when a new GFX feature creeps into a bug the answer is on an archive mailing list somewhere. Often a quick 10 mins of googling finds the answer and you're on your way.
The problem of Linux adoption is multi-faceted.
1. Yes, many projects lack documentation.
2. Yes, lots of users just don't read it anyways
3. Lots of people are lazy and unwilling to spend the time to learn something
4. There are enough OSS zealots and assholes to spoil the party
5. Lots of anti-OSS fud [like this] such as that from MSFT about ToC
6. General misunderstandings of OSS like how licensing works, who do you contact for support, etc...
It's too easy to just point the finger at developers but that's naive and doesn't actually answer the question.
Tom
I actually picked that URL at random, I didn't know it pointed to anything in particular.
:-)
Not having my own personal website didn't mean I couldn't put a URL in my profile
Tom
While lack of sufficient documentation is a common problem [even in the commercial side] I don't think it's the main barrier.
:-)
I got into Gentoo Linux [of all the distros] with minimal "Linux" knowledge. I knew the coreutils [e.g. ls, cp, cd] and bash fron Cygwin but that was about it. It took me a few tries to get Gentoo going at first but now it's a breeze. I can do an install without referring to the manual, what's more the install works
I think the main barrier is people are really apathetic to change or improvement. They're not willing to learn and furthermore learn why things are better in the OSS world. They just assume "Linux is hard" and go on their way.
I mean the Gentoo install manual explains step for step how to install it. Other distros are even pointy-clicky installed.
And don't forget that it's a feedback system too. The more users of a project the more support you're likely to see. So if you think, for instance, firefox is too hard to use and nobody uses it, chances are it won't get support. On the otherhand there are millions of users so it gets upgrades and updates often. Other OSS projects are no different really.
The trick is to not give up when you get the slightest inconvenience. Which is even odder because most people will put up with WinXP ineptitudes but give up on Linux when the first device fails to start on bootup or something.
Tom
There are assholes in every camp. I'm sure I can just as easily find Windows and MacOS snobs [well the latter is a given].
I've personally helped a half dozen people switch to Gentoo. Not all of us are meanies [though I play one on TV].
This article is pure flamebait.
Tom
Who supports users? How about the author of the damn tool?
It's called personal responsibility.
Unfortunately all too many people want the credit for writing OSS [no matter how shoddy] but don't want the actual work of supporting it. How many OSS projects are known for their stellar documentation and 24 hour turnaround e-mail support?
Not that the commercial world is any better. I mean who do I write to, to get a behaviour in MS Word changed?
Tom
Answer: No. It would cause the real L1 to have L0.delay additional cycles of delay. This is also why most cpus don't have L3s even if you could make them out of DRAM on chip. So unless your hit rate for the L0 was like 99% you'd expect to lose performance.
In both the Intel and AMD cases the L1 access is pipelined which is why it's multiple cycles. Intel merely has a shorter pipe to the LSU which is why they have [often] 2 cycle caches as opposed to the 3 cycle AMD has.
Tom
[speaking in general].
You'll find that most LSUs in modern processors are really their own independent units.
Think of a processor as a program with a bunch of threads and really efficient IPC [inter process communication]. the Load-Store Unit [LSU] is just one of many things going on.
Both Intel and AMD have hardware prefetchers which examine memory usage and makes fetches to system memory to bring stuff in [L1 or L2 depending on the design].
Tom
Snooping between cores is fast but not super efficient. It also can send out snoops to the HT bus in MP systems if neither core owns the cache line. And as far as I know [from public info] you have to hit the SRQ before a memory read from the other core can read something written from the other core. The latency to the L2 is ~20 cycles or so on it's own. etc, etc, etc....
As for the other comments, the ALU is already wide enough. You're right about the SSE side. At best FPU opcodes are 2 [of 4] EX cycles giving a latency of 2 cycles. That's partly because of the scheduler though as it looks for things in steps of 2 cycles.
Tom
They don't call it OpenNetscape now do they?
Tom
You're missing the point.
Any task you can sufficiently isolate to different cores is a task you can thread. Otherwise if there is a lot of interdependence the concept won't work. Specially if they work in the same memory space. Keeping the caches sync'ed between cores would basically kill any benefit you think you can get.
Unlike say the Intel designs the AMD "dual cores" are really two distinct independent cores with their own caches living in their own worlds.
So why would AMD spend money on researching a concept which is basically doomed to failure.
Tom
Well if it's the closed project it's opened up.
If it's a clean-house implementation then it's not strictly based on it.
Call it something else like Vzeeforefree!
Dunno just annoyed at people abusing the OSS blanket for publicity.
Tom
Bosses don't care if it's open source. They care
1. How much does it cost to license
2. How much does it cost to setup
3. What does it solve any better than what we already have.
Tom
What's with "open" in the name of all these projects. Is anyone really impressed by that anymore?
Tom
The problem is that CPUs are very independent once instructions get into the decoder window. The only way to stop it is to raise an exception or interrupt (e.g. APIC signal).
So just because you may have 4 cores in your box [say dual-core 2P] doesn't mean all of the cores can act as one logically to the OS in a meaningful and efficient manner.
The striping analogy would be to dispatch instructions in round-robin fashion to all the processors. The problem with that is that the architectural state has to be shared. Keeping that insync with current cores would kill any sort of performance gain you might hope to obtain.
Tom
Um actually you're wrong. The Core [64-bit stuff coming out] processors have a 4-way instruction window which is 1 larger than AMD already. That means they can issue upto 4 macro-ops per cycle. So processors are already using more pipes.
There there THREE FPU pipes. Therefore it is possible to add an adder [or vice versa] to the multiplier then have the decoder be aware of this and feed stuff into either pipe. So technically you don't have to change the ICU at all to support more FPU resources.
As for the ALU performance I never said make it wider. They're vastly underutilized as it is. L1 cache stalls account for quite a bit of cycles even when there is a hit.
As for threading... that's an OS issue. Doing anything on the level the CPU will recognize is not feasible. You simply cannot extract architectural state fast enough. The best way to use two cores is with SMP aware software.
I haven't heard of any AMD projects to merge cores like this and in fact the emphasis has always been on SMP and NUMA aware development practices.
Tom
The problem is once an instruction gets down to the core of the ... er core ... it's hard to get it to another core.
So you can only load balance at the process/thread level.
Tom
"armchair"... whatever. I'd say I know a bit more about the K8 design than your average slashdotter.
...
The point is as it stands now the K8 cannot, repeat cannot, get a register from one core to another FASTER THAN THE L1 CACHE WORKS.
Now that we got that out of the way... realize that
IPC OF 99% OF ALL CODE is less than 1 on most cases and why is that? Aside from register contention there is the three cycle latency of the L1. So it's very trivial to stall an entire execution unit.
So AMD would see little benefit from tying the ALUs on core 1 (which can only access the registers local to it) to core 0 since they would just go unused most of the time.
The only possible benefit is the FPU of the second core but even then it's pushing it. Getting data from one core to the other is really slow.
AMD would benefit more from just adding another FPU adder or multiplier [or both] to a single core than by adding high speed super-wide busses between cores (which in terms of processors are "far away").
Tom
For those not in the know... reading a register from core 1 and loading it in core 0 would work like this
1. core 1 issues a store to memory [dozens if not hundreds of cycles]
2. core 0 issues a read, the XBAR realises it owns the address and the SRQ picks up the read
3. core 0 now read a register from core 1
It would be so horribly slow that accessing the L1 data cache as a place to spill would be faster.
The IPC of most applications is less than three and often around one. So more ALU pipes is not what K8 needs. It needs more access to the L1 data cache. Currently it can handle two 64-bit reads or one 64-bit store per cycle. It takes three cycles from issue to fetched.
Most stalls are because of [in order of frequency]
1. Cache hit latency
2. Cache miss latency
3. Decoder stalls (e.g. unaligned reads or instructions which spill over 16 byte boundary)
4. Vectorpath instruction decoding
5. Branch misprediction
AMD making the L1 cache 2 cycle instead of 3 cycle would immediately yield a nice bonus in performance. Unfortunately it's probably not feasible with the current LSU. That is, you can get upto 33% faster in L1 intense code with that change.
But compared to "pairing" a core, die space is better used improving the LSU, adding more pipes to the FPU, etc.
Tom
The bus between the two cores is FAR TOO SLOW for this sort of operation. Moving [say] EAX from core 0 to core 1 would take hundreds of cycles.
So if the theory is to take the three ALU pipes from core 1 and pretend they're part of core 0... it wouldn't work efficiently. Also what instruction set would this run? I mean how do we address registers on the second core?
AMD would get more bang for buck by doing other improvements such as adding more FPU pipes, adding a 2nd multiplier to the integer side, increasing L1 bandwidth, etc.
This story is pure and utter bullshit.
Tom