but it's claimed to be able to create DVD movies that canb be played in consumer DVD players.... my question was "how do they handle the DVD encryption of the bits they put on the media" not "how do they put the bits on the media" which is self evident
I said: On the other hand a compiler could spend time optimizing this stuff more (maybe more inlining of those tiny calls at least).
Or in other words, it's a compiler issue. Which is pretty much exactly what he said in the first place
Umm - that's what 'on the other hand' means - I'm trying to argue both sides of the issue in order to be fair - it's one of those things one does to try and keep oneself intellectually honest - no issue (esp. not computer architecture!) is completely one sided you have to argue all sides to try and find a good place to stand - a really good architect has a gestalt of all the issues and some good idea of the 'sweet spot' he/she is trying to address - it's never easy:-).
My point was that there ARE some compiler related issues - BUT lots that are more language/methodology related
I think you're missing the point a bit - it takes years for silicon to come out - as I pointed out in my posting the call/ret issue you mention is a real problem - partly I suspect because the compiler writers don't take it into consideration (and in the x86 case it's probably made worse by a lack of tmp registers that end up pushing people into using the stack for PC manipulation).
The whole reason I was doing the work I was was so that we could discover and address exactly these sorts of issues.
The problem with branching to an unknown offset is - well - it's unknown - and you often don't know what it is untill you've completed a data cache miss to get the address (that's a double pipe break in my mind) - the solution can be things like branch target caches (which make a guess at the target and start a speculative stream).
Remember you're stuck a primary architecture (x86) that are decades old - that were optimized for compact code (at the expense of registers because they were designed for a system with no icache and limited memory bandwidth) that force complex language constructs into overusing the cache/memory subsystem - of course you are going to get data/branch interactions
The other big problem in the architectures we see today is marketting - GHz is all - it's what Intel/AMD's marketting people know how to sell - it's a nice number that they can wave at their competition). It's driving their architects to longer and longer pipes to get the clock rates up... and reducing the CPI into the bargin I'm sure it has a lot to do with why P4 is such a dog - I bet when their engineers are faced with the question "should we make virtual method calls run faster or shall we make it easier to up the clock speed" the clock speed wins every time - no one's saying lets make the pipes smaller (and clocks slower) so the actual programs will run faster
nope - virtual method calls are a language feature, and calling lots of little subroutines is a coding style encouraged by OO devotees.
On the other hand a compiler could spend time optimizing this stuff more (maybe more inlining of those tiny calls at least).
The one thing that is a compiler issue in my posting is the poor use of the call/return paradigm (for example using a jsr to make a call, but not ret to get back means the architecture can't do a hidden [branch target predictor] stack cache to keep the pipe moving - many modern cpus, even x86s do this)
(for the record I first wrote smalltalk code in the 70's, I regularly code in C++...)
I'm a sometimes chip designer, sometimes programmer... a while back while working on an unnamed CPU project I did some low level performance analysis on a number of well known programs (maybe even the browser you're using now) basicly we were taking very long instruction/data traces and then modelling them against various potential CPU pipeline/tlb/cache architectures - we were looking for things that would help guide us to a better architecture for our CPU.
I found that quite early on I could figure out which language something was coded in from the cooked numbers pretty easily - OO (ie C++) coded stuff always had a really sucky CPI (clocks per instruction - a measure of architectural efficiency that includes pipe breaks, stalls and cache misses) - I spent some time looking at this (since it seemed that C++ code would probably become more common in our CPU's lifetime) - basicly C++ code sucked because it took more icache hits (because the coding style encourages lots of subroutine calls which tend to spread over the cache more filling the I$ quickly) and it took more pipe breaks (also due to the subrotine calls and returns - it turned out that some code generators did stuff that broke CPU's return stack caches causing many more mispredicts) and finally virtual method dispatches (basicly load a pointer, save the pc on the stack and jump to the new pointer) tended to cause double pipe stalls that couldn't be predicted well at all even though these weren't done much they were a real killer (if you've one a bit of modern CPU architecture you learn that with long pipes you live or die on your branch predictor's hit rate - these were very bad news)
In short C++ and more genrally OO result in code and coding styles that tend to make code that makes modern CPU's run less efficiently.
Anyway - you often hear about 'efficiency of programmers' etc etc for OO - I thought I'd add a data point from the other end of the spectrum.
was manager of the documentation dept. we had layoffs - it was tough - they had her fire over half her department..... then they fired her.... (scum! but then this is the same company who's president was let go from his next job for embezzeling $750k)....
actually cooling in space is a big problem - you have heat input from direct exposure from the sun and from internal sources - but you can only radiate excess heat on the side away from the sun - no cool convective air flows to take heat away - keeping electronics and batteries cool enough to operate safely is a big worry
loh-tay - but be carefull it's an expresso drink made with a lot of milk (mmmm expresso...).
But you have to be carefull - Starbucks is slowly inventing their own language - you don't order a 'large' latte - you order a 'vente' when youask the people serving there what this means they look dumbly at you - of course it's Italian for 20 - 20 what you might ask? 20ounces - I guess they assume all Italians order their expresso in ounces - rumor has it SB is openning in Italy sonn - I wonder if they'll be forced to rename their drinks (or dish up 20 litres or ml neither of which would be what people expect)
If the processor itself is dealing with thread-local state, wouldn't you include more than one prefetch queue/pipeline, and match available pipelines to working threads just like any other register set or other thread-local stuff?
Ummm... maybe, maybe not.... in an out of order, register renaming CPU like a Athlon/Pentium/etc 'pipelines' are pretty amourphous, apart from the prefetch there's basicly just a bunch of instructions waiting for chances to get done - you may have even speculatively gone down both sides of a conditional branch and intend to toss some of them depending on the branch being resolved (or even speculatively guessed at the results of a load and gone down that path....). Expanding this to SMT is a pretty simple process - you just expand the size of the 'name' that you rename things to and tag loads/stores to use a particular TLB mapping.
Now ifetch (and as a result decode) is a harder problem - ports into icaches are expensive - running 4 caches with associated decoders is possible. But remember the idea here is to use existing hardware that's unused some portion of the time - not to make the whole design 4 times larger, so more likely you're going to do something like provide some back pressure to the decoder logic giving information about how many micro-ops are waiting for each thread and use that to interleave fetch and decode from various threads.
Now IMHO the conditions that make SMT viable are somewhat transient - they may make sense for a particular architecture one year, and maybe not next year - depends on a lot of confluence of technologies (for example I still think RISC to CISC transition made sense mostly because of the level of integration available at the time and the sudden speed up of ifetch bandwidth over core) - apart from the super-computer (everything's a memory access) crowd SMT may be a passing fad - not worth breaking your ISA for or creating a new one with SMT as its raison d'etre (ie add a few primitives, don't go crazy).
(note to patent lawyers - I'm "skilled in the art" I find all the above is obvious)
what you describe is a form of threaded architecture - it's not a new idea (certainly it's been in the literature for way more than 5 yrs - in other forms it was in a variety of IO processors in the 60s) - the stuff being described in these articles are a more tightly coupled sort of threading where an out-of-order CPU can use register renaming etc to implement the multiple register sets.
Having no data dependance isn't necesarily a good thing - it tends to lead to needing caches and TLBs that are twice as big or having the existing caches/TLBs thrash - some SMT schemes assume compilers that do things like generate speculative threads and share data and address mappings closely in order not to choke.
the problem you're trying to solve is the long latencies to main memory and the fact that when the CPU is idle for long periods when it has to wait for them. Basicly if you've gone to the trouble of building a cool OO cpu with register renaming, scoreboards etc etc then setting it up with and extra PC and the hardware to manage an extra thread is (theoretically) relatively easy - doing it for something like an X86 with state up the wazoo is probably rather harder.
Having gone down the route of doing a paper design for an SMT I know that one of the real problems with SMT in traditionally piped CPUs (ie non-OO) is that with today's deep pipelining the cost of thread switches is really high - often to the point of being useless.
The alternative (SMP) is good for other reasons - you can potentially reduce the size synchonous clock domains on a dies - design time may be lower (build one and lay out 8). The downsides have to do with memory architectures (cross bars, buses, cache paths etc)
Seems to me the one thing that their second hand book biz would cut into would be their new book biz..... either their strangling themselves long term to get short term profits.... or they've decided that they can make more money long term if part of their business is in a market where they have more control over price (rather than having it dictated by the publishers).
In some sense they may just be starting a large for-profit lending library..... of napstering the book industry they depend on (look Ma I made a new verb!)
Reading this it seems it's just a shrink to 0.15 plus a 64k L2 and a speedup from 667 to 700MHz. Same core and pipe as far as I can tell. Die size is presumably smaller (and thus yield will hopefully be higher) and cheaper.
But think about low end computer costs (VERY rough numbers - just to give you an idea of relative costs - remember BOM numbers at least double by the time they reach the customer):
$60 motherboard etc
$20 case/power supply
$40 memory
$40 disk
$40 CPU
$10 kbd/mouse
$70 monitor
$20 CD drive
The CPU's only about 13% of the total cost - a cheaper CPU doesn't buy you much in the low end CPU marketplace - but a faster one does - it's hard to compete here.
More integration (cpu/north&south bridges/graphics together) is probably the way to go if you want to win the low price point in this market - esp for someone like VIA who already owns all the IP to do it (Cyrix CPU - VIA core logic - S3 graphics)
More likely the web-pad market (if it ever exists) is the plave to go with this
Growing up in NZ (the first place in the world Santa hits on his big night out) I got used to seeing the 'santa radar' on the xmas eve news.... flying up from the SOUTH pole.... I mean we just had those news reports about open water at the North Pole if that's true then he's toast and I know for certain he dropped goodies off with my kids just last night.... so either it's the South Pole.... or a scam.... you decide
There's a definite cycle in this part of the biz - companies come and go - I think it's mainly because the product life cycles are short compared to their design times (ie the chips they depend on) - this makes designing the best chip/card a pretty hit-and-miss operation - between the time you're commited to a silicon architecture and those chips are in boxes on the shelves the whole world can change around you - IMHO (having designed graphics accelerators in the past) there's a lot of luck involved - make the right guess and you're on top, miss it by even a little and you're toast - and in this biz you don't get to screw up twice.
As an example at a previous employer many many years ago we once bet the company on a [then] new and untested chip packaging technology - it worked and we had an accelerator design that walked all over the competition for almost 2 generations - made over $100M in sales off of it - but management wouldn't spend the money to do the short term re-engineering to keep our lead and we were toast - and by the time they figured it out it was of course too late....
I think that in the long run NVDA and ATI have more to worry about from Intel than anyone else (Intel's 810 is already hurting them both) - they now own the largest pieces of silicon in a PC outside of Intel's control - luckily for them Intel has already been burned by trying to go the graphics route and may be somewhat reluctant (just talk to the C&T people who were absorbed by the iBorg...)
ATI's still the biggest graphics company - but more the slow turtle than the nvda rabbit.....
More importantly - it's in our interest that there be multiple competing vendors - that means better drivers, faster and cheaper cards - so spread your money around - don't just buy from one manufacturer
It requires that you access the 'URL' with a 'keypad' (not a mouse).... and by my reading it could also cover FTP, and other remote file access protocols - in 1976 is was already dated (and I suspect suffers badly from the patent lawyer's attempt to try and write as wide a patent as possible)
but it's claimed to be able to create DVD movies that canb be played in consumer DVD players .... my question was "how do they handle the DVD encryption of the bits they put on the media" not "how do they put the bits on the media" which is self evident
So the superdrive lets YOU create DVDs .... anyone know how they do that? are some of the DVD 'secret's compromised by this? :-)
Or in other words, it's a compiler issue. Which is pretty much exactly what he said in the first place
Umm - that's what 'on the other hand' means - I'm trying to argue both sides of the issue in order to be fair - it's one of those things one does to try and keep oneself intellectually honest - no issue (esp. not computer architecture!) is completely one sided you have to argue all sides to try and find a good place to stand - a really good architect has a gestalt of all the issues and some good idea of the 'sweet spot' he/she is trying to address - it's never easy :-).
My point was that there ARE some compiler related issues - BUT lots that are more language/methodology related
The whole reason I was doing the work I was was so that we could discover and address exactly these sorts of issues.
The problem with branching to an unknown offset is - well - it's unknown - and you often don't know what it is untill you've completed a data cache miss to get the address (that's a double pipe break in my mind) - the solution can be things like branch target caches (which make a guess at the target and start a speculative stream).
Remember you're stuck a primary architecture (x86) that are decades old - that were optimized for compact code (at the expense of registers because they were designed for a system with no icache and limited memory bandwidth) that force complex language constructs into overusing the cache/memory subsystem - of course you are going to get data/branch interactions
The other big problem in the architectures we see today is marketting - GHz is all - it's what Intel/AMD's marketting people know how to sell - it's a nice number that they can wave at their competition). It's driving their architects to longer and longer pipes to get the clock rates up ... and reducing the CPI into the bargin I'm sure it has a lot to do with why P4 is such a dog - I bet when their engineers are faced with the question "should we make virtual method calls run faster or shall we make it easier to up the clock speed" the clock speed wins every time - no one's saying lets make the pipes smaller (and clocks slower) so the actual programs will run faster
2-3 years
On the other hand a compiler could spend time optimizing this stuff more (maybe more inlining of those tiny calls at least).
The one thing that is a compiler issue in my posting is the poor use of the call/return paradigm (for example using a jsr to make a call, but not ret to get back means the architecture can't do a hidden [branch target predictor] stack cache to keep the pipe moving - many modern cpus, even x86s do this)
I'm a sometimes chip designer, sometimes programmer ... a while back while working on an unnamed CPU project I did some low level performance analysis on a number of well known programs (maybe even the browser you're using now) basicly we were taking very long instruction/data traces and then modelling them against various potential CPU pipeline/tlb/cache architectures - we were looking for things that would help guide us to a better architecture for our CPU.
I found that quite early on I could figure out which language something was coded in from the cooked numbers pretty easily - OO (ie C++) coded stuff always had a really sucky CPI (clocks per instruction - a measure of architectural efficiency that includes pipe breaks, stalls and cache misses) - I spent some time looking at this (since it seemed that C++ code would probably become more common in our CPU's lifetime) - basicly C++ code sucked because it took more icache hits (because the coding style encourages lots of subroutine calls which tend to spread over the cache more filling the I$ quickly) and it took more pipe breaks (also due to the subrotine calls and returns - it turned out that some code generators did stuff that broke CPU's return stack caches causing many more mispredicts) and finally virtual method dispatches (basicly load a pointer, save the pc on the stack and jump to the new pointer) tended to cause double pipe stalls that couldn't be predicted well at all even though these weren't done much they were a real killer (if you've one a bit of modern CPU architecture you learn that with long pipes you live or die on your branch predictor's hit rate - these were very bad news)
In short C++ and more genrally OO result in code and coding styles that tend to make code that makes modern CPU's run less efficiently.
Anyway - you often hear about 'efficiency of programmers' etc etc for OO - I thought I'd add a data point from the other end of the spectrum.
was manager of the documentation dept. we had layoffs - it was tough - they had her fire over half her department ..... then they fired her .... (scum! but then this is the same company who's president was let go from his next job for embezzeling $750k) ....
actually cooling in space is a big problem - you have heat input from direct exposure from the sun and from internal sources - but you can only radiate excess heat on the side away from the sun - no cool convective air flows to take heat away - keeping electronics and batteries cool enough to operate safely is a big worry
(the KDE browser) is that it often shows web bugs (like the one at the top of every slashdot page ...)
But you have to be carefull - Starbucks is slowly inventing their own language - you don't order a 'large' latte - you order a 'vente' when youask the people serving there what this means they look dumbly at you - of course it's Italian for 20 - 20 what you might ask? 20ounces - I guess they assume all Italians order their expresso in ounces - rumor has it SB is openning in Italy sonn - I wonder if they'll be forced to rename their drinks (or dish up 20 litres or ml neither of which would be what people expect)
it would just imply it had FEET ... (as opposed to meters ...)
The first online mass murders! (or is it mudderers?)
Ummm ... maybe, maybe not .... in an out of order, register renaming CPU like a Athlon/Pentium/etc 'pipelines' are pretty amourphous, apart from the prefetch there's basicly just a bunch of instructions waiting for chances to get done - you may have even speculatively gone down both sides of a conditional branch and intend to toss some of them depending on the branch being resolved (or even speculatively guessed at the results of a load and gone down that path ....). Expanding this to SMT is a pretty simple process - you just expand the size of the 'name' that you rename things to and tag loads/stores to use a particular TLB mapping.
Now ifetch (and as a result decode) is a harder problem - ports into icaches are expensive - running 4 caches with associated decoders is possible. But remember the idea here is to use existing hardware that's unused some portion of the time - not to make the whole design 4 times larger, so more likely you're going to do something like provide some back pressure to the decoder logic giving information about how many micro-ops are waiting for each thread and use that to interleave fetch and decode from various threads.
Now IMHO the conditions that make SMT viable are somewhat transient - they may make sense for a particular architecture one year, and maybe not next year - depends on a lot of confluence of technologies (for example I still think RISC to CISC transition made sense mostly because of the level of integration available at the time and the sudden speed up of ifetch bandwidth over core) - apart from the super-computer (everything's a memory access) crowd SMT may be a passing fad - not worth breaking your ISA for or creating a new one with SMT as its raison d'etre (ie add a few primitives, don't go crazy).
(note to patent lawyers - I'm "skilled in the art" I find all the above is obvious)
Having no data dependance isn't necesarily a good thing - it tends to lead to needing caches and TLBs that are twice as big or having the existing caches/TLBs thrash - some SMT schemes assume compilers that do things like generate speculative threads and share data and address mappings closely in order not to choke.
Having gone down the route of doing a paper design for an SMT I know that one of the real problems with SMT in traditionally piped CPUs (ie non-OO) is that with today's deep pipelining the cost of thread switches is really high - often to the point of being useless.
The alternative (SMP) is good for other reasons - you can potentially reduce the size synchonous clock domains on a dies - design time may be lower (build one and lay out 8). The downsides have to do with memory architectures (cross bars, buses, cache paths etc)
In some sense they may just be starting a large for-profit lending library ..... of napstering the book industry they depend on (look Ma I made a new verb!)
But think about low end computer costs (VERY rough numbers - just to give you an idea of relative costs - remember BOM numbers at least double by the time they reach the customer):
- $60 motherboard etc
- $20 case/power supply
- $40 memory
- $40 disk
- $40 CPU
- $10 kbd/mouse
- $70 monitor
- $20 CD drive
The CPU's only about 13% of the total cost - a cheaper CPU doesn't buy you much in the low end CPU marketplace - but a faster one does - it's hard to compete here.More integration (cpu/north&south bridges/graphics together) is probably the way to go if you want to win the low price point in this market - esp for someone like VIA who already owns all the IP to do it (Cyrix CPU - VIA core logic - S3 graphics)
More likely the web-pad market (if it ever exists) is the plave to go with this
Amsat - the amateur sattelite corp. - kind of an open-source sattelite corp. run by radio hams - check out www.amsat.org
Growing up in NZ (the first place in the world Santa hits on his big night out) I got used to seeing the 'santa radar' on the xmas eve news .... flying up from the SOUTH pole .... I mean we just had those news reports about open water at the North Pole if that's true then he's toast and I know for certain he dropped goodies off with my kids just last night .... so either it's the South Pole .... or a scam .... you decide
Specify .doc AS the standard .... and start a standardization process on it .... take it out of M$'s hands so that it becomes a non-issue
As an example at a previous employer many many years ago we once bet the company on a [then] new and untested chip packaging technology - it worked and we had an accelerator design that walked all over the competition for almost 2 generations - made over $100M in sales off of it - but management wouldn't spend the money to do the short term re-engineering to keep our lead and we were toast - and by the time they figured it out it was of course too late ....
I think that in the long run NVDA and ATI have more to worry about from Intel than anyone else (Intel's 810 is already hurting them both) - they now own the largest pieces of silicon in a PC outside of Intel's control - luckily for them Intel has already been burned by trying to go the graphics route and may be somewhat reluctant (just talk to the C&T people who were absorbed by the iBorg ...)
More importantly - it's in our interest that there be multiple competing vendors - that means better drivers, faster and cheaper cards - so spread your money around - don't just buy from one manufacturer
It requires that you access the 'URL' with a 'keypad' (not a mouse) .... and by my reading it could also cover FTP, and other remote file access protocols - in 1976 is was already dated (and I suspect suffers badly from the patent lawyer's attempt to try and write as wide a patent as possible)
many use speech synthes tools to read screen contents .... I smell an ADA suit ....