Sun Kills Rock CPU, Says NYT Report
BBCWatcher writes "Despite Oracle CEO Larry Ellison's recent statement that his company will continue Sun's hardware business, it won't be with Sun processors (and associated engineering jobs). The New York Times reports that Sun has canceled its long-delayed Rock processor, the next generation SPARC CPU. Instead, the Times says Sun/Oracle will have to rely on Fujitsu for SPARCs (and Intel otherwise). Unfortunately Fujitsu is decreasing its R&D budget and is unprofitable at present. Sun's cancellation of Rock comes just after Intel announced yet another delay for Tukwila, the next generation Itanium, now pushed to 2010. HP is the sole major Itanium vendor. Primary beneficiaries of this CPU turmoil: IBM and Intel's Nehalem X86 CPU business."
Well there's IBM. And they don't seem to be slowing down:
POWER 6
POWER 7
also:
http://www.theregister.co.uk/2008/07/11/ibm_power7_ncsa/
POWER 7 sounds like crazy town...
FUNK!
I decided to post anon as I worked at Sun during the tail end of Cheetah and the beginning of Rock.
Rock (aka Turd Rock from the Sun) was not the first turd from Sun. The last one was USIII (Cheetah). What happened there is that it got delayed and by then the L2 cache it had been designed for was not sufficiently larger than the competition's (I think the original idea was 1 or 2 MB configs), so the option was added to add really big L2 caches. One of the pie in the sky ideas early on was putting the L2 tags on the die for speed. So by then there was no room for more tags. You ended-up having a 512 byte L2 cache line size if I recall correctly if you had 8MB of L2 cache. Plus since when it was designed they addressed the problem of waiting around for a cache line to fill by making a special purpose wide fast bus for it they did not have much sectoring. There was either no sectors or only two, I cannot remember (by USIIIi all this broken L2 cache desing was rectified so I am fuzzy on when what was when). So say there were two. What would happen on a cache miss is that the 256 byte sector that needed would fill. when it was done, the instruction stream would continue (no amount of reordering would prevent a pipeline stall for filling 256 bytes) and the other sector would start filling. Now imagine that cache miss was for data. How often do you look at data structures that are 512 bytes big (common random access case)? Turns-out 64 bytes is a good real world figure that is ideal 95% of the time. Just think about how much memory bandwidth and time is being wasted. Now imagine that cache miss was for an instruction. 512 bytes is 16 instructions. Again in 95% of code there is a branch in less than 16 instructions.
So you might think how can something like this happen. The reason is that the the hardware people were their own kingdom, and the US people a fiefdom within. They #1 did not think like software engineers and came-up with pie in the sky ideas (like that L2 cache) which led to delays (another thing they could have done is made L1 caches that were physically tagged, but that is okay Sun engineers had been dealing with coloring for years already) and #2 did not simulate early on enough. When they did run simulations they had everything already worked-out on paper for up to 2MB L2 and things were good. Then they just did tweaks and did not run simulations again until much too late. The simulations showed that for almost all cases USIII was slower with 8MB L2 cache than with 2MB, think about that.
Rock was more of the same. In fact the simulation was done even later. The pie in the sky idea was the leap frogging prefetcher(they called it a hw scout). When they ran simulations after doing a bunch of work on it, they saw that the way typical code branched it was not all that good for the added memory bandwidth consumption. So they added a few tweaks to that, but it was hopeless. So they needed something else to make the chip worthwhile, transactional memory. Did they do it ala PPC et all with reservations on cache line boundaries, no they came up with a scheme with two new instructions and a status register. You did a chkpt instruction with a pc relative fail addr to jump to in case something was not guaranteed to be atomic. At the end you did the commit instruction. If something got in the way before everything got out the write buffer, you would arrive at the fail addr where you could check the cps register for info and nothing was committed. Can anyone else see how difficult this would be to get right? They were hardware guys and they did not see how hard of a problem it was? In fact the implementation they had had conditions like if an interrupt occurred or if you did a divide instruction you would end-up at the fail addr (yes if the other core on the die did it as well). My hunch is that the complexities of this transactional memory scheme is what delayed Rock for more than 2 years.
Another example was Jaguar USIV. For that one they decided that they could have less frequent pipe line stalls i