Slashdot Mirror


Intel Employees Speak Out On Rambus Debacle

Coupland writes: "A fascinating article from Electronic News Online discussing the fall-out within Intel caused by the Rambus nonsense. The troops seem to be breaking rank." This is definitely the most informative article I've seen on the Rambus / Intel relationship, and it includes a timeline that pretty much sums things up. (What it doesn't mention is the trouble that PC manufacturers like Dell, Gateway, etc., are caused by the constant cycle of delay and deny.)

10 of 89 comments (clear)

  1. Finally. by JurriAlt137n · · Score: 4

    As far as I'm concerned this is one of the best things to have happened recently. At least speaking from the perspective of the end user. Big corp's like Intel going through this kind of trouble often find back some of the spirit they had when they just started up. Instead of being able to sit on their asses and enjoying the fact that they are market leader they will have to fight back which can only result in a better quality/performance outcome towards the end users. It will also allow AMD to catch up even further which might result in a nicely balanced competition between the (currently) major chip-builders.

    People may get fired or quit of their own, and this is a bad thing for those people personally, but the fact that new people and new ideas will enter the company might bring some major improvements.

    --

    People replying to my sig annoy me. That's why I change it all the time.
    1. Re:Finally. by ToLu+the+Happy+Furby · · Score: 3

      Fortunately for Intel, they didn't have to take any risks, since every single one of the things you mentioned was done by someone else first. Hell, the Alpha alone did all of them before Intel did. Not one of these technologies were "in it's infancy" when Intel deployed them.

      The only risk Intel takes in deploying any of these technologies is the risk that Intel customers won't buy them. That's the risk every company takes when introducing a new model. While yes, it means Intel is taking risks, none of the risks Intel takes actually advance the state of the art.


      That's just because he came up with a bad list. Despite the fact that there are very few totally new ideas in the MPU industry (just as there are very few totally new software algorithms), Intel has indeed bet the farm (well, bet the product line) on some very radical design ideas, both in the past and the present.

      Some were successful, some crashed and burned. One design that was extraordinarily innovative and successful was the P6 core, introduced in 1995 with the PPro. In it, Intel managed to do "the impossible"--execute variable-length x86 code out-of-order, something that was supposed to be only possible with a fixed-length ISA and was even relatively state-of-the-art there. The way they did this was by essentially "emulating" x86 code by decoding it into internal "RISC-like" ops, which could be run OOO. While I doubt this was an entirely new idea, I'm not aware of any previous implementations of it, much less one as wildly successful as the P6.

      One design that was a horrid failure was the iAPX432, an MPU spread out over 3 chips which essentially operated in an object-oriented manner, rather than iteratively like, well, every other chip in history. Perhaps a sign of what was to follow was the fact that the 432's "assembly code" was actually built to closely model ADA, the government's ill-fated OO language. The 432 somehow managed to work, but performed a bit slower than mainstream MPUs from 5 years beforehand. Not too many sold. But there is no doubt that here Intel took a huge risk based on a very interesting idea.

      Nowadays Intel is engaged in exactly the same "risky" design behavior in an attempt to further the state of the art. The P4 contains several totally new innovations. Perhaps most prominent is the trace cache, an L1 instruction cache which instead of just dumbly storing instructions, orders them safely and unrolls loops, allowing branch- and dependency-free operation for large swaths of code. In addition, the trace cache stores those internal "RISC-like" ops, not x86 ops like a normal instruction cache; this takes the x86->"RISCop" decoder out of the critical path and should result in higher top-clock-speeds and excellent performance on small looped code which can fit in the L1 trace cache--3D engines, encryption, and FFT (i.e. audio/video encoding/decoding, voice recognition), for example. Trace caches are not a new idea; they've apparently been studied quite a bit in the literature. However, the P4 is the first commercial MPU to include one, and that's a substantial engineering innovation.

      Another innovation which is, from what I've heard, actually a totally new idea is the P4's double-pumped ALU and supporting hardware. While the idea of different pieces of hardware running at multiple speeds is of course not new, this is apparently the first time it's been worthwhile to implement it on-die in a commercial MPU. More impressive is the fact that Intel was actually able to get an ALU--one of the most studied logic circuits in history--to run at up to 4.0 GHz in current .18 um process technology. Apparently the way they did this is by implementing a new, lower-latency adding technique. This is the circuit-design equivalent of finding, for example, a faster sorting algorithm; it represents a very impressive achievement. While the double-pumped ALU will likely not have as large an effect on overall P4 performance as the trace cache, it should help out noticeably and it's definitely a radical design.

      On the other hand, we have Intel's upcoming IA-64 ISA, an attempt to move the VLIW philosophy from specialized DSP work into general-purpose computing. Again, VLIW is not a new idea, and the idea of a VLIW general-purpose MPU is not either. However, the Itanium is one of the first attempts to actually build one (Transmeta's Crusoe is the other).

      Furthermore, it represents quite a risk from a performance standpoint. The basic idea behind VLIW is to in effect take the RISC revolution one step further. While the RISC vs. CISC debate is often treated as a fair fight capable of producing one victor, the reality was quite different. (The following is essentially a synopsis of this excellent article on ArsTechnica.) Instead, each was the best ISA philosophy for the prevailing conditions at the time. CISC was the best design choice for its time--that is, up until the early 80's--and "pure RISC" the best for its time--from the mid 80's until the mid 90's.

      The main issues involved the evolution of storage capabilities and compiler technology. First a broad comparison of what CISC and RISC actually mean: CISC refers to a category of ISAs in which a new instruction is concieved of to take care of every possible situation. A (made-up) example of a CISC-like instruction is the following:

      CRAZY_OP, mem1, r1, mem2

      which does the following: load mem1 from memory, take r1 from a register on the chip, compute (mem1 - r1) / r1^2, and store that in mem2. And there actually were some CISC instructions which were nearly that crazy. The RISC philosophy, on the other hand, would break that one operation down into many--one to load mem1 to a register, one to subract mem1-r1, one to multiply r1*r1, one to divide the two, and one to store the result, for a total of...lessee...5 instructions.

      What's the difference? Well, like I said, it came back to storage capabilities and compiler technology. Back in the 70's when CISC was the Right Thing To Do, storage was extremely expensive and thus very scarce. If chips back then had used my RISC design, such an operation would have taken 5 instructions to code; with my CISC design, it takes just one. Yes, the CISC design might need to reserve some extra bits in the opcode field in order to code for so many ridiculous instructions, but overall the compiled RISC code is going to take at least 4 times as much storage space as the CISC code. So even if you didn't expect to run into the above situation very often, it made sense to have an explicit code name for it whenever you did.

      As we hit the 80's, these storage issues rapidly eased, to the point where it wasn't such a hardship taking 4 times as much space to say the same thing every once in a while. Meanwhile, back in the CISC way of doing things, you actually needed to find some way to make your chip capable of performing all the goofy instructions that might be asked of it. In essence, it's almost like your assembly code is "compressed" to save storage space, and thus needs to be "decompressed" by the chip. This means complicated chip implementations, each trying to do more in each clock cycle--which means lower top clock speeds. The RISC chip may need more cycles to do perform all 5 instructions, but since it only performs simple instructions, it can have a higher clock speed and thus come out ahead.

      But there's a problem with this too: people generally like to program in high-level languages. RISC is a low-level ISA philosophy. Thus you need to have good compilers, to be able to analyze high-level instructions and decompose them into all their composite parts for encoding in a RISC assembly language--often a more difficult process than in my example. Again, the compilers of the 70's weren't up to the task; only in the 80's did good enough compilers come along to enable this. In essence, we moved the "decompression" of a high-level instruction to its low-level constituant operations from inside the chip (CISC) to in the compiler (RISC).

      Thus, we went from CISC being a Good Thing to RISC being a Good Thing. The main issues were 1) code bloat not such a big deal and 2) move more instruction scheduling duties to the compiler.

      Since that time, we've moved from what I called "pure RISC" to what Hannibal in the article I'm summarizing calls "post-RISC". That is, people started realizing that with RISC operations being more-or-less uniform, a good way to make things to faster was to do more than one thing at a time, and that instead of sitting and waiting on a long memory access, etc., you could switch and do other stuff at the time. Thus we got superscalar and out-of-order execution, respectively.

      Moreover, we got deeper and deeper pipelines--sort of like assembly lines, in which each instruction goes through several stages, each 1 clock long, in its execution. This means we can clock the chip faster (less to do on each clock cycle), and get overall faster performance (think a fire brigade of 10 people each passing buckets a short distance, vs. one person running 10 times as far between buckets delivered). The problem is that, unlike buckets or trucks, code has dependencies; instruction 2 might take as its input the result of instruction 1, which is still in the pipeline--only halfway down the assembly line, as it were. Thus we need rescheduling logic to keep our pipeline stuffed--our assembly line filled--with instructions which don't depend on each other. Or, instruction 1 might be a branch instruction, which goes one way or another based on its result, so that we don't know "what comes next" until it is completely finished. Thus we use branch prediction, which uses some statistical methods to guess what comes next, and execute it accordingly, while aware that if when we get to the end of instruction 1 it turns out something else came next, we need to go over and do that instead.

      The result of all this out-of-order superscalar pipelined "post-RISC" stuff was much higher IPC (Instructions executed Per Clock), but also lots of complicated logic on MPUs to handle all the scheduling and dependency checking and prediction. Theoretically, just as all the complicated logic made CISC chips complicated and slow, all this complicated logic makes today's post-RISC too complicated, too large, too hot, and slower than they might otherwise be. [end summary]

      Thus, the basic idea behind VLIW is an extension of the idea behind the CISC->RISC transition. To wit: why not take all this complexity out of the MPU and put it back into the compiler? That way, we can get rid of all the unpleasantness once, at compile time--on the developer's time, not the user's. The way it does this is by trying to find all the parallelism, work out all the dependencies, and predict all the branches at compile time--in other words, to do all the scheduling at compile time. The way it communicates this to the chip, then, is to compile not to individual instructions for the chip to schedule, but rather into prescheduled "bundles"--or "Very Long Instruction Words"--which are supposed to be guaranteed to work well when run together in parallel.

      Or rather, this is how VLIW works where it is normally used--in DSP type processors, running programs for which it is very easy to extract this sort of data at compile-time. Problem is, it is much more difficult to do with general-purpose programs, which is why it hasn't been done before. As you might guess, there's just too much you don't know at compile-time for you to get unambiguous scheduling information. Transmeta solves this problem by compiling at run-time, using their code-morphing software, essentially a JIT compiler. The problems with this are obvious and well known: namely, that the JIT compiler uses resources which would otherwise go to running the program, and that you don't get the VLIW benefit of doing all the optimization once and forgetting about it. (The code-morphing software caches, profiles, and further optimizes the code its already run, but it still always running, and doesn't save this information from session to session.) Indeed, you're essentially moving the scheduling problem from one which is done by specialized on-chip logic in different pipeline stages than the execution logic--and thus not competing for execution resources--to one which is run by the general-purpose execution logic; a shaky trade-off at best. On the other hand, by working in software you theoretically get more flexibility to schedule instructions than when doing the scheduling with a chip's fixed logic.

      The way IA-64 handles this problem is to have the compiler insert "hints" about which instructions look like they *might* be able to run in parallel, without dependencies; which way a branch is *likely* to go; which scheduling is *likely* to make good use of the chip's execution resources. The problem with this is, as the hints are inevitably going to be wrong, the chip needs its own analogues of much of the scheduling hardware it was trying to get rid of in the first place. In some ways, it's little more than a change in terminology: with OOE designs you have a smallish general register set with a large set of "rename registers", so that each instruction running in parallel essentially thinks it has a full copy of the general register set all to itself; with IA-64, you just have a huge general register set so that each parallel instruction has enough registers to work with.

      The problem is, of course, that you haven't done what you set out to do--eliminate complex scheduling logic from the processor. Instead, you've just replaced it with similar but less-well understood versions of the same stuff. The end result is the that Itanium core, far from being small, simple and clocking fast, is huge, complex, unbalanced, and therefore capable of pitiful clock speeds. The die is ~300mm^2--roughly 3 times the size of a P3--yet only has room for a total of 16kb L1 and 96kb L2 cache, less than even a lowly Celeron. (Server level chips like Itanium generally need much *larger* caches than PC chips; Itanium is supplemented with a large off-chip L3 cache, but it is too high-latency to be much use.) Itanium was supposed to launch in early 1998 at 800MHz; it is only now yielding above 733MHz--again, Celeron territory.

      Furthermore, we run into trouble from an unexpected place--code bloat. Of course, it's not the same problem as in the 70's, when we used CISC ISA's to keep code small so that they could be stored at all; today's 100GB HD's testify to that. Rather the problem is that *bandwidth* to storage is very often the limiting factor with today's technologies, and that high-bandwidth storage--i.e. on-chip cache--is just as scarce as overall storage was in the 70's. With all its hints and bundling and exception codes to execute if the hints turn out wrong, IA-64 is much more bloated than x86 or RISC code, and thus those not-even-Celeron sized on-die caches are effectively even smaller.

      Of course, Itanium has more functional units than the P6 core, and if all these compiling tricks actually keep them full of instructions, it will perform much better per clock. Unfortunately, all indications are that even with the relaxed "hints instead of guarantees" rule, it's still just too difficult for today's compiler technology to keep this monster even remotely well-fed. Intel even had the gall to claim at their recent Intel Developers' Forum that the SPEC CPU benchmarks were "irrelevent" for Itanium's target market, offering instead a (hand-written in assembly) RSA encryption benchmark in which Itanium demolished a Sun USIII. Well, that's fine, except that a very cheap dedicated encryption chip can beat the Itanium at this game several times over for 1% the cost and power requirements. Of course, the SPEC benchmarks run exactly the sorts of programs used in Itanium's target market, and are the most relevent measure possible. And not coincidentally, they are extremely sensitive to compiler quality.

      So...to get back to our original topic, IA-64 is another huge risk--Intel has repeatedly called it a "bet-the-company thing"--which incorporates some very interesting, non-mainstream ideas in an attempt to radically advance the state of the art. And so far, it appears not to be working.

      Don't worry too much about Intel, though; from all indications, McKinley, the 2nd-generation IA-64 core, should perform just fine thank you. Interestingly enough, it was designed almost entirely by HP engineers. But it also must be emphasized that they have clearly learned from Intel's myriad mistakes with Itanium. (Everything about Itanium, from the pitiful tacked-on caches to the rather unnatural pipeline design--apparently an extra stage needed to be added late in the design process--indicates that this design was a "learner".) Plus, Itanium has been delayed so long that the almost-on-schedule McKinley is due out relatively soon--roadmaps have it as soon as Q4 2001 (dubious), and Q1 2002 might actually be reasonable. McKinley should clock just fine (although not as high as the CISC front-end P4), and has plenty of on-die cache. And in a year and a bit, the compilers might finally be ready too.

      So, Intel might just turn this risky strategy into gold. Maybe the "post-RISC" paradigm *will* run out of gas soon, and VLIW will speed past. The point is, for better or worse, Intel's MPU designers are not conservative in the least.

      AMD, on the other hand, has never introduced any significant new MPU design techniques that I can think of; instead they concentrate on implementing Intel's designs better than Intel. Indeed, their first PC MPUs had the same names as Intel's--AMD made a "386" and a "486", and possibly a "286" too, I don't remember. The much-vaunted K7 is really quite similar to the P6 core, just with more functional units, larger buffers, more decoders--more more. It's a better version of the P6 (though less power efficient), but it's not terribly innovative. Of course, AMD was in a precarious enough position market-wise that they didn't have the luxury of taking engineering risks. Intel being relatively secure (and percieving themselves, Andy Grove's catchy business-trade bestsellers notwithstanding--as even more so), they can and do experiment with some wacky stuff. Some of it works, some of it doesn't, and for some of it they take their massive market power and force it to work.

  2. Bizarre... by pb · · Score: 3

    What's this with having parts of the contract blacked out? I've never heard of this. Is this a common practice?

    I've heard a few too many stories about heavy-handed tactics by Intel when dealing with their employees, or other corporations, so somehow it does not sadden me to see them trapped by RAMBUS. Maybe this will be a welcome breather to get some competition back into the industry.

    In any case, I'm pretty happy with my Athlon. :)
    ---
    pb Reply or e-mail; don't vaguely moderate.

    --
    pb Reply or e-mail; don't vaguely moderate.
  3. A question of approach by Bug2000 · · Score: 3

    Mistakes at this level can really be lethal. Intel imposed its top-down approach even though bottom-up surveys showed that it was a poor decision... But the war for market shares tend to impose these top-down decisions for a question of survival.

    In these times of huge mergers between giant companies, it is quite likely that this happens again and at very high scales (ouch!!). AOL-Netscape was a first example, Intel-Rambus another one, who's next ?

    --

    É que os desafinados também têm um coração
  4. Price of memory by dnnrly · · Score: 3
    It is entirely possible that this development could have an effect on the price of memory. There is something to be said about the theory that the perception that people are going to need more non-Rambus memory leading to a shortage in supply and driving up the cost. Any economists have any ideas?

    As for Intel destroying the trust between management and those engineers. I think this is pretty dire! Those engineers are going to start thinking twice before giving their honest opinions on things. Could bery well lead to management not getting the information that they need, appraisals of new technology. If the engineers think that management are willing to push for it then people are just going to fold. Bear in mind that I use the word 'management' here loosely!

    dnnrly

  5. Finally... but by Aceticon · · Score: 5
    The problem is, that the people they lost, might very much be some of the better amongst them - this is not good for any company.

    The article seems to indicate that Rambus adoption was completly a high level decision, and that the input from the lower levels (the engineering team) was not only disregarded but also, for those that persisted in voicing their diagreement with the technology, punished.
    Altough i believe that choosing Rambus was a bad move, i think that:

    1. "outsourcing inovation" (to Rambus Inc)
    2. Ignoring or even supressing internal opinions
    were by far the worst moves that Intel could've done.

    Think about this:

    1. It's more than obvious that Rambus Inc exists not to serve the interests of Intel, but to serve the interests of it's own members and/or shareholders
    2. Ignoring the opinions that come from experience, and taking punitive measures against those amongst Intel that were brave enough to stick to their opinions, will just push out from Intel the most knowleadgeable and daring - probably the same persons that are more willing to voice/try new and inovative ideas - and leave Intel with less free spirits and more zombies. Zombies do not inovate.
    1. Re:Finally... but by cyber-vandal · · Score: 3

      Yes, I was amazed to see that the opinions of the technical experts were ignored by the management. Surely this sort of thing is rare in IT (snigger)

  6. I think this article is sensationalized by Anonymous Coward · · Score: 4

    --Now this may go against common opinion, but in a team atmosphere, Intel's so-called "disagree and commit" thing is a common requirement. In general it doesn't mean "shut up and do what management says", it means if the whole team agrees on a particular solution, then you can't have the few who disagree continuing to undermine what you're trying to accomplish.

    For example... pretend I have 10 designers working on an ASIC, and one thinks the protocol we are using sucks. The majority agreed that this thing has a good chance to perform and sell well, but this guy was the odd man out. Now... what do we do? Do we throw away the other 9 opinions and say: "ok, scrap this, we'll do what you want"?

    I've worked with guys like this before. Not only do they refuse to accept the teams decision, but they continue to profess their negative opinions at every chance possible.

    The only reason you guys are eating up the negative view of a single ex-employee, is that the in this case, Rambus did have problems. Even though he may have been right about Rambus, its still tough for me to believe that "employees got bad reviews because they spoke out against Rambus". Chances are, this guy got a bad review because he was being counter-productive.

    -This is the opinion of one guy, just like that article.

  7. Re:Gateway? by jht · · Score: 3

    Gateway uses Athlons in some of their consumer PC lines (like the Select series), but their "corporate" systems (the Enterprise, or E-series) all use Intel chips. They have one desktop (the E1400) that's i810-based, one (the E3400) that uses the i815 without the built-in video, and a model (the E4400) that uses the i820 and RDRAM.

    The difference is that the E-series have longer product lifecycles and offer more consistency in the devices that they use (for instance, they offer the same video card and Ethernet card throughout the product's lifespan). THe lifecycle also runs longer - usually about 18 months compared to the 6-12 months that a consumer PC might be available.

    Most top-tier PC makers do something similar. The bleeding-edge and "cool" technologies go into consumer PC's (which small businesses also usually buy), and Big Business buys the managed systems (which are relatively boring, but consistent). Dell, as another example, has the Dimension PC's for home/small business, and they offer the Optiplex for their managed line (we used to be a Dell shop and switched to Gateway earlier this year).

    When I last discussed their roadmap plans with Gateway, they were starting to consider the possibility of adding an Athlon-based E-series PC, but it's still a little immature to them.

    - -Josh Turiel

    --
    -- Josh Turiel
    "2. Do not eat iPod Shuffle."
  8. Intel Lost It's Culture by casio · · Score: 4

    I left Intel after 15 years. (I've been out 10 months.) I think the main reason for disasters like Rambus and many of the other execution problems is that the traditional Intel culture has been allowed to slip away. Believe it or not, the intenral culture revolved around responsibility and accountability. Around 6 years ago that started to change. Disenting opinions where not welcomed. (Shoot the messager.) and too many decisions are being made too high in the chain. (Specifically technical decisions.)