Slashdot Mirror


Next-Gen Processor Unveiled

A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.

39 of 183 comments (clear)

  1. I want one by Normal+Dan · · Score: 4, Insightful

    But when are they likely to be ready?

    --
    A unique way to learn a language: http://languageloom.com
    1. Re:I want one by ackthpt · · Score: 5, Funny

      But when are they likely to be ready?



      • You know they'll be ready when Intel places large orders for aluminium for heatsinks.
      • You know they'll be ready when there's a sudden drop in prices of the current Hot CPUs, which are all proven but suddenly look like last month's pizza from under the couch.
      • You know they'll be ready when AMD hasn't said anything and they are suddenly shipping them, while Intel tells you in 9 mos. then suddenly says 3 mos. (and you can hear the whips cracking through the walls.)
      • You know they'll be ready when Microsoft doesn't have an operating system ready, but there are a dozen Linux distros good to go.
      --

      A feeling of having made the same mistake before: Deja Foobar
    2. Re:I want one by mrbluze · · Score: 4, Funny

      Well i won't buy one until the Super version comes out (STRIPS). Now that's a name that has appeal!

      --
      Do it yourself, because no one else will do it yourself. [beta blockade 10-17 Feb]
    3. Re:I want one by Anonymous Coward · · Score: 2, Funny

      Hsss, he ruins it! The fat hobbit ruins it!

  2. Hm... by imsabbel · · Score: 4, Insightful

    The article contains little more information than the blurb.
    But it seems to me that we called this great new invention "vector processors" 15 years ago, and there is a reason they arent around anymore.
    "Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?

    --
    HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
    1. Re:Hm... by superpulpsicle · · Score: 4, Interesting

      Come on now. It's a capitalist market. You can't just innovate your way to fame. Just like the list of 5 million other patents that never see the daylight.

    2. Re:Hm... by volsung · · Score: 5, Informative

      The vector processors never went away. They just became your graphics card: 128 floating point units at your command

      BTW, here is a real article on TRIPS.

    3. Re:Hm... by Anonymous Coward · · Score: 3, Informative

      Actually, it is more like the dataflow architectures from the 70s. Vector processors are a totally different kind of thing (SIMD).

      The idea is simple, instead of discovering instruction level parallelism by checking the dependencies and anti-dependencies by global names (registers), define the dependencies directly by relating to instructions themselves.

      > "Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?

      That equality does not exist. It is a wide parallel execution, not super-pipelining, ergo no huge branching penalties.
      Also, the architecture is more likely exploting the wide execution unit by predicating both branches and calculating them both.

    4. Re:Hm... by frank_adrian314159 · · Score: 5, Informative

      No, here are the real articles on TRIPS. These and many others can be found here.

      --
      That is all.
    5. Re:Hm... by SiJockey · · Score: 4, Informative

      The big difference in TRIPS is that stuff flying around out in memory can be squashed easily. The machine has aggressive branch prediction, efficient predication support in the ISA, and data dependence prediction. So, the 1024 instructions don't need to be long vectors streaming from memory. Squashing a mispredicted branch and restarting down the right path takes on the order of 10-20 machine cycles. Thanks for your comments and interest. -DB

      --
      --+-- Doug Burger, UT-Austin Computer Sciences
  3. Next-Gen Business Model Unveiled by Anonymous Coward · · Score: 3, Insightful

    1. Copy some university press release to your blog
    2. Make sure google ads show up at the top of the page
    3. Submit blog to slashdot
    4. Profit

    1. Re:Next-Gen Business Model Unveiled by Manos_Of_Fate · · Score: 2, Funny

      You forgot the embedded YouTube video...

      --
      Isn't enough that I ruined a pony, making a gift for you?
  4. Marketting hype? by faragon · · Score: 5, Informative

    Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.

    It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution?

    1. Re:Marketting hype? by Aadain2001 · · Score: 4, Informative
      Based on the article, "TRIPS" is nothing more than a Out-Of-Order(OOO) SuperScalar based processor. So unless the article is grossly simplifying (possible), this is nothing but a PR stunt. And based on the quote from one of the Professors about building it on "nanoscale" technology (um, been doing that for years now), my vote is pure PR BS.

      And as an aside, the reason modern CPUs are designed to "only" issue 4 instructions per cycle instead of 16 is because after years of careful research and testing real work applications, 4 instructions is almost always the maximum number of instructions any program can concurrently issue, due to issues like branches, cache-misses, data dependencies, etc. Makes me question just how much these "professors" really know.

      --
      Space for rent, inquire within
    2. Re:Marketting hype? by Doches · · Score: 3, Interesting

      Branches are no problem for TRIPS -- in the EDGE architecture, both execution paths resulting from a branch are computed, unlike in classic architectures where the processor blocks (8086), skips ahead a single instruction before blocking (MIPS), or chooses a path using a branch predictor and executing it, possibly only to discard all instructions issued since the branch, if the predictor turns out wrong. EDGE architectures still lag on cache misses (or any memory hit) -- but that's fundamentally a problem with memory, not with the processor. Don't read the article, read the UT pdf.

    3. Re:Marketting hype? by smallfries · · Score: 5, Interesting

      Multiway branching is ancient, and it's not used much because it's very inefficient. At least half of the instruction stream after a branch will be canceled, two branches deep it is 75% and so on. No matter how much parallelism ou throw at this there are only marginal gains to made (exponential increase in number of execution units for a linear increase in depth). It still doesn't get around data dependencies which will be the major bottleneck if looking that far ahead in the instruction stream.

      Having read the articles that were easy to get to, and the abstract of the PhD student: this is buzzword bollocks. There is no innovation in what they have done. As other people have pointed out this is a vector / datastream architecture. It's not a very good one at that. Although it has the "potential" to scale to terraflops, so does my toothbrush. On a 130 process they can fit 2 cores with 32-wide dispatch clocked at 500Mhz. My 7800 is fab'ed on a 130 process with 24*4*4 = 384 operation wide vector dispatch. This prototype would hit about 16 billion ops/sec, versus 180 Gflop on the 7800. This is a long way from terraflops, and doesn't convince me that it can scale.

      As the 7800 is close to a systolic model there is a limited class of programs that can be executed; but those that are in that class exhibit (near)perfect parallelism and so have zero hit from memory access costs. Actually the internal bandwidth on the 7800 is a bottleneck for some computations but I'm just going for coarse detail here.

      Edge appears mix and match ideas from several parallel designs; every one of which suffers from hard code generation problems. I suspect that the only sample applications that hit 32 ops / cycle are media apps (or dataflow problems as they used to be called) which normal architectures run at high speed anyway.

      Interesting research, as it's always good to see people explore different designs, but it sounds overhyped and I believe that it has zero commercial appeal. Finally, as a sidenote, you are right about cache latencies being a memory defect rather than processor but there are ways around it. If you are willing to limit yourself to a certain class of applications (roughly the same one that executes well on most parallel architectures such as this, or GPUs) then you can completely avoid the latency. This provides a much bigger performance hike than any other technique as memory latency is a dominating factor in most runtimes. The only snag is that it is very hard to do, requires different fabrication technology (largely solved now), and lots of compiler advances... If you're interested then google for intelligent ram. It's about a decade of research now...

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    4. Re:Marketting hype? by SiJockey · · Score: 4, Informative

      Actually, there is much more parallelism (more than 4 ops/cycle) available in many of these applications, but you correctly observe that many of these ancillary features (branch mispredictions, cache misses, etc.) chip away at the achieved parallelism. The TRIPS ISA and microarchitecture (which is, as you correctly point out, a variant of an OOO "superscalar" processor) has numerous features to try to mitigate many of these features ... up to 64 outstanding cache misses from the 1,024-entry window, aggressive predication to eliminate many branches, a memory dependence predictor, and direct ALU-ALU communication for making data dependences more efficient. The most important difference is in the ISA, which allows the compiler to express dataflow graphs to directly to the hardware, which will work best (compared to convention) in ultra-small technologies where the wires are quite slow. To get a similar dependence graph in a RISC or CISC ISA, a superscalar processor must reconstruct it on the fly, instruction by instruction, using register renaming and issue window tag broadcasting. Thanks for reading.

      --
      --+-- Doug Burger, UT-Austin Computer Sciences
    5. Re:Marketting hype? by David+Greene · · Score: 4, Interesting

      ...this is buzzword bollocks.

      No, it isn't. The TRIPS group has done some really interesting things with compilers, for example. They've managed to have the compiler break up code into packets and schedule them on the processor array so that dependencies flow nicely across the grid. That is not an easy problem to tackle. This is very good research.

      I believe that it has zero commercial appeal.

      That's not the point of research. The point of research is to explore problems no one has tackled before, of course always with an eye toward future technology trends.

      --

  5. One trillion calculations per second by 2012 by xocp · · Score: 3, Informative
    A link to the U of Texas project website can be found here.

    Key Innovations:

    Explicit Data Graph Execution (EDGE) instruction set architecture
    Scalable and distributed processor core composed of replicated heterogeneous tiles
    Non-uniform cache architecture and implementation
    On-chip networks for operands and data traffic
    Configurable on-chip memory system with capability to shift storage between cache and physical memory
    Composable processors constructed by aggregating homogeneous processor tiles
    Compiler algorithms and an implementation that create atomically executable blocks of code
    Spatial instruction scheduling algorithms and implementation
    TRIPS Hardware and Software
    1. Re:One trillion calculations per second by 2012 by xocp · · Score: 3, Informative
      DARPA is the primary sponsor...
      Check out this writeup at HPC wire.

      A major design goal of the TRIPS architecture is to support "polymorphism," that is, the capability to provide high-performance execution for many different application domains. Polymorphism is one of the main capabilities sought by DARPA, TRIPS' principal sponsor. The objective is to enable a single processor to perform as if it were a heterogeneous set of special-purpose processors. The advantages of this approach, in terms of scalability and simplicity of design, are obvious.

      To implement polymorphism, the TRIPS architecture employs three levels of concurrency: instruction-level, thread-level and data-level parallelism (ILP, TLP, and DLP, respectively). At run-time, the grid of execution nodes can be dynamically reconfigured so that the hardware can obtain the best performance based on the type of concurrency inherent to the application. In this way, the TRIPS architecture can adapt to a broad range of application types, including desktop, signal processing, graphics, server, scientific and embedded.
  6. Must...resist...obvious...joke by jimicus · · Score: 2, Funny

    Imagine a beowulf cluster of these!

  7. Gets rid of the register-file by DrDitto · · Score: 5, Insightful

    The EDGE architecture gets rid of relying on a single register file to communicate results between instructions. Instead, a producer-consumer ISA directly sends results to one of 128 instructions in a superblock (sort of like a basic block, but larger). In this way, hopefully more instruction-level parallelism can be extracted because superscalars can't really go beyond 4-wide (8-wide is a stretch...DEC was attempting this before Alpha was killed). Nice concept, but it doesn't solve many pressing problems in computer architecture, namely the memory wall and parallel programmability.

  8. This is cool! by adubey · · Score: 3, Informative

    The link has NO information.

    The PDF here: has more information about EDGE.

    The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.

    The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.

    1. Re:This is cool! by treeves · · Score: 2, Insightful

      If it's so cool, why did it take three years for us to hear about it? I'm really asking, not just trolling.

      --
      ...the future crusty old bastards are already drinking the Kool-Aid.
  9. Moore's law immortal? by Manos_Of_Fate · · Score: 3, Interesting

    It seems like for every "realist" claiming that Moore's law will soon hit a ceiling, I see another ZOMG Breakthrough! Lately, the question I've been asking myself is, "Will we ever surpass it?"

    --
    Isn't enough that I ruined a pony, making a gift for you?
  10. but... by Klintus+Fang · · Score: 4, Insightful

    The motivations for this technology provided in the article ignore some rather basic facts.

    They point out that current multi-core architectures put a huge burden on the software developer. This is true, but their claim that this technology will relieve that burden is dubious. They mention, for example, that current processing cores can typically only perform 4 simultaneous operations per-core, and imply that this is some kind of weakness. They completely fail to mention that the vast majority of applications running on those processors don't even use the 4 available scheduling resources in each core. In other words, the number of applications that would benefit from being able to execute more than 4 simultaneous instructions in the same core is vanishingly small. This is why most current processors have stopped at 3 or 4. Not because they haven't thought of pushing it beyond that, but because it is expensive, and because it yields very little return on the investment. Very few real-world users would see any performance benefit if the current cores on the market were any wider than 3 or 4. Most of those users aren't even using the 4 that are currently available.

    Certainly the ability to do 1024 operations simulatenously in a single core is impressive. But it is not an ability that magically solves any of the current bottlenecks in multi-threaded software design. Most software application developers have difficulty figuring out what to do with multiple-cores. Those same developers would have just as much (if not more) difficult figuring out what to do with a the extra resources in a core that can execute 1024 simultaneous operations.

    --
    In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
    1. Re:but... by $RANDOMLUSER · · Score: 3, Informative

      Two words: loop unwinding. This critter is perfect to run all iterations of (certain) loops in parallel, which would be determinable at compile time.

      --
      No folly is more costly than the folly of intolerant idealism. - Winston Churchill
    2. Re:but... by Klintus+Fang · · Score: 3, Interesting

      of course loop unwinding works fine... when you have a long loop. it does though have two problems. 1) it only works when you have very long loops where there are very little dependencies between the consecutive iterations of the loop 2) even when it does work, it causes the code footprint of the application to be much bigger which means you end up putting a lot more stress on your cache pipeline, requiring bigger caches and a wider fetch engine. And that all aside, what about the vast majority of code segments where massive parrallelizable loops are not being executed? Loop unwinding isn't going to help at all for those.

      --
      In a minute there is time For decisions and revisions which a minute will reverse. -T.S. Eliot
  11. Re:TRIPS by glwtta · · Score: 3, Insightful

    I think it's mostly because the backronym is contrived and silly.

    --
    sic transit gloria mundi
  12. Re:Ix86 by convolvatron · · Score: 4, Insightful

    you are absolutely right. no one should ever do any research into
    something which doesn't ultimately look like an x86.

  13. nothing spectacular by CBravo · · Score: 4, Informative

    Right, let me begin by saying that after reading ftp://ftp.cs.utexas.edu/pub/dburger/papers/IEEECOM PUTER04_trips.pdf it actually became a bit more clear about what they were talking about.

    It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=103228 8&lastnode_id=0 to see what transport-triggered architectures are about. They are more power efficient, etc etc.

    Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally > .9 efficient) of optimality. I don't see the gain there in efficiency.

    Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_dh ofstee_v1.0_18july2003_eindverslag.pdf here for my thesis.

    --
    nosig today
  14. Re:Better support for concurrency in Languages by Anonymous Coward · · Score: 2, Insightful

    A lot of this is due to the fact that most popular languages right now do not support concurrency very well. Most common languages are stateful, and state and concurrency are rather antithetical to one another. The solution is to gradually evolve toward languages that solve this either by forsaking state (Haskell, Erlang) or by using something like transaction memory for encapsulating state in a way that is easy to deal with (Haskell's STM, Fortress (I think), maybe some others).

    Concurrency is not that hard to do well in the right setting.

    And before anyone claims that Haskell and Erlang are impractical, there are many examples of "real world" programs written in them.

    A few nice, and very useful ones are Yaws (for erlang) and Darcs (for Haskell). There are many others (even quake clones), which I won't bother listing, but people can find them easily if they look.

    Regarding concurrency, and its ease of use in these languages, I'm taking a machine learning class at the moment where most of the problems are computationally intensive, and could stand for improvement by making use of multiple cores. I do all of my assignments in Haskell, and not only are my solutions often shorter than those of my classmates (and they often work fine the first time they compile), but it's usually trivial to allow my application to scale nicely to as many cores as I can throw at it. It's worth mentioning, by the way, that most algorithms given in these classes are given under the assumption that people are using imperative languages, and even then, it's still easy. It takes a while to learn how to approach problems differently without mutable state, yes, but it's not as hard as some people make it out to be. I think it has more to do with the fact that people just don't like to learn anything new unless they absolutely are forced to do so, which is a pity.

    By the way, there is a nice presentation from Tim Sweeney on what he would like future programming languages to look like, and there's a lot in there about functional programming, concurrency, and expressive (re: dependent) types.

  15. Re:Welcome to 1994 by Wesley+Felter · · Score: 3, Informative

    EPIC (i.e. Itanium) is still based on centralized structures like register files. To create a 16-issue EPIC processor, you'd need a ~32R/16W port register file which would be virtually impossible to build because it would be so huge and power-hungry. Also, EPIC needs heroic compiler optimizations to overcome its in-order execution, while EDGE is naturally out-of-order.

  16. This is just an update from year ago... by coldmist · · Score: 4, Informative

    Here is the slashdot article from 2003 about this processor: link

    The specs have been updated to 1024 from 512, but that's about it.

    Another 3-5 years out?

    --
    Don't steal. The government hates competition.
  17. TRIPS may solve some problems by knowsalot · · Score: 2, Informative

    The big thing that all the commenters have missed that I've read so far is the fact that OOO execution is difficult not because it's hard to make many ALU's on a chip (vector design, anyone?) but because in a general-purpose processor the register file and routing complexity grows as N^2 in the number of units. That's bad. Every unit has to communicate with every other unit (via the register file or, more commonly, via bypasses to an OOO buffer for every stage prior to writeback). The issue being addressed here is wiring complexity which, as modern designers would tell you, is a much harder problem than designing fast logic. Routing is hard. Plunking down more ALU's is easy. If you eliminate the register file, and design your processor and ISA to feed instructions in a data-flow manner to thousands of ALU's then you may be able to vastly simplify routing requirements, thereby decreasing the length of your critical path electrical circuits, thereby allowing the processor to clock faster. (Data-flow execution is executing instructions when their data inputs are ready, rather than tracking the compiler-optimized order, which does not have the run-time information that the hardware has.) If you are clever about your compiler, and make your hardware wide enough, you can for example speculatively execute both sides of a branch until it is resolved, thus eliminating a certain percentage of pipeline stalls for branch mispredicts. Similarly, with data-prediction you can speculate during cache misses. The list goes on. This is a very new and different paradigm (ugly word) for CPUs which may lead to higher IPC. This isn't the single golden goose, but it's a very different way of looking at the problem of pushing more instructions through a processor at higher speeds.

    1. Re:TRIPS may solve some problems by knowsalot · · Score: 3, Interesting

      Oh, and before someone points this out for me, you have to imagine that the routing requirements are VASTLY improved. Imagine a grid of ALU's each connected by a single bus, (simple,) rather than 128 bypass busses all multiplexed in to each ALU. (chaos! don't forget the MUX logic!) You map one instruction to one (virtual) ALU, rather than one result to a (virtual) register. Then you pipeline/march each instruction with its partial data down the grid until all the inputs come in. Instructions continually cascade in the top of the grid, and commit out the bottom. But their results are available to feed other instructions as soon as they are computed! Never have to wait for a MUX or a bus or what-have-you. Plus, you can clock the whole thing EXTREMELY fast, because you don't have these wire-delays from difficult routing requirements.

  18. Don't dismiss it by er824 · · Score: 5, Informative

    I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.

    http://www.cs.utexas.edu/~trips/

    They have several papers available that motivate the rationale for a architecture.

    The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.

    To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.

    TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.

    By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.

    At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.

    There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.

  19. One word: by robyannetta · · Score: 2, Funny

    Skynet

    --
    - Just my $0.02, take with a grain of salt, your mileage may vary.
  20. Compilers Are Not Magic, or Why IA64 Didn't Work by Erich · · Score: 2, Insightful
    TRIPS, like many other projects optimized to produce the largest number of PhD students possible, starts out with a premise something like:

    So, I have this big array of CPUs/ALUs/Functional Units, all I need to do is program them and I can be the computingest damn thing you ever saw!

    And it's true. You build a sea of ALUs and you sic some folks on hand coding all sorts of things to the machine, and you end up with some spectacular results.


    The problem is that we still can't get a compiler to do a good job at it, for the most part. We thought we could, and we threw every bell and whistle into IA64
    for a compiler-controlled architecture, and you've seen what we've ended up with. Many years later, the situation is still pretty much the same: the compiler
    can't do all that great of a job with these sorts of machines.


    Don't get me wrong, there are lots of good ideas in TRIPS or any of the various other academic projects like it, but I'm yet to be convinced that it's useful in
    any kind of real codebase that's not coded by hand by an army of graudate students. For some tasks, that's an acceptable model -- It's been the model in the world of
    signal processing for quite a while (though becoming less so daily) -- but for most mainstream applications it just won't fly.


    That, and it's hard for compilers to have knowledge about history. It's terribly important for optimization, and it's just hard to get into the compiler (though relatively
    easy to get into a branch predictor).

    --

    -- Erich

    Slashdot reader since 1997