Slashdot Mirror


Inside Intel's Next Generation Microarchitecture

Overly Critical Guy writes "Arstechnica has the technical scoop on Intel's next-generation Core chips. As other architectures move away from out-of-order execution, the from-scratch Core fully adopts it, optimizing as much code as possible in silicon, and relies on transistor size decreases--Moore's Law--for scalability."

22 of 116 comments (clear)

  1. Core Duo == Article Duo! by willith · · Score: 5, Funny

    Do we get two front page articles because the Core Duo has two cores? Goodie!!

    1. Re:Core Duo == Article Duo! by EtherAlchemist · · Score: 2, Funny


      totally, just sum it up in one post:



      New Intel architecture- Smaller, faster, better!



      New Intel architecture- Smaller, faster, better!

      --
      R(k)
  2. AMD Vs Intel: Round 9 by Sqwubbsy · · Score: 4, Interesting

    Ok, so I know I'm going to get a lot of AMD people agreeing with me and a lot of Intel people outright ripping me to shreds. But I'm going to speak my thoughts come hell or high water and you can choose to be a yes-man (or woman) with nothing to add to the conversation or just beat me with a stick.

    I believe that AMD had this technology [wikipedia.org] before Intel ever started in on it. Yes, I know it wasn't really commercially available on PCs but it was there. And I would also like to point out a nifty little agreement between IBM and AMD [pcworld.com] that certainly gives them aid in the development of chips. Let's face it, IBM's got research money coming out of their ears and I'm glad to see AMD benefit off it and vice versa. I think that these two points alone show that AMD has had more time to refine the multicore technology and deliver a superior product.

    As a disclaimer, I cannot say I've had the ability to try an Intel dual core but I'm just ever so happy with my AMD processor that I don't see why I should.

    There's a nice little chart in the article but I like AMD's explanation [amd.com] along with their pdf [amd.com] a bit better. As you can see, AMD is no longer too concerned with dual core but has moved on to targeting multi core.

    Do I want to see Intel evaporate? No way. I want to see these two companies go head to head and drive prices down. You may mistake me for an AMD fanboi but I simply was in agony in high school when Pentium 100s costed an arm and a leg. Then AMD slowly climbed the ranks to be a major competitor with Intel--and thank god for that! Now Intel actually has to price their chips competitively and I never want that to change. I will now support the underdog even if Intel drops below AMD just to insure stiff competition. You can call me a young idealist about capitalism!

    I understand this article also tackles execution types and I must admit I'm not too up to speed on that. It's entirely possible that OOOE could beat out the execution scheme that AMD has going but I wouldn't know enough to comment on it. I remember that there used to be a lot of buzz about IA-64's OOOE [wikipedia.org] processing used on Itanium. But I'm not sure that was too popular among programmers.

    The article presents a compelling argument for OOOE. And I think that with a tri-core or higher processor, we could really start to see a big increase in sales using OOOE. Think about it, a lot of IA-64 code comes to a point where the instruction stalls as it waits for data to be computed (most cases, a branch). If there are enough cores to compute both branches from the conditional (and third core to evaluate the conditional) then where is the slowdown? This will only break down on a switch style statement or when several if-thens follow each other successively.

    In any case, it's going to be a while before I switch back to Intel. AMD has won me over for the time being.

  3. The real technology is... by LordRPI · · Score: 4, Funny

    Each core can be in two places at once!

    1. Re:The real technology is... by TheDreadSlashdotterD · · Score: 2, Funny

      Okay Paul, knock it off. We know you're the maudi.

      --
      I have nothing to say.
  4. Re:Let's be perfectly honest here... by lanky2004 · · Score: 2, Funny

    i want funny

  5. Since this is a dupe by TubeSteak · · Score: 3, Interesting

    Can someone summarize nicely and neatly, the practical difference(s) between out-of-order and in-order executions?

    Why is it important that Intel is embracing OOOE and everyone else is moving away.

    --
    [Fuck Beta]
    o0t!
    1. Re:Since this is a dupe by dlakelan · · Score: 5, Informative

      Out of order execution is where special silicon on the processor tries to figure out the best way to run your code by reordering the instructions to use more of the processor features at once.

      In order execution doesn't require all that special silicon and therefore frees up die space.

      So one approach is to try to make your one processor as efficient as possible at executing instructions.

      Another approach is to make your processor relatively simple, and get lots of them on the die so you can have many threads at once.

      I personally prefer the multiple cores, because I think there is plenty of room for parallelism in software. HOwever this guy is basically claiming that intel is trying to get both, more cores and smarter cores. They're relying on Moore's law to shrink the size of their out of order execution logic so that they can get more smart cores on die.

      --
      ((lambda (x) (x x)) (lambda (x) (x x))) http://www.endpointcomputing.com a scientific approach to custom computing.
    2. Re:Since this is a dupe by John_Booty · · Score: 5, Informative

      It's a philosophical difference. Should we optimize code at run-time (like an OOOE processor) or rely on the compiler to optimize code at compile time (the IOE approach)?

      The good thing about in-order execution is that it keeps the actual silicon simple and uses less transistors. This keeps costs down and engineers have more die space to "spend" on other features, such as more cores or more cache.

      The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

      (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)

      As a programmer, I like the idea of out-of-order execution and the concept of runtime optimization. Programmers are typically the limiting factor in any software development project. You want those guys (and girls) worrying about efficient, maintainable, and correct code... not CPU specifics.

      I'd love to hear some facts on the relative performance benefits of runtime/compiletime optimization. I know that some optimizations can only be achieved at runtime and some can only be achieved at compiletime because they require analysis too complex to tackle in realtime.

      --

      OtakuBooty.com: Smart, funny, sexy nerds.
    3. Re:Since this is a dupe by acidblood · · Score: 5, Informative
      Be careful when you speak of parallelism.

      Some software simply doesn't parallelize well. Processors like Cell and Niagara will take a very ugly ugly beating from Core architecture based processors in that case.

      Then there's coarse-grained parallelism, tasks operating independently with modest requirements to communicate between themselves. For these workloads, cache sharing probably guarantees scalability. Going even further, there's embarassingly parallel tasks which need almost no communication between different processes -- such is the case of many server workloads, where each incoming user spawns a new process, which is assigned to a different core each time, keeping all the cores full. This type of parallelism ensures that multicore (even when taken to the extreme, as in Sun's Niagara) will succeed in the server space. The desktop equivalent is multitasking, which can't justify the move to multicore alone.

      Now for fine-grained parallelism. Say the evaluation of an expression a = b + c + d + e. You could evaluate b + c and d + e in parallel, then add those together. The architecture best suited for this type of parallelism is the superscalar processor (with out-of-order execution to help extract extra parallelism). Multicore is powerless to exploit this sort of parallelism because of the overhead. Let's see:
      • There needs to be some sort of synchronization (a way for a core to signal the other that the computation is done);
      • The fastest way cores can communicate is through cache sharing -- L1 cache is fairly fast, say a couple of cycles to read and write, but I believe no shipping design implements shared L1 cache, only shared L2 cache;
      • An instruction has to go through the entire pipeline, from decode to write-back, before the result shows up in cache, whereas in a superscalar processor there exist bypass mechanisms which make available the result of a computation in the next cycle, regardless of pipeline length.

      Essentially, putting synchronization aside for the moment (which is really the most expensive part of this), it takes a few dozens of cycles to compute a result in one core and forward it to another. Also, if this were done in a large scale, the communication channel between cores would become clogged with synchronization data. Hence it is completely impractical to exploit any sort of fine-grained paralellism in a multicore setting. Confront this with superscalar processors, which have execution units and data buses especially tailored to exploit this sort of fine-grained parallelism.

      Unfortunately, this sort of fine-grained parallelism is the easiest to exploit in software, and mature compiler technology exists to take advantage of it. To fully exploit the power of multicore processors, the cooperation of programmers will be required, and for the most part they don't seem interested (can you picture a VB codemonkey writing correct multithreaded code?) I hope this changes as new generations of programmers are brought up on multicore processors and multithreaded programming environment, but the transition is going to be turbulent.

      Straying a bit off-topic... Personally, I don't think multicore is the way to go. It creates an artificial separation of resources: i.e. I can have 2 arithmetic units per core, so 4 arithmetic units on a die, but if the thread running on core 1 could issue 4 parallel arithmetic instructions while the thread running on core 2 could issue none, both of core 1's arithmetic units would be busy on that cycle, leaving 2 instructions for the next cycle, while core 2's units would sit idle, despite the availability of instructions from core 1 just a few milimeters away. The same reasoning is valid for caches and we see most multicore designs moving to shared caches, because it's the most efficient solution, even if it takes more work. It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functional

      --

      Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

    4. Re:Since this is a dupe by JollyFinn · · Score: 2, Interesting
      It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functionality via symmetric multi-threading (SMT), a.k.a. hyperthreading. The secret is designing a processor for SMT from the start, not bolting it on a processor designed for single-threading as happened with the P4. I strongly believe that such a design would outperform any strict-separation multicore design with a similar transistor budget.

      Too bad it doesn't work that way. Lots of structures in CPU are n in complexity where n is width of processor. Also when the travaling of information across a die takes more than 10 cycles you need to have smaller structures, it will increase latencies of instructions.

      Here's one example, the bypass path needs to connect load port and every integer unit to every integer unit So there is (n*n) Connections between units, and the number of stages it needs to go in selecting input hampers the clockspeed eventally. There is practical limit on core size if we go bigger the clockspeed penalties and latencies will reduce more performance than adding core resouces will increase. Also SMT hurts cache hit rate, and that penalizes per thread performance also. When you put more execution units the maximum distance between execution units grows so the time its needed per cycle increases, due to delays moving data between execution units. But execution units are *NOT* the area where widening hurts mosts, its still easiest to explain. So then you either use 2 cycle latencies or go for very lower clockspeeds, or increase the voltage but power consumption is relative to v so no matter what the efficiency goes down.

      I believe SMT isn't completely dead, it can make a comback in intel machines at somepoint, with SOME additional per core resources. But from now on there is multiple cores.

      To make it clear, the transistor budget right now is so large that putting them all in single core isn't efficient, due to need to move data inside the core, and the n complexities.

      --
      Emacs is good operating system, but it has one flaw: Its text editor could be better.
    5. Re:Since this is a dupe by naasking · · Score: 2

      That's where a project like LLVM comes in. Platform-neutral binaries via LLVM bytecode, and full processor-specific link-time native compilation+optimization when a binary is installed. Alternatively, you can JIT the bytecode at runtime. Developers just distribute LLVM bytecode binaries, and the installers/users do the rest. I think the LLVM approach is the future.

  6. The real problem with dupes by lordsid · · Score: 5, Insightful

    The real problem with dupes isn't the fact that there are the same two articles on the front page, nor the whines that come from it, or even the whitty banter chidding the mods.

    If I see an article I've already read at the top of the page I QUIT READING.

    This has happened to me several times over the number of years I've read this site. Then I end up coming back and realizing it was a dupe and that I missed several interesting articles inbetween.

    SO FOR THE LOVE OF GOD READ YOUR OWN WEBSITE.

    --
    IMAGE VERIFICATION IS EVIL!
  7. Giving up the 'smart compiler' concept? by Gothmolly · · Score: 3, Interesting

    Wasn't the Achilles heel of the P4 and Itanium crappy code, that caused a pipeline stall on their very long pipes? Every time someone pointed out that AMD didn't have this problem, an Intel fanboy would reply that "with better compilers" you could avoid conditions where you'd have to flush the pipeline, thus maintaining execution speed.
    Well, those "better compilers" don't seem to be falling from the sky, and AMD is beating Intel in work/MHz because of it.
    Is Intel finally deciding "screw it, we'll make the CPU so smart, that even the crappiest compiled code will run smoothly" ?

    --
    I want to delete my account but Slashdot doesn't allow it.
  8. Moore's law isn't a law at all. by Nazo-San · · Score: 5, Interesting

    I just thought it should be stated for the record. Moore's law isn't a definite fact that cannot be disproven. It has been working so well up to now and will for a while yet that it is rather easy to seriously call it a law, but, we shouldn't forget that, in the end, there are physical limitations. I don't know how much longer we have until we reach them though. It could be five years, it could be twenty. It is there though and eventually we will hit that point to where transistors will get no smaller no matter what kind of technology you throw at it. At that point, a new method must be put into place to continue growth. This is why I personally like reading Slashdot so much for articles on things like quantum computing and the like. Those may be pipe dreams perhaps, but, the point is, they are alternate methods that may have hope someday of becoming truly powerful and useful. Perhaps the eventual sucessor to the current system will arise soon? Let's keep an eye out for it with open minds though.

    Anyway, I do understand a bit about how it all works. OOOE has amazing potential, but, in the end the fact remains that you can only optomize things so much. The idea there is actually to kind of break up instructions in such a way that you can actually kind of multi-thread a task not originally designed for multi-tasking. A neat idea I must say, with definite potential. However, honestly, in the end the fact remains that you will run into a lot of instructions that it can't figure out how to break up or which actually can't be broken up to begin with. If they continue to run with this technology, they will improve upon both situations, but, in the end, the nature of machine instructions leads me to believe that this idea may not take them far to be brutally honest.

    Let's not forget that one of the biggest competitors in the processors that focus on SIMD is kind of fading now. Apple is going to x86 architechure with all their might (and I must say I'm impressed at how smoothly they are switching -- it's actually exciting most Apple fans rather than upsetting them) and I think I read they no longer will even be producing anything with PowerPC style chips, which I suppose isn't good for the people who make them (maybe they wanted to move on to something else annyway?) At this point it's looking like it's more and more just the mobile devices who benefit from this style of chip, which is primarily just due to the fact that between their lack of need for higher speeds and overall design to use what they have efficiently, they use very little power and do what they do well in a segment like that.

    Multi-threading, however, is a viable solution today and in the future as well. It just makes sense really. You start to run into the limitations as to how fast the processor is going to run, how many transistors you can squeeze on there at once, power and heat limitations, etc, however, if you stop at those limits and simply add more processors handling things, you don't really have to design the code all THAT well to take advantage of it and keep the growth continuing in it's own way. I can definitely see multicore having a promising future with a lot of potential for growth because even when you hit size limitations for a single core you can still squeeze more in there. Plus, I wonder if multicore couldn't work in a multi-processor setup? If it can't today, won't it in a future? Who knows, there are limits on how far you can go with multi-core, but, those limits are further away than single core by far and I really feel like they are more promising than relying on smart execution on a single core running around the same speed. In the end, a well designed program will be splitting up instructions on a SMP/multicore system much like the OOOE will try to do. While the OOOE may be somewhat better at poorly designed programs (ignoring for a moment the advantages that multithreading provides to a multitasking os since even on a minimal setup a bunch of other stuff is running in the background) overa

    1. Re:Moore's law isn't a law at all. by Kupek · · Score: 2, Informative

      OOOE has amazing potential
      Which has been realized for about the past 20 years. Exploiting Instruction Level Parallelism (which requires an out-of-order-execution processor) has gotten us to where we are today. We're reaching the limits of what ILP can buy us, so the solution is to put more cores on a chip.

      It may be possible to integrate OOOE into a multicore.
      It is possible, and every single Intel multicore chip has done it. Same with IBM's Power5s. For general-purpose multicore processors, that is the norm.

  9. Re:Israel by xenn · · Score: 3, Insightful
    Parent post smacks of anti-semitism.

    no it doesn't. only mentions country - not culture. are you suggesting that only semites live in Israel? or maybe only semites could obtain PHD's in Israel?

    I think your reference to semitism is plain OOO .

    actually, your "joke" about a checkpoint firewall actually infers racism.

  10. Re:Israel by pchan- · · Score: 2, Informative

    Intel Israel has been a strong development center for Intel for quite some time now. Traditionally, new chips have been designed in the U.S., and then the designs were sent to the Israel for making them more power-efficient or improving performance. This situation got turned on its head. The American design team came up with the disaster known as the Netburst architecture (the highest clock P4 chips). Meanwhile, the Israel team was optimizing the Pentium-M (P3 and up) architecture and got its performance close to that of the Netburst chips at a lower clock rate and lower power consumption. Now Intel's top of the line chip was getting trounced by AMD's offering in both performance and power consumption, and further, AMD was announcing dual core chips years before Intel had planned to release any. In a way, Intel got lucky. They couldn't extend the Netburst architecture much more, the massively long pipelines on it made it terrible at executing general purpose code, and even hyperthreading didn't help it. It was generating massive amounts of heat at the frequency it was running and needed a huge cache. It was not ready for dual-cores. But the Pentium-M was. AMD's move to dual core saved Intel from competing in the megahertz race, just when the payoff from cranking the clock was starting to run out. They could now move from advertising clock rate to advertising dual cores. The Israel design team delivered the Core-Duo chip, and fast. Noticed how these appeared in laptops first? That's what the Israel team was experienced with.

    Expect the Israel team to continue developing this line of processors, with the American developers going back to the drawing boards for the next generation product.

  11. dupe tagging solution! by hobotron · · Score: 2, Insightful



    Alright mod me offtopic, but if /. just took the beta tags and if dupe showed up after a certain number of tags, or however they calculate it, have the story minimize to the non popular story size thats in between main stories, I dont want dupes deleted but this would be a simple soultion that would get them out of the limelight.

    --
    There is truth in humor.
  12. Re:GHz by jawtheshark · · Score: 2, Interesting
    That was one thing that really annoyed me about the P4; a 2 GHz P4 was NOT more than twice as fast as a 850 MHz P3. It meant one couldn't compare CPUs with each other any more.

    You never could do that in the first place. Within a CPU family, it used to be possible. (With Intels naming schemen today, I can't do it anymore either!) Compare a P-III 500MHz to a P-III 1GHz and you knew that the latter was approximately twice as fast. An 2GHz AMD Athlon XP was approximately twice as fast as a 1GHz AMD Athlon XP. I say approximately because cache sizes could influence these results. You never could compare a P-IV to a P-III or a P-IV to an AMD Athlon, expect by falling back on benchmarks and you *know* that all these benchmarks are pretty much artificial and can skew results in favour of a certain architecture.

    Really a long time ago, it was even dubious within the processor family: is a 486DX2/66 slower than a 486DX4/100? After all, the bus speed of the DX2 was 33MHz and the DX4 had a 25MHz bus. Back in the day such things has a major impact. (Even today it can have a big impact...)

    You can also recall the Pentium Pro (The CPU on which both the P-II and the P-III were based on) It was a horrible performer for 16-bit code, but on 32-bit code it was pretty much king. Also don't forget the extremely fast cache that it had. A PPro200 with enough RAM can handle Windows 2000 without a hitch. (I know, I had one with 256Meg RAM) The P-II came out, with a less performant cache and it couldn't beat the PPro clock-for-clock. That's why the lowest P-II came at 233MHz. (Yeah, it also included the MMX instruction set, I know, I know...)

    In summary: within processor families you can compare, outside processor families you are pretty much SOL.

    Besides, I know I'm going to sound like someone saying "we have enough processor power", but my primary laptop is a P-III 500MHz mobile with 512Meg PC100 RAM. You know what? That baby runs pretty much everything I throw at it: Windows XP Pro SP2, OpenOffice 2.0.2, Firefox 1.5.0.1, Thunderbird 1.5, AVG Antivirus, PuTTY, Filezilla, Acrobat Reader 7, iTunes6, Quicktime, Media Player classic, Borland Delphi Personal, Eclipse 3.0, Tomcat and The GIMP (but I have to be patient when handling big images). Perhaps not all at the same time (I never tried), but I often run at least a selection of the above. Sure, sometimes I have to wait a few seconds for a program to start, but it's not as if I'm that of a hurry.
    If I need more oompha, I just switch to my own AMD Athlon MP 2400+ SMP machine (4Gig RAM) or to my wifes P-IV 2.6GHz Hyperthreading (2Gig RAM). Frankly, that doesn't happen often...

    --
    Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
  13. Re:Israel by jawtheshark · · Score: 2, Insightful
    I read that the main problem in the US is that science/math is considered unsexy. Most students want to go into business or law, because that's where the money is made. I guess it is a result of being an extremely capitalist society.

    One odd thing is that the US imports many scientists with attractive grants, resulting in an exodus from European scientists (probably from other countries too, I just know Europe). Of course, since the eleventh september, getting a visa has become hard and thus less scientists are imported, which could result in a downfall of the science contributions from US.
    That said: being a scientist in Europe is hard too because the lack of money. That's probably a result of being a socialist society ;-)

    I guess there has to be a middle way between the two systems.

    --
    Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
  14. Only on slashdot that would of been insightfull... by JollyFinn · · Score: 2

    The poster obviosly hasn't design any CPU:s. Nor doesn't know about physics related to semiconductor design.
    He's programmer who doesn't need to think those things.
    n^2 or n^3 algorithms (in terms of power and aread) are used in MOST part of the core. So when the guy recommends that in next generation instead of having 4 cores we have single core he suggested that we have one core which is twice as wide as one of those 4 cores.
    Large fraction of code is pointer chasing, large fraction of code has ILP equal or lower than 1. There just are too many data dependensies.

    Just like latency of cache it depends on its size, instead of is it L1 or L2 or whatever. Physics says that the drive strength and distance is important. Same happens inside core, when you have quadrupled the core size you need to drive all the instructions and data around, its like prescott. You spend huge resources moving instructions around in pipeline stages instead of doing computation, and you have to do it since the distance you travel between different parts of cores is so much bigger now. The register renamers take a lot more die area, and have longer latency, so does the out of order queus, then there is latencies between aluinstructions, latencies INSIDE the logic selecting which instruction goes next is growing since number of locations for which each instruction in queu has gone up, and number of instructions in queu has gone up too. The bad part is that the logic isn't linear its n^2 algorithm in terms of width, so its width^2*quedepth. So in his recommendation of doubling he get 8 times the area and power consumption there.

    Of course trying to educate masses of programmers is futile attempt here, there is plenty of people who know nothing about the costs of doing something proposing solutions that the people who have designed CPU:s for 20+ years have probably already dismissed because of their infeasibility.

    --
    Emacs is good operating system, but it has one flaw: Its text editor could be better.