Slashdot Mirror


Inside Intel's Next Generation Microarchitecture

Overly Critical Guy writes "Arstechnica has the technical scoop on Intel's next-generation Core chips. As other architectures move away from out-of-order execution, the from-scratch Core fully adopts it, optimizing as much code as possible in silicon, and relies on transistor size decreases--Moore's Law--for scalability."

116 comments

  1. Core Duo == Article Duo! by willith · · Score: 5, Funny

    Do we get two front page articles because the Core Duo has two cores? Goodie!!

    1. Re:Core Duo == Article Duo! by wangf00 · · Score: 1

      So can we expect multiple dupes with multicore?

    2. Re:Core Duo == Article Duo! by EtherAlchemist · · Score: 2, Funny


      totally, just sum it up in one post:



      New Intel architecture- Smaller, faster, better!



      New Intel architecture- Smaller, faster, better!

      --
      R(k)
  2. Is this a PURPOSEFUL dupe? by Silverlancer · · Score: 0, Troll

    It even links to the same article...

  3. Re:Doooooop! by Anonymous Coward · · Score: 0

    Well, at least it's not verbatim...

    www.afterthought.cjb.cc

  4. AMD Vs Intel: Round 9 by Sqwubbsy · · Score: 4, Interesting

    Ok, so I know I'm going to get a lot of AMD people agreeing with me and a lot of Intel people outright ripping me to shreds. But I'm going to speak my thoughts come hell or high water and you can choose to be a yes-man (or woman) with nothing to add to the conversation or just beat me with a stick.

    I believe that AMD had this technology [wikipedia.org] before Intel ever started in on it. Yes, I know it wasn't really commercially available on PCs but it was there. And I would also like to point out a nifty little agreement between IBM and AMD [pcworld.com] that certainly gives them aid in the development of chips. Let's face it, IBM's got research money coming out of their ears and I'm glad to see AMD benefit off it and vice versa. I think that these two points alone show that AMD has had more time to refine the multicore technology and deliver a superior product.

    As a disclaimer, I cannot say I've had the ability to try an Intel dual core but I'm just ever so happy with my AMD processor that I don't see why I should.

    There's a nice little chart in the article but I like AMD's explanation [amd.com] along with their pdf [amd.com] a bit better. As you can see, AMD is no longer too concerned with dual core but has moved on to targeting multi core.

    Do I want to see Intel evaporate? No way. I want to see these two companies go head to head and drive prices down. You may mistake me for an AMD fanboi but I simply was in agony in high school when Pentium 100s costed an arm and a leg. Then AMD slowly climbed the ranks to be a major competitor with Intel--and thank god for that! Now Intel actually has to price their chips competitively and I never want that to change. I will now support the underdog even if Intel drops below AMD just to insure stiff competition. You can call me a young idealist about capitalism!

    I understand this article also tackles execution types and I must admit I'm not too up to speed on that. It's entirely possible that OOOE could beat out the execution scheme that AMD has going but I wouldn't know enough to comment on it. I remember that there used to be a lot of buzz about IA-64's OOOE [wikipedia.org] processing used on Itanium. But I'm not sure that was too popular among programmers.

    The article presents a compelling argument for OOOE. And I think that with a tri-core or higher processor, we could really start to see a big increase in sales using OOOE. Think about it, a lot of IA-64 code comes to a point where the instruction stalls as it waits for data to be computed (most cases, a branch). If there are enough cores to compute both branches from the conditional (and third core to evaluate the conditional) then where is the slowdown? This will only break down on a switch style statement or when several if-thens follow each other successively.

    In any case, it's going to be a while before I switch back to Intel. AMD has won me over for the time being.

    1. Re:AMD Vs Intel: Round 9 by kestasjk · · Score: 1

      There's no way you could do branch prediction and processing on multiple cores, the latency would be too high for branches of a realistic size.

      --
      // MD_Update(&m,buf,j);
    2. Re:AMD Vs Intel: Round 9 by somersault · · Score: 1

      'no way' sounds a bit far fetched, but it does seem that Intel's idea of using massive pipelines to aid in certain calculations bombed a bit. Nothing wrong with trying new things though. Branch prediction does seem a waste on anything but a multithread/multicore processor, unless you're running calculations in spare processor cycles - but for most apps where performance matters, how likely is that to happen? You may as well try to predict every single possible thing the user is going to do next while they are idling on the desktop (and in fact I now remember that Vista is going to cache certain apps depending on what time of day you usually use them etc, ho ho ho..)

      --
      which is totally what she said
    3. Re:AMD Vs Intel: Round 9 by Dolda2000 · · Score: 1

      Really? Is it not weird, then, that Sun's octuple-core T1 processor outclassed the competition with at least 2:1, and normally closer to 3:1, in the last SPECweb round?

    4. Re:AMD Vs Intel: Round 9 by networkBoy · · Score: 1

      The massive pipelines work great, just not on things with lots of branches. This was a known issue to Intel, and was considered to be a worthwhile risk, as the expectation was that the CPU would scale to the high GHz. That the processor tops out at ~4GHz means that your gain to loss ratio of what the popeline depth gets you has changed (or more accurately failed to improve as anticipated).

      All that said, there are several applications where the Intel Archecture whips AMDs, the top two being:
      MS Office and similar applications
      Compression Algos

      Both these applications have very few branches, and thus do not pay the price for instruction misses.
      Games OTOH, are essentially nothing but a decision tree, one who has never been trimmed, and thus punishes the IA like nothing else.

      Cheers,
      -nB

      --
      whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
    5. Re:AMD Vs Intel: Round 9 by zsazsa · · Score: 1

      I'm neither an Intel or an AMD fan. I generally dislike Intel due to the retarded Netburst architecture and many of their business practices, but on a purely technical standpoint I seriously think they're onto something with their next generation Core. I do think you're talking out of your ass, or just pasted an old comment from a different article. You may feel free to tell me I'm talking out of my own ass as well.

      I believe that AMD had this technology before Intel ever started in on it.

      What technology are you talking about exactly? That's just a link to a Wikipedia article.

      As a disclaimer, I cannot say I've had the ability to try an Intel dual core but I'm just ever so happy with my AMD processor that I don't see why I should.

      This is a discussion of Intel's new architecture, not the Pentium 4-Ds or current Core Duos. On an aside, I think Intel's branding of both the enhanced Pentium-M in laptops now, and the much different next generation architecture as "Core" has the potential to be very confusing.

      There's a nice little chart in the article but I like AMD's explanation along with their pdf a bit better. As you can see, AMD is no longer too concerned with dual core but has moved on to targeting multi core.

      See the "Core 1" and "Core 2" in the AMD chart? The "nice little chart" in the article was detailing how the next gen architecture's CPU core alone operates, not how multiple cores interface with each other, memory, and the system bus. As the article said, Intel is relying on multiple, not just dual, cores to let the architecture scale in the future. Check out the road map.

      It's entirely possible that OOOE could beat out the execution scheme that AMD has going but I wouldn't know enough to comment on it.

      K8 is an out of order execution architecture as well. So was K7. So was K6. So was P6, as the article said. This is nothing new, and in fact the article says Intel is being conservative by refining their OOOE unit instead of throwing it out like on new competing architectures like the UltraSPARC T1.

    6. Re:AMD Vs Intel: Round 9 by be-fan · · Score: 1

      The T1 doesn't do branch processing on multiple cores. Its just a giant array of fairly standard (and slow), in order UltraSPARC II processors.

      --
      A deep unwavering belief is a sure sign you're missing something...
  5. The real technology is... by LordRPI · · Score: 4, Funny

    Each core can be in two places at once!

    1. Re:The real technology is... by TheDreadSlashdotterD · · Score: 2, Funny

      Okay Paul, knock it off. We know you're the maudi.

      --
      I have nothing to say.
    2. Re:The real technology is... by ploss · · Score: 1

      I think it's just having an Out-of-Order Experience...

      --
      What are the odds that some idiot will name his mutex ether-rot-mutex!
  6. Let's be perfectly honest here... by Timex · · Score: 1, Insightful

    It's like a landmark-- "Surf until you get to a geekish news site with anti-Microsoft bent and a couple dupes on the front page. When you get there, you're on Slashdot."

    --
    When politicians are involved, everyone loses.
    1. Re:Let's be perfectly honest here... by lanky2004 · · Score: 2, Funny

      i want funny

  7. I have a great idea by Anonymous Coward · · Score: 1, Funny
    Guys I have a great idea, let's all point out the fact that this article is a dupe!

    Seriously this is gonna be so cool, slashdot will never be the same again!

  8. No no, not good enought by Lisandro · · Score: 0, Troll

    Dupe articles with identical links? Meh. Bring it on. When are we getting dupes with identical summaries?

  9. The Secret of Slashdot by Anonymous Coward · · Score: 0

    We can now reveal, for the first time anywhere wihtout a cover charge, the central secret of slashdot: Everything old is new again.

    But not after only 13 hours, hosehead. Gotta let the little sister sites have their turn at it, then you can reference them tomorrow. It's all one big, incestuous, irrelevant family.

  10. Israel by Anonymous Coward · · Score: 1, Interesting

    So apparently Intel had to go to Israel to find computer engineers to design their flagship architecture for the next 5+ years. With a population of only 7 million how is it that so many brilliant chip designers are in Israel?

    1. Re:Israel by Anonymous Coward · · Score: 0, Insightful

      Ummm. I think Israel probably has more PHd's per capita than any other country in the world. I also think that it is very possible that with a population of 7million highly educated people there could be plenty of brilliant chip designers in Israel.

      Oh and ever hear of a Checkpoint Firewall?

      Parent post smacks of anti-semitism.

    2. Re:Israel by kfg · · Score: 1

      With a population of only 7 million how is it that so many brilliant chip designers are in Israel?

      So many of them came from Levittown, LI.

      KFG

    3. Re:Israel by xenn · · Score: 3, Insightful
      Parent post smacks of anti-semitism.

      no it doesn't. only mentions country - not culture. are you suggesting that only semites live in Israel? or maybe only semites could obtain PHD's in Israel?

      I think your reference to semitism is plain OOO .

      actually, your "joke" about a checkpoint firewall actually infers racism.

    4. Re:Israel by DSP_Geek · · Score: 1

      [flamebait]
      During the Middle Ages, while gentiles pushed their smart sons into the priesthood and celibacy, the smart Jews became rabbis and had lotsa kids.
      [/flamebait]

      The Izzies have had to become really smart because they're surrounded by people who'd like nothing better than to push them into the sea. As a matter of fact, when they got military gear from the States, the manufacturers often came back and asked them exactly *what* they did with the electronics; it might have had to do with the 88-2 kill ratio over the Bekaa Valley in the early 80s.

      http://www.airpower.maxwell.af.mil/airchronicles/a pj/apj89/hurley.html

      It's not only Things That Go Fast And Explode, either: Morel of Israel also does a bang-up job improving speaker designs sourced from the Danish firm Dynaudio, to the point where some of their tweeters are considered among the best in the world.

      The same thing holds for chip designers, and don't forget the Russian Jewish exodus into Israel - just because the Soviet fab lines were a couple of Moore generations behind didn't mean their chip guys were slouches. The Israelis took over the Pentium III and designed the Pentium M, whence came Conroe. Motorola (now Freescale) recent DSPs are also Israeli. They know how to Make Stuff Work Better.

    5. Re:Israel by pchan- · · Score: 2, Informative

      Intel Israel has been a strong development center for Intel for quite some time now. Traditionally, new chips have been designed in the U.S., and then the designs were sent to the Israel for making them more power-efficient or improving performance. This situation got turned on its head. The American design team came up with the disaster known as the Netburst architecture (the highest clock P4 chips). Meanwhile, the Israel team was optimizing the Pentium-M (P3 and up) architecture and got its performance close to that of the Netburst chips at a lower clock rate and lower power consumption. Now Intel's top of the line chip was getting trounced by AMD's offering in both performance and power consumption, and further, AMD was announcing dual core chips years before Intel had planned to release any. In a way, Intel got lucky. They couldn't extend the Netburst architecture much more, the massively long pipelines on it made it terrible at executing general purpose code, and even hyperthreading didn't help it. It was generating massive amounts of heat at the frequency it was running and needed a huge cache. It was not ready for dual-cores. But the Pentium-M was. AMD's move to dual core saved Intel from competing in the megahertz race, just when the payoff from cranking the clock was starting to run out. They could now move from advertising clock rate to advertising dual cores. The Israel design team delivered the Core-Duo chip, and fast. Noticed how these appeared in laptops first? That's what the Israel team was experienced with.

      Expect the Israel team to continue developing this line of processors, with the American developers going back to the drawing boards for the next generation product.

    6. Re:Israel by Anonymous Coward · · Score: 0

      Why would there need to be a lot of talented chip designers in Israel? There only needs to be enough for Intel to hire to design a chip. They don't, technically speaking, even need to be from Israel. Since Intel has been doing business in Israel for decades, why would it seem even slightly unusual for them to have a talented design team or two in Israel?

      There's also this undue insinuation that people have that NetBurst is a big failure, or that Intel is essentially incapable of designing competitive processors. Intel has some of the brightest designers in the world working for it, even if the physical constraints holding back the NetBurst design was underestimated many years ago when Intel put the ball in motion to switch microarchitectures.

    7. Re:Israel by drachenstern · · Score: 1

      They also, Have A Lot Of Math Education Per Capita. We, OTOH here in the States, have enough to allow people to push the <10.00> buttons and hope to get the right change. In my high school (in state A) we were taught all the math basics (you know, trig, pre-cal, nothing too hard), including how to use a scientific calculator (such as the simple TI line), and where I am now in Uni (almost ten years later, in state B), people in my math classes cannot even find the pi button, or sin/cos/tan (yeah, i know, three keys in the middle, so hard to see). And these are the kids who just graduated high school, that stuff should be fresh in their minds.

      So much to ask for our country's education system, I guess. However, the good news is that there is a federal grant now available (for fall 06 term) for students in 3 & 4 year programs bach for Eng and Math and Sci as well as (some) For Lang. The country knows what part of the economy is off, their just about ten years behind (as always)

      Not a US gov't basher, just sharing

      --
      2^3 * 31 * 647
    8. Re:Israel by jawtheshark · · Score: 2, Insightful
      I read that the main problem in the US is that science/math is considered unsexy. Most students want to go into business or law, because that's where the money is made. I guess it is a result of being an extremely capitalist society.

      One odd thing is that the US imports many scientists with attractive grants, resulting in an exodus from European scientists (probably from other countries too, I just know Europe). Of course, since the eleventh september, getting a visa has become hard and thus less scientists are imported, which could result in a downfall of the science contributions from US.
      That said: being a scientist in Europe is hard too because the lack of money. That's probably a result of being a socialist society ;-)

      I guess there has to be a middle way between the two systems.

      --
      Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
    9. Re:Israel by Anonymous Coward · · Score: 0

      Andrew Grove of Intel: Time magazine's 1997 Man of the Year was born Andras Grof in Budapest, Hungary, in 1936, the son of a Jewish dairyman. When the German tanks rolled into Hungary, he and his family had to go into hiding and remain there for the duration of the war. The end of the war brought little relief to most of the country, and things took a turn for the worse in 1956 when the Soviet Army invaded and established Communist rule.
      source
      http://www.leadershipnow.com/leadershop/8414-4exce rpt.html

    10. Re:Israel by Anonymous Coward · · Score: 0

      rest of the bio missed before....

      Following the Soviet invasion, Grove escaped from Hungary and departed for America on a rusty ship designed to transport American troops during the war. His arrival in America was anticlimactic, but he did receive a much-needed hearing aid from the International Rescue Committee, and hearing lost in a childhood illness was restored. Barely able to speak English, Grove obtained a job as a waiter and enrolled in classes at the City College of New York.

      Despite his initial difficulties, Grove went on to graduate at the top of his chemical engineering class in 1960. He continued his education at the University of California, Berkeley, and received a Ph.D. in chemical engineering in 1963. Following graduation, he took a job at Fairchild Semiconductor in research and development while continuing to teach at UC Berkeley.

      In 1968, Bob Noyce and Gordon Moore started a company called Intel, and Grove was one of their first two employees. Grove went on to become the firm's third CEO, following Intel's two founders in the role. In many ways, Grove's management shepherded Intel through what he later called the "Valley of the Shadow of Death," referring to the world's transition from mainframe computers to personal computers. Despite the turbulent environment, Intel ultimately emerged as one of the dominant firms of the Digital Age.

    11. Re:Israel by Anonymous Coward · · Score: 0

      are you suggesting that only semites live in Israel?

      I'm sorry, it is spelled "occupied Palestine", not "Isreal"

    12. Re:Israel by be-fan · · Score: 1

      Because Jews are really smart. No, seriously. Why do you think they're so rich? Studies have actually shown that there is a sub-population of Jews that gets Nobel Prizes vastly out of proportion with their numbers.

      --
      A deep unwavering belief is a sure sign you're missing something...
    13. Re:Israel by Anonymous Coward · · Score: 0

      Having worked for an Israeli software company and in Israel, my experience is that you have the exact same ratio of brilliant engineers to useless fucks there as everywhere else.

      My opinion of what happened here is that while the American design team were being worn down on a marketing-driven design (Netburst) and another one driven by NIH - Not Invented Here - (IA64), the Israeli were handed over the mobile stuff that nobody cared about (the mobile stuff). That means they had little management oversight, did not have to deal with marketing views of what makes a "good" processor and, last but not least, had a good excuse to toss out stupid ideas without having the responsible parties (now promoted) throwing a tantrum (ah, yes it was a good idea, but see, this is for *mobiles*...).

      So basically the engineers could actually engineer, and there you go.

    14. Re:Israel by acidblood · · Score: 1
      AMD was announcing dual core chips years before Intel had planned to release any.

      Is this an attempt to prove the saying that if a lie is often repeated, it becomes true?

      Intel First to Ship Dual core

      I don't care how you spin it, your statement was a lie bordering on AMD fanboyism.
      --

      Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

    15. Re:Israel by homer_ca · · Score: 1

      "it might have had to do with the 88-2 kill ratio over the Bekaa Valley in the early 80s."

      For comparison, the US Navy lost 2 planes to Syrian SAMs in just one raid in '83.

    16. Re:Israel by thedletterman · · Score: 1

      i didn't appreciate the anti-semitism of this thread either, which dismisses the fact that Israel is one of the most high-tech economies in the world.

      --
      Any fool can criticise, condemn, and complain, and most fools do. - Benjamin Franklin
    17. Re:Israel by kiatoa · · Score: 1

      Damn, my mod points expired yesterday. I'd be modding this post "insightful++" if I could.

      --
      90% of the wealth is in 2% of the pockets. Bummer to be in the majority.
  11. Twice. by mnemonic_ · · Score: 1

    Bite me twice.

  12. What gives? by Sqwubbsy · · Score: 1, Funny

    If the editors can post a dupe story, why can't I post a dupe comment?
    The mods gotta loosen up a little. Sheesh.

    1. Re:What gives? by Anonymous Coward · · Score: 0

      Because the editors are normal people, and you are a fucking moron. Good enough reason?

  13. Since this is a dupe by TubeSteak · · Score: 3, Interesting

    Can someone summarize nicely and neatly, the practical difference(s) between out-of-order and in-order executions?

    Why is it important that Intel is embracing OOOE and everyone else is moving away.

    --
    [Fuck Beta]
    o0t!
    1. Re:Since this is a dupe by dlakelan · · Score: 5, Informative

      Out of order execution is where special silicon on the processor tries to figure out the best way to run your code by reordering the instructions to use more of the processor features at once.

      In order execution doesn't require all that special silicon and therefore frees up die space.

      So one approach is to try to make your one processor as efficient as possible at executing instructions.

      Another approach is to make your processor relatively simple, and get lots of them on the die so you can have many threads at once.

      I personally prefer the multiple cores, because I think there is plenty of room for parallelism in software. HOwever this guy is basically claiming that intel is trying to get both, more cores and smarter cores. They're relying on Moore's law to shrink the size of their out of order execution logic so that they can get more smart cores on die.

      --
      ((lambda (x) (x x)) (lambda (x) (x x))) http://www.endpointcomputing.com a scientific approach to custom computing.
    2. Re:Since this is a dupe by DerGeist · · Score: 1
      In software like video/audio processing, then yes, there is a veritable orgasm of parallelizable code. For most single programs a user wants to execute, the max speedup you can expect to see is about 1.2 with 2 cores versus one. (I'll spare you the computation, it's rather long, I had to do it in my Advanced Computer Architecture class and again in my High Performance Architecture class).

      Also don't forget that with multiple cores you're introducing a host of new problems such as scheduling, cache coherency, synchronization, etc. It's a complex thing to have more than one core working on the same problem. The best application for this type of multiprocessor system is multitasking. Before, when copying 40 GB of movies/tv/pr0n from your friend's removable HDD, your computer would tank, practically deadlocked. With a dual core machine, you'll barely notice anything is running in the background (I've tried it myself, it's awesome). You'll see the same effect installing software, torrenting, etc.

      So in general there isn't a ton of parallelizable code running in a single program, but there is a good amount of parallelizable code running on a machine at a given time.

    3. Re:Since this is a dupe by Billly+Gates · · Score: 1

      An out of order execution executes out or order and an in order executions executes in order.

      Get with the program. Sheesh

    4. Re:Since this is a dupe by DrMrLordX · · Score: 1

      The article summary is strange. Nobody should be surprised by Intel's decision to base their next generation of CPUs on out-of-order execution. They've been doing that ever since the Pentium Pro. Outside of the Itanium, Intel has never gotten away from "embracing OOOE". I have no idea why they even brought up the subject of in-order and out-of-order execution.

    5. Re:Since this is a dupe by John_Booty · · Score: 5, Informative

      It's a philosophical difference. Should we optimize code at run-time (like an OOOE processor) or rely on the compiler to optimize code at compile time (the IOE approach)?

      The good thing about in-order execution is that it keeps the actual silicon simple and uses less transistors. This keeps costs down and engineers have more die space to "spend" on other features, such as more cores or more cache.

      The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

      (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)

      As a programmer, I like the idea of out-of-order execution and the concept of runtime optimization. Programmers are typically the limiting factor in any software development project. You want those guys (and girls) worrying about efficient, maintainable, and correct code... not CPU specifics.

      I'd love to hear some facts on the relative performance benefits of runtime/compiletime optimization. I know that some optimizations can only be achieved at runtime and some can only be achieved at compiletime because they require analysis too complex to tackle in realtime.

      --

      OtakuBooty.com: Smart, funny, sexy nerds.
    6. Re:Since this is a dupe by baywulf · · Score: 1

      OOOE breaks up the intruction stream execution order so that as many execution units are busy as possible thus maximizing performance. While this is done, the hardware checks data dependencies between instructions so that the correct results are still produced. For example, if there is a integer add followed by a fp multiply and then a branch, it could theoretically execute all there in parallel assuming enough execution units are available. But then lots of problems come up such as if the fp multiply generates an exception then the branch should no longer execute or if the branch was not predicted correctly then all executed instruction after it must be flushed. The hardware uses lots of extra logic and hidden registers to store results temporarily and commits then in the original order so the same results are produced by it executes faster. Read up on the Tomosulo algorithm if you want to learn further.

    7. Re:Since this is a dupe by distributed · · Score: 1
      Before, when copying 40 GB of movies/tv/pr0n from your friend's removable HDD, your computer would tank, practically deadlocked.

      I believe disk transfers are mostly done using DMA, the processor isnt really executing a loop for copying data (check ur cpu usage during a copy)... the deadlocking i think has prolly more to do with the IO interface being choked.

      You are right about the amount of available parallelism though, architects/designers simply dont know of any good way to use all the real estate on the chip to run single threaded code faster (curse the signal propagation delay)... and its much easier in term of chip verification, design costs etc. to add a dupe on the chip. No matter how many core's we have on a chip... single threaded performance is always going to be important unless we shift to a more parallel programming model.

      One major problem with in-order architectures is that its hard for them to do useful work during cache misses... they simply cant fetch out of order and have to just stall in almost all cases, whereas outta order can manage to use atleast part of the miss penalty. But this does come at the cost of complexity and power.

      If only everyone was taught to think and program in parallel since birth...

      --
      [all generalizations are untrue except this one]
    8. Re:Since this is a dupe by TubeSteak · · Score: 1
      The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well. Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.
      That may have mattered in previous iterations of CPU hardware, but haven't the last few generations of AMD & Intel CPUs used the same instruction sets?

      I'm pretty sure they cross licensed all the SSE and MMX instruction sets.
      Even the newer Via and Transmeta CPUs have SSE3. Intel never did use the 3DNow! set, even though SSE3 implements some of it's features.
      --
      [Fuck Beta]
      o0t!
    9. Re:Since this is a dupe by Nazo-San · · Score: 1

      Hmm, if nothing else, this is kind of where the Cell concept comes in. Is it not supposed to actually dedicate one of the cores specifically to the control of the others? Sounds like a good idea to me. Anyway, what I'm thinking (and I've touched base on this in another thread) is that with multicore being pushed so hard these days programmers might buckle down and actually program better. See, OOOE probably requires the chip to do most of the breaking up. I mean, if you break up the code first-hand, why do you need the chip to have a smart way to do it for you? But, if you think about it, modern things such as games could be designed a little better where they split things up more into threads. Yes, scheduling and such do indeed get tricky, and that's a problem that will have to be gotten used to, but, then this is why few designers actually make their own engine. Conceivably you can split up just about anything. Put AI in one thread, physics processing in another, sound in yet another, split graphics up a bit (even with hardware acceleration some CPU processing is required for these, though I might add that nVidia has already begun to run with the idea of splitting up graphics stuff and have multithreading support in their drivers already, albiet a little buggy I think.) I've already mentioned in a seperate post that Intel thinks they can come up with a 10 core processor in the not so far future. That leaves quite a bit of leeway for the game doesn't it? Heck, even just the idea of four sounds like it would be a world of difference compared to two to me once you start to be able to move the overhead a little further away. Ok, the programming is not easy for a moment and will make life for the actual engine designers unpleasant for a while, but, then you can't tell me that various things haven't already done this in the past and that at least some of those hurdles haven't already been passed sucessfully? It really looks like people are getting serious about the idea of multithreading since multicore is less troublesome than SMP and AMD and Intel are both pushing it. I honestly believe programmers will buckle down and start optomizing their code a bit more.

      I don't know what you're talking about with the harddrive transfer thing though. The CPU isn't a limit when transferring files unless you're using some kind of weird program that has to compress the data first, then transfer, but, that only makes sense on a network. Your friend's external drive would have to compress on its end, not yours. And the only compression algorithm I remember reading about any time recently being actually symmetric was Monkey's Audio while things like a file transfer would probably be gzip at most even assuming the protocol did allow it. No, the harddrive has been the slowest critical component of a computer since the early days and that still remains true, especially with USB and it's inconsistand speeds, but, even firewire can only manage so much with sustained transfers. I would accept an argument that explorer is a peice of crap, but, the fact is that I do a lot of transfers from partition to partition on my single internal harddrive (which really hurts since it has to keep jerking back and forth so its max speed at doing this is probably overall less than half of its max speed from harddrive to harddrive.) My single core processor doesn't stutter when I do this. Oh, sure, if I run something that needs to get a file off the harddrive, that program will freeze up, the harddrive will flare up even more and that light will stop blinking and become sustained for a moment (in other words, there is little doubt the CPU isn't the limiting factor here.) However, all the parts that end up running from memory run smoothly despite the transfer. Hmm, or does the USB bus use interrupts maybe? I suppose it could force the CPU to stop for a moment? Oh well, my fastest devices are little thumb drives like a cruzer micro, so I haven't managed a USB 2.0 transfer fast enough to stagger my single core cpu. Mind you, I'm not a

    10. Re:Since this is a dupe by acidblood · · Score: 5, Informative
      Be careful when you speak of parallelism.

      Some software simply doesn't parallelize well. Processors like Cell and Niagara will take a very ugly ugly beating from Core architecture based processors in that case.

      Then there's coarse-grained parallelism, tasks operating independently with modest requirements to communicate between themselves. For these workloads, cache sharing probably guarantees scalability. Going even further, there's embarassingly parallel tasks which need almost no communication between different processes -- such is the case of many server workloads, where each incoming user spawns a new process, which is assigned to a different core each time, keeping all the cores full. This type of parallelism ensures that multicore (even when taken to the extreme, as in Sun's Niagara) will succeed in the server space. The desktop equivalent is multitasking, which can't justify the move to multicore alone.

      Now for fine-grained parallelism. Say the evaluation of an expression a = b + c + d + e. You could evaluate b + c and d + e in parallel, then add those together. The architecture best suited for this type of parallelism is the superscalar processor (with out-of-order execution to help extract extra parallelism). Multicore is powerless to exploit this sort of parallelism because of the overhead. Let's see:
      • There needs to be some sort of synchronization (a way for a core to signal the other that the computation is done);
      • The fastest way cores can communicate is through cache sharing -- L1 cache is fairly fast, say a couple of cycles to read and write, but I believe no shipping design implements shared L1 cache, only shared L2 cache;
      • An instruction has to go through the entire pipeline, from decode to write-back, before the result shows up in cache, whereas in a superscalar processor there exist bypass mechanisms which make available the result of a computation in the next cycle, regardless of pipeline length.

      Essentially, putting synchronization aside for the moment (which is really the most expensive part of this), it takes a few dozens of cycles to compute a result in one core and forward it to another. Also, if this were done in a large scale, the communication channel between cores would become clogged with synchronization data. Hence it is completely impractical to exploit any sort of fine-grained paralellism in a multicore setting. Confront this with superscalar processors, which have execution units and data buses especially tailored to exploit this sort of fine-grained parallelism.

      Unfortunately, this sort of fine-grained parallelism is the easiest to exploit in software, and mature compiler technology exists to take advantage of it. To fully exploit the power of multicore processors, the cooperation of programmers will be required, and for the most part they don't seem interested (can you picture a VB codemonkey writing correct multithreaded code?) I hope this changes as new generations of programmers are brought up on multicore processors and multithreaded programming environment, but the transition is going to be turbulent.

      Straying a bit off-topic... Personally, I don't think multicore is the way to go. It creates an artificial separation of resources: i.e. I can have 2 arithmetic units per core, so 4 arithmetic units on a die, but if the thread running on core 1 could issue 4 parallel arithmetic instructions while the thread running on core 2 could issue none, both of core 1's arithmetic units would be busy on that cycle, leaving 2 instructions for the next cycle, while core 2's units would sit idle, despite the availability of instructions from core 1 just a few milimeters away. The same reasoning is valid for caches and we see most multicore designs moving to shared caches, because it's the most efficient solution, even if it takes more work. It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functional

      --

      Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

    11. Re:Since this is a dupe by be-fan · · Score: 1

      At a technical level, the difference between OOO and IO is thus: an OOO Processor can issue, via a structure called a reservation station, instructions in an order other than what is in the code stream. So say the CPU decodes instructions A, B, C, and D, in that order. These instructions go into a reservation station. Instructions in this structure sit there until all its source operands are available. That means if A, B, C, and D enter the RS, but B and C's operands are available before A and D's, B and C will execute before A and D.

      The key use of OOO is two fold. First, it reduces the compiler's burder. Say the CPU is designed so the latency between two FPU additions is 4 clock cycles. With an in-order processor, the compiler would have to schedule code so there are 4 independent operations between each pair of dependent ones. If a dependent operation shows up in the code stream before the instruction it depends on has completed, the processor will just wait there while the instruction finishes, even if there are instructions in the queue behind the dependent one that could be executing in the mean time. In an OOO processor, if independent instructions arrive while a particular instruction is waiting for some operands, they'll be executed ahead of the waiting instruction.

      The other use of OOO is to cover the latencies to caches and memory. Typically, L1 caches have a latency of 3 cycles, while L2 caches have latencies of 10-30 cycles, and memory has latencies of upwards of 150 cycles. If an instruction arrives that requires results from memory, it could be waiting anywhere from 3-150 cycles for that data to arrive, depending on what is cached. In an in-order processor, this could cause the CPU to stall for a very long time doing nothing, even when there are instructions behind the waiting one that could be executing in the meantime.

      --
      A deep unwavering belief is a sure sign you're missing something...
    12. Re:Since this is a dupe by John_Booty · · Score: 1
      That may have mattered in previous iterations of CPU hardware, but haven't the last few generations of AMD & Intel CPUs used the same instruction sets?


      You can have two processors that implement the exact same instruction set, yet have entirely different performance characteristics.

      Of course, this happens even with complex out-of-order cores. With simpler, in-order cores, the difference really grows. You need to tightly couple your code (typically via compiler optimizations, unless you're hand-coding assembly) to a specific implementation of the x86 instruction set instead of merely writing good clean efficient code and letting your friendly out-of-order core figure out an efficient way to run it.

      There's no right or wrong answer. The simple, in-order approach definitely has some real strengths. You can achieve some stunning performance this way (at a large cost in man-hours) assuming the coders and compilers are up to the task.

      In-order cores are probably the right approach for markets without a diverse selection of CPUs. Game consoles come to mind.

      But with all the different CPUs floating around in the PC world, I think out-of-order cores are definitely the right approach. How many different processor architectures have we seen in the past 10 years? At least 20 or 30 if you count everything from AMD/Intel/Transmeta/VIA/whoever.
      --

      OtakuBooty.com: Smart, funny, sexy nerds.
    13. Re:Since this is a dupe by Nazo-San · · Score: 1

      I'm not really talking about direct access to the cache or anything like that. Just better design of the code so that it it splits more things to begin with. With OOOE this isn't absolutely necessary (though it can't hurt to try to write it concentrating on writing code that you can be relatively positive will do well in OOOE) but with multithreading it does, admitedly, become necessary. While you can't directly control what the chip will be doing, you can control what you are sending to it to begin with so that it will be doing things more efficiently than if you just sent it a poorly written set of commands. Unfortunately, the biggest problem here is that so many want the compilers to do it for them, but, it's not really the compiler's job to make your code better. If they won't get more serious about it, then I suppose OOOE is better in the long run, but, I do honestly believe they will. I mean, heck, even people like nVidia are doing so and that's not even going to be affected nearly so much as, say video processing for example.

      Actually, if you think about it, programmers are kind of suffering from job loss due to outsourcing. Should programming suddenly beging to require a bit more expertise, it might buy some more time for the CS field before it becomes like some things such as the textile industry to Mexico. (Mind you, there are ups to every down with outsourcing, so it's not positive except from the point of view of the programmers who like their jobs, but, otherwise it's really more of a shift than a positive or negative change.)

      Oh well, in the end, these are just educated guesses. Theories if you prefer. We won't know anything for certain until they come out with a serious product as directly oriented towards this as the article implies (rather than the more minor OOOE used in the past.) By then we should be able to get benchmarks and see for ourselves where the truth lies.

      PS. Intel is making certain processors of a RISC type of nature like the ARM based processors, which is what my old Toshiba PocketPC does. Actually, for a 400MHz XScale processor, it may not be the most blazing fast thing I've ever used, but, I'd say it runs normal enough tasks pretty well. Anyway, my point is, they know OOOE pretty well, so they do have a good head start here.

    14. Re:Since this is a dupe by rsbroad · · Score: 1

      Out of order execution, and also on-chip cache, help in speeding up programs like Windows, and also speeding up other programs that are compiled from high level languages like C or Basic.

      Neither feature improves the speed of assembly language programs. Out of order execution does not assist code that has been written to run fast.
      On-chip cache does not help such code as much as plain old on-chip memory would.

      Therefore Intel's and AMD's focus on on-chip complexity is to favor Windows Benchmark programs.

      The fastest possible code is sequential with no branches or subroutine calls. This type of code is not practical using a high level language.

      When executing assembly language program, some very high speed on-chip memory can be very useful. But very high speed on-chip cache is not nearly so useful.

      Computers not running the Windows operating system will benefit from lower complexity and multiple cores.
      Applications like weather simulation, or making "The Incredibles".

      Computers running Windows will be judged by Windows Benchmark scores, and will benefit from higher complexity processors and lower number of cores.

    15. Re:Since this is a dupe by JollyFinn · · Score: 2, Interesting
      It is only natural to extend this idea to the sharing of all resources on the chip. This is accomplished by putting them all in one big core and adding multicore functionality via symmetric multi-threading (SMT), a.k.a. hyperthreading. The secret is designing a processor for SMT from the start, not bolting it on a processor designed for single-threading as happened with the P4. I strongly believe that such a design would outperform any strict-separation multicore design with a similar transistor budget.

      Too bad it doesn't work that way. Lots of structures in CPU are n in complexity where n is width of processor. Also when the travaling of information across a die takes more than 10 cycles you need to have smaller structures, it will increase latencies of instructions.

      Here's one example, the bypass path needs to connect load port and every integer unit to every integer unit So there is (n*n) Connections between units, and the number of stages it needs to go in selecting input hampers the clockspeed eventally. There is practical limit on core size if we go bigger the clockspeed penalties and latencies will reduce more performance than adding core resouces will increase. Also SMT hurts cache hit rate, and that penalizes per thread performance also. When you put more execution units the maximum distance between execution units grows so the time its needed per cycle increases, due to delays moving data between execution units. But execution units are *NOT* the area where widening hurts mosts, its still easiest to explain. So then you either use 2 cycle latencies or go for very lower clockspeeds, or increase the voltage but power consumption is relative to v so no matter what the efficiency goes down.

      I believe SMT isn't completely dead, it can make a comback in intel machines at somepoint, with SOME additional per core resources. But from now on there is multiple cores.

      To make it clear, the transistor budget right now is so large that putting them all in single core isn't efficient, due to need to move data inside the core, and the n complexities.

      --
      Emacs is good operating system, but it has one flaw: Its text editor could be better.
    16. Re:Since this is a dupe by shaitand · · Score: 1

      BEGIN RANT *sighs* If only programmers today were concerned with efficient, correct, and maintainable code. In reality the lazy/money factor usually wins out now days. That is why you see 10 billion frameworks out there and every project uses a handful of them.

      Usually said programmers sell out efficient code claiming that the framework has been tested and worked on by a lot of people, blah blah blah. The truth is that two good programmers will churn out roughly the same number of bugs per 1000 lines of code. My homerolled 50-500 line custom library is going to have fewer bugs than your 10,000 line framework everytime and the bugs it does have will be easier to find. It is a given that a homerolled library will be faster since it will be tuned with the app in mind instead of general use. It is easier to maintain because one does not have to sort through code from a dozen different sources. If a bug is found you can also simply fix it instead of having to try to convince the framework authors there is a flaw in their precious art and wait 6 months for them to release a new version.

      END RANT

    17. Re:Since this is a dupe by naasking · · Score: 2

      That's where a project like LLVM comes in. Platform-neutral binaries via LLVM bytecode, and full processor-specific link-time native compilation+optimization when a binary is installed. Alternatively, you can JIT the bytecode at runtime. Developers just distribute LLVM bytecode binaries, and the installers/users do the rest. I think the LLVM approach is the future.

    18. Re:Since this is a dupe by TheRaven64 · · Score: 1
      Imagine in a world where AthlonXPs, P4s, P-Ms, and Athlon64s were all highly in-order CPUs. Each piece of software out there in the wild would run on all of them but would only reach peak performance on one of them.

      Not really. The best case for any in-order processor is to have dependent instructions as far apart from each other as possible. From this state, no amount of re-ordering instructions by an OoO processor will give any performance benefit. Similarly, no in-order pipeline will be particularly disadvantaged by this. Out of order was important a decade ago when this was difficult to do, but much less so now.

      --
      I am TheRaven on Soylent News
    19. Re:Since this is a dupe by TheRaven64 · · Score: 1
      When I learned to code, I was taught that multiplication was expensive, and shifting was cheap. If at all possible, I should replace power-of-two multiplications with shifts. In some cases, it was even better to replace constant multiplications with sequences of shifts and adds. This was so common that (when I checked a year ago), GCC output shift/add sequences for all constant multiplications.

      The Athlon, while instruction-set compatible with previous CPUs, had two multipliers on chip and only one shifter. This meant that using shift/add sequences for multiplication was considerably slower than using multiply instructions.

      Just because two CPUs have the same instruction set, doesn't mean that the same code is optimal on both.

      --
      I am TheRaven on Soylent News
    20. Re:Since this is a dupe by John_Booty · · Score: 1

      Wow, that sounds fascinating. Sounds like that achieves the best of all worlds with minimal drawbacks.

      I'd seen the odd reference to LLVM in the past, but I'd never seen a succinct description of its benefits until now. Thanks for the informative reply.

      --

      OtakuBooty.com: Smart, funny, sexy nerds.
    21. Re:Since this is a dupe by John_Booty · · Score: 1

      Not really. The best case for any in-order processor is to have dependent instructions as far apart from each other as possible. ...no in-order pipeline will be particularly disadvantaged by this.

      You're assuming that the definition of "dependent instructions" is the same for every in-order processor sharing the same instruction set. I think that's a highly suspect assumption!

      Different theoretical in-order x86 CPUs would surely differ in terms of execution units and other factors.

      --

      OtakuBooty.com: Smart, funny, sexy nerds.
    22. Re:Since this is a dupe by Raffaello · · Score: 1

      I think you mean "veritable orgy," not "veritable orgasm," unless of course you're processing the tail end of a porn video.

    23. Re:Since this is a dupe by DerGeist · · Score: 1

      "And where are the city's snowplows? Sold off to billionaire Montgomery Burns in a veritable orgasm of poor planning." -Kent Brockman

    24. Re:Since this is a dupe by DerGeist · · Score: 1
      Yeah my example was a bad choice, I was more thinking of IDE being such a load on the CPU. I'm sure any of us can think of a good example of doing something that hogs your CPU but would be almost unnoticeable in a dual core environment.

      And yes, dumping the pipeline due to page faults or cache misses is a big deal. Miss penalties are a huge deal in any system. Most nowadays just go do something else if a program faults (assuming there's something else to do).

    25. Re:Since this is a dupe by acidblood · · Score: 1
      Also when the travaling of information across a die takes more than 10 cycles you need to have smaller structures, it will increase latencies of instructions.

      Not sure what you mean here, but if you're talking about my estimate of the costs of exchanging information between cores, remember that this is due to the lack of bypass structures between cores, the need for explicit synchronization code, and the rather inefficient method of sharing data through the cache. Once hardware is dedicated to it, even in large die processors, this latency is dramatically reduced.

      Also, the argument that cores will increase in size along with distances between execution units in the core and so on, is flawed. For a processor to make sense economically, it can't go beyond a certain die size. Cost grows with area, but not only that, yields naturally decrease for larger dies at a fast rate, increasing the price even further. Also, with the ever increasing costs of plants, materials, etc. the balance is being tilted towards ever smaller cores. I will concede that, with increasing clock speeds, it's not enough for distances to stand still, they actually have to be reduced. But distances are hardly the bottleneck for clock speed, and even if a couple of critical paths are hampered by distance, just do what the P4 did which is to include extra pipeline stages for data propagation.

      Here's one example, the bypass path needs to connect load port and every integer unit to every integer unit So there is (n*n) Connections between units, and the number of stages it needs to go in selecting input hampers the clockspeed eventally.

      That's assuming a crossbar configuration. Hardly any kind of interconnect (switches, etc.) produced today uses a true crossbar scheme. One could try other, more cost efficient topologies, or perhaps something like a multi-ported queue with n input connections and n output connections. If it has enough capacity for 99% of real-world code, it's certainly good enough.

      Or you could have pairs/triples/n-tuples of execution units with interconnects only between themselves, and try to dispatch code with dependencies to interconnected execution units. Again, there might be a contrived piece of code which would require all execution units to be interconnected, but if most code doesn't, it's good enough.

      The thing is that not enough research has been done on SMT and SMT-friendly structures. If there's a large benefit to be had with SMT over multicore, SMT-friendly structures will inevitably begin to appear. If you were to ask a RISC advocate 15 years ago whether something like the P6 core (used in the Pentium Pro/II/III) could be done for a CISC processor (and x86 is CISC, of course), you'd probably be laughed at. Yet there's money to be made in x86, so engineers and researchers eventually overcame the barriers. Given enough interest and money, the same will be true of SMT.
      --

      Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

    26. Re:Since this is a dupe by be-fan · · Score: 1

      The fundemental problem is that the compiler doesn't know at runtime exactly what the dependencies will be. Branches and memory operations, which are extremely common in most software, create dependencies that the compiler cannot analyze at compile-time, but the processor can analyze at run-time. In the real-world, in-order versus out-of-order isn't just a matter of code scheduling, but fundementally limits the types of code you can run at high speed.

      --
      A deep unwavering belief is a sure sign you're missing something...
    27. Re:Since this is a dupe by JollyFinn · · Score: 1

      Well the problem you need to fix is called physics. The RC delay with process scaling increases.
      The basicly in every process generation you have to reduce length of each wire by 0.7 or have half as many wires. Inorder to keep the delay per mm at same. Since rc delay increases when scaling wires smaller. The latency of moving data around increases all the time.
      Your transistor budget may go up, but the area that you can use with reasonable clockspeed per cycle goes down.

      Here's a hint, even in a good condition if you would put all your resources in one core, there would be 16 cycle distance between furthest parts of the core. That means branch missprediction penalty is flushing pipeline which probably is 40 cycles or more and then the extra 16 cycles that comes from moving the information about branch missprediction to front end.
      Also the extra buffers to run the data take extra die area and power. Also the communation between processing units goes down.

      As for wideing processor to SMT there is something designed called EV8, it would of been a great CPU but it wouldn't of been most efficient for multithreaded workloads, since the core was 5x larger than EV6 core in given process. It was ultimate SMT processor, it had double the resources of EV6 core, but 5x the area. It was cancelled due to company politics but still it was great project.
      Now here comes the rebuttal of your first fallacy, the amount of resources available per die area isn't constant across different width CPU:s the smaller core more efficient its use of power&die area is.

      Now the scaling makes the EV8 style core *LESS* feasible in 0.45u process, than it was in 0.9 target, since transistor delay scales by 0.7 while wiredelay worsens every generation. [Scaling down the wires makes them slower.] So in overall cores have to either become smaller or else we start DECREASING clockspeed when we improve transistor density.

      You don't want to run 16 wide 1Ghz core against a 8 cores of 4 wide and running at 4Ghz. Thats why people don't put all transistors in single core anymore. Read this 3page article and you get my point. iacoma.cs.uiuc.edu/CS497/PIM2b.pdf

      The communication latencies between cores are not mostly because of design of how they connect but because of area they consume as a whole. And you don't wan't those latencies happening everywhere inside core that comes from moving data around.
      Also more buffering and moving data-around large areas consume more power than simply moving inside smaller area, so smaller cores consume disproportianaly less power than the large core.

      --
      Emacs is good operating system, but it has one flaw: Its text editor could be better.
    28. Re:Since this is a dupe by thedletterman · · Score: 1
      "The bad thing about in-order execution is that your compiled, highly-optimized-for-a-specific-CPU code will only really perform its best on one particular CPU. And that's assuming the compiler does its job well.
      (Unless developers released multiple binaries or the source code itself. While we'd HAVE source code for everything in an ideal world, that just isn't the case for a lot of performance-critical software out there such as games and commerical multimedia software.)"

      This isn't an issue that couldn't be solved with binary build distributions. You missed the obvious on that one.

      "As a programmer, I like the idea of out-of-order execution and the concept of runtime optimization. Programmers are typically the limiting factor in any software development project. You want those guys (and girls) worrying about efficient, maintainable, and correct code... not CPU specifics."

      In the above quote, you placed the burden of efficiency on the compiler, and on this quote, you placed the burden of efficiency on the programmer. Which is responsible for the optimization of the resulting binary, the compiler or the language?

      --
      Any fool can criticise, condemn, and complain, and most fools do. - Benjamin Franklin
    29. Re:Since this is a dupe by thedletterman · · Score: 1

      Actually you didn't miss the binary build distribution, sorry about that. The idea of releasing the source-code for every application seemed like non-sense. Binary build distribution is the status quo, and therefore doesn't even present a challenge in my mind. Which is why I assumed you missed this answer. Again, my apologies for misreading you.

      --
      Any fool can criticise, condemn, and complain, and most fools do. - Benjamin Franklin
    30. Re:Since this is a dupe by John_Booty · · Score: 1

      In the above quote, you placed the burden of efficiency on the compiler, and on this quote, you placed the burden of efficiency on the programmer. Which is responsible for the optimization of the resulting binary, the compiler or the language?

      You certainly made a great point here, though. To be honest, I'm not sure of the answer. I was banking on it being "both".

      I'm going on various (admittedly secondhand) things I've heard about Xbox360/PS3 development along with several whitepapers I've read. Creating code that fully utilizes the 360's three multithreaded cores or the PS3's multiple execution units seems to be quite the challenge.

      I'm thinking that instruction scheduling for a single-threaded in-order core could be handled by the compiler. Whereas vectorization and/or multithreading are still largely the domain of hand-tuned low-level code, despite compilers' slight inroads into this area.

      But I guess I really muddied the issue since this was a thread about in-order vs. out-of-order CPUs, not vectorization and multithreading.

      To some extent the concepts are intertwined (since the trend seems to be dropping out-of-order execution in favor of vector units and/or multithreading support) but not necessarily so.

      --

      OtakuBooty.com: Smart, funny, sexy nerds.
  14. The real problem with dupes by lordsid · · Score: 5, Insightful

    The real problem with dupes isn't the fact that there are the same two articles on the front page, nor the whines that come from it, or even the whitty banter chidding the mods.

    If I see an article I've already read at the top of the page I QUIT READING.

    This has happened to me several times over the number of years I've read this site. Then I end up coming back and realizing it was a dupe and that I missed several interesting articles inbetween.

    SO FOR THE LOVE OF GOD READ YOUR OWN WEBSITE.

    --
    IMAGE VERIFICATION IS EVIL!
    1. Re:The real problem with dupes by IHSW · · Score: 1

      You say that under the assumption that you are the majoirty that realise it's a dupe (you are a minority). This article was probably posted for those that missed the previous article, and seeing the comments about dupes leads people to actually click on the link.

      I'm one of those people that just read summaries, and decide not to click on the link because it doesn't interest me. Seeing people say "dupe" leads me to think this article was worth posting twice.

      Or Ars wasn't pleased with the ad-clicks from the previous posting? I don't know.

    2. Re:The real problem with dupes by Anonymous Coward · · Score: 0

      > If I see an article I've already read at the top of the page I QUIT READING.

      Simple solution: read the article below to make doubly sure you've already seen those stories.

  15. No no, not good enought-Double dipping. by Anonymous Coward · · Score: 0

    Or worse. The paying customers get charged twice.

    1. Re:No no, not good enought-Double dipping. by Anonymous Coward · · Score: 0

      "Did you just double-dip that chip?!"

  16. Giving up the 'smart compiler' concept? by Gothmolly · · Score: 3, Interesting

    Wasn't the Achilles heel of the P4 and Itanium crappy code, that caused a pipeline stall on their very long pipes? Every time someone pointed out that AMD didn't have this problem, an Intel fanboy would reply that "with better compilers" you could avoid conditions where you'd have to flush the pipeline, thus maintaining execution speed.
    Well, those "better compilers" don't seem to be falling from the sky, and AMD is beating Intel in work/MHz because of it.
    Is Intel finally deciding "screw it, we'll make the CPU so smart, that even the crappiest compiled code will run smoothly" ?

    --
    I want to delete my account but Slashdot doesn't allow it.
    1. Re:Giving up the 'smart compiler' concept? by TheRaven64 · · Score: 1
      This was a problem with the Itanium, not the P4. The problem with the P4 was that the pipeline was very long and wide. A P4 could have 150 (from memory) instructions in-flight at once. On average, every 7th instruction is a branch. Every branch that is incorrectly predicted causes a pipeline flush (i.e. 150 instructions, at various stages of execution, are ignored). With a prediction rate of 95%, this means you will have an incorrect prediction every 20 branches. Since 20 branches means roughly 140 instructions, it was very uncommon for the pipeline to ever be full.

      The Itanium is a nice design. It moves the parallelism detection and re-ordering into the compiler. This should be more efficient, since it means that it only needs to be done once, at compile time, rather than every time an instruction is issued.

      --
      I am TheRaven on Soylent News
    2. Re:Giving up the 'smart compiler' concept? by Anonymous Coward · · Score: 0

      Misprediction penalties are part of the explanation.
      It seems that the P4 microarchitecture suffered from other oddities as well.

      See for example this x-bit-labs article about a feature called "Replay".
      http://www.xbitlabs.com/articles/cpu/display/repla y.html

  17. Two cores? Me likee. by M0b1u5 · · Score: 1

    I just want a planet with two cores now.

    --
    How many escape pods are there? "NONE,SIR!" You counted them? "TWICE, SIR!"
  18. GHz by TheSHAD0W · · Score: 1

    Does this mean we're not going to be seeing mid-ten-digit clock rates any more? That was one thing that really annoyed me about the P4; a 2 GHz P4 was NOT more than twice as fast as a 850 MHz P3. It meant one couldn't compare CPUs with each other any more.

    1. Re:GHz by DrMrLordX · · Score: 1

      It does mean that the long-pipeline + high clock strategy of Netburst will be abandoned. Presler and Dempsey are the last of that ill-fated breed.

      However, Conroe has been announced to hit speeds as high as 3 ghz (or higher) for Intel's next Extreme Edition part. We may see speeds that high for the server version of Conroe (Woodcrest) as well.

    2. Re:GHz by Anonymous Coward · · Score: 0

      Beh. That was still extremely easy. It might not have been exactly twice the speed, but you could rather easily tell it was around that.

      Nowadays, a CPU is pretty much the most complicated thing to shop for. You'll see like 2 dozen of them around 3GHz. Some "plain" ones. Some with hyperthreading. Some with a larger cache. Some with a faster/slower bus (FSB). Some with 2 cores. Some with special instruction sets or features (SSE3, 64bit, etc). Some "extreme edition" ones. Different cores (northwood or whatever else). Different sockets (478/775/whatever). And that's just on the Intel side so far. AMD has a lot of different chips too. And it's not easy to compare between both kinds either.

      I mean, what sounds faster, a 2.6 GHz dual core P4 or a 3.2GHz single core? A Athlon64 3000+ Venice or a P4 3GHz 800FSB w/ HT? What if that Athlon64 has another core revision? Or against a P4 D 3GHz? Or whatever, with a little more cache but the slower FSB version? It very quickly gets confusing.

      You just can't tell anymore. Not easily at least. Clock rate is getting more irrelevant than ever. And it's not as simple as CPU Clock * IPC (instructions per cycle) anymore (as a measure of how fast it is). Different chips will perform better at different tasks. Sometimes being a 64bit chip can help (extra registers at least). Or more L2 can be quite a benefit in other tasks. Some stuff can run slower using HT. So you can't exactly compare directly anymore. You just can't simply rate using a "standard" index as they're often flawed in some way.

      Even for a tech, buying a CPU has become a nightmare. What company? Which one is the fastest of the bunch at the price point you're willing to pay? What features does it have and that you need? (...)

      I spent an entire evening this week looking for a CPU upgrade for one of my PCs. Having to look every single chip manually on various benchmarks. Enter those in a spreadsheet along with prices. And notes about the chip features and what not. I'm still not sure what to buy. My head still hurts. It's so complicated I'm thinking about giving up on the upgrade altogether, and waiting for a Conroe or whatever instead (either that or just walk in BestBuy and grab one semi-randonmly). I don't even want to think about it anymore.

      Now imagine your grandma looking at CPUs. Do you think she'd have the slightest clue of what to buy? ...

    3. Re:GHz by jawtheshark · · Score: 2, Interesting
      That was one thing that really annoyed me about the P4; a 2 GHz P4 was NOT more than twice as fast as a 850 MHz P3. It meant one couldn't compare CPUs with each other any more.

      You never could do that in the first place. Within a CPU family, it used to be possible. (With Intels naming schemen today, I can't do it anymore either!) Compare a P-III 500MHz to a P-III 1GHz and you knew that the latter was approximately twice as fast. An 2GHz AMD Athlon XP was approximately twice as fast as a 1GHz AMD Athlon XP. I say approximately because cache sizes could influence these results. You never could compare a P-IV to a P-III or a P-IV to an AMD Athlon, expect by falling back on benchmarks and you *know* that all these benchmarks are pretty much artificial and can skew results in favour of a certain architecture.

      Really a long time ago, it was even dubious within the processor family: is a 486DX2/66 slower than a 486DX4/100? After all, the bus speed of the DX2 was 33MHz and the DX4 had a 25MHz bus. Back in the day such things has a major impact. (Even today it can have a big impact...)

      You can also recall the Pentium Pro (The CPU on which both the P-II and the P-III were based on) It was a horrible performer for 16-bit code, but on 32-bit code it was pretty much king. Also don't forget the extremely fast cache that it had. A PPro200 with enough RAM can handle Windows 2000 without a hitch. (I know, I had one with 256Meg RAM) The P-II came out, with a less performant cache and it couldn't beat the PPro clock-for-clock. That's why the lowest P-II came at 233MHz. (Yeah, it also included the MMX instruction set, I know, I know...)

      In summary: within processor families you can compare, outside processor families you are pretty much SOL.

      Besides, I know I'm going to sound like someone saying "we have enough processor power", but my primary laptop is a P-III 500MHz mobile with 512Meg PC100 RAM. You know what? That baby runs pretty much everything I throw at it: Windows XP Pro SP2, OpenOffice 2.0.2, Firefox 1.5.0.1, Thunderbird 1.5, AVG Antivirus, PuTTY, Filezilla, Acrobat Reader 7, iTunes6, Quicktime, Media Player classic, Borland Delphi Personal, Eclipse 3.0, Tomcat and The GIMP (but I have to be patient when handling big images). Perhaps not all at the same time (I never tried), but I often run at least a selection of the above. Sure, sometimes I have to wait a few seconds for a program to start, but it's not as if I'm that of a hurry.
      If I need more oompha, I just switch to my own AMD Athlon MP 2400+ SMP machine (4Gig RAM) or to my wifes P-IV 2.6GHz Hyperthreading (2Gig RAM). Frankly, that doesn't happen often...

      --
      Ahhh...the great dumpster continuum. Many a free computer will be found there. -- sowth (748135)
  19. Same article again tomorrow please by Anonymous Coward · · Score: 0

    Hey, this is great promotion! I own some INTC stock, can you post this same article again tomorrow as well, and the next day, and the next...

  20. Moore's law isn't a law at all. by Nazo-San · · Score: 5, Interesting

    I just thought it should be stated for the record. Moore's law isn't a definite fact that cannot be disproven. It has been working so well up to now and will for a while yet that it is rather easy to seriously call it a law, but, we shouldn't forget that, in the end, there are physical limitations. I don't know how much longer we have until we reach them though. It could be five years, it could be twenty. It is there though and eventually we will hit that point to where transistors will get no smaller no matter what kind of technology you throw at it. At that point, a new method must be put into place to continue growth. This is why I personally like reading Slashdot so much for articles on things like quantum computing and the like. Those may be pipe dreams perhaps, but, the point is, they are alternate methods that may have hope someday of becoming truly powerful and useful. Perhaps the eventual sucessor to the current system will arise soon? Let's keep an eye out for it with open minds though.

    Anyway, I do understand a bit about how it all works. OOOE has amazing potential, but, in the end the fact remains that you can only optomize things so much. The idea there is actually to kind of break up instructions in such a way that you can actually kind of multi-thread a task not originally designed for multi-tasking. A neat idea I must say, with definite potential. However, honestly, in the end the fact remains that you will run into a lot of instructions that it can't figure out how to break up or which actually can't be broken up to begin with. If they continue to run with this technology, they will improve upon both situations, but, in the end, the nature of machine instructions leads me to believe that this idea may not take them far to be brutally honest.

    Let's not forget that one of the biggest competitors in the processors that focus on SIMD is kind of fading now. Apple is going to x86 architechure with all their might (and I must say I'm impressed at how smoothly they are switching -- it's actually exciting most Apple fans rather than upsetting them) and I think I read they no longer will even be producing anything with PowerPC style chips, which I suppose isn't good for the people who make them (maybe they wanted to move on to something else annyway?) At this point it's looking like it's more and more just the mobile devices who benefit from this style of chip, which is primarily just due to the fact that between their lack of need for higher speeds and overall design to use what they have efficiently, they use very little power and do what they do well in a segment like that.

    Multi-threading, however, is a viable solution today and in the future as well. It just makes sense really. You start to run into the limitations as to how fast the processor is going to run, how many transistors you can squeeze on there at once, power and heat limitations, etc, however, if you stop at those limits and simply add more processors handling things, you don't really have to design the code all THAT well to take advantage of it and keep the growth continuing in it's own way. I can definitely see multicore having a promising future with a lot of potential for growth because even when you hit size limitations for a single core you can still squeeze more in there. Plus, I wonder if multicore couldn't work in a multi-processor setup? If it can't today, won't it in a future? Who knows, there are limits on how far you can go with multi-core, but, those limits are further away than single core by far and I really feel like they are more promising than relying on smart execution on a single core running around the same speed. In the end, a well designed program will be splitting up instructions on a SMP/multicore system much like the OOOE will try to do. While the OOOE may be somewhat better at poorly designed programs (ignoring for a moment the advantages that multithreading provides to a multitasking os since even on a minimal setup a bunch of other stuff is running in the background) overa

    1. Re:Moore's law isn't a law at all. by drachenstern · · Score: 1
      Google search for powerpc shows that they are IBM chips. Just thought you might like to know. They're not exactly looking to get out of the processor market, and I am even under the impression that they use PPC in their datacenter style servers, etc. Just so's use-all knows, s'all 'm sayin'.
      Plus, I wonder if multicore couldn't work in a multi-processor setup?
      Well, I work for a major computer manufacturer (think top 3, they also make very nice printers [market leaders you might say]) and the Enterprise class servers that we build all have multicore multiprocessor setups ([these three phrases should go without saying] - as of now, it seems like, and only since dual-core processors showed up on the market a few years back, seeing as how we have a custom Intel-hybrid processor and all). Granted these are machines which retail at the bottom for about 5-6 grand without many added parts or more mem or more proc, but the point I'm making is they all run WinXXXXXXXXX (fill in your own blanks here) as well as Linux and (need I say it) (our own custom) Unix (variant).

      Also, Intel has been on the multi-core bent for the past, oh, dozen years or so, more like twenty.

      So here's the thing, OOOE on die is more about having multiple loops segments that have different maths that all execute the same way. Most games would have that, Excel or Word would not. The reason why some things are faster is that they flat out ask the processor to do more calculations. And more cache per processor is generally a Good Thing[tm], but cache communication between cores is also good in some respects. AMD's architecture really allows for that more, I believe, because of the general construction of their memory pipe.

      Once again, from my particular vantage point, the AMDs are strong, but the Intels seem to be stronger still.
      --
      2^3 * 31 * 647
    2. Re:Moore's law isn't a law at all. by Nazo-San · · Score: 1

      Only thing I'm really in any disagreement at all about is the popularity of the PowerPC processors today (not even a year ago, but, specifically today.)

      It sounds like you're advocating the oft-mentioned point that games are the main thing that will benefit. Well, this is true, but, there are some business or non-gaming oriented things where people will see the differences as well, and these shouldn't be discounted either. Firstly, we're going to need those things like MMX I guess. MS is determined that one day Windows will be prettier than a nice sunset across the ocean scene complete with enough bubbles and other such crap that definitely wastes CPU power (gee, wouldn't you be so surprised to hear that I disable the XP theme service on my system?) They already promise even worse with Vista. Actually, frankly linux is following basically the same sort of course since it's trying to compete directly with Windows, and with less hardware support for video (especially for us poor ATi users -- I've just completely given up on trying to get the ATi proprietary drivers set up to where they work right, it's too much for us more amateur types) it probably actually relies more on the CPU overall for processing that interface right now.

      Besides useless waste on GUIs so average joe farmer can feel more relaxed while outlook is quitely rebooting his computer without warning, there are some more legitimate uses that will see the benefits. For example, servers can do more CPU limited things, such as a web server adding compression support (individually zip compressions streams are practically off the scale of a modern processor, but, when you add hundreds if not thousands of requests, databases, and other such things all going at once, it adds up.) Not to mention tasks that are even more CPU limited than gaming like encoding or video processing. I can safely say that I've seen more of a hit on my processor as far as heat and power consumption as well as how background processes start to act -- again, single core -- when I fire up the latest anime encode with my ffdshow set to resize to 1440x1080 for my CRT with denoising added for good measure. Whereas games haven't really pushed things nearly so hard most of the time. But, it's not a surprise to hear this, right? I mean, I have a real video card, an X850XT-PE and I've actually seen less benefit from overclocking my CPU in games than I did in ffdshow. In fact, on my old mobile barton I saw this more firsthand where setting my memory asynchronous (and this was on a nforce 2 system, so that caused a latency war the likes of which causes games to jerk like insane) so my CPU could overclock even further benefitted ffdshow, but, lowering the CPU down to meet the memory left games running smoothly even though the CPU was no longer running so well. In truth, I honestly believe gamers actually aren't seeing the benefits they assume they'll get by running to buy a X2 4800+ or whatever considering that I really don't think any games have been CPU limited on my San Diego (let's call it 4100+ when factoring in the overclock.) I certainly have been pleased at how much faster encoding things goes (especially now that I have a small portable DVD player so I like to encode some of my old backups into low-res DVDs and fit half a season onto one DVD-R for watching as long as I have time for when away. Those encodes run slightly faster than realtime, and for MPEG2 that's rather decent, though it's obviously nothing compared to what those lucky people with the S939 Opterons that can hit 3.2+GHz probably would get.) Actually, I tend to upgrade my CPU more for the benefits on my anime watching than for gaming even though I play games as much as I watch anime. Now that I've hit a point where ffdshow can manage all I want in realtime and I can encode at speeds I'm still not used to I don't even want a X2 yet, nor the competing Intel dual cores. However, if I were to pick one, it would definitley be the AMDs due to that L2 thing. I really doubt the shared memory gets taken advantage of a lot in the long run.

    3. Re:Moore's law isn't a law at all. by Kupek · · Score: 2, Informative

      OOOE has amazing potential
      Which has been realized for about the past 20 years. Exploiting Instruction Level Parallelism (which requires an out-of-order-execution processor) has gotten us to where we are today. We're reaching the limits of what ILP can buy us, so the solution is to put more cores on a chip.

      It may be possible to integrate OOOE into a multicore.
      It is possible, and every single Intel multicore chip has done it. Same with IBM's Power5s. For general-purpose multicore processors, that is the norm.

    4. Re:Moore's law isn't a law at all. by drachenstern · · Score: 1

      Didn't mean to give the impression that I thought that the multi setups were better for games, I don't game that often [wait for collective slashdot sigh to dissipate]. I personally would rather see better multi-threaded application support, however, IIRC, the big programs out there are: Adobe, Autodesk (Engin minor) and SAP, etc. So now we're left with a bunch of not-necessary-that-they-run-at-all programs which may or may not be multithreaded, and programs like Word or Excel that it really wouldn't be helpful that much more often to have them be multi-threaded, although if they're not now, I expect version 12 will be (since it's designed for the system that is designed for multi- support from the manufac (vista)). And you mentioned servers needing the better processor support, but in truth, wouldn't the servers be better served [no puns intended] by using one of the low-voltage server chips from either vendor? I agree that we need powerful servers and so-so dekstop machines for the general public, however muscle cars sold well for a reason, and now the "HEMI"s on the market are making a profit, even though the engines look like a disgrace next to a REAL HEMI. just my $.02 on this one though.

      I also do not use the XP themes; I too consider them to be a waste of precious CPU/mem.

      PowerPC has it's place, but since this site is merely an Intel-is-better-than-AMD-is-better-than-Intel site, I will bow out of this before a flamewar gets started.'

      Just thought I would clarify on those points, as I am still running a 450-k6/2 as my main devel (since it's not on the 'net) and my laptop for everything else. The laptop's not even over 2.0 anyways, so I'm not exactly racing for speed. Although any donations on a dual-xeon-multicore workstation are gladly accepted :D.

      --
      2^3 * 31 * 647
  21. Power6 delivers where Intel has failed... by KonoWatakushi · · Score: 1
    Not likely on the Intel side, but IBM has made good progress with the Power6. They have managed to keep the pipeline at 13 stages, while clocking it at 4-5GHz. This is in contrast to the P4 with its 31 stage pipeline and much higher power consumption. It seems that Intel has given up prematurely, or perhaps their process technology/ISA are not as amenable to such optimizations.

    Now, frequency isn't everything, but performance scaling is nearly linear if you hold the pipeline depth constant. (And scale the bandwidth, which has also been done..) For more information about Power6, take a look at:

    http://www.serverpipeline.com/showArticle.jhtml?ar ticleId=180200700&pgno=1

  22. dupe tagging solution! by hobotron · · Score: 2, Insightful



    Alright mod me offtopic, but if /. just took the beta tags and if dupe showed up after a certain number of tags, or however they calculate it, have the story minimize to the non popular story size thats in between main stories, I dont want dupes deleted but this would be a simple soultion that would get them out of the limelight.

    --
    There is truth in humor.
    1. Re:dupe tagging solution! by Anonymous Coward · · Score: 0

      Submit that as a suggestion rather than as a thread they may never read in a topic they are less likely to read many in anyway (thanks to it being a dupe.)

    2. Re:dupe tagging solution! by AcidPenguin9873 · · Score: 1

      Here's the real problem: I'll bet that Arstechnica pays Slashdot to have their article linked from the main page. Doing what you suggest would remove the article summary from the main page, and Slashdot would lose a revenue source.

  23. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  24. So basically what your saying is... by Psiven · · Score: 1

    AMD = Pwned?

  25. SIMD going out of style? by Rezonant · · Score: 1

    No way Jose, SIMD isn't going out of style at all. What do you think the SPE:s of the Cell processor do best? SIMD. What did Intel put a LOT of resources into in its new Core architecture, theoretically doubling the speed of this part? SIMD. What is it that makes it possible for a PII300 to decode DivX, or a P4 3GHz/Athlon64 2GHz able to decode video in HDTV resolutions? SIMD. It wouldn't stand a chance with just regular scalar instructions. MMX/SSE2 are essential.

    1. Re:SIMD going out of style? by Nazo-San · · Score: 1

      You misunderstand me. I mean major processors fully relying on this sort of method such as the PowerPC. Of the things you mentioned, only the Cell is actually a truly modern thing (which, btw, I hear is basically a PowerPC style chip.) Instructions like MMX are definitely useful, but, does the processor rely on them almost exclusively? You see, the almost pure SIMD processors run far slower and rely on getting a lot of stuff done at once while the x86 architecure we're so used to runs blazing fast and gets by only doing a few things at once and sometimes even dragging down to just one thing at a time. Each system has its ups and downs really, and I don't really know why the SIMD focused processors seemed to be getting less popular lately in the non-mobile/embedded fields.

      PS. Decoding many video formats like DivX, if I recall correctly, doesn't actually get to take advantage of CPU optimizations very much. Oh, I'm sure there's a little, but, generally speaking, decoding video is going to be more a matter of just raw processing. Oh well, I don't have any benchmarks handy, though I can point out that I have personally played some relatively high resolution DivX files on a Pentium 2 running at 166, 233, and 266MHz (I just underclocked to 166 for support for old dos games that crash on a too fast system, so when I remembered I'd switch to 266MHz for things like video watching. 233 was it's stock speed, so I tested that first. And yes, I have an unlocked P2 chip even though it wasn't "hacked" to be unlocked.) P2 definitely lacks SSE2. Supposedly GeeXboX can play DVDs on a 400MHz Pentium 2 in fact, all without SSE2 (since I have only the 266MHz P2 and a 500MHz P3, I can't directly verify this, but, I can say that user posts in the forums would seem to support this.)

      Actually, if you want to see something interesting, take a look at the geexbox requirements yourself. Note that for the Macintosh they say only a G3 is actually required (albiet a strong recommendation for a G4.) I haven't exactly searched terribly extensively, but, I see that they have Macintosh G3s running at 266MHz with a 66MHz bus (same as that P2 come to think of it.) Yet that thing is supposed to be able to manage DVD playback (albiet probably with occasional skips) when the P2 stuttered like crazy. No, I don't deny for a second that they have advantages when properly used.

  26. WHY PREVIEW??? by drachenstern · · Score: 1

    The initial "Google search" in the above should be Google search

    --
    2^3 * 31 * 647
  27. Intel people! LOOK HERE, good marketing idea by Anonymous Coward · · Score: 0

    Hey Intel peeps, listen to this idea. (hope this is the right place to
    blabber about this)
            Ok, I have read from sources that the 2 reasons for making the celeron
    and the A series (btw, the 300a, great processor guys) processor is to give
    people who want a not-so-costly preformance solution and to compete with the
    low end market of AMD. So far it has worked.
            For a gamer, the real only logical processor (unless you have gobs of
    money to get a Xeon or a P3) is a celeron, its fast, its Intel, hey, it
    works! BUT, now with the new PIII, with the KNI/SSE(whatever your calling
    it) SIMD/SIMD-fp instructions, which has been said by people from Intel to
    improve the speed of 3d and voice recognition software, as much as 30%. That
    leaves a great deal of heavy gamers wanting a pIII, but its just too darn
    expensive! ($80-$160 vs $550-$750) The solution? Celeron SSE!
            Basically the celeron version of the Pentium III, throw it at speeds of
    400-500, price it at $180 - $275 and it will sell big. I have a Celeron
    processor and I love it, I play games heavily, and reading about all the SSE
    and its preformance boost per clock speed makes me want to get a PIII, but
    its too expensive to really consider making the means to get that kind of
    money. Now, a Celeron SSE = p3 core, 128k on-die full speed cache (or more!)
    the same PIII SSE/KNI/whatever, on a slot-1 PCB, and you have a big
    contender for the 'lower end SIMD fight' (see K6-III) I have no doubt a
    'celeron SSE' would beat the socks off a K6-3. Since the K6-3 has on-die
    full speed cache, and 3dnow! (their weaker SIMD) and is priced around $260
    for the 450mhz. Clearly the PIII beats it, but its not really what most
    gamers will get. The celeron SSE would be a dream come true for many people,
    driving many more next-gen intel processors.
    Aaron (a gaming enthusiast/hardware junky)

  28. Lots of n^2 was changed to n in submission. by JollyFinn · · Score: 1

    Submision changed n^2 complexities to n complexities.
    Its register rename, choocing which instruction goes next etc... increasing n^2 when when core changes.

    --
    Emacs is good operating system, but it has one flaw: Its text editor could be better.
    1. Re:Lots of n^2 was changed to n in submission. by fishybell · · Score: 1
      Your sig, while perhaps being factually correct, is extremely misleading.

      High risk in medical terminology means a statistically significant risk higher than average. This means that 1 in 6 babies have a risk that is outside the margin of error. Most likely this means that 1 in 6 babies have a 1 percent chance of brain damage. So roughly 0.16 percent of babies actually have some form of brain damage that can be attributed to coal pollution.

      I do agree that 1 out of every 600 babies damaged by pollution is a lot and that that number should be reduced to 0, but please don't try to spread misinformation. Give actual facts, not vague percents of percentages.

      --
      ><));>
  29. Japan Leads.. by Anonymous Coward · · Score: 0

    Japan's little known microprocessors, win hands down on energy consumption, as some were designed so. VIA added a lot, to come up with something reasonable. However, AMD knows what the market wants, and worked out a decent tradeoff mix whilst keeping x86.

    I would like to see the day where microcode can be loaded into processors, like the old days. However trimmig code bloat, to say match Z80, 8Mb desktop solutions would go a long way.

  30. IA-64 is an in-order execution processor by Anonymous Coward · · Score: 0

    Not OOOE.

    1. Re:IA-64 is an in-order execution processor by homer_ca · · Score: 1

      Correct. It's a VLIW architecture that depends on the compiler to optimize the order of instruction to keep all th execution units busy.

  31. Its Wintel vs. Lamdix? by Anonymous Coward · · Score: 0

    Sounds to me like Intel will run generically compiled Windows code better whereas AMD may run specifically-compiled Linux code better. Guess which one will generate more "I am faster" headlines and make more money?

    This is by far Intel's most aggressive, and possibly best, design ever. They've taken their cue not to ignore performance per cycle from AMD, while keeping and extending much of what is good in their processor design.

  32. OOOE was 'adopted' in the Pentium Pro in 1995 by Anonymous Coward · · Score: 0

    Out Of Order Execution was first adopted by INTEL in 1995 when they released the Pentium Pro, their '6th Generation' architecture. INTEL's been using OOOE in all of their x86 CPUs since then.

  33. this thing looks a lil impressive to me by Anonymous Coward · · Score: 0

    well lemme see :3
    according to this page,conroe can do 6 fp instructions per cycle,and with simd that gives us 4 numbers per fp instruction,or 24 fp calculations per cycle,at 3 ghz this means 72 gflops per core,and as the conroe has two cores,this mean 144 gflops,or even 288 if you count that quad core stuffs

    cell can only do 210 gflops,with no branch prediction,out of order execution or 4 mb L2 cache as Conroe has

    poor poor cell,beaten by a x86 before even put into commercial use

  34. Only on slashdot that would of been insightfull... by JollyFinn · · Score: 2

    The poster obviosly hasn't design any CPU:s. Nor doesn't know about physics related to semiconductor design.
    He's programmer who doesn't need to think those things.
    n^2 or n^3 algorithms (in terms of power and aread) are used in MOST part of the core. So when the guy recommends that in next generation instead of having 4 cores we have single core he suggested that we have one core which is twice as wide as one of those 4 cores.
    Large fraction of code is pointer chasing, large fraction of code has ILP equal or lower than 1. There just are too many data dependensies.

    Just like latency of cache it depends on its size, instead of is it L1 or L2 or whatever. Physics says that the drive strength and distance is important. Same happens inside core, when you have quadrupled the core size you need to drive all the instructions and data around, its like prescott. You spend huge resources moving instructions around in pipeline stages instead of doing computation, and you have to do it since the distance you travel between different parts of cores is so much bigger now. The register renamers take a lot more die area, and have longer latency, so does the out of order queus, then there is latencies between aluinstructions, latencies INSIDE the logic selecting which instruction goes next is growing since number of locations for which each instruction in queu has gone up, and number of instructions in queu has gone up too. The bad part is that the logic isn't linear its n^2 algorithm in terms of width, so its width^2*quedepth. So in his recommendation of doubling he get 8 times the area and power consumption there.

    Of course trying to educate masses of programmers is futile attempt here, there is plenty of people who know nothing about the costs of doing something proposing solutions that the people who have designed CPU:s for 20+ years have probably already dismissed because of their infeasibility.

    --
    Emacs is good operating system, but it has one flaw: Its text editor could be better.
  35. Not based on PM by Groo+Wanderer · · Score: 1

    The Merom/Conroe/Woodcrest cores are NOT based on the PM, aka Banias/Dothan/Yonah. Any even cursory look at the architectures, from pipeline depth to functional units will show they are totally different.

    Who keeps perpetuating this stupidity, and when have we as a culture lost the ability to look past shiny things shown to us by guys in lab coats? The cores rock, the previos cores rock, they are not the same.

    Just because Merom is more like the PM than the P4 means all of squat.

              -Charlie

  36. If you want a serious answer.......... by Anonymous Coward · · Score: 0

    A few factors. First, as one poster pointed out, the Israeli team has lots of experience designing low power processesors. This started when they started designing system-on-a-chip processors (a project which was untimately cancelled), while the guys in the US were concentrating on Netburst. They were able to build up a team with some really good people AND experience, which means a lot for a design team.

    Secondly, the Technion. Imagine if, in the US, Intel was located next to MIT, and had a job-sharing program where they hired the best and brighest (while paying for their education). That's what you get at the Israel Development Center in Haifa, at which the Technion is the major supplier of new grads to Intel. These guys are damn smart - and Intel has first dibs at them.

    Thirdly, Intel gets massive tax concessions from the Israeli government - so you can put a fab and a development center in the same country barely larger than New Jersey. Makes a massive difference in development time, etc.

    Fourth, the Israeli/Jewish culture stresses education and the Israeli guys are damn aggressive. In the Intel culture they've moved up like crazy because Intel has a "constructive confrontation" policy - a fairly aggressive work environment in which they do well, and once they get into c-level positions - can move work to Israel, where they have counterparts and are comfortable.

    Lastly, a huge amount of the IDC guys are US-educated and raised. They would be in hot demand anywhere, but they want to live in Israel (for the sake of raising their families there, mostly), and Intel is the biggest game in town. So where do you go when you want to go back to Israel?

    If anyone else has worked at Intel or in Haifa, feel free to add or correct. And I agree with another poster - your comment was close to being racist, although I'm sure you didn't mean it that way....