AMD's Showcases Quad-Core Barcelona CPU

Bit Slice by flyingfsck · 2007-02-09 19:37 · Score: 1

Things have come a long way since the heady days of bit slice processors. The first microcode I wrote was for an XOR operation - I could not think of anything simpler, that would actually do something useful...

--
Excuse me, but please get off my Pennisetum Clandestinum, eh!

But SSE is already 128 bits! by pammon · 2007-02-09 19:39 · Score: 1

Anyone know what "SSE128" means? SSE registers have been 128 bit from day one.

Re:But SSE is already 128 bits! by Zenki · 2007-02-09 19:45 · Score: 5, Informative

SSE+ operations up until now were operated on 64 bit at a time within the processor. SSE128 just means the new AMD chip will complete a SSE instruction in one pass.

This was pretty much the reason why most people only bothered with MMX optimizations in their applications.
Re:But SSE is already 128 bits! by larrystotler · 2007-02-09 19:50 · Score: 5, Informative

When Intel first added SSE to the Pentium 3 chips, they did it with a 64bit setup to save die size on the then 350nm parts. Even when they moved to the newer smaller designs, they left it that way. The Core2 was the first chip to incorporate a single issue SSE engine. Therefore, with the Core2, it loads the instruction, then executes it. With the other chips, you have to load the first part(if it's a full 128bit instruction, or if it's multiple instructions added together), save, load, save, add, execute. This is where the Core2 kicks butt. I've been saying that the Barcelona would move to that design, since it's the biggest reason Intel has been beating AMD in the benchmarks. This will re-level the playing field. There have been lots of articles about this. Google it
Re:But SSE is already 128 bits! by pammon · 2007-02-09 19:54 · Score: 1, Interesting

SSE+ operations up until now were operated on 64 bit at a time within the processor
Hmm...do you mean specifically on AMD's hardware? That stopped being true for Intel starting with the Core, which has 1-cycle latency on SSE instructions.
Re:But SSE is already 128 bits! by adam31 · 2007-02-09 20:54 · Score: 2, Interesting

With the other chips, you have to load the first part(if it's a full 128bit instruction, or if it's multiple instructions added together), save, load, save, add, execute.

Please explain this. Do I understand correctly that you think some SSE instuctions are 16 bytes? Issuing is one thing, and latency another. In most cases I've found AMD/Intel can issue 1 mulps/shufps/adds per cycle, the *ss instructions at 2 per (AMD sometimes 3 per cycle). If you mean that only the first 64-bits, 2 components, are computed in a cycle and the next 2 components in the next cycle, okay. Except that vmx does 4 component multiply-add in a single cycle, which would mean that SSE sucks at its GHz.
Re:But SSE is already 128 bits! by waaka! · 2007-02-09 22:00 · Score: 5, Informative

Hmm...do you mean specifically on AMD's hardware? That stopped being true for Intel starting with the Core, which has 1-cycle latency on SSE instructions.
Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency. Most of these instructions still take 3-5 cycles to generate results, which is similar to the Pentium M, but now a vector of results finishes every cycle, instead of every two or four cycles.

An important consequence of this is that if your instructions are poorly scheduled by the compiler (or assembly programmer) and the processor spends too much time waiting for results of previous operations, the advantages of single-cycle throughput mostly disappear.
Re:But SSE is already 128 bits! by pammon · 2007-02-09 22:33 · Score: 4, Informative

Core2 has single-cycle throughput on most SSE instructions, not single-cycle latency
Well, certainly you won't be able to get a square root through in one clock cycle, but many/most of the simple integer arithmetic, bitwise, and MOV SSE instructions on the Core 2 really do have single cycle latency. source. None do on the AMD64, which supports the theory that SSE128 means more "new for us" than "new for everyone." Not to put AMD down - many of the other features sound promising (but the article is long on breathlessness and light on details, alas).
Re:But SSE is already 128 bits! by Lost+Race · 2007-02-10 09:03 · Score: 1

SSE first appeared in the Katmai (that's why SSE was also known as "KNI" or "Katmai New Instructions") which was produced in a 250 nm (0.25 micron) process. 250 nm was already pretty mature when the Katmai came out so I doubt they ever targeted the design for 350 nm production.

Re:Honestly... by mabinogi · 2007-02-09 19:43 · Score: 5, Insightful

I don't care if it's 65nm, 45nm or 10mm - that's a completely irrelevant (to me as a user and purchaser) implementation detail. I care about the results - how fast is it for my workloads? How much is it? How much power does it use?

Obsession about process size is sillier than obsession over clock speeds.

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

--
Advanced users are users too!

Dethrone? No. by NXprime · 2007-02-09 19:44 · Score: 2, Insightful

"Will AMD dethrone Intel again?" Dear AMD, meet Larrabee. http://www.theinquirer.net/default.aspx?article=37 548 AMD might kick Intel in the nuts a little but definitely not dethrone.

Is dethroning Intel the point? by Weaselmancer · 2007-02-09 19:50 · Score: 5, Insightful

As long as AMD and Intel continue to chase each other in the x86 market, high end chips become low end in the span of six months. Just keep buying 6 months behind the press releases and you get great processors for next to nothing.

--
Weaselmancer
rediculous.

Re:Is dethroning Intel the point? by Anonymous Coward · 2007-02-09 20:56 · Score: 1, Funny

Ahh, I just spent $985 on a QX6700. In 6 months you're going to love it!

Huh? by sumdumass · 2007-02-09 19:59 · Score: 1

It is labeled a quad-core Opteron, but according to Infoworld's Tom Yeager, is really a redefinition of x86.

I don't get the surprise or disapointment here. It appears that the submiter thinks x86 isn't an opertron or something. As far as i know, the opertron is the same thing- IE and extention to the x86 that can handle 64 bit extentions.

Am i missing something or am i completly wrong?

Re:Honestly... by tedpearson · 2007-02-09 20:05 · Score: 1

... and you can wake me up when the processors themselves are smaller than a square millimeter. I need them for my new thumb-top computer design.

Well.... by Creepy+Crawler · 2007-02-09 20:06 · Score: 2, Interesting

Keeping in scientific fact, how much heat has to be generated for 1 MIPS?

The fact is, absolutely none. It has been shown that only the destruction of information via AND and like instructions create entropy (heat). As long as you use only 3 types of gates (pass through, not, xor), you can create a heat-free CPU. Provided we do want to check for bit errors, we could maintain a very low heat via ECC like checking. Estimates on that are 10^8 lower than present.

We could keep 98% of our efficiency of current day chips if we switched to this method.

--

Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37

Re:Well.... by Gori · 2007-02-09 20:16 · Score: 1

That sounds very interesting. Would you mind providing a link to the literature that discusses that ? I have some trouble figuring out the thermodynamics of this. Perpetum mobile and such, you know....

--
Complexity is a measure of our ignorance...
Re:Well.... by DimGeo · 2007-02-09 20:43 · Score: 1

If memory serves right, you need the constant 1 and xor to for a Boolean base with xor.
Re:Well.... by Khyber · 2007-02-09 20:48 · Score: 2, Informative

Heat-free? Did you forget the Second Law? Or did you just forget about pure friction itself? Moving ANYTHING is going to involve friction. Nothing moves without SOME force, and friction will happen.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:Well.... by Creepy+Crawler · 2007-02-09 20:48 · Score: 4, Interesting

---That sounds very interesting. Would you mind providing a link to the literature that discusses that ? I have some trouble figuring out the thermodynamics of this. Perpetum mobile and such, you know....

Of course. It, at first, sounds too good, but here you go.

Rolf Landauer showed in 1961 that reversible logic operations could be performed by neither using energy or taking heat out. The same could not be said for irreversible logic operations.

"Irreversibility and Heat Generation in the Computing Process" IBM Journal of Research Development 17 (1973): 525-32, IBM PDF

___

In 1973, Charles Bennett proved that any computation could be derived from purely reversible computing.

Charles H. Bennett "Logical Reversibility of Computation" IBM PDF

___

Later on, Fredkin and Toffoli presented a review of the ideas of reversible computing. The essential idea is that you can save all intermediary states between an algorithm to get the answer, and then reverse the process so that no energy is used, and generated no heat. Fredkin also indicates that if we switched from irreversible to reversible computing, we would expect to lose no more than 1% efficiency.

International Journal of Theoretical Physics 21 (1982):219-53 PDF

___

And as an unsubstantiated claim, I remember hearing that due to heat/radiation sources, that volatile memory gains errors of 1 bit per billion with a time from 1 minute to 1 day ( I forget the exact time). To correct this would only require the entropy of deleting that incorrect bit. In other words, 10^8 or so magnitude heat shrinkage. But trust the stuff above.

(Many of these ideas were taken from "The Singularity is Near" by Ray Kurzweil from page 130)
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Well.... by Gori · 2007-02-09 21:20 · Score: 1

Very interesting indeed. Thanks for the pointers. Im definitely going to read up on the papers you suggested in greater detail. After a quick scan, they are really interesting

My research is in complex adaptive systems in a socio-technical setting, e.g. industrial network evolution etc. These things are about as far removed from equilibria and reversibility as they can be. They are all path dependent, and intractable ( as are all evolutionary processes). Think dissipative structures ala Prigogine etc.
So it is truly fascinating to think of a computation that can be fully reversible / adiabatic. Is there no work being performed ? If you reverse the computation, are you still "allowed" to know the answer ?

Goes to show how truly different information is, compared to matter and energy.

What I still do not understand it whether this can be practically built, and if so, why do not we have it already ? Global warming and all of that ?

Would it require radically different software? I would think so...

Anyway, your comment was one of the most enlightening ones I have ever seen on slashdot. Chapeau !

--
Complexity is a measure of our ignorance...
Re:Well.... by drgonzo59 · 2007-02-09 21:25 · Score: 4, Insightful

And how exactly is your reversible computing going to reduce the resistance of millions and millions of conductors to 0. You are confusing a theoretical issue relating to computer science (and very relevant to quantum computing) with a practical problem of a CPU design. Just moving information around _without_ deleting it will generate heat.

Or did you actually think that those "stupid" CPU designers for all this years, battling with heat dissipation, never thought of, oh.. simply replacing the nand gates with reversible Fredkin and Toffoli gates and 'poof' magically all the heat issues are gone, processors will run @ hundreds of GHz, the wold's electrical power consumption will go down and the geeks won't be able to boast about their huge ass sinks anymore...
Re:Well.... by rbarreira · 2007-02-09 22:06 · Score: 1

If you reverse the computation, are you still "allowed" to know the answer ?

As far as I remember reading, outputting answers adds a bit of heat output to the equation, but doesn't prevent you from using reversible circuits.

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Re:Well.... by OrangeTide · 2007-02-09 23:10 · Score: 1

electrical resistance formulas cannot be derived from friction formulas. friction is a macroscopic thing that is the statistically accumulation of many microscopic effects.

--
“Common sense is not so common.” — Voltaire
Re:Well.... by alphamugwump · 2007-02-09 23:22 · Score: 1

I am not a physicist. However, I know that there are other physical processes that do not change entropy, for example, mechanical and some electrical ones. So, I suppose it makes sense that computational processes can be reversible, as all you are doing is twiddling bits anyway. However, just as mechanical systems must be ultimately powered by heat engines, computers must ultimately do IO. Sure, you could use your computer to find the trillionth Ramsey number, but that wouldn't have anything to do with the physical world. I imagine that as IO needs increase, we'll have a giant pile of supporting junk around a single, ultrafast processor. The processor would be relatively cool, but the IO would generate heat, due to the quantum effects of trying to measure stuff. Otherwise, you could get around the second law by predicting the future. ( I imagine that statistical thermodynamics and quantum mechanics must apply to macroscopic stuff too, like human populations). I think you already see this with things that really do generate enormous amounts of data, like particle accelerators. The energy for processing the data is not negligible, but it's a hell of a lot less than the energy to run the accelerator. Conversely, taking 3D data that the computer understands, and flattening it down to 2D without losing very much is also very intensive, relative to the amount of computation that goes into game physics, for example. But this is just me bullshitting.
Re:Well.... by caramelcarrot · 2007-02-09 23:55 · Score: 1

Surely constructing logic-destructive gates out of these non-destructive gates would have the same entropy effect... unless you somehow manage to program entirely in non-destructive functions (probably very wasteful in terms of storage/area)
Re:Well.... by chthon · 2007-02-10 00:16 · Score: 1

Yeah, I think there was an article back in 1993 or 1994 in Byte about such processors. It seems that in practice, the theory doesn't add up.
Re:Well.... by Enrique1218 · 2007-02-10 04:56 · Score: 1

Take it more fundamental than that. Temperature arise from the motion of atoms in a material. Switching transistor the way they do, allows those electrons to really get those atoms moving.

--
You don't have to be smart to use a Mac, you just have to be smart enough to buy one
Re:Well.... by Creepy+Crawler · 2007-02-10 05:00 · Score: 1

---So it is truly fascinating to think of a computation that can be fully reversible / adiabatic. Is there no work being performed ? If you reverse the computation, are you still "allowed" to know the answer ?

My best estimate why they arent being worked upon is that they require different equipment a drastic change in software and engineering. That would explain the force against researching this.

---Goes to show how truly different information is, compared to matter and energy.

It goes to show that the laws of thermodynamics do play here too, but under the guise of entropy. Bits keep a small amount of heat. A destruction of a bit releases a very small amount of heat. However, this happening on a 1GHz chip in which timing loops can be multiple times that, and you have a very large amount of heat: in the factor of 10's of watts/s .

---What I still do not understand it whether this can be practically built, and if so, why do not we have it already ? Global warming and all of that ?

Well, I cannot see why we dont test it now. FPGA's are simply programmable gates, right? We can adhere ourselves to a rule only to use "passthru", "nor", "not" to do an operation. Then we can implement it in the traditional way. Measuring heat from this shouldnt prove too difficult. Doing this should prove beyond a reasonable doubt (along with it be repeatable) that it exists and can be made.

---Would it require radically different software? I would think so...

Absolutely. But there's no rule to call an instruction, say "and" and have it go through a series of reversible operations instead. If I'm reading the strong implications correctly, we could emulate all of our current (chipmaker)-64 instructions via those 3 instructions with appropriate 'glue' together. But then again, this probably is the most difficult.
--
- Mod parent up! by Anonymous Coward (Score:1) Thurs, Nov 31, @13:37
Re:Well.... by jbengt · 2007-02-10 05:31 · Score: 1

I have no doubt that chips are currently much more inefficient than theoretically possible,
but
Thermal entropy is not the same as information entropy. Related as far as the math of statistics goes, if you're thinking about an ideal gas, but not physically the same thing.
Entropy is not heat. Heat is the flow of energy due to temperature differences. Entropy is a measure of the loss of the ability of energy to do useful work. Think of heat flowing until the temperatures reach equilibrium - the energy is not lost, but you can't use it to run an engine without a temperature difference.
Re:Well.... by Mr+Z · 2007-02-10 06:32 · Score: 1

It's not enough to merely limit yourself to NOT, XOR and pass-through, as traditional implementations still destroy information in a way. Traditional gates are made of switches: When you switch the input to an inverter (NOT gate) off, the output switches on by closing a switch to Vdd and opening the switch to ground. Some current flows from Vdd to the inputs of whatever gates the inverter is driving. When you switch the input to that inverter on, the switch to Vdd opens and the switch to ground closes. That charge that had been on the output of the gate driving other inputs now gets drained to ground. That's where a good portion of the power gets burned in a CMOS circuit. The new output destroys the old output.

And then there's other sources of inefficiency. While switching, there's a point where both switches are neither fully open nor fully closed, and you get direct flow-through from Vdd to GND. That adds to the power consumption as well. And finally, the switches aren't perfect. An "open" switch still lets a little current "leak" through. This is referred to as leakage power.

This site has some logic gates that are designed specifically for reversible computation. --Joe

--
Program Intellivision!
Re:Well.... by Mr+Z · 2007-02-10 06:54 · Score: 1

Oops, my bad. That site has abstract circuits and doesn't show the actual transistors. This paper shows some circuits.

--
Program Intellivision!
Re:Well.... by WhoBeDaPlaya · 2007-02-10 09:46 · Score: 1

Naw... he has a material with an infinite mean free path. ;)
Re:Well.... by julesh · 2007-02-10 11:16 · Score: 1

As long as you use only 3 types of gates (pass through, not, xor), you can create a heat-free CPU.

And how would such a CPU go about calculating an arithmetic result, say 5 * 7?
Re:Well.... by aminorex · 2007-02-10 12:37 · Score: 1

That's just wrong. Friction is a statistical feature of ensemble systems. At the level of quanta, electrodynamic systems are frictionless. And one doesn't need to be doing quantum computing (meaning TAQC or GLQC) to compute using quanta: All of our current transistor-based systems are doing so. The only issue is whether they are doing so at a scale that necessarily implies ensemble losses. Today, yes. Tomorrow? Probably no.

--
-I like my women like I like my tea: green-

The proof of the pudding.... by 15Bit · 2007-02-09 20:13 · Score: 1

....is in the eating. Until we see benchmarks everything is just ignorant speculation.

AMD64 is very fast by GreatDrok · 2007-02-09 20:17 · Score: 5, Interesting

In my own benchmarks (generic C integer and floating point scientific code) I have found that the Core Duo and Core 2 Duo aren't all that quick compared with an AMD64. Clock for clock the AMD64 Opterons we have are about 50% quicker than an equivalent Core 2 Duo for integer work. I know this doesn't agree with all the usual magazine benchmarks but they are heavily biased towards using SSE instructions where possible and it is SSE where the Core 2 Duo has been a real improvement over previous Intel designs and also bests the AMD chips. Hopefully, AMD has recognised this and the new SSE implementation will bring them back on par with Intel for these benchmarks but even today an AMD64 processor is a beast and more than a match for anything Intel produces.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Re:AMD64 is very fast by pjbass · 2007-02-09 20:30 · Score: 2, Informative

Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?

And pointing out that it isn't fair to compare because a Core2 duo already executes the full SSE instruction in one pass vs. the 2 clocks for a curret AMD64 is the same as saying it's not fair to compare the on-die memory controller on AMD's vs. Intel's FSB. But people didn't seem to care when the numbers went in AMD's favor.

I'd really be interested in seeing your numbers, your programs, and what compiler options you passed when building on each platform (as well as type of memory, mobo chipset, etc., in each machine).
Re:AMD64 is very fast by GreatDrok · 2007-02-09 20:54 · Score: 5, Informative

"Care to publish your numbers that debunk all the other hardware sites that are typically AMD-biased anyways?"

OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).

AMD Opteron 2.0Ghz (HP wx9300) - 205 Million calculations per second
Intel Core 2 Duo 2.66Ghz (Mac Pro) - 146 Million
Intel Core Duo 2.0 Ghz (MacBook Pro) - 94 Million
IBM G5 PPC 2.3 Ghz (Apple Xserve) - 81 Million
Motorola G4 PPC 1.42 Ghz (Mac mini) - 72 Million
Intel P4 2.0 Ghz (Dell desktop) - 61 Million
Intel PIII 1.0 Ghz (Toshiba laptop) - 45 Million

Interesting things about these numbers. The Core Duo is clearly a close relative of the PIII since the performance at 2Ghz is roughly twice that of the PIII at 1Ghz. The P4 at 2Ghz is really very poor indeed which isn't a huge surprise as it was never very efficient. The G4 PPC puts in a reasonable result easily beating the much higher clocked P4 (what, the Mac people were right? Shock!) although I have to say that the performance of the G5 is disappointing. The Core 2 Duo isn't a bad performer although it does have the highest clock speed of any processor in this set but it is seriously beaten by the Opteron. From these numbers, a Core 2 Duo at 2Ghz would be about half as quick as an Opteron at the same speed.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by GreatDrok · 2007-02-09 21:12 · Score: 1

"perhaps you need to write some more cache efficient code to test with. goto BLAS can feed the beast like no other."

goto BLAS uses SSE so doesn't count. It has already been acknowledged that the SSE implementation of Core 2 Duo is very good. The new AMD chips may address this but we won't know until we see the benchmarks. For non-SSE the Core 2 Duo is a little better than the Core Duo which was similar to the PIII/PII/PentiumPro clock for clock. The current Opteron is much quicker clock for clock for non-SSE work. Also, my test code is really just integer code and it works mainly in registers and uses very little memory so it is quite a good test of the raw performance of a CPU which is why I like it.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by timelessroguestar · 2007-02-09 21:48 · Score: 1

The P3 you list looks a Coppermine, I suspect a P3 Tualatin would perform much better. At least every test and regular usage that I did seemed to indicate it was 33% faster than the Coppermine. It's interesting to see that the Core Duo on par with a Coppermine, I can't see how they could have gone so wrong there. My own testing of the Core 2 seems to indicate it is about 74% faster than the Tualatin clock for clock. My user experience doesn't seem to deny it. I'd expect to see a 2.66 GHz Core 2 closer to 275 Million, which would be similar to the Opteron you list. I'm not sure what's wrong with this picture, but something doesn't add up her.
I'm not sure how valid any of our anecdotal tests are. It's very hard to stress a chip like the Core 2 properly. The task manager will report it at 100% utilization and it will reach X degrees of heat, but if you run something like thermal analysis tool (TAT) it will go way beyond in temperature increase and 100% utilization, using all branches and transistor I would guess.

--
Timeless Rogue Star - Defile Convention - Transcend Time, Life, the Universe, and Everything.
Re:AMD64 is very fast by GreatDrok · 2007-02-09 22:03 · Score: 3, Informative

"The P3 you list looks a Coppermine, I suspect a P3 Tualatin would perform much better."

Pretty sure it is a Tualatin since it is a 1Ghz PIII Mobile which I bought in early 2002 (http://www.theregister.co.uk/2001/01/31/chipzilla _readies_1ghz_mobile_piii would seem to support this).

Given that it is a Tualatin, then the peformance of the Core Duo at 2Ghz looks about right. The Core 2 Duo gets about 10% better performance clock for clock from all the blurb I have read except when it comes to SSE where it is about twice as fast so the performance figure of 146 million also looks pretty much on the mark too as a 2Ghz Core 2 Duo should be able to manage about 110 million if you scale the figure for clock speed and that is (surprise) ~10% quicker than the Core Duo at 2Ghz (94 million) so the basic integer performance of the Core 2 Duo is better than the Core Duo but doesn't compare with the 205 million the 2.0Ghz Opteron manages.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by kestasjk · 2007-02-09 22:11 · Score: 1

I hope the benchmarks don't take get advantage out of using 64-bit arithmetic.

--
// MD_Update(&m,buf,j);
Re:AMD64 is very fast by jez9999 · 2007-02-09 22:16 · Score: 1

Any numbers for an Athlon 64? I just bought a 3800+ single core and would like to be made really excited about it. :-P

Also which of these chips are single, and which are dual, and which are quad cores?

What's the point of dual and quad core, anyway? Anyone figured out why it's better than just having 2/4 CPUs?

--
== Jez ==
Do you miss Firefox? Try Pale Moon.
Re:AMD64 is very fast by waaka! · 2007-02-09 22:19 · Score: 3, Insightful

OK. I can't give you the code but it is my own implementation of a pretty standard bioinformatics sequence comparison program which doesn't use SSE/MMX type instructions and is single threaded. On all platforms it was compiled using gcc with -O3 optimisation. I have tried adding other optimisations but it doesn't really make much difference to these numbers (no more than a couple of percent at best).
When you say you've tried "adding other optimizations," are you referring only to other GCC optimization flags? If your program's algorithms have any moderate degree of parallelism and you haven't tried vectorization either by compiler (GCC and ICC can both do this) or by hand, the benchmark you've done is not unlike a race where no one is allowed to shift out of first gear. Can you go into any more specifics about how this program does sequence comparisons?

Also, the disappointing numbers from the G5 may be partially explained by the fact that its integer unit has higher latency than the other desktop processors in that list. The G5 isn't exactly known for blistering integer performance, anyway.
Re:AMD64 is very fast by NovaX · 2007-02-09 22:22 · Score: 2, Interesting

AMD64 is not a processor, it is an instruction set. So you need to clarify whether you compiled your programs using 32-bit or 64-bit x86 instructions. I am not a gcc user, but I'm assuming that it chooses the default architecture based on your environment settings, thus AMD64 on 64-bit Linux. Since you've included a PowerPC processor, its really not obvious.

When the Core2 was released, benchmarks made it clear that Intel did not optimize for 64-bit performance. They have the architecture, but they pushed that task to a future refresh. AMD processors receive a significant performance boost in 64-bit mode, which could easily explain your numbers.

--

"Open Source?" - Press any key to continue
Re:AMD64 is very fast by GreatDrok · 2007-02-09 23:10 · Score: 1

"Any numbers for an Athlon 64? I just bought a 3800+ single core and would like to be made really excited about it. :-P"

Pretty much the same as the Opteron in this case. The program doesn't really hammer cache or main memory, just the CPU. Work out your clock speed as a percentage of 2Ghz and do the sums and that should be the number.

The Opteron, Core 2 Duo and Core Duo are all dual core chips in this test, the others single core although the G5 was a dual processor system. Since the program is single threaded there isn't any benefit to extra cores in this case but if you multithread your program you can utilise multiple CPUs and improve performance substantially. At the moment, the sweet spot seems to be dual core as modern operating systems are happy working with multiple CPUs. A quad system is really only necessary if you are maxing out a dual core system and need to run more tasks at once.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by High+Hat · 2007-02-09 23:11 · Score: 1

What's the point of dual and quad core, anyway? Anyone figured out why it's better than just having 2/4 CPUs?
Easy: Multiprocessor/core is not an "optional" or "high-end" feature in the multicore architectures, thus making SMP a commodity feature, which in turn results in lower prices on CPUs and Mainbords per processing core.
The other big advantage are shorter signal paths, which are faster - synchronizing processors that run at a couple of GHz that are connected by wires longer than a centimeter or so can be pretty challenging because of clock propagation issues. Also, shorter signal paths are much more energy efficient.
Re:AMD64 is very fast by GreatDrok · 2007-02-09 23:12 · Score: 1

"I hope the benchmarks don't take get advantage out of using 64-bit arithmetic."

Nope, straight 32 bit. If it had been 64 bit then the Core 2 Duo would also have seen a more significant boost versus its 32 bit predecessor not to mention the G5 should have been better than the G4 which it wasn't.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by GreatDrok · 2007-02-09 23:19 · Score: 1

I should say that this program was written a very long time ago originally. It implements an efficient but standard Smith and Waterman dynamic programming algorithm. I have done vectorisation of this algorithm in the past and the performance improvement was dramatic (about x20). With this test program though, it hasn't really benefited from extreme compiler optimisations. I do remember running it after compiling with ccc on an Alpha and seeing a 30% speedup so there is definitely room for improvement but most of these benchmarks are with gcc on x86 which isn't too bad. OK, not icc but I don't have access to that.

As you say, the G5 was never really good at integer work but its floating point performance isn't too bad. If only HPaq hadn't killed Alpha. *sigh* What a chip.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by NovaX · 2007-02-10 00:09 · Score: 1

No, Core2 has worse 64-bit performance.

--

"Open Source?" - Press any key to continue
Re:AMD64 is very fast by ben+there... · 2007-02-10 00:58 · Score: 1

Intel Core 2 Duo 2.66Ghz (Mac Pro) - 146 Million

Where did you get a Mac Pro with a Core 2 Duo?

Should be LGA-771 2-socket Xeon Woodcrest, and not fit a LGA-775 C2D, right?
Re:AMD64 is very fast by jcupitt65 · 2007-02-10 01:21 · Score: 1

I have some benchmarks too:

http://www.vips.ecs.soton.ac.uk/index.php?title=Be nchmarks

Again, plain C code, no SSE/whatever. It is threaded, which makes it slightly different. The source is there too.

Results:

Opteron 850, 2.4 GHz, 4 CPUs, 4.5s
Opteron 254, 2.7 GHz, 2 CPUs, 6.9s
P4 Xeon (64 bit), 3.6 GHz, 2 CPUs (4 threads), 7s
Core Duo, 2.0 GHz, 2 CPUs, 18.1s
P4 Xeon (32 bit), 3.0 GHz, 2 CPUs (4 threads), 19.7s
P4 (Dell desktop), 2.4 GHz, 1 CPU, 36.6s
PM (HP laptop), 1.8 GHz, 1 CPU, 58.5s

So I agree: an Opteron beats a Core Duo by about a factor of two. I blame a combination of gcc, the extra regsiters in 64-bit mode and AMD's better FPU.
Re:AMD64 is very fast by pdwalker · 2007-02-10 02:38 · Score: 1

My AMD64 servers with the older 1.6 GHz AMD CPUs outperform my 2.4 GHz Xeon CPUs by a factor of 2 to 1 consistently when running our applications under Tomcat.

There really is a huge difference in performance between the machines.
Re:AMD64 is very fast by ocbwilg · 2007-02-10 02:40 · Score: 1

What's the point of dual and quad core, anyway? Anyone figured out why it's better than just having 2/4 CPUs?

It's better than just having 2/4 CPUs because you can now get dual CPU functionality on consumer-level mainboards. You get SMP without having to shell out for workstation or server level hardware. Of course, if you do have workstation or server boards with 2 or 4 CPU sockets on it, then you can put dual or quad core CPUs in those sockets as well. So instead of having 2-way SMP with 2 sockets you can have 4-way or 8-way SMP with those two sockets.

For example, back in the day if you wanted a 4-way x86 box from HP you needed to get a DL580 or DL740 server. They were big (4-6 rack units tall), drew a lot of power, and were very expensive. Now you can get a pair of dual core CPUs in a DL360, their 1U "entry level" server, which costs a lot less. You can also get dual or quad cores in their 2 socket bladeservers. So you can get a lot higher density for less money, so if you are performing compute intensive tasks (or even if you're not) you can do a lot more with less.

Another advantage comes down to software licensing costs. I know that when buying CPU licenses for Microsoft products they count sockets and not cores. So if you need a SQL server with four cores running SQL Server Enterprise (which is like $30k per CPU for processor licensing) you can get a quad core CPU and only buy a single $30k license, versus 4 single cores and $120k in software licenses. Performance should be fairly similar. And the same goes for many other software vendors (VMWare is a another one that licenses per socket, not per core).

So basically you get more performance for less money all around.
Re:AMD64 is very fast by RightSaidFred99 · 2007-02-10 04:55 · Score: 1

Good god, how did this get modded as informative? "This just in, random poster redefines reality, AMD64 really faster than Core 2 Duo regardless of the tons of real world application performance data which completely contradicts this!!!"
Please people, get a grip. This guys little application does tons of random memory reads. This is the one area where the Opteron still kicks ass because it has an IMC. The number of applications where this is useful is fairly small, and it's been known for a long time.
There's nothing "interesting" about this, it's just some fanboi claiming that performance in his application domain is applicable to general computing applications people care about, and it's not.
Re:AMD64 is very fast by GreatDrok · 2007-02-10 05:28 · Score: 2, Informative

"This guys little application does tons of random memory reads"

If only that was the case but actually it is very linear. The application can hold the whole of its memory requirements in cache these days so it hardly has to touch main memory and it was designed to do all the inner loop code using only registers. Heck, I doubled the size of the inner loop just to avoid a single register copy because it made a significant performance increase.

The reason I like this code is that it shows how many operations you can expect a chip to achieve when it isn't having to wait on main memory. It is an extremely compute intensive application with very little I/O. If it really was about random memory reads then I would be inclined to agree with you but it isn't, it loads blocks of memory into cache and chews through them linearly.

I am pretty processor agnostic. I did really like the Alpha but that is dead now and that is a real shame. I also object to being called a 'fanboi' as I personally don't own any AMD kit since I generally prefer Macs which means regardless of what you might think all my personal machines are PPC or Intel. If anything though, my application is an example of general computing applications in that it doesn't use any SSE tricks to increase performance. It is just code that anyone could write in C and compile on whatever machine they have to hand so the performance is pretty real world. Sure, I've spent a bit of time right in the core making the thing efficient but that is where the program spends 99% of its time so not doing so would be stupid.

How's this for something to make your head spin, we were benchmarking some Java code written by someone else the other day and found that Java under Windows XP Pro on one of these Opterons was no quicker than it was on my G4 1.5Ghz PowerBook but the same app under Linux on the same Opteron was 4x quicker. The guy running the machine under Windows is a Java developer and now wants Linux installing and will use Windows via VMware in future. Also interesting, the 2.66Ghz MacPro was about 30% slower than the Opteron under Linux running the same bytecode but still faster than Windows running the same code. Not my 'little application' but still seems to follow the same trend which I thought was interesting. Apart from the Windows thing. No idea what is wrong with Java under Windows unless Sun did it deliberately which wouldn't surprise me.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:AMD64 is very fast by midnighttoadstool · 2007-02-10 07:05 · Score: 1

I just compressed my camera's avi output to wmv with "Microsoft Movie Maker", it used both cores on my core 2 duo, and halved the time it usually takes. Since it can take several minutes for each movie I was quite pleased.
Re:AMD64 is very fast by pjbass · 2007-02-10 08:27 · Score: 1

One of the things that makes a big impact on performance, on any platform, is the type and speed of memory used. Looking at your list of platforms above, I see an HP Workstation used with the Opteron. Not having one in front of me to verify, but reading on HP's website what chipset and memory is available, you could get a very distinct increase in performance simply due to lower memory latency through the chipset and memory type.

What chipset(s) and memory were used in the Mac's? Were they on-par with a workstation-level machine from HP?

The G5 doesn't surprised me. The underlying architecture, a consumer version of the Power5, is a damn good architecture. But it needs code compiled to use what makes it fast, which is the altivec stuff. I remember watching a G4 right after it released smoke anything on the market with RC5 number crunching *only* if you ran the RC5 passes with altivec enabled. If you ran it with a standard optimization, it got crushed by the Tualatin P3's we had in the labs.

Bottom line is anyone can post benchmark results, but when they don't have a level playing field for their hardware, it's comparing apples to oranges. If the code has any specific code optimizations for any of the processor's respective strengths, then it's apples to oranges. If you post numbers like this, either be prepared to lay out what you specifically used (hardware, software), or be prepared to have long discussions asking for these things. This is why the hardware magazines on the net always use the same application with very well-published hardware setups, so they don't get questions and scrutiny like we have here. Apples to apples is what people like to see. Other data is interesting, but is subject to questions.

Cheers!
Re:AMD64 is very fast by MemoryDragon · 2007-02-10 10:07 · Score: 1

Ok thanks for the interesting result, an IBM Power5 would be more interesting in this comparison. But your core Duo Opteron comparison is somewhat flawed in the logic, you compare basically single core performance of those processors, if you split the problem up in multiple threads then the results will look entirely different, and that is what multiple cores are about, as many threads and processes as possible without significant slowdown. Given modern operating systems hosting 20-100 processes each of them having several threads multicore can make an impact, also for scientific computing, although I think good vectorization units willl go a longer way.
Re:AMD64 is very fast by MemoryDragon · 2007-02-10 10:15 · Score: 1

No, but after reading this thread, I came to the conclusion that most people in here are completely clueless about multiprozessing and multithreading and that is what it is all about. If you dump one thread on a single core machine and on a multicore machine you probably will get better results at the single core machine, if you raise the thread number, the single core machine will relatively early reach its peak while the multicore machine will reach its performance peak way later, and that is all about. Of course this horizontal increase in performance will not go linearily up due to other factors, like memory, bus bandwith collisions in mem access (programwise often marked as critical regions or semaphores in the code), physical disk access (always the performance killer #1) but it will increase way more and will max out way later than a single core machine. This is highly interesting for scientific computing, where data often can be split into different domains which can be calculated at the same time, normally the limiting factor in this area often is network or bus bandwidth.
Re:AMD64 is very fast by Ecuador · 2007-02-10 11:09 · Score: 1

About two years ago, my professor asked me to run benchmarks using our lab's codebase to decide what kind of cluster server we were going to buy. I run two kinds of benchmarks, bioinformatics programs in C and text processing programs in Perl. It was clear that the Athlon/Opteron 64bit architecture was the way to go with 80% performance over intel's offering at the time. Then Dell sliced their already much lower prices (compared to AMD-based HP and Sun) to less than half of their first offer and my comments that the Opteron platform would be upgreadable and easier to feed with power were disregarded by the purchasing department. Obviously I was right and the 3.4 Xeons required new air-conditioning and power installations, thus laid disconnected for months. It is not that they used more power and produced more heat per CPU, but they came in twice the numbers since they were slower and cheaper.

Anyway, after that anecdote I just run one of the bioinformatics benchmarks on some of my office's machines. It is a single-threaded edit distance implementation that I cannot post the code to. But it is simple floating point with -O3 the only flag used. The results are:

Core 2 Duo 6400 @ 3.2GHz 3.308s
Athlon 64 512KB 3500+ @ 2.31 GHz 4.236s
P4 Northwood 512KB @ 3.2GHz 6.54s
Xeon 2MB Gallatin @ 3.4GHz 9.222s

Note the first two are running beyond spec (OC'd) and are running on SUSE 64bit (and compiling @ 64bit). The other two were not 64bit capable CPU's. This is supportive to the parent's results, more obviously if we calculate the inverse of time per clock cycle and normalize, i.e. we calculate theoretical performance if they all ran at the same clock speed and give speed index 1 to the slowest of the bunch. We have:

Athlon 64 3.20
Core 2 Duo 2.96
P4 Northwood 1.50
Xeon Gallatin 1

There you go. Now I could start playing with flags, but this is enough proof that for some things (especially in 64bit I guess) the Athlon 64 still leads Intel. However, the Core 2 Duo is a much better overclocker which does defeat any AMD advantage as seen in the non-normalized results. Also, I tried the text processing benchmark in which AMD had huge performance advantage as well, but the Core 2 Duo was about 30% faster on the same clock speed.

I really hope AMD comes up with an ace. Firstly because I don't really like Intel (like most almost monopolistic companies their behavior is neither pro-consumer nor pro-market). Secondly because competition is great! Do you guys remember the era before the Athlon, how much a non-celeron Intel CPU cost? Compare that to the great-at-last Core 2 Duo! (If they only had a better sounding name).

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Re:AMD64 is very fast by Sterling+Christensen · 2007-02-10 11:45 · Score: 1

Those machines are probably using different types of ram...
Re:AMD64 is very fast by complete+loony · 2007-02-10 13:52 · Score: 1

Ah, that explains a lot. In all the benchmarks I've seen, the AMD processors have more raw CPU power and more raw speed when accessing memory, but the Intel chips have a far more effective memory cache. AMD's seem to have better numbers when everything fit's into it's cache, or when none of the work fits into intel's cache.

--
09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
Re:AMD64 is very fast by akuma(x86) · 2007-02-11 08:39 · Score: 1

I would suggest using ICC instead of GCC if you want performance out of either Intel or AMD. Frankly, gcc isn't going to get you anywhere near icc.

It's not just SSE performance where the Core-2 shines. It's also integer. The fastest Core-2 SPECint is 50% faster than the fastest AMD SPECint. ICC was used. This makes sense because the core-2 is a 4-wide issue with instruction-fusion (which makes it look effectively even wider) - wheras the Opteron is only 3-wide issue.

It makes very little sense that the Opteron would outperform on integer code.

Re:Honestly... by Khyber · 2007-02-09 20:27 · Score: 1

I would care. POWER CONSUMPTION/EFFICIENCY. If I want a space heater, I'll stick with a 3.4 GHz P4 with HyperThreading. I DON'T WANT ONE. As it is, for what I like doing and for what I want to do, current-gen processors work just fine. I can play my games, make my music, draw shit, upload data, and check out sites like this, while maintaining my bank account, talk with other people, and more, at the same time. I got over the clock speed thing the second I actually owned a G3. Granted, Windows emulation sucked balls, but what was made for that computer, it once again, did what I wanted it to do. Sometimes better, sometimes worse. YMMV. If anyone wants bragging rights of ANY sort, the technology to shrink the die-size is the way to go. Go read a book called "Nanotime," IIRC. THAT is what I'd like to see, minus superdense C4 that's equivalent to a mini tacnuke, and those insane driving laws and shit.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:Intel's Responds by pchan- · 2007-02-09 20:31 · Score: 4, Interesting

"Lets make a Octa-core processor!"

Oh, here's one. Though it's been out since before Intel had quad-core chips.

Inevitable question by DiamondGeezer · 2007-02-09 20:33 · Score: 1, Funny

But will it run Linux?

--
Tubby or not tubby. Fat is the question

Re:Inevitable question by OrangeTide · 2007-02-09 23:13 · Score: 1

I think I would cry if the answer was "No".

--
“Common sense is not so common.” — Voltaire
Re:Inevitable question by Anonymous Coward · 2007-02-09 23:17 · Score: 1, Funny

But will it run Linux?
.. And this in the one place that "Imagine a Beowolf cluster of these things!" would actually make sense!

Re:Dethrone? No. by robinvanleeuwen · 2007-02-09 20:35 · Score: 2, Interesting

No but a good hard, well aimed, holding nothing back kick in the nuts can leave them impotent,
so they'll have to do some ugly procedures to survive it in the long run. A couple of identical
blows in the meantime could leave them sterile, so if the current setups begin to die out.
And Intel had no more babies waiting anymore, they will not be dethrowned, but will be getting
an hounerable mention in the history books.

--
If you don't like my sig then don't read it.

I neglected to mention something else... by Khyber · 2007-02-09 20:35 · Score: 2, Interesting

Clock speed doesn't mean crap anyways. It's all in the code. I see guitar tuning programs for the computer... TEN megs in size, slow as hell, and inaccurate! I believe APTuner is FAR smaller than most, faster and far more accurate. People just don't know how to code, plus the fastest ways to code are copyrighted, which they shouldn't be since they'd be utterly obvious to any programer with that standard "ordinary" knowledge in that language, so one has to make workarounds that inevitably end up being slower. No more oldskool hacker ethic, now it's greed.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Re:I neglected to mention something else... by cduffy · 2007-02-09 21:08 · Score: 1

the fastest ways to code are copyrighted, which they shouldn't be since they'd be utterly obvious to any programer with that standard "ordinary" knowledge in that language

An individual implementation can be copyrighted. A way of doing something can't be covered by copyright, and needs to be patented. That's what you meant, right?
Re:I neglected to mention something else... by lazarus+corporation · 2007-02-09 21:52 · Score: 1

Correct, so long as you're only talking about the US and not about Europe, where software patents don't restrict innovation (yet).

--
the lazarus corporation
Re:I neglected to mention something else... by Echnin · 2007-02-10 03:49 · Score: 1

EU != Europe

--
Lalala

Re:Honestly... by mabinogi · 2007-02-09 20:38 · Score: 4, Insightful

45nm is not inherently "better" than 65nm any more than 3Ghz is inherently "better" than 1Ghz. A smaller process size is a means to an end, it's not an end in itself.

The end is the delicate balance of improving power / watt while increasing overall performance and keeping the price down. If AMD can deliver a chip that does a better job of that at 65nm than an Intel 45nm one, then the AMD chip is not somehow "worse" than the Intel one just because it doesn't use 45nm. That's just stupid.

I'm not saying AMD can do that, but I think that criticizing them for not being ready for 45nm yet is more than premature.
AMD's actually guilty of the same flawed logic though - their criticism of Intel's 4 core processor being just 2 dual cores stuck together is just as pointless. It doesn't matter what matters is how well the processor meets the requirements of its target market.

--
Advanced users are users too!

GPU not CPU - Re:Dethrone? No. by dave1g · 2007-02-09 20:50 · Score: 2, Insightful

read the article, that is an x86 GPU it wouldn't be able to compete with general purpose CPUs

Re:GPU not CPU - Re:Dethrone? No. by Fweeky · 2007-02-10 12:37 · Score: 1

"what the fuck is an x86 GPU???"

Roughly what Cell is for PowerPC; take a x86 core, rip out most of the OOE stuff and the whacky big caches, beef up the SIMD side, put a few dozen on a die with a really fast interconnect and there's your x86 GPU/Stream Processor.
Re:GPU not CPU - Re:Dethrone? No. by dave1g · 2007-02-11 12:35 · Score: 1

from the article: http://www.theinquirer.net/default.aspx?article=37 548

"What are those cores? They are not GPUs, they are x86 'mini-cores', basically small dumb in order cores with a staggeringly short pipeline. They also have four threads per core, so a total of 64 threads per "CGPU". To make this work as a GPU, you need instructions, vector instructions, so there is a hugely wide vector unit strapped on to it. The instruction set, an x86 extension for those paying attention, will have a lot of the functionality of a GPU."

To answer your question more directly. An x86 GPU is just what it sounds like: a processor thats main purpose is to produce graphics, it also happens to use the x86 instruction set along with its various extensions and probably a few graphics specific extensions.

In this same sense I can make an x86 dsp, an x86 sound card, etc. now it might not be very good at those functions but x86 is just a description of the ISA being used, nothing more.

Re:If its true by Anonymous Coward · 2007-02-09 20:54 · Score: 5, Funny

Three quad cores for the pasty-nerds under the sky,
Seven for the WoW-nerds in their halls of stone,
Nine for Diablo Men doomed to die,
One for the Dark Nerd on his dark throne
In the Land of Silicon where the corporations lie.
One quad core to rule them all, One quad core to find them,
One quad core to bring them all and in the darkness bind them
In the Land of Silicon where the corporations lie.
He paused, and then said in a deep voice,
This is the Master-quad core, the One quad core to rule them all.

Re:wrong section by Khyber · 2007-02-09 20:57 · Score: 1

Who says news can't come from an advertising section? It's stil a source of information.....

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.

Wow by 2ms · 2007-02-09 21:03 · Score: 1

I dont think I've ever read such an admiring review of a CPU design. Last time I remember a chip sounds so fantastic was the Alpha or something like ten years ago. If a lot of all the new things really work the way they sound in theory, well then yeah I guess it's evident this Barcelona thing is really going to be something else.

The design for VW performance sounds extra interesting

Re:Honestly... by suv4x4 · 2007-02-09 21:13 · Score: 1

I don't care if it's 65nm, 45nm or 10mm - that's a completely irrelevant (to me as a user and purchaser) implementation detail.

This is Slashdot. We care about those details. You can read more about the "super fast, super cool, super cheap!" market speak on the company's official press releases section.

how do they fit a fourwheeler in the chip? by pseudosero · 2007-02-09 21:17 · Score: 2, Funny

I want a floating quad.

--
sometimes, nothing.

Re:how do they fit a fourwheeler in the chip? by Narishma · 2007-02-09 23:14 · Score: 1

Like this one? http://youtube.com/watch?v=gaFrYZrDWRM

--
Mada mada dane.

SSE128 means... by adam31 · 2007-02-09 21:22 · Score: 1

Does SSE128 mean some significant departure from the doomed SSE instruction set?

I'm not kidding. In SSE I'm familiar with, one of the input registers is always an output register, which means its contents are destroyed. Another flaw is that there aren't enough registers... SSE uses 8, where 32 are commonly not enough when latency is longish (especially with SoA-style progamming, where pragmatically a single vec3 occupies 3 128-bit registers).

... or Madd. You know, multiply-add. Does it have that?

Re:SSE128 means... by ja · 2007-02-09 23:37 · Score: 1

According to http://www.theinquirer.net/default.aspx?article=35 011 there is nothing breathtakingly "new" here. If you were hoping for dramatic changes like r1 += r2 * r3 or altivec style permute, it is simply not there.

--

send + more == money? ...
Re:SSE128 means... by philthedrill · 2007-02-10 03:27 · Score: 1

Does SSE128 mean some significant departure from the doomed SSE instruction set?

No. It means 128 bit SSE ops can be done in a single cycle instead of two (64-bit chunks).

In SSE I'm familiar with, one of the input registers is always an output register, which means its contents are destroyed

How is this different from regular x86 (non-SSE) instructions? They have two operands where one is a source and destination.

Another flaw is that there aren't enough registers... SSE uses 8

AMD64 specifies 16 SSE (XMM) registers from the 8 in IA32.

where 32 are commonly not enough when latency is longish

The trade off to have 32 registers was probably not worth the die space and extra complexity. Having 16 probably gave most of the benefit, and having 32 provided diminishing returns.
Re:SSE128 means... by edwdig · 2007-02-10 04:33 · Score: 1

The trade off to have 32 registers was probably not worth the die space and extra complexity. Having 16 probably gave most of the benefit, and having 32 provided diminishing returns.

At least with the general purpose registers, AMD wanted to go to 32, but couldn't do it without changing the instruction set. I'd assume the same thing applies to the SSE registers.
Re:SSE128 means... by philthedrill · 2007-02-10 04:53 · Score: 1

At least with the general purpose registers, AMD wanted to go to 32, but couldn't do it without changing the instruction set. I'd assume the same thing applies to the SSE registers.

How so? Unless I'm missing something here, I think the only cost is in the size of the register file and rename register set, but nothing ISA-related.

Great by Trogre · 2007-02-09 21:33 · Score: 2, Funny

So now I'll see four penguins at startup!

--
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife

Re:Great by gilboad · 2007-02-09 23:01 · Score: 1

... I'll see 8. :)
(2xSocketF machine).

Barcelona vs Itanium in single and double float ? by tchiwam · 2007-02-09 21:45 · Score: 2, Interesting

What really interest me is how does it compare with single and double precision calculations. If AMD gets in the range of Itanium performaces will Intel follow and kill their own Itanium by boosting core 2 FP ?

Re:Honestly... by epine · 2007-02-09 21:49 · Score: 4, Insightful

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

Feature size has denominated progress (as measure either by raw performance or performance per watt) over an unbroken 30 year period. Do you recall the very passionate debates about RISC vs CISC? Did a RISC design at one feature size ever beat a CISC design at the next shrink? I think not. Design has never mattered anywhere near as much as feature size. Not that you can't get design wrong. But then you can get a shrink wrong, too, and end up with 1% yields. AMD managed briefly to remain competitive with Intel playing a full shrink behind when Intel did that rather stupid marketron-driven face-plant into the thermal wall (against good advice from their Israel team, who later came to the rescue with Core Duo).

With the recent skyrocket of leakage current, the holy grail of feature size is somewhat tarnished, but it still dominates the performance curve. You completely missed the relationship between feature shrinks and the performance crown. If Intel has better process technology than AMD (almost always) and AMD has a better design (most of the time since the Athlon was first launched) and both companies shrink every 18 months following the Moore projection (that unbroken 30 year historical trend) and AMD always shrinks 9 months behind Intel, then the performance crown will pass back and forth exactly as often as either company announces their next product.

So I agree with you: feature size has no importance to the customer who wants performance for their dollar. Except that you can set your clock by it and project ten years into the future effective performance levels of shrinks we haven't even seen yet. Except for that part, yeah, I'm with you.

Show us your source code by rbarreira · 2007-02-09 21:54 · Score: 1, Insightful

Well, until you show us your source code those numbers are as believable as anything else one might randomly type here...

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F

Re:Show us your source code by GreatDrok · 2007-02-09 22:08 · Score: 1

"Well, until you show us your source code those numbers are as believable as anything else one might randomly type here..."

I can't because the program is really large and it doesn't entirely belong to me (you know, work for people, they own your code).

You're right, I could just be making these numbers up and if you prefer to believe that then there is nothing I can do to change your mind. All I can say is that this is my own (admittedly anecodatal) experience.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"
Re:Show us your source code by rbarreira · 2007-02-09 22:30 · Score: 1

Actually I was thinking more about benchmarking/coding flaws than lying from your part.

--

The AACS key is NOT 0xF606EEFD628B1CA427BEA93A9CA9773F
Re:Show us your source code by GreatDrok · 2007-02-09 23:06 · Score: 2

"Actually I was thinking more about benchmarking/coding flaws than lying from your part."

Certainly a possibility. In my defense I would like to point out that all benchmarks are open to question. I know my own code, I know what it does and it doesn't do much but it does a lot of it so the performance figures are what they are. I originally wrote this code on an SGI, ported it to Linux on a 486, SPARC, Alpha, PPC and so on. Its old and simple but does real work. While I could make it faster using SSE and have done so with other code, that wasn't the purpose of these numbers. It was simply to see what the processors do using the same code, the same compiler and similar OSs (Linux v OSX in this case). Anyway, the PPC code from gcc is likely not particularly well optimised, especially for the G5, but for the x86 based chips it isn't too bad. All code was compiled 32 bit with just the basic optimisation a C programmer would put. Compiling with -m64 doesn't really help it much and on the Intel chips has been known to reduce performance so I stuck with 32 bit.

--
"I have the attention span of a strobe lit goldfish, please get to the point quickly!"

Re:grandparent was interesting, your comment isn't by drgonzo59 · 2007-02-09 22:32 · Score: 1, Offtopic

...the conversation started on the theoretical limits of computation.

Mr. Coward, in case you have not read the article, the conversation actually is about AMD's new processor, which is a real processor. That processor will generate some amount of heat ... real heat, not theoretical heat.

A few years ago people would've ...

What are you referring to? I never mentioned OOP, relativity, or satalites (sic) in my post.

To be fair, I suspect that the resulting information from the computation itself carries... well... information.

You are a coward but at least you are fair, that's good to know. A fair coward is better than a regular, run-of-the-mill coward, at least in my book. Yes, information carries information. Do you want a Nobel prize for that? O perhaps an honorary PhD degree?

All good computer scientists know information has a direct corolation (sic) to entropy

And even better computer scientists know how to spell, or at least use a spell checker.

It seems like you need to listen to your own advice and : quit pointlessly spewing shit at people who have useful things to say.

Re:Honestly... by jmv · 2007-02-09 23:02 · Score: 2, Interesting

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

They care. Just moving the chip from 65 nm to 45 nm means you can produce twice as much on the same silicon wafer. Also, if a 65 nm chip performs well, then a 45 nm version of it (with slight modifications of course) will work even better.

--
Opus: the Swiss army knife of audio codec

Re:Honestly... by OrangeTide · 2007-02-09 23:03 · Score: 2, Interesting

Likely Intel has an edge because they are [almost] ready for 45nm process, while AMD is just getting started on 65nm.

But it is interesting to see the two companies approach the problem from different ends. Do you improve the silicon process or do you alter the architecture and instruction set? I bet you the best answer will be to do both.

quad cores that actually share cache would be nice. these double duals kind of suck because architecturally they can never share cache. although AMD and Intel don't have very dual cores that can even share cache with-in themselves. (although I think Intel is releasing one soon?)

--
“Common sense is not so common.” — Voltaire

Re:Honestly... by Dahamma · 2007-02-09 23:05 · Score: 1

You haven't really added much from your original post. Die shrink is an implementation detail, probably something you read that sounded "futuristic"... The real goals are performance, and (usually secondarily) power consumption. Doesn't really matter how they achieve those goals.

I agree with you on one point - I think as with your requirements, the goal for the average non-technical home consumer should be focused more on efficiency than multi-core 64 bit 4MB cache, etc. But not everyone spends 95% of their time browsing the web and typing in their casserole recipes. Try to run a parallel make on a large project and you immediately appreciate a processor with a high clock rate, multiple cores, and huge cache. I don't need to read sci fi to know what I need to do my REAL job.

Re:Intel's Responds by Anonymous Coward · 2007-02-09 23:08 · Score: 3, Informative

8 core (two quad core chips in a single package) is already on Intel's internal roadmaps.

(this was anonymous for a reason)

Re:cool - quad core !! by OrangeTide · 2007-02-09 23:16 · Score: 1

well if your cpu ever gets powerful enough to do some sort of extremely computationally expensive compression (like with fractals or something?). maybe you could squeeze a little bit more out of your slow link?

I like dial-up, nobody can call me (one phone line, disable call waiting), and I really only do IRC and text browsing. Honestly who wants to give the cable company or phone company $50 a month, those bastards are rich enough.

--
“Common sense is not so common.” — Voltaire

Re:Honestly... by l3v1 · 2007-02-09 23:18 · Score: 1

AMD's actually guilty of the same flawed logic though - their criticism of Intel's 4 core processor being just 2 dual cores stuck together is just as pointless. It doesn't matter what matters is how well the processor meets the requirements of its target market.

I don't think it's about that. I mean, since Intel quickly pumped out something which seems like being 4 core cpu which took far shorter than to develop a new quad core design cpu makes them seem to lag behind, so what can you do to explain ? Not much really, besides what they said above, that Intel got quicker because it's not really a quad core design. Everyone will have their own multicore designed cpus sooner or later, so it won't really matter.

--
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.

32 vs 64 by mrnick · 2007-02-09 23:35 · Score: 1

It depends on what kind of data you are processing. If you are doing 32 bit calculations then you would want to compile your code for 32 bit, assuming your processor can handle it, as most 64 bit CPU can. If you are using 64 bit calculations then of course the 64 bit CPU would out perform the 32 bit as you would have to do additional coding steps to simulate 64 bit on 32 bit architecture, multiple 32 bit operations with bit shifting and the like.

If you took code that was written for 32 bit operations and compiled that code for 64 bit then the results would be very bad. The same issues where present when computers made the move from 16 to 32 bit.

I would assume that the code written in these tests are 32 bit. If so, then when evaluating CPU one would have to take into consideration if they needed 64 bit calculations. Also, before jumping the gun and making a rash decision some research would need to be taken to see if your operating system of choice supports true 64 bit operations. Many of the so called 64 bit operating systems only use the 64 bit CPU to support memory beyond 4GB.

Nick Powers

--

Encryption: I may not agree with what you say, but I will defend your right to encrypt it...

Re:32 vs 64 by NovaX · 2007-02-10 00:06 · Score: 1

You're assuming that the only difference between 32-bit and 64-bit x86 instructions are the bit sizes. That's not true, and the most immediate gain from AMD64 are the extra registries. There are a lot of changes to the ISA that will dramatically skew the results. The only negative results you would get compiling 32-bit code to 64-bit would be: A) The cache can contain fewer entries; B) Platform assumptions, such as when performing pointer arithmetic, would break. His code is probably fairly clean since he was able to test it on both x86 and PowerPC processors.

I also disagree with you on Operating System support. The HP wx9300 is sold with 64-bit Linux or Windows XP x64, both of which have mature 64-bit support. Microsoft Windows became 64-bit clean with the Intel Itanium series and Linux with the DEC Alpha. Windows 2000, for instance, supports up to 32GB of addressable memory on 32-bit x86 by using PAE.

Its funny you talk about jumping the gun, but then shrug off your assumptions. The entire point is to find out what bad assumptions are being made, since so many people are surprised by his numbers.

--

"Open Source?" - Press any key to continue

Re:Honestly... by zCyl · 2007-02-10 00:01 · Score: 1

Just moving the chip from 65 nm to 45 nm means you can produce twice as much on the same silicon wafer. Also, if a 65 nm chip performs well, then a 45 nm version of it (with slight modifications of course) will work even better.

But how much does this really affect the retail price of a cpu? From randomly googling around, it looks like silicon wafer cost only translates to a dollar or two per cpu, so who cares if they can drop this expense by half? Surely other factors would be more important for cost, such as design expense, manufacturing time, and the cost of the fabrication equipment (which would increase if more manufacturing units are needed due to longer manufacturing time).

Power loss and speed seem to be complicated things, and I really doubt this continues to get better as the fabrication process keeps going to infinitesimally small sizes. It seems to be getting hard to keep the electrons confined at these small sizes, and that's going to have to start having a dominant negative impact at some point.

Re:Honestly... by Charcharodon · 2007-02-10 00:33 · Score: 1

don't care if it's 65nm, 45nm or 10mm - that's a completely irrelevant (to me as a user and purchaser) implementation detail. I care about the results - how fast is it for my workloads? How much is it? How much power does it use?

The best way of looking at things. I started off Intel and stuck with them up to about 1Ghz, jumped ship stayed with AMD untill my X2 3800, now I'm back to Intel with a Duo 2 6600. We'll see in 1-2 years who'll I'll be with next. The same goes for video cards and soda. Pepsi vs Coke, Nvidia vs ATI. Doesn't matter which ever gives me what I want at the lowest price.

Of course companies spend billions trying to convince you otherwise, but all products are just commodities and are practically the same in the end.

Re:Honestly... by Gr8Apes · 2007-02-10 00:37 · Score: 1

I guess you're still running that 4GHz P4 then?

--
The cesspool just got a check and balance.

SUNDAY SUNDAY SUNDAY by un1xl0ser · 2007-02-10 00:41 · Score: 1

Can we start rejecting 'scoops' that sound like a radio/TV demolition durby or monster-truck madness advertisement?

--
v4sw6PU$hw6ln6pr4F$ck 4/6$ma3+6u7LNS$w2m4l7U$i2e4+7en6a2X h

Junk article, full of inaccuracies. by barracg8 · 2007-02-10 00:43 · Score: 4, Informative

Each of Barcelona's four cores incorporates a new vector math unit referred to as SSE128

SSE has always been 128bit (the 64bit simd extensions were called MMX). AMD used to funnel the instructions through a 64bit execution unit by splitting the work into two halves, the new core has a full 128bit SSE pipeline so doesn't need split the operations. Nothing new here, just a faster internal implementation. Can this deliver and 80% improvevment in benchmark performance? - quite possibly. Take a look at the Core2 FP perfromance numbers - it also has a full 128bit implementation of SSE.

And separating integer and floating-point schedulers also accelerates this thing called virtualization

Huh. Hardware virtualization affects how the processor handles certain instructions such as priviledged operations. FP instruction execution is unaffected. Virtualized workloads will benefit no more than non-virtualized workloads. Separate issue queues are good but does it specifically benefit virtualization? - no.

Barcelona blacks out power to individual portions of the chip that are idled, from in-core execution units to on-die bus controllers. This hasn't made it into PCs before ...

Intel call this 'intelligent power capability'.
http://www.intel.com/technology/magazine/computing /core-architecture-0306.htm?iid=search&

Barcelona adds Level 3 cache, a newcomer to the x86

Xeons have featured L3 caches for years. http://en.wikipedia.org/wiki/List_of_Intel_Xeon_mi croprocessors

Barcelona is genius, a genuinely new CPU that frees itself entirely of the millstone of the Pentium legacy.
Barcelona is a new CPU, not a doubling of cores and not extensions strapped on here and there.

Barcelona is an Opteron, with a doubling of cores and some extensions strapped on here and there.

I'm not meaning to detract from AMD here - the fact that they have still not had to make any radical changes to the opteron micro-architecture is a testament to the quality of the original design. They are slightly ahead of the game on virtualization - they're going to beat Intel to nested page tables - but other than that this chip is playing catchup. Overall this is going to be a very nice piece of kit to work with. But nothing radical and new here.

G.

Re:Junk article, full of inaccuracies. by ocbwilg · 2007-02-10 02:27 · Score: 2, Informative

Xeons have featured L3 caches for years. http://en.wikipedia.org/wiki/List_of_Intel_Xeon_mi croprocessors

Actually, if you go waaaay back to the Socket 7 days you could have L3 cache as well. The AMD K6 and K6-2 CPUs only had on-die L1, and the L2 cache was on the mainbaord. But the K6-3 CPU had 256KB or 512KB of on-die L2 and was compatible with the same mainboards. So when you put that K6-3 in a socket 7 mainboard the mainboard's cache actually functioned as L3. Sure it wasn't on-chip, but L3 cache is definitely nothing new to x86.
Re:Junk article, full of inaccuracies. by ScriptedReplay · 2007-02-10 04:27 · Score: 2, Informative
I fully agree, the article is mainly empty of information - it took words from AMD briefings and produced a meaningless salad.

Now, as far as some claims, in detangled order:
- FPU boost: this seems to be based on several things - one is the obvious widening of SSE2 issues. Others are increasing instruction fetch from 16B/cycle to 32B/cycle, making the FPU scheduler 128bit, unaligned loads and a doubling of cache bandwidth.
- Virtualization: Nested page tables and reduces witching times for the hypervisor.
- Power: CPU and northbridge on separate power planes so they can be in different power modes (clock+voltage); apparently, voltages of different cores are independent as well, so that should give lower power consumption when not at full load (with appropriate MB support) AFAIR this is better than what Intel has, but I might be mis-remembering.
- the extra cache is long overdue and one will have to see whether their way of managing it is smart enough (things like moving data from L3 to individual caches, but sometimes keeping code shared in L3)
There. More content than TFA and shamelessly copied from a 4-month old article for the benefit of all us non-RTFA people. And there's more actually.
Re:Junk article, full of inaccuracies. by Bj�rn · 2007-02-10 07:31 · Score: 1

Here is another article or post that has a relatively long lists of the improvements in Barcelona.

--
Never express yourself more clearly than you are able to think. --Niels Bohr
Re:Junk article, full of inaccuracies. by mczak · 2007-02-10 09:59 · Score: 1

* And separating integer and floating-point schedulers also accelerates this thing called virtualization Separate issue queues are good but does it specifically benefit virtualization? - no. True. Additionally, the article implies this is something new. All K8 chips (=Opterons, Athlon64) however always had seperate schedulers for float and int instructions (in contrast to the intel core2 chips, so amd is touting that as an advantage - it's more of a design choice than really a simple "better" or "worse" for either solution probably). There is a reason the codename of Barcelona is K8L! As you mentioned, it's certainly not somehow a completely new chip.
Re:Junk article, full of inaccuracies. by Bj�rn · 2007-02-10 10:36 · Score: 1

There is a reason the codename of Barcelona is K8L! As you mentioned, it's certainly not somehow a completely new chip.
From an article in The Inquirer:
"WE'VE BEEN HEARING the "K8L" codename for ages now, but we can say now, straight from the horse's mouth, K8L was never a codename for AMD's upcoming generation of chips."
If we are to believe the article, K8L was apparently the code name for the Turion64 where the L stands for Low-power. K9 was the X2 processors, so that would make the upcoming Barcelona the K10. The article also claims that the differences between the K10 and K9 and K8 are larger than what most people assume.

--
Never express yourself more clearly than you are able to think. --Niels Bohr
Re:Junk article, full of inaccuracies. by mczak · 2007-02-10 11:32 · Score: 1

Ok, so I guess K8L is not an "official" codename for Barcelona. It doesn't change the fact that everybody used it for this chip, and with good reason. Call it K10, but it's still an improved K8 (not that this is a bad thing - it is a good design, improve the areas where it is a bit weak, why reinvent the wheel).

Re:Honestly... by Gr8Apes · 2007-02-10 00:48 · Score: 1

This is way more than a mere quad core design. I was hoping to impart that. It is actually dropping some of the legacy x86 architecture internally, and adding big-iron features - the nested paging per core will be a huge plus for businesses that run multiple CPU machines with lots of virtualization for instance.

The separated schedulers for floating and integer math allows for more parallelism, another speed up.

The shared L3 and reduced latency L2 caches should put Barcelona ahead of Cloverton's split caches: only 2 cores share a cache, one of the features that made Core 2 Duo's faster was that the cache was shared among all existing processors.

I think the real heart of the matter is going to be that these CPUs will far outshine Intel's best in multi CPU rigs, especially business type rigs. They should be on par for single CPU gaming machines, although I'm going to hedge and say AMD might be a little faster on games that can use more than 1 core. (Almost none at the moment, I know, but they'll start coming out soon)

--
The cesspool just got a check and balance.

I see FOUR penguins - Gentoo-Luc Picard by Anomalyst · 2007-02-10 01:50 · Score: 1

The STNG episode is similar to Orwell's 1984 where O'Brien, the torturer, shows four fingers before Winston's face, O'Brien increases the pain until Winston says that he sees five fingers, finally, Winston actually imagines five fingers.

--
There is no right to feel safe thru security vaudeville at the expense of everyone's freedom, privacy and tax money.

Re:Honestly... by foobsr · 2007-02-10 01:56 · Score: 1

A quick check reveals that there is more to it than size: IPC, clock rate, wire delay .

CC.

--
TaijiQuan (Huang, 5 loosenings)

Barcelona??? by master_p · 2007-02-10 01:56 · Score: 1, Funny

Rumours have it that their next CPU model will be named 'Real Madrid'...

AMD64 is faster in floating/scientific operations by ubuwalker31 · 2007-02-10 02:00 · Score: 1

Check out this benchmark from an article on extremetech:
http://www.extremetech.com/image_popup/0,1694,iid= 146710,00.asp
The actual article:
http://www.extremetech.com/article2/0,1697,2014647 ,00.asp

This isn't surprising at all. I believe its been well known that the FX chips crunches numbers faster than the Core 2 duo chips. Notice how the FX-62 almost ties or beats the X6800 in some tests. However, in the benchmarks that matter for most personal computer users, the Core 2 duo is the right choice, blending the right amounts of power in the right places. The FX looks like the marathon sprinter in some areas, but the Core 2 is the triathlete in others.

Re:Honestly... by Josef+Meixner · 2007-02-10 02:05 · Score: 1

AMD's actually guilty of the same flawed logic though - their criticism of Intel's 4 core processor being just 2 dual cores stuck together is just as pointless. It doesn't matter what matters is how well the processor meets the requirements of its target market.

But AMD is right, it is no quad core, it is a multichip package of two dual cores. So calling it "quad core" is pure marketing speech because INTEL is lagging behind AMD again. They simply don't have a real quad core yet. So they use those multichip processors to hide the fact that they behind and even try to create the illusion that it is AMD which is behind.

Realistically you have to compare an INTEL quad core with an AMD dual processor dual core setup. But that is not fair either, because now the AMD rig has two independent memory systems and so has a big advantage on big data sets. But it also needs much more space. So basically the multichip module from INTEL only saves you space on the mainboard, technically it is a dual processor setup with a bad memory path.

And that could hurt AMDs real quad cores, because if everybody in the marketplace associates quad cores with INTELs multichip module they might severely underestimate the performance AMDs processor (hopefully) has. And that would hurt AMDs sales.

Re:grandparent was interesting, your comment isn't by balloonhead · 2007-02-10 02:26 · Score: 1

Very aggressive tone you have there. Most likely to get modded flamebait (as you have done) and get people to disregard what you say.

Mr. Coward, in case you have not read the article, the conversation actually is about AMD's new processor, which is a real processor. That processor will generate some amount of heat ... real heat, not theoretical heat.

The conversation might have started out as that, but this thread has gone somewhere else. This is a natural part of any discussion. It does not make it offtopic or irrelevant. Part of the discussion of increasing CPU speeds naturally turns to the increased heat production, and he quite reasonably pointed out some research which, albeit theoretical at this stage, may become more relevant when heat production outstrips cooling capacity for a home computer.

What are you referring to? I never mentioned OOP, relativity, or satalites (sic) in my post.

Do you not understand the example of how theoretical ideas sometimes become working, practical items? Because it is relevant to the discussion at hand (particularly in light of you being a tool about it).

You are a coward but at least you are fair, that's good to know. A fair coward is better than a regular, run-of-the-mill coward, at least in my book. Yes, information carries information. Do you want a Nobel prize for that? O perhaps an honorary PhD degree?

Do you have anything useful to say? Because there's nothing in that rather feeble insult that really has any substance. I can't see the relevance that posting AC has to this discussion one way or another. Criticise opinions, not anonymity (or lack of pseudonymity which is not much different anyway). I don't think that his own admission of making a slightly redundant statement really carried any expectation of winning a Nobel prize. Maybe you want a Nobel prize for making up such impotent arguments? No? Then stfu because it's such a pointless non-statement and it doesn't add anything other than to make you come across as a dick.

And even better computer scientists know how to spell, or at least use a spell checker. You only made one typo in your post, but it's still one too many when you are criticising other people's. Poor spelling isn't a particularly good way to get a point across, I will be the first to agree. But who cares? It doesn't add or detract from the underlying content.

Your pompous little diatribe however is a little different - as explored above there is no actual content in it other than you having a little wank over your nerdy agenda.

--
This idea was invented by Shampoo.

Re:Honestly... by Targon · 2007-02-10 02:35 · Score: 2, Interesting

The problem I have with performance/watt is that it distorts the true "value" to the system owner. You NEED to break it down, because while power usage is important, the real issue comes down to "is the higher performance WORTH the extra power the chip draws". I personally don't CARE about performance/watt, except when the power draw is excessive, and I believe that is how MOST people will look at it.

Most laptop processors have a higher performance/watt than desktop processors because they are designed with battery life in mind. What people want is a processor that goes faster, but doesn't suck a huge amount of power to get that performance increase. The Pentium 4 got a LOT of flack toward the end with Prescott because the power demand was so far above the benefits that extra power provided. If it were only ten percent more than an Athlon 64 at the time, then no one would have been bothered by it, unless you are talking about a data center where the price for electric power is a very important consideration.

The only reason the whole fab process improvements is even brought up is because Intel is afraid of AMD. Intel has amazing resources when it comes to money and the ability to pay a lot more into their R&D, but in spite of this, AMD was seen as the performance leader before the Core 2 Duo came out, and AMD has the potential to come back and beat Intel again once K8L is released. It goes to show that if you spend some time looking at how to improve the overall system design and how things fit together, performance will go up a LOT.

Re:Honestly... by Waffle+Iron · 2007-02-10 02:53 · Score: 1

it looks like silicon wafer cost only translates to a dollar or two per cpu,

It's not the cost of the wafer itself. It's the fact that hundreds of fabrication steps have to be applied individually to each wafer. So the cost of each fabrication step is roughly multiplied by the number of wafers you need to process. Fewer wafers allow for lower costs.

Paging Tables by Doc+Ruby · 2007-02-10 03:16 · Score: 4, Informative

Nested paging tables is a per-core feature that will light the afterburners on x86 hardware virtualization. A paging table holds the map that translates virtual memory addresses to physical memory addresses, and each CPU core has only one. Virtual machines have to load and store their page tables as they get and lose their slice of the CPU. AMD solved the problem with nested paging tables. Simplified, each VM maintains its own paging table that stays fixed in place. Instead of loading and saving paging tables as your system flips from VM to VM, your system just supplies Barcelona with the ID of the virtual machine being activated. The CPU core flips page tables automatically and transparently. This is another feature that's implemented for each core.

Context-switching has long been the weakest design point for x86 in "PCs", especially servers. x86 arch is rooted in single-user, single-threaded, single-context apps. The in-core registers that CPU operations execute directly against have to be swapped out for each context switch. In *nix, that means every time a different process gets a timeslice, it's got to execute two slow copies between registers and at best cache RAM, at worst offchip RAM (over some offchip bus). If the register count is larger than the bus width (even onchip), that's another multiple on that slow cycle. That context-switch overhead can be larger than the timeslice allocated to each process's "turn" in the schedule for lower-latency / higher-response (lower "nice") processes, approaching realtime.

Unix was designed for multiusers, context-switching from the beginning. The chips it's run on coevolved with it. Linux arrived when x86 CPUs ran fast enough that context-switching was OK, but still a big waste compared with, say, MicroVAX multiple register sets. Windows architecture is rooted in the x86 architecture that DOS was designed for, though perhaps Vista has finally lost all of the old design baggage originated in the 8088/8086, but its long history of UI multitasking means it's context-switching all the time, which will gain in speed. The MacOS switch to BSD means it's got lots of power bound up in the context switches that could be released with Barcelona.

So while low-level benchmarks might show something like 80% FPU improvement, the high level (application) performance could improve quite a lot more. Recompiling apps to machine code that exploits more registers without the context-switching penalties could find multiples, especially apps with realtime multimedia that run concurrently with other apps. Intel's hyperthreading already gets past some of these bottlenecks in distributing tasks among multiple cores, but the Barcelona paging tables go even deeper, for likely extra performance (on top of Barcelona's own hyperthreading and new L3 cache).

Aside from the marketing "vapormarks" we'll surely see out of AMD (and their sockpuppets) before it's actually released "midyear", I'm looking forward to seeing how this thing really runs in multitasking apps. I'm expecting "like a greased snake across a griddle".

--

--
make install -not war

Re:Like a repeat all over again... by drsmithy · 2007-02-10 03:41 · Score: 1

Intel comes up with some hair-brained scheme that "More is better!". (like Viagra) They design something new and decide to make it faster (or in this case just glue more of them together). Back in the day it was the "GHz" now it's all about how many "Cores" you got. This tactic seems to suit Intel quite well and dethrones AMD for about a year and a half... During this time AMD massively redesigns there chips to integrate new, emerging technologies. The gamers and server operators of the world sit by their AMD chips knowing that they might not have the fastest chips for the time being but they are more technologically advanced.

It's so cute when kids who have only been watching x86 CPU development for the last few years try to sound authoritative !

Re:Dethrone? No. by Anonymous Coward · 2007-02-10 03:57 · Score: 1, Insightful

Which part of "2009 timeframe for Larabee" did you fail to understand? Both AMD and Intel have other things in store for 2 years away and it's not clear which is going to be better. AMD seems to be moving in the same direction of many microcores with its ATi GPUs (google for it on inquirer) and currently they have more experienced engineers on it, so it's a bit premature to discount them. Intel on the other hand has to execute a first-gen GPU on par with ATi's. And I'm curious what is nVidia thinking of all this.

Re:Like a repeat all over again... by RightSaidFred99 · 2007-02-10 05:01 · Score: 1

Yeah, that is pretty cute. I know _I_ certainly want to claim that my CPU is more "technologically advanced" and don't really care about performance. I'm sure server operators are the same way. I'd take the most technologically advanced CPU in the world even if it only ran at 80286 speeds, because then I get nerd bragging rights!!

"Hey, boss, we need to by another 100 machines to support these validation runs. Or we could buy 80 machines of this other brand which will accomplish the same thing and save a lot of money, but they're just not as technologically advanced!"

Quad Core by Gary+W.+Longsine · 2007-02-10 05:36 · Score: 2, Interesting

Actually, from the important perspective of the difficulty of building a new machine around it, the Intel "dual-dual" core chips really are quad core -- they drop into the same socket as the previous dual core chip, placing four cores into the socket. That certainly helped speed the time to market for the chip.

--
If you mod me down, I shall become more powerful than you could possibly imagine.

Very nice... by mr_flea · 2007-02-10 08:11 · Score: 1

...but does it have an IOMMU?

Re:Honestly... by Deliveranc3 · 2007-02-10 09:13 · Score: 1

I don't care if it's 65nm, 45nm or 10mm - that's a completely irrelevant (to me as a user and purchaser) implementation detail. I care about the results - how fast is it for my workloads? How much is it? How much power does it use?

Obsession about process size is sillier than obsession over clock speeds.

If AMD can produce a better performing chip at 65nm, then who the hell cares if Intel - or anyone else - move to a 45nm process?

If you move to a smaller transistor size you get more processors per wafer (Thus it gets much much cheaper) you lose less of your wafer to small defects (Again Cheaper). Moving data around smaller chips is faster( perhaps better for your workload possibly not).

It will also use less power to change transistor state and will use less voltage ( though increased leakage until recently minimized this effect).

Now there are dozens of other characteristics but by saying "we've moved to a lower die size" they are pretty much addressing all of your "simple" concerns.

Re:single threaded makes no sense... by MemoryDragon · 2007-02-10 10:09 · Score: 1

Ahem you have to see it differently, you can launch way more programs parallely without having a huge impact on performance. Multicores can go a long way, especially if you do vm stuff and java development where you have to juggle threads by the dozends...

Re:Like a repeat all over again... by canuck57 · 2007-02-10 10:15 · Score: 2, Interesting

I will not surprised if AMD dethrones Intel again. It is a classical Intel vs. AMD battle...

I am not sure Intel ever did beat out AMD.

I went down to Best Buy where the Intel rep was hard peddling a Code Duo 2 machine and compared his $1500 machine to a AMD X2 clearance one for $600. I had nothing to do that day but be a clown, so I went and got a DVD with software on it, and said these are both XP right? Copy the contents to the hard drive and compress it. I am going to measure it. Core Duo 2 results were almost the same at more than twice the prices were less than 1% different. Not only that, I put my hand over the back of the machine to see how warm the exhaust was. AMD was noticeably cooler. So I walked out with an AMD X2.

So while in my less than perfect benchmark and testing, it is an end to end test factoring in everything. I still bought AMD. Nice machine too, runs much cooler than my P4 2.8HT. Certainly a lot faster.

perfectly fine for a CPU benchmark by r00t · 2007-02-10 10:38 · Score: 1

We're not testing the compiler. IMHO, turning optimization OFF would be a fine idea, or at least unobjectionable.

The only important thing is that the compiler choices and options are fair. Using gcc on the Opteron and icc on the Core Duo would not be fair. Using gcc everywhere, with the same options, it completely fair.

One can also define "fair" as "all systems tweaked to the max", but this is rather difficult to do right. (see also: OS benchmarks, where the benchmarker knows all the ways to tweak the OS he uses most often)

Java is slow on x86 by r00t · 2007-02-10 10:49 · Score: 1

Java is big-endian, like the SPARC and G4.

Java has strictly-defined floating-point math that is incompatible with the x86. An x86 chip must save floating-point options out to memory to force the exponent to be the right size.

JIT/emulation systems in general, including Java, do better with more registers. The G4 has about 6x as many once you exclude registers that are unavailable. (about 5 for x86, but at least 30 for the G4)

that's what we all run though, and it can be OK by r00t · 2007-02-10 10:55 · Score: 2, Interesting

The proper fix is to run multiple copies of the benchmark.

I'm using Linux, with single-threaded apps, but so what? I run lots of things at once:

X, window manager, xterm, editor -- that is 4, plus the kernel

X, xterm, tar, gzip -- that is 4, plus the kernel

X, xterm, make, bash, cc1, cc1, cc1, gas, gas, ld... -- that's a lot of things!

vector units mostly sit idle by r00t · 2007-02-10 11:02 · Score: 1

The main use of vector units is running crappy Windows gamer "benchmarks" and MacOS Photoshop "benchmarks". The games don't even use the vector units all that much. It's just the benchmarks that use the vector units.

In the real world, vector units aren't good for much at all. You can do radar processing with them, but that isn't exactly a desktop app. Linux can use them for software RAID.

Re:Intel's Responds by XMode · 2007-02-10 11:21 · Score: 1

You allude to having a little more incite than most on this subject, but let me just say... DUH!. I'm fairly sure that after 4 cores (which Intel can do now) 8 cores will be next, and once they have that under their belt, 16 cores here we come!

Re:Intel's Responds by julesh · 2007-02-10 11:31 · Score: 1

Oh, here's one. Though it's been out since before Intel had quad-core chips.

Not to mention its availability as an open-source chip design.

But: it may be 8 core, but it lacks out-of-order execution, and each core can't perform at full speed without 4 threads, so it's not much better in terms of performance than a 32-core processor running at a quarter of its speed. This isn't bad for some workloads, but for others it's a nightmare.

Re:Honestly... by aminorex · 2007-02-10 12:28 · Score: 1

> ...will work even better...

If it works at all... maybe. Meandering capacitances between traces, variable leakage currents and higher susceptibility to electromigration mean that a substantial rework of the masks is required in order to move from 65 to 45nm.

Yes, you can make it faster, and yes you can make it cheaper, but you can't make it as reliable, nor can you make it use a higher proportion of it's power for computationally useful activities -- not without a substantial change in architecture, using error correction, reversible logic, self-timed circuits, &c.

So far Intel's superior process tech and AMDs superior architecture have passed the performance crown back and forth for several years, but it remains to be seen how long this condition will remain cyclically stable. I rather expect architecture to trump process in the next iteration, as AMD's partnership with IBM is likely to bring process nearer to parity, while Intel's architecture group is simply not competitive -- and revamping their staffing enough to make the kind of revolutionary change needed to recapture the lead would be too risky.

--
-I like my women like I like my tea: green-

2009. by DrYak · 2007-02-10 12:29 · Score: 1

The Larrabree is planned for 2009.

By then, both ATI/AMD and nVidia have plans to introduce their own "multiple chips on single card" family.
The current (for nVidia) / next (for ATI/AMD) line of DirectX 10 cards will be the last of single-super GPU, where different models in the line differs only by the clock speed and the disabled pipe-lines.

From G90 and R700 onward witch will probably appear around 2008-2009 too, the graphic card will look the same way as those multi-chip card pioneered from 3DFX (Voodoo 4 : 1 VSA, Voodoo 5 5000/5500 : 2 VSA, Voodoo 5 6000 : 4 VSA, AAlchemy's custom PCI boards : Fucking-8 VSA on a single oversized card).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:Intel's Responds by Gr8Apes · 2007-02-10 13:33 · Score: 1

Until they hit that wonderful FSB wall....

I know they're working on a solution to that.

--
The cesspool just got a check and balance.

Re:Dethrone? No. by I+Like+Pudding · 2007-02-10 13:40 · Score: 1

Intel applied concerted effort to kicking its own nuts during the P4 years. If AMD couldn't deliver a knockout blow then, it sure as hell won't now that Intel has a real lineup.

Opcodes by DrYak · 2007-02-10 14:31 · Score: 1

nothing ISA-related

Not the instruction set, but the binary coding of.

Let's say a system encode some instruction in a single byte where the high nibble is the command ( 0Bxh are all moves ), and the lower nibble is the register ( 0x3h are all manipulating register C. Therefore opcode 0B3h moves something into register C ).
You can move this machine from 8 registers to 16. But if you want to extend to 32 registers, you'll have to use 2 bytes to code them ( first bte command, second byte, register. Or same opcodes as before in second byte and first byte being a prefix that says "use the upper 16 registers" ).

Now taking the exemple of x86 to AMD64 extension, maybe moving from 8 to 32 register may have increased the average opcode size from 3 btes to 5 bytes. Wich would have the following disadvantage :
- Same code eats up a lot more memory in 64 modes.
- Less code in cache
- Maybe slower instruction decoding stage in pipeline
- Code start to look a lot different from legacy code (opcodes for 32 bits instructions are not the same binary for equivalent counterparts in 64 bits) wich may eat up more silicon real estate (having 2 different instructions decoders instead of 1 single. For exemple when 32bits was introduced with the 386, most of the opcodes between 16bits mode and 32bits mode where the same, only a prefix was available to specify when code used the "other" size (16bits in 32b code or vice-versa) )

Some ARMs have dual mode opcodes : either opcodes coded on 32bits (4 bytes per opcode) or Thumb mode (2 bytes per opcode). Using the thumb mode, not all functionality is available (not all register are usable for some instruction, only jumps are conditional, not arithmetic). To get all capability, 32bit mode has to be used.
For ARM (the R means RISC), the instruction set is small, the decoding isn't complicated and doesn't eat much space, adding a second "thumb" mode for denser code isn't that hard and doesn't each much space (specially because thumb mode is less complex)
For AMD64s maybe it's not that easy because the x86 opcodes are already a mess (thanks to all the legacy including some dating back the 8080 8-bits predecestror [not binary compatible but at least source-code compatible] ) and implementing alternate completly retought from begining opcodes for 64bits mode code may have used too much silicon (it already nice that they cleaned up the memory model for 64 bits modes)

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:Honestly... by zCyl · 2007-02-10 14:35 · Score: 1

It's not the cost of the wafer itself. It's the fact that hundreds of fabrication steps have to be applied individually to each wafer. So the cost of each fabrication step is roughly multiplied by the number of wafers you need to process. Fewer wafers allow for lower costs.

But are those costs equal for the 45nm versus 65nm versions?

16 finally... by DrYak · 2007-02-10 14:36 · Score: 1

16 cores here we come!

And then finally, the joe 6-pack user will be able to run it's current work (browsing for porn ?), the ressource-eating microsoft OS, and the huge pile of spambot/spyware/virus/sony rootkit/trojans/etc..., all together, without the main task having to be switched by the multitasking scheduler in favor of the others, and thus for the first time won't observe such a massive slowdown, 2 days after the anti-virus license expired...

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:Dethrone? No. by Gr8Apes · 2007-02-10 15:34 · Score: 1

Therein lies the humurous part. Intel's "real lineup" is nothing more than a few minor tweaks and the application of modern processes to 10 year old tech. You might be able to whip the buggy harder, add a second horse, but eventually the locomotive just wears you out as it passes you. (Even if it's quarter-scale)

As for delivering a knockout blow with the P4, I wouldn't count my chickens just yet. AMDs offerings still smoke Intel's in the server market, esp anything over 2 CPUs.

--
The cesspool just got a check and balance.

Not for you I guess. Who are you, anyway? by fm6 · 2007-02-10 16:34 · Score: 1

If all you want is a cheap powerful processor to put into your gaming PC, then yeah, you don't care who dethrones who. But if you work in the industry (and right now I'm writing documentation for an Operton-based HPC system) then the Intel-AMD struggle is very interesting indeed. Every time AMD scores an upset over Intel, the whole marketplace changes.

And while you may be content to buy cheap technology 6 months after its introduced, not everybody who buys hardware has that luxury. If you're spending 6 or 7 figures for a rack full of high-performance computer, then every little twitch in performance or pricing makes a big difference to your bottom line.

Re:Not for you I guess. Who are you, anyway? by Weaselmancer · 2007-02-12 04:38 · Score: 1

Well, you're a good example of the other side of the fence. But not everyone is running a high demand server room. I'd imagine that guys who have your purchase priorities are the exception and not the rule. The consumer market seems to be the lion's share of cpu sales.

But what you've wrote is interesting. Do people who run these high demand systems really upgrade the cpu every time AMD or Intel manages a tiny delta of performance increase? Seems terribly wasteful if they do. Bleeding edge cpu cost seems to be around 800-1200 dollars American. That's a lot to throw down every time someone manages to push an additional 100MHz out of some chip. The exception IMHO would be whenever there is an architecture change, like the recent glut of dual core chips, or the currently hyped quadcore CPUs. Other than that though, it seems like an awfully expensive way to do business.

--
Weaselmancer
rediculous.
Re:Not for you I guess. Who are you, anyway? by fm6 · 2007-02-12 16:49 · Score: 1

Do people who run these high demand systems really upgrade the cpu every time AMD or Intel manages a tiny delta of performance increase?

Probably not. But they don't take the "wait six months until the price comes down" attitude either. They can't afford to.
And the people who take these chips and turn them into actual systems do upgrade every time AMD or Intel does that tiny delta. No, that's wrong: they upgrade before. Because they have OEM agreements that give them access to the chip maker's secret development schedule. So long before you hear about a new chip, system manufacturers are busy designing the systems that will use them.

Re:Honestly... by Waffle+Iron · 2007-02-10 18:22 · Score: 1

The costs don't need to be equal to pay off. Keeping other factors constant, anything less than (65/45)^2 (about 2X) would cost less per CPU.

Re:Honestly... by jlfose · 2007-02-12 06:45 · Score: 1

If 65nm, 45nm or 10mm consumed the same amount of power or were equally susceptible to random errors, then from a consumer standpoint I would agree with you. But that is not the case. The smaller process allows for an equal amount of computing power to be done with less electrical power. For instance lets say the 65nm processor uses 50watts, and the 45nm process uses 45watts. That may not seem like much but lets assume that you now apply that to 1 million pcs that use the proc 24x7 for three years. This results in: Lbs CO2 134,405,700 Tons CO2 67,202 Equivalent in Cars 3,875 Equivalent gallons of gas 6,870,139 Acres of Trees needed to offset 6,109.63 Mature Trees needed to offset 2,986,793 Trees planted to offset 716,830 If you just want to look at what a person would save for running 1 new proc for the same 3 year period it results in: Equivalent gallons of gas 6.87 Mature Trees needed to offset 2.99 So even if you just purchase 1 chip, it will save you about $18 over the 3 years and will be more eco-friendly. In addition lower power chips tend to be higher in reliability, so the average consumer is less likely to have to pay for a repair over the same 3 year period of time which also has a tangible value. So all in all you should care about the new processes.

Slashdot Mirror

AMD's Showcases Quad-Core Barcelona CPU

155 of 190 comments (clear)