AMD Delays Hammer

Not surprising by Anonymous Coward · 2002-09-12 16:13 · Score: 4, Funny

AMD = All Microprocessors Delayed

Comment removed by account_deleted · 2002-09-12 16:22 · Score: 4, Insightful

Comment removed based on user account deletion

Re:Comment non-sense by Paul+Komarek · 2002-09-12 16:38 · Score: 5, Insightful

I, for one, am hoping to replace our Alphas with cpus from the AMD Hammer series. We're about to buy a bunch of P4-based machines despite the problems we've had with certain tight loops in scientific code performing 80 times slower than a similarly clocked Athlon (according to Athlon advertised "speed", not actual clock). No, I'm not exaggerating, and this has been verified independently -- the P4 cpu has some huge weak spots that really suck if you hit them. If Hammer were out and working properly, we probably wouldn't buy the P4 machines to hold us over.

We need 64 bit machines to accomodate massive memory for our research. I'm really hoping the Hammer can provide a relatively inexpensive and *commoditized* 64 bit platform for us to work on, compared to existing 64 bit (workstation/server) platforms. And I want it yesterday. Actually, I want it last year.

I have no idea what the editors or submitter meant, of course.

-Paul Komarek

The real reason by PaxTech · 2002-09-12 17:04 · Score: 5, Funny

They're waiting so they can ship the new chip bundled with Duke Nukem Forever. ;)

--
All movements for social change begin as missions, evolve into businesses, and end up as rackets.

Re:Comment non-sense by Paul+Komarek · 2002-09-12 17:28 · Score: 4, Interesting

I can probably send you some test code (same for anyone else who asks), but I'll have to check with my advisor first. The smallest I've made the test code is a bit under 300 lines. It's been run on Alpha 21264 EV67, Athlon C, Athlon XP, P4, and P-III, and one other Pentium-ish platform. At least two (I believe it's actually three) profilers have been run to find the bottleneck; it appears to be the floating point unit stalling for data.

Here are the timings. Note that these are just via "time" on GNU/Linux or a wall clock on Windows (or something -- I didn't do the Windows tests).

P4 dual Xeon 1.7GHz/gcc: 82 seconds
P3 1000/msvc: 18 seconds
Athlon C 600/msvc: 2 seconds
P3 1000/msvc, using floats and sse:
2 seconds
Alpha 667/gcc: 2 seconds
Athlon XP 1900+ 0.88 seconds

I guess the Athlon's clock was closer to the P4's clock than I recalled in my original post. Either way, the slowdown on the Pentiums can be easily seen.

-Paul Komarek

we believed it was hammer time by deft · 2002-09-12 17:34 · Score: 5, Funny

but it turns out you can't touch this.

--

There's nothing Intelligent about Intelligent Design.

Re:Current Athlons by packeteer · 2002-09-12 18:21 · Score: 4, Informative

In case everyone doesn't know what "double pumped" or "DDR FSB" mean let me explain. The clock that sets how often data is transfered clicks over and over to keepo the pace. On an Athlon it transfers data twice for every click. On a Pentium 4 its 4 times a click. Right now most Athlons run at 133mhz "DDR FSB". Mine already runs at 166mhz (overclocked of course) and let me tell you its sweet. I cant wait to see everyone have access to 166 mhz FSB Athlons.

--
unzip; strip; touch; finger; mount; fsck; more; yes; unmount; sleep

Re:Good by Billly+Gates · 2002-09-12 18:52 · Score: 5, Informative

Oops I forgot to include this from the faq.

Q: Can Linux, FreeBSD or another open source OS run on "Palladium" hardware?

A: Virtually anything that runs on a Windows-based machine today will still run on a "Palladium" machine (there are some esoteric exceptions[1]). If you currently have a machine that runs both Linux and Windows, you would be able to have that same functionality on a "Palladium" machine.

The exceptions are here

[1] These exceptions include the following:

1.)Some debuggers may need to be updated to work in the "Palladium" environment, but they can still work.

2.)Some special performance tools may need to be updated.

3.)Software that writes directly to TCPA hardware will need to be updated.

4.)Memory scrub routines (at the hardware level) will need attention.

5.)Third-party crash dump software may need to be updated.

6.)BIOS mode hibernation features will need to be updated to work with "Palladium."

Its these 6 reasons why palladium is still beta and why AMD is probably waiting before releasing Hammer.

--
http://saveie6.com/

short rant and a question by Erpo · 2002-09-12 19:00 · Score: 5, Interesting

Everyone always makes the same really annoying mistake when it comes to athlon fsbs. Athlon front side busses do not run at 200MHz and 266MHz. They offer bandwidth equivalent to 200MHz and 266MHz by using both sides of the clock (DDR) on 100MHz and 133MHz fsbs. All new athlons use 133MHz DDR fsbs. The hammers will support 166MHz DDR memory busses, offering performance equivalent to 333MHz SDR memory.

However, the notion of "fsb" is a little blurred with the hammer. Hammers will be directly connected to dimm banks and have integrated memory controllers, so the speed of the fsb will no longer be a determining factor in memory bandwidth. (* see mp note below) The traditional fsb to the traditional northbridge will be replaced by a "high speed" hypertransport link to a chip that connects to the agp slot, and has another (slower) hypertransport link to what could be called the south bridge. This "south bridge" will then connect the pci bus, serial ports, hard drives, usb ports, and any other devices that need to talk to the processor or main memory.

*What does this mean for MP systems? Well, that's actually the really cool part. By moving the memory controller onto the processor and providing communication between processors over a hypertransport link (3.2GB/sec for dual, 6.4GB/sec for quad and above), memory bandwidth actually increases as more cpus are added! This is in contrast to a normal MP system where as more cpus are added, there is increased competition for a fixed resource (main memory) which is already the bottleneck in many single processor applications.

That's my rant on terminology. Here's the question:

I'm no kernel hacker, and I certainly don't know anything about writing schedulers, but it seems like this would require a change in how processes are handled in hammer mp systems. In traditional mp systems, every processor has equal access to main memory. If a process gets moved from one cpu to another, there's initial overhead to do the moving, but after that it can still get to its areas in memory without any problems. On a hammer mp system, migrating a process from one cpu to another would mean that in order to access its memory it would have to reach out of its cpu's hypertransport link, into another cpu's memory controller (which may or may not be busy) and into the attached ram. Considering there would not be enough bandwidth available on the 3.2GB/sec hypertransport bus (in the case of a dp system) for both processors to reach into eachothers 166MHz DDR memory at the same time without suffering a performance hit, it seems like there would definitely be an advantage to keeping processes close to their data.

What changes would this require to scheduling and process management code, if any? Has this already been addressed, or are there people working on it in the linux kernel?

Re:short rant and a question by Erik+Hensema · 2002-09-12 22:50 · Score: 4, Informative

Essentially this would be a NUMA system (non-uniform memory architecture). As far as I know Linux 2.6 will have support for these systems.

In a real NUMA machine there would be a hierarchy of clusters of processors. Each cluster functions a bit like a traditional SMP system, but the clusters are interconnected over "low"-bandwidth busses. This makes memory accesses across clusters slower than direct accesses into the clusters' memory.

Both the VM and the scheduler will have to know about this.

Another point with NUMA systems is the possibility of gaps in the main memory (discontinues memory). Kernel hackers are currently working on support for that (discontigmem patch, merged in 2.5.34).

--
This is your sig. There are thousands more, but this one is yours.

Why Pentium IVs are slow by stewartjm · 2002-09-12 19:13 · Score: 4, Informative

The P4's x87 FPU and x86 ALU are just plain slow compared to P3s and Athlons. Though I am surprised your code is running 82x slower. I'd expect more like 2-8x slower for compute bound code. You can get a somewhat sensationalistic overview of why it's so slow at this link.
If you want more in-depth numbers you can compare appendix C of the Intel Pentium 4 Optimaztion Manual with chapter 29 of Agner Fog's Pentium/II/III Optimization Manual. You can see the Athlon numbers in Appendix F of AMD's Athlon Optimization Manual.
If you want to do number crunching with Pentium 4s your best bet is to use the SSE2 instructions/registers. You should be able to get a noticable speedup by using the Intel C++ compiler and telling it to use SSE2 instructions. If you want to eek out max performance you'll have to use assembly language. Though you can probably get most of the way there using the Intel C++ Compiler's SSE2 intrinsics.
I'm curious as to why your code is so much slower on a P4 than on an Athlon. The best way to find out would be to look at the assembly code that gcc is producing. You can do that by using gcc's -S option. If you'd like send me the C code and the output from -S and I'll see if I see anything obvious.
I'm somewhat paranoid about posting my email address. My paranoia seems to work, as I've received no more than the occasional spam in the last few years. My email address is my slashdot user name at woh.rr.com.

They're having clock speed issues with Hammers... by Heretic2 · 2002-09-12 19:25 · Score: 5, Interesting

You ever notice how all the Hammers are clock speed locked at 800MHz? Yea, there's a reason for that. They're having problems cranking the clock speed up. For 800MHz they're fast as hell, beating P4 with twice the frequency, but they're not gonna release them until they clock faster than current Athlons so they're trying different types of transitors and what not.

How the hell do I know that??? Look where I live, take a guess...The birds outside my window know things.

Re:is there a real difference? by kimmo · 2002-09-12 22:14 · Score: 5, Informative

Latency.

With single data rate a new address can be sent every clock for all memory requests.

With double data rate a new address can be send with every other "clock", but while data transmission rate stays the same. Effectively this means transferring double data for each request, while the amount of requests doesn't change.

This isn't very serious problem, since single bytes/bus wide data aren't usually transferred, but whole cachelines of 32/64 bytes. They will generate 4/8 sequential burst requests nullifying much of the "halfclocked" address generation potential latency problems.

Ok, so why can't the addresses be sent like the data is another question which someone else with more knowledge might explain.. Maybe it would complicate things too much since the request-answer mechanism should be pipelined to accept new requests until previous requests are served. Or maybe the physical bus has some limitations, like using the same pins for address/data, which would simply make it impossible to send new addresses simultaneously (on falling edge of clock) while receiving data.

13 of 346 comments (clear)