Bulldozer Server Benchmarks Not Promising
New submitter RobinEggs writes "Some reviews of Bulldozer's server performance have arrived. Ars Technica has the breakdown, and the results are pretty ugly. Apparently Bulldozer fares just as poorly with servers as with desktops. From the article: 'One reason for the underwhelming performance on the desktop is that the Bulldozer architecture emphasizes multithreaded performance over single-threaded performance. For desktop applications, where single-threaded performance is still king, this is a problem. Server workloads, in contrast, typically have to handle multiple users, network connections, and virtual machines concurrently. This makes them a much better fit for processors that support lots of concurrent threads. ... It looks as though the decisions that hurt Bulldozer on the desktop continue to hurt it in the server room. Although the server benchmarks don't show the same regressions as were found on the desktop, they do little to justify the design of the new architecture.' It's probably much too early to start editorializing about the end of AMD, or even to say with certainty that Bulldozer has failed, but my untrained eye can't yet see any possible silver lining in these new processors."
Bulldozers do not make good servers. Use a computer. Problem solved.
How many more years will slashdot have an off-by-one error on your Score in your profile?
And yet, 3 supercomputers with those opterons were ordered in the last 4 weeks ? and in a month, one of them - which is being revamped from #3 supercomputer position of the world - will be #1 supercomputer of the world when complete ? Was lockheed martin also morons to choose an opteron based supercomputer ?
Why is an article which is apparently written to bash amd was included in slashdot despite its apparent bias ?
Read radical news here
The standard of writing at "Ars Technica" have declined far more than AMD's relative performance to Intel.
Recall the Itanium from Intel and HP.. It started out with great hype more than ten years ago. When the first benchmarks came no-one wanted to believe them. Still that particular architecture is about to die.
Unfortunately, Bulldozer may end up with a similar fate. The big difference is that Intel had its regular desktop cpu line-up to finance the Itanium disaster. If nothing can be much improved on the AMD cpu side, can the shrinking graphics card business save AMD?
I hope so.
i always liked the AMD CPUs, mostly for almost equal computing power for less money but at the moment this is not really true anymore it seems when i look at the benchmarks (doesn't matter if desktop or server)
I really don't get the conclusion.
The bulldozer is faster then the Xeon chip on all cpu benchmarks which can generate enough threads to fill all cores.
Each bulldozer core is as fast as a core on a Opteron 6100.
It looks exactly like the cpu I want in my web/db server, and my supercomputer.
We need healthy competition to Intel, to keep pushing tech forward and prices down. Sadly AMD simply has not performed over the last year or two, with no real answers to Intel's I series.
Can we really call Bulldozer an 8-core processor ? It seems its real-world benchmarks would suggest otherwise. I guess the question should be: is modern computing still so integer-dependent that it would benefit from Bulldozer's twinned integer units ? I thought we all switched to full-fat floating-point operations over 15 years ago when the Pentium hit the mainstream and everyone finally had an on-die FPU in their PC.
On a server, I would expect bus throughput to be a deciding factor. I'm not crunching fancy scientific data, mostly ferrying bits from disk to network and back. Having extra cores allows more simultaneous transfers by handling more handshakes and thus connections, but beyond that it's all DMA copies from memory to I/O.
-Billco, Fnarg.com
When someone says that a CPU was designed around multiple threads I think virtualization. yeah you can argue that servers are multithreaded in that they have to handle multiple users connecting, but that's bull. I can write a badly threaded application that doesn't effectively use the multiple cores...
So how do these cpus perform with something like ESX running on them?
Scott
That's perfect for running BOINC though, which is very good at using multiple cores at their full capacity. Useless for the business, but great for contributing to science projects :-)
I stopped reading right there. When people start talking about the performance of a server on the desktop, it is pretty clear that they lack even the most basic understanding of what they are talking about.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
I thought the justification for the new architecture was that it adds more cores without adding as many transistors as a more traditional architecture, not that it made the individual cores faster, if it doesn't "show the same regressions as were found on the desktop" and we've got more cores at less size per core then surely this is a win for the new architecture?
Bulldozer chips are in short supply due to sales. Because they are not able to immediately meet opteron demands, amd is keeping 8150 supply low, binning them as opterons instead, and therefore leaving desktop market undersupplied. read the informative thread below.
http://www.overclock.net/t/1171264/compared-3-different-bulldozer-fx-8120s-want-to-know-the-difference/10
bulldozer 8150s have been in short supply on newegg and amazon. sometimes they are out of stock, and you cant even put them on watchlist.
way too high sales for a 'failed' processor ?
Read radical news here
One element has me curious about how these benchmarks were prepared: Is the benchmark software compiled on the target platform/cpu combination with all available optimisations of that platform?
Many of these benchmarks have a binary/library or set thereof that is written for a single target platform (the platform the original developers of the benchmark were working on), Usually pre-compiled, usually for intel, on an intel system, by an intel compiler, with intel optimisations or at least two of the four. This same binary is then used against whatever systems on compatible architectures, this has the high potential to produce skewed results on non-intel platforms as not all manufacturers use the same optimisations.
While this specific processor may not be as great as it should have been, I feel that benchmarks in themselves are usually flawed and must be taken with a grain of salt until real-world software that isn't in a lab-style environment is attempted on it.
Maybe it's early, but I was having a hard time seeing the comparisons they were trying to make. Also when Ars was comparing pricing, X system is 400k and Y system is 600k, what the hell was that, usually stats like that would be accompanied with a link or site to said system. It said benchmarks were "here", I didn't see any. I'd like to see benchmark details such as OS. May be too early to judge as this is the first generation chip, and will the Bulldozer perform better under the next iteration of windows(if that was the control)?
TPC-C is performed on Windows 2008 see http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501
Anantech tested on Windows 7.
It is known that Windows 7 and 2008 are not optimized for Bulldozer, especially at the task scheduling level.
So we do not know the real power of the Bulldozer architecture in the Windows world yet
See http://hexus.net/tech/news/cpu/32394-bulldozer-benchmarks-correct-definitive which unfortunately only has very few benchmarks.
You can also look at the phoronix site, where Bulldozer is tested on Linux.
Or maybe you could think before you post. Talking about desktop performance for a processor designed for a server is like talking about the performance of a race car for trips to the grocery store. Newsflash: Things that are used in applications for which they were not designed are not as good as the performance of other things that were designed for said application!
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Anandtech.com provides much more knowledgeable and professional reviews. They had this to about AMD's new chip, "Unfortunately, with the current power management in ESXi, we are not satisfied with the Performance/watt ratio of the Opteron 6276. The Xeon needs up to 25% less energy and performs slightly better. So if performance/watt is your first priority, we think the current Xeons are your best option. The Opteron 6276 offers a better performance per dollar ratio. It delivers the performance of $1000 Xeon (X5650) at $800. Add to this that the G34 based servers are typically less expensive than their Intel LGA 1366 counterparts and the price bonus for the new Opteron grows. If performance/dollar is your first priority, we think the Opteron 6276 is an attractive alternative." http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200/14
For those people who do lots of media transcoding, and 3d rendering, either as part of work, or just on their own time, I feel that the 62xx series are fantastic. I mean, under $6,000 for a 64-core workstation with 128GB of ram, and the capacity to add a high end video card? Consider me sold. It's like a cluster array in the bedroom, without having to worry about the networking headache.
Yes, performance falls behind in a few sectors, but compared to where computers were 3 years ago (my last large build), the 62xx chips pull ahead in every category.
Just because something isn't the fastest doesn't mean that it isn't fast enough xD
Hell, it's tempting to build such a system just for giggles and bragging rights.
One 2.1Ghz 62xx core is still faster then my old Athlon 64 3000+, and that ran my games for ages and ages xD
Why would he even mention the idea of editorializing the end if AMD?
If one failed product meant the end of a company, we'd have no companies. Intel has screwed up alot in the past, and they're still around...
Though I'm suspicious that Bulldozer is going down remarkably like NetBurst (NetBurst made design compromises for marketable massive clock gains, Bulldozer similarly makes compromises to boost the now-marketable core count) and time may prove that wrong, but this article was crap.
It looked like they cherry picked some benchmarks from the world at large with no control. As pointed out in the article, the tpmC benchmark had massive storage differences and the cost delta means there were probably node count differences. There are so many things in play that it is impossible to derive any sort of statement specifically about the processors. The article, however uses that as a point to show AMD is more expensive to make AMD look bad but in the same breath says better SSDs probably drove the benefit to steal AMD's thunder. He can't have it both ways. I'm inclined to believe the storage architecture was the key in terms of cost and performance given the nature of the test.
Later, the article says AMD should have just done 16-core Magny-Cours. Clearly AMD should hire him as he is a genius who *must* have considered all the complexities and figured out a way to achieve that core density when no one else in the industry has. No one pretends for a second that a bulldozer module matches 2 'real' cores, but they can't just wave their wand and make a 16-core package of the old architecture. Bulldozer is all about trying to ascertain the 'important' bits of a core and share other bits in the hopes the added resource gives most of the benefit of an additional core without the downsides that make it impossible to do that many cores on a socket.
XML is like violence. If it doesn't solve the problem, use more.
Bulldozer can't consistently beat Phenom X6 in desktop workloads.
It can't consistently beat Magny-Cours in server workloads.
It doesn't seem to be any more power-efficient than AMD's last generation, despite being built on a smaller process node (32nm vs 45nm).
At what point does AMD simply admit Bulldozer is a failure, pull the plug, and write off the sunk costs? Putting good money after bad is a classic business mistake that has killed many companies.
AMD should continue improving their existing cores on the 32nm process (they already have some of the work done with Llano) and forget about their "revolutionary new" architecture which is basically this decade's Prescott.
Or, heck, see if it's possible to scale up the Bobcat cores for mainstream desktop use. Don't forget, Intel's very successful Core 2 Duo came from a previous design (Pentium M) that had been reserved to laptops. AMD will probably have more luck increasing performance (both raw clock and IPC) on Bobcat than trying to tame the heat, insane transistor count, and long pipeline of Bulldozer.
>but my untrained eye can't yet see any possible silver lining in these new processors.
Maybe you need to buy some new glasses then or are you just another of the Intel trolls
Every large business, and most medium sized ones, are going to try to (at least) match that target.
(athough memory seems to be a bigger constraint.)
After clicking on links I finally found some benchmarks. As usual, they were bullshit. Can't these people think of a test that can put them through real hoops? I used to throw 60G pcap files (1 minute of traffic) at machines to determine if the hardware could run our IPS software. The machine with the fewest millions of threads not yet processed won. The application opened a thread for every packet that traversed a 1G nic. The content of each packet was then sent (branched) through the appropriate inspections simultaneously; one thread for each protocol check, one thread for each header check, one thread for each regular expression on the body, making a potential (65,535^2^10k + 4^252^200) new threads per second. No branch prediction can be used in this kind of test because the traffic is never predictable so every path for every packet must be traversed completely. Note: the 10k and 200 are the number of rules (regular expressions) applied to the packets.
Having to work for a living is the root of all evil.
The US Office of Management and Budget (OMB) has a virtual to physical server target of 15:1.
Every large business, and most medium sized ones, are going to try to (at least) match that target.
(athough memory seems to be a bigger constraint.)
They're still not likely to use all the cores unless they have some peculiar workload. They'll run out of RAM and IO (on a single server) first.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
Has anybody released any compilation tests for a big project on Bulldozer?
Start moving some of that crap to the GPU side of Bulldozer. There are a few things that the GPU could be dedicated to with OpenCL and such.
In a server, it's essentially wasted silicon unless fully utilized.
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Com'on Ars... you can do better than that. Give us some chart pr0n to gloss over.
There are no poor processors, only a poor software...
There you are, staring at me again.
I just read the article about "AMDs Wegschaufler" in the c't (http://www.heise.de/ct/inhalt/2011/25/158/) and instead off just talking about bla-bla-benchmarks they actually tested the CPU for themselves. What did Ars[e] Technica do to get such bad results? Their crap about a "bulldozer server benchmark catastrophe" doesn't even relate to real life anymore because the numbers i saw were quite good. Yes a Sandy Bridge server CPU from Intel will could change that but until it is here, the bulldozer server CPU has the performance crown.
Caches that are too small, or have too much latency will cause this problem. You can get a sense for this by locking a thread to a core but this doesn't tell you that the OS is the problem.
At first blush, the cache architecture looks poor. Most of us took a 'wait and see' approach. But now that the benchmarks are public, there's a lot of head scratching going on trying to understand why AMD thought that this would be a good idea.
There has been a lot of discussion about BD's cache architecture on realworldtech. Pretty much without exception the opinion was that BD's poor performance was due to the cache architecture and implementation.
If they can fix whatever is killing the performance of the on-chip caches (previous reviews indicated that the L3 appears to be a bottleneck) and/or figure out how to get the clock speeds up (it supposedly has deeper pipelines than K10, so it theoretically should clock higher), Bulldozer could still be a competitive part. This is, of course, dependent on AMD surviving for long enough to do that. I wonder how long they can get by on their GPU revenues?
AMD's company spokesman Mike Silverman put it best recently: "We will all need to let go of the old 'AMD versus Intel' mind-set, because it won't be about that anymore." - San Mateo County Times, AMD aiming to emerge from Intel's shadow
The problem is AMD is trying to solve a problem that regular developers cover their ears and go "la la la, doesn't exist"
You see the problem in Chrome
You see the problem in how people still recommend pre-fork apache httpd
You see the problem in how Perl and PHP misbehave because all the extension shit added to it isn't threadsafe
You see it in pretty much every game made between 2003 (when HT started being available) and now (when HT/Multicores are standard) with the exception of games running on Source/Unreal engines.
What are we seeing? We're seeing developers still designing software with "1 core, one job", and over time processors have been decreasing in speed, not increasing. This is what AMD has done now, decreased the per-core performance while increasing the parallelism. They can catch up, but they better catch up quick.
Games are the least efficient use of multicores. One game I tore apart with debugging tools, the entire game except for sound and DRM runs in one large loop. Sound runs in another thread and uses 1% of it, the DRM runs in another thread. But the video capture runs inside the same game loop. God this game would run at 60fps until you punched video capture and then it dropped to like 15fps, and still peg to one CPU core.
There is no reason to use the prefork model in apache httpd, but it's still as prevalent as "don't use pngs in web pages" web mythos. My finely tuned worker mode httpd easily serves thousands of requests, the bottleneck being php which has to go through mod_fastcgi to php_fpm... which acts like prefork.
The problem is that developers do not design things to use threads, and those that do don't parallelize, they just run "time insensitive" bits in other cores.
Zlib, libpng, jpeg libraries are used in almost everything, yet these libraries are stuck as single-core libraries because they were all initially written before 2003. If google wants WebP and WebM to be useful they'd embrace multithreading, not keep covering their ears on it like they do with Chrome.
This performance test shows Bulldozer totally destroying other computer architectures:
http://www.youtube.com/watch?v=TrLAeDOcMIs
I have read that bulldozer is having compiler issues in the desktop space. Apparently, the current gcc, Microsoft, Intel, etc. compilers are having problems with acceleration, core allocation, etc. Fixes are on the way and some compilers, such as Open64 5.0, will apparently drastically improve bulldozer performance. Could the same problem be occurring here?
OK, before anyone else says it, I meant "no doubt" not "know doubt". Mea culpa ;-)
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
There's a lot of discussion here about the possibility that current benchmark results are not accurate due to poor OS thread scheduling.
But as Peter points out in one of his responese on Ars,
"These are highly multithreaded workloads, some of which were affinitized to specific hardware threads. They are best case. They are already as optimized as they are ever going to get. The Opteron 6200 SPEC JBB2005 scores, for example, run multiple JVM instances (one per hardware thread) with each instance bound to its specific hardware thread. There was no bouncing between cores, no suboptimal operating system scheduling, no inadvertent cache thrashing of shared data. The workload has zero shared data. It is a perfect case for processor scaling, and yet the result called into question the very existence of Bulldozer. A 16-core K10.5 would have been cheaper to develop, would have come to market faster, would have used about the same number of transistors, would have performed just as well, and would have eliminated the performance regressions."
As Peter points out in one of his responese on Ars,
"These are highly multithreaded workloads, some of which were affinitized to specific hardware threads. They are best case. They are already as optimized as they are ever going to get. The Opteron 6200 SPEC JBB2005 scores, for example, run multiple JVM instances (one per hardware thread) with each instance bound to its specific hardware thread. There was no bouncing between cores, no suboptimal operating system scheduling, no inadvertent cache thrashing of shared data. The workload has zero shared data. It is a perfect case for processor scaling, and yet the result called into question the very existence of Bulldozer. A 16-core K10.5 would have been cheaper to develop, would have come to market faster, would have used about the same number of transistors, would have performed just as well, and would have eliminated the performance regressions."
is it cheaper?
An article about benchmarks without benchmark charts, that the first time I've seen that, I'm really impressed.
I have seen a number of applications do GPGU because it has a *lot* of theoretical potential. I saw quite a few places spend a lot of money assuming they'd sort it out. Most (not all) found that the advertised benefit was not feasible to use with their workload. In some cases it was because the development cost was high, but in many cases they found they really *couldn't* execute in that context no matter the cost.
From the other end, even Intel is making great strides in CPU capability. When people painfully started doing transcode on GPGPU, they made some pretty dramatic results. Then Sandy Bridge brought a transcoding engine along that blew all the GPGPU transcode work out of the water. Despite having indisputably weak GPUs, they are able to deliver potent responses to GPGPU usage of the GPU chips.
Either way, GP-GPU has no bearing on Bulldozer, the architecture doesn't seem particularly more amenable to GPGPU. With Bulldozer, AMD is gambling that somehow (between Piledriver and OS advances) that the limitations hurting their performance today will be alleviated. Intel had a similar sort of behavior around Netburst (except with an assumption of IA64 taking over as a long term strategy), and it didn't pan out for them. It may or may not pan out for AMD.
XML is like violence. If it doesn't solve the problem, use more.
Geeze, just fired up taskmanager and except for a few low level system processes, almost every app running has more then one thread running. More the 50% have greater then 10 threads. New features in C++ and C# makes is very easy to support multiple threads in an app even when you don't think its needed like parallel data collections and queries. I think companies have to stop blaming the myth of single threaded desktop apps for their CPU's lousy performance, I don't think there is a single desktop app that runs in only one thread these days. I also think people that benchmark hardware should start getting a clue about the software they are testing on, maybe take a course on software development or something.
I haven't thought of anything clever to put here, but then again most of you haven't either.