Bulldozer Server Benchmarks Not Promising
New submitter RobinEggs writes "Some reviews of Bulldozer's server performance have arrived. Ars Technica has the breakdown, and the results are pretty ugly. Apparently Bulldozer fares just as poorly with servers as with desktops. From the article: 'One reason for the underwhelming performance on the desktop is that the Bulldozer architecture emphasizes multithreaded performance over single-threaded performance. For desktop applications, where single-threaded performance is still king, this is a problem. Server workloads, in contrast, typically have to handle multiple users, network connections, and virtual machines concurrently. This makes them a much better fit for processors that support lots of concurrent threads. ... It looks as though the decisions that hurt Bulldozer on the desktop continue to hurt it in the server room. Although the server benchmarks don't show the same regressions as were found on the desktop, they do little to justify the design of the new architecture.' It's probably much too early to start editorializing about the end of AMD, or even to say with certainty that Bulldozer has failed, but my untrained eye can't yet see any possible silver lining in these new processors."
Bulldozers do not make good servers. Use a computer. Problem solved.
How many more years will slashdot have an off-by-one error on your Score in your profile?
And yet, 3 supercomputers with those opterons were ordered in the last 4 weeks ? and in a month, one of them - which is being revamped from #3 supercomputer position of the world - will be #1 supercomputer of the world when complete ? Was lockheed martin also morons to choose an opteron based supercomputer ?
Why is an article which is apparently written to bash amd was included in slashdot despite its apparent bias ?
Read radical news here
The standard of writing at "Ars Technica" have declined far more than AMD's relative performance to Intel.
Recall the Itanium from Intel and HP.. It started out with great hype more than ten years ago. When the first benchmarks came no-one wanted to believe them. Still that particular architecture is about to die.
Unfortunately, Bulldozer may end up with a similar fate. The big difference is that Intel had its regular desktop cpu line-up to finance the Itanium disaster. If nothing can be much improved on the AMD cpu side, can the shrinking graphics card business save AMD?
I hope so.
i always liked the AMD CPUs, mostly for almost equal computing power for less money but at the moment this is not really true anymore it seems when i look at the benchmarks (doesn't matter if desktop or server)
I really don't get the conclusion.
The bulldozer is faster then the Xeon chip on all cpu benchmarks which can generate enough threads to fill all cores.
Each bulldozer core is as fast as a core on a Opteron 6100.
It looks exactly like the cpu I want in my web/db server, and my supercomputer.
We need healthy competition to Intel, to keep pushing tech forward and prices down. Sadly AMD simply has not performed over the last year or two, with no real answers to Intel's I series.
I thought we all switched to full-fat floating-point operations over 15 years ago when the Pentium hit the mainstream and everyone finally had an on-die FPU in their PC
Its application dependent. I doubt if much fp stuff gets done in cryptography, routing, and many simulations.
When someone says that a CPU was designed around multiple threads I think virtualization. yeah you can argue that servers are multithreaded in that they have to handle multiple users connecting, but that's bull. I can write a badly threaded application that doesn't effectively use the multiple cores...
So how do these cpus perform with something like ESX running on them?
Scott
That's perfect for running BOINC though, which is very good at using multiple cores at their full capacity. Useless for the business, but great for contributing to science projects :-)
Bulldozer chips are in short supply due to sales. Because they are not able to immediately meet opteron demands, amd is keeping 8150 supply low, binning them as opterons instead, and therefore leaving desktop market undersupplied. read the informative thread below.
http://www.overclock.net/t/1171264/compared-3-different-bulldozer-fx-8120s-want-to-know-the-difference/10
bulldozer 8150s have been in short supply on newegg and amazon. sometimes they are out of stock, and you cant even put them on watchlist.
way too high sales for a 'failed' processor ?
Read radical news here
One element has me curious about how these benchmarks were prepared: Is the benchmark software compiled on the target platform/cpu combination with all available optimisations of that platform?
Many of these benchmarks have a binary/library or set thereof that is written for a single target platform (the platform the original developers of the benchmark were working on), Usually pre-compiled, usually for intel, on an intel system, by an intel compiler, with intel optimisations or at least two of the four. This same binary is then used against whatever systems on compatible architectures, this has the high potential to produce skewed results on non-intel platforms as not all manufacturers use the same optimisations.
While this specific processor may not be as great as it should have been, I feel that benchmarks in themselves are usually flawed and must be taken with a grain of salt until real-world software that isn't in a lab-style environment is attempted on it.
Maybe it's early, but I was having a hard time seeing the comparisons they were trying to make. Also when Ars was comparing pricing, X system is 400k and Y system is 600k, what the hell was that, usually stats like that would be accompanied with a link or site to said system. It said benchmarks were "here", I didn't see any. I'd like to see benchmark details such as OS. May be too early to judge as this is the first generation chip, and will the Bulldozer perform better under the next iteration of windows(if that was the control)?
Windows does not (yet) know how to properly schedule threads on that hardware. This has caused issues with all the benchmarks, not unlike what happened when Intel Hyperthreading was first released. Once the proper support is added to the OS kernels, the results should be much better.
TPC-C is performed on Windows 2008 see http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501
Anantech tested on Windows 7.
It is known that Windows 7 and 2008 are not optimized for Bulldozer, especially at the task scheduling level.
So we do not know the real power of the Bulldozer architecture in the Windows world yet
See http://hexus.net/tech/news/cpu/32394-bulldozer-benchmarks-correct-definitive which unfortunately only has very few benchmarks.
You can also look at the phoronix site, where Bulldozer is tested on Linux.
Anandtech.com provides much more knowledgeable and professional reviews. They had this to about AMD's new chip, "Unfortunately, with the current power management in ESXi, we are not satisfied with the Performance/watt ratio of the Opteron 6276. The Xeon needs up to 25% less energy and performs slightly better. So if performance/watt is your first priority, we think the current Xeons are your best option. The Opteron 6276 offers a better performance per dollar ratio. It delivers the performance of $1000 Xeon (X5650) at $800. Add to this that the G34 based servers are typically less expensive than their Intel LGA 1366 counterparts and the price bonus for the new Opteron grows. If performance/dollar is your first priority, we think the Opteron 6276 is an attractive alternative." http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200/14
Though I'm suspicious that Bulldozer is going down remarkably like NetBurst (NetBurst made design compromises for marketable massive clock gains, Bulldozer similarly makes compromises to boost the now-marketable core count) and time may prove that wrong, but this article was crap.
It looked like they cherry picked some benchmarks from the world at large with no control. As pointed out in the article, the tpmC benchmark had massive storage differences and the cost delta means there were probably node count differences. There are so many things in play that it is impossible to derive any sort of statement specifically about the processors. The article, however uses that as a point to show AMD is more expensive to make AMD look bad but in the same breath says better SSDs probably drove the benefit to steal AMD's thunder. He can't have it both ways. I'm inclined to believe the storage architecture was the key in terms of cost and performance given the nature of the test.
Later, the article says AMD should have just done 16-core Magny-Cours. Clearly AMD should hire him as he is a genius who *must* have considered all the complexities and figured out a way to achieve that core density when no one else in the industry has. No one pretends for a second that a bulldozer module matches 2 'real' cores, but they can't just wave their wand and make a 16-core package of the old architecture. Bulldozer is all about trying to ascertain the 'important' bits of a core and share other bits in the hopes the added resource gives most of the benefit of an additional core without the downsides that make it impossible to do that many cores on a socket.
XML is like violence. If it doesn't solve the problem, use more.
Bulldozer can't consistently beat Phenom X6 in desktop workloads.
It can't consistently beat Magny-Cours in server workloads.
It doesn't seem to be any more power-efficient than AMD's last generation, despite being built on a smaller process node (32nm vs 45nm).
At what point does AMD simply admit Bulldozer is a failure, pull the plug, and write off the sunk costs? Putting good money after bad is a classic business mistake that has killed many companies.
AMD should continue improving their existing cores on the 32nm process (they already have some of the work done with Llano) and forget about their "revolutionary new" architecture which is basically this decade's Prescott.
Or, heck, see if it's possible to scale up the Bobcat cores for mainstream desktop use. Don't forget, Intel's very successful Core 2 Duo came from a previous design (Pentium M) that had been reserved to laptops. AMD will probably have more luck increasing performance (both raw clock and IPC) on Bobcat than trying to tame the heat, insane transistor count, and long pipeline of Bulldozer.
Tech Report demonstrated this to be the case by setting the thread affinity on their tests, so they were locked to specific cores, using only once core per module. They saw as much as a 30% improvement in the single threaded or lightly threaded benchmarks. Other sources, including AMD itself, have demonstrated as much as 10% improvement in performance by using a better thread scheduler. AMD has whitepapers discussing this issue.
As for changing the OS kernels... Windows 8 already has the changes. Windows 7 and Server 2008 may get them in a future update (Service Pack?). Linux kernel support is ready and is available in a kernel patch. Compiler support is now included in VS 2010. So, not necessarily a flop; but, might be a short while before the full capability of the architecture is realized.
Every large business, and most medium sized ones, are going to try to (at least) match that target.
(athough memory seems to be a bigger constraint.)
wrong. the 386-sx had a 16 bit memory bus (vs 32 bit on the DX). It had no FPU, that was a separate socket.
I run: Windows, OS X, Linux, FreeBSD. Just because you have a hammer, doesn't mean everything is a nail.
If this is a serious production application, consider optimizing your software. Firstly, spawning endless threads is rarely an efficient use of resources. After the thread count exceeds the number of available threads the CPU can process, the overhead of managing threads becomes pure overhead. The degree to which this overhead can be reduced is application dependent, but it is often worth chasing.
Additionally, applying 10,000 and 200 rules at a rate of one thread per rule per packet is probably not a sensible strategy. Consider merging the rules and using one or more state machines to process the packet data once. That way only one set of reads occur per incoming packet, and the rest of the code is executed in tight state machines that cache efficiently.
Finally, did you mean 65535*2*10k + 4*252*200?
If ^ is interpreted as a power symbol, 65535^2^10k+4^252^200 results in a fantastically large number that all the supercomputers in the world could not process in real-time.