Smarter Thread Scheduling Improves AMD Bulldozer Performance
crookedvulture writes "The initial reviews of the first Bulldozer-based FX processors have revealed the chips to be notably slower than their Intel counterparts. Part of the reason is the module-based nature of AMD's new architecture, which requires more intelligent thread scheduling to extract optimum performance. This article takes a closer look at how tweaking Windows 7's thread scheduling can improve Bulldozer's performance by 10-20%. As with Intel's Hyper-Threading tech, Bulldozer performs better when resource sharing is kept to a minimum and workloads are spread across multiple modules rather than the multiple cores within them."
that's the truth. unless i can buy an AMD server for a lot cheaper i'm not going to try and take on the risk of performance issues
So basically they suck. I shouldn't need to tweak my os thread scheduler just so a cpu can suck less. AMD needs to fix their shit instead of lame excuses.
Perhaps I'm remembering incorrectly, but I thought part of the Bulldozer hype was that it had two 'real' cores and not hyperthreading, with only a few resources shared? Yet now it turns out that you have to treat it like a hyperthreading CPU or performance sucks.
I still don't understand why AMD didn't just set the hyperthreading bit in the CPU flags, so Windows would presumably just treat it like a hyperthreading CPU in the first place.
The article basically says "if your schedule threads to use less modules, dynamic turbo will clock those modules up, giving you a performance boost.
so... anybody who is already clocking their entire cpu at top stable clock speed isn't going to get a boost out of thread scheduler modifications.
I take it back. apparently that's what page 1 says. There is a page 2 and it says something else entirely.
Sure, the scheduling change improves performance by 10-20% for certain tasks, but that still makes it 30-50% slower than an i7, and with more power consumption.
I can't fault AMD for not having full third-party support for their custom features, since Intel had a head-start with hyperthreading, but if it will still be an inferior product even after support is added then I'm not going to buy it.
You mean integer based instructions. Floating point is still not as good with the AMD chips (unless using the new instructions)
MidnightBSD: The BSD for Everyone
I would have been content if they had shrunk the X6 core down to 32nm, slap 2 of them on a chip and sell it as a 12 core. They could have released it a year ago.
Intel did just that with their first quad core, and the consumer wasn't concerned about philosophical discussions on its cores. Heck I'm typing this message on a kentsfield chip right now and even after all these years its a great processor.
Worse, what this shows is that AMD's idea that you only need one FPU for every two integer units (how Bulldozer is laid out) results in a 20% performance drop.
Why does this sound like Barcelona? Granted, Bulldozer doesn't seem to have the same breadth of architectural flaws but still. God I miss the days when AMD came out with the X2 series... There is just no way AMD can compete with Sandy Bridge. With Ivy Bridge coming up, things are not looking good for AMD. After Barcelona they need to catch up a bit however, the performance difference seems to be increasing compared with Intel's offerings.
This is really more of an OS-level problem. CPU scheduling on multiprocessors needs some awareness of the costs of an interprocessor context switch. In general, it's faster to restart a thread on the same processor it previously ran on, because the caches will have the data that thread needs. If the thread has lost control for a while, though, it doesn't matter. This is a standard topic in operating system courses. An informal discussion of how Windows 7 does it is useful.
Windows 7 generally prefers to run a thread on the same CPU it previously ran on. But if you have a lot of threads that are frequently blocking, you may get excessive inter-CPU switching.
On top of this, the Bulldozer CPU adjusts the CPU clock rate to control power consumption and heat dissipation. If some cores can be stopped, the others can go slightly faster. This improves performance for sequential programs, but complicates scheduling.
Manually setting processor affinity is a workaround, not a fix.
http://hardware.slashdot.org/story/11/09/13/1336210/amd-breaks-overclocking-record-with-bulldozer AMD already showed how to speed things up on their Bulldozer line
Windows is not exactly known for its multi-processor (multi-core) scalability.
Repeat the test with a real OS (Linux, Solaris...) and I'll be interested, especially Solaris x86 since it is known to be the best at scaling on parallel hardware.
Stick Men
http://www.overclock.net/amd-cpus/1141562-practical-bulldozer-apps.html
also, if you set your cpuid to genuineintel in some of the benchmark programs, you will get suprising results
try changing cpuid=genuineintel for +47% INCREASE IN SCORES.
changing cpuid to GenuineIntel nets 47.4% increase in performance:
[url]http://www.osnews.com/story/22683/Intel_Forced_to_Remove_quot_Cripple_AMD_quot_Function_from_Compiler_[/url]
PCMark/Futuremark rigged bentmark to favor intel:
[url]http://www.amdzone.com/phpbb3/viewtopic.php?f=52&t=135382#p139712[/url] [url]http://arstechnica.com/hardware/reviews/2008/07/atom-nano-review.ars/6[/url]
intel cheating at 3DMark vantage via driver: [url]http://techreport.com/articles.x/17732/2[/url]
relying on bentmarks to "measure performance" is a fool's errand. dont go there.
Read radical news here
The idiocy here is that they've not succeeded in making bulldozer faster, they've succeeded in making one very specific benchmark run faster with very specific scheduler settings for that exact one benchmark. Give it some different code to run and this'll degrate performance.
This an architecture designed for a ten year run, much like the original P6, which underwhelmed everyone with (at most) half a brain.
Just how long do you think the OS can remain task agnostic as we head down the road to eight and sixteen core processors? Why plan for the future when we can languish on easy-street for another year or two? When the PC came out, some people complained they "would have preferred" a superior and more reliable electronic typewriter.
I'm quite certain the correct design approach is to resource a CPU regarding TDP as your performance wall. If eight floating point units require more TDP than your chip provides, what point is there in providing eight such units? And even if the math in the first spin from the new architecture could have gone the other way on some of these matters, in no time at all you're up hard against it, if you glance a few weeks further down the roadmap.
It's a bizarre conceit in any other walk of life that you can get away with not knowing the workload on the path to optimal resource assignment. Half of the human brain is devoted to power management. The glucose demand of the human brain is one of the big reasons why we were a late addition to mother nature's species road map. The brain doesn't operate from a baseline glucose guzzle equally able to handle any task that might come up. Much of what we perceive as quick reaction is only possible because the brain decided to fire up the necessary circuit 400ms beforehand.
It is time for some reverse engineering of the benchmark programs I think to see what exactly is happening.
Here's Agner Fog's page about this issue.
The Intel compiler (for many years and many versions) has generated multiple code paths for different instruction sets. Using the lame excuse that they don't trust other vendors to implement the instruction set correctly, the generated executables detect the "GenuineIntel" CPU vendor string and deliberately cripple your program's performance by not running the fastest codepaths unless your CPU was made by Intel. So e.g. if you have an SSE4-capable AMD CPU, it will run the SSE2 codepath instead of the SSE4 codepath that comparable Intel chips will run.
Over the years, MANY libraries (including several from Intel) have been compiled and shipped with this compiler, with the result that the applications compiled with those libraries including many benchmarks, also suffer from the same performance sabotage.
>relying on bentmarks to "measure performance" is a fool's errand. dont go there.
And yet, that's what you're doing.
The correct phrase is: Relying on benchmarks that are not relevant to your application is a fool's errand.
In theory it actually has the equivalent of an 128-bit wide FPU for every integer unit. Though I hear rumours that they may have not put as much effort into making the classic x87 FPU instructions run fast and that harmed them in some of the non-SSE-supporting benchmarks that a lot of the reviews used.
One fun side note: notice how that link says "it will fail to recognize future Intel processors with a family number different from 6". Intel have conspicuously kept the family number reported by CPUID at 6 on their new processors in order not to trigger a fallback to the non-Intel pathway that AMD processors get to use, presumably because they know how much that'll harm them in benchmarks and how bad the reviews will look.
The correct phrase is: Relying on benchmarks that are not relevant to your application is a fool's errand.
Yes, yes, yes.
The bulldozer architecture is heavily optimized for highly threaded applications with a heavy reliance on integer operations. This is well represented by today's server workloads, not todays desktop applications. But more importantly for AMD's future this also represents the trending path of tomorrows applications. A great example of this is Battlefield 3, where the 8150 outperforms the i72600k. Unfortunately today this also means thatwhether Bulldozer or Sandybridge is faster today depends on the application. As from the above test we can almost assuredly guess that BF3 does more integer work, while Civ 5 does more floating point work.
However, less obvious than the multithreading issue is the push away form using the CPU for floating point operations. This is one both Intel and AMD having been slowly gambling on for quite some time, putting floating point operations on the GPU. AMD has just taken a more "committed" approach to this. Its also something that may pay off big time.
As an aside, as a server administrator today I would buy Bulldozer over Sandybridge based processors in a heart beat. Most of the "scale out" boxes such as web caches, database servers, etc., are highly multithread integer driven workloads. In this case bulldozer is going to destroy sandybridge, plain and simple. Also to those citing supercomputers, those tend to be floating point driven as they are generally for simulation.
Those 8150 vs. 2600K numbers are sans discrete graphics. It shows the 8150 integrating a stronger graphics core. It's still a pretty crummy graphics core. It's just better than the basic-equipment one that Intel put in the 2600k. The 2600K is much older than the 8150, though, and hardly Intel's top-end part. It's just the one that AMD chose to compare against. They do that. Bring out a half-assed chip, then pick an Intel part they can beat and beat it. While Intel's other parts stand on the other side of the cafeteria wondering why the fat kid is picking on the nerdy kid instead of someone their own size.
So, if you're in the small niche where you want to play Battlefield 3 on a computer with no discrete graphics, the 8150 is probably your choice. Or you could wait a couple of weeks for Intel's next part. Or upgrade to new-release discrete graphics, which will kick that integrated graphics solution's ass using either CPU.
So what I'm saying is, AMD's market is lamers, not gamers. And they seem to be making a little money at it. For now. Bulldozer has yet to dominate a quarterly report, so its issues, which comprise both performance and manufacturability, have yet to reveal themselves as a significant driver of the bottom line. Q3 was good to them, Q4 not likely to be. 12Q1 could be their doom if they don't find a miracle.
I'm sure there will be plenty of fanbois in this discussion, but this is the person you chose to call out?
He seems to be more or less right. This kind of scheduling might help many processor-intensive tasks, but this kind of scheduling isn't available to the majority of software, and it's obvious that Windows isn't smart enough to do it either. Unless AMD gets Windows support, or BIOS trickery as mentioned at the end of the article... these chips will be under-utilized.
But no, you come here to say, in essence, "people with contrary opinions suck," no matter how reasonable they may be. It is you who reveals yourself to be a fanboy, if that's all you have to add.
I vote based on politicians' actions, unless contrary to my preconceptions. Often wrong, never uncertain. #iamthe99%
The bulldozer architecture is heavily optimized for highly threaded applications with a heavy reliance on integer operations. This is well represented by today's server workloads, not todays desktop applications. But more importantly for AMD's future this also represents the trending path of tomorrows applications.
No it doesn't. AMD marketing would like you to believe every application is going to go massively parallel, but they're not the ones who actually have to write the software. It is not easy to thread all types of software (the low hanging fruit has already been picked), and it can be hard or impossible to get gains beyond two or three threads for many types of code.
A great example of this is Battlefield 3, where the 8150 outperforms the i72600k.
Er, what? The 2600k outperformed the 8150 when both were clocked at stock speeds. The 8150 won in the overclocked test.
I shouldn't use the terms "outperformed" or "won", however, as this test was very sloppily done. Carefully read the description. Apparently the BF3 beta lacks facilities for repeatable benchmarking, so they just played on a live server with real players. This guarantees lots of noise and poor repeatability. And you can see that noise in the data: the BD average frame rate went up when overclocked, but the i5 and i7 averages went slightly down. That shouldn't ever happen in a CPU benchmark, not when you're raising the clock speed of the processor by over 1 GHz.
In fact, note that all the average FPS results fall in the range ~50.5 to 54 regardless of CPU type and clock, and the overclocked i7-2600K loses to every non-OC result, including itself! This test isn't CPU limited in any way. It's just a fancy way of generating random numbers with no correlation to CPU performance.
Unfortunately today this also means thatwhether Bulldozer or Sandybridge is faster today depends on the application. As from the above test we can almost assuredly guess that BF3 does more integer work, while Civ 5 does more floating point work.
You're assuming games load up all cores. Few games use more than 2 or 3 cores, even recent releases. This is for two related reasons. First, it's hard to scale game logic to huge numbers of cores. Second, game developers know that it's only recently that new PC sales ticked over to a dual-core on average, much less quad or better, and they want to spend most of their time working on things which will benefit all their customers rather than putting out a lot of effort to help only the 5% or less who own high end hardware.
(This is also why SLI/CrossFire has always been plagued by poor game support, and ATI & Nvidia have had to try various schemes to incentivize developers to spend time on multi-GPU.)
Benchmarkers who managed to actually create CPU-limited gaming tests almost universally found that Sandy Bridge stomped all over Bulldozer. SB's per-thread performance is much better, which is a better match to today's (and tomorrow's) software.
As an aside, as a server administrator today I would buy Bulldozer over Sandybridge based processors in a heart beat. Most of the "scale out" boxes such as web caches, database servers, etc., are highly multithread integer driven workloads. In this case bulldozer is going to destroy sandybridge, plain and simple.
I think you're getting a bit ahead of yourself. Client BD certainly hasn't destroyed client SB, even in highly multithreaded integer workloads (it's more like it wins some, loses others), so what reason is there to believe anything will be different in servers? Keep in mind that Intel has yet to even release its high end SB CPU and platform (6-core/12-thread desktop, 6C/12T server, and 8C/16T server); all existing SB CPUs are mainstream desktop/notebook (4C/8T max with integrated gr
I'm sure there will be plenty of fanbois in this discussion, but this is the person you chose to call out?
Half truths are the most insidious kind. So somebody discovers that Bulldozer likes a certain kind of scheduling and runs faster with it. That is not "one benchmark", that is an interesting optimization technique. Completely fair, and nobody can claim that Intel fails to benefit from optimizations directed at their exact architecture as well.
Have you got your LWN subscription yet?
I don't really understand what the fuck that link is supposed to prove other than you have been published. So what? I have been published before as well. Does being published mean you know what you are talking about? Nope. Logic flows from evidence and rational argument, and I see no evidence nor rational argument in your post.
That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
The problem being that what you just said is a half truth – what they discovered is not that Bulldozer likes a certain kind of scheduling and runs faster with it. Instead what they discovered is that when running exactly one benchmark, Bulldozer likes a certain kind of scheduling and runs faster with it. This says nothing about the affect on other benchmarks, and, windows devs not being stupid (though many would argue they are), I'm quite sure that in speeding one thing up, they'll have slowed several others down.
Sounds like sticking with GCC or some other neutral compiler is the best option.
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
No.. Sticking with the compiler most likely to be used in the programs you care about is the best option. If that's GCC it's GCC, if it's ICC, it's ICC.
Perhaps the Intel compiler situation might explain some of my experiences. I have a system (mac pro) with a quad core Bloomfield Xeon @ 3.2GHz and also a system with a quad-core Phenom II also @ 3.2GHz. On most things I run the Xeon is faster. However, the same is not true for the software I write. If I implement say the edit distance algorithm in C to compare two DNA molecules, and compile in x86-64 with gcc, the Phenom II is about 10% faster than the Xeon for a single thread. Interestingly, the Xeon is faster if compiled for 32bit arch. Then, a string processing program I have in Perl, runs at about the same speed on both CPU's per thread. The Xeon does get an advantage for more than 4 threads due to HT, but of course I could switch to the cheap Phenom X6 if I needed such workloads...
Overall, for the custom software I use daily for work, the $5000 Xeon Mac Pro machine is not faster than the $500 Phenom II system...
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Interestingly, the Xeon is faster if compiled for 32bit arch.
Obviously compared to the Phenom also running a 32bit binary.
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
No it doesn't. AMD marketing would like you to believe every application is going to go massively parallel, but they're not the ones who actually have to write the software. It is not easy to thread all types of software (the low hanging fruit has already been picked), and it can be hard or impossible to get gains beyond two or three threads for many types of code.
Yeah... No. Data dependencies are actually fairly low in almost every real world application. Highly threaded applications are rare outside of the enterprise (read Google/Amazon/transaction process etc.) world because they are modestly hard to write and there is a significant lack of developers experienced with making them. However, as more performance is required everyone is moving that way. Sure simple word editing may not require that many threads, but anything that needs performance will move that direction. It is not hype. Also battlefield 3 loads all cores of a i7-2600k to 50% on ultra.
In fact, note that all the average FPS results fall in the range ~50.5 to 54 regardless of CPU type and clock, and the overclocked i7-2600K loses to every non-OC result, including itself! This test isn't CPU limited in any way ...
You're assuming games load up all cores. Few games use more than 2 or 3 cores, even recent releases. This is for two related reasons. First, it's hard to scale game logic to huge numbers of cores. Second, game developers know that it's only recently that new PC sales ticked over to a dual-core on average, much less quad or better, and they want to spend most of their time working on things which will benefit all their customers rather than putting out a lot of effort to help only the 5% or less who own high end hardware.
Games are not the only thing in the world. People who actually do computing as a profession, for example those of us who have to compile kernels on a regular bases, care about stuff like highly multithreaded integer based work loads. I also game so as long as I can get comparable performance in game it doesn't really matter to me. This will be the case for many people.
I think you're getting a bit ahead of yourself. Client BD certainly hasn't destroyed client SB, even in highly multithreaded integer workloads (it's more like it wins some, loses others), so what reason is there to believe anything will be different in servers? Keep in mind that Intel has yet to even release its high end SB CPU and platform (6-core/12-thread desktop, 6C/12T server, and 8C/16T server); all existing SB CPUs are mainstream desktop/notebook (4C/8T max with integrated graphics). And the 8150 is the exact same chip AMD is going to be selling as a server CPU. (Unlike Intel since Nehalem, AMD tries to use the same chip design for server and client, because AMD doesn't have the volumes to justify taping out a different chip for each market.) Worse yet for AMD, every review I've seen where they measured power showed BD using a lot more juice, at least 50W more in every case (just spot checked one review and it was ~75W more load power than i7-2600K). That's not going to go over so well in the server space.
The 8150 isn't the chip that AMD is offering for servers. AMD offers different chips for server/client, unless you seriously think the 12 core Magny-Cores is for clients? They have a 8core/16thread offering from the bulldozer architecture for servers too. The Intel SB server processors have been available to my company (admittedly a very large one) for quite some time as well. I am assuming you literally don't know what your talking about at this point on the server front.