Smarter Thread Scheduling Improves AMD Bulldozer Performance
crookedvulture writes "The initial reviews of the first Bulldozer-based FX processors have revealed the chips to be notably slower than their Intel counterparts. Part of the reason is the module-based nature of AMD's new architecture, which requires more intelligent thread scheduling to extract optimum performance. This article takes a closer look at how tweaking Windows 7's thread scheduling can improve Bulldozer's performance by 10-20%. As with Intel's Hyper-Threading tech, Bulldozer performs better when resource sharing is kept to a minimum and workloads are spread across multiple modules rather than the multiple cores within them."
Bulldozer sucks at multitasking, but it's great if programmers utilize parallel programming techniques (which they don't use right now anyway--multicore processors are pretty much explicitly for improving multitasking performance due to this).
that's the truth. unless i can buy an AMD server for a lot cheaper i'm not going to try and take on the risk of performance issues
So basically they suck. I shouldn't need to tweak my os thread scheduler just so a cpu can suck less. AMD needs to fix their shit instead of lame excuses.
Perhaps I'm remembering incorrectly, but I thought part of the Bulldozer hype was that it had two 'real' cores and not hyperthreading, with only a few resources shared? Yet now it turns out that you have to treat it like a hyperthreading CPU or performance sucks.
I still don't understand why AMD didn't just set the hyperthreading bit in the CPU flags, so Windows would presumably just treat it like a hyperthreading CPU in the first place.
The article basically says "if your schedule threads to use less modules, dynamic turbo will clock those modules up, giving you a performance boost.
so... anybody who is already clocking their entire cpu at top stable clock speed isn't going to get a boost out of thread scheduler modifications.
I take it back. apparently that's what page 1 says. There is a page 2 and it says something else entirely.
Sure, the scheduling change improves performance by 10-20% for certain tasks, but that still makes it 30-50% slower than an i7, and with more power consumption.
I can't fault AMD for not having full third-party support for their custom features, since Intel had a head-start with hyperthreading, but if it will still be an inferior product even after support is added then I'm not going to buy it.
I would have been content if they had shrunk the X6 core down to 32nm, slap 2 of them on a chip and sell it as a 12 core. They could have released it a year ago.
Intel did just that with their first quad core, and the consumer wasn't concerned about philosophical discussions on its cores. Heck I'm typing this message on a kentsfield chip right now and even after all these years its a great processor.
Why does this sound like Barcelona? Granted, Bulldozer doesn't seem to have the same breadth of architectural flaws but still. God I miss the days when AMD came out with the X2 series... There is just no way AMD can compete with Sandy Bridge. With Ivy Bridge coming up, things are not looking good for AMD. After Barcelona they need to catch up a bit however, the performance difference seems to be increasing compared with Intel's offerings.
This is really more of an OS-level problem. CPU scheduling on multiprocessors needs some awareness of the costs of an interprocessor context switch. In general, it's faster to restart a thread on the same processor it previously ran on, because the caches will have the data that thread needs. If the thread has lost control for a while, though, it doesn't matter. This is a standard topic in operating system courses. An informal discussion of how Windows 7 does it is useful.
Windows 7 generally prefers to run a thread on the same CPU it previously ran on. But if you have a lot of threads that are frequently blocking, you may get excessive inter-CPU switching.
On top of this, the Bulldozer CPU adjusts the CPU clock rate to control power consumption and heat dissipation. If some cores can be stopped, the others can go slightly faster. This improves performance for sequential programs, but complicates scheduling.
Manually setting processor affinity is a workaround, not a fix.
http://hardware.slashdot.org/story/11/09/13/1336210/amd-breaks-overclocking-record-with-bulldozer AMD already showed how to speed things up on their Bulldozer line
Windows is not exactly known for its multi-processor (multi-core) scalability.
Repeat the test with a real OS (Linux, Solaris...) and I'll be interested, especially Solaris x86 since it is known to be the best at scaling on parallel hardware.
Stick Men
http://www.overclock.net/amd-cpus/1141562-practical-bulldozer-apps.html
also, if you set your cpuid to genuineintel in some of the benchmark programs, you will get suprising results
try changing cpuid=genuineintel for +47% INCREASE IN SCORES.
changing cpuid to GenuineIntel nets 47.4% increase in performance:
[url]http://www.osnews.com/story/22683/Intel_Forced_to_Remove_quot_Cripple_AMD_quot_Function_from_Compiler_[/url]
PCMark/Futuremark rigged bentmark to favor intel:
[url]http://www.amdzone.com/phpbb3/viewtopic.php?f=52&t=135382#p139712[/url] [url]http://arstechnica.com/hardware/reviews/2008/07/atom-nano-review.ars/6[/url]
intel cheating at 3DMark vantage via driver: [url]http://techreport.com/articles.x/17732/2[/url]
relying on bentmarks to "measure performance" is a fool's errand. dont go there.
Read radical news here
Type start /? & see this part - the "help/manpage" for the start command!
(Mind you - I am on Windows 7 64 bit here):
SEPARATE Start 16-bit Windows program in separate memory space.
SHARED Start 16-bit Windows program in shared memory space.
* The "bug" being there ARE NO 16-bit WINDOWS SUBSYSTEMS IN 64-bit Windows..., only 32-bit subsystems...
APK
P.S.=> No "Huge Bug", but a misleading statement in the start commands' help outputs... apk
This an architecture designed for a ten year run, much like the original P6, which underwhelmed everyone with (at most) half a brain.
Just how long do you think the OS can remain task agnostic as we head down the road to eight and sixteen core processors? Why plan for the future when we can languish on easy-street for another year or two? When the PC came out, some people complained they "would have preferred" a superior and more reliable electronic typewriter.
I'm quite certain the correct design approach is to resource a CPU regarding TDP as your performance wall. If eight floating point units require more TDP than your chip provides, what point is there in providing eight such units? And even if the math in the first spin from the new architecture could have gone the other way on some of these matters, in no time at all you're up hard against it, if you glance a few weeks further down the roadmap.
It's a bizarre conceit in any other walk of life that you can get away with not knowing the workload on the path to optimal resource assignment. Half of the human brain is devoted to power management. The glucose demand of the human brain is one of the big reasons why we were a late addition to mother nature's species road map. The brain doesn't operate from a baseline glucose guzzle equally able to handle any task that might come up. Much of what we perceive as quick reaction is only possible because the brain decided to fire up the necessary circuit 400ms beforehand.
With Windows it's always wait until the next version of Windows to get new features. MS could easily hotfix Windows 7 with a Bulldozer aware scheduler. The won't, because they want something to drive sales of there "new and improved" OS. This is just one of the various reasons why I use Linux. In about a month Linux (kernel 3.2) will roll out a Bulldozer aware scheduler and I'm set.
Interestingly enough, if you look at the current Bulldozer benchmarks on Linux, it's performing quite nicely (even without the Bulldozer tuned scheduler).
It is time for some reverse engineering of the benchmark programs I think to see what exactly is happening.
Here's Agner Fog's page about this issue.
The Intel compiler (for many years and many versions) has generated multiple code paths for different instruction sets. Using the lame excuse that they don't trust other vendors to implement the instruction set correctly, the generated executables detect the "GenuineIntel" CPU vendor string and deliberately cripple your program's performance by not running the fastest codepaths unless your CPU was made by Intel. So e.g. if you have an SSE4-capable AMD CPU, it will run the SSE2 codepath instead of the SSE4 codepath that comparable Intel chips will run.
Over the years, MANY libraries (including several from Intel) have been compiled and shipped with this compiler, with the result that the applications compiled with those libraries including many benchmarks, also suffer from the same performance sabotage.
".ssakcaj ,pu kcuf eht tuhS" - by Anonymous Coward ANOTHER "ne'er-do-well" /. OFF-TOPIC TROLL on Friday October 28, @02:54PM (#37872146)
"???"
Uhm... Could we get a translation of that off-topic "troll-speak/trolllanguage" of yours, please?
* And, you're an off-topic troll - no questions asked...SEE MY SUBJECT LINE ABOVE, because the use of profanity of your part? Doesn't make for an intelligent reply!
APK
P.S.=> Yes, it must have just have been another off-topic done nothing of significance with his life troll spewing his off-topic b.s. again & not contributing to the ongoing conversations. Oh well - No biggie!
("ReVeRsE-PsYcHoLoGy", for trolls - Courtesy of this code by "yours truly" in less than 1 second flat):
---
#TrollTalkComReversePsychologyKiller.py (Ver #2 by APK)
def reverse(s):
try:
trollstring = ""
for apksays in s:
trollstring = apksays + trollstring
except:
print("error/abend in reverse function")
return trollstring
s = ""
print reverse(s)
try:
s = "Insert whatever 'trollspeak/trolllanguage' gibberish occurs here..."
s = reverse(s)
print(s)
except Exception as e:
print(e)
---
... apk
>relying on bentmarks to "measure performance" is a fool's errand. dont go there.
And yet, that's what you're doing.
The correct phrase is: Relying on benchmarks that are not relevant to your application is a fool's errand.
One fun side note: notice how that link says "it will fail to recognize future Intel processors with a family number different from 6". Intel have conspicuously kept the family number reported by CPUID at 6 on their new processors in order not to trigger a fallback to the non-Intel pathway that AMD processors get to use, presumably because they know how much that'll harm them in benchmarks and how bad the reviews will look.
AMD lost the game a while back. Intel's getting ready to introduce new chipset and architecture which will blow whatever AMD currently has out of the water.
For a while, AMD was beating the crud out of Intel, but well, yeah.
The correct phrase is: Relying on benchmarks that are not relevant to your application is a fool's errand.
Yes, yes, yes.
The bulldozer architecture is heavily optimized for highly threaded applications with a heavy reliance on integer operations. This is well represented by today's server workloads, not todays desktop applications. But more importantly for AMD's future this also represents the trending path of tomorrows applications. A great example of this is Battlefield 3, where the 8150 outperforms the i72600k. Unfortunately today this also means thatwhether Bulldozer or Sandybridge is faster today depends on the application. As from the above test we can almost assuredly guess that BF3 does more integer work, while Civ 5 does more floating point work.
However, less obvious than the multithreading issue is the push away form using the CPU for floating point operations. This is one both Intel and AMD having been slowly gambling on for quite some time, putting floating point operations on the GPU. AMD has just taken a more "committed" approach to this. Its also something that may pay off big time.
As an aside, as a server administrator today I would buy Bulldozer over Sandybridge based processors in a heart beat. Most of the "scale out" boxes such as web caches, database servers, etc., are highly multithread integer driven workloads. In this case bulldozer is going to destroy sandybridge, plain and simple. Also to those citing supercomputers, those tend to be floating point driven as they are generally for simulation.
Those 8150 vs. 2600K numbers are sans discrete graphics. It shows the 8150 integrating a stronger graphics core. It's still a pretty crummy graphics core. It's just better than the basic-equipment one that Intel put in the 2600k. The 2600K is much older than the 8150, though, and hardly Intel's top-end part. It's just the one that AMD chose to compare against. They do that. Bring out a half-assed chip, then pick an Intel part they can beat and beat it. While Intel's other parts stand on the other side of the cafeteria wondering why the fat kid is picking on the nerdy kid instead of someone their own size.
So, if you're in the small niche where you want to play Battlefield 3 on a computer with no discrete graphics, the 8150 is probably your choice. Or you could wait a couple of weeks for Intel's next part. Or upgrade to new-release discrete graphics, which will kick that integrated graphics solution's ass using either CPU.
So what I'm saying is, AMD's market is lamers, not gamers. And they seem to be making a little money at it. For now. Bulldozer has yet to dominate a quarterly report, so its issues, which comprise both performance and manufacturability, have yet to reveal themselves as a significant driver of the bottom line. Q3 was good to them, Q4 not likely to be. 12Q1 could be their doom if they don't find a miracle.
http://slashdot.org/comments.pl?sid=2498664&cid=37872868
That is not how hyperthreading works at all. Practically all x86 chips since the original Pentium have out-of-order cores. With hyperthreading, each core has a certain amount of execution units and store buffers and so on, and it simply shares these resources two hardware threads. Instructions from both threads get decoded and allocated to the right type of execution unit as soon as one is available.
Its not going to run as well as two completely dedicated cores, because the two hardware threads are competing for the available resources. Sometimes one of them will need an execution unit and it won't be available because the other one is using it, etc. Thats why one hardware thread may see a speedup if you don't schedule anything on the other hardware thread for that core.
However, hyperthreading still a damn better deal than only having ONE hardware thread in that core, and letting the execution units sit there unused a greater percentage of the time. For the same amount of transistors, you can get more execution done in the same amount of time with hyperthreads.
The bulldozer architecture is heavily optimized for highly threaded applications with a heavy reliance on integer operations. This is well represented by today's server workloads, not todays desktop applications. But more importantly for AMD's future this also represents the trending path of tomorrows applications.
No it doesn't. AMD marketing would like you to believe every application is going to go massively parallel, but they're not the ones who actually have to write the software. It is not easy to thread all types of software (the low hanging fruit has already been picked), and it can be hard or impossible to get gains beyond two or three threads for many types of code.
A great example of this is Battlefield 3, where the 8150 outperforms the i72600k.
Er, what? The 2600k outperformed the 8150 when both were clocked at stock speeds. The 8150 won in the overclocked test.
I shouldn't use the terms "outperformed" or "won", however, as this test was very sloppily done. Carefully read the description. Apparently the BF3 beta lacks facilities for repeatable benchmarking, so they just played on a live server with real players. This guarantees lots of noise and poor repeatability. And you can see that noise in the data: the BD average frame rate went up when overclocked, but the i5 and i7 averages went slightly down. That shouldn't ever happen in a CPU benchmark, not when you're raising the clock speed of the processor by over 1 GHz.
In fact, note that all the average FPS results fall in the range ~50.5 to 54 regardless of CPU type and clock, and the overclocked i7-2600K loses to every non-OC result, including itself! This test isn't CPU limited in any way. It's just a fancy way of generating random numbers with no correlation to CPU performance.
Unfortunately today this also means thatwhether Bulldozer or Sandybridge is faster today depends on the application. As from the above test we can almost assuredly guess that BF3 does more integer work, while Civ 5 does more floating point work.
You're assuming games load up all cores. Few games use more than 2 or 3 cores, even recent releases. This is for two related reasons. First, it's hard to scale game logic to huge numbers of cores. Second, game developers know that it's only recently that new PC sales ticked over to a dual-core on average, much less quad or better, and they want to spend most of their time working on things which will benefit all their customers rather than putting out a lot of effort to help only the 5% or less who own high end hardware.
(This is also why SLI/CrossFire has always been plagued by poor game support, and ATI & Nvidia have had to try various schemes to incentivize developers to spend time on multi-GPU.)
Benchmarkers who managed to actually create CPU-limited gaming tests almost universally found that Sandy Bridge stomped all over Bulldozer. SB's per-thread performance is much better, which is a better match to today's (and tomorrow's) software.
As an aside, as a server administrator today I would buy Bulldozer over Sandybridge based processors in a heart beat. Most of the "scale out" boxes such as web caches, database servers, etc., are highly multithread integer driven workloads. In this case bulldozer is going to destroy sandybridge, plain and simple.
I think you're getting a bit ahead of yourself. Client BD certainly hasn't destroyed client SB, even in highly multithreaded integer workloads (it's more like it wins some, loses others), so what reason is there to believe anything will be different in servers? Keep in mind that Intel has yet to even release its high end SB CPU and platform (6-core/12-thread desktop, 6C/12T server, and 8C/16T server); all existing SB CPUs are mainstream desktop/notebook (4C/8T max with integrated gr
If the case stated above is true it is possible that it is simple not realising the AMD CPU has the required SSE support because of a weak check.
Scratch that, the guys who wrote the article already figured it out.
Benchmarks, benchmarketing, if you want some non-rigged results take a look at Geekbench. I see that an AMD system was top of the chart for over 9 months and after many thousands of submissions from Intel and IBM users: http://browse.geekbench.ca/geekbench2/top
Sounds like sticking with GCC or some other neutral compiler is the best option.
comment first, facts later. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
No.. Sticking with the compiler most likely to be used in the programs you care about is the best option. If that's GCC it's GCC, if it's ICC, it's ICC.
Phoronix has a set of benchmarks showing bulldozer is quite competitive when running under Linux.
"i dont give a flying fuck about how the win(d)blowz. ... called fglrx.
i want a working paper weight with my bulldozer
u can shove your chinese mass manufactured LED mouse up you HP printer-enabled
keyboard (knock on wood) spooled windows powered internet.
hear that AMD? you're a m$ subdivision just like adobe-flash-youtube"
Perhaps the Intel compiler situation might explain some of my experiences. I have a system (mac pro) with a quad core Bloomfield Xeon @ 3.2GHz and also a system with a quad-core Phenom II also @ 3.2GHz. On most things I run the Xeon is faster. However, the same is not true for the software I write. If I implement say the edit distance algorithm in C to compare two DNA molecules, and compile in x86-64 with gcc, the Phenom II is about 10% faster than the Xeon for a single thread. Interestingly, the Xeon is faster if compiled for 32bit arch. Then, a string processing program I have in Perl, runs at about the same speed on both CPU's per thread. The Xeon does get an advantage for more than 4 threads due to HT, but of course I could switch to the cheap Phenom X6 if I needed such workloads...
Overall, for the custom software I use daily for work, the $5000 Xeon Mac Pro machine is not faster than the $500 Phenom II system...
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Interestingly, the Xeon is faster if compiled for 32bit arch.
Obviously compared to the Phenom also running a 32bit binary.
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
No it doesn't. AMD marketing would like you to believe every application is going to go massively parallel, but they're not the ones who actually have to write the software. It is not easy to thread all types of software (the low hanging fruit has already been picked), and it can be hard or impossible to get gains beyond two or three threads for many types of code.
Yeah... No. Data dependencies are actually fairly low in almost every real world application. Highly threaded applications are rare outside of the enterprise (read Google/Amazon/transaction process etc.) world because they are modestly hard to write and there is a significant lack of developers experienced with making them. However, as more performance is required everyone is moving that way. Sure simple word editing may not require that many threads, but anything that needs performance will move that direction. It is not hype. Also battlefield 3 loads all cores of a i7-2600k to 50% on ultra.
In fact, note that all the average FPS results fall in the range ~50.5 to 54 regardless of CPU type and clock, and the overclocked i7-2600K loses to every non-OC result, including itself! This test isn't CPU limited in any way ...
You're assuming games load up all cores. Few games use more than 2 or 3 cores, even recent releases. This is for two related reasons. First, it's hard to scale game logic to huge numbers of cores. Second, game developers know that it's only recently that new PC sales ticked over to a dual-core on average, much less quad or better, and they want to spend most of their time working on things which will benefit all their customers rather than putting out a lot of effort to help only the 5% or less who own high end hardware.
Games are not the only thing in the world. People who actually do computing as a profession, for example those of us who have to compile kernels on a regular bases, care about stuff like highly multithreaded integer based work loads. I also game so as long as I can get comparable performance in game it doesn't really matter to me. This will be the case for many people.
I think you're getting a bit ahead of yourself. Client BD certainly hasn't destroyed client SB, even in highly multithreaded integer workloads (it's more like it wins some, loses others), so what reason is there to believe anything will be different in servers? Keep in mind that Intel has yet to even release its high end SB CPU and platform (6-core/12-thread desktop, 6C/12T server, and 8C/16T server); all existing SB CPUs are mainstream desktop/notebook (4C/8T max with integrated graphics). And the 8150 is the exact same chip AMD is going to be selling as a server CPU. (Unlike Intel since Nehalem, AMD tries to use the same chip design for server and client, because AMD doesn't have the volumes to justify taping out a different chip for each market.) Worse yet for AMD, every review I've seen where they measured power showed BD using a lot more juice, at least 50W more in every case (just spot checked one review and it was ~75W more load power than i7-2600K). That's not going to go over so well in the server space.
The 8150 isn't the chip that AMD is offering for servers. AMD offers different chips for server/client, unless you seriously think the 12 core Magny-Cores is for clients? They have a 8core/16thread offering from the bulldozer architecture for servers too. The Intel SB server processors have been available to my company (admittedly a very large one) for quite some time as well. I am assuming you literally don't know what your talking about at this point on the server front.