Slashdot Mirror


Smarter Thread Scheduling Improves AMD Bulldozer Performance

crookedvulture writes "The initial reviews of the first Bulldozer-based FX processors have revealed the chips to be notably slower than their Intel counterparts. Part of the reason is the module-based nature of AMD's new architecture, which requires more intelligent thread scheduling to extract optimum performance. This article takes a closer look at how tweaking Windows 7's thread scheduling can improve Bulldozer's performance by 10-20%. As with Intel's Hyper-Threading tech, Bulldozer performs better when resource sharing is kept to a minimum and workloads are spread across multiple modules rather than the multiple cores within them."

29 of 196 comments (clear)

  1. Re:Weird by Sloppy · · Score: 2

    I thought part of the Bulldozer hype was that it had two 'real' cores and not hyperthreading,

    No, the hype is that it blurs the distinction between cores and hyperthreading. It's both and neither.

    --
    As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
  2. But does it actually make a difference? by robot256 · · Score: 2

    Sure, the scheduling change improves performance by 10-20% for certain tasks, but that still makes it 30-50% slower than an i7, and with more power consumption.

    I can't fault AMD for not having full third-party support for their custom features, since Intel had a head-start with hyperthreading, but if it will still be an inferior product even after support is added then I'm not going to buy it.

    1. Re:But does it actually make a difference? by HarrySquatter · · Score: 2

      An i2600k is only 15% more expensive has a 25% lower tdp and blows away the fx-8150 in most of the benchmarks. Even with this tweak it'll still barely compete and the 2600k has half as many real cores and a lower clock speed.

  3. Re:Weird by laffer1 · · Score: 2

    It's not like hyper threading. For integer operations, the AMD chips are much better. What AMD doesn't have is two floating point units so that's what gets bogged down. There are two instruction decoders and two units to handle integer math, but one floating point unit per component.

    AMD's approach is faster for some workloads. The problem is that they didn't design it around how most people currently write software.

    I would have preferred AMD to implement hyper threading as it would have greatly simplified things for OS developers. It's getting to a point where kernels have to know about CPU families in order to get the performance they need. They also have to know the workload.

    For instance, if i'm trying to save power in a laptop, it's best with the new AMD chips to give all the instructions to the first two logical cpus which are the same cores. Then the others can go into an enhanced sleep state. However, this is slower than distributing to different physical cores. I'm even having trouble with terminology with these chips.

    With intel chips, it's best to keep the same processes on nearby cores to take advantage of cache (for those that are really 2 cpus on the same package) but to avoid scheduling them on two threads on the same core. Again the power issue comes into play with intel chips as other cores could go into C1E state or similar.

    AMD did add special instructions to the bulldozer chips that speed up floating point, but compilers and applications have to take advantage of them. Microsoft's Visual Studio does not yet.

  4. Re:no one got fired buying intel by Antisyzygy · · Score: 3, Informative

    AMD servers are way cheaper, and there are no performance issues most admins can't handle. What do you mean by performance? If you mean slower, then yes, but if you mean reliability than they are about the same. Why else do Universities almost exclusively use AMD processors in their clusters for cutting edge research? I can see your point if you are only buying 1-3 servers but you start saving shitloads of money when its a server farm.

    --
    That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
  5. Re:So basically... by h4rr4r · · Score: 2

    So then SSDs suck because you have to tweak the IO scheduler(elevator)?

  6. Re:no one got fired buying intel by QuantumRiff · · Score: 3, Informative

    A dell R815 with 2 twelve-core AMD processors (although they were not bulldozer ones) 256GB of ram, and a pair of hard drives was $8k cheaper than a similarly configured Dell R810 with 2 10-core Intel Processors when we ordered a few weeks ago. That difference in price is enough to buy a nice Fusion-IO drive, which will make much, much more of a performance impact than a small percentage higher CPU speed

    --

    What are we going to do tonight Brain?
  7. Re:no one got fired buying intel by KhazadDum · · Score: 2

    Agreed. To further expound upon parent's point, unless you really know your performance needs and requirements, where the initial extra cost of Intel chips is lower than the revenue that is gained with that extra couple percent of performance, then go Intel. Otherwise, it's usually a cost versus preference piss fest. And last I checked in a down economy, cost is king.

  8. Re:Weird by 0123456 · · Score: 2

    It's not like hyper threading. For integer operations, the AMD chips are much better. What AMD doesn't have is two floating point units so that's what gets bogged down. There are two instruction decoders and two units to handle integer math, but one floating point unit per component.

    Ah, so this benchmark is floating point and that's why it's faster across multiple cores?

    I can't really see AMD convincing Microsoft to invest a lot of effort into dynamically tracking which threads use floating point and which don't and reassigning them appropriately. Maybe a flag on the thread to say whether it's using floating point or not at creation time would be viable, but then app developers won't bother to set it.

  9. It's a Windows limitation by Animats · · Score: 3, Informative

    This is really more of an OS-level problem. CPU scheduling on multiprocessors needs some awareness of the costs of an interprocessor context switch. In general, it's faster to restart a thread on the same processor it previously ran on, because the caches will have the data that thread needs. If the thread has lost control for a while, though, it doesn't matter. This is a standard topic in operating system courses. An informal discussion of how Windows 7 does it is useful.

    Windows 7 generally prefers to run a thread on the same CPU it previously ran on. But if you have a lot of threads that are frequently blocking, you may get excessive inter-CPU switching.

    On top of this, the Bulldozer CPU adjusts the CPU clock rate to control power consumption and heat dissipation. If some cores can be stopped, the others can go slightly faster. This improves performance for sequential programs, but complicates scheduling.

    Manually setting processor affinity is a workaround, not a fix.

  10. Re:So basically... by fuzzyfuzzyfungus · · Score: 3, Interesting

    So basically they suck. I shouldn't need to tweak my os thread scheduler just so a cpu can suck less. AMD needs to fix their shit instead of lame excuses.

    I've got some very bad news for you: While I have no particular knowledge of, or interest in, today's architecture pissing match, the days when the OS was allowed to ignore architectural details and expect things to just work optimally are good and over(if they ever existed in the first place).

    Dynamic processor clocks? Why should I have to deal with some performance governor shit when Intel can just make a CPU that either uses almost no power at 3GHz or runs like a bat out of hell at 800MHz? Oh, because they actually can't. Sorry. Multiple cores? WTF? Why do they expect me to program in parallel for 2 3GHz cores instead of just giving me a 6GHz core? Oh, because they actually can't. Sorry. NUMA? Memory access times already blow! Now you want to make them unpredictable? Well, we can either repeal the speed of light and restrict every system to a single memory controller or deal with nonuniform access times and cry into our 128GB of RAM... The list just goes on. Hyperthreading can provide anything from less than zero improvement, if it increases contention for resources that were already being fully used, to fairly substantial improvement, if the CPU was being starved at times under a single thread. Now the Bulldozer cores have implemented something between full multi-core(with 100% duplication of resources per core) and hyperthreading(with virtually zero additional resources for the HT 'core'). Shockingly, performance depends on whether the two semi-independent cores are stepping on one another's shared toes or not...

    Even if, in this specific instance, AMD happens to have fucked up and made the wrong architectural choice, that doesn't change the fact that you can't escape architectural oddities unless you are willing to stay quite far from the forefront of performance, or deal with some sort of hardware/firmware abstraction layer that ends up being at least as complex as the OS-level hackery would have been, but more likely to be vendor specific and have its cost spread across far fewer units. It certainly isn't the case that all architectural deviations are good, some are ghastly hacks best forgotten, some are perfectly OK ideas dragged down by products that overall aren't much good; but the path of progress has been liberally sprinkled with oddities that have to be accounted for somewhere in the overall stack.

  11. Re:So basically... by fuzzyfuzzyfungus · · Score: 2

    So then SSDs suck because you have to tweak the IO scheduler(elevator)?

    How can you even Dream of trusting any drive that isn't good enough for solid, proven, CHS addressing?

  12. It was already beating all intel in highly threade by unity100 · · Score: 5, Interesting
    applications, like photosop cs5 or truecrypt, including some more :

    http://www.overclock.net/amd-cpus/1141562-practical-bulldozer-apps.html

    also, if you set your cpuid to genuineintel in some of the benchmark programs, you will get suprising results :

    try changing cpuid=genuineintel for +47% INCREASE IN SCORES.

    changing cpuid to GenuineIntel nets 47.4% increase in performance:
    [url]http://www.osnews.com/story/22683/Intel_Forced_to_Remove_quot_Cripple_AMD_quot_Function_from_Compiler_[/url]

    PCMark/Futuremark rigged bentmark to favor intel:
    [url]http://www.amdzone.com/phpbb3/viewtopic.php?f=52&t=135382#p139712[/url] [url]http://arstechnica.com/hardware/reviews/2008/07/atom-nano-review.ars/6[/url]

    intel cheating at 3DMark vantage via driver: [url]http://techreport.com/articles.x/17732/2[/url]

    relying on bentmarks to "measure performance" is a fool's errand. dont go there.

  13. Re:So basically... by Anonymous Coward · · Score: 3, Interesting

    You did when it was initially launched. Windows 2000's scheduler does not cope well with hyperthreading /at all/ by default. You saw similar things when dual core CPUs were launched. Now hyper threading and multicore are standard and OSs are aware of these cases.

    It's already been pointed out that windows 8's scheduler is bulldozer aware and performs much better than windows 7. I would not be surprised to see a patch from Microsoft that specifically addresses scheduler performance improvements for bulldozer CPUs. We've seen similar things in the past.

    By the way I'm seeing this unsusual phrase "Esoteric Tweaking" showing up a lot out of nowhere. It smells of astroturf. Could intel be affraid?

    Could it bet that bulldozer architecture, with its uneven fpu-integer core ratio, be the key to significant future scaling above and beyond what 1:1 can offer?

  14. Re:So basically... by Runaway1956 · · Score: 3, Funny

    "User". That summarizes half of the nonsense being posted here. This is a techie forum, isn't it? Techies tweak when no tweaking is needed. If you're a "user", then you're not even authorized to be in a server room. GTFO a STAY OUT!

    (listens for door slamming as the dweeb runs out)

    I just hate it when children blurt out their juvenile bullshit, interrupting the adults. Happens all the time . . .

    --
    "Windows is like the faint smell of piss in a subway: it's there, and there's nothing you can do about it." - Charlie Br
  15. Re:So basically... by DeadCatX2 · · Score: 2

    Uh...what? Users don't have to do anything to the scheduler. That's the responsibility of the operating system. A Service Pack will be released and you won't have to do shit, so your argument is moot.

    Besides, if your argument is "We shouldn't have to optimize schedulers", then you're a little late, because schedulers are most definitely optimized for their associated hardware

    --
    :(){ :|:& };:
  16. Re:So basically... by washu_k · · Score: 3, Informative

    No, It's because AMD is lying to the OS. The "8 core" BD is not really 8, core. It only has 4 cores with some duplicated integer resources. Basically a better version of hyper-threading, but not a proper 8 core design.

    The problem is that the BD says to Windows "I have 8 cores" and thus Windows schedules assuming that is true. If BD said "I have 4 cores with 8 threads" then Windows would schedule it just like it does with Intel CPUs and performance would improve just like in the FA.

    There shouldn't need to be any OS level tweaks because Windows already knows how to schedule for hyper-threading optimally. If BD reported it's true core count properly then no OS level changes would be needed.

  17. Re:no one got fired buying intel by Kjella · · Score: 4, Interesting

    Well, it doesn't seem to apply when you get up to supercomputing levels at least. I checked the TOP500 list and it's 76% Intel, 13% AMD. As for Bulldozer, it has serious performance/watt issues even though the performance/price ratio isn't all that bad for a server. On the desktop, Intel hasn't even bothered to make a response except to quietly add a 2700K to their pricing table, with the 2600K left untouched. On the business side (where after all margins fund future R&D) then Sandy Bridge's 216mm2 is much smaller than Bulldozer's 315mm2. Intel can produce almost 50% more in the same die area, in practice the yields probably favor Intel more because the risk of critical defects go up with size. Honestly, I don't think Intel has felt less challenged since the AMD K5 days...

    --
    Live today, because you never know what tomorrow brings
  18. Re:Weird by DamonHD · · Score: 2

    A T1 is still working well for me: at most about 1 thread on my entire Web server system is doing any FP at all, and in places I switched to some light-weight integer fixed-point calcs instead. That now serves me well with the came code running on a soft-float (ie no FP h/w) ARMv5.

    So, for applications where integer performance and threading is far more important than FP, maybe AMD (and Sun) made the right decision...

    Rgds

    Damon

    --
    http://m.earth.org.uk/
  19. Re:So basically... by turgid · · Score: 3, Insightful

    Unfortunately, the Wintel world has thrived on this philosophy for 20 years.

  20. Re:So basically... by 0123456 · · Score: 2

    After 6 months to 1 year, the process will be significantly more mature and the Bulldozer chips will be serious contenders to Intel offerings.

    AMD just have to survive six months to a year of selling poorly-performing CPUs that have twice as many transistors as the competition.

  21. Re:So basically... by Kjella · · Score: 3, Interesting

    There shouldn't need to be any OS level tweaks because Windows already knows how to schedule for hyper-threading optimally. If BD reported it's true core count properly then no OS level changes would be needed.

    Except that hyperthreading quite obviously has one fast thread and one slow thread filling the gaps. In AMDs solution both cores in a module are equal, but they share some resources. To use a car analogy the Intel solution is a one-lane road with pullouts where the hyperthread sneaks from one pullout to the other while there's no traffic while the AMD solution is a two-lane road with one lane chokepoints. Both sorta allow cars to travel simultaneously, but I don't think the optimization would be the same.

    --
    Live today, because you never know what tomorrow brings
  22. Re:It was already beating all intel in highly thre by yuhong · · Score: 2

    It is time for some reverse engineering of the benchmark programs I think to see what exactly is happening.

  23. No need, everyone knows... by Anonymous Coward · · Score: 2, Informative

    Here's Agner Fog's page about this issue.

    The Intel compiler (for many years and many versions) has generated multiple code paths for different instruction sets. Using the lame excuse that they don't trust other vendors to implement the instruction set correctly, the generated executables detect the "GenuineIntel" CPU vendor string and deliberately cripple your program's performance by not running the fastest codepaths unless your CPU was made by Intel. So e.g. if you have an SSE4-capable AMD CPU, it will run the SSE2 codepath instead of the SSE4 codepath that comparable Intel chips will run.

    Over the years, MANY libraries (including several from Intel) have been compiled and shipped with this compiler, with the result that the applications compiled with those libraries including many benchmarks, also suffer from the same performance sabotage.

  24. Re:So... by makomk · · Score: 2

    In theory it actually has the equivalent of an 128-bit wide FPU for every integer unit. Though I hear rumours that they may have not put as much effort into making the classic x87 FPU instructions run fast and that harmed them in some of the non-SSE-supporting benchmarks that a lot of the reviews used.

  25. Re:no one got fired buying intel by yuhong · · Score: 2

    And slower will be I think solved with Interlagos, and Intel will have only Westmere-EX (Xeon E7) to compete since Sandy Bridge-EP is not even released yet. Now compare the already-released pricing of Opteron 6200 CPUs with Intel's current Xeon 7500/E7 pricing, and guess what will happen.

  26. Re:So basically... by makomk · · Score: 2

    Except that's not quite right either, because classic hyperthreading only gets about 10-20% improvements at most from using two threads rather than one, whereas Bulldozer appears to be closer to 80-90% even for stuff that makes heavy use of the shared resources.

  27. Re:So basically... by washu_k · · Score: 2

    No, that is not correct. Hyper-threading gives each thread the same amount of resources, assuming they can use them equally. The only difference between hyper-threading and a BD module is that the BD module has a dedicated integer execution unit and L1 D cache for each thread. Everything else is shared just like in Intel cores. It is simply a better hyper-threading, not real cores.

  28. Re:So basically... by beelsebob · · Score: 2

    This is actually exactly what you wouldn't want in a design –when you're designing a threading model, whether at the application level, the OS level or the CPU level, you absolutely do not want thread starvation. Designing it in is just dumb, hence why intel didn't.