Windows 7 On Multicore — How Much Faster?
snydeq writes "InfoWorld's Andrew Binstock tests whether Windows 7's threading advances fulfill the promise of improved performance and energy reduction. He runs Windows XP Professional, Vista Ultimate, and Windows 7 Ultimate against Viewperf and Cinebench benchmarks using a Dell Precision T3500 workstation, the price-performance winner of an earlier roundup of Nehalem-based workstations. 'What might be surprising is that Windows 7's multithreading changes did not deliver more of a performance punch,' Binstock writes of the benchmarks, adding that the principal changes to Windows 7 multithreading consist of increased processor affinity, 'a wholly new mechanism that gets rid of the global locking concept and pushes the management of lock access down to the locked resources,' permitting Windows 7 to scale up to 256 processors without performance penalty, but delivering little performance gains for systems with only a few processors. 'Windows 7 performs several tricks to keep threads running on the same execution pipelines so that the underlying Nehalem processor can turn off transistors on lesser-used or inactive pipelines,' Binstock writes. 'The primary benefit of this feature is reduced energy consumption,' with Windows 7 requiring 17 percent less power to run than Windows XP or Vista."
Suck it, nerds.
Nooo! I was hoping that power consumption would continue to increase! Sooner or later our PCs would require 1.21GW!
It pays to be obvious, especially if you have a reputation for being subtle.
What is surprising is that power consumption could be so significantly reduced. This story could have come out with an entirely different spin if the headline were simply, "Windows 7 Reduces Power Consumption by 17%."
I disagree - user-mode code, whether it's separated into threads or processes, still relies very heavily on kernel scheduling decisions. It may sound simple enough, but if you study the decisions the kernel has to make (such as which thread to wake first, from a set of 8 all waiting on the same semaphore), you can find lots of ways to get it wrong. We now take it for granted because thousands of man-years have been spent on solutions.
Sam ty sig.
So you've got a Linux fan and not a Windows fan. Not surprising on this site.
My webcomic
While actual performance may not be faster, perceived performance almost certianly is. It "feels" snappier, seems to respond better, due to some optimizations in locking and in the graphics subsystem that allows visual feedback in one app to not be blocked or held up by work going on in another app.
- Spryguy
There are three kinds of people in this world: those that can count and those that can't
Windows 7 (like all modern versions of Windows) does nothing with the BIOS at all - the BIOS ceases running as soon as Windows starts booting. You don't even need to *have* a BIOS to run Win7. And, if a power cycle fixes the issue, it clearly is not a BIOS problem.
If the device drivers for your motherboard have a bug - which sounds more like the cause of your issue - then that isn't a Microsoft problem at all, since they didn't write the drivers. Contact Abit for support.
Lots of things affect performance. One of the big things is cache usage. A L1 cache miss costs around 10 cycles these days. A L2 cache miss costs 200 or more. If you move a process (or thread) between cores on the same die with a shared L2 cache, then every load or store instruction for a little while will cause a L1 cache miss. If you move them between processors with no shared cache, then every access will cause a L2 cache miss. If, every time you schedule a thread, it is on a different processor then, given that a typical scheduling quantum is only 10ms, your thread will spend most of its time loading data from main memory to cache. This will show up as 100% CPU usage, but will only be getting something like 10% of the maximum theoretical throughput for that CPU. Improve processor affinity, and you can easily see a large speedup relative to this.
I am TheRaven on Soylent News
Agreed. A 17% reduction in power consumption doing the same tasks is nothing to scoff at...
Suck it, transitive property.
not surprising because the OS really can't do that much to improve (or mess up) the performance of user-mode code that isn't making many OS calls anyways.
Others have already mentioned scheduling and cache thrashing, I'd like to add memory management. There are lots of ways memory management choices can degrade performance, sometimes drastically.
One example is page sizes and the TLB - each cpu has a hardware TLB which is like a cache of virtual page to physical page address maps. Hardware TLB look-ups are fast, but the TLB is only of limited size and when a virtual address is not in the hardware TLB, the OS has to take a fault and walk its own software-maintained TLB that holds the complete list of virt2phys translations. That's a couple of orders magnitude slower than getting it from the hardware TLB.
One way to reduce TLB misses is to use larger pages. So an OS that is smart enough to automagically coalesce 4K pages to 4MB (or larger, depending on the hardware) pages can significantly improve TLB performance. In a pathological case, that could result in a 100x-1000x speed-up, in typical cases where it is going make an difference you'll probably see ~10% performance improvement.
Another related example is how shared memory is handled. Every page of virtual memory has a PTE which, at the most basic level, contains the virt2phys translation. When shared memory is used, a decision must be made - are the PTEs shared, or does each process get a separate copy of the PTEs for the shared memory. Downside of sharing PTEs is that the shared memory must be mapped at exactly the same virtual address in each process that uses it, so if one of those processes already has something else at that address, it won't be able to use the shared memory. The downside of using separate copies of PTEs is that you can really suck up a lot memory for just the PTE list -- imagine 50 processes that all share on chunk of 100MB of memory, if they all get their own PTE copies for that 100MB its the equivalent of 5GB worth of PTEs. If a PTE itself takes up 32 bytes, then that's at least 40MB of PTE entries just to manage that 100MB of memory. A 40% overhead is huge and then there is the issue of hardware TLB misses which, depending on the implementation, may have to search all PTEs in the system, so the more PTEs the worse a TLB miss will hurt performance.
When information is power, privacy is freedom.