Slashdot Mirror


Windows 7 On Multicore — How Much Faster?

snydeq writes "InfoWorld's Andrew Binstock tests whether Windows 7's threading advances fulfill the promise of improved performance and energy reduction. He runs Windows XP Professional, Vista Ultimate, and Windows 7 Ultimate against Viewperf and Cinebench benchmarks using a Dell Precision T3500 workstation, the price-performance winner of an earlier roundup of Nehalem-based workstations. 'What might be surprising is that Windows 7's multithreading changes did not deliver more of a performance punch,' Binstock writes of the benchmarks, adding that the principal changes to Windows 7 multithreading consist of increased processor affinity, 'a wholly new mechanism that gets rid of the global locking concept and pushes the management of lock access down to the locked resources,' permitting Windows 7 to scale up to 256 processors without performance penalty, but delivering little performance gains for systems with only a few processors. 'Windows 7 performs several tricks to keep threads running on the same execution pipelines so that the underlying Nehalem processor can turn off transistors on lesser-used or inactive pipelines,' Binstock writes. 'The primary benefit of this feature is reduced energy consumption,' with Windows 7 requiring 17 percent less power to run than Windows XP or Vista."

2 of 349 comments (clear)

  1. Re:I disagree with *you* by TheRaven64 · · Score: 5, Informative

    Lots of things affect performance. One of the big things is cache usage. A L1 cache miss costs around 10 cycles these days. A L2 cache miss costs 200 or more. If you move a process (or thread) between cores on the same die with a shared L2 cache, then every load or store instruction for a little while will cause a L1 cache miss. If you move them between processors with no shared cache, then every access will cause a L2 cache miss. If, every time you schedule a thread, it is on a different processor then, given that a typical scheduling quantum is only 10ms, your thread will spend most of its time loading data from main memory to cache. This will show up as 100% CPU usage, but will only be getting something like 10% of the maximum theoretical throughput for that CPU. Improve processor affinity, and you can easily see a large speedup relative to this.

    --
    I am TheRaven on Soylent News
  2. Re:Not Really by Jah-Wren+Ryel · · Score: 5, Informative

    not surprising because the OS really can't do that much to improve (or mess up) the performance of user-mode code that isn't making many OS calls anyways.

    Others have already mentioned scheduling and cache thrashing, I'd like to add memory management. There are lots of ways memory management choices can degrade performance, sometimes drastically.

    One example is page sizes and the TLB - each cpu has a hardware TLB which is like a cache of virtual page to physical page address maps. Hardware TLB look-ups are fast, but the TLB is only of limited size and when a virtual address is not in the hardware TLB, the OS has to take a fault and walk its own software-maintained TLB that holds the complete list of virt2phys translations. That's a couple of orders magnitude slower than getting it from the hardware TLB.

    One way to reduce TLB misses is to use larger pages. So an OS that is smart enough to automagically coalesce 4K pages to 4MB (or larger, depending on the hardware) pages can significantly improve TLB performance. In a pathological case, that could result in a 100x-1000x speed-up, in typical cases where it is going make an difference you'll probably see ~10% performance improvement.

    Another related example is how shared memory is handled. Every page of virtual memory has a PTE which, at the most basic level, contains the virt2phys translation. When shared memory is used, a decision must be made - are the PTEs shared, or does each process get a separate copy of the PTEs for the shared memory. Downside of sharing PTEs is that the shared memory must be mapped at exactly the same virtual address in each process that uses it, so if one of those processes already has something else at that address, it won't be able to use the shared memory. The downside of using separate copies of PTEs is that you can really suck up a lot memory for just the PTE list -- imagine 50 processes that all share on chunk of 100MB of memory, if they all get their own PTE copies for that 100MB its the equivalent of 5GB worth of PTEs. If a PTE itself takes up 32 bytes, then that's at least 40MB of PTE entries just to manage that 100MB of memory. A 40% overhead is huge and then there is the issue of hardware TLB misses which, depending on the implementation, may have to search all PTEs in the system, so the more PTEs the worse a TLB miss will hurt performance.

    --
    When information is power, privacy is freedom.