Slashdot Mirror


Microsoft Advice Against Nehalem Xeons Snuffed Out

Eukariote writes "In an article outlining hidden strife in the processor world, Andreas Stiller has reported the scoop that Microsoft advised against the use of Intel Nehalem Xeon (Core i7/i5) processors under Windows Server 2008 R2, but was pressured by Intel to refrain from publishing this advisory. The issue concerns a bug causing spurious interrupts that locks up the Hypervisor of Server 2008. Though there is a hotfix, it is unattractive as it disables power savings and turbo boost states. (The original German-language version of the article is also available.)"

45 of 154 comments (clear)

  1. Broken processors by Anonymous Coward · · Score: 5, Insightful

    The processors are clearly broken, and anyone who bought them should get a refund or an exchange. End of story.

    1. Re:Broken processors by Anonymous Coward · · Score: 2, Interesting

      We use them with Oracle VM (Xen), and they work ok.

    2. Re:Broken processors by hattig · · Score: 4, Insightful

      It's pretty serious.

      Server requirements of CPUs include virtualisation and power savings (saving power in the data centre is a top priority for companies now).

      This CPU cannot do both at the same time, at least with Windows Server 2008's Hypervisor. Presumably it is being sold with both items listed as features however. I agree with the OP - the CPUs are broken as sold and advertised.

    3. Re:Broken processors by Bengie · · Score: 5, Informative

      so much FUD.

      #1. MS classified this interrupt as "unreliable" for all previous hypervisors and randomly decided to use it for this version of their hyper visor

      #2. ONLY MS uses this interrupt, not vmware or anyone else.

      #3. Intel's new Xeons still use less power and out perform AMD and any previous CPUs. It's still the best CPU, even if you use the "work around"

    4. Re:Broken processors by agnosticnixie · · Score: 2, Insightful

      Or the processor exposes an issue with the OS...

    5. Re:Broken processors by Waynelson · · Score: 5, Informative

      I don't know if anyone actually read the kb article on the Microsoft website, but it appears that you don't lose the power saving features and what not with the hot fix installation, the loss of those features only occurs when you directly modify the registry to disable some of the c-states in the apci system as a quick fix. Either that or i'm reading the kb article wrong.

    6. Re:Broken processors by countach · · Score: 2, Insightful

      So you've missed the entire trend towards power saving in the data center?

    7. Re:Broken processors by Anpheus · · Score: 2, Informative

      The hotfix fixes the problem and allows the use of power saving states.

      Done!

  2. Re:AMD is looking better and this is the type of s by Anonymous Coward · · Score: 2, Insightful

    amd is incapable of having bugs in the convoluted exception path?

  3. Spurious Interrupt, eh? by chebucto · · Score: 4, Funny

    Maybe Xeons are what end up being used on the UESG Marathon. I mean, half of the terminal messages on that ship are subject to the same bug. Just look at this typical example:

    http://marathon.bungie.org/story/nawmanhesclose.html#M3.13.1.1

    --
    The English word fart is one of the oldest words in the English vocabulary.
  4. What about for Windows 7? by Faizdog · · Score: 3, Interesting

    This story is interesting and timely because I plan on buying a new desktop in the next 2 weeks, just waiting for the right deal to come out, hopefully on Cyber Monday. While not getting a server, I will be getting Windows 7. I had been planning on an i7, but now am hesitant. Is there a problem with these processors for home use/gaming purposes under Windows 7? Or would I better off going with a Quad Core?

    --
    -"Those who fought today will die tommorow."-
    1. Re:What about for Windows 7? by Viros · · Score: 5, Informative

      I've got an i7 920 on my desktop and run Windows 7 for gaming/home use purposes and it works fine. Don't let the problems with the server software dissuade you from a very good processor for home and gaming use. The kind of stuff you're describing doing will never run into anything close to the problems from this article.

    2. Re:What about for Windows 7? by the+linux+geek · · Score: 3, Informative

      No, this only applies to the Hyper-V component of Server 2008 R2. Normal people do not use Windows Server for "home use/gaming purposes" (cue a dozen replies of people talking about how cool they are because they use pirated copies for said purpose), so its not a big deal. Also, Core i5/i7 is already a Quad Core, I assume you mean Core 2 Quad.

    3. Re:What about for Windows 7? by Anonymous Coward · · Score: 5, Funny

      No problems at all. I'm running an i7 920 with 12 GB of RAM and Windows 7 64-Bit Ultimate. I've been playing BF2, GTA4, COD:MW/MW2, Batman: AA and others without any problem. Not to mention running 2 or 3 VMWare sessions, putty sessions, winscp, IE8, pidgin and streaming TV through Windows Media Center all at the same time.

      Okay you have a big penis (not literally). We get it.

    4. Re:What about for Windows 7? by cwebster · · Score: 2, Insightful

      Actaully no, IE8 is the only program you mentioned that actually needs an i7 920 and 12 gigs or ram to properly execute.

      The rest of your post is like a word problem, "Sally has 5 fish, 2 turtles and a cat. How many cats does Sally have?." That is to say, completely irrelevant to the question at hand.

      Using putty to justify a multiple core machine, quite hardcore...

    5. Re:What about for Windows 7? by omfglearntoplay · · Score: 2, Interesting

      Be sure to avoid any of the HP i7 processor models. They have a major motherboard problem, as I have luckily learned myself by buying one. Check some googling with HP i7 crash freeze, etc. Or go here to see gobs of users with problems:

      http://h30434.www3.hp.com/psg/board?board.id=lockups

      However, I hear that home made PCs on i7 platforms are fine as well as recent Dells. I'm going the Asus motherboard route to rectify my problem. It's just grand though b/c they are way expensive, then I need a new case (b/c HP uses the lefty cases), and I need a new heat sync/fan (b/c HP is proprietary). So to fix my $1200 HP, I have to spend 5 or 6 hundred. WEEEE!

  5. First Rev of New Architecture by bill_mcgonigle · · Score: 4, Interesting

    Many of the benchmarking sites have also posted some poor results - I was thinking this might be a generation to skip, but now I wonder if a flaw has been discovered that could be fixed with a microcode upload. Might help the benchmarks too if it was a hidden variable.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    1. Re:First Rev of New Architecture by Darkness404 · · Score: 2, Informative

      A generation to skip for servers (or move to AMD for a generation) but Core i7s are amazing for home/gaming use. For just about anything other than visualization and server-specific stuff, Core i7s and CPUs with the same architecture have no comparison with what AMD has to offer.

      --
      Taxation is legalized theft, no more, no less.
  6. Windows specific? by Anonymous Coward · · Score: 5, Funny

    It sounds like microsoft should retract the advice and issue a warning that no OS should be run on a processor with such spurious interrupts?

    Or is this the sort of crappy hardware kernels are supposed to put up with in which case it should be Intel advising against running windows on it's hardware?

    Int€l bashing..check
    M$ bahing...check
    now i just sit back and watch the karma roll in

    1. Re:Windows specific? by Anonymous Coward · · Score: 3, Funny

      Uh, guy? That symbol you used is a "C" with two lines through it, not an "E". Get it right.

    2. Re:Windows specific? by onefriedrice · · Score: 4, Funny

      I think you just missed out on the joke. It's unlikely the OP meant to show any kind of disrespect (heaven forbid) for the wonderful, lovely Euro, so try not to be so defensive huh? Relax.

      --
      This author takes full ownership and responsibility for the unpopular opinions outlined above.
  7. VMWare may also be a problem by Virtucon · · Score: 2, Informative

    I've been experiencing problems with intermittent lockups under VMWare as well. DL370-G6 boxes. HP has given us BIOS fixes and is even shipping new boxes, but if there's a suspect problem
    with working with MS' hypervisor, I wonder if this is the same issue?

    --
    Harrison's Postulate - "For every action there is an equal and opposite criticism"
    1. Re:VMWare may also be a problem by Glasswire · · Score: 2, Interesting

      Is it in response to a documented problem with VMWare ESX that HP trying to remedy with a specific BIOS change or is HP just flailing around suggesting BIOS updates as a fix to a problem they don't yet understand? There are 100s of reasons why you're having VMWare lockup issues - the ONLY similarity to MSFT issue that you seem to have is they are both hypervisors running on Nelhalem procs. Pretty thin. What does VMWare think the problem is?

  8. Please Explain Further by Anonymous Coward · · Score: 5, Informative

    I read the article, I read the MS support report, and I read the Intel advisory. And I don't think that the summary is correct.

    The summary says that the hotfix disables power savings and turbo boost. But my reading of the MS report is that an affected system has two options, (1) a workaround, and (2) the hotfix. The difference is that the workaround disables advanced power savings and is known to be stable without side effects, but the hotfix actually fixes the problem with the vector table, presumably by following the instructions provided in the Intel advisory note.

    Said another way, the hotfix doesn't disable power savings and doesn't disable turbo boost.

    I expect that this is another fine example where Slashdot editors misunderstand a situation. Someone prove me wrong.

    1. Re:Please Explain Further by RDaneel2 · · Score: 4, Informative

      I just saw your post as I was finishing researching mine... and I certainly agree with you that the summary is wrong.

      The Microsoft KB article is quite explicit that the workaround is what disables the sleep states, leading to higher power usage - the hotfix itself does not exhibit this problem.

    2. Re:Please Explain Further by Anonymous Coward · · Score: 5, Informative

      Your explanation is exactly how I interpreted the KB article. I think Slashdot was going for some sensationalistic journalism. :-)

      Taken from TFA:
      You can disable the Advance Configuration and Power Interface (ACPI) C-states by using a BIOS firmware option on the computer. If the firmware does not include this option, a software workaround is available. You can disable the ACPI C2-state and C3-state by setting a registry key. To do this, follow these steps:

            1. At a command prompt, run the following command:
                  reg add HKLM\System\CurrentControlSet\Control\Processor /v Capabilities /t REG_DWORD /d 0x0007c044
            2. Restart the computer.

      Note The computer idle power consumption will increase significantly if the deeper ACPI C-states (processor idle sleep states) are disabled. Windows Server 2008 R2 uses these deeper C-states on the Xeon 5500 series as a key energy saving feature.

      To continue to benefit from these energy saving states, remove this registry key after you install the hotfix that this article describes. To do remove this registry key, follow these steps:

            1. At a command prompt, run the following command:
                  reg delete HKLM\System\CurrentControlSet\Control\Processor /v Capabilities /f
            2. Restart the computer.

    3. Re:Please Explain Further by oldhack · · Score: 5, Funny

      Your explanation is exactly how I interpreted the KB article. I think Slashdot was going for some sensationalistic journalism. :-)

      NO WAY!

      --
      Fuck systemd. Fuck Redhat. Fuck Soylent, too. Wait, scratch the last one.
  9. Isn't it really a bug in Windows Server? by tomhudson · · Score: 5, Insightful

    FTFA:

    For the integrated hypervisor of Windows Server 2008 R2, Microsoft has bravely resorted to a timer function that they themselves had classified as unreliable for former processors: the timer of the Advanced Programmable Interrupt Controller (APIC). Unlike, for example, the CPU timer (Time Stamp Counter, TSC) - which by now is comparatively resistant to power-saving, SpeedStep and turbo-boost modes, but is also virtualised by virtual machines - the APIC timer can also trigger interrupts. Unfortunately, right now, the Nehalem has too many of those, so that the hypervisor falters and then stops, returning the message "Clock_Watchdog_Time-out".

    So yes, if you depend on something that generates an interrupt whose code path may be suspended in certain power-saving modes, don't be surprised if it doesn't get serviced promptly. It looks more like a bug in Windows Server.

    Back in the old days, when you issued a CLI instruction, you made sure your routine didn't do too much work before issuing an STI, because that code isn't re-entrant (it's directly modifiable by the hardware, which is why you have to use the "volatile" keyword to make sure that compilers didn't "optimize away" any loops, etc). Kind of hard to guarantee that if you're putting that portion of the hardware to sleep between interrupts. As the article points out, disabling those power-saving modes fixes the problem.

    1. Re:Isn't it really a bug in Windows Server? by AcidPenguin9873 · · Score: 4, Interesting
      I don't think so. Here's the text from the Intel erratum:

      During a complex set of conditions, if the APIC timer is being used to generate interrupts, unexpected interrupts not related to the APIC timer may be signaled when a core exits the C6 power state. The APIC timer stops counting in C6 and as such isn't typically used to generate interrupts when the C6 core power state is enabled. Implication: Unexpected interrupt vectors could be sent from the APIC to a logical processor.

      Interrupts not related to the APIC timer being caused by the APIC timer is not a software problem, it's a hardware problem. I could understand your argument if the APIC timer was generating too many interrupts upon C6 exit, or something else related to messed-up APIC timekeeping near power management events, but this is unrelated interrupts being generated.

      I don't know the details, but I would assume Microsoft is using the APIC timer in its hypervisor for a reason. Maybe it's because the hypervisor is required to virtualize all the other timekeeping mechanisms for the guest.

    2. Re:Isn't it really a bug in Windows Server? by Anonymous Coward · · Score: 2, Insightful

      This article is gibberish. The TSC does not generate interrupts. As a clocksource, the TSC is unreliable because while the frequency is fixed within a socket, it can skew across sockets particularly when dealing with multi-node systems.

  10. Re:AMD is looking better and this is the type of s by CAIMLAS · · Score: 2, Insightful

    I wouldn't say "AMD is better", necessarily. I will say, however, that the Xeons seem to have been plagued from the very beginning with problems like this. They're just fringe enough to not get enough run-in testing, and the bugs don't get as quickly found as they do with the more mainstream/many users processors.

    --
    ~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
  11. Re:AMD is looking better and this is the type of s by amorsen · · Score: 3, Insightful

    Read the link. 5 pages of errata, and that's just headlines. Modern processors are very complicated, and they will have bugs.

    The major difference between Intel and AMD when it comes to errata is that Intel learned its lesson about secrecy from the Pentium FPU fiasco. Since then they have had a very open approach to processor bugs. AMD hasn't had such a PR disaster and isn't quite as open. That doesn't mean they are particularly less buggy.

    --
    Finally! A year of moderation! Ready for 2019?
  12. Re:Damn pesky kids by tomhudson · · Score: 2, Funny

    Nothing to see here. Move along. What? Nevermind where I work.

    Sorry, didn't get the message - running with interrupts disabled due to too many interrupts - so Im goo@#@!%!!#)(MN!NO CARRIER

    I for one welcome our non-interrupted cpu overlords, because in Soviet Russia, interrupts disable YOU!

  13. Actual errata by crow · · Score: 2, Informative

    From the pdf file linked from the Intel site, I think it's AAK36, as it's the only one that mentions the word "spurious." This has to do with writing to the interrupt vector table when a local interrupt is pending. That doesn't look terribly serious from my perspective. If I'm mistaken and it's a different errata, please reply with the correction.

    1. Re:Actual errata by crow · · Score: 2, Informative

      AAK36 for the Xeon version. AAN31 is the code for the i7 and i5 version. It's the same errata, just a different code number for different chips.

    2. Re:Actual errata by YesIAmAScript · · Score: 2, Informative

      I don't think it's either of them. The top one about changing vectors would be unlikely to happen in commercial software like Windows, because they would have handlers installed for all interrupts already.

      I think it issue really is the watchdog, MS is using the APIC during C6 state and as the 119 errata, the APIC counter stops during C6 state. So some interrupt that is supposed to fire to reset the watchdog doesn't fire and thus the watchdog goes off (as indicated by the error code).

      So the 119 errata is related only as much as it mentions that the APIC counter doesn't increment during C6 state (which is also probably documented elsewhere).

      There really isn't enough info in this article to know for sure what is up. That didn't stop the slashdot editors from going off half-cocked though.

      --
      http://lkml.org/lkml/2005/8/20/95
  14. Performance, complexity & bugs by Alwin+Henseler · · Score: 3, Insightful

    No, it's more like [hardware manufacturer of your choice] AND [software manufacturer of your choice] are incapable of making products that are both complex, and bug-free.

    And for some reason, 'high performance' often equals 'complex'.

  15. No evidence of problem in Xen or VMWare -MSFT bug by Glasswire · · Score: 2, Insightful

    Looks like it's a Microsoft coding problem if there is no problem in Xen or VMWare ESX Hypervisors (post on VMware above is far from useful).
    And poster didn't read the MSFT article very closely. The hotfix doesn't preclude the energy saving sleep states, it's the workaround that inhibits their use.

  16. Re:AMD is looking better and this is the type of s by lukas84 · · Score: 2, Informative

    Xeon is just a marketing name. The Xeon 3400 are identical with the i5-7xx, i7-8xx CPUs, the Xeon 3500 are identical with the i7-9xx CPUs and the Xeon 5500 CPUs are basically i7-9xx with two QPI Links.

    For example, this issue also affects als i5 and i7 CPUs.

  17. Re:AMD is looking better and this is the type of s by lukas84 · · Score: 2, Informative

    It's a processor bug exposed by a new hypervisor technique used by MS and nobody else.

    I'm not sure why you want to blame this on MS.

  18. Re:Inverted perceptions and Llanelli by daveime · · Score: 2, Funny

    Thousand(s) implies at least two thousand.

    Ergo, you use each program on average for 43.2 seconds. Is this because they *all* suck, or you simply have the attention span of a concussed duckling ?

  19. AMD looking better? Bullshit by TopSpin · · Score: 5, Informative

    AMD has also built parts with equally screwed up timers, particularly TSC clock skew on multi-cores. Timers are just messed up on x86 from either company. This nonsense goes back years. There are now at least four distinct general purpose clock sources that must be present on modern systems; tsc, apci_pm, hpet and pit (as labeled by the Linux kernel.) There will probably be further proliferation in the future as ALL of the existing timers are inadequate in subtle ways. Implementations from both manufacturers have been plagued with bugs that require nasty work-arounds; google "clocksource tsc unstable", "pm-timer bug" or "athlon x2 tsc" for some examples. This nonsense that Microsoft has stumbled upon is just the latest in a long and colorful history of failure that we'll now have to add to the list.

    Computers are supposed to keep time. Today that means high resolution clocks that work correctly regardless of power saving, concurrency, etc. Using these crucial timers is not suppose to cause spurious interrupts, bus contention or other subtle problems. People that must work with this stuff are thoroughly fed up with this ever growing pile of half-baked bullshit.

    --
    Lurking at the bottom of the gravity well, getting old
  20. Re:AMD is looking better and this is the type of s by mysidia · · Score: 3, Insightful

    It's the equivalent to writing a program against the Windows API, not testing it, and calling the API buggy when you find that it is failing in the wild.

    The API may not match the spec perfectly, but it's your software that's buggy.

    Intel can revise the proc, or revise the spec to be in agreement.

    MS is trying to use an APIC interrupt for timing that isn't normally used for that purpose.

    It's the equivalent of attaching an alarm clock to your electric car's engine, and complaining when the idling speed deviates due to a power saving feature.

    Nehalem processors were out long before 2008 R2 or the newest Hyper-V release.

    intel Nehalem is a processor with features very attractive to users of virtualization, it's one of the most common procs to be used in new server deployments.

    There is absolutely no excuse for MS not extensively testing and qualifying including stress-testing their Hypervisor on Nehalem CPUs before releasing the code.

    It would be like you or me writing a piece of desktop software today (in 2009), designed for use with Windows, and extensively testing it on Windows '98 and XP, but not discovering a frequent crash on Vista, that almost always occurs as soon as starting the program.

  21. MS KB DOES NOT say hotfix breaks power save by George_Ou · · Score: 3, Interesting

    Folks, this is a very irresponsible headline at slashdot. The Microsoft articles does NOT say hotfix breaks power save and it doesn't even mention turbo, but that it is an either or solution. Microsoft always offers workarounds as an ALTERNATIVE to the hotfix for people who don't want to apply hotfixes. The Microsoft KB article even tells you if you want to keep using those power states, then run the hotfix and make a certain modification to the registry.

    This post makes it sound like some kind of cover up and that the fix causes major CPU slowdowns, and that it's on the level of the AMD Barcelona TLB bug where the fix actually did cause a significant performance drop. This does not appear to be true. The real story is that all CPUs have hundreds of errata, and it's the job of the software maker to work around it, and that is what Microsoft is doing with their hotfix and registry hack. They're also telling you if you aren't experiencing any problems, don't bother applying the hotfix.

  22. KB Link by woan · · Score: 2, Informative

    I didn't see a link to the KB article in question. I assume this is the one: http://support.microsoft.com/kb/975530