Microsoft Advice Against Nehalem Xeons Snuffed Out
Eukariote writes "In an article outlining hidden strife in the processor world, Andreas Stiller has reported the scoop that Microsoft advised against the use of Intel Nehalem Xeon (Core i7/i5) processors under Windows Server 2008 R2, but was pressured by Intel to refrain from publishing this advisory. The issue concerns a bug causing spurious interrupts that locks up the Hypervisor of Server 2008. Though there is a hotfix, it is unattractive as it disables power savings and turbo boost states. (The original German-language version of the article is also available.)"
The processors are clearly broken, and anyone who bought them should get a refund or an exchange. End of story.
amd is incapable of having bugs in the convoluted exception path?
Maybe Xeons are what end up being used on the UESG Marathon. I mean, half of the terminal messages on that ship are subject to the same bug. Just look at this typical example:
http://marathon.bungie.org/story/nawmanhesclose.html#M3.13.1.1
The English word fart is one of the oldest words in the English vocabulary.
This story is interesting and timely because I plan on buying a new desktop in the next 2 weeks, just waiting for the right deal to come out, hopefully on Cyber Monday. While not getting a server, I will be getting Windows 7. I had been planning on an i7, but now am hesitant. Is there a problem with these processors for home use/gaming purposes under Windows 7? Or would I better off going with a Quad Core?
-"Those who fought today will die tommorow."-
Many of the benchmarking sites have also posted some poor results - I was thinking this might be a generation to skip, but now I wonder if a flaw has been discovered that could be fixed with a microcode upload. Might help the benchmarks too if it was a hidden variable.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
It sounds like microsoft should retract the advice and issue a warning that no OS should be run on a processor with such spurious interrupts?
Or is this the sort of crappy hardware kernels are supposed to put up with in which case it should be Intel advising against running windows on it's hardware?
Int€l bashing..check
M$ bahing...check
now i just sit back and watch the karma roll in
I've been experiencing problems with intermittent lockups under VMWare as well. DL370-G6 boxes. HP has given us BIOS fixes and is even shipping new boxes, but if there's a suspect problem
with working with MS' hypervisor, I wonder if this is the same issue?
Harrison's Postulate - "For every action there is an equal and opposite criticism"
I read the article, I read the MS support report, and I read the Intel advisory. And I don't think that the summary is correct.
The summary says that the hotfix disables power savings and turbo boost. But my reading of the MS report is that an affected system has two options, (1) a workaround, and (2) the hotfix. The difference is that the workaround disables advanced power savings and is known to be stable without side effects, but the hotfix actually fixes the problem with the vector table, presumably by following the instructions provided in the Intel advisory note.
Said another way, the hotfix doesn't disable power savings and doesn't disable turbo boost.
I expect that this is another fine example where Slashdot editors misunderstand a situation. Someone prove me wrong.
FTFA:
So yes, if you depend on something that generates an interrupt whose code path may be suspended in certain power-saving modes, don't be surprised if it doesn't get serviced promptly. It looks more like a bug in Windows Server.
Back in the old days, when you issued a CLI instruction, you made sure your routine didn't do too much work before issuing an STI, because that code isn't re-entrant (it's directly modifiable by the hardware, which is why you have to use the "volatile" keyword to make sure that compilers didn't "optimize away" any loops, etc). Kind of hard to guarantee that if you're putting that portion of the hardware to sleep between interrupts. As the article points out, disabling those power-saving modes fixes the problem.
I wouldn't say "AMD is better", necessarily. I will say, however, that the Xeons seem to have been plagued from the very beginning with problems like this. They're just fringe enough to not get enough run-in testing, and the bugs don't get as quickly found as they do with the more mainstream/many users processors.
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
Read the link. 5 pages of errata, and that's just headlines. Modern processors are very complicated, and they will have bugs.
The major difference between Intel and AMD when it comes to errata is that Intel learned its lesson about secrecy from the Pentium FPU fiasco. Since then they have had a very open approach to processor bugs. AMD hasn't had such a PR disaster and isn't quite as open. That doesn't mean they are particularly less buggy.
Finally! A year of moderation! Ready for 2019?
Sorry, didn't get the message - running with interrupts disabled due to too many interrupts - so Im goo@#@!%!!#)(MN!NO CARRIER
I for one welcome our non-interrupted cpu overlords, because in Soviet Russia, interrupts disable YOU!
From the pdf file linked from the Intel site, I think it's AAK36, as it's the only one that mentions the word "spurious." This has to do with writing to the interrupt vector table when a local interrupt is pending. That doesn't look terribly serious from my perspective. If I'm mistaken and it's a different errata, please reply with the correction.
No, it's more like [hardware manufacturer of your choice] AND [software manufacturer of your choice] are incapable of making products that are both complex, and bug-free.
And for some reason, 'high performance' often equals 'complex'.
Looks like it's a Microsoft coding problem if there is no problem in Xen or VMWare ESX Hypervisors (post on VMware above is far from useful).
And poster didn't read the MSFT article very closely. The hotfix doesn't preclude the energy saving sleep states, it's the workaround that inhibits their use.
Xeon is just a marketing name. The Xeon 3400 are identical with the i5-7xx, i7-8xx CPUs, the Xeon 3500 are identical with the i7-9xx CPUs and the Xeon 5500 CPUs are basically i7-9xx with two QPI Links.
For example, this issue also affects als i5 and i7 CPUs.
It's a processor bug exposed by a new hypervisor technique used by MS and nobody else.
I'm not sure why you want to blame this on MS.
Thousand(s) implies at least two thousand.
Ergo, you use each program on average for 43.2 seconds. Is this because they *all* suck, or you simply have the attention span of a concussed duckling ?
AMD has also built parts with equally screwed up timers, particularly TSC clock skew on multi-cores. Timers are just messed up on x86 from either company. This nonsense goes back years. There are now at least four distinct general purpose clock sources that must be present on modern systems; tsc, apci_pm, hpet and pit (as labeled by the Linux kernel.) There will probably be further proliferation in the future as ALL of the existing timers are inadequate in subtle ways. Implementations from both manufacturers have been plagued with bugs that require nasty work-arounds; google "clocksource tsc unstable", "pm-timer bug" or "athlon x2 tsc" for some examples. This nonsense that Microsoft has stumbled upon is just the latest in a long and colorful history of failure that we'll now have to add to the list.
Computers are supposed to keep time. Today that means high resolution clocks that work correctly regardless of power saving, concurrency, etc. Using these crucial timers is not suppose to cause spurious interrupts, bus contention or other subtle problems. People that must work with this stuff are thoroughly fed up with this ever growing pile of half-baked bullshit.
Lurking at the bottom of the gravity well, getting old
It's the equivalent to writing a program against the Windows API, not testing it, and calling the API buggy when you find that it is failing in the wild.
The API may not match the spec perfectly, but it's your software that's buggy.
Intel can revise the proc, or revise the spec to be in agreement.
MS is trying to use an APIC interrupt for timing that isn't normally used for that purpose.
It's the equivalent of attaching an alarm clock to your electric car's engine, and complaining when the idling speed deviates due to a power saving feature.
Nehalem processors were out long before 2008 R2 or the newest Hyper-V release.
intel Nehalem is a processor with features very attractive to users of virtualization, it's one of the most common procs to be used in new server deployments.
There is absolutely no excuse for MS not extensively testing and qualifying including stress-testing their Hypervisor on Nehalem CPUs before releasing the code.
It would be like you or me writing a piece of desktop software today (in 2009), designed for use with Windows, and extensively testing it on Windows '98 and XP, but not discovering a frequent crash on Vista, that almost always occurs as soon as starting the program.
Folks, this is a very irresponsible headline at slashdot. The Microsoft articles does NOT say hotfix breaks power save and it doesn't even mention turbo, but that it is an either or solution. Microsoft always offers workarounds as an ALTERNATIVE to the hotfix for people who don't want to apply hotfixes. The Microsoft KB article even tells you if you want to keep using those power states, then run the hotfix and make a certain modification to the registry.
This post makes it sound like some kind of cover up and that the fix causes major CPU slowdowns, and that it's on the level of the AMD Barcelona TLB bug where the fix actually did cause a significant performance drop. This does not appear to be true. The real story is that all CPUs have hundreds of errata, and it's the job of the software maker to work around it, and that is what Microsoft is doing with their hotfix and registry hack. They're also telling you if you aren't experiencing any problems, don't bother applying the hotfix.
I didn't see a link to the KB article in question. I assume this is the one: http://support.microsoft.com/kb/975530