Sandy Bridge Chipset Shipments Halted Due To Bug
J. Dzhugashvili writes "Early adopters of Intel's new Sandy Bridge processors, beware. Intel has discovered a flaw in the 6-series chipsets that accompany the new processors. The flaw causes Serial ATA performance to 'degrade over time' in 'some cases.' Although Intel claims 'relatively few' customers are affected, it has stopped shipments of these chipsets and started making a revised version of the silicon, which won't be ready until late February. Intel expects to lose $300 million in revenue because of the problem, and it's bracing for repair and replacement costs of $700 million."
How many heads will roll?
I'm been anxiously waiting for Dell to release its Mobile Precision Workstation with a Sandy Bridge processor.
I had been cursing Dell for their slowness, but I guess it was a blessing.
Thank you for the "scare quotes." I wasn't sure what slant I should have "read into" this "summary."
They patented slowly degrading performance over time many years ago. It's a key feature built into Windows.
...it looks like they learned from the Pentium FP fiasco and are handling this one correctly.
make imaginary.friends COUNT=100 VISIBLE=false
Isn't that the same line they fed the public about the Pentium FP bug?
I don't recall seeing any complaints online about degraded SATA performance, so it looks like Intel caught this internally and took the appropriate action before the issue became widespread in the wild. The bug sucks but it just goes to show how difficult it can be to test complex hardware under all situations. Kudos to Intel for being proactive... they have learned from the FDIV bug fiasco, and some other companies with fruity logos might learn from the example.
AntiFA: An abbreviation for Anti First Amendment.
At least I don't have to prove I _need_ high speed SATA performance to get a replacement... clearly SATA is more important than _DIVISION_...
The more I learn about Windows the more I am surprised it runs at all
I RTFA and I for the life of me can't figure out if it's a "The longer the uptime the worse the degrading...and a reboot will start the process over?" or "You will use this and it will get worse and worse untill the chip burns out..."
I hope to god it's the first one...If not this might beat the floating point error by a mile!
Obviously, silicon bugs happen, barely anything makes it out of the fab without an 'errata' list as long as your leg; but the "may gradually degrade over time" part kind of freaks me out.
If it were a "due to a design error, setting register xyz to 0xDEADBEEF causes Serious Badness, chipset drivers are being patched to Never Do That on rev.1 chipsets and future chipsets will be amended" that would be unfortunate; but so it goes. Fully deterministic errors, like the classic division bug, may be problematic; in some cases bad enough to qualify the product as just plain defective; but once known they can be mitigated by not stepping on them. Something that "sometimes" "gradually decreases" performance, on a bus with error correction, though, sounds a lot like a physical problem where some sort of silicon/electrical issue causes error rates to increase and thus retries/corrections to increase in frequency, and user-visible performance to go down. That makes me nervous. It sounds less like a deterministic error problem and more like a certain physical components are actually degrading much faster than expected problem...
Can anybody think of an explanation for how a hardware bug would cause behavior that gradually changes over time(in a manner that couldn't be dealt with with a driver update) that doesn't involve the alarming possibility of gradually increasing error rates and/or early death of onboard SATA ports?
Seems to me they had issues the last time they rushed a product to beat AMD, as well. Ghosts of the i820 MTH fiasco.
Apparently the problem is with SATA ports 2-5, at least for mobile motherboards. Every desktop board is affected.
>>>I'm still running a dual-core athlon 64. Processor/memory upgrades became overrated a few years ago.
Agreed. I am still running a Pentium 4 at 3000 megahertz. When I experienced slowdown, I just doubled the memory and that eliminated the main problem (hard drive/virtual memory swapping). The only thing my P4 doesn't do is HD video, but I'm okay with that since my 700k connection doesn't do HD either.
Now my Pentium 3(?) 700 MHz laptop is long in the tooth, and often runs too slow for my taste, but it is just a laptop. I don't use it much except for travel.
As for Intel:
1 billion dollar loss is a major suck. I doubt it will end-up costing that much though. When the original Pentium developed a floating-point bug, most users did not upgrade because it was not something they needed. That helped Intel save $$$ and probably the same will happen with this chipset too.
Information wants to be expensive AND wants to be free. So you have Value vs. Cheap distribution fighting each other.
Depends on your workload. I'm typing this on a still-just-as-adequate-as-when-I-bought-it A64, plays games and everything; but when I put on my work hat, the fact that we can get more VMs into the same physical volume and power consumption with every generation(and, for annoyingly expensive software that is licensed per-socket, get substantially improved performance for peanuts hardware money) is reason to cheer...
Should have been article title.
Well, I guess this vindicates my decision to stick with MFM hard disks.
Please read my Canon EOS tech blog at http://www.everyothershot.com
The systems with the affected support chips have only been shipping since January 9th and the company believes that relatively few consumers are impacted by this issue.
Important details about shipment date lost in transcription.
Could they pretty please make a change that allows me to use the new H.264 encoding instructions without being forced to rely upon their nice but not nice enough video display capabilities? I'd LOVE to use the encoder speedups but if I'm forced to use their CPU as my GPU I may be forced to skip it. Everything I've read says that this is what I'll be forced to do - YUCK!
Build it, Drive it, Improve it! Hybridz.org
...as you write the spec: $1.00
...after you ship a few: $1,000,000,000.00
I would have thought Intel would consider that to be a feature. Certainly it seems to describe every system I've worked on for the past two decades...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
I'd been exclusively building and using AMD systems for my past 2-3 machine builds, but I just built a new computer over the weekend. Just my luck it was a Sandy Bridge CPU with a P67 chipset motherboard. I suppose I'll go fill out my registration for the motherboard tonight and wait for Gigabyte to contact me regarding a recall.
If there's anything more important than my ego around here, I want it caught and shot immediately.
I'm still running a dual-core athlon 64. Processor/memory upgrades became overrated a few years ago.
I'm running a four core Phenom II black edition at home. Basically dirt cheap. I don't call that overrated at all, far from it. Oh, and 4 Gig, all of which I happily use. Actually that amount of memory will look small in the future. Say 6 months from now.
Have you got your LWN subscription yet?
Damn that Will I Am!
"For every expert, there is an equal and opposite expert"
$700 million? Talk about taking a hit! Someone's getting fired.
Alright, competitors, time to shine! Let's go get... uh... guys? ... you there?
It's always confirmation bias!
I wonder how AMD will be taking Intel's total loss of a billion dollars?
VIA had a chipset bug on their old KT-series motherboard bridge chips that would lock up the machine if a certain sequence of bytes and a couple of signals on the ATA bus interface hit simultaneously. That condition was legal (if rare) as far as the ATA bus spec was concerned and shouldn't have caused the lockup, but it did. It was one of the conditions we had to insert escape code for when building optical (CD/DVD) drives otherwise people who bought our drives would bitch at us when the inevitable lockups happened. All the other manufacturers of IDE-bus devices did the same sort of workarounds and VIA did eventually fix the bug but it still left millions of motherboards out there with the problem chips on them.
OTOH we caught the VHDL bug that occasionally switched off the DRAM refresh controller in the testing lab before the design got sent to production, a relief for everyone concerned.
What's the big appeal of Sandy Bridge anyway ? I still haven't figured out where it fits in the market... mind you, I type this on my dual nehalem, which is still king of the mountain after a year, so I really don't get what the fuss is about. Is Sandy Bridge significantly faster than the original i3/i5 cop-outs ? Or is this a mythical "bang for the buck" platform where everything costs twice as much as AMD ?
I've been building a lot of systems, and Intel dominates the high end, but in my view they haven't sold a decent value processor since the E2xx0 Core 2's. In the desktop market there's really just 3 segments that matter: sub-$500, 500 to 1000, and balls-to-the-wall nutjobs like myself, and AMD has the bottom two tiers in a fierce headlock.
-Billco, Fnarg.com
amd HAS sata 6 on all ports intel does not and now they can't even get sata right?
'relatively few' people are affected, yet it's gonna cost Intel $700 million in replacements ?
Either these are very expensive replacements, or Intel has a different idea on how many 'few' are compared to the rest of us.
So many companies would still be denying having an issue. I say Kudos to them for owning up and acting fast. Everybody makes mistakes, but few take responsibility.
(If at first you don't succeed, do it different next time!)
so the mini, macbook, mac bookpros under $1800 will be stuck on core 2 for most of 2011?
"Slow performance degradation over time" on SATA controllers? Who wants to bet this is due to some "misapplied security" scheme such as DRM or something related to the TPM?
-- Sig down
Any delay to Intel brings AMD's release of their new Bulldozer architecture a little bit closer to Sandy Bridge. Things will be interesting for the CPU market in 2011 to say the least.
simple, fast homepage with your links: http://www.ngumbi.com/
If this was Apple, they'd just say you're using it wrong.
What do I know, I'm just an idiot, right?
What is interesting is that those 2 SATA 6Gbps ports that the Intel boards have, are the ones unaffected by this! The problem is only with the other 4 ports. I bet they were thinking going with mostly the ol' 3Gbit ports will be safe and save some money... Woops!
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
If anyone remembers the i820 MTH (DRAM to RAMBUS translator Hub) fiasco, this is it again. Basicly what happened back then is that as soon as the drivers (INF files) are installed the chipset experiences a high error rate over time. With the i820MTH "over time" was roughly one hour once affected. Initial builds of such systems seem to have no problems, it's not until the customer keeps calling in for the same problem that it's discovered to be a chipset issue.
I'm glad that Intel caught this early, but it's very likely that affected parts from 3rd party motherboard vendors already went out, so I'd avoid buying anything with Sandy Bridge until late 2011.
Will.I.Am, at his first public Intel press event since being hired, was quoted saying "The problem with the Sandy Bridge Chipset seems to be the dirty bit. BZZZZT BOOM BOOM.... BZZZZT BZZZT BZZZT BOOM BOOOM...." The rest of his comments weren't heard by anyone at the event due to the sudden loud and obnoxious music blaring from all corners...
using System.Awesome;
Comment removed based on user account deletion
Comment removed based on user account deletion
There have been quite a few products worth spending money on in the last few years if your workload consists of applications that are highly multithreaded and/or use a large amount of memory. Servers especially have gotten phenomenally better since three years ago with Intel retiring their aged, performance-sapping FSB architecture for an IMC + point-to-point link architecture, Intel getting rid of high-latency and hot FB-DIMMs for normal DDR3, AMD releasing the MCM Opteron 6100s with 2-3 times the number of cores per socket as they had in early 2008, and the adoption of 2 Gbit DIMMs. Low-power/embedded users also are in luck as products like AMD's Fusion APUs and dual-core 1 GHz ARM Cortex A9s are much better than the early 945GC + single-core Atoms, 130 nm AMD Geodes, or ~600 MHz single-core ARM CPUs available three years ago. Also, the rise of NAND SSDs has done a lot to increase computer performance over the last three years as well. Other than that, most desktop CPUs and GPUs aren't phenomenally better or faster than those around three years ago. There has been some improvement, but nothing like what we saw during any three years of the 1990s.
Just "gittin-r-done," day after day.
Maybe power savings and the costs associated with running high power consuming machines is not a factor for you, but processor/memory upgrades are very often not overrated. It just depends on what you call an upgrade. The last 5 machines I have bought have more than paid for themselves in power savings. They happen to be faster, but the bigget upgrade was in moving from an average power usage of ~180 watts to an average power usage of ~40 watts, as well as improving the machines ability to go into standby.
By "over time" they mean that every time a set of circumstances crop up for the bug to manifest, roll some dice. Eventually you'll get snake eyes and the bug will bite you. From this article:
"On its conference call to discuss the issue, Intel told me that it hasn’t been made aware of a single failure seen by end users. Intel expects that over 3 years of use it would see a failure rate of approximately 5 - 15% depending on usage model. Remember this problem isn’t a functional issue but rather one of those nasty statistical issues, so by nature it should take time to show up in large numbers (at the same time there should still be some very isolated incidents of failure early on)."
So it's not like the chip is dissolving or some such.
On the good news front, from this article:
"If you've already built a Sandy Bridge system, fortunately, there are some obvious workarounds available. Most enthusiast-class motherboards these days ship with extra SATA ports driven by auxiliary SATA controller chips from third-party suppliers like Marvell, and those ports aren't at risk for this problem. As we've noted, the two 6Gbps SATA ports on the 6-series chipset aren't, either. For a great many users, sidestepping this problem should be as simple as moving their storage device connections to the other ports. Given the relatively strong performance that we've seen out of Intel's SATA 6Gbps controller, we'd recommending attaching any fast, primary storage devices like SSDs or 7,200-RPM drives to the 6Gbps SATA ports if possible. Other drives, like large and slow-rotating HDDs, should be fine on the third-party controllers. Just be careful to ensure that you have all the right drivers installed and the boot order in the BIOS set correctly before making the move, so you don't cause yourself the headache of an unbootable system."
So it's not the huge deal that it seemed to be at first. Your 6Gbps ports are fine. It's your 3Gbps ports that are pooched. But if your board has a secondary controller like the Marvell controller - just move your drives to those ports (or plunk down $20 bucks and get an ePCI SATA board) and Bob's your uncle.
That being said though - dammit. I JUST ordered one of these boards last night from Newegg. I've always been an AMD fan, but I figured just this once I'd try Intel since they've been making some really great cpus lately. Haven't upgraded in five years and BANG - this hits.
If you'd like to make some quick cash, go to Vegas and place a few bets. Then have me root for the team you'd like to lose.
Weaselmancer
rediculous.
My educated guess is that the SATA Input/Output Pads have a digital timing compensation circuit that tries to center the data sampling window (e.g., the clock edge where data is sampled). Since the appropriate data sampling window that won't cause a setup/hold violation changes with process variation and temperature it needs to have lots of potential settings in a large window and may need automatic tracking.
Probably someone didn't design that window large enough to center the data sampling timing offset (or the step size isn't small enough or the auto adjustment circuit that tracks temperature and adjusts the window appropriatly has an algorithmic flaw in some cases, etc). It might be okay now (in early production tests), but as the part ages, the required data sampling window can shift significantly, and if the chip can't adjust the data sampling window appropriatly, then data errors are inevitable.
As a silly example, let's say a hw engineer put in a clock trim circuit that could adjust +-100ps in steps of 10ps. No driver update can make that adjustment -110ps.
Conversely, if the hw control algorithm that tracks temperature and adjusts the window has a postive temperature coefficient over time (say gets slower), but the actual I/O circuit has a negative coefficient over time (say gets faster), after a while, that feedback algorithm may become unstable, that might not be fixable with a driver update either (if the control algorithm is in hw).
Of course, I have no real infomation, but it's my guess having designed high speed I/Os in the past...
Yes, I am an ASIC designer. Transistor design varies wildly based on the process type but below is what can happen for a low-power CMOS process. Also SATA uses clock recovery and there are many things that can go wrong there.
One case that can cause reliability issues is metastability. If you're not familiar, metastability is what happens when more than one input of a register changes within a certain time interval. The usual case is the clock input arrives too close to a change on the data input(called a setup violation), but it can happen with reset pins as well. The internal circuit of the flop isn't designed to handle these cases so the output voltage is not forced to either low or high, but hovers somewhere in between for a time before ultimately drifting one way or the other. Now if your clock and data are such that this case happens every clock, the output may in fact never settle and stay in an undetermined state.
Besides the logical effect, there is a reliability effect as well. When the flop output is between high and low, the feedback circuit inside causes a direct short to ground that lasts as long as the output is unresolved. This can cause hot-electron effects and electro-migration, although local melting may be possible but that typically causes a hard failure. Both of these cause electrical property changes which can be read about here.
I should point out that some designs have special registers that can handle metastability. They do this by increasing the register size about 10 times so that the transient current during metastability can be handled reliably.
Not that it matters to myself. I'm still running a dual-core athlon 64. Processor/memory upgrades became overrated a few years ago.
There has been nothing new worth spending money on, hardware wise, for the last 3 years at least.
Agreed, but 2011 changes that for some folks. I waited for Sandy Bridge instead of moving to AMD's six cores (or waiting for Fusion). The i5's Quick Sync decreased my video encode time for a feature length film down to just under 8 minutes at stock speeds.
Actually I'd be surprised if it didn't cost them every penny plus of that money. After all there is a pretty big difference between a certain math type being screwed up in edge corner cases and having to reboot daily like it is Win98 all over again just to keep your nice new SATA from running like a 66MHz IDE.
So if I was Intel I'd be figuring in on a one to one replacement of EVERY SB board out there, plus disposal costs.
ACs don't waste your time replying, your posts are never seen by me.
First of all, this should be in the summary, the first two 6g controllers on the board are unaffected, its only the 3g controllers that are affected. Most users aren't even going to be using more than 2 ports. Secondly, most motherboards use third party controllers for additional ports (mine use 2 other controllers and 6 non-Intel ports), if you have any of these then unless you are using 10 drives in your computer it isn't going to be a problem and a simple cable switch to other ports and you are done.
Lastly, it is probably going to be much easier, cheaper, and faster to simply pop in a SATA controller card than getting an RMA.
I blame spaghetti.
http://images.encyclopediadramatica.com/images/0/00/Spagsouth01.jpg
Unfortunately it's not over time during a session. This problem is degradation over the life of the machine, with as high as 15% complete failure of the SATA 3Gb/s ports after three years.
http://en.wikipedia.org/wiki/Electromigration
It's a problem that becomes more and more of an issue as the feature size (nanometer figure) on chips gets smaller and smaller.
It is as far as I know the only way in which silicon chips degrade over time. It cannot be reversed in software.
Intel had this problem with their early Core 2 but I don't know that it ever became fully public like this. Intel released a patch to be placed in BIOS which changed a few settings in the chip to prevent the problem. This couldn't reverse it after it failed, but it could prevent or at least make it less likely if applied before the chip failed.
It is a design and/or chip process (production) problem and cannot be fixed in software after it happens.
I hope some of these recalled MB's become available at bargain basement prices. I'd happily buy one and use a PCIe SATA card if it could save me ~75% compared to the present MSRP.
The only thing my P4 doesn't do is HD video, but I'm okay with that since my 700k connection doesn't do HD either.
You can download videos overnight to watch in the morning, or buy them on bluray. I'm guessing you're not playing recent games either. Nothing wrong with either position of course, but I think you'll find it's not the industry that's changed to make it not worth upgrading, it's you getting older.
Now my Pentium 3(?) 700 MHz laptop is long in the tooth, and often runs too slow for my taste, but it is just a laptop. I don't use it much except for travel.
And you don't like to watch video or play games while travelling?
I am trolling
FWIW Intel's 2x SATA III (6 Gbps) ports are a lot faster than AMD's.
>>>older
Cheaper. I try to avoid spending money on upgrades when the computer (or car or TV) is still working. And yes I play videogames but usually on my Gamecube or PS2, rather than my laptop.
As for videos, youtube and hulu don't demand much processing power. The 700 MHz P3 can handle it just fine.
Information wants to be expensive AND wants to be free. So you have Value vs. Cheap distribution fighting each other.
to bad intels chips have limited pci-e lanes and you need to go to the high end i7 cpus just to get more then 20+DMI bus speed of 4 pcie lanes. For Sandy Bridge that may be a $400 cpu!
AMD lets you USE ANY CPU in a AM3 board with chip set choice with better pci-e lane setups. 890FX has 38 + 4 SB link. 890GX and other 800's 22 + 4 SB link.
790fx has 38 + 4 SB link lanes. 790X and 790GX has 22 + 4 SB link. 785G 20 + 4 SB link. 785E 22 + 4 SB link and most of the other 700 ones have the same.
I think the AMD systems use a NEC USB 3 controller.. At least my Gigabyte 890FX does.
Gee, good thing Intel ran NVIDIA out of the chipset business. And that the feds didn't come down on Intel like a ton of bricks.
My old PC died yesterday night. Today I did all the research and bought a sandy bridge PC while I was at work, instead of slacking and reading slashdot- so I missed this. Damn it! I hope cyberpowerpc lets you cancel orders . . .
Remember the FDIV bug? For certain floating point divisions, the original Pentium gave results that were slightly incorrect. Now this really didn't matter to most people. This was back in the day when floating point was not used a lot (many 486s had no FPU at all), the error only happened on certain calculations, and it wasn't a large error. So realistically, most people would have been fine with a defective chip, never would have caused a problem.
Intel reasoned along those lines and thus decided that to get a replacement chip, you'd need to demonstrate some kind of need for it. Show them you were doing something where it'd actually cause problems, not just that you wanted a new one...
BIG mistake. People were extremely pissed off at that. It was a major PR disaster for Intel and in the end they offered replacements to all chips, so they had to spend the money anyhow, AND they'd already gotten egg on their face.
This time they are making sure it isn't a problem. They are replacing things, even if it wouldn't really matter, even if a SATA card would bypass the problem. They are keeping people happy and avoiding bad PR/lawsuits.
Sounds like intel fessed up yesterday and stated it was a problem with a bias circuit in the PLL clocking tree. A bias circuit apparently caused a transistor to remain in a high leakage state (which over time will induce a failure mode). What makes it silly is apparently intel is saying this circuit wasn't in initial designs, added, but not needed in the design and will be disabled in the future... Back to the future!
It's not a bug at all. It's a design fault - but a purely hardware one at that. (It's *not* a bug in software and it's not a bug in firmware). Also, performance does not degrade over time either - the ports themselves degrade, but not performance. What it is is one particular transistor in the 3Gbps clocking circuts having very thin gate oxide and being accidentally driven at an higher bias voltage (due to a fault). This gate oxide breaks down over time; leaking more and more current.
Thus, performance does not degrade (RTFA - it doesn't talk about performance degredation, only port degredation), but apparently error rates do increase over time. Eventually the port itself becomes non-functional.
.
Some people are going to be stuck with these boards with no way of knowing when they will die. But the problem is progressive and in theory detectable. So a service that periodicly checks the error rate of the affected sata ports and plots thier degradation would be very helpful. You may have to do a bandwidth test to find this if there is no low level way and there is a problem that testing it helps make it worse so you would have to keep the test short. Would be useful also to confirm if your board has this issue or some other unrelated sata issue.
Another option to help is to write a modified sata driver that you can set a performance to relibility slide bar for. in performance mode it runs like normal and as you slide it to reliablility it adds delays in accessing the sata devices to reduce any sudden high transfer rate requests from damaging the transistor.