Slashdot Mirror


Sandy Bridge Chipset Shipments Halted Due To Bug

J. Dzhugashvili writes "Early adopters of Intel's new Sandy Bridge processors, beware. Intel has discovered a flaw in the 6-series chipsets that accompany the new processors. The flaw causes Serial ATA performance to 'degrade over time' in 'some cases.' Although Intel claims 'relatively few' customers are affected, it has stopped shipments of these chipsets and started making a revised version of the silicon, which won't be ready until late February. Intel expects to lose $300 million in revenue because of the problem, and it's bracing for repair and replacement costs of $700 million."

39 of 212 comments (clear)

  1. Turns out they violated a Microsoft Patent by Anonymous Coward · · Score: 4, Funny

    They patented slowly degrading performance over time many years ago. It's a key feature built into Windows.

  2. Intel caught this one first? by CajunArson · · Score: 4, Insightful

    I don't recall seeing any complaints online about degraded SATA performance, so it looks like Intel caught this internally and took the appropriate action before the issue became widespread in the wild. The bug sucks but it just goes to show how difficult it can be to test complex hardware under all situations. Kudos to Intel for being proactive... they have learned from the FDIV bug fiasco, and some other companies with fruity logos might learn from the example.

    --
    AntiFA: An abbreviation for Anti First Amendment.
    1. Re:Intel caught this one first? by alen · · Score: 2

      there was a rumor that new MacBook Pro's were going to be released tomorrow. if true it could have been Apple QA catching this at the last minute

    2. Re:Intel caught this one first? by Anonymous Coward · · Score: 2, Interesting

      If its released tomorrow then Apple QA (early adopters) haven't received the new MacBook Pro yet.

    3. Re:Intel caught this one first? by petermgreen · · Score: 3, Interesting

      Someone else linked to a post that claims it only supports ports 2-5 (the 3Gbps ports) not ports 0-1 (the 6Gbps ports). Most systems won't be stressing ports 2-5 that heavilly.

      Plus if this is indeed a gradual degredation issue it may be that most people simply haven't been using the systems long enough for it to become noticable yet.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
    4. Re:Intel caught this one first? by Ecuador · · Score: 5, Informative

      According to Anand's coverage, Intel said that they started getting customer complaints after they had shipped about 100k units, and their engineers managed to duplicate the problem early last week, the cause of which they figured out in a couple of days.

      Source : http://www.anandtech.com/show/4142/intel-discovers-bug-in-6series-chipset-begins-recall

      --
      Violence is the last refuge of the incompetent. Polar Scope Align for iOS
    5. Re:Intel caught this one first? by alvinrod · · Score: 2

      Yeah, but there's always a rumor about new MacBook Pro's being released tomorrow.

      If there wasn't, some rumor site is probably going to source our posts as possible evidence of a release tomorrow.

    6. Re:Intel caught this one first? by Zebedeu · · Score: 2

      Yes, and it was Steve Jobs who personally found and reported the bug to Intel while testing the new MacBook model.

      In fact, he's not even sick, he just needed time to concentrate on finding out why his beloved new mac model was behaving weirdly.

      Look for news of his holy return tomorrow.

    7. Re:Intel caught this one first? by vivek7006 · · Score: 2

      Mod Parent up. Steve Jobs just replied to my email from his iPad confirming exactly what the parent just said!

  3. Bracing for costs? this is an improvement... by maroonhat · · Score: 2

    At least I don't have to prove I _need_ high speed SATA performance to get a replacement... clearly SATA is more important than _DIVISION_...

    --
    The more I learn about Windows the more I am surprised it runs at all
  4. Over time? by Caviller · · Score: 3, Interesting

    I RTFA and I for the life of me can't figure out if it's a "The longer the uptime the worse the degrading...and a reboot will start the process over?" or "You will use this and it will get worse and worse untill the chip burns out..."

    I hope to god it's the first one...If not this might beat the floating point error by a mile!

    1. Re:Over time? by gstrickler · · Score: 2

      I can't imagine a hardware bug that would manifest only as degraded performance after extended uptime. Anything of that nature could probably be worked around with a software fix that periodically reset the controller. Therefore, I think it's safe to assume it's literally the SATA logic degrading with age, which would require a chip level change.

      --
      make imaginary.friends COUNT=100 VISIBLE=false
    2. Re:Over time? by LWATCDR · · Score: 2

      Because you paid for X sata connectors and or you do not want to waste a slot and or you have put the motherboard in a 1u rack-mount case or a slim HTC case.
      And of course it could be in a laptop.

      It is broken so Intel is going to do the right thing and fix it. This is a good thing.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    3. Re:Over time? by ShnowDoggie · · Score: 3, Informative

      The problem in the chipset was traced back to a transistor in the 3Gbps PLL clocking tree. The aforementioned transistor has a very thin gate oxide, which allows you to turn it on with a very low voltage. Unfortunately in this case Intel biased the transistor with too high of a voltage, resulting in higher than expected leakage current. Depending on the physical characteristics of the transistor the leakage current here can increase over time which can ultimately result in this failure on the 3Gbps ports.

      ~http://www.anandtech.com/show/4143/the-source-of-intels-cougar-point-sata-bug

  5. Does anybody else think this sounds ominous? by fuzzyfuzzyfungus · · Score: 3, Interesting

    Obviously, silicon bugs happen, barely anything makes it out of the fab without an 'errata' list as long as your leg; but the "may gradually degrade over time" part kind of freaks me out.

    If it were a "due to a design error, setting register xyz to 0xDEADBEEF causes Serious Badness, chipset drivers are being patched to Never Do That on rev.1 chipsets and future chipsets will be amended" that would be unfortunate; but so it goes. Fully deterministic errors, like the classic division bug, may be problematic; in some cases bad enough to qualify the product as just plain defective; but once known they can be mitigated by not stepping on them. Something that "sometimes" "gradually decreases" performance, on a bus with error correction, though, sounds a lot like a physical problem where some sort of silicon/electrical issue causes error rates to increase and thus retries/corrections to increase in frequency, and user-visible performance to go down. That makes me nervous. It sounds less like a deterministic error problem and more like a certain physical components are actually degrading much faster than expected problem...

    Can anybody think of an explanation for how a hardware bug would cause behavior that gradually changes over time(in a manner that couldn't be dealt with with a driver update) that doesn't involve the alarming possibility of gradually increasing error rates and/or early death of onboard SATA ports?

    1. Re:Does anybody else think this sounds ominous? by pclminion · · Score: 5, Interesting

      It sounds less like a deterministic error problem and more like a certain physical components are actually degrading much faster than expected problem...

      Well, obviously. Could be a wire that was made too thin, or some component that overheats and slowly damages itself. I'm not sure why you think it's "ominous" though. It's a physical object that apparently has a design defect that causes it to wear out. I've seen ominous things before, but that usually involves a shadowy figure standing in a doorway with something that looks oddly like a machete, but dammit I can't really see clearly in this low light...

  6. Some additional information by dc29A · · Score: 3, Informative

    Apparently the problem is with SATA ports 2-5, at least for mobile motherboards. Every desktop board is affected.

  7. Re:Sucks to be them! by fuzzyfuzzyfungus · · Score: 2

    Depends on your workload. I'm typing this on a still-just-as-adequate-as-when-I-bought-it A64, plays games and everything; but when I put on my work hat, the fact that we can get more VMs into the same physical volume and power consumption with every generation(and, for annoyingly expensive software that is licensed per-socket, get substantially improved performance for peanuts hardware money) is reason to cheer...

  8. Sandy Bridge Closed Due to Erosion by BondGamer · · Score: 5, Funny

    Should have been article title.

  9. SATA early adoptors by jolyonr · · Score: 4, Funny

    Well, I guess this vindicates my decision to stick with MFM hard disks.

    --


    Please read my Canon EOS tech blog at http://www.everyothershot.com
    1. Re:SATA early adoptors by Skater · · Score: 2

      I actually was a SATA "early adopter" back in early 2004 when I bought a computer. Had taken a "computer sabbatical" for several years in hawaii and was just purchasing a new one finally. Configured it without a floppy disk, which I was glad to never have to use again. Then a few months later when good ole' Windows XP takes a crap on me, I goto reinstall it, and XP requires a floppy drive to install the drivers for my sata drive... No thumb drive, burnt CD, nothing else, just a fucking floppy which I purposely left off my system build. f micorosft. fml

      I had a similar problem recently. First, I'm surprised you haven't gotten flamed yet like I did when I mentioned it. I said that I was having trouble getting it to work, but Linux was working fine on the machine, and I wasn't missing Windows XP - apparently that's worthy of flames about how I was stupid for trying to use an 8 year old OS on a new machine and I'd have as much trouble with Linux, etc.

      Anyway, it required slipstreaming XP on to a new DVD then reinstalling using that. It's not so bad as long as you have another machine on which to do it (I had my old laptop that I dug out of the closet for this purpose). You can mess with a bunch of options, but the first time I did it, I got a really wacky system setup without things like sound, so I recommend going with the normal settings. Just put SP3 and your drivers on the disc, and configure everything the old-fashioned way.

    2. Re:SATA early adoptors by sxeraverx · · Score: 2

      Successful troll is successful.

  10. Not all chipset affected by Anonymous Coward · · Score: 2, Interesting

    The systems with the affected support chips have only been shipping since January 9th and the company believes that relatively few consumers are impacted by this issue.

    Important details about shipment date lost in transcription.

  11. Re:Given Intel's reaction... by fuzzyfuzzyfungus · · Score: 4, Insightful

    Or this one is much more serious... The Pentium FP one was a big issue because of how cagey Intel was about it(and was a genuine problem for users who had purchased it for certain FP heavy operations); but it was a deterministic logical bug: as long as you avoided a fairly specific set of trigger conditions, it would stay safely contained(for certain customers, doing so would likely be so onerous as to qualify as unacceptable; but for everybody else not so scary).

    What makes the hair on the back of my neck stand up about this one is the "may gradually degrade" stuff. That makes it sound much less like the "100% of people who do X get bitten/0% of others do" logical bugs and more like the "component degradation in the field can be unpredictable, except at a population level" type of bug that, say, happened to Nvidia not too long back...

  12. That's a Bug? by Greyfox · · Score: 2

    I would have thought Intel would consider that to be a feature. Certainly it seems to describe every system I've worked on for the past two decades...

    --

    I'm trying to teach myself to set people on fire with my mind... Is it hot in here?

  13. Re:First Stalin, now this. You Georgians, I swear. by Chemicles · · Score: 4, Informative

    Or, they could be actual quotes from the company's actual press release.

  14. Re:Given Intel's reaction... by Zocalo · · Score: 2

    Yeah, it does seem that way, which makes a change. Even though Intel has been here before, it still good to see a company just 'fessing up and dealing with a mistake like this for a change - unlike Dell's blatant denials about their faulty motherboards and Apple's "Antenna Gate".

    It's such a shame that they didn't also learn from the much earlier lesson about building on a foundation of rock instead of sand. If only they'd gone with "Rocky Bridge"... :)

    --
    UNIX? They're not even circumcised! Savages!
  15. Why Sandy Bridge ? by billcopc · · Score: 2

    What's the big appeal of Sandy Bridge anyway ? I still haven't figured out where it fits in the market... mind you, I type this on my dual nehalem, which is still king of the mountain after a year, so I really don't get what the fuss is about. Is Sandy Bridge significantly faster than the original i3/i5 cop-outs ? Or is this a mythical "bang for the buck" platform where everything costs twice as much as AMD ?

    I've been building a lot of systems, and Intel dominates the high end, but in my view they haven't sold a decent value processor since the E2xx0 Core 2's. In the desktop market there's really just 3 segments that matter: sub-$500, 500 to 1000, and balls-to-the-wall nutjobs like myself, and AMD has the bottom two tiers in a fierce headlock.

    --
    -Billco, Fnarg.com
    1. Re:Why Sandy Bridge ? by Anonymous+Showered · · Score: 3, Informative

      Sandy Bridge is the successor to Nehalem. It uses less power and is more efficient.

      The current P67 boards (LGA 1155) are for the mainstream market, e.g. Best Buy, Futureshop, Fry's, Staples, etc. They're basically "high-end' for the middle-class.

      Wait until LGA 2011 comes out (successor to 1366). You'll be thinking of switching then. :)

    2. Re:Why Sandy Bridge ? by PitaBred · · Score: 4, Informative

      Yes. Sandy Bridge i7-2600K CPUs are approaching the speeds of the i7-980X, while costing 1/3rd as much. You can build an insanely fast machine for under $1000 with Sandy Bridge, including graphics card.

    3. Re:Why Sandy Bridge ? by DoofusOfDeath · · Score: 3, Informative

      What's the big appeal of Sandy Bridge anyway ?

      For some of us (including me), the big deal is that Sandy Bridge adds a new set of instructions called "AVX" intstructions, which let us do more floating-point operations at the same time. For some scientific apps this can nearly double the performance of the overall app.

    4. Re:Why Sandy Bridge ? by Theovon · · Score: 3, Informative

      There are a number of really good articles on the advances in Sandy Bridge. For instance:

      http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937
      http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed

      To summarize some of the things I remember off the top of my head:

      The design is basically area-equivalent to the Nehalem designs, but they've made certain structures more space efficient to make room to enlarge others. For instance, they've made the branch predictor use fewer bits for the same prediction accuracy. This and other improvements have allowed them to increase critical structures that affect things like the instruction window size. The instruction window pertains to the number of decoded but not executed instructions out-standing. A larger instruction window allows you to (a) find more instruction-level parallelism because you're more likely to find independent instructions that can be executed simultaneously, and (b) absorb the effect of some high latency operations, like L2 cache misses -- you can effectively hide much of the latency by continuing to look for and perform unrelated work during the stall. In Nehalem and before, they had a structure that unified the reservation station, register file, and reorder buffer. Logically, this makes sense, but it also makes that area very power hungry, and you can never turn it off. In Sandy Bridge, they've split those structures, so they can be clock-gated separately. Also, instead of accumulating dependency results in the reservation station, they're stored in a single centralized physical register file, and pointers are held in the RS. This saves a lot of space, since now instructions traveling around the processor just need to carry the pointer. (This does add some latency and writing required to fetch those results from the RF when they're finally needed.)

      It's explicitly stated that Sandy Bridge is not a major revolution in processor design. Compared to Nehalem, you might think of it representing a large collection of efficiency improvements that work together to make a processor that is faster (clock for clock efficiency) and more power efficient.

      Many of these improvements lead to the larger instruction window. IMHO, this is a critical improvement. A Sun engineer once described modern processing as being a race between last-level cache misses. You have an L2 miss, and you quickly run out of work to do, and the processor stalls until that out-standing read arrives. Meanwhile, you've accumulated a hundred cycles or so of pending work, which gets blasted through, and execution continues perhaps a little while until you have another L2 miss. Processors like Nehalem can execute four or more instructions per cycle (peak), but the effective AVERAGE instructions per clock is less than 1. These high-latency L2 misses are primarily responsible for that. Besides adding on-die memory controllers, which reduces the latency, Sandy Bridge lengthens the instruction window so as to absorb more of that latency, so that stall time is less.

  16. Re:"Relatively Few" customers affected? by RightSaidFred99 · · Score: 2

    Yes, if by "line" you mean "unarguably true statement".

  17. Quote from Intel's Director of Creative Media by A+Guy+From+Ottawa · · Score: 2

    Will.I.Am, at his first public Intel press event since being hired, was quoted saying "The problem with the Sandy Bridge Chipset seems to be the dirty bit. BZZZZT BOOM BOOM.... BZZZZT BZZZT BZZZT BOOM BOOOM...." The rest of his comments weren't heard by anyone at the event due to the sudden loud and obnoxious music blaring from all corners...

    --

    using System.Awesome;

  18. Comment removed by account_deleted · · Score: 2

    Comment removed based on user account deletion

  19. Re:How many heads will roll? by Andrewkov · · Score: 2

    I heard Will I Am is so disgusted he's canceling his contract..

  20. The most reasonable explaination by slew · · Score: 4, Insightful

    My educated guess is that the SATA Input/Output Pads have a digital timing compensation circuit that tries to center the data sampling window (e.g., the clock edge where data is sampled). Since the appropriate data sampling window that won't cause a setup/hold violation changes with process variation and temperature it needs to have lots of potential settings in a large window and may need automatic tracking.

    Probably someone didn't design that window large enough to center the data sampling timing offset (or the step size isn't small enough or the auto adjustment circuit that tracks temperature and adjusts the window appropriatly has an algorithmic flaw in some cases, etc). It might be okay now (in early production tests), but as the part ages, the required data sampling window can shift significantly, and if the chip can't adjust the data sampling window appropriatly, then data errors are inevitable.

    As a silly example, let's say a hw engineer put in a clock trim circuit that could adjust +-100ps in steps of 10ps. No driver update can make that adjustment -110ps.

    Conversely, if the hw control algorithm that tracks temperature and adjusts the window has a postive temperature coefficient over time (say gets slower), but the actual I/O circuit has a negative coefficient over time (say gets faster), after a while, that feedback algorithm may become unstable, that might not be fixable with a driver update either (if the control algorithm is in hw).

    Of course, I have no real infomation, but it's my guess having designed high speed I/Os in the past...

  21. Re:How many heads will roll? by Anonymous Coward · · Score: 2, Funny

    32 Intel people will be fired.... unless Hyperthreading is enabled, in which case 64.

  22. Apparently, not the problem by slew · · Score: 2

    Sounds like intel fessed up yesterday and stated it was a problem with a bias circuit in the PLL clocking tree. A bias circuit apparently caused a transistor to remain in a high leakage state (which over time will induce a failure mode). What makes it silly is apparently intel is saying this circuit wasn't in initial designs, added, but not needed in the design and will be disabled in the future... Back to the future!