Intel's Atom C2000 Chips Are Bricking Products -- And It's Not Just Cisco Hit (theregister.co.uk)
Thomas Claburn, reporting for The Register: Intel's Atom C2000 processor family has a fault that effectively bricks devices, costing the company a significant amount of money to correct. But the semiconductor giant won't disclose precisely how many chips are affected nor which products are at risk. In its Q4 2016 earnings call earlier this month, chief financial officer Robert Swan said a product issue limited profitability during the quarter, forcing the biz to set aside a pot of cash to deal with the problem. "We were observing a product quality issue in the fourth quarter with slightly higher expected failure rates under certain use and time constraints, and we established a reserve to deal with that," he said. "We think we have it relatively well-bounded with a minor design fix that we're working with our clients to resolve." Coincidentally, Cisco last week issued an advisory warning that several of its routing, optical networking, security and switch products sold prior to November 16, 2016 contain a faulty clock component that is likely to fail at an accelerated rate after 18 months of operation. Cisco at the time declined to name the supplier of that component.
"A crash reduces
your expensive computer
to a simple stone."
The headline says products are bricked, but are they?
It seems like a clock on the CPU is failing. Those CPUs are soldered on, so replacing them is not easy and you could fairly say that the device is bricked. But Intel claim that there is a "board level" fix. I wonder if they mean replace the CPU, or if there is some other bodge that can mitigate the problem.
I can't imagine how a bodge would prevent a clock failing or replace it once failed. It sounds like there is a silicon level fault with the CPU, with the clock is generation circuit inside it.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Intel for the past decade has dropped the ball. Its missing the boat on mobile and failing to push x86 chips into mobile phones has weakened their entire platform which really needs to be an "everywhere" platform. It has been clear for a while that mobile would be a majority of CPUs for a decade, why it has not pushed x86 into more phones is beyond me. Its totally incompetent, especially given x86 binary compatability between desktop and mobile could be a selling point
are we not cheeto?
WE ARE DEVO!
(sorry) ;)
--
"It is now safe to switch off your computer."
So much for knocking ARM out of the embedded processor market.
Time to return them ALL then, no?
The 8080 never had these problems.
Puma 6 issues (atom powered) - www.dslreports.com/shownews/The-Arris-SB6190-Modem-Puma-6-Chipset-Have-Some-Major-Issues-138411
Once you get a replacement CPU from Intel, it's easy to upgrade your system.
Get a small screwdriver, and insert it in the gap under the chip near pin 1. Gently rock the CPU out of its DIP socket; you may have to alternate pulling at each end of the chip.
The new chip's legs will be slightly splayed for use with automatic pick-and-place machines. You may need to gently bend them inwards before proceeding. Making sure that pin 1 is aligned with the marker on the motherboard silkscreen, gently push the new CPU straight down into the DIP socket. Your system is fixed!
If they don't want to say which products are effected I will assume it is all of them.
Can't post to The Register, since they don't have ACs.
Anyway, the issue is damage to the LPC (low-pin-count) bus clock line. This is a secondary bus where you hang old ISA-style devices, like the system FLASH. If the FLASH is the only thing in there, it will mostly render the system unbootable (so, stuff that never gets power-cycled would just keep going). But LPC can generate interrupts, and one often hangs other crap to that bus, such as i2c controllers for hot-swap bays, motherboard management controllers, and other sensors. In that case, you can expect severe runtime misbehavior.
The issue is caused by *continuous degradation due to use*, so repairing it is easy, if costly: replace the motherboard with a new one under warranty (and even if out of warranty period wherever this kind of "stealth" manufacturing defect is not subject to warranty time period limitations, such as in Brazil). It will "reset" the counter. This is your zero-day solution to the issue.
Depending on time-to-market for the new stepping (hardware revision) B1/C0 of the Atom C2000, you might need an interim solution, which is the "platform-level change", i.e. redesigned board with extra components that work around Intel's hardware design error. As soon as you have these, you start using these to replace any boards returned due to the defect, or start a "recall" to preemptively replace boards.
Depending on the total cost of the board plus other components, you keep the old boards you replaced around, and when revision B1/C0 of the Atom C2000 is out, you BGA-replace them in a factory (about US$ 25 per board in large volumes, if that much), maybe replace any liquid electrolytic capacitors and other crap that ages badly, and use the boards either as new or as refurbished, depending on your corporate/regulatory ethics. This kind of repair almost always really resets the boards MTBF. If Intel supplies the replacement Atoms at no charge, the cost of repair might well be far less than the cost of the production run for boards you'd want to keep around for warranty services, anyway.
Mind you, at 1.5 years per failure, it will be rare the legislation/contract that forces more than one replacement... so, let's hope they don't replace a faulty board with a brand-new virgin but-still-timebombed board. You'd have trouble to replace it a second time if it fails after the warranty period.
So far, NetGears storage line, ReadyNAS, has some affected systems, including the Readynas 3100 family. And possibly some other networking controllers like their wireless controller. Those are using various C2000 family chips.
They were supposed to fail after warranty period, not before
Don't power it off, or it might not power back on.
I have Supermicro with a 2750F-O dammit
and let marketers take over the company.
They had issues with this dating back to the 80s, but the pentium and forward is where they really ruined themselves with it.
Plus putting their engineers (both hardware and software) on Don Quixote or MacGyver (the original!) style quests to make an oversized peg fit into an undersized hole. And around the time the engineers figured out an ingenious way to make it work: PRODUCT CANCELLED. After putting months or years of overtime into a project like that, of course they are going to find fewer and fewer competent engineers working for them. The smart ones cut their losses after 5-20 years, just enough to build up a cushion for a career change. The rest are interchangable and either aren't willing to risk jumping ship to somewhere better, or can't get somewhere better to hire them.
Intel C2000 series was a dream come true for low power servers. I have a 8 core C2758 atom server at home (from SuperMicro), and it is really a beast given working at less than 10W total system power at idle or low utilization (excluding HDDs, of course but with the MB, CPU and RAM).
But they have dropped the ball, now in two ways:
- There was no update to the Atom server line in the last two years. They probably do not want to cannibalize their other offerings (8 core CPU with AES and VT extensions is more than enough to host several VMs). But they also left the market empty (there is still no competition).
- Now we learn that the chips are faulty. Without any replacement option (the chip is soldered on a particularly expensive motherboard), I'll just wait for the time it will fail.
Given they did not provide any useful performance increase in the last two generations, heck make it three, I'm really disappointed with Intel at the moment. No more mobile chips, no more low power server chips, and three year old i7 4790K can easily compete against a recent 7700K with only 10% drop in performance.
Wow! You have a lot of insight in to the technical issue AND the strategic implications/solutions. I suspect that you "run things" somewhere with mastermind skill. Thanks for sharing your thoughts!
Thanks for the sarcasm :-P Yeah, things are *never* that simple, but it is still a nice simplified view of how it *should* go.