e1000e Bug Squashed — Linux Kernel Patch Released
ruphus13 writes "As mentioned earlier, there was a kernel bug in the alpha/beta version of the Linux kernel (up to 2.6.27 rc7), which was corrupting (and rendering useless) the EEPROM/NVM of adapters. Thankfully, a patch is now out that prevents writing to the EEPROM once the driver is loaded, and this follows a patch released by Intel earlier in the week. From the article: 'The Intel team is currently working on narrowing down the details of how and why these chipsets were affected. They also plan on releasing patches shortly to restore the EEPROM on any adapters that have been affected, via saved images using ethtool -e or from identical systems.' This is good news as we move towards a production release!"
Hwwaa? Oh yes...the kernel does't corrupt your EEPROM anymore!
Obligatory blog plug: http://www.caseybanner.ca/
Linus isn't very happy with Intel here:
http://lkml.org/lkml/2008/9/29/368
On Mon, 29 Sep 2008, Arjan van de Ven wrote:
>
> we have a patch to save/restore now, in final testing stages
> (obviously we want to be really careful with this)
Btw, the _real_ bug is clearly in the hardware design that allows you to
brick those things without apparently even having a lock bit.
I'm hoping Intel doesn't treat this as just a software bug. Some hw
designer should be thinking hard about which orifice they put their head
up in.
It used to be that you could fry some monitors by feeding them
out-of-range signals. The _monitors_ got fixed.
Linus
It's newsworthy because it was a bug that actually bricked hardware.
Hey, look! It's Bono's brother.
An alpha/beta of the most recent linux kernel patch had a bug fixed, and it hits the front page?
They have not fixed the bug that caused the e1000e ethernet cards to get bricked. This is at least a two part bug. The EEPROM should not have been writable and Something Is Happening to cause bad writes to happen. What that "Something" is, no one knows yet, though it appears they are getting close.
Linus is an absolute, total anal retentive with regards to fixing bugs by understanding and fixing the root cause[1], not just papering over it. This papers over it for the moment, because the bug hasn't been isolated yet, but it allows more people to participate because the side effects were really nasty - this was a true bricking of the ethernet card.
This stage isn't newsworthy for Slashdot.[2] It must be a slow news day.
[1] This is a Good Thing.
[2] Nor will the real bug fix when it comes. A bug is found, a bug is fixed. Life, goes on.
Yes, they released a patch so that the NVM can't be overwritten after the e1000e driver is loaded. But from what I can tell, they still don't know what is/was responsible for the overwriting.
FWIW, I'm almost positive that modern CPUs have debug traps for this exact sort of thing...you can trap arbitrary I/O writes via SMM or something...obviously I'm not in the debug loop, but I don't see why this has been so hard to figure out...
I know this is News For Nerds and all that, but isn't this a tad specific?
That's what sections are for. See the little Tux Icon over there? We all care about Linux. Besides, it's a VERY IMPORTANT BUG. A showstopper, so to speak. And keep in mind that a lot of people in here are kernel freaks. They want to test-drive the latest versions of the kernel. And one of the reasons why people keep coming here (and not to digg) is precisely for this kind of news.
Thanks, ruphus13.
Try Erasing the BIOS on the main board and you will be more accurate in your comparison.
This bug actually flashed the firmware for the network controller and hosed access to it in some unexplained sort of way. That is something note worthy because of the rarity of it. If it was simply hosing something that was readily diagnosable and more common like a boot sector or something, then it would be different. It isn't often the software is associated with hardware damage either purposefully or accidentally.
BTW, I know there are recovery methods for a hosed BIOS. That isn't the point. Simply installing an operating system shouldn't hose it nor should it hose hardware either. Imagine all the people who just thought their card was broken or something and went for a refund under warranty or the bad name Intel or Linux received for the "faulty shipment of devices" or the ability to break a device. This is something that would work in windows, load Linux in a dual boot mode, it would stop working in both windows and Linux without any errors or indication that the car was even capable of being seen by the mainboard.
3com used to be that way too. I'm not exactly sure what it was but the 3c905's rocked and would run data quite a bit faster then any other card at the time. I know they had a full blown data processors on the cards but I assume the others would to. I used to go to computer shows just to pick them up for $10-$20 used because they had the same effects on data performance as you would see with rendering going from a S3 trident video adapter to a Gforce video card. I because seriously convinced when at a lan party with an AMD Athlon 800 system running windows 98se with 256 memory and we had to pull a 100 meg file from a file server to get the updates in sync to a game to play. I started pulling the file last because of helping others find it, I was on the tail end of the 3rd tire of uplinked switches and I had the file installed while others were still transering it. The funny part is that people with their brand new Windows XP 1.4 and 1.8 gig plus systems were still slower and the only thing I can attribute to it is the NIC.
Intel caught up with 3com in this aspect and despite my older fascinations with 3com, I'm actually an Intel fan in this one respect now.
Linus has a very good analogy here -- in fact, I love the fact that on the rare occasions I have to set modelines myself, I can pretty much put whatever I want, knowing that if it doesn't work, I can just ctrl+alt+backspace and try again.
But the conclusion does bother me: We're basically saying that all software is buggy, or that we're incapable of preventing this kind of thing from happening (in software). This is true of most modern OS designs -- monolithic kernels do make it possible for pretty much any driver to accidentally ruin any other driver's day.
The proposed workaround, then, is to prevent that memory from being written -- and to prevent this in hardware, for no other reason than to avoid having to write it into every kernel that might potentially allow buggy code to run in Ring 0.
I don't like either solution. Hardware shouldn't be brickable from software, or at least, not so easily. But software shouldn't need hardware to coddle it, either -- why is the SSD in this laptop emulating a hard disk?
Don't thank God, thank a doctor!