Intel Gigabit NIC Packet of Death

← Back to Stories (view on slashdot.org)

Intel Gigabit NIC Packet of Death

Posted by Soulskill on Wednesday February 6, 2013 @09:03AM from the how-to-break-things dept.

An anonymous reader sends this quote from a blog post about a very odd technical issue and some clever debugging: "Packets of death. I started calling them that because that’s exactly what they are. ... This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network. Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead. Nothing but a power cycle would bring it back. ... While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE. ... With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death — and kill client machines behind firewalls!"

37 of 137 comments (clear)

Min score:

Reason:

Sort:

Ouch by Anonymous Coward · 2013-02-06 09:07 · Score: 5, Insightful

I think an actual summary would have been a vast improvement over TFS.
1. Re:Ouch by whois · 2013-02-06 09:26 · Score: 5, Insightful
  
  It's pretty bad even by slashdot standards:
  'Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller.'
  This statement is worse than useless, it's a waste of space and a waste of your time to read it (I'm sorry I quoted it). The next sentence is okay but then they go back to 'Literally the link lights on the switch and interface would go out. It was dead.'
  Literally, this is a waste of the word literally. And it being dead was implied by everything stated above. The rest is informative but still in a conversational style that makes it hard to read, and it's lacking in details such as:
  What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?
2. Re:Ouch by chevelleSS · 2013-02-06 09:42 · Score: 4, Informative
  
  If you read further down in the article, you would know that they worked with Intel and were given a patch to fix this issue. Brandon
3. Re:Ouch by el+borak · 2013-02-06 09:48 · Score: 4, Informative
  What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?
  I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.
  
  Ethernet controller? 82574L
  
  Reported? Yes, and Intel supplied an EEPROM fix.
  --
  An imperfect plan executed violently is far superior to a perfect plan. -- George Patton
4. Re:Ouch by kelemvor4 · 2013-02-06 10:03 · Score: 2
  What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?
  I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.
  
  Ethernet controller? 82574L
  Reported? Yes, and Intel supplied an EEPROM fix.
  It's Slashdot. Most people don't even read the whole summary before asking questions like that.
5. Re:Ouch by WarJolt · 2013-02-06 11:22 · Score: 5, Funny
  
  Less /. bashing more Intel bashing please.
6. Re:Ouch by sirsnork · 2013-02-06 11:48 · Score: 4, Informative
  
  Intel NIC's are held in high regard because a) they are fixed when a problem is found, and b) the bugs are documented.
  You should have a look through some of the CPU errata on Intel's site. it'll open your eyes as to just how many bugs a desktop CPU has even once it's shipped
  
  --
  
  Normal people worry me!
7. Re:Ouch by Xtifr · 2013-02-06 12:23 · Score: 5, Funny
  
  Then GP's on the wrong site. Here at slashdot, we're proud of our editors' inability and unwillingness to do anything that could actually be described as editing. Cuz writin' good isn't sumpin' real nurdz car about. U shld just B glad it ain't all writ in 1337-5p34|<, and STFU, n00b!
  At least, that's the impression I've always had of what the so-called "editors" seem to believe. :)
8. Re:Ouch by sumdumass · 2013-02-06 21:17 · Score: 2
  
  Too bad 3com isn't around any more. Well, not in any meaningful way. They used to rock this world.
This is why the equipment should be heterogeneous by eksith · 2013-02-06 09:10 · Score: 4, Insightful

Whether it's your brand of switch, motherboard or even memory, never have the same across all machines if you can help it. The only time I'd recommend the same brand would be hard drives (due to concurrency issues), but then at least try go get them from different batches. If your lot of mobos will only handle one brand of memory for whatever reason even when cas latency is identical, then have two machines doing whatever it is you need to be doing.
One kind of anything makes it easier to kill you swiftly in the end, whether it's by a ping of death or a biological disease.

--
If computers were people, I'd be a misanthrope.
QOTD by jlv · 2013-02-06 09:12 · Score: 4, Funny

``Life is too short to be spent debugging Intel parts.''
-- Van Jacobson
1. Re:QOTD by ACluk90 · 2013-02-06 10:26 · Score: 5, Funny
  
  Maybe that was what the guys at Intel thought.
Re:Online Income by Anonymous Coward · 2013-02-06 09:16 · Score: 5, Funny

http://www.cloud65.com/ just as Marcus answered I didnt know that a mother can profit $8765 in four weeks on the computer. did you read this webpage
I think the NIC packet of death might be just what you need.
Re:This is why the equipment should be heterogeneo by Anonymous Coward · 2013-02-06 09:17 · Score: 3, Informative

Agreed. OP clearly has no experience managing large server installations.
Re:Online Income by WilyCoder · 2013-02-06 09:18 · Score: 4, Funny

Listen here my friend, has anyone really been far even as decided to use even go want to do look more like?
Re:This is why the equipment should be heterogeneo by vlm · 2013-02-06 09:19 · Score: 4, Insightful

the drivers are certified to work
LOL this is a firmware bug, you can lock up the hardware even with no OS booted. Hilarious.

and you get real support
Yeah I love being told to reinstall windows on my linux boxes. Those guys sure are helpful !

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Re:This is why the equipment should be heterogeneo by LordLimecat · 2013-02-06 09:20 · Score: 5, Interesting

One kind of thing makes it a zillion times easier to recognize a problem when it crops up, and makes it so you only ever have to troubleshoot an issue once.
How much more awful would it be if something similar happened next week on more computers, and he had to troubleshoot it all over again-- not even knowing whether the machines had NICs in common?
"Everything blew up" is a problem. "Everything blew up, I dont know why, and it will take 3 weeks to find a solution" is a huge problem. "Everything blew up AGAIN, and I it will take another 3 weeks because our environment is heterogenous" means you are out of a job.
Re:Three Strikes... I'll Pass by v1 · 2013-02-06 09:23 · Score: 5, Informative

oh I think this is at least slightly interesting. I remember the "ping of death" (and pissing off a few windows heads in my sights) back in 'th day.
This is basically a DoS attack on hardware. The fact that it can get through someone's firewall makes it a bit more effective. Having your ethernet port check out every five minutes (requiring a reboot to fix) just because someone down the hall (or in Bulgaria) wants to be an ass is definitely annoying and something I'd like to know is a possibility when troubleshooting screwy network problems.
I just got done swapping out a gigabit switch that was being wonky and slow for no obvious reason. I don't mind so much when hardware keels over and dies, but when it throws symptoms that don't immediately suggest where the problem is, those are the real time wasters. And we've come to rely on hardware generally being more reliable than software. So if my ethernet was going out when I VOIP'ed, I might have spent (wasted) a lot of my time troubleshooting the VOIP software.

--
I work for the Department of Redundancy Department.
Re:Three Strikes... I'll Pass by localman57 · 2013-02-06 09:25 · Score: 4, Insightful

It's actually a pretty good write up with a nice trace of his troubleshooting. If my customers gave me bug reports that included 10th of the level of detail he does in the article, i'd be over the moon.
Re:This is why the equipment should be heterogeneo by PRMan · 2013-02-06 09:32 · Score: 4, Insightful

or just buy premade servers from dell or HP. they aren't that much more expensive, the drivers are certified to work and you get real support
...and you're guaranteed that every shipment will have radically different hardware, despite having identical model numbers.

--
Peter predicted that you would "deliberately forget" creation 2000 years ago...
Re:This is why the equipment should be heterogeneo by eksith · 2013-02-06 09:48 · Score: 4, Informative

There's a good reason a lot of our equipment is slightly older. No, we don't use ancient stuff, but they're not 100% top of the line made yesterday either. And that's because each time a new mobo, memory and storage combo that looks like its worth purchasing comes to market, the first thing we do is run a few sample sets under everything we can throw at it. Usually problems are narrowed down within the first couple of weeks or so, but that's why we have separate people just for testing equipment.
Now admittedly, it's getting harder with this economy so we have some people doing double duty on occasion (I've had to do a bit too when the flu came rolling in), but testing goes on for as long as we think is necessary before the combo goes live. We avoid a lot of the headaches that come with large deployments by keeping changes isolated to maybe 10-15 nodes at a time. It's a slow and steady rollout of mostly similar systems (maybe 3-4 identical) that helps us avoid down time.
We're not Google and we don't pretend to be, but common sense goes a long way to avoiding hiccups like "everything blew up". I think the biggest issue was when hurricane Sandy hit and we weren't sure if the backup generators would come online (this is a big problem with things that need fuel and oil, but stay off for a long time), so we brought in a generator truck for that too, just in case. Again, avoiding one of anything.

--
If computers were people, I'd be a misanthrope.
Nice debugging by Dishwasha · 2013-02-06 09:58 · Score: 4, Interesting

I for one definitely appreciate the diligence of Kristian Kielhofner. Many years ago I was supporting a medium-sized hospital whose flat network kept having intermittent issues (and we all know intermittent issues are the worst to hunt down and resolve). Fortunately I was on-site that day and at the top of my game and after doing some ethereal sleuthing (what wireshark was called at the time), I happened to discover a NIC that was spitting out bad LLC frames. Doing some port tracking in the switches we were able to isolate which port it was on which happened to be at their campus across the street. Of all possible systems, the offending NIC was in their PACS. After pulling the PACS off the network for a while the problem went away and we had to get the vendor to replace the hardware.
Re:This was fixed years ago by omnichad · 2013-02-06 10:01 · Score: 4, Insightful

That's not the same bug. I'd explain, but that's what you get for saying "I wish this guy had done his homework."
Counterfeit Intel NIC? Apparently not. by Zemplar · 2013-02-06 10:22 · Score: 4, Interesting

I'm glad Mr. Kielhofner contacted Intel about this issue and had Intel confirm the bug.

Some years ago I had been diagnosing similar server NIC issues, and after many hours digging, Intel was able to determine the fault was due to the four-port server NIC being counterfeit. Damn good looking counterfeit part! I couldn't tell the difference between a real Intel NIC and counterfeit in front of me. Only with Intel's document specifying the very minor outward differences between a real and known counterfeit could I tell them apart.

Intel NIC debugging step #1 = verify it's a real Intel NIC!
Replacements by phorm · 2013-02-06 10:39 · Score: 3, Insightful

Errrr, no. Have you ever tried to deal with replacements and/or issues within a large organization where everything is different? It's hellish.
Try tracking an issue across an enterprise of architecture when all the architecture is DIFFERENT. You also don't want to mix RAM, and drivers can be a real b**** for different motherboards. Oh, and RMA's things, not fun.
Different brands of RAM. Yeah, you try a rack full of servers playing mix'n'match and see how well that works.
Lastly... how many vendors/brands of enterprise gear do you think are out there, and for the ones that do exist how well do you think they talk together. Maybe you're happy mixing HP Procurves with your Cisco stuff but I don't recommend it, and for some stuff there aren't a lot of vendors to choose from anyhow.
Re:Other NIC models by ewieling · 2013-02-06 11:02 · Score: 2

The Intel 82580 does not appear to have the same issue. All our network problems went away when we put in some cards based on that chip in our systems which used the Intel 82574L for the onboard LAN. Customers stopped screaming, sales stopped screaming, management stopped screaming and I was able to get some sleep.

--
I really shouldn't have used someone else's email address for this account.
Re:This is why the equipment should be heterogeneo by Anonymous Coward · 2013-02-06 11:06 · Score: 2, Informative

You will have to update the kernel, though. The linux e1000 and e1000e drivers have a fuckload of hardware bug workarounds, and the ASPM thing did hit some people recently. You *must* have ASPM L0s and L1 disabled on the Intel NIC *and* its parent PCIe bridge, and the kernel driver usually will only be able to disable it on the NIC itself, if the BIOS is crap and leaves ASPM L0s or L1 enabled on the bridge or has a crap NIC eeprom image that causes issue with 128b/256b maximum PCIe packet (this one can be fixed by Linux, *if* you give it a specific parameter, no idea why it isn't automatic since it is major utter braindamage by the BIOS that is known to hang the box hard sometimes), the NIC can hang.
Really?!?! by Anonymous Coward · 2013-02-06 11:07 · Score: 4, Funny

So by "bring down" you didn't just mean bring down, implying it was brought down, but you meant "BRING DOWN" (notice the caps), implying it was brought down (notice the italics). Such a critical distinction. If it was merely "brought down" this would hardly have been an issue. You could have simply ignored the dead router. As it stands, being brought down, this is a real problem, and you cannot ignore the dead router. Good job!
Re:This was fixed years ago by LordLimecat · 2013-02-06 11:18 · Score: 2

The ubuntu bug had to do with bad drivers and / or firmware; when the affected distro was installed on a computer with the affected NIC (which was the completely different e1000), it would render that NIC unusable, even afterreboots.
This bug appears to be triggered by receiving a crafted packet, remotely, and is fixable with a reboot. It also affects a different nic.
no public fix by SuperBanana · 2013-02-06 11:26 · Score: 2

Too bad Intel gave a fix to them (a fix they ultimately couldn't use), but hasn't to anyone else.
Too bad Intel has also apparently known about the problem for months now.
"Intel has been aware of this issue for several months. They also have a fix. However, they haven't publicized it because they don't know how widespread it is."
Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.

--
Please help metamoderate.
1. Re:no public fix by TheLink · 2013-02-06 15:41 · Score: 2
  
  Maybe the spooks told them to keep the bug unfixed in the wild ;).
  --
  
  Too many replies beneath your current threshold
2. Re:no public fix by BitZtream · 2013-02-07 01:51 · Score: 2
  
  Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.
  Yes, that would be ideal for everyone. If it just silently went away, that also means it wasn't much of a problem ... that means it didn't get used as a massive exploit. Thats good.
  There is no scenario where 'going away' is a bad thing, unless you're just all angsty and looking for a reason to tell 'the man' how much he sucks.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
Re:Counterfeit Intel NIC? Apparently not. by countach · 2013-02-06 12:13 · Score: 4, Insightful

It's just an Intel support strategy. Release NICs with random and minor outward differences. When you have a support issue, say that it is counterfeit. Really cuts support costs!
MAC address take-down by Anonymous Coward · 2013-02-06 13:19 · Score: 4, Interesting

A number of years ago I discovered that you can take down many routers, and Windows / Linux hosts by sending an ARP response that says "IP 0.0.0.0 is at MAC FF:FF:FF:FF:FF:FF". When you direct this packet to the access point in a wireless network, this makes the SSID broadcast disappear and the whole device go down. Never posted this until now, I wonder if this still works on modern devices.
Re:Three Strikes... I'll Pass by TheLink · 2013-02-06 15:22 · Score: 3, Insightful

Hardware is just what you call something YOU don't configure/patch much even if someone else does :).

To a PHB everything might be hardware. To a HDD maker HDDs aren't hardware, same for CPU makers and their CPUs.
--
- Too many replies beneath your current threshold
Re:Firmware updates motherfucker, do you speak the by Cramer · 2013-02-06 18:56 · Score: 3, Insightful

And where do I get this mythical "firmware update" for the NIC CHIP? I'm sure the chip has code in it, but I've never even heard of a utility from Intel to update the in-chip code in a nic. (it's called "microcode", not firmware)
Re:Counterfeit Intel NIC? Apparently not. by kakaburra · 2013-02-06 21:54 · Score: 2

say that it is counterfeit. Really cuts support costs!
Also cuts sales volume. If someone tells me there are lots of counterfeits that look almost like original, I'd stop buying.