Slashdot Mirror


Intel Gigabit NIC Packet of Death

An anonymous reader sends this quote from a blog post about a very odd technical issue and some clever debugging: "Packets of death. I started calling them that because that’s exactly what they are. ... This customer location, for some reason or another, could predictably bring down the ethernet controller with voice traffic on their network. Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller. The system and ethernet interfaces would appear fine and then after a random amount of traffic the interface would report a hardware error (lost communication with PHY) and lose link. Literally the link lights on the switch and interface would go out. It was dead. Nothing but a power cycle would bring it back. ... While debugging with this very patient reseller I started stopping the packet captures as soon as the interface dropped. Eventually I caught on to a pattern: the last packet out of the interface was always a 100 Trying provisional response, and it was always a specific length. Not only that, I ended up tracing this (Asterisk) response to a specific phone manufacturer’s INVITE. ... With a modified HTTP server configured to generate the data at byte value (based on headers, host, etc) you could easily configure an HTTP 200 response to contain the packet of death — and kill client machines behind firewalls!"

137 comments

  1. Ouch by Anonymous Coward · · Score: 5, Insightful

    I think an actual summary would have been a vast improvement over TFS.

    1. Re:Ouch by mythosaz · · Score: 1

      The summary is pretty much word-for-word copy-pasta from his blog.. ..minus any of the useful formatting.

    2. Re:Ouch by whois · · Score: 5, Insightful

      It's pretty bad even by slashdot standards:

      'Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller.'

      This statement is worse than useless, it's a waste of space and a waste of your time to read it (I'm sorry I quoted it). The next sentence is okay but then they go back to 'Literally the link lights on the switch and interface would go out. It was dead.'

      Literally, this is a waste of the word literally. And it being dead was implied by everything stated above. The rest is informative but still in a conversational style that makes it hard to read, and it's lacking in details such as:

      What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

    3. Re:Ouch by chevelleSS · · Score: 4, Informative

      If you read further down in the article, you would know that they worked with Intel and were given a patch to fix this issue. Brandon

    4. Re:Ouch by Anonymous Coward · · Score: 0

      Yes, and it's horrible as a concise and readable summary of the article.

    5. Re:Ouch by radiumsoup · · Score: 0

      tl;dr

    6. Re:Ouch by el+borak · · Score: 4, Informative

      What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

      I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.

      • Ethernet controller? 82574L
      • Reported? Yes, and Intel supplied an EEPROM fix.
      --
      An imperfect plan executed violently is far superior to a perfect plan. -- George Patton
    7. Re:Ouch by noc007 · · Score: 1

      82574L was the Intel NIC.

      I'm surprised that Intel NICs are held in such high regard, yet there are some really detrimental bugs.

      CSB:
      I just bought a three port daughterboard for a Jetway ITX mobo I am planning on using as a pfSense FW. Their Gen2 daughterboard uses this chip, but thankfully I didn't spend the extra $50 on the Gen2 compatible board and went with a Gen1 that uses 82541PI. Hopefully that one doesn't have the same issue.

    8. Re:Ouch by kelemvor4 · · Score: 2

      What model of Ethernet controller was tested. What Firmware version are they using? Has the problem been reported to Intel?

      I realize you found the article difficult to read, but it wasn't that long. 2/3 of your questions were addressed in the article.

      • Ethernet controller? 82574L
      • Reported? Yes, and Intel supplied an EEPROM fix.

      It's Slashdot. Most people don't even read the whole summary before asking questions like that.

    9. Re:Ouch by Anonymous Coward · · Score: 0

      For the record, GP seems to be complaining about the /. summary, not the article at all. He's saying that the summary should, well, summarize.

    10. Re:Ouch by Anonymous Coward · · Score: 1

      The 82541 has worse bugs and worse performance. Besides, the 82574L is used instead of the RealTek RTL 81xx and its ilk. The RTL81xx crap is MUCH worse, as it is unfixable: slow, dumb, and requires severe performance reducing measures that dumbs it down to fastethernet-like levels of hardware assistance to even survive without causing rogue pci master transactions (aka rogue DMA over whatever is after the packet buffer), you cannot even use that RTL LOM NIC with jumbo frames without risking PCIe stalls every once in a blue moon.

      Anyway, this looks like the usual ASPM brokenness in that generation of Intel NICs, which is usually only a problem when the motherboard has ubershit firmware (BIOS/EFI/NIC EEPROM) that doesn't implement the published errata fixes properly (i.e. disable ASPM L0s and L1 for the Intel 82574L device and its upstream PCIe bridge).

    11. Re:Ouch by chronokitsune3233 · · Score: 0

      From a psychological perspective, I think the author most likely is often misunderstood offline. As a result, reinforcing the idea being expressed is a subconscious necessity, based upon interpersonal encounters.

      Or maybe his wife/girlfriend has been complaining at him about the fact that they never understand each other, which is natural since men and women think and react differently.

      --
      I have been a captive in America my entire life. Everybody and everything uses customary units instead of metric.
    12. Re:Ouch by WarJolt · · Score: 5, Funny

      Less /. bashing more Intel bashing please.

    13. Re:Ouch by sirsnork · · Score: 4, Informative

      Intel NIC's are held in high regard because a) they are fixed when a problem is found, and b) the bugs are documented.

      You should have a look through some of the CPU errata on Intel's site. it'll open your eyes as to just how many bugs a desktop CPU has even once it's shipped

      --

      Normal people worry me!
    14. Re:Ouch by Xtifr · · Score: 5, Funny

      Then GP's on the wrong site. Here at slashdot, we're proud of our editors' inability and unwillingness to do anything that could actually be described as editing. Cuz writin' good isn't sumpin' real nurdz car about. U shld just B glad it ain't all writ in 1337-5p34|<, and STFU, n00b!

      At least, that's the impression I've always had of what the so-called "editors" seem to believe. :)

    15. Re:Ouch by Anonymous Coward · · Score: 0

      Slashdot editors are too busy desperately hunting for Australian news to post to waste time doing their fucking jobs.

    16. Re:Ouch by Anonymous Coward · · Score: 0

      It's pretty bad even by slashdot standards:

      'Let me elaborate on that for a second. When I say “bring down” an ethernet controller I mean BRING DOWN an ethernet controller.'

      This statement is worse than useless, it's a waste of space and a waste of your time to read it (I'm sorry I quoted it).

      reminds me of some teenage girls on the station this morning.

      "Oh migod like you know?....I know!"

    17. Re:Ouch by sumdumass · · Score: 2

      Too bad 3com isn't around any more. Well, not in any meaningful way. They used to rock this world.

    18. Re:Ouch by Mike+Frett · · Score: 1

      Intel needs to be bashed imo. They had that recent Core CPU bug, this network bug and they are ending buying the Mobo/CPU separate. And you see those articles everywhere and AMD has yelled at the top of their lungs that AMD isn't doing that and will offer the same things to enthusiasts building their own boards. I don't know why people act like Intel is the be all end all, they are definitely not; yet the articles fail to mention that. Absolutely not a thing wrong with AMD components.

    19. Re:Ouch by Anonymous Coward · · Score: 0

      Let me elaborate on that for a second. When I say "literally" I mean LITERALLY. Literally I actually mean the thing that I said.

    20. Re:Ouch by Anonymous Coward · · Score: 0

      Nobody buys expensive stuff just because it works. Duh.

    21. Re:Ouch by tigersha · · Score: 1

      Most people do not read the TITLE of the article before they start to bash. Never mind the summary.

      --
      The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
  2. This is why the equipment should be heterogeneous by eksith · · Score: 4, Insightful

    Whether it's your brand of switch, motherboard or even memory, never have the same across all machines if you can help it. The only time I'd recommend the same brand would be hard drives (due to concurrency issues), but then at least try go get them from different batches. If your lot of mobos will only handle one brand of memory for whatever reason even when cas latency is identical, then have two machines doing whatever it is you need to be doing.

    One kind of anything makes it easier to kill you swiftly in the end, whether it's by a ping of death or a biological disease.

    --
    If computers were people, I'd be a misanthrope.
  3. Sping Break '13 by Anonymous Coward · · Score: 0

    Crazy sping flashbacks :)

  4. They don't do VOIP by Anonymous Coward · · Score: 0

    With a certain manufacturer's VOIP phone system a single properly crafted packet will force all phones to reset and reboot. There are little "issues" like that all throughout networking and computers. Find and patch is the order of the day

  5. QOTD by jlv · · Score: 4, Funny

    ``Life is too short to be spent debugging Intel parts.''
                                    -- Van Jacobson

    1. Re:QOTD by ACluk90 · · Score: 5, Funny

      Maybe that was what the guys at Intel thought.

  6. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0, Insightful

    That's the dumbest sh*t I've ever heard .. idiot

  7. Three Strikes... I'll Pass by Anonymous Coward · · Score: 0

    Anonymous submitter... strike one
    Summary linked to random blog... strike two
    Sensationalist language in summary... strike three

    I think I'll take a pass on this story.

    I would be very surprised if this actually contains useful news and isn't someone trying to be an attention whore....

    1. Re:Three Strikes... I'll Pass by v1 · · Score: 5, Informative

      oh I think this is at least slightly interesting. I remember the "ping of death" (and pissing off a few windows heads in my sights) back in 'th day.

      This is basically a DoS attack on hardware. The fact that it can get through someone's firewall makes it a bit more effective. Having your ethernet port check out every five minutes (requiring a reboot to fix) just because someone down the hall (or in Bulgaria) wants to be an ass is definitely annoying and something I'd like to know is a possibility when troubleshooting screwy network problems.

      I just got done swapping out a gigabit switch that was being wonky and slow for no obvious reason. I don't mind so much when hardware keels over and dies, but when it throws symptoms that don't immediately suggest where the problem is, those are the real time wasters. And we've come to rely on hardware generally being more reliable than software. So if my ethernet was going out when I VOIP'ed, I might have spent (wasted) a lot of my time troubleshooting the VOIP software.

      --
      I work for the Department of Redundancy Department.
    2. Re:Three Strikes... I'll Pass by Anonymous Coward · · Score: 0
      Anonymous user... strike one
      Didn't read the Article... strike two
      Arrogant attitude in post... strike three

      I think I'll flash a mirror at your bourgeoisie logic, since I'm bored.

    3. Re:Three Strikes... I'll Pass by localman57 · · Score: 4, Insightful

      It's actually a pretty good write up with a nice trace of his troubleshooting. If my customers gave me bug reports that included 10th of the level of detail he does in the article, i'd be over the moon.

    4. Re:Three Strikes... I'll Pass by Anonymous Coward · · Score: 0

      The problem with hardware being more reliable than software is this: Hardware these days doesn't do anything without software.

    5. Re:Three Strikes... I'll Pass by TheLink · · Score: 3, Insightful

      Hardware is just what you call something YOU don't configure/patch much even if someone else does :).

      To a PHB everything might be hardware. To a HDD maker HDDs aren't hardware, same for CPU makers and their CPUs.

      --
    6. Re:Three Strikes... I'll Pass by Anonymous Coward · · Score: 0

      It's actually a pretty good write up with a nice trace of his troubleshooting. If my customers gave me bug reports that included 10th of the level of detail he does in the article, i'd be over the moon.

      I'd rather just replace the NIC....

      after a powercycle of course.

  8. Re:Online Income by Anonymous Coward · · Score: 5, Funny

    http://www.cloud65.com/ just as Marcus answered I didnt know that a mother can profit $8765 in four weeks on the computer. did you read this webpage

    I think the NIC packet of death might be just what you need.

  9. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 3, Informative

    Agreed. OP clearly has no experience managing large server installations.

  10. Re:Online Income by WilyCoder · · Score: 4, Funny

    Listen here my friend, has anyone really been far even as decided to use even go want to do look more like?

  11. Re:This is why the equipment should be heterogeneo by vlm · · Score: 4, Insightful

    the drivers are certified to work

    LOL this is a firmware bug, you can lock up the hardware even with no OS booted. Hilarious.

    and you get real support

    Yeah I love being told to reinstall windows on my linux boxes. Those guys sure are helpful !

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
  12. Re:Online Income by localman57 · · Score: 1

    I think the literal flying dagger of death might be just what he needs. And Marcus too. But before that, we'll teach him how to use capitalization and puncutation. Because it would be morally wrong to kill him before he understood these things.

  13. Re:This is why the equipment should be heterogeneo by LordLimecat · · Score: 5, Interesting

    One kind of thing makes it a zillion times easier to recognize a problem when it crops up, and makes it so you only ever have to troubleshoot an issue once.

    How much more awful would it be if something similar happened next week on more computers, and he had to troubleshoot it all over again-- not even knowing whether the machines had NICs in common?

    "Everything blew up" is a problem. "Everything blew up, I dont know why, and it will take 3 weeks to find a solution" is a huge problem. "Everything blew up AGAIN, and I it will take another 3 weeks because our environment is heterogenous" means you are out of a job.

  14. Firmware updates motherfucker, do you speak them? by Anonymous Coward · · Score: 0

    Always update all the firmware on a box before it hits production. Been there with packets of death. Always upgrade firmware. Always. Always. Always.

  15. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    This would drive up costs of support, training and troubleshooting.

  16. PAM SLAM? by tekrat · · Score: 1

    Wasn't there an old program (Nuke 'em on the Mac I think), that would send out-of-band data (whatever that was), and it would crash the TCP/IP stack on Windows NT 3.51? There was another program on Linux called Pam Slam or something like that, that would also bring down NT servers... Very popular in the early days of the web to bring down your competitor's website.

    --
    If telephones are outlawed, then only outlaws will have telephones.
    1. Re:PAM SLAM? by Anonymous Coward · · Score: 1

      As I recall, it was "WinNuke", and it was best known for killing Windows 9x systems (though it seemingly also killed Windows 3.1 and early versions of Windows NT).

  17. Re:This is why the equipment should be heterogeneo by PRMan · · Score: 4, Insightful

    or just buy premade servers from dell or HP. they aren't that much more expensive, the drivers are certified to work and you get real support

    ...and you're guaranteed that every shipment will have radically different hardware, despite having identical model numbers.

    --
    Peter predicted that you would "deliberately forget" creation 2000 years ago...
  18. Re:This is why the equipment should be heterogeneo by Zeromous · · Score: 1, Funny

    if ($uid -ge 1000000) || ($uid == "Anonymous Coward"; then
            cat $foo > /dev/null
    else
            cat $foo > $file
    fi

    --
    ---Up Up Down Down Left Right Left Right B A START
  19. Re:This is why the equipment should be heterogeneo by datapharmer · · Score: 1, Interesting

    I'm guessing you didn't buy them with Linux on them... or prove it was a hardware issue. They have no reason to support something they didn't ship. Sure the support varies but their pro server support is actually decent if you get the right person on the other end. I had a case where teaming 2 nics caused windows to eat crap and die inexplicably and getting it back up was quite the ordeal. I couldn't even keep it stable long enough to unteam or remove the drivers (even in safe mode). Fortunately they did have documentation on the problem - a broadcom driver had a problem with a particular firmware set when teaming was used. I managed to flash the firmware update from a usb flash drive which got me to the point I could at least boot into safe mode and delete the drivers and then get a working older version of the driver from Dell's site up and running and teaming reconfigured. This was on an poweredge r610 btw. I feel bad for the poor sap who ran into this first and having dell support saved me unnecessary downtime, especially since there is no mention of this problem anywhere on broadcom's website. That said for 99% of the issues I've ever run into having on-site spares and a good internal KB has been far more effective than paying for Dell's support, but if it is free with the server why not use it...

    --
    Get a web developer
  20. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    The Dell / HP servers use the same chipset which is affected with the same drivers, dumbass.

  21. Re:Online Income by Redmancometh · · Score: 1

    This hurt my brain.

  22. Re:This is why the equipment should be heterogeneo by eksith · · Score: 4, Informative

    There's a good reason a lot of our equipment is slightly older. No, we don't use ancient stuff, but they're not 100% top of the line made yesterday either. And that's because each time a new mobo, memory and storage combo that looks like its worth purchasing comes to market, the first thing we do is run a few sample sets under everything we can throw at it. Usually problems are narrowed down within the first couple of weeks or so, but that's why we have separate people just for testing equipment.

    Now admittedly, it's getting harder with this economy so we have some people doing double duty on occasion (I've had to do a bit too when the flu came rolling in), but testing goes on for as long as we think is necessary before the combo goes live. We avoid a lot of the headaches that come with large deployments by keeping changes isolated to maybe 10-15 nodes at a time. It's a slow and steady rollout of mostly similar systems (maybe 3-4 identical) that helps us avoid down time.

    We're not Google and we don't pretend to be, but common sense goes a long way to avoiding hiccups like "everything blew up". I think the biggest issue was when hurricane Sandy hit and we weren't sure if the backup generators would come online (this is a big problem with things that need fuel and oil, but stay off for a long time), so we brought in a generator truck for that too, just in case. Again, avoiding one of anything.

    --
    If computers were people, I'd be a misanthrope.
  23. Other NIC models by Dishwasha · · Score: 1

    I would be curious to know if other versions like the Intel 82576 have the same vulnerability. Maybe we should crowd source this and people can post what they've tested with and received the same behavior.

    1. Re:Other NIC models by tippe · · Score: 1

      FWIW, the 82580 doesn't seem to have this problem (that, or we have up-to-date EEPROMs that fix the issue...)

    2. Re:Other NIC models by ewieling · · Score: 2

      The Intel 82580 does not appear to have the same issue. All our network problems went away when we put in some cards based on that chip in our systems which used the Intel 82574L for the onboard LAN. Customers stopped screaming, sales stopped screaming, management stopped screaming and I was able to get some sleep.

      --
      I really shouldn't have used someone else's email address for this account.
  24. Re:This is why the equipment should be heterogeneo by rot26 · · Score: 1

    Are We the Imperial We or the Editorial We?

    Curious.

    --



    To ensure perfect aim, shoot first and call whatever you hit the target
  25. Nice debugging by Dishwasha · · Score: 4, Interesting

    I for one definitely appreciate the diligence of Kristian Kielhofner. Many years ago I was supporting a medium-sized hospital whose flat network kept having intermittent issues (and we all know intermittent issues are the worst to hunt down and resolve). Fortunately I was on-site that day and at the top of my game and after doing some ethereal sleuthing (what wireshark was called at the time), I happened to discover a NIC that was spitting out bad LLC frames. Doing some port tracking in the switches we were able to isolate which port it was on which happened to be at their campus across the street. Of all possible systems, the offending NIC was in their PACS. After pulling the PACS off the network for a while the problem went away and we had to get the vendor to replace the hardware.

  26. Similar happened to me a few years back. by Anonymous Coward · · Score: 0

    I had a similar issue on my home network. My primary desktop would occasionally have blue screens for no apparent reason, and at odd times of day. Finally figured it out that it only occurred when my HTPC was on. I remember thinking it was impossible, but as soon as I swapped out the HTPC's NIC, it never occurred again. I don't know if it was a driver issue or hardware, but it seemed like that network card was sending some sort of bizarrely malformed packet that caused my other machine to crash.

    I don't tell people about it much, because they look at me like I'm nuts.

  27. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    Actually, I think he's using the Corporate We.

  28. Re:This is why the equipment should be heterogeneo by eksith · · Score: 1

    Editorial, I assure you. :)

    --
    If computers were people, I'd be a misanthrope.
  29. Re:This was fixed years ago by omnichad · · Score: 4, Insightful

    That's not the same bug. I'd explain, but that's what you get for saying "I wish this guy had done his homework."

  30. Re:Online Income by Anonymous Coward · · Score: 0

    Syntax error.

  31. Intel deserves to suffer by Anonymous Coward · · Score: 0

    You have no idea how much time I wasted trying to fix intermittent issues with the same chip glued on to my motherboard.

    Update your bios and turn off all goddamn ASPM shit in the bios. Kernel options don't do shit.

  32. Counterfeit Intel NIC? Apparently not. by Zemplar · · Score: 4, Interesting

    I'm glad Mr. Kielhofner contacted Intel about this issue and had Intel confirm the bug.

    Some years ago I had been diagnosing similar server NIC issues, and after many hours digging, Intel was able to determine the fault was due to the four-port server NIC being counterfeit. Damn good looking counterfeit part! I couldn't tell the difference between a real Intel NIC and counterfeit in front of me. Only with Intel's document specifying the very minor outward differences between a real and known counterfeit could I tell them apart.

    Intel NIC debugging step #1 = verify it's a real Intel NIC!

  33. Re:This is why the equipment should be heterogeneo by Katmando911 · · Score: 1

    or just buy premade servers from dell or HP. they aren't that much more expensive, the drivers are certified to work and you get real support

    ...and you're guaranteed that every shipment will have radically different hardware, despite having identical model numbers.

    Sad but true. It makes it a PITA when dealing with disk images from one server to another.

  34. Replacements by phorm · · Score: 3, Insightful

    Errrr, no. Have you ever tried to deal with replacements and/or issues within a large organization where everything is different? It's hellish.

    Try tracking an issue across an enterprise of architecture when all the architecture is DIFFERENT. You also don't want to mix RAM, and drivers can be a real b**** for different motherboards. Oh, and RMA's things, not fun.

    Different brands of RAM. Yeah, you try a rack full of servers playing mix'n'match and see how well that works.

    Lastly... how many vendors/brands of enterprise gear do you think are out there, and for the ones that do exist how well do you think they talk together. Maybe you're happy mixing HP Procurves with your Cisco stuff but I don't recommend it, and for some stuff there aren't a lot of vendors to choose from anyhow.

    1. Re:Replacements by eksith · · Score: 1

      See my reply to LordLimecat above.

      All machines get a barcode that let us pull up every component that went in, vendors, dates of installation and who touched what. For memory, I think we have 3 different vendors. Mobos are usually Asus and Supermicro with one or two Tyan. HDs are Samsung and WD with a couple more for SSDs that are special cases. Speaking of cases, we have Supermicro again and NORCO (for storage) primarily with a few Antec cases here and there.

      L3 switches are Cisco and Netgear, L2 is Netgear and Trendnet.

      --
      If computers were people, I'd be a misanthrope.
    2. Re:Replacements by bobbied · · Score: 1

      Let me guess, the Cisco Switches are the Small Business Series, right?

      Hey, the Cisco Small Business stuff isn't all that bad... At least the older stuff from a few years ago is OK. Where I admit the Cisco/Netgear hardware suffers from a higher failure rate, you can easily buy two of them for every main line Cisco switch and have some change left over. Now I'm not saying they are *easier* to configure, but most of us don't make a habit of changing switch configurations all the time.

      --
      "File to fit, pound to insert, paint to match" - Aircraft Maintenance 101
  35. Re:This was fixed years ago by Anonymous Coward · · Score: 0

    I am curious and not the same coward care, to explain?
    P.s. I had no homework ;)

  36. "No reason" by SuperKendall · · Score: 1

    They have no reason to support something they didn't ship.

    They shipped you hardware. Therefore they need to support THE HARDWARE.

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  37. Re:This is why the equipment should be heterogeneo by ls671 · · Score: 1

    line 2: -ge: command not found

    with $uid set to 1000001:
    line 2: 1000001: command not found

    The condition is always false and the user never goes the /dev/null

    I guess you need to use brackets, in bash at least...

    --
    Everything I write is lies, read between the lines.
  38. Re:This is why the equipment should be heterogeneo by James-NSC · · Score: 1

    Not to speak for OP, but there is a hint of logic in there. It wouldn't apply at farms where hegemony translates into resiliency, but it would apply in situations where resiliency results in the ability to withstand faults without replacing anything. Military and other tier one instances come to mind.

    "Over specialize and you breed in weakness"
    - Major Kusanagi Motoko

  39. That's crazy stuff by Just+Brew+It! · · Score: 1

    Intel NICs have (or at least had...) a very good reputation for performance and stability. Maybe this is a sign that their QA is starting to slip?

    1. Re:That's crazy stuff by SIGBUS · · Score: 1

      Maybe this is a sign that their QA is starting to slip?

      It wouldn't surprise me (but the problem may go well beyond Intel). Of the three motherboards that have ever failed on me over 30+ years, two were Intel (a D101GGC and a DG43NB). Neither of them were ever run from crap PSUs, and both had blown capacitors. Even weirder, the caps were all from respected capacitor firms (Nippon Chemi-Con on the DG43NB and Matsushita on the D101GGC). I guess it's a Good Thing that Intel is exiting the motherboard business.

      The third board that failed was an Abit KT7-RAID... one of the early infamous examples of capacitor plague.

      --
      Oh, no! You have walked into the slavering fangs of a lurking grue!
    2. Re:That's crazy stuff by Just+Brew+It! · · Score: 1

      Maybe whoever was building their boards for them got some counterfeit caps. In my experience, the worst brands for capacitor problems were MSI, Abit, and FIC. ECS was notorious as well, but I never owned one personally. But apparently nobody was immune; I've had a couple of Asus boards that developed cap issues, as well as other random gear from that era (Netgear Ethernet switches/routers, etc.)

  40. Re:This is why the equipment should be heterogeneo by James-NSC · · Score: 1

    %$#@! auto-correct, I meant: homogeneous

  41. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 2, Informative

    You will have to update the kernel, though. The linux e1000 and e1000e drivers have a fuckload of hardware bug workarounds, and the ASPM thing did hit some people recently. You *must* have ASPM L0s and L1 disabled on the Intel NIC *and* its parent PCIe bridge, and the kernel driver usually will only be able to disable it on the NIC itself, if the BIOS is crap and leaves ASPM L0s or L1 enabled on the bridge or has a crap NIC eeprom image that causes issue with 128b/256b maximum PCIe packet (this one can be fixed by Linux, *if* you give it a specific parameter, no idea why it isn't automatic since it is major utter braindamage by the BIOS that is known to hang the box hard sometimes), the NIC can hang.

  42. Really?!?! by Anonymous Coward · · Score: 4, Funny

    So by "bring down" you didn't just mean bring down, implying it was brought down, but you meant "BRING DOWN" (notice the caps), implying it was brought down (notice the italics). Such a critical distinction. If it was merely "brought down" this would hardly have been an issue. You could have simply ignored the dead router. As it stands, being brought down, this is a real problem, and you cannot ignore the dead router. Good job!

    1. Re:Really?!?! by Anonymous Coward · · Score: 0

      The article has that bit and the emphasis really illustrates the sheer astonishment that the author experienced while debugging this problem. I know I wouldn't stop going "WTF? I mean What The FUCKING Fuck?" and that wouldn't even come close to expressing my amazed bewilderment if I came across a bug like that. The NIC, without any interaction with the host machine, loses link and can't be brought up again without powercycling the machine, and this is triggered when it sees one of two numbers in a particular position somewhere in the middle of an Ethernet frame, unless it has seen particular other numbers in that spot before, but not any other number. If you manage to find a bug like that, you absolutely positively have to find a way to express how unbelievably far beyond weird that is.

  43. Re:This is why the equipment should be heterogeneo by Just+Brew+It! · · Score: 1

    Whether it's your brand of switch, motherboard or even memory, never have the same across all machines if you can help it. The only time I'd recommend the same brand would be hard drives (due to concurrency issues), but then at least try go get them from different batches.

    ...and then along comes something like the Seagate 7200.11 firmware bug from a few years back, which caused all drives of several related models to self-brick after a period of time.

  44. Re:This was fixed years ago by LordLimecat · · Score: 2

    The ubuntu bug had to do with bad drivers and / or firmware; when the affected distro was installed on a computer with the affected NIC (which was the completely different e1000), it would render that NIC unusable, even afterreboots.

    This bug appears to be triggered by receiving a crafted packet, remotely, and is fixable with a reboot. It also affects a different nic.

  45. Re:Online Income by couchslug · · Score: 1

    I see that beautiful 4chan meme lives.

    Your post made my day.

    --
    "This post is an artistic work of fiction and falsehood. Only a fool would take anything posted here as fact."
  46. no public fix by SuperBanana · · Score: 2

    Too bad Intel gave a fix to them (a fix they ultimately couldn't use), but hasn't to anyone else.

    Too bad Intel has also apparently known about the problem for months now.

    "Intel has been aware of this issue for several months. They also have a fix. However, they haven't publicized it because they don't know how widespread it is."

    Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.

    1. Re:no public fix by TheLink · · Score: 2

      Maybe the spooks told them to keep the bug unfixed in the wild ;).

      --
    2. Re:no public fix by sumdumass · · Score: 1

      I don't think it is too bad. I'm going to use this for the router log in screen and expose it to the world.

      Imagine how many script kiddies and infected drones will go down when they probe my ports and try to connect. I can imagine the look on their faces when they lose connection for trying to "hack" into someone's network.

    3. Re:no public fix by BitZtream · · Score: 2

      Bullshit. I bet they were hoping to very quietly roll it into a driver update and have it all go away.

      Yes, that would be ideal for everyone. If it just silently went away, that also means it wasn't much of a problem ... that means it didn't get used as a massive exploit. Thats good.

      There is no scenario where 'going away' is a bad thing, unless you're just all angsty and looking for a reason to tell 'the man' how much he sucks.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  47. previous less lethal but similar tg3 bug by Anonymous Coward · · Score: 1

    I ran into a tg3 bug where as the tg3 firmware took the byte value that it expected for a destination port number and redirected the udp packets with that value at that location to the BMC/SMDC/ipmi card (as designed). The issue was that the firmware did not appear to understand that a UDP datagram could be up to 64k so up to 40 1500byte packets and was always looks for the destination port on all packets (not just the first as it should have been) so if the data in the packet matched the expectations those packets never got to the OS.

    This caused a client to have move their network port on the machine to the 2nd port (on 200 machines) that did not have the firmware bug in it, and this caused us to find another odd firmware bug...the bug being that if one uses jumbo frames and were to explicitly route to a certain set of nodes with smaller packets (to correct someone else's network bug where they sometimes report the wrong MTU size) then the firmware feature that puts packets together nicely helps you and puts the 6 1500's (the route explicitly broke up) together and attempts to send them on as the firmware does not have that complicated set of rules as the OS does around MTU size.

    The broadcom guy I talked to (and he was definitely off-shored) was a ID10T, and claimed there was nothing wrong...even though we could generate a valid linux UDP NFS packet every time that would never get to the OS and completely stop NFS from working. The client found it because one of there data streams was running into this feature pretty consistently if the file offsets line up such that the data had the certain magic value that the fw expected at the right location.

    And at the end of the day the real issue is the firmware is poorly documented and appears to be poorly tested and reviewed and is terribly important for stability.

  48. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 1

    You do realize both HP and Dell commonly come with the very Intel NICs being discussed here, don't you?

    So for not much more expensive you get the same failures, the drivers are certified but still fail, and the real support clearly never detected or patched this problem. Yeay?

  49. Is this a bug? Maybe not by WindBourne · · Score: 1

    Intel, or possibly nation where the manufacturering happens, is that code was added into the chip to respond to a highly unlikely sequence. Then when you need to kill a large number of computers simply hit various web servers sending in the required packet. Now, if a nation is protected by a firewall, well, then this approach will not be that useful. However, if other nations do not have a centralized firewall/router, then it can be used to take down a nation.

    --
    I prefer the "u" in honour as it seems to be missing these days.
  50. Re:This is why the equipment should be heterogeneo by JustOK · · Score: 1

    Or the Nintendo Wii?

    --
    rewriting history since 2109
  51. Re:This was fixed years ago by dissy · · Score: 1

    I wish this guy had done his homework. This was fixed a long time ago:
    http://blogs.computerworld.com/when_linux_does_well_the_e1000e_ethernet_bug_fixed

    I am amazed that you got a patched e1000 driver working with a 82574L based piece of hardware... mighty impressive hacking! You should have written up a report on how you managed it for the rest of us to study as homework.

  52. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    One kind of anything makes it easier to kill you swiftly in the end, whether it's by a ping of death or a biological disease.

    So you're saying we should develop technology similar to Sixth Day? Otherwise, if something happens to you, we need another eksith.

  53. Re:Counterfeit Intel NIC? Apparently not. by countach · · Score: 4, Insightful

    It's just an Intel support strategy. Release NICs with random and minor outward differences. When you have a support issue, say that it is counterfeit. Really cuts support costs!

  54. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 1

    There is a big difference between corporate support and customer support.
    Apparently you never worked in corporate environment. If the problem can't be resolved by phone, HP, Lenovo, hell even Dell would send a technicien on site.

  55. It's not that hard to imagine. by Anonymous Coward · · Score: 0

    Early on, I had the task of implementing an RFC from scratch . To my surprise the RFC was not any kind of EBNF- it was just some English "talking " which left some room for interpretation or at least didn't exclude a lot of possibilities. The result was my implementation worked perfectly, posting well formed requests to the server and receiving and processing the expected payload back except for the fact that any Netscape server I aimed it at was immediately taken down and, like the article said, when I say *down* I mean needs to be rebooted. Since at that time Netscape had 95% of the server market or more, I essentially had an internet-wide, universal, death ray which,if I were nefariously inclined, I might have been able to leverage into an early retirement.

  56. Re:This is why the equipment should be heterogeneo by Culture20 · · Score: 1

    What you want is some homogeneity in sections, but heterogeneity between sections, so you're not brought completely to your knees when a bug like this is exploited, but you still have copies of hardware for part-swapping tests or frankensteining old servers.

  57. Re:This is why the equipment should be heterogeneo by wbr1 · · Score: 1

    tl;dr: monocultures suck.

    --
    Silence is a state of mime.
  58. Re:This is why the equipment should be heterogeneo by UnknownSoldier · · Score: 1

    Sounds, like you've found the balance point between bleeding edge (things are broken/buggy) and outdated (no longer supported/available).

    I wished more people would favor this approach. It would save money and time down the road. i.e. Planned Upgrade Path.

  59. Re:This is why the equipment should be heterogeneo by antdude · · Score: 1

    Also, cheaper with older ones. I also don't buy the latest stuff. I want the stable anc cheap ones. Also, older stuff have issues worked out and known. I stopped being in first in line unless I get paid to use and test. :P

    --
    Ant(Dude) @ Quality Foraged Links (AQFL.net) & The Ant Farm (antfarm.ma.cx / antfarm.home.dhs.org).
  60. MAC address take-down by Anonymous Coward · · Score: 4, Interesting

    A number of years ago I discovered that you can take down many routers, and Windows / Linux hosts by sending an ARP response that says "IP 0.0.0.0 is at MAC FF:FF:FF:FF:FF:FF". When you direct this packet to the access point in a wireless network, this makes the SSID broadcast disappear and the whole device go down. Never posted this until now, I wonder if this still works on modern devices.

  61. Re:This is why the equipment should be heterogeneo by viperidaenz · · Score: 1

    They have no reason to support something not in an SLA. What they support is only loosely related to what they shipped you.

  62. Re:Counterfeit Intel NIC? Apparently not. by Anonymous Coward · · Score: 0

    and you trusted who, to determine that it was a counterfeit? - Intel you say?

    Right. To paraphrase Steven Wright: "Everything in my apartment was replaced with an exact duplicate"

  63. Re:Online Income by Anonymous Coward · · Score: 0

    It's a really good troll post, because many of the three words subphrases are actually grammatically valid, so your brain wants to believe that its just a somewhat broken sentence, hopefully just reading more and harder will make the earlier parts fit together correctly. But then the (strategically placed, I might add) really bad errors just really really hurt. To make things better, the end of the sentence is very close to reasonable, so if you get there, you think it might make sense to read it again. Which just amplifies the pain. Ugh.

  64. Triggering some monitor mode? by cpghost · · Score: 1

    Maybe, just maybe, some frames could trigger an internal monitoring or debugging mode on the controller? Sometimes, manufacturers would want to remotely diagnose hardware, and that could be a way to do it. Of course, it could also be something else, much more sinister like, say, some obscure government backdoor. Not saying that this applies to this particular case, but since most silicon designs aren't open source, we can't be sure there's no such thing in there, lurking, waiting to be activated.

    --
    cpghost at Cordula's Web.
  65. Re:This was fixed years ago by whoever57 · · Score: 1

    This bug appears to be triggered by receiving a crafted packet, remotely, and is fixable with a reboot. It also affects a different nic.

    According to TFA (I know, WTF, I actually read TFA?), a reboot does not fix this problem, but a power cycle does.

    --
    The real "Libtards" are the Libertarians!
  66. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    the 82574 is already 5 years old!

    http://ark.intel.com/products/36920/Intel-82574IT-Gigabit-Ethernet-Controller

  67. Re:Online Income by Anonymous Coward · · Score: 0

    You’ve got to be kidding me. I’ve been further even more decided to use even go need to do look more as anyone can. Can you really be far even as decided half as much to use go wish for that? My guess is that when one really been far even as decided once to use even go want, it is then that he has really been far even as decided to use even go want to do look more like. It’s just common sense.

  68. Re:This is why the equipment should be heterogeneo by cusco · · Score: 1

    Eventually. After you run their diagnostic, which spends 6-10 hours checking every sector of every hard drive among other things, send them the resulting diagnostic file, wait until they decide that the bad memory you told them about was really the issue after all, THEN the clock start running on the premium "4 hour guaranteed" support.

    --
    "Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
  69. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    In his case that is a feature.

  70. Re:This is why the equipment should be heterogeneo by aiht · · Score: 1

    line 2: -ge: command not found

    with $uid set to 1000001:
    line 2: 1000001: command not found

    The condition is always false and the user never goes the /dev/null

    I guess you need to use brackets, in bash at least...

    You're right, but bash doesn't even enter the picture:
    $ ls -l `which [`
    -rwxr-xr-x 1 root root 35264 Nov 20 06:25 /usr/bin/[

    The program is called [ and it complains if its last argument is not a ], so you need the square brackets no matter which shell you use.

  71. Re:Counterfeit Intel NIC? Apparently not. by Anonymous Coward · · Score: 0

    That would be straight fraud. Maybe worth a Funny mod, but +5 Insightful? What the hell, moderators?

  72. Re:Firmware updates motherfucker, do you speak the by Cramer · · Score: 3, Insightful

    And where do I get this mythical "firmware update" for the NIC CHIP? I'm sure the chip has code in it, but I've never even heard of a utility from Intel to update the in-chip code in a nic. (it's called "microcode", not firmware)

  73. Disable Power Management by zandeez · · Score: 1

    IIRC,. this is a known issue for certain chipsets, disabling power management for the PCI-E port the interface is a attached to in the BIOS is the known work-around.

  74. Re:Counterfeit Intel NIC? Apparently not. by kakaburra · · Score: 2

    say that it is counterfeit. Really cuts support costs!

    Also cuts sales volume. If someone tells me there are lots of counterfeits that look almost like original, I'd stop buying.

  75. Re:This is why the equipment should be heterogeneo by fuzzywig · · Score: 1

    No. In my old job we had the four hour gold support on our servers (all 5 of them, we weren't a huge customer).
    It typically took less than five minutes on the phone with a knowledgeable techie before they escalated and sent us an engineer. Usually the parts and engineer would arrive within two hours.
    Dell consumer support might be shite, but their business support is bloody good.

  76. Re:Online Income by Anonymous Coward · · Score: 0

    If you're happy and you know it

    Syntax error.

  77. Re:This was fixed years ago by Anonymous Coward · · Score: 0

    Haha, i was the same coward. I tricked you into doing my homework.

  78. Re:This is why the equipment should be heterogeneo by tigersha · · Score: 1

    Same here with HP. Technician driving 250 km through the Black Forest and heavy traffic, there in 3 hours.
    And we were an NGO with one server only.

    --
    The dangers of excessive individualism are nothing compared to the oppressiveness of excessive collectivism
  79. Re:This is why the equipment should be heterogeneo by Zeromous · · Score: 1

    Was just end of day, I'm totally checked out pseudo code- relax gents. But I should have known better if I was to be snarky in bash:

    if [ $uid -ge 1000000 ] || [ $uid == "Anonymous Coward" ]; then
                    cat $foo > /dev/null
    else
                    $foo > $file
    fi

    --
    ---Up Up Down Down Left Right Left Right B A START
  80. Re:This is why the equipment should be heterogeneo by Zeromous · · Score: 1

    the second condition should probably be another variable too like $uid_cn or such.

    --
    ---Up Up Down Down Left Right Left Right B A START
  81. Re:This is why the equipment should be heterogeneo by cusco · · Score: 1

    You probably had a different issue than we did (random almost-daily blue screens) on one of 72 identically configured and imaged R510 servers we deployed as NVRs in half a dozen data centers. The tech that came took out all the RAM and reseated it (which I had already done) and told me the problem was gone. He'd put the DIMMs back in the wrong slots though, entailing a new set of diagnostics, another visit to put the same damn bad DIMM back in its original slot, three more days of blue screens, and finally another visit to replace it.

    My current headache is a R210 that apparently had the wrong image put on it and blue screens out of the box. Took them 4 1/2 weeks to get us the replacement, and it has the same incorrect image. The techs on site shipped it to us, I put the correct image on it and it's happy. It's now back on its way to the customer site to get installed.

    Still better than my previous experiences with Compaq business support, though.

    --
    "Think about how stupid the average person is. Now, realise that half of them are dumber than that." - George Carlin
  82. Packet captures by n7ytd · · Score: 1

    I started stopping the packet captures as soon as the interface dropped

    Yes, that's usually when my packet captures stop, too.

  83. Re:This is why the equipment should be heterogeneo by Zero__Kelvin · · Score: 1

    "LOL this is a firmware bug, you can lock up the hardware even with no OS booted. Hilarious."

    Since no OS is grabbing the data from the buffer how do you plan on shifting the bytes through it in order to trigger the flaw? While it is not entirely outside the realm of possibility, it is unlikely this bug would be exposed without a driver on the host interacting with chipset and putting the firmware through its paces.

    --
    Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
  84. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    I think you will find that Dell NICs tend to be Broadcom.

  85. ibtimes.co.uk website fix by kelemvor4 · · Score: 1

    Rules for adblock plus:
    ibtimes.co.uk##.ibt_con_artaux.f_rht
    ibtimes.co.uk###bg_header
    ibtimes.co.uk##.fb-like.fb_edge_widget_with_comment.fb_iframe_widget
    ibtimes.co.uk##.twitter-follow-button.twitter-follow-button
    ibtimes.co.uk##IMG[style="border:0;width:20px;height:20px; margin-top:-10px;"]
    ibtimes.co.uk###scrollbox
    ibtimes.co.uk###taboola-grid-3x2
    ibtimes.co.uk##.f_lft.morebox
    ibtimes.co.uk###wrap_bottom
    ibtimes.co.uk##.bk_basic.bk_disqus

  86. likely talking about the NVRAM on the NIC by Chirs · · Score: 1

    When the NIC powers up the first thing it does is load a bunch of default settings from a chunk of nonvolatile memory (also sometimes called the EEPROM).

    You can reprogram the EEPROM using tools from the vendor, or if the driver supports it you can do it under Linux using ethtool.

  87. seem still stable to me...good linux drivers by Chirs · · Score: 1

    I've worked a fair bit with the latest 1-Gig and 10-Gig parts (i350 and 82599). They seem pretty decent and stable, good enough for telecom use, though like all chips they do have a list of errata.

    The developers are fairly active about updating the linux drivers in the core kernel as well as on sourceforge. The new chips (the 10-gig one especially) are very flexible but this means the drivers are getting a lot more complex than they used to be. (The programming manual for the 82599 is 900 pages.)

  88. The offending packet by cloudshark · · Score: 1

    We posted the offending packet on CloudShark, with links to all of Kristian's articles. Check it out here: http://appliance.cloudshark.org/news/cloudshark-in-the-wild/intel-packet-of-death-capture/

  89. Re:This is why the equipment should be heterogeneo by drsmithy · · Score: 1

    ...and you're guaranteed that every shipment will have radically different hardware, despite having identical model numbers.

    Bought a lot of HP and Dell hardware, never seen this happen. What server models and what hardware ?

  90. Don't bash by Anonymous Coward · · Score: 0

    Don't bash the guy, be thankful that some people still care enough to share findings.

  91. Re:This is why the equipment should be heterogeneo by Anonymous Coward · · Score: 0

    ...and then along comes something like the Seagate 7200.11 firmware bug from a few years back, which caused all drives of several related models to self-brick after a period of time.

    Failure analysis geekery / nitpicking alert!

    It was very far from all such drives. In fact, despite the PR shitstorm, relatively few users experienced a bricking. A Seagate employee provided some technical information about it:

    http://it.slashdot.org/comments.pl?sid=1098793&cid=26542735

    The drive firmware implemented a fixed-size rotating health log with 320 entries. During operation, it occasionally overwrote the oldest entry with a new one. My impression is that log rollover might take several days or more during typical operation. If the drive happened to be shut down while the current log entry was the 320th one (just about to roll over), the buggy firmware would brick on the next powerup. In other words, they had a classic forehead-slapper of a fencepost bug in the startup log scanning code.

    That gave a low percentage chance to brick on any given power cycle (less than 1 in 320, influenced by the frequencies of log rollover and power cycling), but it wasn't guaranteed to happen. The only way to make it inevitable would've been to power cycle much faster than the frequency of writing log entries, such that you'd be guaranteed to power down at each possible offset, in sequence. But essentially nobody cycles 3.5" desktop drives that much. (On the flip side, a more common use model is 24/7 operation, in which case these drives would never brick.)

    Seagate probably wishes they would've been so lucky as to have that bug hit 100% of the drives on a fixed time schedule. They'd likely have caught it before shipping any drives, and even if not, still would've begun seeing it on internal engineering test drives weeks or months before end users. Instead, it was the kind of low probability failure which they never saw at all because the dice never came up right -- for them. Once millions of units are in customer hands, though, there are millions of dice being rolled...

  92. Re:This is why the equipment should be heterogeneo by Kynde · · Score: 1

    And while it's tempting to write ==, it should be just a =

    I'm won't admit how many years of unix/linux it took for me to notice that, that few shells bother to complain about it.

    --
    1 Earth is warming, 2 It's us, 3 it's royally bad, 4 we need to take action NOW
  93. Re:This is why the equipment should be heterogeneo by ls671 · · Score: 1

    Make sense, I always wondered why you needed spaces after the [ and before the ]. Make sense if they are program arguments. Good one !

    But in realty, bracket support has been built into bash for eons mostly for optimization purposes. It is the case for other functionalities where the legacy executable is still present on the system but not needed.

    ~# which [ /usr/bin/[
    ~# ls /usr/bin/ttt
    ls: cannot access /usr/bin/ttt: No such file or directory
    ~# mv /usr/bin/[ /usr/bin/ttt
    ~# which [
    ~# if [ 1 = 1 ] ; then echo true; fi
    true
    ~# mv /usr/bin/ttt /usr/bin/[

    --
    Everything I write is lies, read between the lines.