HP Server Killer Firmware Update On the Loose
OffTheLip (636691) writes "According to a Customer Advisory released by HP and reported on by the Channel Register website, a recently released firmware update for the ubiquitous HP Proliant server line could disable the network capability of affected systems. Broadcom NICs in G2-G7 servers are identified as potentially vulnerable. The release date for the firmware was April 18 so expect the number of systems affected to go up. HP has not released the number of systems vulnerable to the update."
So, don't upgrade the firmware?
And this is why I wait at least 2 months before installing firmware updates (unless it's a major security issue). It's not uncommon for a firmware update to be pulled shortly after being published. The 2 month window delay is generally more then enough time to ensure it's a proper update is solid.
Life is not for the lazy.
We pushed a firmware update this morning to the firewall and its been smooth sai#*($$#[NO CARRIER]
Good people go to bed earlier.
...don't flash it.
Do admins routinely flash firmware updates in the absence of some identified need? I could see flashing an update if I was suffering from a known problem, or if the vendor identified a security flaw in a previous release. I could see flashing it if necessary to install new hardware.
I just don't see why a server admin would flash a firmware update as if it were Patch Tuesday. In the absence of a security vulnerability or production issue there is no reason to treat a firmware change as an expedited change and not perform full testing before deploying it. That isn't to say that doing some testing of security patches/etc isn't wise - but I can see why it would get rushed.
You don't flash firmware unless it is for an important issue.
Or at least not until it has been out quite some time so that other people have done your testing for you.
Aren’t they also the ones who limit their firmware updates only to customers who have support contracts? I guess you get what you pay for..
http://slashdot.org/story/14/02/05/0258244/hp-to-charge-for-service-packs-and-firmware-for-out-of-warranty-customers
Known for reliable oscillators and calculators, and then they made a line of laser printers that lasted for a while; great engineers behind all that stuff too. Yes, I remember them. How are they doing now post-Carley? (HP's calculators put Rockwell's to shame. I can still remember the Rockwell jingle from over the radio, "big green numbers, and little rubber feet.").
You can't be ahead of the curve, if you're stuck in a loop.
didn't brick my server but it screwed up the device list in Windows and caused a cluster not to see the one node where i upgraded the drivers/firmware. put a null device driver into the device manager and i had to delete it and all was OK. just $250 to MS to figure this out since didn't think it was a HP issue
on the server the network worked and all but the NIC's weren't "seen" by Windows and so the clustering was screwed up
That's why you implement AT LEAST the change management part of ITIL. Some executive/manager will have to sign on it, either in the "do" or "don't do" checkbox.
And you fucking test these things in non-critical boxes first!
It's probably a mercy killing. Some of those poor servers were probably forced to run HP/UX. I'd want to die, if I had to run HP/UX...
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
As a network engineer, I can see being involved in arguments between the server platform support teams (read: off-shore) and the network engineering teams (read: on-shore). It'll be like this; "we need network support on a call" "Hello. what's wrong?" "The entire network is down for everyone!!!!! You need to fix this!!!!! The support we get from you is horrible!!!! AAHHHHhhhhhhh!!!!!!" "OK. What changed? What was being done at the time the entire network disappeared for everyone?" "we (15 people on the call - it apparently takes that many) were doing nothing (to do nothing)." "OK, well, I'm on the cores and I can see a lot of traffic, other servers, the outside world etc. you need to define the "everything being down" part." "well, we were in the middle of doing a firmware update on server xxx01 and...." "OK, so, you lied to me about doing nothing. what did you update?" "the NIC card to improve performance for..." "And now you're wondering why the network is down..." It'll go this way for some time until the next couple of layers of management get involved....... lots of yelling, me sending pictures of the network working I should write a script for this call. I know it'll be coming.
i had a servers network die on me yesterday.
did a reinstall, and as usualy used the latetest SPP to upgrade everything before putting the DL380G7 back in production.
after the flash, no network anymore on the onboard dual network cards.
This is about as bad as it gets, but still should not be the end of the world.
Always have a reasonably quick way to revert if stuff like this happens.
[I know none of the sysadmins in here need to hear this...]
Company X releases patch, patch then retracted, how did this make the front page?
Don't manufacturers test their updates? it's not like they couldn't keep some of the stuff they sell for said testing...
I've got better things to do tonight than die.
Is the article suggesting that the Broadcom NICs that HP used in the old Proliants actually _did_ work before this update?
That goes against years of experience in the field with those things.
With HP only one problem persists. Namely, it is what the money cant do.
This is one of the first rules of administering servers -- unless it's an absolute necessity, let someone else find these firmware bugs.
This is especially true now that firmware controls so much in modern hardware. I've had business PCs that have gone through more than 10 EFI revisions in their 18 month lifecycle, and all the release notes show that they fix surprisingly low level things.
The unfortunate trend is that these firmware bugs are more and more prevalent. It seems like manufacturers are skimping on QA and testing. I'm not surprised that HP is affected -- their maintenance applications and documentation look like it's now written by an offshore team. So, I wouldn't be surprised if the EEs and SEs sitting in Houston have to write specs and have their offshore counterparts hack up the firmware changes. Worse, since they're getting the NICs from Broadcom, it's engineers --> offshore team --> Broadcom --> Broadcom's offshore team, making it even more likely that confusion will be introduced.
Funny enough Dell has the same issue. Using their "lifecycle controller" or other in OS methods to update the broadcom firmware will disable the network card. The only supported way to do the upgrade is the boot from a Dell provided iso image OMSA live cd, then put the firmware on a USB (or virtual floppy) and install from there. I tried with the lifecycle controller and the interface flat out vanished. They said there is some sort of recovery procedure involving downgrading back to the previous firmware and then upgrading again but I havent tried it yet.
I deployed a DL380p Gen8 last year, and it gave me heart failure.
Under Red Hat, I needed to change the IP address, so I modified the file /etc/sysconfig/network-scripts/ifcfg-eth0 then did a "service network restart"
Alas, the box did not come up on the new IP. Got to the console which was blank and unresponsive. Power cycled, and the RAID array was GONE (and let's just say this was EXTREMELY inconvenient timing).
Support was able to walk us through some BIOS disk recovery that (thankfully) worked. But I'll never change the IP address on a Proliant without a full reboot.
Their last update bricked broadcom blades of the G1 variety.
That's what users are for.
TL;DR version: A perfectly good firmware update was turned into a disaster by a Program Manager checking the wrong boxes on a release form.
Long version...
(1) The firmware is developed by Broadcom. HP doesn't have the source code. They may not even have anybody left on their staff who would be qualified to work on it (HP laid off most of their NIC engineering team years ago).
(2) HP uses a lot of different Broadcom NIC's in various ProLiant servers. This firmware update was intended for a subset of those servers, and was tested on that intended subset.
(3) HP is responsible for packaging the firmware for release. Somebody screwed up during that process, and set the package metadata so the update got installed on a much wider range of servers.