Slashdot Mirror


Monitoring the Health of Your Penguin?

codepunk asks; "I work for a large manufacturing firm in the midwest, working on a migration from Windows to Linux in the data center. We just completed installation of two full Oracle RAC 9i clusters. We are also in the process of configuring two clusters for our manufacturing floor's Linux desktop roll out. The machines that make up our data center are all Compaq Proliant Series machines. In order to facilitate hardware maintenance we are in bad need of a monitoring solution. HP offers Insight Manager as well as the Compaq Health Agents. This solution would seem like a natural but the drivers installed by these solutions are binary only. We have never managed to get these to work correctly and are really concerned about the stability of our systems with these modules loaded. We are not opposed to buying hardware in the future from a vendor that provides a more open solution. We are also not opposed to buying a open third party solution. Slashdot, what do you use for Linux system hardware monitoring?"

7 of 45 comments (clear)

  1. Loggerithim by gphat · · Score: 4, Informative

    I use Loggerithim.

  2. BB is really good by nsebban · · Score: 4, Informative

    Big Brother is often a good choice.

    --
    ____
    nico
    Nico-Live
  3. Also by szysz · · Score: 4, Interesting

    You can also try the 'Just for Fun' Network Management System ,
    its open source and extesible to fit any of your needs.

    I'm the main developer.

    --
    - Smells Like Open Source Code
  4. Yet Another by jacobmj · · Score: 4, Informative

    I think you can probably get the same end result with the projects mentioned above, but you might check out Nagios. It has a nice web interface and provides a wide array of monitoring options.

  5. Making up funny headlines for articles is hard. by cyberkreiger · · Score: 4, Funny

    To monitor the health of a penguin, a self-respecting geek would set up something much like this.

    --
    Stumbling in the dark
    I hear slavering of jaws
    Eaten by a grue.
  6. Compaq hardware monitoring by FattMattP · · Score: 5, Informative
    I also use Compaq hardware and I'm in the same position as you. People are suggesting programs that monitor applications and such which isn't what you are asking about.

    Just to clarify for other readers, he's asking about the Compaq health and wellness drivers which are binary only kernel modules and daemons that monitor things like the power supplies, temperature, if the case was opened, the speed and health of every fan in the system as well as things like memory errors and the state of the hard drives. They provide information that things like Nagios and Big Brother won't be aware of because the information isn't in /proc without these drivers.

    That being said, you'd do well to subscribe to the Compaq and Linux mailing list. There are some solutions to getting those Compaq drivers working with versions of Red Hat that are newer than what Compaq supports. I haven't had the time to try any of the suggestions out on one of our servers yet.

    Also, since everyone else is thinking you want application monitoring, I'd recommend Nagios.

    --
    Prevent email address forgery. Publish SPF records for y
  7. Coincidence! Here's what I've found... by __aaaaxm1522 · · Score: 5, Informative
    We're in the middle of rolling out two new HP/Compaq DL380 servers, and have run into the same problem as you.

    There are a variety of agents and monitoring tools that make up the Insight Management toolset. We've found that some of the tools are better than others.

    Pretty much the only *essential* tool that's required is the cpqhealth drivers and daemons. This poll the health of the onboard systems such as fans, CPUs, disk arrays, etc, and will log to syslog when there is a fault. Unforunately, the open source lm_sensors and cpqarrayd packages don't talk to the hardware in the new G3 DL380's, so cpqhealth is your only option. You can find it on HP's support site, as part of the hpasm package for Linux... I grabbed my copy for RedHat 7.3 from here.

    cpqhealth comes with pre-built modules for RedHat 7.3, 8.0 and a few other distros (SuSE for example). But I've found that even the most up to date stuff from HP's site only supports the kernels shipped on the RedHat cd, and nothing newer. Luckily, cpqhealth (part of the hpasm package) does allow you to build new modules. You'll need a compiler on the machine. Take a peek at this script: /opt/compaq/cpqhealth/custom_cpqhealth.sh - it will build a new cpqhealth RPM for you, containing the drivers and daemons necessary to log hardware faults to the syslog (as well as to take action on them).

    The script will break when you first run it - it will look for the following two files:

    /opt/compaq/cpqhealth/cpqasm/S10cpqasm
    /opt/compaq/cpqhealth/hpuid

    Both of which are missing in the most recent hpasm release. Create the S10cpqasm file yourself (it's just a startup script that gets dumped /etc/init.d - a simple touch of that file is fine for now - you can put a proper one together later), and copy hpuid from /bin (where it gets installed when you installed the hpasm RPM).

    Once done, you'll have an RPM that installs the following:

    two kernel modules: cpqasm.o, cpqevt.o

    two daemons: /opt/compaq/cpqhealth/cpqasm/casmd
    /opt/compaq/cpqhealth/cpqevt/cevtd

    Make sure the kernel modules and daemons get loaded, and you'll now get warnings when a fan fails, disk in the RAID array dies, etc.

    Even better - unlike the rest of the HP/Compaq Insight stuff, this doesn't use SNMP, doesn't install a web server that listens to 0.0.0.0, and seems to work quite well.

    Other annoying things I've discovered about the rest of the HP/Compaq toolset:

    • Dumb OS-detection routines - the Storage Monitor Agent greps /etc/issue for "RedHat" + "7.3"...
    • Over-reliance on SNMP.
    • Unclear documentation - the agents indicate they'll work with the snmpd shipped with RedHat, whereas HPs site indicates you need to install the HP-modded snmpd.
    • Corporate schizophrenia - some daemons called hpxxxx, others called cpqxxxx ... some scripts fail, looking for the old filename (HP renamed Compaq daemons, forgot to update script).
    • Check ports 2301 and 2381 - One or more of the agents installs webservers there, for "remote management". They listen to 0.0.0.0 and have no IP-filtering ability. So make sure that it's firewalled off. netstat -anp is your friend.

    In the end, we just ended up installing cpqhealth on the boxes to warn us of hardware problems, and will use RRDtool for our other monitoring requirements.