Monitoring the Health of Your Penguin?

← Back to Stories (view on slashdot.org)

Monitoring the Health of Your Penguin?

Posted by Cliff on Wednesday March 5, 2003 @03:41AM from the stocking-the-digital-medicine-cabinet dept.

codepunk asks; "I work for a large manufacturing firm in the midwest, working on a migration from Windows to Linux in the data center. We just completed installation of two full Oracle RAC 9i clusters. We are also in the process of configuring two clusters for our manufacturing floor's Linux desktop roll out. The machines that make up our data center are all Compaq Proliant Series machines. In order to facilitate hardware maintenance we are in bad need of a monitoring solution. HP offers Insight Manager as well as the Compaq Health Agents. This solution would seem like a natural but the drivers installed by these solutions are binary only. We have never managed to get these to work correctly and are really concerned about the stability of our systems with these modules loaded. We are not opposed to buying hardware in the future from a vendor that provides a more open solution. We are also not opposed to buying a open third party solution. Slashdot, what do you use for Linux system hardware monitoring?"

45 comments

Min score:

Reason:

Sort:

Loggerithim by gphat · 2003-03-05 03:47 · Score: 4, Informative

I use Loggerithim.
BB is really good by nsebban · 2003-03-05 03:49 · Score: 4, Informative

Big Brother is often a good choice.

--
____
nico
Nico-Live
1. Re:BB is really good by zeugma-amp · 2003-03-05 04:37 · Score: 2, Insightful
  
  We use BB where I work. It monitors health of over 100 servers and does a pretty good job of it.
  
  You didn't mention how deep you want monitoring to go. Do you want to monitor the state of any individual files or processes?
  
  Anything you can monitor via perl/shell scripts can be reported by BB.
  
  --
  This is an ex-parrot!
Also by szysz · 2003-03-05 03:57 · Score: 4, Interesting

You can also try the 'Just for Fun' Network Management System ,
its open source and extesible to fit any of your needs.

I'm the main developer.

--
- Smells Like Open Source Code
You beat me to it. by drivers · 2003-03-05 04:00 · Score: 1

Big Brother is pretty good... extensible too.
1. Re:You beat me to it. by nyamada · 2003-03-05 05:17 · Score: 2, Informative
  
  Surprised that no one has mentioned Nagios. Used to be called Netsaint. We've used it to monitor about 10 servers for about 3 years now. It's good for monitoring almost anything about your hardware you can think of, as long as you have the ability to get back the extended information from your motherboard (temperature, etc). It's open source, with many plugins & writing plugins for it is very simple.
Yet Another by jacobmj · 2003-03-05 04:05 · Score: 4, Informative

I think you can probably get the same end result with the projects mentioned above, but you might check out Nagios. It has a nice web interface and provides a wide array of monitoring options.
Making up funny headlines for articles is hard. by cyberkreiger · 2003-03-05 04:09 · Score: 4, Funny

To monitor the health of a penguin, a self-respecting geek would set up something much like this.

--
Stumbling in the dark
I hear slavering of jaws
Eaten by a grue.
Moaning Goat by Johnny_Longtorso · 2003-03-05 04:10 · Score: 3, Interesting

http://www.xiph.org/mgm/

--
Even casual involvement excludes total freedom by it's inherent nature. John Valby
Binary-only monitoring. by dentar · 2003-03-05 04:15 · Score: 2, Insightful

Througout my career working with Compaq servers, I've noticed that Compaq's monitoring stuff, at least for SCO platforms, is "just ok." However, every now and then we find that someone's server gets bogged down by these daemons and device drivers, which are linked into the kernel when Compaq's EFS is loaded.

AMI Megaraid adapters on HP have a monitoring daemon that sometimes bogs down under SCO as well.

I don't know how their Linux versions perform, if they exist, but Compaq's tools for SCO have been hit and miss.

The best of the lot IMHO is Compaq's SCSI monitoring, which is really nothing more than regurgitating of the firmware-based logs, which is where all that stuff belongs anyway.

--
-- I am. Therefore, I think!
Wow - not at all flamebait by Johnny_Longtorso · 2003-03-05 04:16 · Score: 0, Offtopic

Yes, Slowlaris and Windoze are the way to go! So sayith the Consultant.

Those that can't do Consult.

--
Even casual involvement excludes total freedom by it's inherent nature. John Valby
1. Re:Wow - not at all flamebait by Anonymous Coward · 2003-03-05 09:35 · Score: 0
  
  That's nice. Instead of making a point, you start name calling.
  
  It is a fact that Sun's hardware (as well as hardware from IBM and other real companies) is hotswappable, and redundant. Hard drive failed? nic burnt out? no problem.
  
  With Windows datacenter, you have a cluster of computers load levelled. If one goes down, the others are unaffected, and users are none the wiser. Same with VMS. Linux with beowulf or mosix have a long ways to go before they can be mentioned in the same sentence without causing people to shit their pants laughing.
2. Re:Wow - not at all flamebait by Acheron · 2003-03-05 13:06 · Score: 2, Interesting
  
  Responding even though AC here has obviously been living under the bridge a bit too long and is rather off-topic to boot.
  
  Load levelling/failover such as your speaking of in Windows Datacenter is definitely possible in linux. Please visit http://linux-ha.org/ if you're interested in learning more about some of those types of applications.
  
  I'm not sure what your point is with hardware. Quality hardware is available from many vendors, including hardware which supports Linux. Yes, IBM, Sun, etc have systems which provide good hardware redundancy and replaceability, but so does Dell.
  
  Your point about comparing a load leveled Windows Datacenter cluster to a beowulf cluster is comparing apples to oranges. The Windows Datacenter is a cluster of machines providing redundancy of service for each other, the beowulf is a cluster of machines acting as a single large processing unit. Completely different balls of wax.
  
  I guarantee you that I could build with Linux and Dell a cluster that would be just as reliable as your windows datacenter, plus it would cost less and probably perform better.
  
  Feel free to visit again. Careful though, I hear there's heretics around who don't religiously praise ANY hardware or ANY software at all!
but... by pizza_milkshake · 2003-03-05 04:18 · Score: 0, Offtopic

i don't own a penguin, how should i know?
1. Re:but... by Hubert_Shrump · 2003-03-05 04:27 · Score: 1
  
  C'mon, get in the /. spirit!
  
  When in doubt, point at google and show the parent what a hater you can be.
  
  We all win!
  
  --
  Keep your packets off my GNU/Girlfriend!
read! by .@. · 2003-03-05 04:19 · Score: 1

Errr...folks? You're posting system monitoring software. He wanted HARDWARE monitoring solutions.

--
.@.
1. Re:read! by bellings · 2003-03-05 04:30 · Score: 0, Flamebait
  
  That's the problem with Ask Slashdot -- you get answers from Slashdot readers.
  
  I have no idea why anyone would post a link to freeware PHP projects to display pretty graphs of SNMP data when the question was about Linux Kernel Modules to monitor Firmware.
  
  Probably because there are hundreds of shitty free linux web tools to display pretty graphs of your network performance, and nearly zero linux tools for hardware monitoring. That should tell you something about the type of projects linux is currently being used for, and the type of projects you should use linux for.
  
  I'm not saying that linux doesn't have a niche, and that it fills that niche very well. But outside that, it starts to fall down. True high-availability, single-point-of-failure servers like the original poster is using is one of those places where linux falls, unfortunately.
  
  --
  Slashdot is jumping the shark. I'm just driving the boat.
Compaq hardware monitoring by FattMattP · 2003-03-05 04:26 · Score: 5, Informative

I also use Compaq hardware and I'm in the same position as you. People are suggesting programs that monitor applications and such which isn't what you are asking about.
Just to clarify for other readers, he's asking about the Compaq health and wellness drivers which are binary only kernel modules and daemons that monitor things like the power supplies, temperature, if the case was opened, the speed and health of every fan in the system as well as things like memory errors and the state of the hard drives. They provide information that things like Nagios and Big Brother won't be aware of because the information isn't in /proc without these drivers.
That being said, you'd do well to subscribe to the Compaq and Linux mailing list. There are some solutions to getting those Compaq drivers working with versions of Red Hat that are newer than what Compaq supports. I haven't had the time to try any of the suggestions out on one of our servers yet.
Also, since everyone else is thinking you want application monitoring, I'd recommend Nagios.

--
Prevent email address forgery. Publish SPF records for y
1. Re:Compaq hardware monitoring by RedHat+Rocky · 2003-03-05 04:59 · Score: 3, Informative
  
  I use lm_sensors with my x86 systems (non-Compaq), with Nagios handling the monitoring. Nagios is incredibly easy to extend, creating a plugin for the health stuff shouldn't take more than a simple shell or perl script.
  
  What really needed here is for Compaq to open the specs on their health monitoring interface. This is what limits me from running Gentoo on my Sun SPARCs as well; I need to know the temps and fans are all ok.
  
  --
  Anything is possible given time and money.
Coincidence! Here's what I've found... by __aaaaxm1522 · 2003-03-05 04:44 · Score: 5, Informative
We're in the middle of rolling out two new HP/Compaq DL380 servers, and have run into the same problem as you.
There are a variety of agents and monitoring tools that make up the Insight Management toolset. We've found that some of the tools are better than others.
Pretty much the only *essential* tool that's required is the cpqhealth drivers and daemons. This poll the health of the onboard systems such as fans, CPUs, disk arrays, etc, and will log to syslog when there is a fault. Unforunately, the open source lm_sensors and cpqarrayd packages don't talk to the hardware in the new G3 DL380's, so cpqhealth is your only option. You can find it on HP's support site, as part of the hpasm package for Linux... I grabbed my copy for RedHat 7.3 from here.
cpqhealth comes with pre-built modules for RedHat 7.3, 8.0 and a few other distros (SuSE for example). But I've found that even the most up to date stuff from HP's site only supports the kernels shipped on the RedHat cd, and nothing newer. Luckily, cpqhealth (part of the hpasm package) does allow you to build new modules. You'll need a compiler on the machine. Take a peek at this script: /opt/compaq/cpqhealth/custom_cpqhealth.sh - it will build a new cpqhealth RPM for you, containing the drivers and daemons necessary to log hardware faults to the syslog (as well as to take action on them).
The script will break when you first run it - it will look for the following two files:
/opt/compaq/cpqhealth/cpqasm/S10cpqasm
/opt/compaq/cpqhealth/hpuid
Both of which are missing in the most recent hpasm release. Create the S10cpqasm file yourself (it's just a startup script that gets dumped /etc/init.d - a simple touch of that file is fine for now - you can put a proper one together later), and copy hpuid from /bin (where it gets installed when you installed the hpasm RPM).
Once done, you'll have an RPM that installs the following:
two kernel modules: cpqasm.o, cpqevt.o
two daemons: /opt/compaq/cpqhealth/cpqasm/casmd
/opt/compaq/cpqhealth/cpqevt/cevtd
Make sure the kernel modules and daemons get loaded, and you'll now get warnings when a fan fails, disk in the RAID array dies, etc.
Even better - unlike the rest of the HP/Compaq Insight stuff, this doesn't use SNMP, doesn't install a web server that listens to 0.0.0.0, and seems to work quite well.
Other annoying things I've discovered about the rest of the HP/Compaq toolset:
- Dumb OS-detection routines - the Storage Monitor Agent greps /etc/issue for "RedHat" + "7.3"...
- Over-reliance on SNMP.
- Unclear documentation - the agents indicate they'll work with the snmpd shipped with RedHat, whereas HPs site indicates you need to install the HP-modded snmpd.
- Corporate schizophrenia - some daemons called hpxxxx, others called cpqxxxx ... some scripts fail, looking for the old filename (HP renamed Compaq daemons, forgot to update script).
- Check ports 2301 and 2381 - One or more of the agents installs webservers there, for "remote management". They listen to 0.0.0.0 and have no IP-filtering ability. So make sure that it's firewalled off. netstat -anp is your friend.
In the end, we just ended up installing cpqhealth on the boxes to warn us of hardware problems, and will use RRDtool for our other monitoring requirements.
BMC and Candle by pcs305 · 2003-03-05 04:44 · Score: 2, Informative

If you have money too throw around...
BMC have a good few Linux server mananagement stuff. BMC Patrol is one of them.
And so do Candle: Omegamon XE for Linux
Big Blue by duffbeer703 · 2003-03-05 04:44 · Score: 3, Informative

We started using Director on xSeries hardware, and it seems to work alot better than insight manager.

The SNMP traps (or Tivoli TEC events in our case) are alot more intuitive and useful than the crap that insight manager sent out, and the agent seems to be more reliable. You can use a "Director Server" or a product like Nagios or OpenView for alerting.

Keep in mind that "Systems Management" in the pc/unix world is a black hole that consumes time & money. You might be better off using something like VMware GSX server on big Intel hardware or even an IBM zseries and use virtual hosts.

One of the advantages of big, expensive hardware is that they often come with service processors that phone home or page you when problems occur. At our shop, several admins have been suprised when an IBM repairman calls to schedule replacement of a failed disk or fan that they were unaware of -- the service call was generated automatically.

--
Conformity is the jailer of freedom and enemy of growth. -JFK
Insight by mwood · 2003-03-05 04:50 · Score: 1

Compaq's Insight is the best I've ever used, although that was on Netware. I'm surprised you have so much trouble with it. (Now, if you had said OpenManage....)
RMON, SNMP, perl, and an extensible system by snopes · 2003-03-05 04:52 · Score: 3, Insightful

Here's how I'd suggest approaching the problem. Look into the platform MIBs. Find out what you can query values for. You should at least be able to get some binary type "fan working", "power supply working", etc. type stuff. Then get yourself an easily extensible monitoring system. Frankly, BigBrother is anitquated and a pia to manage. Other recommendations made here are reasonable, but I'd suggest mon. It's not a monitoring system per se. It is a scheduling framework with concepts of monitor and alert built-in. Many monitors and alerts are availble, but best of all it's really easy to write your own. For such things (for most things), I like perl.
Open Source Hardware Monitoring by tarus · 2003-03-05 04:55 · Score: 3, Informative

You face a similar problem to pretty much any hardware specific driver issue when it comes to Linux: the O/S tends to be ignored by the vendor.

Open-source tools tend to be (gasp) based on open protocols, whereas hardware tends to have its own specific, closed methods for determining state (such as temperature, etc.). The only real way to solve the problem is to reverse engineer the available methods or patronize those vendors that offer either an open solution or wider selection of supported O/S's. I believe that Compaq embeds some code from BMC Software for monitoring low level hardware information, so it is doubtful you will ever see the source for it.

Off the top of my head, only Dell's OpenManage is available for Linux.

If you can find a way to access the information from the command line, you can always use net-snmp to integrate it into an SNMP agent that can be accessed by most management products.

Good luck, and if you get it working you may want to check out OpenNMS as your monitoring solution. It supports CIM out of the box (as well as Dell OpenManage).
big brother / SNIPS / mrtg / NUT by Anonymous Coward · 2003-03-05 04:56 · Score: 1, Informative

I use a combo of big brother, SNIPS, MRTG and NUT(Network UPS tools). I have a demonstration page here:

http://monitor.aphroland.org/

(best viewed in 1600x1200 or higher)

instant network/system status from my home network. plenty of performance monitoring stuff, availability tracking etc. Works wonderfully. Took forever to setup though, most of that was determining what to monitor and how to monitor it, these days I can set it up in a few hours.

I also use PureSecure(www.demarc.com) for NIDS and HIDS. That data isn't available at the above url though some of my snort activity is.

please don't slashdot it as it puts a somewhat
heavy load on my system :)

I hacked up a couple of the tools(SNIPS/NUT) to display much more compact information so it would all fit in 1 page. with the exception of nut, all of the 3 other tools are individually accessable from .aphroland.org e.g. snips.aphroland.org bb.aphroland.org etc..
Re:Coincidence! Here's what I've found... by Jenova · 2003-03-05 05:03 · Score: 1

Wow, looks like quite a number of people having issues with the HPaq agents.

My colleague has almost given up trying to make the compaq agents work on his RH 7.2 DL 360 G2 servers.

I think he's gotten to the point of loading some of the drivers but still getting some device not found errors when loading the modules.
Penguin health... by Anonymous Coward · 2003-03-05 05:23 · Score: 0, Funny

Do penguin's have prostate glands that need checking?
1. Re:Penguin health... by codepunk · 2003-03-05 09:41 · Score: 1
  
  Actually I was looking for a little more than asking him to turn his head and cough..
  
  --
  
  Got Code?
InterMapper by rakerman · 2003-03-05 06:05 · Score: 2, Informative

InterMapper can monitor anything that responds to SNMP or other TCP/IP queries. For the central monitoring server, it is available now for Mac and Windows, and in beta for Linux.
Nagios by RichiP · 2003-03-05 06:52 · Score: 1

Nagios does hardware monitoring. I just looked at their front homepage and there's a piece of hardward that monitors temperature, etc. Who knows what else it monitors...?
OSDN whoring time by Anonymous Coward · 2003-03-05 07:27 · Score: 1, Informative

You asked slashdot, but didn't ask freshmeat?! OSDN is your one-stop shop for all you computing needs! OSDN is great! OSDN is good!

# Topic :: System :: Networking :: Monitoring :: Hardware Watchdog (90 projects)

Bow down before OSDN! Huzzah!
HP SmartStart CDs now Linux based by __aaaaxm1522 · 2003-03-05 09:36 · Score: 2, Interesting

Just noticed something else this afternoon. I normally don't bother using the HP/Compaq SmartStart CDs to configure my servers, preferring to do the OS installation by hand. For those of you not "in the know", HP/Compaq servers (at least the Proliant models) ship with a CD that will walk a new user through hardware config, OS install, etc. It automatically sets certain firmware/BIOS settings based on your chosen OS, etc... and helps you load the necessary drivers into your freshly installed OS.

Anyhow, back to what I noticed: The old SmartStart CDs were Windows based (yup, bootable Windows, or a subset thereof, on a CD). The SmartStart CDs that shipped with my new DL380 G3s are *Linux* based. They boot into a web browser from which all system config is managed. CTRL-ALT-F1 gets you a bash prompt. X is running on F3. Window manager is icewm. Browser is phoenix. PHP seems to be present as well, as there is a /php.ini file ... haven't looked much further yet.

cat /proc/version shows that it's running 2.4.18-4smp, and is the kernel shipped with RH 7.3 (or an update).

Very interesting, as this Linux-based tool helps people install Windows on their servers. For fun, I asked it to walk me through an OS install - it noticed that I had configured my OS in the BIOS as Linux (yes, there is a BIOS Linux-specific option), it told me that it couldn't assist a Linux install, only Windows. :)
1. Re:HP SmartStart CDs now Linux based by __aaaaxm1522 · 2003-03-05 10:01 · Score: 1
  
  Not to confuse anyone with the above - just because the SmartStart CD doesn't allow you to run the "Assisted Install" wizard for a Linux installation doesn't mean you can't install Linux on the box. Just do a regular OS install. :)
2. Re:HP SmartStart CDs now Linux based by weeboo0104 · 2003-03-05 14:01 · Score: 1
  
  "it couldn't assist a Linux install, only Windows."
  
  Of course! If you are trying to install Windows, you MUST be in dire need of assistance!
  
  --
  It is easier to build strong children than to repair broken men. -Frederick Douglass
Compaq RILO Boards and mozilla by codepunk · 2003-03-05 09:50 · Score: 1

And another thing has anyone successfully gotten the junk RILO boards to work with a browser other than IE? The thing runs java but the idiots have a browser detect script that keeps me from logging in with mozilla.

--

Got Code?
Mon and Nagios by buttahead · 2003-03-05 10:46 · Score: 2, Insightful

Mon is what I use. It is very extensible, but also is fairly good out of the box. I monitor ~90 servers (many in remote data centers) with no problems. I write all sorts of monitors that are run on the remote servers via ssh. It is open source, and free.

Nagios seems to be good as well, although I haven't used it myself.
Volution by chamont · 2003-03-05 15:31 · Score: 2, Informative

I know everyone hates Caldera/SCO around here, but Volution is solid, feature-rich, and made by a Linux company still actually in business. Monty
Is that even rational? by g4dget · 2003-03-05 18:01 · Score: 1

I wonder whether "health monitoring" of computers is even a rational thing to do. Just like with medical health, too much diagnosis and knowledge can be harmful: it can lead to overtreatment, unnecessary worries, and lots of extra expenses. The kinds of problems that hardware sensors detect (fan, overheating, CPU voltage off, etc.) are usually very easy to check on a case-by-case basis when one suspects a hardware problem (intermittent or fatal).
I have found that having a few spares on hand, being able to swap out machines quickly, and having a good backup and mirroring strategy has always been the best insurance. Beyond that, I don't want to be bothered by little aches and pains from a computer until it breaks.
Zabbix by Anonymous Coward · 2003-03-05 18:30 · Score: 1, Informative

Zabbix is Open Source software that will probably perfectly fit into your environment. Zabbix has its native high-performance agents virtually for every OS. All collected information and configuration data is stored in the RDBMS. This is big advantage! Zabbix also allows to look at everything you monitor from business point of view (SLA, service availability, etc). It has maps, alerts, triggers, user-based permissions and more. Check screenshots to get overview of functions it offers.
Crystal Darksite by B1 · 2003-03-06 03:55 · Score: 3, Informative

You might want to take a look at the Darksite remote management card from Crystal Computing. I don't work for them, but I had the chance to try one out, and I liked what I saw.

Basically, it's a single-board computer that sits in a PCI slot in your server, and monitors its vitals (hardware / software). It runs completely independently of your server, except for an optional OS agent that can monitor things like memory utilization, CPU activity, etc. (Yes, there is an agent available for Linux).

It has a web-based administration interface, and can send you alerts and warnings via Email or pager, even if the main server locks up hard for some reason--in that case, you can perform a remote reset or even cycle the power, all from a web based interface.

It's a pretty nifty card--you should take a look.
1. Re:Crystal Darksite by codepunk · 2003-03-06 06:01 · Score: 1
  
  Thank you very interesting, they mention a os agent
  I wonder if it comes with source. If I am not mistaken it does not measure raid drive status so it is probably not a silver bullet either.
  
  --
  
  Got Code?
Please read the article. by Anonymous Coward · 2003-03-06 05:47 · Score: 0

Instead of making a point, you start name calling.

Mr. Troll, I think you need to check your hipocrisy meter. It's broken.

It is a fact that Sun's hardware (as well as hardware from IBM and other real companies) is hotswappable, and redundant.

Yes, and this is relevant how?

The article mentions they are using Proliant servers. Perhaps you haven't stuck your head out of your proprietary-hardware hole recently, but these servers also provide hotswappable and redundant hardware. NIC burnt out? Hard drive failed? Power supply failed? Fan failed? CPU failed? No problem.

The fact that you haven't kept up with the pace of innovation is no reason to act like an ass.
Hardware Monitoring by o517375 · 2003-03-12 02:59 · Score: 1

This might be of interest to you.

http://www.linuxjournal.com/article.php?sid=6721