Monitoring the Health of Your Penguin?
codepunk asks; "I work for a large manufacturing firm in the midwest, working on a migration from Windows to Linux in the data center. We just completed installation of two full Oracle RAC 9i clusters. We are also in the process of configuring two clusters for our manufacturing floor's Linux desktop roll out. The machines that make up our data center are all Compaq Proliant Series machines. In order to facilitate hardware maintenance we are in bad need of a monitoring solution. HP offers Insight Manager as well as the Compaq Health Agents. This solution
would seem like a natural but the drivers installed by these solutions are binary only. We have never managed to get these to work correctly and are really concerned about the stability of our systems with these modules loaded. We are not opposed to buying hardware in the future from a vendor that provides a more open solution. We are also not opposed to buying a open third party solution. Slashdot, what do you use for Linux system hardware monitoring?"
Througout my career working with Compaq servers, I've noticed that Compaq's monitoring stuff, at least for SCO platforms, is "just ok." However, every now and then we find that someone's server gets bogged down by these daemons and device drivers, which are linked into the kernel when Compaq's EFS is loaded.
AMI Megaraid adapters on HP have a monitoring daemon that sometimes bogs down under SCO as well.
I don't know how their Linux versions perform, if they exist, but Compaq's tools for SCO have been hit and miss.
The best of the lot IMHO is Compaq's SCSI monitoring, which is really nothing more than regurgitating of the firmware-based logs, which is where all that stuff belongs anyway.
-- I am. Therefore, I think!
We use BB where I work. It monitors health of over 100 servers and does a pretty good job of it.
You didn't mention how deep you want monitoring to go. Do you want to monitor the state of any individual files or processes?
Anything you can monitor via perl/shell scripts can be reported by BB.
This is an ex-parrot!
Here's how I'd suggest approaching the problem. Look into the platform MIBs. Find out what you can query values for. You should at least be able to get some binary type "fan working", "power supply working", etc. type stuff. Then get yourself an easily extensible monitoring system. Frankly, BigBrother is anitquated and a pia to manage. Other recommendations made here are reasonable, but I'd suggest mon. It's not a monitoring system per se. It is a scheduling framework with concepts of monitor and alert built-in. Many monitors and alerts are availble, but best of all it's really easy to write your own. For such things (for most things), I like perl.
Mon is what I use. It is very extensible, but also is fairly good out of the box. I monitor ~90 servers (many in remote data centers) with no problems. I write all sorts of monitors that are run on the remote servers via ssh. It is open source, and free.
Nagios seems to be good as well, although I haven't used it myself.