Network Monitoring and Alerting?
SpamMonkey asks: "At work I am trying to implement a central monitoring and alerting service. We have in excess of 250 Windows servers, approx 15 AIX servers and another 30 Linux servers (mainly SLES/Suse). My investigation into systems that will allow us to monitor critical areas on each of these systems has so far led me to a clustered Linux server running Nagios with passive and active checks. What I'm curious about though is how Slashdot readers are carrying out their own jobs and how they can comfortably sit back, without having to repeatedly check that various systems are still operational and how to cut down their own response times when something goes wrong."
Used at my Fortune 500 company to monitor thousands of Win/Linux/Solaris. http://www.bb4.org/
At work, we use a program called JFFNMS (Just For Fun Network Monitoring System). It's come a long way from it's inception, and has to ability to run on *nix and Windows.
We tried Cacti and just didn't like it. I've looked at Nagios, but not in detail.
Find JFFNMS at www.jffnms.org
Alcohol & calculus don't mix. Never drink & derive.
Profiler Rx
.sigs are for post^Hers.
Rule #1: Do not add to the problem.
This is an entire category of Operations Management and can encompass everything. Don't take it lightly and don't be afraid to start small. The first thing you need to do is categorize what you want to monitor into individual sections and work on the easiest stuff first. By the time you work up to the tough stuff, you'll have an idea of whats available, what your capabilities are and hopefully the easy stuff can be quickly rewritten/integrated into a netter solution. Don't miss the critical stuff in a morass of junk alerts. Sample consideration (everything that moves in the data center):
Hardware:
---------
1) Server
2) Storage
3) Network
4) Power
5) Environment
6) Security
Software:
---------
1) OS
2) Applications
3) Security
Events:
-------
1) Failures
2) Alerts
3) Misfires
4) Security
Triggers:
---------
1) Notification/False positives
2) Action plans/Event handling
3) Documentation, Documentation, Doumentation
4) Reporting aka analysis and cya
5) Security
Once you're done building it, start over. The last tier is the most visible e.g. delegating a raid rebuild page to the opcenter flunky without proper documentation is a Career Limiting Move (CLM), building the best monitoring system is a fucking waste unless you pay attention to it. The most apt cliches for monitor normalisation are all military: Warrooms, Bridges, Weapons Hot, Communications channels etc. View everything as SNAFU and work from there.
Rule #1: Do not add to the problem.
argus is not bad, is pretty simple to set up, and scales reasonably well.
it's also pretty flexible so you can plug it into just about any paging system and monitor just about any service you can imagine.
it handles heirarchies so you dont get 300 pages at once, etc. and it has a simple, fast, clean web interface which isn't bloated with gigabytes of shiny widgets and is even perfectly usable via lynx.
it has a few rough edges but the overall ease of use and simplicity make up for it.
Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.
You might be interested in Intermapper for its Remote component. You can run a monitoring system at each location yet administer them centrally. Your remote datastream will basically be the set of events that's interesting to the Human In Charge.
It's commercial software, so you have to weigh the cost of the software vs. getting it right yourself.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)