Network Monitoring and Alerting?
SpamMonkey asks: "At work I am trying to implement a central monitoring and alerting service. We have in excess of 250 Windows servers, approx 15 AIX servers and another 30 Linux servers (mainly SLES/Suse). My investigation into systems that will allow us to monitor critical areas on each of these systems has so far led me to a clustered Linux server running Nagios with passive and active checks. What I'm curious about though is how Slashdot readers are carrying out their own jobs and how they can comfortably sit back, without having to repeatedly check that various systems are still operational and how to cut down their own response times when something goes wrong."
Used at my Fortune 500 company to monitor thousands of Win/Linux/Solaris. http://www.bb4.org/
Rule #1: Do not add to the problem.
This is an entire category of Operations Management and can encompass everything. Don't take it lightly and don't be afraid to start small. The first thing you need to do is categorize what you want to monitor into individual sections and work on the easiest stuff first. By the time you work up to the tough stuff, you'll have an idea of whats available, what your capabilities are and hopefully the easy stuff can be quickly rewritten/integrated into a netter solution. Don't miss the critical stuff in a morass of junk alerts. Sample consideration (everything that moves in the data center):
Hardware:
---------
1) Server
2) Storage
3) Network
4) Power
5) Environment
6) Security
Software:
---------
1) OS
2) Applications
3) Security
Events:
-------
1) Failures
2) Alerts
3) Misfires
4) Security
Triggers:
---------
1) Notification/False positives
2) Action plans/Event handling
3) Documentation, Documentation, Doumentation
4) Reporting aka analysis and cya
5) Security
Once you're done building it, start over. The last tier is the most visible e.g. delegating a raid rebuild page to the opcenter flunky without proper documentation is a Career Limiting Move (CLM), building the best monitoring system is a fucking waste unless you pay attention to it. The most apt cliches for monitor normalisation are all military: Warrooms, Bridges, Weapons Hot, Communications channels etc. View everything as SNAFU and work from there.
Rule #1: Do not add to the problem.