Network Monitoring and Alerting?

← Back to Stories (view on slashdot.org)

Network Monitoring and Alerting?

Posted by Cliff on Monday February 28, 2005 @10:55AM from the centralized-fault-detection dept.

SpamMonkey asks: "At work I am trying to implement a central monitoring and alerting service. We have in excess of 250 Windows servers, approx 15 AIX servers and another 30 Linux servers (mainly SLES/Suse). My investigation into systems that will allow us to monitor critical areas on each of these systems has so far led me to a clustered Linux server running Nagios with passive and active checks. What I'm curious about though is how Slashdot readers are carrying out their own jobs and how they can comfortably sit back, without having to repeatedly check that various systems are still operational and how to cut down their own response times when something goes wrong."

7 of 59 comments (clear)

Min score:

Reason:

Sort:

Simple by A+beautiful+mind · 2005-02-28 11:32 · Score: 1, Interesting

No sitting back.

Security/monitoring is a process not a product. When you finish the checklist you start over.

Also, i would recommend trying to cut back the windows server's somehow, maybe mention the right words when the licenses expire or its that time of the year to upgrade to a "free" solution?

--
It takes a man to suffer ignorance and smile
Be yourself no matter what they say
Also curious by RabidMonkey · 2005-02-28 11:35 · Score: 5, Interesting

We are in the middle of a large scale Linux project .. replacing 900 SCO Unix servers with Linux. We are wondering the same thing .. what monitoring tools can be put in to watch these servers?

Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.

Our solution (And the reason I'm working late right now) is to build a custom suite of tools that does batch reporting every night by polling logs and custom programs, then sends it back in a handy xml file. We take that file, dump it into a large informix database, and then we can do whatever we want to create reports.

It's a little more work than just installing a package, but we're getting EXACTLY what we want out of the product. It works with our very unique communications and configuration, and it's modular so I can add whatever monitoring/checks I want by writing a new ksh script. All the output is standardized and all the parsing is done at the office by a very clever xml parser one of the db guys wrote.

I think for whaty ou're looking for, theres things like Big Brother, MRTG, HP Openvie umm ... the IBM equivalant (A Tivoli product I think?). No need to reinvent the wheel.

But I'd love some feedback for people who are working in a bandwidth sparse shop like me ... how do you balance your need to know everything about every box all the time (well, the business and management want to know at least) vs the tiny amount of data you can push without interfering with mission critical applications running over that same link?

--
We emerge from our mother's womb an unformatted diskette; our culture formats us. - Douglas Coupland
1. Re:Also curious by MistabewM · 2005-02-28 11:39 · Score: 1, Interesting
  
  I would consider piggy backed systems in you case, cluster type server have one server monitor the other. If one goes down alert the ?local? tech staff or alert someone... if you want immediate reporting you can have it jump onto the GSM / TDMA network and have it start blastings sms text messages to your wireless phone.
  
  If you are using a realtime link you can just have it alert through the NOC.
  
  --
  "A learning experience is one of those things that says, 'You know that thing you just did? Don't do that.'" - DNA
MANY ways to take care of event notification by jgaynor · 2005-02-28 11:41 · Score: 4, Interesting

So you're concerned about letting things go on auto-pilot and missing an alarm . . .

Why not slap a modem into the head nagios box and have it page you when things fail. Don't worry about having to wear a beeper - you can page most cellphones via your carrier's SMS gateway (still dial-capable).

Too much hassle? How about AIM? YIM? Jabber? Email?

If you're TRULY the teeth jittering, chain smoking NOC type, buy some x10 crap and build a physical network alarm interface like I did :). Either way, nagios has ASSLOADS of event notification options.
Monitoring Tools by Plake · 2005-02-28 11:45 · Score: 3, Interesting

Personally I've used an array of the free monitoring tools and find most of them be decent. For larger sized monitoring you'd want something that can have the clients push data to the monitoring systems so they do far less work.

Here's a couple of the monitoring solutions:

Opennms
Mon
Big Brother

For system information polling I'd go with:

Cacti hands down this is the best polling system out there and it's simple to setup and run.

--
Check out Mon and Mon.cgi
Re:Why? by Asgard · 2005-02-28 11:50 · Score: 2, Interesting

If the monitoring server goes down then no pages will go out. In thise case the poster may have meant 'redundant servers' instead of a true cluster. One can configure a Nagios environment such that two systems check on each other, and if the primary Nagios instance goes down the secondary Nagios instance will take over monitoring and alerting.
Zabbix by Anonymous Coward · 2005-03-01 00:12 · Score: 1, Interesting

Check Zabbix if you're looking for a solution which is free, supports all platforms, and easy to deploy. Look at screenshots.

My company uses it for several months already in a mixed Windows/Unix environment (~180 servers) with great success. We use nearly all features (notifications, graphs, network maps, SLA monitoring, cool screens) Zabbix provides. Very useful stuff indeed. We tried Nagios before, but found it complex and hard to maintain. Besides performance of Nagios was disappointing (not enough tuning?).

Rog