Slashdot Mirror


Network Monitoring and Alerting?

SpamMonkey asks: "At work I am trying to implement a central monitoring and alerting service. We have in excess of 250 Windows servers, approx 15 AIX servers and another 30 Linux servers (mainly SLES/Suse). My investigation into systems that will allow us to monitor critical areas on each of these systems has so far led me to a clustered Linux server running Nagios with passive and active checks. What I'm curious about though is how Slashdot readers are carrying out their own jobs and how they can comfortably sit back, without having to repeatedly check that various systems are still operational and how to cut down their own response times when something goes wrong."

17 of 59 comments (clear)

  1. Big Brother by one8zero · · Score: 4, Informative

    Used at my Fortune 500 company to monitor thousands of Win/Linux/Solaris. http://www.bb4.org/

    1. Re:Big Brother by fireshipjohn · · Score: 4, Informative

      Or for the open source version ..

      Big Sister.

      http://bigsister.graeff.com/

  2. Nagios by codejnki · · Score: 2, Insightful
    I implemented Nagios at my work and am very happy with. We are mostly a Citrix shop so rouge processes on windows servers are the bane to things running smoothly. We monitor average CPU load on all our application servers.


    All in all I'm monitoring about 200 different processes across our network as well as running MRTG on the same box. Never felt once I needed to cluster.

    --
    "War doesn't determine who's right, just who's left"

    Steven Wright

    1. Re:Nagios by dubious9 · · Score: 2, Insightful

      I like nagios, but it can be difficult to set up. To obtain the same level of functionality with MS Mom, I had to write a number of plugins. Plugin writing is really easy if you know some perl, it's basically just creating a CLI utility that outputs in a certain format that nagios can understand.

      Remote services can be checked with ease, but stuff that needs to query the local system, disk-space or load for example, needs a different setup. I ran it through SSH, but I'm not sure the kind of load that would put on 200+ machines and the network. SSH was never ment to be lightweight. You could also do some MRTG hacking and SNMP, but it's a lot of work. If the network was mostly linux machines, I'd go with nagios. However, since this problem is mostly with windows machines, I'd use MS-MOM as the main tool, and go from there.

      --
      Why, o why must the sky fall when I've learned to fly?
  3. Best type of network monitoring by antifoidulus · · Score: 2, Funny

    is still someone calling you up at 3 am screaming at you that the network is down and yelling at you to get your ass in there asap. Works every time!

  4. Its Easy. by MistabewM · · Score: 2, Funny

    You blame the new hire. If there is more than one new hire you blame the one that spends his waking life playing everquest or .

    As for preventitive mesures, you put the new hire in charge of the task and sit back and relax.

    1st rule of managment is misdirection.

    --
    "A learning experience is one of those things that says, 'You know that thing you just did? Don't do that.'" - DNA
  5. Also curious by RabidMonkey · · Score: 5, Interesting

    We are in the middle of a large scale Linux project .. replacing 900 SCO Unix servers with Linux. We are wondering the same thing .. what monitoring tools can be put in to watch these servers?

    Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.

    Our solution (And the reason I'm working late right now) is to build a custom suite of tools that does batch reporting every night by polling logs and custom programs, then sends it back in a handy xml file. We take that file, dump it into a large informix database, and then we can do whatever we want to create reports.

    It's a little more work than just installing a package, but we're getting EXACTLY what we want out of the product. It works with our very unique communications and configuration, and it's modular so I can add whatever monitoring/checks I want by writing a new ksh script. All the output is standardized and all the parsing is done at the office by a very clever xml parser one of the db guys wrote.

    I think for whaty ou're looking for, theres things like Big Brother, MRTG, HP Openvie umm ... the IBM equivalant (A Tivoli product I think?). No need to reinvent the wheel.

    But I'd love some feedback for people who are working in a bandwidth sparse shop like me ... how do you balance your need to know everything about every box all the time (well, the business and management want to know at least) vs the tiny amount of data you can push without interfering with mission critical applications running over that same link?

    --
    We emerge from our mother's womb an unformatted diskette; our culture formats us. - Douglas Coupland
    1. Re:Also curious by BlurredWeasel · · Score: 2, Insightful

      Sorry to be shilling for the company I work for, but we can do exactly what you need. Indicative Service Directory can collect constant data, but then only upload to the central collection server once a day, or even less frequently if desired.

  6. MANY ways to take care of event notification by jgaynor · · Score: 4, Interesting

    So you're concerned about letting things go on auto-pilot and missing an alarm . . .

    Why not slap a modem into the head nagios box and have it page you when things fail. Don't worry about having to wear a beeper - you can page most cellphones via your carrier's SMS gateway (still dial-capable).

    Too much hassle? How about AIM? YIM? Jabber? Email?

    If you're TRULY the teeth jittering, chain smoking NOC type, buy some x10 crap and build a physical network alarm interface like I did :). Either way, nagios has ASSLOADS of event notification options.

  7. Monitoring Tools by Plake · · Score: 3, Interesting
    Personally I've used an array of the free monitoring tools and find most of them be decent. For larger sized monitoring you'd want something that can have the clients push data to the monitoring systems so they do far less work.

    Here's a couple of the monitoring solutions:

    Opennms

    Mon

    Big Brother

    For system information polling I'd go with:

    Cacti hands down this is the best polling system out there and it's simple to setup and run.

  8. Re:Why? by Asgard · · Score: 2, Interesting

    If the monitoring server goes down then no pages will go out. In thise case the poster may have meant 'redundant servers' instead of a true cluster. One can configure a Nagios environment such that two systems check on each other, and if the primary Nagios instance goes down the secondary Nagios instance will take over monitoring and alerting.

  9. polling and whatnot by bendsley · · Score: 2, Informative

    At work, we use a program called JFFNMS (Just For Fun Network Monitoring System). It's come a long way from it's inception, and has to ability to run on *nix and Windows.

    We tried Cacti and just didn't like it. I've looked at Nagios, but not in detail.

    Find JFFNMS at www.jffnms.org

    --
    Alcohol & calculus don't mix. Never drink & derive.
  10. Profiler Rx by Tek-Tools by blunte · · Score: 2, Informative
    This thing rocks.

    Profiler Rx

    --
    .sigs are for post^Hers.
  11. Break it down and start with the small stuff by bolix · · Score: 3, Informative

    Rule #1: Do not add to the problem.

    This is an entire category of Operations Management and can encompass everything. Don't take it lightly and don't be afraid to start small. The first thing you need to do is categorize what you want to monitor into individual sections and work on the easiest stuff first. By the time you work up to the tough stuff, you'll have an idea of whats available, what your capabilities are and hopefully the easy stuff can be quickly rewritten/integrated into a netter solution. Don't miss the critical stuff in a morass of junk alerts. Sample consideration (everything that moves in the data center):

    Hardware:
    ---------
    1) Server
    2) Storage
    3) Network
    4) Power
    5) Environment
    6) Security

    Software:
    ---------
    1) OS
    2) Applications
    3) Security

    Events:
    -------
    1) Failures
    2) Alerts
    3) Misfires
    4) Security

    Triggers:
    ---------
    1) Notification/False positives
    2) Action plans/Event handling
    3) Documentation, Documentation, Doumentation
    4) Reporting aka analysis and cya
    5) Security

    Once you're done building it, start over. The last tier is the most visible e.g. delegating a raid rebuild page to the opcenter flunky without proper documentation is a Career Limiting Move (CLM), building the best monitoring system is a fucking waste unless you pay attention to it. The most apt cliches for monitor normalisation are all military: Warrooms, Bridges, Weapons Hot, Communications channels etc. View everything as SNAFU and work from there.

    Rule #1: Do not add to the problem.

  12. try argus... by bani · · Score: 2, Informative

    argus is not bad, is pretty simple to set up, and scales reasonably well.

    it's also pretty flexible so you can plug it into just about any paging system and monitor just about any service you can imagine.

    it handles heirarchies so you dont get 300 pages at once, etc. and it has a simple, fast, clean web interface which isn't bloated with gigabytes of shiny widgets and is even perfectly usable via lynx.

    it has a few rough edges but the overall ease of use and simplicity make up for it.

  13. Re:Zabbix by egon · · Score: 2, Insightful

    I've played with Nagios a fair bit. When I saw the Zabbix article in the recent Linux Magazine, I thought I'd check it out. The graphing looked much easier to do, and the I liked the concept of "screens".

    Having used it for a couple weeks now, I would call it "software with potential". It's not quite there yet, and has the feeling of being immature software. This is not a dig - quite the contrary. I think that zabbix has a *lot* of potential, but I think it needs a little more time before it's ready.

    Just my $.02.

    --
    Give a man a match, you keep him warm for an evening.
    Light him on fire, he's warm for the rest of his life
  14. Intermapper Remote by bill_mcgonigle · · Score: 2, Informative

    Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.

    You might be interested in Intermapper for its Remote component. You can run a monitoring system at each location yet administer them centrally. Your remote datastream will basically be the set of events that's interesting to the Human In Charge.

    It's commercial software, so you have to weigh the cost of the software vs. getting it right yourself.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)