Slashdot Mirror


What Would You Want In a Large-Scale Monitoring System?

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"

342 comments

  1. I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · · Score: 5, Funny

    Publish them in DNS, and have the NSA monitor them for me!

    --
    "Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
    1. Re:I Name My Devices After Al Qaeda Members by just_another_sean · · Score: 2, Funny

      I was going to suggest he should ask the UK government, but I like your idea better.

      --
      Creationist Textbook Stickers Declared Unconstitutional by CowboyNeal
    2. Re:I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · · Score: 5, Funny

      The only drawback to this comes in the form of UAVs.

      --
      "Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
    3. Re:I Name My Devices After Al Qaeda Members by Hurricane78 · · Score: 4, Funny

      No need to. We are doing it anyway.

      Your NSA.

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    4. Re:I Name My Devices After Al Qaeda Members by ionix5891 · · Score: 1

      What Would You Want In a Large-Scale Monitoring System?

      uncle sam is that you?

    5. Re:I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · · Score: 1

      Well, yeah. NOW you are, after that!

      --
      "Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
    6. Re:I Name My Devices After Al Qaeda Members by Moblaster · · Score: 1

      Just use a series of tubes. It's really the easiest approach.

    7. Re:I Name My Devices After Al Qaeda Members by ObsessiveMathsFreak · · Score: 1

      Good idea, except they stopped monitoring when they found out your sites are all still under construction.

      --
      May the Maths Be with you!
    8. Re:I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · · Score: 1

      I said Al Qaeda members - not CIA employees.

      Whoops! I thought there was a difference. Oh, well.

      --
      "Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
    9. Re:I Name My Devices After Al Qaeda Members by bilbobugginz · · Score: 1

      I thought he said "large scale" (something on the scale of > 1000)

    10. Re:I Name My Devices After Al Qaeda Members by Bob_Who · · Score: 1

      Yeah. Give the BOTS and Trojans day off.

    11. Re:I Name My Devices After Al Qaeda Members by Anonymous Coward · · Score: 0

      Brilliant!

      Data access is conveniently provided through any of their Windows security holes.

      Everyone wins.

    12. Re:I Name My Devices After Al Qaeda Members by mysidia · · Score: 1

      The other drawback is, although they may be monitoring them, they're not going to send you notifications about their status.

      They may even have a hand in the hosts going down.

      And the only alert you actually get is when they come and visit you, to take you away for interrogation, followed by covert, involuntary relocation to Guantanamo.

    13. Re:I Name My Devices After Al Qaeda Members by Hurricane78 · · Score: 1, Interesting

      I get "Insightful" for THIS, but my other much more important comments get nothing or even "Troll"??

      Slashdot got seriously weird / fucked up, in the last time...

      People, please learn that even if you completely disagree, it still is no troll, but very insightful. Because without it you would not be able to disagree with it and come up with your own point of view in the first place! And sometimes making you disagree is just the point of a good argument!

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    14. Re:I Name My Devices After Al Qaeda Members by Hurricane78 · · Score: 1

      Nope.

      But hey, why is you e-mail addresses domain a domain squatter owned one?

      --
      Any sufficiently advanced intelligence is indistinguishable from stupidity.
    15. Re:I Name My Devices After Al Qaeda Members by Anonymous Coward · · Score: 0

      Use McAfee NUBA (Securify) or Cisco MARS.

  2. Hyperic HQ by Anonymous Coward · · Score: 2, Informative

    Hyperic HQ may be worth checking out.

    1. Re:Hyperic HQ by Anonymous Coward · · Score: 0

      Case of one, but I have not had good luck with Hyperic being able to consistently detect whether a server or application is truly "Available" on even 5 minute granularity. FreeNATS could also be looked at.

    2. Re:Hyperic HQ by Anonymous Coward · · Score: 0

      Yes, it's true. Hyperic is a very good option. But it's a lot expensive (well at least one year ago) in Corporate license (the only one to scale well for those 2500+ monitored nodes)...

    3. Re:Hyperic HQ by Anonymous Coward · · Score: 0

      InterMapper works but its licensed per Device. It is verry n00b friendly as it allows you to map your entire network and how they interconnect. You can also monitor the device via ping or snmp, and for service like http & mysql you can monitor the service.

    4. Re:Hyperic HQ by Doug+Neal · · Score: 1

      Last time I looked into Hyperic HQ (around last November I think) it seemed to be all talk and no trousers. I was also put off by the big fat Java agent that you have to install on the servers you want to monitor.

  3. rule based DSS by ecklesweb · · Score: 1

    Don't assume that you can successfully diagnose the problem based on your understanding of the indicators. You don't know my institutional context. Instead, give me a decision support system that I can use by adding rules that key off the monitored indicators and inject some of our own expertise into the diagnostic process.

    1. Re:rule based DSS by mysidia · · Score: 1

      Huh? Monitoring systems don't diagnose problems, they report symptoms that strongly suggest or actually indicate problems.

      (Depending on the severity of conditions observed, e.g. host not responding to ping, or last 5 SNMP queries show load average statistics much higher than usual, and process count is in excess of 1000 processes)

  4. OpenNMS by Anonymous Coward · · Score: 5, Informative

    That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.

    OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.

    http://www.opennms.org/

    Enjoy.

    1. Re:OpenNMS by Ryan+Amos · · Score: 1

      Mod parent up, OpenNMS rules.

    2. Re:OpenNMS by mu51c10rd · · Score: 4, Interesting

      I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.

    3. Re:OpenNMS by Cato · · Score: 4, Insightful

      I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.

    4. Re:OpenNMS by abigor · · Score: 1

      I'll second this.

    5. Re:OpenNMS by Anonymous Coward · · Score: 0

      Why do all free NMS systems have crappy performance (execute system()/shell scripts to check absoultely anything) or require you to learn a whole new application specific poorly designed language to do even simple tasks?

      I just want something that looks reasonable (A configuration GUI), works, a sane person can learn to use reasonably well in less than an hour that scales well with strong SNMP discovery.

      After spending less than 2 minutes toying around with the Open NMS demo server. I was greeted with the following:

      org.opennms.web.event.EventIdNotFoundException: The event id must be an integer.
      at org.apache.jsp.event.detail_jsp._jspService(detail_jsp.java:70)
      at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
      at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:328)
      at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:315)
      at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
      at org.extremecomponents.table.filter.AbstractExportFilter.doFilter(AbstractExportFilter.java:49)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
      at org.opennms.web.StoreRequestPropertiesFilter.doFilter(StoreRequestPropertiesFilter.java:71)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:265)
      at org.acegisecurity.intercept.web.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:107)
      at org.acegisecurity.intercept.web.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:72)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
      at org.acegisecurity.ui.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:166)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
      at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
      at org.acegisecurity.wrapper.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:81)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
      at org.acegisecurity.ui.basicauth.BasicProcessingFilter.doFilter(BasicProcessingFilter.java:173)
      at org.acegisecurity.util.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:275)
      at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFi

    6. Re:OpenNMS by Anonymous Coward · · Score: 0

      too bad is java! thats an automatic no no!

    7. Re:OpenNMS by Euan+Buchanan · · Score: 3, Informative

      OpenNMS offers excellent training and support. My company flew Tarus down to Autralia for a week where he implemented OpenNMS across our four sites in the first day, then spent four days with myself and a colleague training us on its features. The price, including business class flight from America (it's a lonnnng flight and we wanted him semi-conscious on arrival) was absolutely trivial when compared against just the licensing costs of a comparable proprietary product such as HP OpenView. Highly recommended.

    8. Re:OpenNMS by Ranger+Rick · · Score: 2, Informative

      That would definitely be a bug. I'll look into it... :)

      --

      WWJD? JWRTFM!!!

    9. Re:OpenNMS by mysidia · · Score: 2, Insightful

      MRTG is kind of limited. I would suggest using Cacti instead; it provides useful features like graph zooming and detailed RRD configurations like number of DSes, number of steps and rows per DS in a RRD.

      I would try installing the opennms software yourself, and don't rely on that clearly broken demo of the software. The breakage you are seeing of the demo is not at all representative of what the software is normally like.

      OpenNMS is not bug free, but neither are any of the viable competitors close to 'bug free'. When setup properly and JVM/memory settings tuned, including proper setup and tuning of the PostgreSQL database per PostgreSQL best practices, with physical resources suitable to the load (e.g. about 4gb of physical RAM and 2x2Ghz CPU plus 80gb disk space to monitor 3000 interfaces), polling each service every 30 seconds, OpenNMS runs like a champ.

      It's pretty darn stable. And vastly outperforms other Open Source solutions.

      I've hardly ever seen exception conditions like that one, and generally, it would just be the webui having problems.

    10. Re:OpenNMS by inKubus · · Score: 1

      I'd recommend a suite of stuff, not just NMS. It's good, but nagios can be made so compact and small. MTRG/rrdtool is the only solution for gathering stats. There's a ton of third party visualization stuff for all monitoring systems. When you're talking about monitoring system, you have to decide the facets you're looking to cover. Number one is a current status of systems, ie: a table of systems and their ping, load, whatever. Number two is automation. Can the system kick off escalating notifications when a host is down? Can the system kick off corrective actions, such as restarting httpd or power cycling a switch? Can it integrate with your asset management and help desk system? Number three is data. Can it provide long-term data for peak analysis? For visualization?

      We're using nagios right now and it seems quite good for all these things. Yes, it's a little hard to set up, and takes some time to get right. But it has all the features necessary to monitor a network. Even better, it's easy to write plugins, scripts, and integrate with other existing systems.

      What about network security monitoring? Traffic monitoring? And you need something like syslog-ng to collect your logs so you can take action when something goes down.

      Also, I think Zabbix is pretty good also.

      --
      Cool! Amazing Toys.
    11. Re:OpenNMS by ta+bu+shi+da+yu · · Score: 1

      I'm curious if anyone's used EMC's Smarts Service Assurance Manager (SAM)? Or nLayers' (also owned by EMC) Application Discovery Manager (ADM)?

      --
      XML is like violence. If it doesn't solve the problem, use more.
    12. Re:OpenNMS by Anonymous Coward · · Score: 0

      I have built and managed an OpenNMS system for about 8k devices and it worked beautifully. Has some nice integration with the tools you mentioned above as well(cacti, hyperic,etc).

      I know we are talking Opensource but for what its worth do not choose any BMC products, at all, for monitoring systems and network devices.

    13. Re:OpenNMS by mu51c10rd · · Score: 1

      Can't say I have used EMC's software, but if their support is anything like their SAN or backup support, I'll pass. I have used HP SIM, solid product, but only good if you have all HP equipment and don't care about the application layer much.

    14. Re:OpenNMS by mu51c10rd · · Score: 1

      I'll second your last statement. Most bugs I find in openNMS are in the web interface. Backend code seems solid...maybe they just need to bring on a good web developer?

    15. Re:OpenNMS by ProfFalcon · · Score: 1

      I looked at many different commercial options here at work and purchased EMC Smarts. The root cause analysis is very helpful. It has saved us a lot of time tracking down some outages we've had here. It can tell you, for instance, that a specific port on a switch is down or flapping which is causing problems.

      Most of the other tools we looked at would tell you that all of the servers at a remote facility was down but Smarts will take it one step further and identify the root cause so you do not spend time figuring out if one of the routers on one side or the other is down, if it is the link itself, a firewall, etc. It is all information you could tell on your own but none of the other tools even went to the detail necessary to track the problem down using only the information presented in the tool. Smarts goes even further and specifically points at the problem point.

      There are a bunch of other modules you can buy to help you automatically model application/system dependencies to find out which business units are impacted by an outage, what systems/applications would be impacted by a DB outage, etc. Other modules can track application performance in all of the steps from workstation all the way through the network into each server and DB using just network monitoring or through synthetic transactions.

      It is not cheap by any stretch of the imagination but implementation is fairly easy with its autodiscovery providing huge value. If you want to use it to its fullest, it will take some learning and a bit of time from a good administrator.

      Before anyone asks, I was unsuccessful getting open source tools seriously considered. I had implemented OpenNMS very successfully and was using unofficially to monitor for outages and track system availability.

      Doing this right is not a light task if you want all of the detail necessary to properly manage a large-scale network. We getting into the level of detail of monitoring server memory utilization, disk space utilization, CPU, switch/router port utilization, etc. It is taking at least one full-time administrator just to manage it. I wish you luck on developing a new tool. I would encourage you to look at some of the existing tools before trying to build your own. Just implementing a tool that has been around a long time is a huge process. Building and implementing....

      --
      Simply stating [Citation Needed] does not automatically make you insightful or brilliant.
    16. Re:OpenNMS by ta+bu+shi+da+yu · · Score: 1

      Hey. I guess now is the wrong time to confess I work for EMC? :-)

      --
      XML is like violence. If it doesn't solve the problem, use more.
    17. Re:OpenNMS by Ranger+Rick · · Score: 1

      Yeah, the web UI code is the cruftiest part of OpenNMS, and it's next on our list to tackle/modernize. It's last in the list since the most important part is the backend, and notifications. Day-to-day, the web UI is more important to managers that want to see pretty graphs, and the notification system is for the folks doing real work responding to issues. ;)

      We've already started taking steps towards that, implementing a RESTful interface for the backend parts of the system. Now we need to make a nice UI that takes advantage of it...

      --

      WWJD? JWRTFM!!!

    18. Re:OpenNMS by Akatosh · · Score: 1

      Ya, I tried to used the emc package for a while. I didn't care for having to use their proprietary programming language to extend or customize it. It was also expensive for what it did. There always seemed to be missing functionality like 'so how do we make it query a radius server to see if it's up?' oh that'll be in the NEXT version, you should upgrade for the low low price of only $300,000. Repeat. Oh, we're sorry, you need another $150,000 module to do that, no wait.. time to upgrade again! Another $200k! Oh doesn't support that feature either, maybe you can write a way to monitor it in our proprietary programming language that no one knows and isn't documented? Oh your annual service contract is $75k a year.

      Here's my opinion on the EMC product line.

    19. Re:OpenNMS by guile*fr · · Score: 1

      I'm following OpenNMS with some interest. (mostly for network)
      In the backend, I think it miss a way to monitor via snmp the status of modules in a network chassis

    20. Re:OpenNMS by Ranger+Rick · · Score: 1

      It may not have default configs for everything, but it should be able to monitor just about anything available through SNMP through configuration, or the SNMP poller, or the BGP monitor.

      --

      WWJD? JWRTFM!!!

    21. Re:OpenNMS by guile*fr · · Score: 1

      I thought that the snmp poller only polled interfaces status, if so it doesn't fit the bill.
      If it can poll any oid and test the result against arbitrary values i should look harder, i'm afraid that passive status check could be a serious performance hit.

    22. Re:OpenNMS by Anonymous Coward · · Score: 0

      It can, but it currently uses a custom MIB definition format for stats. I recently manually converted MIB stats information from a standard MIB format for some Riverbed HTTP accelerators to fit into the openNMS configuration file format. It took a while because I was misunderstanding something about the oid tree boundary between the device class and its stats, but I did manage to get it working. So it is definitely extensible, it's just that there's a limited set of devices supported in the stock configuration.

  5. A more interesting question by drsmithy · · Score: 5, Insightful

    What limitations exist in current solutions that justifying developing a new one from scratch ?

    1. Re:A more interesting question by Meshach · · Score: 3, Insightful

      What limitations exist in current solutions that justifying developing a new one from scratch ?

      Exactly! Too often people just jump in and redo everything without actually investigating what needs to be fixed. Quote from George Santayana "Those who cannot learn from history are doomed to repeat it,' seems very appros here.

      --
      "Maybe this world is another planet's hell"
      Aldous Huxley
    2. Re:A more interesting question by glassware · · Score: 3, Informative

      He said he was asked to "develop a new solution" - which most likely means he gets to pick and choose what to implement, whether parts of it are custom developed or off the shelf. I would imagine a good solution would be a core product plus custom built extensions for the features he needs that the product doesn't implement itself.

    3. Re:A more interesting question by Krneki · · Score: 2, Informative

      Exactly, I need a good core product that I'll evolve over time.

      --
      Love many, trust a few, do harm to none.
    4. Re:A more interesting question by Anonymous Coward · · Score: 0

      Limitations ? thereare no need for limitations, sometimes you do something just because you can...

    5. Re:A more interesting question by drsmithy · · Score: 1

      He said he was asked to "develop a new solution" [...]

      From TFSummary:

      Today I have changed employer and I have been asked to develop a new monitoring solution from scratch [...]

      Where I come from, "from scratch" doesn't mean "configure existing solutions to my needs".

    6. Re:A more interesting question by drsmithy · · Score: 1

      Exactly, I need a good core product that I'll evolve over time.

      Why do you want to "evolve" it (by which I'm assuming you mean modify in depth rather than "configure") it at all ? What's missing that you need ?

      System monitoring isn't exactly a fresh and new field. There are numerous well-established and quite comprehensive products already out there.

    7. Re:A more interesting question by Krneki · · Score: 1

      Mostly is SNMP and WMI hacking.

      Not all the vendor respect open standards, so you have to guess how to get the info from an UPS, Printer, ....

      --
      Love many, trust a few, do harm to none.
    8. Re:A more interesting question by dave562 · · Score: 1

      The article mentions that he is starting a job at a new employer. The systems that he listed are systems that he has experience with. It seems to me that he's open to the possibility that, despite having had experience with numerous systems, there might be a better way to do things than he has done them in the past.

    9. Re:A more interesting question by abigor · · Score: 3, Interesting

      The big questions are:

      Will your solution need to support snmp v3?

      Do the devices you talk to have published oids?

      Do you need source code to extend it?

      If yes to these, OpenNms is a great bet.

    10. Re:A more interesting question by ArsonSmith · · Score: 2, Funny

      Yea, from scratch, first I'd develop the tools needed to mine the raw materials of silicone, iron, and other needed elements. Then I'd refine them and produce the needed components for memory and processors and storage. as well as develop the new networking, power, form factor etc... Then start working on the boot code and a core kernel, hmm should it be micro/macro or hybrid...? Then I'd start working on interface tools or user space or something along those lines. Once I got this part done I'd start gathering information on what was needed to be monitored. Then develop the required protocols to monitor those things.

      On second thought maybe it'd be easier to not start from scratch and build on the tools others have created as a basis and customize from there.

      --
      Paying taxes to buy civilization is like paying a hooker to buy love.
    11. Re:A more interesting question by timeOday · · Score: 2, Insightful
      If only you read more than the first sentence of TFSummary: "I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager."

      Obviously by "from scratch" he means his company has nothing in place he has to build on; he is free to build a system on whatever tools he likes.

    12. Re:A more interesting question by Krneki · · Score: 1

      Will your solution need to support snmp v3?
      No, monitoring is inside the intranet, so there is no need for SNMP v3, besides they are configured for read only and access from 1 IP only.

      Do the devices you talk to have published oids?
      Some do, some don't.

      I'll check OpenNms

      --
      Love many, trust a few, do harm to none.
    13. Re:A more interesting question by afidel · · Score: 1

      You almost ALWAYS want to do custom monitoring, for WhatsUp for my environment we developed scripts that monitor for the existence of individual processes through WMI or make sure that no JAVA process is over ~1.2GB because garbage collection goes crazy on most JVM's at that point, etc. I have about 100 custom monitors in my relatively small environment (~200 servers and about 100 network devices). Whether you consider custom script development configuration or modifications is up to you.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    14. Re:A more interesting question by hedronist · · Score: 1

      [ ... ] I'd refine them and produce the needed components for memory and processors and storage. as well as develop the new networking, power, form factor etc... [ ... ]

      I worked on the National Semiconductor Mesa project in the early 80's and we came damn close to doing this: new CPU and MMU architecture (and therefore new silicon), new mobo architecture, new boot code, new language (sort of - e-pascal), new OS, new utilities, new apps, etc., etc.

      Kids, Just Say No.
      Don't go there.
      Don't let your friends go there.
      That way lies only madness and tears.

    15. Re:A more interesting question by Nefarious+Wheel · · Score: 1

      Where I come from, "from scratch" doesn't mean "configure existing solutions to my needs".

      It's a variable. When I cook pancakes from scratch, I don't grow the wheat or grind the flour or milk the cow. No doubt there would be at least a subjective improvement if I did (see the online WEC for options there) but I'm perfectly happy to call any recipe that doesn't come from a commercial pancake mix "from scratch". The nice thing about open source is that you can change the mix of ingredients.

      Wow what a metaphor. Now I'm hungry.

      --
      Do not mock my vision of impractical footwear
    16. Re:A more interesting question by PONA-Boy · · Score: 1

      I have used MRTG, CACTI, and PRTG in production use for a little more than 500 unique devices. I started piloting OpenNMS but never got too deep into it before I moved on so I cannot comment on its usefulness. I CAN, however, say that PRTG (especially in the new v7) is the whole shebang for us.

      There used to be a PRTG product and an IPCheck product from Paessler but -now- the PRTG product has pretty-much everything in one convenient package. You can monitor routers and switches (that Cacti was so very, very good at doing) but it also can monitor servers, printers, and other hardware. Anything you can expose an OID for, you can check it with PRTG. We have the NetFlow version, as well, so I capture all the NetFlow streams from our boundary routers and core routing switches. All of the critical Windows/Linux boxes in-house are also a breeze to setup and monitor.

      For the price, the product is really a great deal. The commercial support is good, too, not to mention the large volume of customer/vendor forums on their website. I highly recommend it, esp. considering the short deployment time it required.

      --
      +that's funny...I don't FEEL tardy.+
    17. Re:A more interesting question by Anonymous Coward · · Score: 0

      I find it's not so much the limitations, as the fact that all the ones I've used are tedious to configure and maintain. Take nagios - to write custom tests, you have to faff about either writing snmp traps, setting up ssh authorized keys and custom scripts client-side or fiddle about with the nagios client itself. One of the things I like about hobbit is that you can configure it server-side, and merely installing the client at the client side is enough to monitor anything on it. Then again, hobbit's web interface looks like it was designed by a five year old.

      Then again, at least I can design the front page for hobbit. The front page of nagios looks like it wasn't designed at all, and it doesn't draw graphs either. Hyperic was quite nice at first, until I found myself writing system tests in *java*. Java, that language known for its powerful selection of sysadmin tools. Perl or Python would've done. Perhaps even *cough* ruby, but Java? Java doesn''t understand file permissions.

      OpenNMS is a mess, big brother is a joke, all the commercial ones draw pretty graphs but can't do much else except force you to use internet explorer through abuse of activex. erm..

      must admit.. i've not tried zabbix yet.

    18. Re:A more interesting question by Anonymous Coward · · Score: 0

      seems very appros here.

      The word you were looking for there is "apropros", not "appros".

      HTH. HAND.

    19. Re:A more interesting question by growse · · Score: 1

      Oh god - zabbix - nooooooo.

      I had a horrible experience trying to get zabbix to work. Initially, it works great and looks useful and pretty. But then the zabbix-server process just stops reporting on stuff for no apparent reason at all. The processes are still there, they're just not updating anything. The log files are also giving no indication as to what's going on. Worst of all, you have no way of knowing when or why it stops working, so I ended up writing a cron job to restart it every 6 hours. Then I ditched it.

      It's got great functionality, but crappy reliability. And I like my monitoring systems to be reliable.

      --
      There is nothing interesting going on at my blog
    20. Re:A more interesting question by Anonymous Coward · · Score: 0

      I have run Nagios for a long time. Over the years I have tried Cacti, Zabbix, Zenoss, among others and find myself continually returning to Nagios' reliability and stability. Nagios has a clean interface, is easy to setup using templates and always tells me when something is wrong. I never have to worry about false positives or it dying. I have come to learn that for me there really is one way to do it. I run a grapher using the perfstats gathered from the checks and have as good or better insight into things than cacti or MRTG. The down side is people complain about the old style interface, I tell them I don't really look at the interface much as Nagios alerts me when there is a problem. Silencing alerts or putting hosts in downtime is a few simple clicks and that is good enough for me. I don't need my monitoring system to use an ajax interface. I need it to work.

      anon cuz mod points....

    21. Re:A more interesting question by neurovish · · Score: 1

      When was this?
      I had that problem with zabbix for awhile too. I never did figure out why it died or what the trigger was, but I had a cron setup to poke it with a stick every week. That was kind of the only thing keeping me from bringing it out of the sandbox and using it for anything really important. I'm running 1.6.4 now on a Gentoo server (that's the sandbox) and haven't had problems since.

    22. Re:A more interesting question by jra · · Score: 1

      "Those who do not understand UNIX are condemned to reinvent it. Poorly." Henry Spencer at UTzoo

  6. Bash monitoring by Foofoobar · · Score: 0, Offtopic

    I built a smal program to updating all bash profiles with timestamps, compare changes every few ticks and save changes to a database where the users were given unique ID and it would associated a parent child relationship when users su/sudo to show heightened privileges. Very useful as sys admins are known to wipe their bash historys and this kept a centralized history with relationships.

    --
    This is my sig. There are many like it but this one is mine.
    1. Re:Bash monitoring by Anonymous Coward · · Score: 0

      export HISTFILE=/dev/null

    2. Re:Bash monitoring by karnal · · Score: 1

      Sounds like you need a centralized syslog server. It could do more for you than just log commands.....

      --
      Karnal
    3. Re:Bash monitoring by totally+bogus+dude · · Score: 1

      If monitoring everything your sysadmins do is important because they have a habit of wiping their histories, I think you need new sysadmins.

  7. Before I get flamed... by jwilki1 · · Score: 4, Interesting

    I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.

    1. Re:Before I get flamed... by jmulvey · · Score: 1

      I forsee myself in your shoes in the next few months. Application-level awareness of our key Microsoft applications (Exchange, MOSS, AD, etc..) are very high on our need list, and so SCOM is a natural best-of-breed pick. However, I *really* want a single integrated solution that also covers our unix/linux systems. Is unix/linux monitoring part of your requirements? If so, could you briefly describe the capabilities (and requirements on the monitored systems) that SCOM currently has in this regard?

    2. Re:Before I get flamed... by Anonymous Coward · · Score: 2, Informative

      SCOM R2 integrates native unix and linux agent , supported systems are :

      HP-UX 11i v2 and v3 (PA-RISC and IA64)
      Sun Solaris 8 and 9 (SPARC) and Solaris 10 (SPARC and x86)
      Red Hat Enterprise Linux 4 (x86/x64) and 5 (x86/x64) Server
      Novell SUSE Linux Enterprise Server 9 (x86) and 10 SP1 (x86/x64)
      IBM AIX v5.3 and v6.1

      For application awareness, you can check bridgeways management packs .

    3. Re:Before I get flamed... by blincoln · · Score: 3, Informative

      As an aside, SCOM is a good product, but be sure you have (and are willing to invest) the time to configure it to match your environment. Just because it's also made by MS and has management packs for all of their products doesn't mean you can just flip the on switch and have everything monitored. You will almost certainly be flooded with useless alerts, and not alerted for things that you do care about.

      --
      "...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
    4. Re:Before I get flamed... by Krneki · · Score: 3, Informative

      I saw SCOM 1 year ago, the hardware requirement for just the client was higher then the whole Cacti server.

      Unless they start to optimize the mess I'm not sure I want to use it.

      --
      Love many, trust a few, do harm to none.
    5. Re:Before I get flamed... by Anonymous Coward · · Score: 0

      SCOM 2007 R2 can monitor Linux/Unix natively (built-in) and websites. You can also buy add-on packs from different companies (VEAM, BridgeWays, the list goes on) to monitor VMWare, the LAMP stack, Oracle, etc.

      We are currently monitoring about 6,000 devices and will be expanding that quite a bit. It takes a lot of tuning so that you don't spam yourself but it has agents for just about anything you could want to monitor.

    6. Re:Before I get flamed... by icedivr · · Score: 1

      You've touched on the real value of SCOM - the functionality in the management packs would take years to recreate from scratch. I find the SNMP monitoring to be pretty lacking, so we use Operation Manager for server, OS, and application-level monitoring, and WhatsUp Gold for monitoring important L2/L3 devices. I don't recall ever having a network outage in the datacenter, so I don't miss the ability to create dependencies between server and network.

  8. Splunk It! by Anonymous Coward · · Score: 0

    There's only one way! Splunk it!

    1. Re:Splunk It! by Master+of+Transhuman · · Score: 1

      Yeah, I played with Splunk on my last client.

      Killed it in about ten minutes of futzing around with it. Reliability? Fail.

      Performance? You need dual Xeons. Fail.

      Support? Asked a question on the forums, got told to RTFM - which, by the way, is incomprehensible. Fail.

      Splunk sucks rocks. All I wanted was some relatively simple Windows event log monitoring. Ended up going with Network Event Monitor (which is also heavy on the hardware needed, but it actually runs and is relatively simple to set up, although limited in its filter creation).

      --
      Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
  9. Zabbix by ender- · · Score: 5, Informative

    You can also look into Zabbix. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.

    1. Re:Zabbix by TooMuchToDo · · Score: 5, Informative

      We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.

    2. Re:Zabbix by Achromatic1978 · · Score: 1

      You manage 2500 servers but a 150GB database is "big"? *confused*

    3. Re:Zabbix by TooMuchToDo · · Score: 1

      For a monitoring database, yes. We really have no need for the historical data.

    4. Re:Zabbix by gpmidi · · Score: 1

      I've used Zabbix at work and currently use it at home. Works great. -Paul

    5. Re:Zabbix by Achromatic1978 · · Score: 1

      Relativity is everything. :) You can't toss it after a given period of time? (From within the program, I mean)

    6. Re:Zabbix by TooMuchToDo · · Score: 1

      I believe you can, it's just a low priority item for us to configure it to purge sooner.

    7. Re:Zabbix by Anonymous Coward · · Score: 0

      We use Zabbix to monitor 6768 hosts with 1-8 checks per host at a frequency of 30 sec to once a day. I just can't even begin to say how much we like it. It is by far the best for our environment!

    8. Re:Zabbix by Anonymous Coward · · Score: 0

      We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.

      I second zabbix!

       
      Slightly off topic:

       
      My zabbix DB (MySQL) is only at 25G, but it does seem to grow quite fast. Out of curiosity, how do you back your DB up? If I dump the DB, zabbix stalls out, so that's not an option. I figure replication will work, but wanted to know what you're doing with a DB much larger than mine.

       
      TIA!

    9. Re:Zabbix by BlueBlade · · Score: 4, Informative

      We're using Zabbix at work and I'm doing daily backups of the database with a simple mysqldump command. Since the tables are InnoDB and not MyISAM, you can use the --single-transaction switch. That way, it takes a virtual snapshot of the db at the start of the backup process and the writes can still keep going (they are still happening but they aren't commited until the transaction finishes). Granted, our DB isn't that big (10GB only), but it's been working fine and restore tests also seem to work fine.

      Here's the daily cron:

      mysqldump -u blah -pblah --single-transaction --opt --skip-lock-tables zabbix | gzip > /backup/zabbix_db.sql.gz

      --
      Religion is the best example of mass psychosis
    10. Re:Zabbix by StarHeart · · Score: 1

      Sounds like it is either using MyISAM for tables, or you aren't using the --single-transaction option of mysql-dump for INNODB.

      --
      Havoc Penington, the bane of my Linux desktop.
    11. Re:Zabbix by Anonymous Coward · · Score: 0

      I use Zabbix, Love it.

      Number of hosts 337
      Number of items 43389
      Number of triggers 17847

      Has all the features you want, plus a few. I would highly recommend it.

    12. Re:Zabbix by ender- · · Score: 1

      I'm not the subby, but you guys are certainly giving me some good ammo to bring to management when I do get around to testing Zabbix. :) Thanks!

    13. Re:Zabbix by Anonymous Coward · · Score: 0

      Scalability. Zabbix fails horribly. In a high performance environment it melts, there is not a chance submitter will be able to use it with 5000 machines. Even with a lot of support from zabbix developers, it was a complete waste of time. We spent months trying and could not get past the most basic checks and not even for all machines. Stability on the gathering nodes was poor. This is not an enterprise class solution.

    14. Re:Zabbix by Anonymous Coward · · Score: 0

      You may want to take a look into percona's xtrabackup. It is a free tool capable of InnoDB backups.

    15. Re:Zabbix by Atacama93 · · Score: 1

      --single-transaction and --skip-lock-tables are mutually exclusive, so you don't need to include --skip-lock-tables. Also, --opt is added by default, so you don't need it, either. Of course, it doesn't hurt to include them, other than making the command longer.

    16. Re:Zabbix by Anonymous Coward · · Score: 0

      We use Zabbix and love it. We monitor around 2,000 machines in our datacenters and remote locations. Its got great configurable maps and lots of goodies and will monitor about anything on the network.

  10. GKrellM by Areyoukiddingme · · Score: 5, Funny

    You can pry my GKrellM from my cold, dead hands!

    Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!

    *cough*

    1. Re:GKrellM by funkatron · · Score: 1

      Interesting! Do the mods not know what GKrellM is?

      --
      "Welcome to our world. We are the wasted youth. And we are the future too." Yes, I know these are stupid lyrics.
    2. Re:GKrellM by gmuslera · · Score: 1

      There are several desktop applets that shows what happens in more "serious" (or at least massive) monitoring solutions. Nagstamon shows nagios alarms (and let you ssh/vnc or even see nagios reports onproblematic hosts right there), ZApplet shows Zenoss alarms/warnings too.

    3. Re:GKrellM by Areyoukiddingme · · Score: 1

      To be fair, one of the two mods knew it was +11 Funny... er I mean +1 Funny...

      One of today's articles has a thread deploring people who read the article, the summary, the title, or the posts before posting. So in my defense, I only read 4 words of the title and posted, so I saw Would Want Monitoring System. Naturally I thought of GKrellM.

      Now if I had read 4 different words, I'd have thought it was spam and deleted it. "What You Want Large"

    4. Re:GKrellM by Areyoukiddingme · · Score: 1

      So what you're saying is, GKrellM needs plugins for Zenoss and Nagios and whatever else?

      Damnit, now my +1 Funny is starting to sound practical. Quick, somebody add another +1 Funny! This has to be stopped!

    5. Re:GKrellM by Anonymous Coward · · Score: 0

      Actually, 5000 devices only requires 4 Krell meters.

  11. Nagios, Munin, GKrellm by Anonymous Coward · · Score: 0

    I've played with quite a few in the past. For your application, I'd stick with Nagios... the new version has plenty of scalability features. It has a pretty steep learning curve and just about all the configuration is text-based, but I've always found it well worth the investment in time. I currently use it to monitor services on only a couple hundred devices, and there are plenty of plugins to make it more useful. It's not great for creating and visualizing large 2D and 3D maps, but it has all the necessary hooks for it and with a bit of scripting you should be able to generate more useful views and reports of your farm.

    Corps seem to buy into the commercial HP Openview a lot, but no one I've talked to that uses it seems to like it.

    On a few of my servers, I also like to run Munin... it tracks and displays a little bit more information than Nagios, such as graphs of sensors, uptime, UPS stats, etc. It's come in handy on several occasions when Nagios had simply shown me that "the server went down", but the information from Munin showed that "the server room temperature started climbing up to 90F starting at 2AM".

    For real-time monitoring, I really like GKrellm, which has a server/client mode of operation. It wouldn't be practical to have it up all the time, but it would be sweet to set up the daemon and have a link from Nagios launch a gkrellm client to a remote server, where you can see the affects of anything you do in real-time (rather than waiting for Nagios to refresh in 5-10 minutes).

  12. The Dangers of averaging by Anonymous Coward · · Score: 5, Insightful

    MRTG does it right...most of the others do it wrong
    When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
    So your 380Mbps peak that you had an hour ago is fine on today's graph
    But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
    and next week, when you look at "last weeks" graph...there's a little 50Mbps peak

    Damnit... I want to keep information on my peaks for capacity planning!

    1. Re:The Dangers of averaging by counterplex · · Score: 1

      Damnit... I want to keep information on my peaks for capacity planning!

      AVERAGE isn't the only archiving function you can use with rrdtool. For your purposes, you should create an additional RRA with an archiving function of MAX. http://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html#IRRA_CF_cf_arguments

      --
      $x = ($x * 10) % 10 >= 5 ? 1 + int $x : int $x
    2. Re:The Dangers of averaging by Anonymous Coward · · Score: 1, Informative

      RTFM !

      http://oss.oetiker.ch/mrtg/doc/mrtg-reference.en.html

      WithPeak

    3. Re:The Dangers of averaging by Anonymous Coward · · Score: 0

      Huh?

      I agree that you want to keep the peaks for capacity planning.

      But MRTG does averaging the same way every other tool does and is loosing the peaks.

    4. Re:The Dangers of averaging by Anonymous Coward · · Score: 0

      MRTG does it right...most of the others do it wrong
      When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
      So your 380Mbps peak that you had an hour ago is fine on today's graph
      But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
      and next week, when you look at "last weeks" graph...there's a little 50Mbps peak

      Damnit... I want to keep information on my peaks for capacity planning!

      MRTG does it right indeed, especially with:

      WithPeak[_]: ymwd

    5. Re:The Dangers of averaging by JSC · · Score: 1

      You could try RTG. It's a non-averaging alternative to MRTG. I used it a large telcom provider I used to work for to monitor several thousand circuits. I kept a years worth of data on-hand (MySQL database instead of RRD). It works VERY well. It takes a bit more configuration than MRTG but if you want to keep NON-averaged data, it's a good choice.

      --
      Time's fun when you're having flies. - Kermit the Frog
  13. I would like by Anonymous Coward · · Score: 1, Funny

    Twitter client, facebook integration, google maps mashup.
    And a pony.

    Thanks

  14. Zenoss by KerberosKing · · Score: 4, Informative

    I was really impressed by Zenoss, which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.

    You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.

    1. Re:Zenoss by ckdake · · Score: 1

      A +1 to ZenOSS from me. Quality product and commercial support is great.

    2. Re:Zenoss by NuclearRampage · · Score: 2, Interesting

      A little tough to setup new SNMP devices, I thought, but overall a great product. Even the free version gets you quite far.

    3. Re:Zenoss by rawler · · Score: 5, Informative

      ZenOSS may be great, but a word of warning. We've had 3 failed attempts at implementing it in our shop. What we tried to achieve was mainly host and service-monitoring, with some slight network-monitoring on the side. Nothing fancy, just some 20 hosts, maybe 30 network-devices, and a variety of services.

      One of the major parts we've found missing in most open-source solution was proper event-management (recieving syslog + snmp traps, and apply some intelligence to it regarding flow control, dispatching, archival and that stuff.) ZenOSS is on paper, and throughout the initial evaluation one of the best open source tools to do this.

      However, during our three attempts to get it up and running, we've always encountered some major obstacle (usually after a while of operation), forcing us to start all over from scratch. The problems we had was always in the same category, strange and unexplainable errors, often hard to reproduce, and in general it resulted in a very flaky experience. Some of the problems have been service-checks showing both false positives and false negatives, and in the last problem ZenOSS refused to import new SNMP MIB:s, complaining about some IP-address that could not be found anywhere in the config, and grepping ultimately found the IP to be only present somewhere in the opaque zope-database, where evidently it could not easily be removed, nor even found exactly what the ip-address was for. (It was something auto-discovered in a remote network segment out of our control, but advertised throughout the routers.)

      So, while ZenOSS can do all kinds of things, and does a LOT of things really well, it's extremely complex, not in all parts on solid foundation (such as all network objects in a non-accessible Zope-database that the devs themselves recommends not touching since it may upset things more). If you plan on implementing ZenOSS, I would not go without the support, which I assume is great, since there seems to be quite some dark pits to fall in on your own.

      I dont know how come we had so much obstacles and strange problems when others seem to have a smooth ride. Maybe one explanation is what were the final nail in the coffin for ZenOSS in our deployment. When I started asking around about these problems (and ZenOSS has a really helpful community, no problems there), I realised that many users claimed to have gotten into similar problems that we had, but their solution were to just keep daily backups, and revert to a backup when they ran into these problems. For us, the monitoring data is basis for a lot of 3d-party agreement, and loosing even days worth of monitoring and logging is completely unacceptable due to these reasons. We do backup everything, but in case of rare disasters, and we must be able to rely on the monitoring system giving us a clear view through those disasters.

    4. Re:Zenoss by jon3k · · Score: 1

      Zenoss's commercial support prices are hilarious, I mean, literally, hilarious. The CHEAPEST support (silver) is $100 per managed host (including virtualized hosts) most expensive (platinum) is $180 per node. So your 5,000 hosts would be $500,000-$900,000 per year in support.

      Yes. Seriously.

      The other problem I have with Zenoss is the reporting is basically non-existant. It may sound like I'm being hyper-critical, but it's only because I've looked at Zenoss and I so wanted it to be the NMS for me (I particularly like the fact that it's both open source and written in python) but at this point I just don't think it's going to work.

      We use What's Up Gold from ipswitch right now, but we're only monitoring a few hundred hosts. It's slow, runs on windows, requires ms sql, but it's surprisingly full featured and gets the job done I suppose. Oh and its $900.

    5. Re:Zenoss by KerberosKing · · Score: 1

      From the same page you linked:
      >** Volume pricing:Volume discounts, site licenses, and corporate-wide licensing are available for environments larger than 1,000 managed resources. Contact Zenoss Sales to learn more.

      So I call BS on the $500,000-$900,000/year number you gave.

    6. Re:Zenoss by LodCrappo · · Score: 1

        You missed the volume and site licensing options available for networks of 1000 nodes+

      Their pricing seems in line with their competition, a quick search finds the follow pricing for HP offerings:

      Network Node Manager, $6,000; OpenView Operation, $17,995; OpenView Internet Services, starts at $12,449 for 5 targets. Additional targets: 5, $2,038; 25, $10,207; 250, $65,160.

      I am not familiar with how their product line works, and I'm sure they also have volume licensing agreements for large customers, but using your same logic applied to what I could quickly glean from this article, it would seem HP's product would cost $6000 + $17,995 + $12,449 + (4995 / 250 * $65,160) = $1,338,340.80. So Zenoss is quite the bargain :)

      Silly of me to even do the math, since both of our numbers are very wrong.

      --
      -Lod
    7. Re:Zenoss by jon3k · · Score: 1

      Are we comparing software prices to annual support costs? Those are ANNUAL costs for Zenoss. Even a 50% discount on 5,000 nodes would be a quarter of a million dollars for their LOWEST level of support.

      Next, Zenoss's offering is not comparable to OpenView, sorry. HP OpenView is a massively more mature and robust product, and is really in a totally different league than Zenoss.

    8. Re:Zenoss by Ranger+Rick · · Score: 4, Interesting

      And this is why we (OpenNMS) don't play the per-node. It's not any harder to run OpenNMS when managing 1000 nodes than when managing 100, you only need to scale hardware appropriately. Per-node pricing is an artificial limitation.

      We also don't play the "you get a special price behind closed doors" game, our support prices are public, fair, and the same for everyone -- and that's only if you need commerical support -- our prices are $0 if you don't need or want support.

      If you do the math, it's $0 for the software, plus $14,995/year for support for any number of nodes, and the software is 100% open-source and fully capable of replacing or exceeding OpenView. ;)

      --

      WWJD? JWRTFM!!!

    9. Re:Zenoss by hJordanH · · Score: 1

      I implemented Zenoss for Application, Systems and Network monitoring of close to 1000 devices. We have the collectors distributed across each colocation, and multiple in some colo. My project was so successful that my companie's CTO committee implemented it across all of our other business units, and none of them have found anything that it cannot monitor. If there's something that you can't do, a plug-in can be written. It's replaced our inventory system, IPAM solution (lightly), application, network and systems monitoring systems, and due to the "device class" architecture we've simplified deployment time for monitoring, and inforced consistency in monitoring across the board. We do pay for support, which we've found to be a requirement.

    10. Re:Zenoss by Anonymous Coward · · Score: 0

      And yet they guarantee you'll save 50% or more on licensing compared to HP, CA, IBM and BMC. Sounds like those are the vendors with outrageous pricing. I guess the market will decide.

    11. Re:Zenoss by jon3k · · Score: 1

      Just for fun. Zenoss, highest level of support for 5k nodes is $900,000 per annum. Let's throw in a 25% discount which gives us $675K. So, in two years of Zenoss support you could buy the complete OpenView suite at full retail list prices. Let's assume the same discount rate for OpenView and we get right at an even $1M. So the ROI for purchasing OpenView (an obviously superior product) is only 17 months.

      The very concept that HP OpenView is cheaper than the support for a open source project literally makes my stomach turn.

    12. Re:Zenoss by hotfireball · · Score: 1

      Well, I was into bugfixing Zenoss and not impressed how it is developed and implemented. And it really sucks at performance. For example, you can find stuff in the code like: SELECT * FROM sometable just for selecting all the nodes (ouch!).

      Also it is on top of Zope-2 with all the consequences: you need ZEO for redundancy (don't try this at home) etc

    13. Re:Zenoss by hotfireball · · Score: 2, Insightful

      Zenoss also provided terribly wrong RPM packaging. I don't know how they are now, but exactly 1 year ago it was that wrong. For example, they could simply wipe out some files in /etc where your setup is already done but without any warning or notice. So all the time it is better to setup it from source.

      I've also look at Zabbix and got an impression that it is sort of like a bicycle with a squared wheels. Same to Groundworks (a re-packaged Nagios) and Nagios itself. The only thing I find really worth to pay attention at: OpenNMS. So far it also has its own gotchas, but better than others.

    14. Re:Zenoss by LodCrappo · · Score: 1

      I wasnt able to find what support on Openview costs, it seems to be available but pricing is a big secret or I suck at google. or both.

      zenoss gets a head start with $0 vs $1.3 million (or 650,000 if we assume a similar 50% discount).

      then you add support and really we dont know how it all turns out unless someone knows support costs for openview, but dont get sick yet, i'd bet openview manages to keep a healty lead.

      --
      -Lod
    15. Re:Zenoss by LodCrappo · · Score: 1

      well that assumes free support on openview. not really HP's style, especially with enterprise products. support on their sans is roughly %30 of purchase price per year, thats the only frame of reference i have. don't sick on your shoes ;)

      --
      -Lod
    16. Re:Zenoss by jon3k · · Score: 1

      If we assume 50% for Zenoss we need to assume the same for HP, who is very aggressive on their pricing in general ($work is a 100% HP shop for systems, desktop and datacenter). In which case, you can buy HP OpenView outright for the price of one year of Zenoss support.

      Now, unless HP charges as much for support as they do for the software (obviously they don't) then the second year OpenView becomes cheaper than Zenoss. Which is just hilarious since HP OpenView is lauded constantly for being absurdly overpriced -- because it is, just like Zenoss's support prices.

    17. Re:Zenoss by LodCrappo · · Score: 1

      i just told some lies on a web form and got emailed some zenoss propaganda.

      zenoss licensing cost for 5000 devices is $350,000/yr
      (that much i'd guess is pretty true, its a zenoss marketing doc)

      they claim HPs solution costs $2,000,000 up front and $354,000 each year after the first (hp throws in the first 350k value with your $2mil purchase! call now and you'll get the miracle slicer. but wait, theres more).

      they go on to claim it costs over $1 million to implement HPs stuff but only $90k to do zenoss, blah blah whatever.

      well we know zenoss is stretching those numbers any way it can.. but how far is anybody's guess. at face value its $1,050,000 zenoss vs $2,708,00 hp at three years. even if its exaggerated by %50, and assuming equal everything else you'd only match prices in year 4. Dont know what the lifecycle and upgrade costs are with hp's product, dont know if 50% is any where close to the amount zenoss is skewing the number.. its possible it could go either way but still i lean towards hp being somewhat to much more expensive. and supposedly zenoss will guaranty 50% saving, but you have to already have bought HP to qualify, so whats the point there i dont know.

      --
      -Lod
    18. Re:Zenoss by jon3k · · Score: 1

      Those seem like at least realistic assumptions to me, at which point you have to wonder - which would you chose? The one with a decade of engineering from HP behind it or Zenoss? I've worked with Zenoss and I've seen OpenView and I made up my mind before I finished writing this post :)

    19. Re:Zenoss by Politas · · Score: 1

      How do you compare to ref="http://www.nsai.net/products/incharge-asm.shtml">Smarts InCharge's automatic root cause and impact analysis?

      --

      Politas

    20. Re:Zenoss by AtlantuX · · Score: 1

      SMARTS is cool stuff, but even intra-domain correlation is not working properly. Making it impossible to correlate OSPF/BGP notifications with failing hardware that was detected by another domain manager (e.g. the networkprotocolmanager) - resulting in not so good rootcause analysis. But anyways, RCA is a big problem when running over MPLS clouds, even for SMARTS.

    21. Re:Zenoss by Anonymous Coward · · Score: 1, Funny

      Ok, OpenNMS has a three-digit Slashdot userid, they win!

    22. Re:Zenoss by Ranger+Rick · · Score: 1

      Honestly, I'm not very familiar with SMARTS, but we don't do a ton of root cause analysis out of the box.

      We do have an integration with Drools to be able to do correlation and root cause analysis, but we don't have much in the way of default configuration for it at the moment.

      --

      WWJD? JWRTFM!!!

  15. Spiceworks? by BagOBones · · Score: 1

    http://www.spiceworks.com/
    Not sure how far it scales but I have played with it on some small installations, very easy to manage.

    I have used Cacti but never felt it was mature or robust enough for very large environments

    SCOM, System Center Operations Manager we are deploying now for our enterprise, however I would be afraid to manage IT on my own as it is a large system on to it self, yet very powerfull.

    --
    EA David Gardner -"... but the consumers have proven that actually what they want is fun."
  16. A couple of other options by AFresh1 · · Score: 3, Informative

    I use Nagios and some custom rolled scripts myself.

    For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga.

    Reconnoiter also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.

  17. No humans being monitored! by Hurricane78 · · Score: 1

    That is what I would want! ^^

    --
    Any sufficiently advanced intelligence is indistinguishable from stupidity.
    1. Re:No humans being monitored! by cenc · · Score: 3, Funny

      I find a human monitoring system to be the most reliable. There is always someone to fire, if something goes wrong.

  18. OpsView by imemyself · · Score: 1

    I've really been impressed with OpsView. Can't say how well it scales on huge networks (but there are options for having multiple servers). Its based on Nagios, but its a lot less of a pain to configure and has a pretty good web interface. The only thing I don't really like is its graphing functionality. I use Cacti for monitoring bandwidth/server load/etc. But for availability checking OpsView does a fantastic job. I'm using it to monitor maybe twenty devices, including Linux and Windows servers, and HP/Cisco network devices. I tried Zenoss as well, but it seemed awkward to work with. For instance, with Opsview/nagios it's easy to add a check to verify that a DNS server is correctly resolving a record in a particular zone. I remember it was going to be a pain to monitor some of the things I wanted to with Zenoss. Maybe I'm biased because I used plain old Nagios for a while before I tried OpsView and Zenoss.

    --
    Every time you post an article on Slashdot, I kill a server. Think of the servers!
    1. Re:OpsView by vevel · · Score: 1

      I would agree with your assessment on Opsview http://opsview.org/ . It is working well for me so far. I recently built did a nearly painless build (via apt-get install blah) of it on a Debian box, and they also have a VM available.
      I'm not sure why the NMIS / MRTG combo doesn't do the trick for your trending / graphing needs -- I've used plain old NMIS http://www.sins.com.au/nmis/ (which opsview includes) to do a lot of the things I have done in the past with Cacti. If there's other stuff you're getting out of cacti these days, I'd be interested in hearing that. These are all basically frontends to RRDtool, if memory serves.

      Opsview has a clean (IMO) interface (no goofy Windows-like dropdown like groundworks), and does monitoring (agentless or agent-ful/agenty), trending, psuedo-useful but mgmt-pleasing network visualization (via nagviz), alerting, custom hoopla, etc..

      My additional need has been configuration management for network devices, which is where RANCID http://www.shrubbery.net/rancid/ comes in. Rancid also allows a lot of nice (expect-based) mass-configuration of network devices (e.g., changing snmp passwords globally). Command-line required. There is also a (somewhat weak) 'looking-glass' plugin that comes with NMIS (and I think opsview) so that you could tie in viewing of RANCID configs from the same NMIS/opsview dashboard.

      My only complaint with opsview at the moment is that the integration with MRTG and NMIS isn't very tight. You just click over to their dashboards. On the plus side, device/host configuration is shared, which is fantastic. (Also, you don't have to install them separately, which is actually a pretty big win.) Another good thing -- if you're talking 5000 devices, agents and distributed monitoring are there for you.

  19. GAS/Plexos by Anonymous Coward · · Score: 0

    Talk to these guys http://netfuel.com. Excellent client server monitoring, started out as a trading app monitoring tool and grew. Scales to thousands of nodes and has lots of options including several API's for integration.

  20. Nagios Might Work by hax4bux · · Score: 2, Interesting

    I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).

    If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.

    The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.

    MRTG also integrates w/Nagios, which can be useful.

    Good luck.

    1. Re:Nagios Might Work by Omish-Man · · Score: 1

      In addition to rolling your own Nagios UI, there is Nagvis http://www.nagvis.org/ for Maps. I was using a much earlier version, but the latest versions look much easier to implement and very promising. I have also been known to use Cacti with Nagios watching the RRD files every few min so I can still get my performance monitoring while letting Cacti do all the work. I had Nagios check the Cacti RRDs because Nagios has better options for communicating events.

    2. Re:Nagios Might Work by Techman83 · · Score: 1

      And instead of Rolling your own UI you could try Centreon. It's a frontend for GUI Nagios.

      --
      # cat /dev/mem | strings | grep -i cat
      Damn, my RAM is full of cats. MEOW!!
  21. Argus! by Jeremy+Kister · · Score: 1
    --

    Jeremy Kister
    http://jeremy.kister.net./

  22. ZenOSS all the way by Midnight+Warrior · · Score: 5, Interesting
    We use ZenOSS exclusively at work and have enjoyed every minute of it. Pro's include:
    • 2D map with status of all nodes or submaps, organized by network
    • Application monitoring, with more advanced maps available for purchase (Oracle, JBoss, Cisco) for those things you already paid a lot of money for
    • Performance monitoring via SNMP or other data sources using RRDtool internally which includes graphs linked to each other during zoom in/out or panning
    • Nagios plugins already do some of the heavy lifting
    • Built-in support for watching Windows servers (any metric accessible via WMI)
    • Access control using at least LDAP and Active Directory
    • Secondary data collectors for those networks which are too big for just one central source
    • Highly customizable through Python
    • It has so, so much more than pathetic commercial solutions like OpenView

    Cons:

    • You have to keep your eye on the back end database
    • It still takes a long, long time to tune it to remove noise events
    • If you don't know Python, it can be tough in a few places
    • Proper support is not cheap
    1. Re:ZenOSS all the way by glsiii · · Score: 1

      There are plenty of tweaks that help speed things up greatly-- including disabling the section of code that calculates overall system availability (if thats not important to you).... we dump all of our syslog in to ZenOSS so the tables got quite large before the retention rules kicked in.

    2. Re:ZenOSS all the way by isorox · · Score: 1

      * 2D map with status of all nodes or submaps, organized by network

      That's not a "pro". My organisation isn't a network shop, and 99% of the faults we have are nothing to do with networks, either physical or virtual. More use would be a map of the building showing the max/min/average temperature in each apps room, but ultimatly I don't really care about things that are normal and working, only things that are abnormal or not working.

      Access control using at least LDAP and Active Directory

      Any system using a webserver should be able to do that. Most proprietary ones we've tried can't, they have their own user managment (despite running off IIS)

  23. FreeNATS by Anonymous Coward · · Score: 0

    Some basic functionality, maybe not at the development level your looking for yet.

  24. 5000 seat network? by Anonymous Coward · · Score: 0

    Shurely Shome cash is available to actually PAY somebody to sort this out, rather than ask for free help on /.? Hic - burp.

    1. Re:5000 seat network? by Anonymous Coward · · Score: 0

      "Shurely Shome cash is available to actually PAY somebody to sort this out"

      That's exactly what they are paying the questioneer for, and he is doing it.

  25. For me, that's easy by neokushan · · Score: 1

    I want lots of buttons and dials! And flashing lights!

    --
    +1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
  26. The mistake by vlm · · Score: 3, Insightful

    The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.

    Give them a text screen of whats currently down ... that'll work.

    --
    "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
    1. Re:The mistake by Krneki · · Score: 1

      No, a text screen doesn't give the idea to the help desk of what zones are affected.

      I want the help desk to know what is the problem before our clients calls us.

      --
      Love many, trust a few, do harm to none.
    2. Re:The mistake by Anonymous Coward · · Score: 0

      First, nothing you can do will mean that the helpdesk has any idea.

      Second, management won't like you letting the helpdesk know when things are broken. Telling customers that things are broken costs money.

    3. Re:The mistake by Krneki · · Score: 1

      I don't agree with you, when a customers calls in for a problem it is nice for a change to know what he is talking about.

      --
      Love many, trust a few, do harm to none.
    4. Re:The mistake by Tdawgless · · Score: 1

      I definately agree with this. I work for a hosting provider that has about 67,000 servers. We use a highly modified version of Nagios to monitor them(at customer request, so only a majority of the servers are being monitored, not all).

    5. Re:The mistake by t0rkm3 · · Score: 1

      Regional naming conventions with a simple map of the important nodes accomplishes this task easily in a network of 50,000+ nodes.

  27. MOM by aquilah · · Score: 1

    I've only used MOM but for what it's worth the diagramming capabilities are much improved with the new visio plugin. Previously you could export your diagrams from OpsMgr to visio, but with the new plugin the visio diagrams reflect live health state. You can also create whatever diagram you want in visio and then tie it to monitored objects living in OpsMgr (for example rack diagrams)

  28. Bling by Anonymous Coward · · Score: 0

    "The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night".

    If you can develop a monitoring solution that night time support personnel can understand to diagnose a problem quickly and properly, I am going to nominate you for a peace prize. BTW, give it lots of rapidly updating graphs and eye candy, you know, bling the sh1t out of it. Management types love that.

  29. Pay $20k to stat by Anonymous Coward · · Score: 0

    For the hassle of developing your own solution, you could just cough up the $20k for a statseeker box (hardware and software included in that price too) and then you can monitor up to 200,000 interfaces with ease. We poll 186,000 interfaces twice every 60 seconds and we have 2 years of data taking up only 17 gig of space and the boxes still is speedy. I've used nagios, cacti and the like for years and they are great for smaller deployment but statseeker smokes them from almost every aspect. I know they have upgraded options if you need more than 200K of interfaces but so far we haven't hit the limit.

  30. Intellipool Network Monitor by Anonymous Coward · · Score: 0

    Have you checked out Intellipool Network Monitor ? They got a new version coming out (version 4 I believe), it got maps with drill down and all that stuff, the distributed edition can distribute workload over several gateways also, so monitoring large scale networks is not an issue.

    http://www.intellipool.se/forum/index.php?automodule=blog&blogid=1&showentry=170

  31. Similar Slashdot thread by Cato · · Score: 2, Informative

    Here's a similar thread from a while back that covers most of the options: http://linux.slashdot.org/article.pl?sid=07/03/05/1812247

    1. Re:Similar Slashdot thread by Krneki · · Score: 1

      Thanks, I'll check the article.

      --
      Love many, trust a few, do harm to none.
    2. Re:Similar Slashdot thread by Anonymous Coward · · Score: 0

      You're hoping to slashdot OpenNMS again?

  32. Roll you own... by hofmny · · Score: 2, Interesting

    I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.

    I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.

    The system did push and pull operations. First, the system was built in PHP.
    In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.

    The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.

    I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.

  33. Things that go "ping"? by nine-times · · Score: 0, Troll

    What would I want in a monitoring system? The first thing that pops into my head is "lots and lots of knobs." The kind where when you turn them you get a nice satisfying click. And blinking lights. Lots of switches. Things that go "ping" at regular intervals would be nice. Oh! And a nice big screen that says, "All systems nominal" all the time.

    1. Re:Things that go "ping"? by Anonymous Coward · · Score: 0

      What would I want in a monitoring system? The first thing that pops into my head is "lots and lots of knobs." The kind where when you turn them you get a nice satisfying click. And blinking lights. Lots of switches. Things that go "ping" at regular intervals would be nice. Oh! And a nice big screen that says, "All systems nominal" all the time.

      You're in the wrong field. You want to go into audio, so you can get your hands on one of these

  34. How about SolarWinds Orion? by dakaix · · Score: 1

    This doesn't seem to have already be suggested, but we use SolarWinds Orion. Its cheaper than many of the big systems, such as HP OpenView - and much simpler to use and operate.

    The basic Orion package, which you can get for $2000 for up to 100 servers, will pull the usual CPU/RAM/Disk/Network statistics via SNMP. Built in is a mapping engine, that allows you to take a network map, and drop active elements onto it for live interfaces and device information. In a NOC environment, you can show this on a screen and it'll even sound an alarm when a system Alert fires through the website.

    You can then bolt on additional modules, such as their Application Performance Monitor. It has ready to use templates for common business applications, Exchange, Apache, IIS etc. You can also create your own mixing, SNMP, WMI and User Experience monitors. User Experience monitors for example allow you to actively poll HTTP/FTP/DNS/SMTP/IMAP/POP etc, services to ensure they are not only UP but responding as they should to requests.

    For scaling, you can tack on Additional Pollers to spread polling load across them. You can also use hot-standby pollers to resume the work of a failed poller.

    Just my 2 cents, and not a corporate plug - just a very content user!

    1. Re:How about SolarWinds Orion? by Tarwn · · Score: 1

      I was going to suggest ipMonitor (which Solarwinds acquired a few years ago). They just changed the licensing so it's $1000-$2000 for an unlimited number of monitors. The system has a basic mapping engine, NOC environment, sound/email/pager/exe/etc alarms, complex SNMP, WMI, and SQL alarms for everything fmor basic up/down to user experience types of alarms, etc. I never had to look into distributed pollers and I don't think it supports that. The biggest con I ran into was that the reporting engine seemed slow and there was no way to get at the raw data to export it out to other tools.

      Other pros include wizards (enter an IP to scan and it will suggest a list of monitors for that device by scanning for SNMP, WMI, etc), ability to create dependancy chains so that you would receive alerts from 5 devices when their shared switch to the backbone went down (you would just receive the ones for the switch), ability to create "Smart" groups which were basically dynamic groups that include all devices/monitors that meet a set of search criteria, etc.

      Not sure how it scales, but we had this running against 100+ servers and network switches from a little virtual server w/ maybe 1GB of RAM and it didn't seem to be hurting. We also used Cacti for another perspective into traffic flows an such to give us another dimesnion to use when troubleshooting (monitoring + flows + logs).

      --
      Whee signature.
  35. Pandora FMS by guruevi · · Score: 1

    As one of the core devs and large user, I can tell you it scales well, develops easy and has a lot prefab. The system does everything you're asking for. Let me know if you need help or paid support.

    --
    Custom electronics and digital signage for your business: www.evcircuits.com
  36. not sure if this is helpful, but... by sneakyimp · · Score: 1

    I'm a software developer and, sadly, my knowledge of hardware systems isn't always what it should be. When I write an application to run on a server and it starts to get slow, I want to know where the bottleneck is. Is my application CPU-bound? I/O-bound? Memory-bound? Do I need more memory? Faster storage? More cores or faster processor speed? Is it the network that's causing the problem? I can usually figure this out using various linux command-line programs like netstat and top and all that, but I would sure love a big fat GUI to make it more graphic. I found something like this once and couldn't remember what it was called. It required all kinds of diagnostic utilities be manually installed.

    Ideally, you could view a machine and get some quick idea of where the bottlenecks lie. Maybe that's asking a bit much, but the closer you can get to a single control panel where I could see see all my machines in a list with a status indicator and then drill down machine-by-machine, the happier I would be. It would be even cooler if the machines could contact me when they experience times of overload so that I could get a feel for when the trying times are so I can watch them more closely. I'm imagining a daemon that runs on each server and an admin gui that can speak to that daemon somehow. It would also be nice to have hooks so that I can easily report performance profiling information to the GUI from within my application.

    The Activity Monitor utility found on Macs is pretty close to what I'm imagining.

    1. Re:not sure if this is helpful, but... by Krneki · · Score: 2, Informative

      I use Cacti for this and it's fantastic.

      --
      Love many, trust a few, do harm to none.
    2. Re:not sure if this is helpful, but... by neurovish · · Score: 1

      sysstat will give you the data you're looking for, and kSar will put it into a GUI.
      You won't need all kinds of diagnostic utilities to be manually installed.

  37. JMX Support by Cyberax · · Score: 1

    What I'd like to see is a good monitoring support for JMX-capable Java services.

    It'd be nice to set up an alarm based on time spent in garbage collector in a JVM running our application, for example.

    1. Re:JMX Support by Intelopment · · Score: 1

      Cyberax, Check out dynaTrace. They have just what you're talking about. Deep dive into the JVM (or CLR).

    2. Re:JMX Support by glsiii · · Score: 2, Informative

      You certainly want to check ZenOSS out then. Our instance monitors JVMs across our deployment for everything from heap size to open file descriptors to uptime for the jvm specifically-- all of which can be alerted on if desired.

    3. Re:JMX Support by Anonymous Coward · · Score: 0

      OpenNMS do that.

  38. Nagios, Munin, GKrellm by rwa2 · · Score: 1

    I think Nagios should provide a good start.. they've recently added a lot of scalability features. Though it has a high learning curve and all of its configuration is done in text, I've always found it worth the time and effort. I currently use it to monitor services on a couple hundred machines.

    Munin is a bit simpler, but I like the graphs it provides which occasionally are more useful than the data Nagios provides. In some cases, Nagios might tell me that a server went down, but I'd look at Munin and see that the server room temperature spiked to 90F before then. Also it's neat to see the uptime graphs for the year.

    While it might not be practical to use GKrellm all the time, I'd find it useful for real-time feedback. You might set something up where you can launch a gkrellm client to a server of interest while you're working on it. Then you can see the effects of things you do without waiting for Nagios to refresh in 5-10 minutes.

  39. Hobbit+Cacti+Smokeping by adary · · Score: 1

    That is the solution that i have implemented for our little environment that consists of about 50-ish solaris (8 and 10) servers, 80-ish windows servers, about 500 linux servers, and 40-odd cisco switches. Hobbit handles all host monitoring: availability, services, and a bunch of custom scripts written for it to check various aspects of our HPC grid, plus the SMS sending through an old nokia connected to the comm port of a solaris box. Smokeping is there to check latency, and cacti primarily for network traffic volume, and a custom module for FlexLM licenses. Works like a charm

  40. Yep. This is the one. by Anonymous Coward · · Score: 0

    It requires a substantial investment in learning how to set it up and how to use it, but then so do the big commercial products like OpenView and Tivoli.

  41. The Dude by Anonymous Coward · · Score: 0

    The Dude is what we have been toying with lately.

    http://www.mikrotik.com/thedude.php

  42. SNMPc by Stile+65 · · Score: 1

    It's *not* open-source, but it IS inexpensive. When I worked at a NOC, we used it to monitor hundreds of routers, switches, mainframes, Tandem systems, UNIX boxes, etc. It takes SNMP traps and displays them graphically on a 2D map, and the 2D map is very nicely implemented. You can have your top level view made up only of groups of devices, so if a group goes red you double-click that group to view its members and see which device actually has the error. IIRC, you can nest groups, so it ends up being a fairly scalable solution when you talk about screen space.

    --
    I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
  43. I use Nagios but also recommend OpenNMS by WML+MUNSON · · Score: 1

    I use Nagios, but on a smaller scale than what you describe. I love the system, but I would imagine it being difficult to maintain on a larger scale. Nagios itself is requires manual configuration unless you use a separate front-end like Centreon, which is also far from perfect..

    A friend of mine has been toying with OpenNMS for the last few months, and he's pretty happy with it although he reports that it's still got some minor issues that need to be worked out. It's FCAPS compliant, and I get the impression that it might be the better option for handling a large installation. There's a new version scheduled for release soon, so we'll see what that brings to the table.

    There's also recently been an announcement of a Nagios fork, scheduled for release sometime around October. I forget the site or project name but I'm sure a bit of Googling will locate their site for you.

  44. Wikipedia chart (from hell?) and reading rec by vevel · · Score: 1

    This sounds like the perfect opportunity to harness the power of app partisans to fix the wikipedia article comparing monitoring software. See http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems . Some good info there. And probably bad info. But certainly has a good list of applications. Also, if you like nagios (and he seems to me to be fair to a lot of packages, including ossim), you might check out some of David Josephsen's articles (or Nagios book), etc.. His site is http://www.skeptech.org/ . A decent design article is here -- Best Practices for Designing a Nagios Monitoring System -- http://www.informit.com/articles/printerfriendly.aspx?p=705685 .

  45. I Hate War Rooms by afabbro · · Score: 4, Interesting

    I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.

    What you want in large-scale monitoring is:

    • The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".
    • I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.
    • I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.
    • I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.
    • Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.
    • I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.
    • I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.
    • I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.
    • I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.
    • I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.

    Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.

    --
    Advice: on VPS providers
    1. Re:I Hate War Rooms by pudding7 · · Score: 1

      Hell yeah. Build it, I'll come.

    2. Re:I Hate War Rooms by Anonymous Coward · · Score: 0

      i think you forgot the pony..

      but seriously, this sounds great, but open source exists because some people make the connection between "i want X" and taking action to make X happen. What are you willing to do to make this happen? Seems like some of the existing open source projects have a framework that could provide these things.. maybe spend a few weeks time coding (or a few weeks paychecks sent to someone else) and you'll have some of this. Find a few likeminded folks and you might have the whole list checked off in some months time.

      or are you a "free as in beer" kind of oss user, spending time listing things wrong with oss but doing nothing to fix them?

    3. Re:I Hate War Rooms by Anonymous Coward · · Score: 0

      I live in Kansas and "X" is always down. Or at least after I through out my O`Reilly books.
      I so used to like "nohup xterm -sl 400 -bg blue -fg goldenrod &"

    4. Re:I Hate War Rooms by Krneki · · Score: 1

      This is solved by defining dependency.

      And I agree a spam of 100 warning in 1 minute doesn't help anybody. Luckily most of the monitoring solutions I have implemented allows you to filter this nonsense.

      --
      Love many, trust a few, do harm to none.
    5. Re:I Hate War Rooms by lems1 · · Score: 1

      For a second I thought this was going to end with "and this is why we use Nagios". A properly configured Nagios system can do what you mention here. Heck, you can combine it with Cfengine and have a self-healing network environment.

      I've done this very solution you mention with Nagios. Granted, it's not easy to configure at first. However, it's a free software! We could make it so it sucks less to configure it and release the code back to the community.

      --
      This sig can be distributed under the LGPL license
    6. Re:I Hate War Rooms by Anonymous Coward · · Score: 0

      And just what mythical product is it that does all of this?

    7. Re:I Hate War Rooms by rossz · · Score: 2, Informative

      The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".

      When you have your parent/child relationships and your dependencies set up properly, Nagios does this very well. A properly configured Nagios system will alert you only for that switch that died, not for the 200 services behind that switch that you can't reach.

      I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.

      There's a 'cluster" plugin for nagios available, but I consider it a hack for something that should be inherently supported.

      I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.

      Nagios could be improved here. I can set it up to fire off a script when a hard failure is detected and do something different, e.g. HUP apache, but there isn't a way to directly configure alternative test options.

      I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.

      You can configure Nagios for how many times you want an alert to fire.

      Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.

      Supported in Nagios.

      I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.

      That would require specialty software be running on the system being monitored. Not exactly feasible when dealing with every type of equipment imaginable.

      I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.

      Every monitoring system I've used supports this.

      I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.

      That would require some kind of custom hook into your ticketing system. The monitoring system needs to have an open API for injecting commands so that anyone could write their own script. I know Nagios can do this. I don't know about others.

      I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.

      Nagios does reports, but I feel they have lots of room for improvement.

      I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.

      This would require information you suggested above regarding the API.

      I'm a big fan of Nagios, but realize it has room for improvement.

      --
      -- Will program for bandwidth
    8. Re:I Hate War Rooms by turbidostato · · Score: 2, Informative

      "What you want in large-scale monitoring is:"

      Let's see:

      "The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A."

      Nagios do this. I know, I configured mine that way.

      "Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down"."

      Nagios can do that, while I never deployed it that way.

      "I want my monitoring solution to understand HA and service degredation."

      Nagios do this. I know, I configured mine that way.

      "I want programmable rules about what happens when X is down or Y is down."

      Don't know what exactly do you mean, but if you mean the ability to automatically trying to recover a failing state, I think Nagios can do that. Not that I would want go that path: I'm quite averse to "self-healing" systems and prefer early and meaningful alerts and then push brains into it.

      "I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc."

      Nagios do this. I know, I configured mine that way.

      "I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards."

      Nagios do this. I know, I configured mine that way.

      "Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice."

      Nagios do this. I know, I configured mine that way.

      "I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed."

      Don't know what exactly you mean, but I know you can extract out of Nagios the very same information it is managing though i.e. a socket or push it to a database for further inspection. Apart of this, it's open sourced, you know.

      "I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them."

      Nagios do this. I know, I configured mine that way (well, to autolift the "I'm working on it" after Nagios detects the service on-line again... and it automatically closes the previously opened ticket on our service desk too).

      "I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out."

      Nagios do this. I know, I configured mine that way.

      "I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice."

      Nagios do this. I know, I configured mine that way. It is the "little charts and graphs" where Nagios is quite lacky, in fact.

      "I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately."

      Nagios do this. I know, I configured mine that way.

      "These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas"

      You didn't research on this that so much. Please note that I'm not affiliated to Nagios in any way, but I'm just a satisfied user.

    9. Re:I Hate War Rooms by phish · · Score: 1

      You should check out Hyperic (www.hyperic.com). We designed it with a lot of this in mind.

      -javier

    10. Re:I Hate War Rooms by RobiOne · · Score: 1

      Yes, there is a solution to this problem. BixData. http://www.bixdata.com/

      This is the next generation monitoring solution. Self-installing,organizing,adjusting,correcting,tuning, P2P, 3D, n-cube OLAP, scalable, ...

      It's like something they'd have on StarTrek. A huge advancement in management science. I had a chance to use it a while ago, and I can only say you have to try it to experience the 'awesome'. Hardly compares to the current day popular systems people are still struggling with.

      There's even a free version community edition for 30 hosts.. no registration required either, and they love critical feedback. So don't forget to give them some.

      I'm not affiliated with BixData, just love the efficiency and thoughtfulness.

      --
      -- Robi
    11. Re:I Hate War Rooms by Anonymous Coward · · Score: 0

      Intelligent monitoring of applications/services instead of plain hardware bits is imho the next evolution/holy grail of monitoring, and currently a vastly unexplored area.

      We had exactly the same problem when looking at a monitoring solution for a major airport, and were pretty disappointed to find out how lacking all the existing tools were in setting up nice looking charts that the drones in support could understand without having to be trained beforehand on every single application's internal data flow.

      The ultimate goal was for the drone to be able to wake up only the appropriate dba/coder/net guy in the middle of the night and not all of them at once, plus have a clear understanding of which red icon means 'move your ass now or the ceo will call in 5 mins' and which one means 'just wait until a new spare part is in, as no one cares really'.

      We ended up running a non oss solution from xech.it

    12. Re:I Hate War Rooms by amorsen · · Score: 1

      Nagios is dead slow during large outages. If it loses contact with a few hundred devices (out of a few thousand monitored), it will be 15 minutes at least before it has figured out the scope of the problem, and the same again when the problem is solved. Throwing hardware at the problem doesn't help.

      --
      Finally! A year of moderation! Ready for 2019?
    13. Re:I Hate War Rooms by WuphonsReach · · Score: 1

      Which version of Nagios? I thought part of the v3 was that checks could be one in parallel. (Or maybe I need to go back and read documentation.)

      (My only complaint with Nagios at the moment is convincing it to play nicely in a SELinux enforcing environment.)

      --
      Wolde you bothe eate your cake, and have your cake?
    14. Re:I Hate War Rooms by amorsen · · Score: 1

      V3.0.4 currently

      --
      Finally! A year of moderation! Ready for 2019?
    15. Re:I Hate War Rooms by virex · · Score: 1

      I actually have most of this with Nagios. We are monitoring over 400 nodes with over 1200 services. I set up a soap perl script to create a ticket in our Numara Footprints support database. From there we use footprints escalations to email after 15minutes of downtime. and again after 1 hour. It will also email when the ticket is closed(host is back up). This allows us to do complex associates of users(say, bill gets all email issues, tom gets all issues at this location, etc). It also allows us to do some very complex reports and graphs using the footprints interface.

    16. Re:I Hate War Rooms by ckaminski · · Score: 1

      +1 I love this post. This is exactly what I was striving for with Nagios at my last installation. Smart enough to know that the router on this side of the frame circuits was DOA, and to stop bitching about the FTP, Web and email servers on the other side of the circuits.

      Never really got there, though.

    17. Re:I Hate War Rooms by afabbro · · Score: 1

      Sounds like a lot of people think Nagios is capable of some of the things I outlined. I haven't looked at it in quite a while, so that could certainly be the case. Great news if that's so.

      Wait...I think I just witnessed a constructive exchange of information on Slashdot. What the heck is wrong with us!?!? Let's fix that ASAP and get back on track: Emacs sucks and only girly men use it.

      --
      Advice: on VPS providers
  46. No reliability issues? by Colin+Smith · · Score: 1

    Which revision?

    i tried it for a couple of months, and rather like it, but it'd simply stop monitoring stuff, triggers wouldn't fire reliably etc.

    --
    Deleted
    1. Re:No reliability issues? by TooMuchToDo · · Score: 1
      1.6

      Haven't had any monitoring problems, we interface to a master Remedy system for stuff due to ITIL, and we have triggers that run scripts that try to fix mundane problems (which usually works).

    2. Re:No reliability issues? by growse · · Score: 1

      I had the same issue. Features are great. Reliability sucks.

      I guess if you're enterprisy, you could deploy the servers as resiliant pairs (I believe it supports that), but I couldn't afford that for my smallish system.

      --
      There is nothing interesting going on at my blog
    3. Re:No reliability issues? by neurovish · · Score: 1

      Which revision?

      i tried it for a couple of months, and rather like it, but it'd simply stop monitoring stuff, triggers wouldn't fire reliably etc.

      Try out 1.6.4. I had those problems with every version up until this one. It's been stable for the past 3 months and hasn't needed the cronjob that I setup to do a weekly restart of the server processes.

  47. Don't be like Tivoli, OpenView, etc by duffbeer703 · · Score: 2, Insightful

    Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.

    --
    Conformity is the jailer of freedom and enemy of growth. -JFK
  48. Intermapper by ChiefArcher · · Score: 1

    Big fan of intermapper (www.dartware.com) ... It can use nagios plugins as well.
    It's fairly cheap.. We monitor about 1250 devices at the moment with it... can be set all way down to 5 seconds.
    Server and Client are both in Java... so more or less it runs on any platform.

    They give out 30 day demo keys.

    1. Re:Intermapper by Anonymous Coward · · Score: 0

      The server's actually written in C++ (one reason why it's fast enough to monitor a lot of devices), but it's available for all platforms (win/mac/linux/solaris/bsd)

  49. Cacti w/plugins by Linegod · · Score: 2, Interesting

    I use Cacti, with THold and weathermap plugins.

    But then I'm biased.

    --
    -- I care not for your foolish signatures.
    1. Re:Cacti w/plugins by Krneki · · Score: 1

      Me too, but the key here is to set polling interval to 5-10 sec.

      --
      Love many, trust a few, do harm to none.
  50. InterMapper by Anonymous Coward · · Score: 0

    It's not free, but if you want something with maps that is very easy to set up, and for regular users to work with, take a look at it.

  51. Well.. by Anonymous Coward · · Score: 0

    As long as it has more cow bell!!

  52. Foglight by RobbieJ · · Score: 1

    Foglight from Quest Software covers out most of the requirements out of the box and is script friendly. Its all Java based thou it itself isn't an OpenSource project.
    There is also a community around customization that might be worth checking out over at www.foglight.org.

  53. If you would write one from scratch... by Yaa+101 · · Score: 1

    Then I advise you to let all the connection checks do by a machine dedicated for this task.
    The statistical checks on each of the 5000 machines itself 24 hours a day while pushing/pulling data at intervals to that dedicated machine.
    I advise 2 or more measuring machines each on a separate network each doing the same task and synchronizing their data for redundancy.

  54. Intellipool Network Manager by Knightman · · Score: 1

    You can always try INM (http://www.intellipool.se/).

    It's quite feature rich and it's worth a look.

    --
    --- Reality doesn't care about your opinions, it happens anyway and if you are in the way you'll get squished.
  55. OpenNMS by ckaminski · · Score: 2, Insightful

    Was a step above Nagios in terms of reliability (I didn't have to restart the server four times a day just to keep it running), and did much better at autodiscovery.

    That fact that it is also NRPE compatible was a plus - I could use all the Nagios plugins and check scripts I'd written.

    I was also planning on using it to launch a more aggressive webmin-style management solution - since OpenNMS built this great database of data about my devices and hosts, I could use it to do actual management - change data/settings.

    Cons: It's a Java/Tomcat tool, as much as that is really a con. It's not like you need to run Jboss or Websphere to use it (though I suppose you could).

  56. What I Lack in Open Source Monitoring Solutions by rawler · · Score: 5, Insightful

    I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.

    Transparent Planned Design
    Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.

    Event management
    Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.

    Modularity/Seamless Integration
    Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.

    Complexity
    There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.

    The Perfect Monitoring System
    After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.

    What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.

    1. Re:What I Lack in Open Source Monitoring Solutions by RobiOne · · Score: 1

      I see you haven't evaluated BixData http://bixdata.com/.. see my other comment above.

      --
      -- Robi
    2. Re:What I Lack in Open Source Monitoring Solutions by Krneki · · Score: 1

      So, what would you suggest I should use?

      --
      Love many, trust a few, do harm to none.
    3. Re:What I Lack in Open Source Monitoring Solutions by rawler · · Score: 1

      Sorry, I don't have a suggestion. That's basically the point of my posting, none of the current (open source, that we found) monitoring systems really serves our needs. (Although yours may be different story.)

      Currently, we are doing a closer evaluation of Zabbix, which is so far looking OK for the parts it does solve (although to be fair, we haven't yet reached the full-implementation-attempt where ZenOSS started failing us).

      Zabbix however, lacks the event-handling parts we need (listen to and manage remote syslogs and SNMP traps), so we are attempting a combination with rsyslog + phplogcon and a bunch of own-written scripts for dispatching to cell-phone / e-mail based on a rule-set. It will be a big hairy ball of glue and duct-tape and will be lacking a lot of highly valued features, but in time, we may end up with a solution we can work with.

    4. Re:What I Lack in Open Source Monitoring Solutions by rawler · · Score: 1

      Hmm, and it seems you didn't really read the part in my posting about event-handling. Syslog-handling and SNMP trap decoding is _very_ important for us, since we have a lot of black-box equipment whose only interface is SNMP, and half of the information comes as traps and can't be polled.

      It is true that we did not evaluate BixData, but judging from the web-site, it also does not match what we're looking for. If you remove the SNMP/Syslog-requirements, there are more solutions that does roughly what BixData seems to do, like Zabbix, or maybe even Nagios would be a serious contender in that game.

    5. Re:What I Lack in Open Source Monitoring Solutions by Anonymous Coward · · Score: 0

      You must not have looked very hard at opennms.

    6. Re:What I Lack in Open Source Monitoring Solutions by Krneki · · Score: 1

      Good luck :)

      --
      Love many, trust a few, do harm to none.
    7. Re:What I Lack in Open Source Monitoring Solutions by Anonymous Coward · · Score: 0

      Right on the spot.

      We evaluated lot of commercial/open source solution, but they were lacking in the architecture/model area.

      It was never clear what the underlying conceptual model was and how it was mapped to the architecture implementation.

    8. Re:What I Lack in Open Source Monitoring Solutions by Anonymous Coward · · Score: 0

      I think you've found what many huge enterprises find. It takes a host of monitoring apps to give you what you need. Cisco Works, HP OVO, Nagios, Cacti, eHealth, etc. all working together. Spendy, but for now it seems to be the only plausible solution.

    9. Re:What I Lack in Open Source Monitoring Solutions by Ranger+Rick · · Score: 1

      FYI, I work for OpenNMS, so I can't answer for all systems, but I can tell you how we stack up against your requirements:

       

      Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth.

      OpenNMS was started by guys who did OpenView, NetCool, and other consulting for years and were tired of crappy tools that were hard to integrate with, so it was designed with scalability and "enterprise-ness" from the start. We've got folks monitoring hundreds of thousands of data points every 5 minutes from a single box. At this point the biggest bottleneck is not the code, but the I/O capabilities of your monitoring host, and how much data it can save to disk in a given amount of time.

       

      Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules?

      OpenNMS can do this, with a combination of our syslog daemon (which turns syslog messages into events), the event translator (which can parse those events and let you look for certain patterns to make more specific/different events), and alarms, which collapse multiple events of the same type into a single thing which you would then use to send notifications (which can span various groups, duty schedules, and notification types).

       

      Modularity/Seamless Integration

      OpenNMS has a number of ways to integrate external systems:

      * traps - OpenNMS can receive SNMP traps and turn them into events internally

      * event socket - OpenNMS has an event socket that you can push XML to that become events internally

      * syslog (as mentioned earlier)

      * "passive status" which lets you essentially "push" polled data instead of querying it from a remote device

      I'm a coder, I don't do any of our field-implementation consulting, so there are probably more ways to integrate that I've forgotten, but basically at this point, there's nothing you'd want to integrate that couldn't be integrated with just a little glue scripting.

      That said...

       

      The Perfect Monitoring System

      There is no perfect monitoring system. Everyone (including me... <g>) starts out thinking "eh, network management can't be that complicated" but it turns out everyone has wildly different networks, different needs, and in the end, will get the most out of different solutions. Any network management tool that says it can solve everyone's problems is lying. There are absolutely situations where some tool would work better for your specific needs than OpenNMS, but we've worked hard to provide a platform that eases integration, to cover as many of those needs as possible. So far, all of the stuff you've mentioned is doable in OpenNMS. Not all of it would happen out of the box, but all of the things you're wanting are possible due to our flexible integration points.

      --

      WWJD? JWRTFM!!!

    10. Re:What I Lack in Open Source Monitoring Solutions by rawler · · Score: 1

      Hi, we skimmed through OpenNMS during our initial survey of the Open Source market. Back then (roughly 9 months ago) we discarded OpenNMS since it did not seem to match our scope. The Network monitoring requirement is for us very low, we need much more host and service-monitoring, and OpenNMS did not seem like the right tool back then.

      Now, after a tip earlier in this article and reading a little deeper about OpenNMS, it definitely seems like a strong candidate, and I will investigate further when I get back to work.

      Just one question, I can't easily figure from the website just exactly HOW host monitoring is attempted? I.E. detecting harddrive-failure, filesystem-conditions, load etc. I realise it could be setup as a big bunch of services and monitored indiviually, but that is exactly one of the things we're trying to get away from in our current solution (Argus).

      Regarding "The Perfect Monitoring System", I hear there will be one, slated for simultaneous release with Duke Nukem Forever. ;)

    11. Re:What I Lack in Open Source Monitoring Solutions by mjhuot · · Score: 1

      Quick disclaimer I am a long time user and contributor to OpenNMS. I would post your questions to the OpenNMS discuss list there are many ways to tackle your requirements. Some using SNMP, some with other solutions. You'll find that the OpenNMS community is very very resourceful. See our video for the Sourceforge Community Choice Award http://sourceforge.net/community/cca09

    12. Re:What I Lack in Open Source Monitoring Solutions by neurovish · · Score: 1

      You may have already done this, but the Zabbix dev team are pretty good about listening to their users when it comes to implementing new features. Send them some mail or post on the forums about adding in the syslog event handling and SNMP traps (I thought it already did the SNMP traps though). I've been using tenshi on a centralized log server along with Zabbix to handle alerts from syslog messages.

    13. Re:What I Lack in Open Source Monitoring Solutions by mberkay · · Score: 1

      Take a look at RapidInsight It is an open source integration, automation and presentation solution for IT management.

      Transparent Planned Design
      Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.

      RapidInsight is developed as a platform first. Applications from data model to UI are built using the platform, hence matches to your "Transparent Planned Definition". The platform is using other common open source projects and tools wherever possible (groovy, grails, compass/lucene, etc.) to make it as easy as possible for others to modify/enhance the applications provided and build new ones. Current version of RapidInsight is built on the third iteration of the platform.

      Event management
      Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.

      RapidInsight provides robust event management capabilities; enrichment, filtering, automated action, event lifecycle, archiving, full text search etc. RapidInsight can process syslog files and SNMP traps, provides XML/HTTP api, and has plugins for other open source (Hyperic, OpenNMS) and proprietary (Netcool, Smarts) managements systems. It provides a web based user interface (Ajax) that is capable of handling very large data sets. Events can be manipulated using groovy scripting language.

      Modularity/Seamless Integration
      Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.

      Seamless integration was a primary design requirement for RapidInsight which reflects on many aspects of the solution. For example, the integration layers abstracts the differences and provides consistent methods to work with external systems, databases, etc. Available applications all use the integration services provided by the solution.

      Complexity
      There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.

      This is a tough one. You're right trying to scratch everyones itch can drive a solution towards complexity. We struggle with this all the time. I'm too close to the solution to tell how well/bad

  57. Use the Tivoli architecture and rewrite it by mveloso · · Score: 1

    As you've discovered, the free systems will fall over and die once they're past a certain size. I've worked with Tivoli customers that have tens of thousands of servers, and for all of its problems, Tivoli is scalable.

    They way they do it is, obviously, divide and conquer. There are specific ways that they do it. I'm mixing up the architecture and terminology on purpose, because the Tivoli terminology will confuse you.

    * there are agents on every box that do the monitoring
    * they report to a region
    * those regions report to a top-level region

    That doesn't mean that you can't have a poll engine somewhere, poking machines. What it means is that if you do have a poll engine, it manages a specific number of machines and reports results upwards if necessary.

    Tivoli has a bunch of other stuff that makes things like this easier, like profile-based management and lightweight (relatively) endpoints. You can simulate that using a centralized source control system that everything pulls from - configuration of monitoring, etc comes from those config files, and every your agents pull their configs depending on criteria, like their hostname, ip, or by looking in some file for what they're supposed to get. This also becomes your shared filesystem of sorts, because you can pull monitoring binaries from them as well as config files.

    Management of alerts is always a problem. Having worked on an EMS I can tell you that all the free ones suck, so it doesn't matter which one you pick. Spend some money and buy the BMC Event Manager.

    Besides that, avoid UDP - it fails when you need it the most. And don't do management by exception - it's for lazy admins. Instead, do some kind of thresholding on your stuff, so you can tell before it fails. MBE gets you there 5 minutes before your users call. Real monitoring allows you to ignore the problem for weeks, or at least blame someone else for not acting when the systems finally do fail.

    1. Re:Use the Tivoli architecture and rewrite it by Krneki · · Score: 2, Interesting

      I don't believe in agents. I refuse to install anything not needed on the server. SNMP should be enough for all the information, unfortunately this is not the case. So I use WMI and netbios querying.

      --
      Love many, trust a few, do harm to none.
    2. Re:Use the Tivoli architecture and rewrite it by vrmlguy · · Score: 1

      Besides that, avoid UDP - it fails when you need it the most.

      When a network gets congested, TCP keeps resending packets. Even with back-off, you start getting more packets just when you need fewer. UDP avoids this issue. You should only use TCP for remediation tools, not for reporting.

      --
      Nothing for 6-digit uids?
    3. Re:Use the Tivoli architecture and rewrite it by Anonymous Coward · · Score: 0

      I understand how you feel about agents, but even with SNMP or WMI/Netbios (for the windows only world) you can't gather all pertinent information regarding applications. For example, lets say you have a Java application how are SNMP and WMI/Netbios (or any agentless technology) going to provide you details into what is occurring within the JVM? You can't, the Sun/Oracle API doesn't allow for remote monitoring. Sure you can monitor the application server, but not the packages/classes/methods within the JVM to determine what is causing an application to perform slowly. The same holds true for many applications, while you can get some information remotely it is only a scrape of the information - agents are often needed for more deep-dive troubleshooting diagnostic.

      That is if you are concerned with more that standard server based metrics...(i.e. cpu/disk/memory/etc...) If that's all you are looking at, your missing the boat.

    4. Re:Use the Tivoli architecture and rewrite it by afidel · · Score: 1

      I agree 100%, I don't like to run AV or backup agents when I don't need to and no other agents are getting on my boxes. If it can't figure out how to get the information through SNMP/WMI then the product doesn't even pass my sniff test. In my experience agents are just one more thing to blow up the box, eat resources, and have to be maintained and tested.

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
    5. Re:Use the Tivoli architecture and rewrite it by richrumble · · Score: 1

      We have similar goals with Clearsite, it's like cacti but cisco centric. We're going back to the drawing board this year from what we've learned and our product will be a lot like OSSIM, only better;) Google: xinn.org contact if you'd like to discuss further. -rich Xinn.org

    6. Re:Use the Tivoli architecture and rewrite it by RobiOne · · Score: 1

      Haven't evaluated BixData http://bixdata.com/ ? See my previous post..

      --
      -- Robi
    7. Re:Use the Tivoli architecture and rewrite it by ta+bu+shi+da+yu · · Score: 1

      Oh man. WMI is evil! How do you get it through firewalls? The amount of configuration just to get it working is really quite silly. Hint to Microsoft - next time you setup a distributed monitoring tool that needs to go over firewalls, don't use DCOM.

      Then again, this is the company that decided that a good idea would be to add binary blobs into their Exchange RPC mail protocol, which is what, by default, MAPI is using. This leads to articles on troubleshooting like this one. Ugh. And yet, MAPI is often seen as more secure that IMAP. Sorry, I digress. WMI is evil, that's all you need to know. Just use SNMP.

      --
      XML is like violence. If it doesn't solve the problem, use more.
    8. Re:Use the Tivoli architecture and rewrite it by Sharp+Rulez · · Score: 0

      There is for sure some free alternative out there, but on the off-the-self side non-free software, but for sure you cannot ignore IBM/Tivoli Netcool.

      The Netcool suite have been purchased by IBM a few years ago. That portfolio came from Micromuse which was really telecom oriented. That suite can monitor easily the required 5000 nodes. Across the suite, there are multiple products that covers most of NMS needs from basinc alarm/stats collections such as receiving/parsing SNMP traps, SNMP pooling, node discovery; up to higher-level business requirement like alarm management, ajax or java GUI for alarm display; dashboarding, etc.

      Feel free to have a look: http://www-01.ibm.com/software/tivoli/sw-atoz/indexN.html

    9. Re:Use the Tivoli architecture and rewrite it by Anonymous Coward · · Score: 0

      Posting anonymously because I work for said entity - These opinions are mine and not IBM's etc etc usual disclaimer.

      There are other options. I assume with the talk of agents you mean ITM? There is a sister product to ITM (the two are slowly integrating, but not combining now) that can handle much higher event flow than BMC or anything else on the market. May I suggest you investigate Tivoli Netcool/OMNIbus?

      It has slightly different paradigms from ITM, "agents" do not necessarily reside on endpoints, though they can.

    10. Re:Use the Tivoli architecture and rewrite it by Krneki · · Score: 1

      I do monitoring only inside the LAN and I use SNMP as much as I can, but sometimes you are out of options.

      --
      Love many, trust a few, do harm to none.
    11. Re:Use the Tivoli architecture and rewrite it by Anonymous Coward · · Score: 0

      There is an SNMP interface to WMI

  58. Tried the all... by Anonymous Coward · · Score: 0

    Same size company. Currently using Nagios, Cacti, Kbox, MOM, MARS, ADM.... god knows what else. Beta testing a product called Nimsoft and it's actually really nice.

  59. Use cricket! by Anonymous Coward · · Score: 0

    Hello!

    I used to work for an ISP.
    For monitoring and polling more than 8000 devices, i used cricket (a set of cgi-perl script).
    Just search for it in sourceforge, its a couple of powerfull perl scripts that helps you creating RRDs.
    Then later you can use their own cgi-scripts to visualize them on the web......or use a third-party script.

    The thing about cricket was that it polled more than 10k objects in very few time, and had some interesting threeshold alerts (mail, sms, web).

    The best practice out there is to merge something between cricket and cacti/openNMS.

    hope my experience helps you.
    regards,
    Nicolas Royo

  60. How about... by yacoob · · Score: 5, Informative

    Needed features in random order:
    * Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
    * Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
    * Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
    * Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
    * Some language to easily calculate derivative values from the data above.
    * Interface for defining graphs, using collected data.
    * ...and a system for annotating the above. Raw data is neat, annotated data is even better.
    * Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
    * (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
    * (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
    * hooks/libraries to use collected data "outside" of the system

    I realize that's a lot, but boy, such system would be very useful and flexible.

    --
    -- we're here you're not
  61. Monitoring on scale by C_Kode · · Score: 2, Interesting

    If your monitoring something of that scale, you should probably look into a profession solution.

    I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.

  62. Host/Server-focused or Network-focused? by Etcetera · · Score: 1

    That seems to be one split between the various different monitoring systems out there. Either it's intended for the network guys and its only understanding of host/server metrics is what it can poll out of SNMP, or it's SA-focused but has few of the broad, large-scale network features that the network guys want.

    Personally (as an SA), I've been very satisfied with Xymon (nee Hobbit, which was a fork/rebuild of Big Brother). Performance is great, even with 5000+ devices, it's got an open and simple-to-parse protocol, and an incredibly extensible architecture. As an SA, being able to script up a monitor and throw it into a data stream as plain text makes it very easy to develop new tests (or add simple monitoring/logging/rrdgraphing) out of pre-existing scripts. Don't limit yourself to what SNMP gives you if you're dealing with servers, services, and higher-level app testing. KISS: http://en.wikibooks.org/wiki/Hobbit_Design_Document

    Whatever you do, pick something you can easily customize: Hack together three different monitoring systems to come up with a best-of-both worlds solution. Everyone's monitoring needs are different, after all.

  63. 1 COTS & 1 OSS Suggestion by mars+soup+eel · · Score: 1

    For COTS I'd go with CA (formerly Concord) eHealth. Their SNMP agent is light-weight, flexible and the product scales out very well. It's also reasonably straight forward to deploy and configure and quite expandable. If you want an open source alternative that will grow with you and/or offer support options down the road I'd give Zenoss a shot. Steer clear of HP, BMC, or IBM solutions due to complexity and/or price.

  64. Clu by dlapine · · Score: 1

    If you want a monitor that can display useful information about thousands of nodes on a single display try clumon. We use it for our 1000+ node clusters. The software was developed in-house but is available under the University of Illinois/NCSA Open Source License Copyright (noticeware). If you're just going to use this in-house, the license shouldn't be an issue.

    You can see a sample clumon display of a working cluster at NCSA Linux Cluster Monitor.The clumon page for that cluster shows you each the job status of each individual node (if the node is colored, it has a job assigned), the load on the machine (the height of the line is proportional to the load, and red tips show loads over 1.0 per cpu) and the service status (green underline is ready, yellow/black stripes is offline, and red is unexpected offline/no comms). If you mouse-over a node, a status box pops up with more information on that specific node.

    As this was designed for a cluster with the Torque resource manager, it won't be exactly what you need, but since you are willing to write a monitor from scratch, it might be a really useful starting point. Design-wise, this monitor allows the engineer or manager to see what's going on in general, with problem areas being immediately obvious, and without being overly cluttered.

    The open source Performance Co-Pilot software runs on each node to collect information, which is polled by the central server. Back end is MySQL. The dynamic display is PHP.

    Straightforward, useful and very configurable.

    --
    The Internet has no garbage collection
  65. s/develop/deploy/ by seanadams.com · · Score: 1

    He's just asking what to use

  66. Noone ftw!!! by Anonymous Coward · · Score: 0

    Noone

    NOOOOOONE!!!!

    But in a non-AC mode I would say something that agreed with you and possibly even offer my services in the actual coding of the said system. Just send me a message at /dev/null if you start this project.

  67. LANREV by Anonymous Coward · · Score: 0

    LANREV

    Seriously. That is what you are describing. That is what I use for well over 5000 devices.

    http://www.lanrev.com/

  68. Real world alarm capability by happyslayer · · Score: 1

    I know I'm late to the party, but I haven't seen anyone bring this one up yet: Real-world alarm/notification capability (pager, buzzer, a machine that goes bing, something like that)

    My reasoning: I run a small IT business with various support contracts. I, and probably quite a few others, can't afford to pay someone to sit at a monitor and watch a screen (or a bunch of screens) whilst tied to a desk.

    Most of the monitoring solutions (Nagios, others) are capable of off-site notification, but it's the "last yard" that's the problem--how to tell someone, even a non-techy, there's a problem so he can call in the cavalry. Despite Verizon's "largest 3G network" claim, a lot of my clients and workers in Silicon Holler don't have cell coverage...so SMS, pagers, etc. aren't all that reliable. But we do have office staff who could be around to listen for an alarm, and we have a solid internet connection...so calling for help via the network is viable, but not paying someone to be otherwise unproductive because they can't go anywhere else.

    I even started developing my own ATMEGA based solution...still working on it, and I think it's completely doable. If I ever get it up and running, I'll publish the plans, code, and scripts/software under GPL and let someone else worry about the marketing.

    --
    Never confuse movement with action. --Hemingway
    1. Re:Real world alarm capability by happyslayer · · Score: 1

      Real-world alarm/notification capability (pager, buzzer, a machine that goes bing, something like that)

      ...sorry, I mentioned "pager" as a real world alarm, then panned the idea--I meant "pager" as in "Mr. Jones, please check the server logs..."

      --
      Never confuse movement with action. --Hemingway
    2. Re:Real world alarm capability by afidel · · Score: 1

      Skytel shows good coverage except in the very southern fringe of the Lexington area, have you tried them?

      --
      There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
  69. Go with what you know by Anonymous Coward · · Score: 0

    I use a combined nagios + cacti solution with over a 1000 devices, around 10,000 nagios checks, and 50,000 cacti data points. It's all tied into our host allocation system via some scripting and templating, so requires very very little day-to-day management, and it looks like it would scale to at least your specified size with some more hardware thrown at it.

    The biggest issue I have with scaling is that my devices are in a half a dozen different colos. Network latency hits me harder than anything else. Geographically distributed polling's pretty easily solved with nagios, less easily with cacti, but if you search on the forums there are some people who've done just that. If all your devices are in the same colo, then it's even easier.

    Go with what you know, and make it do what you need it to do.

  70. One word "Reconnoiter" by Anonymous Coward · · Score: 0

    http://labs.omniti.com/trac/reconnoiter

    This project has so much potential and has been developed from the ground up. If you are going to build a system from scratch may I suggest getting involved in the development of this wonderful project.

  71. MRTG by Anonymous Coward · · Score: 0

    MRTG,
    I know, I know...It's old but still works great.

    I have almost 6000 devices monitored, running on an older HP 6000r server, 6 way, 4 gigs ram and 3 nic cards.

    -js

  72. zabbix is excellent by steveatmarz · · Score: 1

    very non invasive, monitors just about anything and graphs via open libraries. opensource and pretty easy to get started.

    --
    Steve Maher freeunixtraining.com
  73. OpenNMS by Anonymous Coward · · Score: 0

    OpenNMS can easily handle a network that big. It is easy to use once its up and running.
    http://www.opennms.org

  74. Groundwork by riffraff · · Score: 1

    We are using http://www.groundworkopensource.com/ for our monitoring. It is working pretty well, and we can use existing Nagios scripts with it.

  75. Zabbix is easy to maintain and flexible by bigtrike · · Score: 3, Interesting

    Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.

  76. Hobbit Monitor - aka - Xymon by nljackson · · Score: 1

    I implemented The Hobbit Monitor where I work. Actually, its called Xymon now because of a copyright complaint (who knew?!) but I digress...

    We monitor basic to complex information of about 5000 machines and a handful of NAS devices. It is a server/client setup, and highly customizable: an evolution of the Big Brother monitor from days of yonder. The histories can go back indefinitely, and all the configuration is done by flat-files: helpful if you like to roll-your-own automatic configuration tool.

    It is pretty basic out of the box, but the way it is implemented makes it very easy to track whatever you want and write your own tests: from simple bash and perl scripts, to c programs with api hooks into your applications.

    We didn't go with Nagios because it initial testing showed it was very chatty and the interface was unintuitive. I happen to like the easy 'smiley face good, frowning face bad' for taking a quick glance at our infrastructure.

    1. Re:Hobbit Monitor - aka - Xymon by surprise_audit · · Score: 1

      I like Hobbit/Xymon too. I had it running for about 8 years quite happily on an old DL380, single 733MHz cpu, 512Mb memory. I think the peak traffic it took was about 3500 status messages for 500 hosts. Almost all the messages were generated by scripts running on the server, grabbing web pages from all those hosts.

  77. System Monitoring by Anonymous Coward · · Score: 0

    I use SolarWinds Orion NPM and the Application Plugin. I have the ability to monitor all aspects of our network, servers, VMware environment, the works.

    It is by no means cheap ($5000 or so to start) but I found hageling with the sales rep usually gets that price reduced.

    Yes, it is a Windows based system, utilizing SNMP and WMI but i've never once had a problem with it and their user forums and tech support are awesome.

    Just my 2 cents

  78. Why one thing? Ganglia and Nagios by dbIII · · Score: 1

    Ganglia for performance monitoring (it's for clusters after all and has been shown to handle 5000+ machines with very little overhead) and Nagios for host/service down alerts.

    1. Re:Why one thing? Ganglia and Nagios by jfp51 · · Score: 1

      We are using Ganglia for our Rocks cluster and have been very happy with it. Some Rocks installations are huge and it apparently scales very well.

  79. Zabbix by fishbowl · · Score: 1

    Zabbix.

    http://www.zabbix.com/

    If it leaves something to be desired, please tell us what.

    --
    -fb Everything not expressly forbidden is now mandatory.
  80. Depends on your requirements by Anonymous Coward · · Score: 0

    If all you want is Up/Down and a few core metrics (CPU, Memory, etc) or have nothing but 'nix servers and network devices to monitor then any of the OSS solutions can fill the need. If you work in a Windows world with loads of Microsoft apps, then it seems to me that Operations Manager 2007 is the best solution. I doubt any of their competitors can provide so much support for MS products and I would rather not build it all myself.

  81. Zope database is solid by Grincho · · Score: 2, Informative

    To be fair, I wouldn't say the Zope database (ZODB) is not a "solid foundation". It's one of the best parts of the Zope stack and, in 3 years of dozens of clients using it in Zenoss, Plone, and other apps, I've never had it corrupt or lose any data. It's a proper DB--ACID, MVCC, and all that--and you can even lop transactions off the storage to go back in time. Don't expect it to be a relational DB with the ad hoc query tools typical thereof; it's an object DB, with the aim of persisting graphs of Python objects transparently.

    Now, if you aren't familiar with it, the ZODB can indeed seem opaque, but, just like any DB, there are tools to read and modify it. At the highest level, just stick "manage" after your Zenoss URL, e.g. http://example.com/zport/dmd/manage . That'll get you into the web-based Zope Management Interface (colloquially, "the ZMI"), where you can poke around at any object that someone's bothered to write a UI for. Deeper than that, you can connect to ZEO (a server that brokers access to the ZODB over a socket) and mess with the object graph using normal Python. When you're done, "import transaction; transaction.commit()". (The Zenoss developers are probably trying to scare you away from such digging around in fear that you'll violate their objects' invariants and leave them a real mess to solve.)

    Now, I don't say that Zope isn't scary; it has over 10 years of scary stored up in it. But the ZODB is a cuddly, loving part.

    Cheers!

    1. Re:Zope database is solid by rawler · · Score: 1

      Yeah, maybe saying ZODB not being solid is not being completely fair, and a bit imprecise.

      The thing I experienced as a problem with our ZenOSS attempts wasn't ZODB by itself. We had no reason to believe ZODB by itself was corrupted, just that something had gone wrong, and was recorded in the ZODB.

      The problem I see with ZODB is not it's own merits, but how it tends to get used. If I develop software for working with some established RDBMS-backend, I'm pretty much forced to remember that someone else may very well enter and change data out of my control, change data-structures, add columns for it's own private use and so on. The SQL database is basically a shared database, which means I must be more wary on how I insert and fetch stuff, and thinking about the corner-cases a bit more. (And also preferably document and be explicit about my database model.)

      In the ZODB-option, or basically any database that is not easily considered as may being shared with other apps or easily directly accessible by the end-user, (SQLite falls into this category as well, despite being RDBMS and SQL), I more easily assume that what is found in the database is only what my app put there. Thus, I'm less forced to expect oddities in the database (such as reference to a missing key, as was the case in my last issue with ZenOSS). This all leads to my implementation getting more prone to have a lot of assumptions, such as the "object invariants" you mention, and we all know what offspring assumptions yield.

      So, the problem here is not the stability, or lack of features, or any problem technically caused by ZODB itself, but that since it's not something that even an advanced user is likely to poke around in, it easily leads to naive implementations, which makes all problems more or less a black-box-problem (even if you have the source-code, and technically CAN poke around freely in the database).

      So, the problem is one of mindset, and a little about easy to tools to directly access the database, but not any technical fault in ZODB itself.

  82. Not sure of scale...but try Spiceworks? by huckda · · Score: 1

    I've used Spiceworks for multiple smaller sites and it works well...
    Cacti was a pain to configure for every client(tried it first)...IMO

    --
    "Just Smile and Nod." --Huck
  83. Years ahead of the competition by Anonymous Coward · · Score: 0

    The question is, are you looking at developing and implementing the policy or the software itself?

    If you are in charge of developing and implementing the policy I would look at the Compuware offering http://www.compuware.com/solutions/it_service_management.asp

    I don't know if you are looking for just a reporting dashboard or the tools to gather server/network/application information too. I could go into more detail as to how the solution works, but it would be a waste of time not knowing what is in place.

    Anyways take a look...it's impressive what can be done with the Compuware Vantage offering...according the Gartner/Forrerster Compuware is 2+ years ahead of the rest of the ITSM market when it comes to aligning bussiness with IT.

  84. SolarWinds Orion by bigal123 · · Score: 1

    I would have to agree with the other poster that suggested SolarWinds Orion network monitor. You can monitor network swithes, each port, servers, apps on the server, other devices with SNMP strings, things that don't support that.... you can import multiple maps etc. At our site in Orion we have a US Map, state map, then campus maps for a few sites. then building maps then to server room. Custom views and alters. My login to orion shows different stuff then our telecom person's login. and no i don't work for them http://www.solarwinds.com/

  85. up.time by OldManOz · · Score: 1

    We use up.time from http://www.uptimesoftware.com/ We monitor all aspects of our enterprise using it, including: Network devices, OS (Windows, Linux, NetWare), Applications (LDAP, DNS, MS Exchange, Oracle, etc), Performance, and hardware monitoring. We have much more than 5000 elements being monitored. We had a huge number of separate systems monitoring each flavor of component, including MOM, Nagios, NetMon, and many others. We wanted one pane of class to see the whole Infrastructure and be able to show the "service" availability. Obviously, each specialist system can monitor their own key element in some ways better, eg MOM can monitor things within Exchange better, but for a single Monitoring system this one won our evaluation process. Check it out.

  86. Intermapper may fit the bill by Anonymous Coward · · Score: 0

    Give Intermapper a try: http://www.intermapper.com/

    Works great, simple to use. We end up relying on it more then our current tools deployed (Cacti, OV, NNM). It's also very reasonably prices.

  87. Zenoss it is by F.O.Dobbs · · Score: 0

    Bias alert, I'm the Zenoss Community Manager.

    Zenoss was written with the intention of making it easy to monitor and manage tens of thousands of network devices remotely. By using templates and device classes, once you have a single machine monitored the way you like, you can apply that to thousands of other devices, making individual changes as necessary. Zenoss handles network hardware, servers (Linux, Unix and Windows), databases, applications and just about anything else you need to monitor. There's a network map and a Google map mashup for mapping. No need to start from scratch, there's already an Open Source (GPLv2) Python-based solution with a large community and installers for Linux and OSX and a VMware image to get started (plus source for everything else). Lots of documentation and frequent releases, with commercial support available. If you're coming from Nagios or Cactii, you can reuse any custom plugins you've developed.

  88. Zabbix by myz24 · · Score: 1

    I vote Zabbix. Here's why.

    1) Free but offers paid support if you need it
    2) Can use agents, snmp or simple checks like ping
    3) Agents can be extended with your own scripts and such. If a check isn't built in you can add it. For example, I added a very simple script for checking of MySQL replication had stopped or failed.
    4) Templates, makes it easy to add a metric and create a trigger based on that metric to any host attached to that template
    5) Triggers can be configured to minimize false positives (multiple dropped packets before sending an alert.
    6) You can graph item, group of items or an aggregate value of items in a host group
    7) Create your own maps
    8) Create custom screens that group simple or complex graphs or whatever else you want onto a single page

    There are some things to know about Zabbix though. You need to put some thought into items to get accurate values. Is the value you are getting from a device in bits or bytes for example. You can use custom multipliers to convert values into what you want to see.

    Honestly, Zabbix is incredibly flexible and this flexibility also gives it a steep learning curve but once you get hosts entered and the templates situated the way you want it becomes very easy to add new hosts down the line. The biggest tip I can give is to make sure you spend a lot of time thinking out how to setup your templates. Zabbix includes a number of them and you'll want to customize them. One thing I found that wasn't a good idea is to make a template and then attach it to a template. It's much easier to join a host to multiple templates.

    http://www.zabbix.com/

  89. To go with your monitoring system: splunk by k8to · · Score: 1

    A far as the actual question goes, I think a patchwork of tools that you understand well and have proven themselves reliable is often a better choice than the all-singing all-dancing approach. The patchwork does take more time to roll out and configure, but if the tools are simple and easily managed, it is probably the better choice for large environments. However, at the 5000 device level, I'm not sure if you're at that break-even point. I've only personally deployed nagios, cacti, and similar tools on the small scale.

    .

    But more usefully, I'd recommend a tool that is *not* network monitoring to go along with it. Monitoring is great for seeing what is going on at the moment within a domain of events, but once you find out about a problem, how do you then dig into it? I really recommend feeding all your monitoring data and *other* IT data as well into a system that lets you investigate all of them. I think Splunk is pretty good way to do this. It's a search engine into all the time-series data in your environment, so you can learn things like what *else* broke at that time, who was logged in, and so on all pretty quickly. It's commercial, but reasonably priced, and can be used free at some data levels.

    .

    Caveat, they pay my bills, so I may be biased.

    --
    -josh
  90. Zenoss, Hyperic HQ, Splunk by katapult · · Score: 1

    Zenoss for general device monitoring. Hyperic HQ for app monitoring. Splunk for scanning log files.

  91. Nagios with Centreon by str8edge · · Score: 1

    Nagios with Centreon. Centreon is a decent front end to Nagios, with commercial support if required.

  92. What are your company's needs by Anonymous Coward · · Score: 0

    This is a typical solution architecture project where I would do the following exercise;

    -Assess your environment's needs (other than 5k+ devices and 2D maps). Integration with other reporting tools, notification (you needs SMS or just email?), access control (by group, user, etc). Scalability, clustering (did you forecast growth? consolidation?, etc...
    -Budget? enterprise supported solution or OSS (other factors to consider for both options).
    -Analyze what's out there in terms of options; Must have, nice to have
    -Break down your list of candidates
    -Internal knowledge with candidates, learning curves (project - installation, operations team training)
    -Other key elements here.

    There's a matrix that you need to do here in order to prepare for your solution.
    Most people will suggest what they're comfortable with or what they've used.

    You need to analyze your needs and what's our there. This process will likely take you more than the time allocated for the actual installation and environment preparation.

    Good luck!

  93. The Dude by Anonymous Coward · · Score: 0

    We use "The Dude" http://www.mikrotik.com/thedude.php where I work and it works great. It is a windows program but can be run under wine. You can can have multiple monitoring servers so you should be able to scale out. We currently monitor over 200 on a shared server with no issue.

  94. Was in this skin before, made my choice. Sharing. by hotfireball · · Score: 1

    OK, guys, I had into this for a while ago and had to choose what to do. Here is a list what I've tried (means really-really tried and even looked at source code) and my short opinion as a result. Disclaimer: was my own personal research and practice, so I might sound different from others. Any suggestions are welcome. :-) Also I want (yeah, I am picky):

    1. Visibility and information NOC needs ASAP.
    2. Scalability and clustering.
    3. Extensibility. E.g. to provide SLA's the way I need with the information I need.
    4. Performance.
    5. Elegancy in code in infrastructure. I truly hate hacks and hackmen, providing quick-n-dirty so-called "solutions".
    6. Integration, integration, integration.
    7. Insert your boss's dream here.

    So! Here it is:

    1. Nagios (http://www.nagios.org/). Solution that works nearly acceptable if you kill enough time in it. However, things I disliked in Nagios:
      • Scalability is very questionable and difficult, if possible at all to get it right.
      • No database. It is just a flat text file that is re-written and re-read every N seconds.
      • Ugly monitoring screen is completely acceptable for a sysadmins, but is really bad for simple operators at night time that has to simply call tech for help. For this, I had to write my own screen that shows only blank screen if no problems and only errors/warnings. To do that was quite difficult to get right, because of the flat text file bottleneck, mentioned above.
      • It is written in C and thus all integration with other stuff is ugly and not very elegant (e.g. I want WSDL online instead of pipe from/to Perl script etc).
      • Latency. 5-15 minutes is what you usually get if you have 2K servers.

      I am sure they are not any leaders in monitoring technology. Also I even doubt they are leaders in monitoring in general. However, this worked for me and I wish Nagios all the best.

    2. Groundwork (http://www.groundworkopensource.com/). Shortly, same Nagios, just in better packaging. Not really, but still Nagios. Basically you gave all Nagios problems:
      • Too much information, but too little what you actually need.
      • Scalability sucks. Just sucks, hands down.
      • Quite wheel reinventing: while lots of stuff over SNMP already there, you still need to write ad-hoc scripts. Not really understand why this is that way.
      • You can write any script in any language and run it remotely. Don't you see here any problem with security?.. I see here a problem to trust in-house developed code that was developed year[s] ago by folks I've never met (they're gone already). Same to newcomer after me.
    3. Zabbix. Just better Nagios. Yes, it is really better. However, at enterprises you see lots of Java stuff. And Java monitors and manages by JMX. Having JMX working with all this Zabbix's PHP stuff through a quite fugly hack (see the source code how they're done it) was a dealbreaker. However good for those who thinks that Java is just a yet another operating system. :-)
    4. Zenoss. They're improving and I have to say improving nicely. However, it is Zope and ZEO for redundancy. Also their code implementation is not the best, they still suffer from correct packaging (e.g. wiping out all your configuration in /etc with their own, completely flushing your stuff etc. Since it is Zope-2, you have to make sure you have all the patches for it, exact required and clean Python version etc.
    5. Some proprietary things like TIBCO Hawk (omg, stay away) and HP developed stuff.
    6. OpenNMS. Chosen!

    Personally I recommend go with OpenNMS. Not going to say it is excellent: it also has its gotchas. For example, don't even think installing it on a slow machines with low memory and/or on LVM. Also I would love to see it with other databases working... It is written in Java and it wants enough space in resources. However, once you give it to it, you will see all the best of it. Integr

  95. one size does not fit all by Anonymous Coward · · Score: 0

    for basic metrics and health, ganglia is very good. we use it exclusively to monitor a similar scale backtesting (compute) cluster, the footprint it makes is negligable. we've even modded it to do more high frequency measurements where appropriate in our production environment, it's proven very flexible and easy to mod for those purposes. very scalable where appropriate, and detailed where that is appropriate, with some work. we inject a lot of other application level metrics into it for workhorse servers. having one view and metrics wrapped up together is a nice feature.

    we use nagios for catagorized system metrics ("what's the load on the netapps and who's beating on them now?") and i rely on hobbit (or whatever they call it now) for ready-for-business metrics. those eventually come down to "this stuff is supposed to be running on this host today, is everything necessary for that to happen in a good state?". it's invaluable for start of day troubleshooting, mixing together a lot of different greenlight systems into a single snapshot with state transition history.

    basically, i don't really think you need a single holistic system to monitor everything you need, a lot of these things excel in their specific domain and the whole interrelated system may require many views into its health to monitor it effectively. i don't advocate looking for a one-size-fits-all system.

  96. How in the world do I get in touch with you? by camargobp · · Score: 1

    Hey, I don't know how I can get in touch with you since that our emails are protected on slashdot. I think I left a comment on your blog, but it was the best guess I had.

    1. Re:How in the world do I get in touch with you? by Krneki · · Score: 1

      What do you need to tell me?

      --
      Love many, trust a few, do harm to none.
  97. SolarWinds by Anonymous Coward · · Score: 0

    I can write a 100 pages of BS or I can just jump out and say it.
    You want a solution that is easy and powerful with information to help IT track down issues with the plus that it is 100% scalable with a zero cost just to try it.

    I recommend a group of programs made by SolarWinds.

    I am the SolarWinds admin for the company I work for. I have monitoring setup across 3 data centers in the mid west, soon to be 4. Right now I am monitoring around 4000 items which covers everything in the buildings, if is has an IP address I am monitoring the device.

  98. Intermapper / Zenoss owns all by Anonymous Coward · · Score: 0

    Intermapper is a solid monitoring / map system, nice selection of probes, very good value too

    It only solves your real time needs, use Zenoss for the rest - trap management, cap planning etc

    To this day I still can't under stand why the whole network monitoring domain has so many gaps and so low quality.
    Have you ever tried a Cisco product? utter shit ....

    I digress.....

  99. Things You better consider by javakev · · Score: 1

    Now a days you have to consider a monitor tool or tools that can do the following: 1. Event Correlation (don't page me or turn my entire dashboard red if I lose 2 severs in a 20 server load balanced pool) 2. Application Mapping, dashboards, and portals 3. User experience monitoring mapped to hardware ( eliminate finger pointing and shorten problem identification) 4. Ability to publish reports ( reports tailored to the person's skill level. The higher up the food chain the less they will understand very technical graphs) 5. Historical comparisons (We now have to justify clearly why we need the upgrade or the latest and greatest) 6. On large scale monitoring solution you have to manage your database, all this data can pile up quickly What tool you use is not as important as what you do with the information to quickly resolve issues and provide data on the health of your infrastructure.

  100. My experience with MOM/SCOM, cacti and nagios by Anonymous Coward · · Score: 0

    I just went through the same exact thing you did.

    The shop I work is is mostly MOM and the new version SCOM. For monitoring windows stuff it seems to do well. However, when we threw things like VMWare, Linux, Solaris and pretty much any other OS at it it had major disadvantages. Sure there were 3rd party plugins for VMWare (really expensive), some for Linux (quest, and talk about dirty and very unstable)....thats when I came in.

    I first tried Cacti because I had some experience using it. Cacti is nice if you want pretty graphs and hosts monitoring. Service monitoring on the other hand is a total pain and not good enough for a NOC (war room or whatever you want to call it). Yes, we tried Thold and it worked ok, but Thold assumes to many things and does not allow the Noc to acknolowedge anything.

    Skip foward to 4 months ago after being told we need something in place. I then looked at the other free solutions out there. Nagios was what I wanted but man setting up nagios with 300 hosts and around 1000 service checks is a beating. I then looked for nagios frontends and tried just about every one out there. Centreon came through with flying colors. Once you get used to it and understand how it works its great. Here are some things I have done so far with Nagios/Centreon:

    Alerts are sent out via email with comments attached to them so the NOC will know how to respond
    With ACL setup, I now have the screens in the noc grouped
    I have a graph for just about every service
    Our managers login to nagios and run monthly reports showing service downtime statistics (sort of a SLA if you will)

    Nagios is very powerful, if you don't see a plugin for what you want its very easy to hack one together. If you can write a shell script to report useful information, you can then use NRPE to send that information to nagios at the interval you want (with graphs).

    I even went as far as converting Cacti scripts over to work with Nagios.

    Here are some of the interesting things we monitor:

    I/O statistics - I wrote a plugin that runs iostat 2 times for 30 seconds I then output that data to a txt file and average the numbers and have nagios grab the data for very accurate io statistics (tps, r/sec, kBread/sec, srvtime ect).
    Oracle - This is where cacti lacked. I found a nagios script called check_oracle_health, we use this to check if oracle is alive and it can also run custom sql. With the custom sql I run whatever our dba wants to watch
    Tomcat - We connect using check_jmx to gather all sorts of stats
    Swap - we watch pagein pageout and alert after so many minutes here
    We have one application that run on a Linux machine that sometimes spins. I wrote a script that keeps count of cpu time. If cpu time is X I then restart the service automatically then alert nagios that this happens. Nagios gets the message then shortly checks to see if the service came back online and alerts if it does not.
    VMWare - I monitor and have graphs for memory on ESX servers and there Virtual Machines. I can also gather stats at a cluster level.
    Many other custom shell scripts, perl scripts to do various things along with all the freebies that you can find on the net.

    A lot of people discount nagios and I think the majority of those that do get either frustrated or truly don't understand the power of it.

    Another great thing about our setup is that we now have 2 pollers. One machine runs the Mysql db, Centreon and Nagios. This machine does about 1/4th of the polling. We have another machine that does 3/4's of the polling and reports back to the other.

    Also don't be afraid to mix things in NRPE checks to get results that you want. I found one script that was almost perfect for us, but it reported to much data. So what I did was run this scripts output through sed to parse out information I did not want. It works like a charm with no noticeable overhead. Again, if you are somewhat versed in shell scripting you can use that knowledge to do all sorts of stuff. Nagios is pretty much what you want it to be if you get past that learning curve.

  101. BixData. by RobiOne · · Score: 1

    If you haven't evaluated BixData http://bixdata.com/ yet? You're missing out. No nag/reg required, free use for less than 30 hosts.

    Does not include kitchen sink. Only the next generation advanced monitoring system that can handle phyisical and virtual as well as the hypervisors! VMware friendly.

    On their science page http://www.bixdata.com/science, they say Bix is Borg. And they're not kidding.

    "BixData is profoundly different. It took science fiction to provide the metaphor. Bix is Borg - 'an inter-connected collective' (self-organizing p2p) that 'assimilates' new life forms (cross-platform virtual machine), functions with a single hive-mind (n-cube datastore) and adapts through self-learning (cybernetic feedback loop) - all in pursuit of perfection. Resistance is Futile."

    Just... Wow.

    --
    -- Robi
  102. MikroTik's The Dude by Guardn · · Score: 1

    I use it both at home and on my previous job at a big datacenter with several hundreds of monitored devices. Worked very well and the map is superb. I haven't seen anything comparable. Espacially the realtime traffic status of network links (witch fade to red when overloaded) is a grad diagnostic tool.

  103. This would never fly by Anonymous Coward · · Score: 0

    If you are in a heterogeneous environment, this is a non-starter.

    Great job security and resume enhancement, though!

  104. The one thing most linux-based tools lack.... by Anonymous Coward · · Score: 0

    Is the ability to monitor windows perfmon stats, even on tools that have been ported over to windows (looking at you, cacti). Seriously, most of the more useful sql server stats are reported only through perfmon, and snmp won't capture that data. Sure, sql server tracks that internally, but if you have a lot of servers, then you need external monitoring. There's such a wealth of information available through perfmon, isn't there any OSS tool that will monitor that stuff?

  105. Big Brother and Project Observer by charnov · · Score: 1

    Big Brother (or Sister) which uses push agents so you are not generating vast amount of SNMP polls and you get instant feedback on a stupid simple dashboard.
    http://www.bb4.org/

    Project Observer is super easy to set up for SNMP and can auto-discover Cisco gear (with CDP). A good, simple SNMP monitor but it has serious scaling limitations.
    http://www.observernms.org/

    Nagios for hard core up/down monitoring with good flap detection and Cacti for performance monitoring.

    OCS Inventory for push software distribution and inventory control.

    Or you could drop some serious cash and just get Unicenter TNG and go bald from ripping your hair out.

    Seriously, though, try a bunch of things and see what actually works for your team.

    --
    [RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
  106. Osmius by joselu · · Score: 1
    Give a try to Osmius.

    Fast, scalable and integrates the business view in the tool, as well as a real GIS (you can use it or not), SLA management and BI and datamining views. Everything related to monitoring in made in C++ using the multiplatform near real-time framework ACE, so it's really fast.
    The Web console is based on TomCat (J2EE) and you can manage, deploy, configure or update agents from there.
    You can develop new events and new agents to monitor "whatever" you want: stock shares, temperatures, web transactions,... it's up to you. And Osmius is real Open: There's no open core, nor closed features, and you can access the documentation, the analysis info, datamodel staff, and (of course) the code.
    We are now (I'm in the development team) doing the final tests and dealing with the latest bugs and also testing scalability and the behaviour under stress of receiving millions of events every day, and thousands per seconds. It's working properly.
    The D day is July the 30th. Every comment and suggestion, and even hard criticism, is very welcome.
    We want Osmius to be:
    • Helpful to technical and business staff.
    • Easy to understand.
    • Robust.

    Let's see if we made it.

  107. Odd Request. by Narcogen · · Score: 1

    Well, I suppose it depends.

    How many large scales are you planning on monitoring?

  108. So many to choose - plan beforehand by cheros · · Score: 1

    I think you have already gathered from the large amount of responses that the problem has been "over solved" - many options to choose from. We provide a rip-n-replace service for HP Openview users (banks and trading exchanges), which, including any coding required and 24/7 support comes out at about 20% of those annual costs in year 1 and below 10% in subsequent years, but there's no point in telling you what Open Source product we use - you need to do your homework so you arrive at an answer that you and your boss understand yourself.

    You will probably find a number of answers to your criteria - TRY THEM. Give the ones that seem viable in terms of support, community, code quality and your own ability to make it work for your company a good try - most you can even do in parallel. Only after a live test can you decide what you're going with, because you will be investing time in tuning it for your own needs - this is not the time you want to waste. A good preparation is worth 80% of the work for monitoring, or you will spend time monitoring the monitoring system instead which is a waste of your time.

    Good luck :-)

    --
    Insert .sig here. Send no money now. Owner may sue, contents will settle. Batteries not included.
    1. Re:So many to choose - plan beforehand by Krneki · · Score: 1

      I tried like 20 of them, the problem is you need 2-4 days to configure it, just to find out it doesn't provide that little detail you must have. :(

      --
      Love many, trust a few, do harm to none.
  109. Re:Was in this skin before, made my choice. Sharin by Krneki · · Score: 1

    Thanks for the info.

    --
    Love many, trust a few, do harm to none.
  110. Cue biased post by Frizzon · · Score: 1

    Both Yacoob and Afabbro have some great lists above (especially Afabbros list!). A combination of these features would be an ultimate system.

    I work for a company called iQuate - I have been (sometimes literally) developing a monitoring system for about 7 years. We do many of the things mentioned in the 2 lists, but not all (I wish!). We have a product called iQRMS which integrates several functions, the largest of which is monitoring.

    - It is agentless - it uses about 30 different protocols (including SNMP obviously!) to connect to remote machines, so it can be deployed very quickly and gives a pretty "true" picture of client connectivity (which sometimes an agent based approach will not).
    - It is horizontally scalable (you can have many scanning services on many computers and they will load balance between them).
    - It has failover built in - when 1 or more of the scanning services die, the others redistribute the load.
    - It has intelligent aggregation of data, recording max, min and average values for any monitor over time - for up to 6 years - in such a way that it doesn't just eat disk and kill performance (that one took a while to crack...)
    - It has pretty graphs and in-depts reports on events
    - It supports complex (or simple!) escalation rules to control who gets told about what, when and how often when events happen
    - It integrates with a helpdesk (it's own or others)
    - It allows you to create templates of monitors using different protocols to get a wider picture of an issue
    - It is easy to understand and designed with 24x7 operations in mind (hence all that failover/scalability)
    - It doesn't cost the earth

    It also doesn't do some of (1 of) the things Timothy mentions at the start of the post (gratz on the new job btw!) - specifically it doesn't create a 2D map of the environment, although there are some plans to implement that in future. It treats and represents devices in the network as groups of hosts - it doesn't display them in relation to physical layout...

    Maybe it's worth having a look at it Tim, I can certainly vouch for the support being excellent (but like I say above - I'm biased :))

    JK

  111. But then again, I'm biased by Frizzon · · Score: 1

    Both Yacoob and Afabbro have some great lists above (especially Afabbros list!). A combination of these features would be an ultimate system.

    I work for a company called iQuate - I amd the CTO and have been (sometimes literally) developing a monitoring system for about 7 years. We do many of the things mentioned in the 2 lists, but not all (I wish!). We have a product called iQRMS which integrates several functions, the largest of which is monitoring.

    - It is agentless - it uses about 30 different protocols (including SNMP obviously!) to connect to remote machines, so it can be deployed very quickly and gives a pretty "true" picture of client connectivity (which sometimes an agent based approach will not).
    - It is horizontally scalable (you can have many scanning services on many computers and they will load balance between them).
    - It has failover built in - when 1 or more of the scanning services die, the others redistribute the load.
    - It has intelligent aggregation of data, recording max, min and average values for any monitor over time - for up to 6 years - in such a way that it doesn't just eat disk and kill performance (that one took a while to crack...)
    - It has pretty graphs and in-depts reports on events
    - It supports complex (or simple!) escalation rules to control who gets told about what, when and how often when events happen
    - It integrates with a helpdesk (it's own or others)
    - It allows you to create templates of monitors using different protocols to get a wider picture of an issue
    - It is easy to understand and designed with 24x7 operations in mind (hence all that failover/scalability)
    - It doesn't cost the earth

    It also doesn't do some of (1 of) the things Timothy mentions at the start of the post (gratz on the new job btw!) - specifically it doesn't create a 2D map of the environment, although there are some plans to implement that in future. It treats and represents devices in the network as groups of hosts - it doesn't display them in relation to physical layout...

    Maybe it's worth having a look at it Tim, I can certainly vouch for the support being excellent (but like I say above - I'm biased :))

    JK

  112. Pandora FMS can monitor anything by villa · · Score: 1

    Do you know Pandora FMS?. Pandora Flexible Monitoring System is a general purpose monitoring tool. It was born in 2002 at the IT department of a international finance corporation. The ultimate goal of Pandora FMS is being an adaptable platform for any organization, able to collect events of any type, generate alarms through a metric system and to represent obtained events in graphs, reports or maps. Pandora FMS can detect a network interface down, a defacement in your website, a memory leak in one of your servers applications, a delay in your website when the customer pays, or the movement of any value of the NASDAQ new technology market. Pandora FMS can show you the state of your servers, systems, applications, communications, or the sale level of your commercial team. Pandora FMS is extremely modular and decentralized. The most important component, and where everything is stored is the Database (right now only MySQL is supported). Every single component of Pandora FMS can be replicated and work under a pure HA system (Active/Passive) or under a clusterized system (Active/Active with balanced load). Pandora FMS can gather information locally with agents software or hardware: - Pandora FMS has specific agent software that runs on any operating system, GNU/Linux, AIX, Solaris, HP-UX, BSD/IPSO, and Windows 2000, XP and 2003, gathering data and sending it to a Data server. - Pandora FMS has a specific hardware agent, being able to connect any sensor to this devices. Using it, it is possible to monitor temperatures, lightness, movement, smoke, ... Pandora FMS can also gather information remotely, without installing software or hardware agents: - With the Network Server Pandora FMS can monitor any kind of service or port via TCP query, any devices via SNMP, and any communication latency or state via ICMP. - With the Plugin Server Pandora FMS can monitor any kind of system with complex code. It is compatible with Nagios Plugins. - With the WMI Server Pandora FMS can monitor any Windows via WMI. - With the Web Server Pandora FMS can monitor web applications via complex checks. There are two kind of webchecks: A check for response time and check for availability. Of course, webchecks are not just making a simple http request to say if works or not, webchecks can make fully complex web operations, like perform logins, choose a parameter from a menu, enter text into a form, expect a specific response in each step and make sure that all programmed steps are done correctly before saying âoeweb application response is OKâ. - With the Prediction Server Pandora FMS can detect trends. It implements in an statistic way a data forecast based on past data (to almost 30 days in four temporary references). - With the SNMP Console Pandora can monitor any device via SNMP traps. You can visit pandorafms.com. Bye

    1. Re:Pandora FMS can monitor anything by techwrench · · Score: 1

      Have to second this. PandoraFMS is great at monitoring. I am not a big fan of installable sensors for clients, but Pandora's work pretty well.
      I use Open-Audit for detailed views of my monitored devices. Each Windows and Linux box needs a script ran to enable it to be monitored, but especially on Windows boxes, a wealth of information is to be had.

      --
      It's You and I against the World... When do we attack?
  113. Druid by Anonymous Coward · · Score: 0

    If you have knowledge in Nagios, do not reinvent the wheel and use it together with NagViz and Business Process View plugins. You will get nice maps and easy corelation from technology point of view to the Business one.

  114. OpsView gets my vote! by PaladinDude · · Score: 1

    I was a long time Nagios user but the manual config changes and management of it was just getting too much, I've recently switched to running a clustered OpsView setup monitoring 2 geographically separated sites and around 1000 devices. It "just works". It is easy to configure and manage, the data warehousing/searching/reporting feature is great, graphing is excellent, dashboard and nagvis elements let you present data nicely and there's even a scheduled reporting tool to email the management a PDF full of pretty graphs every month. Because it's built on Nagios there's a plugin for monitoring just about anything you can think of and it's free! I don't work for OpsView, I'm just a fan!

  115. What I'd like in a monitoring system by Anonymous Coward · · Score: 0

    Well, firstly, I'd like it to work. That is all.

  116. A way to defeat it, you insensitive prying clod. by EWAdams · · Score: 1

    You openly admit that you monitor thousands of people's PC's without their consent and probably without their knowledge? I would be ashamed. Collected any good blackmail material yet?

    --
    I piss off bigots.
  117. Give customers your personal phone number by shish · · Score: 1

    Most monitoring systems only check each server once every 5 minutes, giving an average of 2.5 minutes between error and alert; with a customer, you'll be out of bed at 3 in the morning in a matter of seconds :-)

    --
    I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
    1. Re:Give customers your personal phone number by Krneki · · Score: 1

      The goal is to receive an SMS alert before the client notice a problem.

      --
      Love many, trust a few, do harm to none.
  118. You need a Mono based monitoring application by Anonymous Coward · · Score: 0

    Don't know why. No doubt someone will be able to explain why it's essential.

  119. Nagios + PNP + NagVis and Cacti by mnslinky · · Score: 1

    I've found Nagios and NagVis a solid solution. NagVis is a plugin/addon for Nagios which allows you to create a Heads-Up display with status information on your own network diagram. It has an interactive map, which is 100% customizable. When kept in a browser window, it will play a sound during an event and flash the icon for a host indicating the problem.

    PNP adds graphing of performance data to Nagios. It allows you to click through the nagios interface directly to the graphs for a given host or process. It will graph anything that has performance data output.

    Finally, Cacti is a great solution for things which you may not roll into your Nagios insallation. We use it for monitoring network bandwidth utilization, mostly.

  120. Zabbix is ok here, and recommended by Macka · · Score: 1

    [ this a general comment for anyone interested in Zabbix ]

    I'm currently working on a project in a Hospital. We chose Zabbix (1.6.4) and myself and a colleague have set it up. It's currently monitoring 169 hosts (mixture of Windows, Tru64 Unix, Linux and UPSes) with 7108 items monitored and graphed and 1,704 trigger alarms. And this is really small potatoes compared to what Zabbix can handle. We've not had any issues with the Zabbix server processes. It runs just fine.

    What I really like the most is its graphing capabilities: the ability to zoom in on a section of a graph by just dragging and selecting the time interval you want. This also works with "Favourite Screens" containing multiple graphs, where selecting a time window on one graph automatically zooms the rest of the graphs on that screen at the same time.

    There's an excellent FireFox plugin too that gives you a summary icon at the bottom of the browser, a single bar to display the most recent event, and clicking on the icon slides a tabbed pop-up frame into view that contains all the information you'd expect to see on the Dashboard.

    Like any monitoring tool of this size though, getting all the data in for the systems you want monitor is time consuming. My advise would be to concentrate first on getting a few systems just how you want them, then work out whether it's faster for you to clone them and mod system specific information or export the XML for those systems, use those as templates for other systems/hosts of the same time and work with XML import.

    Zabbix gets a big thumbs up from me.

  121. TRAFip and SLAview by Anonymous Coward · · Score: 0

    These softwares are very powerfull in network monitoring. They can monitoring large networks, with netflow traffic analisys (TRAFip).
    The SLAview system is used for monitoring and managing the performance of key elements, critical to the good functioning of telecommunication networks.
    There is a very interesting 2D map, with weather map included.
    Site: www.telcomanager.com

    enjoy

  122. Daniel Negri - BANCOOB - BRAZIL by Anonymous Coward · · Score: 0

    Here, I develop one solution that integrate Nagios and Proprietary Solution -> JAVA EE (EJB and Message Server) -> Adobe Flex (Client Interface). The Nagios send information to JAVA (Server) and JAVA send realtime information to Flex Clients like a broadcast chat.

    I work in the 3rd bigger bank from Brazil, and we have a lot work to do. ThatÂs Great!

    Thanks,
    _______________________________
    DANIEL GOMES NEGRI
    (Analista de Sistemas)
    Consultor em Arquitetura RIA - Rich Internet Application
    Certified ScrumMaster

    ï daniel.negri@hotmail.com
    ï (62) 9218-7315 / (62) 8135-8339
    GECAN - GerÃncia de Canais de Atendimento
    BANCOOB - Banco Cooperativo do Brasil S/A

  123. If you have Linux servers, don't use OpManager! by scarolan · · Score: 2, Informative

    We used OpManager in production for over a year. It has terrible Linux support. None of their built-in plugins worked properly for monitoring even basic parameters like disk space, free memory, CPU usage, etc. When we pointed this out to their support people, they said we should build our own plugins with SNMP OIDs. Um....no. Not for the amount of money we paid for that steaming POS. We finally kicked OpManager to the curb about a month ago, and have our entire environment, Windows and Linux servers being monitored with Nagios. Nagios scales well, we are currently watching several hundred hosts and about 3500 services.

    OpenNMS is also a good tool, its ability to map servers back to switch ports is extremely handy.

  124. Castlerock SNMPc by smackmywhammy · · Score: 1

    Unfortunately (or not), Windoze based. My experience with it started out unstable and feature poor at version 4, but it kept the relatively inexpensive (core, support, and add-on) price tags, and features kept getting better, and stability continues to improve at version 7.1. Remote windows and java consoles, remote pollers, SNMPv3, easy custom MIB compiles, functional dependencies, device grouping, custom alarms, restricted console views, packaged third party paging and email, custom tool integration, easy maps, acceptable (to me) TCP service monitoring and third party script support. Reporting is also integrated, or use the up-featured SQL add-on. I'm using it for just shy of a couple thousand devices on a single modest server. It's been able to accommodate every NMS feature I need, and a great many wants. My only real gripes are: console authentication still doesn't have a RADIUS, LDAP, or AD hook, and I'd like a Linux port for the backend. Other than that, it's shamefully simple to get new staff up and running, and it requires very little care and feeding. Good luck with your search.

  125. Have you looked at these? by Anarke_Incarnate · · Score: 1

    HypericHQ would be good
    PandoraFMS is another option.
    ZenOSS and Zabbix are popular too

  126. Monitoring Grids by allenw · · Score: 1

    The biggest problems we've seen from a monitoring perspective is that most systems really do have a hard time scaling to large levels and being usable. [A common trick (and one we employ) is to have a multi-tier monitoring system in place, where one monitoring stack monitors the monitoring stack that is actually watching the service/hosts.]

    Once one gets past that hurdle, the tricky part is dealing with the "it is OK if X% of my machines are down". Most monitoring systems that I've dealt with are based around the view that they are monitoring a single host/single service and not a collection of hosts where it is OK if chunks disappear. For those types of problems, one still ends up writing a lot of custom smarts it seems.

  127. Oracle Grid Control by johnnyR · · Score: 0

    I work for a division of a very large corp and we just finished a rolling out ORACLE GRID control, it blew away all other tools we tested. It has agents for everything SqlServer, Cisco, NetApp, EMC, all *nix... everything. Highly recommend it

    --
    The gun is good - Zardoz
  128. Hobbit / Xymon with Devmon by gudmo · · Score: 2, Informative

    The solution is real simple. If you can program in anything then Hobbit/Xymon with Devmon is your only choice.
    Create your own Weather Map for 2D, you never need a full 2D map of 5000 hosts... Less is more.

    1. Free
    2. Fully customizable
    3. Easy administration
    4. Offers clients for all the major OS (And quite a few minor ones)
    5. Large support base (Users with high technical level)
    6. Nice author (Replies to comments and considers all ideas)
    7. You can write a test for anything you can think of and easily add it into hobbit
    8. Offers client/server montoring, remote monitoring, script monitoring, snmp monitoring(devmon) or scripts

    The possibilities with Hobbit are endless

    Personally I use Hobbit to monitor over 2400 devices, including Cisco hardware, AIX, Windows Servers, VMware Clusters, Exchange, Sharepoint etc.etc.etc.etc.
    I've never encountered a system I could not monitor with Hobbit (Or scripts that send their results into hobbit).

  129. I Like SolarWinds by aynov · · Score: 1

    For the price, I am a big fan of SolarWinds Network Performance Monitor (NPM)and ipMonitor. Together, these give me the ability to track and monitor as many devices as I want. They both have network discovery, and can monitor network devices, servers, workstations, and applications. I have not tried the Application Monitor add-on for NPM as a replacement to ipMonitor, but it looks like it would work very well. I have written several c# scripts to augment ipMonitor for the custom applications I need to monitor. The only downside I can see for you is that it these are Windows based productions, and NPM requires Microsoft SQL 2005. I love the map capabilities in NPM and the graphs it makes. I am able to alert on any OID I collect data on, which is a plus. Also, NPM has thousands of MIBs already installed, which makes finding an OID much easier. Best of all, NPM supports OID tables, which makes my monitoring very dynamic. I have, for example, created an alert on disk partitions getting more then 95% full. I do not have to worry about making an monitor for each possible partition, or even worry about how many partitions are on a server. I just monitor the OID table. As long as I have set SNMP properly (of course) I see all the partitions I care about. ipMonitor I use mostly for application monitoring. In this respect, it is very nice, since I can execute custom scripts. For your scenario, I would seriously look into NPM. This is a very easy product to learn, very powerful, and can be fairly cheap to implement.

  130. Monitoring System... by Anonymous Coward · · Score: 0

    The answer to the tool lies in what you are attempting to monitor and how the data is to be used to address the problem(s) that you are trying to solve....

    I dont see any requirements here - probably you have them, but not listed...
    I dont see any expectations of what is to be delivered - you probably have them but not listed either....

    I would suggest that having the requirements and expectations will be help you create the solution - and make the best tool rise to the top....

  131. Easy? Well, here's easy: by Anonymous Coward · · Score: 0

    I write that sort of stuff in awk. Hire me.

  132. Try Kaseya by Anonymous Coward · · Score: 0

    Have you looked at Kaseya?

    www.kaseya.com

  133. 5K instruments in the orchestra: U Need a Maestro by Luc+Dijon · · Score: 1

    In the orchestra, you need different instruments, musicians and a Maestro who knows the whole partition and can render it. Nagios is not playing like Cacti, MOM, Quest, etc...Even if they play in the same sandbox....I mean here that in your environment, you may already have some monitoring solutions specific to each System Management disciplines: Database, Servers, Network devices,....They have all their own user, admin & configuration consoles and may be also their own agents. You need an integration of all these Monitoring solutions ( at least the most important, valuable and easy ones). If you drop all the monitoring solutions and go for a single one from scratch, it will sound like a ring tone in the hall of an airport at 10:00AM...You need a chief, a manager...a Maestro...an Event Manager (centralized or distributed if needed). The role of this Event Manager is to get the most valuable monitoring information from these monitoring tools and to consolidate all these alerts using a single syntax & semantics ....Consolidation: enrichment, correlation, filtering, reaction... of the alerts reported by the monitoring tools to an Event Management Solution sounds like repeating again and again with the orchestra for the D day of the concert. This Event Management Solution will be the Maestro. All the musician (sysadmins) will recognize their partition and their instrument (their monitoring tool) when the music will play => Investigate an Enterprise Event Management Solution OVER your current monitoring instruments. Look for the most flexible (easy to integrate..easy scripting), the one able to speak through multiple and SIMPLE protocols. The one that will serve you on a gold plate the root cause of all your troubles when the poultry will make some noise. There are some of them on the market (Open world included). Music Maestro !

  134. ClearSite NMS was a good start by richrumble · · Score: 1

    We have similar goals with our project Clearsite.sourceforge.net. We've learned our lessons and think we can begin taking on the likes of SolarWinds, OSSIM, ZenOss, SpiceWorks etc... We made the mistake of being to geared toward one vendor(cisco) but no longer. We're making the software work for us, were not working with the software. Crating a Snort interface that highlights the portion of the packet that trips the content rule, being able to note FP's, highlight the portion that's a FP in the packet, and it's added to the rule once you click submit. Some user-agent rule goes off, but it's your own app, highlight the user-agent your app uses, click submit and content:!"user-agent: xyz"; gets added to a display filter and or the actual sig itself. A snort rule is triggered for Bittorrent being used, a cron job connects via wmi, snmp or ssh to a host, runs a netstat -abn effectively and figures out the process and location of the executable that triggered the rule, or the lack of being able to get such a result back might further point to a FP or a machine not under your control. If no contact, check the mac address db to see if it's one of yours, if not, snmp set fa0/22 disable. Proactive. Naturally there are more checks and balances in there, but that's where were heading with just the snort portion. Again making the software work for us. As always we'll use our very popular ajax search for everything we can. http://clearsite.blogspot.com/search?updated-min=2007-01-01T00%3A00%3A00-08%3A00&updated-max=2008-01-01T00%3A00%3A00-08%3A00&max-results=3 -rich (google: xinn.org contact)

  135. Referential integrity by Grincho · · Score: 1

    You bring up an excellent point: ZODB doesn't do any referential or data-type integrity checking; it's pretty much just a dumb (though rather concurrent and durable) graph store. Thus, ZODB-using apps have to take care of data integrity themselves or else interpose another layer (which you'd want to do in a "shared" situation like you mention).

    I guess that's the tradeoff ZODB makes: really fast and agile development (no schemas to maintain, etc.) in exchange for no particular constraint enforcement. In practice, the latter is mitigated (and lots of painful debugging saved) through use of constraint-enforcement frameworks like Archetypes, but that still makes me queasy in a multi-app situation, as you'd have to make sure everybody uses the framework.

    Personally, I'm both a ZODB and a Postgres wonk. What I'd love to see is the best of both worlds: a language-agnostic graph DB with internal constraint enforcement and, as my pony, a declarative ad hoc querying language. :-)

  136. Just my $0.02. by wr37chd00d · · Score: 2, Informative

    We are using a combination of Cacti and mon to monitor about 200 devices, both network gear and PC servers. Cacti is used to graph performance data(bandwidth, cpu, mem, temp) and maps for the visually inclined, while mon is used to do the actual service monitoring and alerting.

    I won't comment on Cacti, since it has been mentioned here already, though iI will say that you CAN change the default behavior of "sample averaging" by increasing the size of the RRD database. There are discussions on the Cacti forum/wiki that cover this topic.

    Mon on the other hand, I didn't see mentioned at all, so here's my blurb on that. The core of mon is a scheduler written in perl, which handles running monitor tests(also perl or any script/program that can exit with a 1/0) and then alerting(also perl, or other languages, and can do more than just sending mail or paging) when necessary, based on the configuration for that service. Like most open source projects, it is extremely flexible, if you have the initial time investment to set up your tests and dependencies correctly, but once this is done, the tests/alerts can be reused, or further modified. There are quite a few monitor tests and alert scripts already included, along with some handy tools for interaction through a web browser(via moncgi), generating dependency trees, generating reports, and more. Theres also a perl module, Mon::Client, that provides an API for interacting with the mon scheduler. The downside, besides configuring it with a text file(m4 can be helpful here), is there hasn't been any activity since 2007(according to the CVS repo on sourceforge).

    Probably not the solution for an extremely large number of hosts, though resource-wise, it could handle it, but maybe someone else might be able to benefit from it. If you need very specific tests(number of BGP routes, verifying NH on routes, customer redundancy) and smart alert logic, it's worth looking at.

  137. DRDs by WizADSL · · Score: 1

    I want DRDs.....

  138. Use a monitoring framework by Anonymous Coward · · Score: 1, Informative

    I would suggest you give GroundWork a go. It an amalgamation of all the best open source monitoring tools previously mentioned in these comments such as Nagios and Cacti but they are fully integrated into one interface and reduces complexity.

    GroundWork Open Source uniquely combines the most mature and successful open source projects available today into a single package. These amalgamated projects have been downloaded over the last decade for more than 4 million times and have a strong codebase and a strong community behind them.

    Combining these projects into a single package that is commercially supported gives you a simplified deployment experience, a single console for managing and monitoring, and a comprehensive view of your IT operations efficiency.

    Other GroundWork Monitor advantages include open APIs and an open event manager so the information collected by GroundWork Monitor can be biâ"directionally shared with your other ITSM systems such as asset tracker, ticketing system or a CMDB.

    But the best part about GroundWork Monitor is the low cost for the enterprise IT management system and its fast return on investment and value.

    GroundWork amalgamates and supports established, mature projects including:

            * Nagios® â" for event handling and notification
            * SNMP and SNMPâ"TT â" for network management protocol
            * RRDtool â" for underlying data collection and management
            * BIRT â" for adâ"hoc reporting
            * Ganglia â" for grid and cluster monitoring
            * Cacti â" for graphing and trending

    Each of these projects and more are included with each GroundWork Monitor edition tier. GroundWork Monitor has three editions: Enterprise, Professional and Community Edition. All editions are based off of the same codebase so trade-up is easy.

  139. Try NETMON by Anonymous Coward · · Score: 1, Informative

    It's based on Debian, you are able to run scripts as an action, can page, email, has a Postgresql backend (Very easy to backup). It can do SNMP v1-3, Syslog, Traps, Portmonitoring graphs, monitor CIFS/NFS volumes, Linux/Windows services, etc

    http://www.netmon.ca

  140. Depends on your needs by cryogaze · · Score: 1

    I recently went through the process of evaluating new monitoring software for my company as well and which product you go with really depends on what your needs are. I have also worked with may of the products that you noted and they all have both their strengths. If you aren't familiar with it OpenNMS really is a great product, I have used it for years, have attended the training offered by the OpenNMS Group, and also had the commercial support for it which was fantastic. Tarus and the rest of the guys with the OpenNMS group are all fantastic to work with and the community support is awesome as well. If you are used to working with OpenSource products and are familiar with RRD (which you likely are from using Cacti) then OpenNMS is absolutely worth a try. If you have some budget to work with and you want a more commercial solution that is MS Windows based I would suggest you take a look at SolarWinds Orion. Orion has a great monitoring solution for a good price. SolarWinds also offers several modules for Orion for things such as high level application monitoring, VoIP monitoring, network device configuration management, IP address space management, etc. Like I said, it all really comes down to your needs and where your personal comforts are.

  141. Loose coupling is key by James+Youngman · · Score: 1

    Totally separate the data collection from the user interface. Keep both of those totally separate from the system that selects and delivers the alerts. Make sure the system as a whole won't make the problem worse (e.g. if you lose a major piece of infrastructure, will it send you 300 alerts?)

  142. Re:5K instruments in the orchestra: U Need a Maest by mberkay · · Score: 1

    Orchestra and the Maestro is one the of the best analogies I've heard to describe the role of the Event Management system. RapidInsight aspires to be the Maestro over all other management to consolidate IT management information from different tools used to monitor systems, network and applications but also fault, performance, config/change, tickets, etc.

  143. MonALISA by Anonymous Coward · · Score: 0

    have a look at MonALISA

    http://monalisa.cern.ch