What Would You Want In a Large-Scale Monitoring System?
Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
Publish them in DNS, and have the NSA monitor them for me!
"Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
Hyperic HQ may be worth checking out.
Don't assume that you can successfully diagnose the problem based on your understanding of the indicators. You don't know my institutional context. Instead, give me a decision support system that I can use by adding rules that key off the monitored indicators and inject some of our own expertise into the diagnostic process.
That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.
OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.
http://www.opennms.org/
Enjoy.
What limitations exist in current solutions that justifying developing a new one from scratch ?
I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.
You can also look into Zabbix. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.
Nothing to see here
You can pry my GKrellM from my cold, dead hands!
Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!
*cough*
MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak
Damnit... I want to keep information on my peaks for capacity planning!
Twitter client, facebook integration, google maps mashup.
And a pony.
Thanks
I was really impressed by Zenoss, which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.
You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.
http://www.spiceworks.com/
Not sure how far it scales but I have played with it on some small installations, very easy to manage.
I have used Cacti but never felt it was mature or robust enough for very large environments
SCOM, System Center Operations Manager we are deploying now for our enterprise, however I would be afraid to manage IT on my own as it is a large system on to it self, yet very powerfull.
EA David Gardner -"... but the consumers have proven that actually what they want is fun."
I use Nagios and some custom rolled scripts myself.
For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga.
Reconnoiter also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.
That is what I would want! ^^
Any sufficiently advanced intelligence is indistinguishable from stupidity.
I've really been impressed with OpsView. Can't say how well it scales on huge networks (but there are options for having multiple servers). Its based on Nagios, but its a lot less of a pain to configure and has a pretty good web interface. The only thing I don't really like is its graphing functionality. I use Cacti for monitoring bandwidth/server load/etc. But for availability checking OpsView does a fantastic job. I'm using it to monitor maybe twenty devices, including Linux and Windows servers, and HP/Cisco network devices. I tried Zenoss as well, but it seemed awkward to work with. For instance, with Opsview/nagios it's easy to add a check to verify that a DNS server is correctly resolving a record in a particular zone. I remember it was going to be a pain to monitor some of the things I wanted to with Zenoss. Maybe I'm biased because I used plain old Nagios for a while before I tried OpsView and Zenoss.
Every time you post an article on Slashdot, I kill a server. Think of the servers!
I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).
If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.
The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.
MRTG also integrates w/Nagios, which can be useful.
Good luck.
http://argus.tcp4me.com/
Jeremy Kister
http://jeremy.kister.net./
Cons:
I want lots of buttons and dials! And flashing lights!
+1 IDisagreeSoHeMustBeATrollOrAnAstroturferOrAShill
The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.
Give them a text screen of whats currently down ... that'll work.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
I've only used MOM but for what it's worth the diagramming capabilities are much improved with the new visio plugin. Previously you could export your diagrams from OpsMgr to visio, but with the new plugin the visio diagrams reflect live health state. You can also create whatever diagram you want in visio and then tie it to monitored objects living in OpsMgr (for example rack diagrams)
Here's a similar thread from a while back that covers most of the options: http://linux.slashdot.org/article.pl?sid=07/03/05/1812247
I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.
I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.
The system did push and pull operations. First, the system was built in PHP.
In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.
The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.
I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.
This doesn't seem to have already be suggested, but we use SolarWinds Orion. Its cheaper than many of the big systems, such as HP OpenView - and much simpler to use and operate.
The basic Orion package, which you can get for $2000 for up to 100 servers, will pull the usual CPU/RAM/Disk/Network statistics via SNMP. Built in is a mapping engine, that allows you to take a network map, and drop active elements onto it for live interfaces and device information. In a NOC environment, you can show this on a screen and it'll even sound an alarm when a system Alert fires through the website.
You can then bolt on additional modules, such as their Application Performance Monitor. It has ready to use templates for common business applications, Exchange, Apache, IIS etc. You can also create your own mixing, SNMP, WMI and User Experience monitors. User Experience monitors for example allow you to actively poll HTTP/FTP/DNS/SMTP/IMAP/POP etc, services to ensure they are not only UP but responding as they should to requests.
For scaling, you can tack on Additional Pollers to spread polling load across them. You can also use hot-standby pollers to resume the work of a failed poller.
Just my 2 cents, and not a corporate plug - just a very content user!
Sounds like you need a centralized syslog server. It could do more for you than just log commands.....
Karnal
As one of the core devs and large user, I can tell you it scales well, develops easy and has a lot prefab. The system does everything you're asking for. Let me know if you need help or paid support.
Custom electronics and digital signage for your business: www.evcircuits.com
I'm a software developer and, sadly, my knowledge of hardware systems isn't always what it should be. When I write an application to run on a server and it starts to get slow, I want to know where the bottleneck is. Is my application CPU-bound? I/O-bound? Memory-bound? Do I need more memory? Faster storage? More cores or faster processor speed? Is it the network that's causing the problem? I can usually figure this out using various linux command-line programs like netstat and top and all that, but I would sure love a big fat GUI to make it more graphic. I found something like this once and couldn't remember what it was called. It required all kinds of diagnostic utilities be manually installed.
Ideally, you could view a machine and get some quick idea of where the bottlenecks lie. Maybe that's asking a bit much, but the closer you can get to a single control panel where I could see see all my machines in a list with a status indicator and then drill down machine-by-machine, the happier I would be. It would be even cooler if the machines could contact me when they experience times of overload so that I could get a feel for when the trying times are so I can watch them more closely. I'm imagining a daemon that runs on each server and an admin gui that can speak to that daemon somehow. It would also be nice to have hooks so that I can easily report performance profiling information to the GUI from within my application.
The Activity Monitor utility found on Macs is pretty close to what I'm imagining.
What I'd like to see is a good monitoring support for JMX-capable Java services.
It'd be nice to set up an alarm based on time spent in garbage collector in a JVM running our application, for example.
I think Nagios should provide a good start.. they've recently added a lot of scalability features. Though it has a high learning curve and all of its configuration is done in text, I've always found it worth the time and effort. I currently use it to monitor services on a couple hundred machines.
Munin is a bit simpler, but I like the graphs it provides which occasionally are more useful than the data Nagios provides. In some cases, Nagios might tell me that a server went down, but I'd look at Munin and see that the server room temperature spiked to 90F before then. Also it's neat to see the uptime graphs for the year.
While it might not be practical to use GKrellm all the time, I'd find it useful for real-time feedback. You might set something up where you can launch a gkrellm client to a server of interest while you're working on it. Then you can see the effects of things you do without waiting for Nagios to refresh in 5-10 minutes.
That is the solution that i have implemented for our little environment that consists of about 50-ish solaris (8 and 10) servers, 80-ish windows servers, about 500 linux servers, and 40-odd cisco switches. Hobbit handles all host monitoring: availability, services, and a bunch of custom scripts written for it to check various aspects of our HPC grid, plus the SMS sending through an old nokia connected to the comm port of a solaris box. Smokeping is there to check latency, and cacti primarily for network traffic volume, and a custom module for FlexLM licenses. Works like a charm
It's *not* open-source, but it IS inexpensive. When I worked at a NOC, we used it to monitor hundreds of routers, switches, mainframes, Tandem systems, UNIX boxes, etc. It takes SNMP traps and displays them graphically on a 2D map, and the 2D map is very nicely implemented. You can have your top level view made up only of groups of devices, so if a group goes red you double-click that group to view its members and see which device actually has the error. IIRC, you can nest groups, so it ends up being a fairly scalable solution when you talk about screen space.
I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
I use Nagios, but on a smaller scale than what you describe. I love the system, but I would imagine it being difficult to maintain on a larger scale. Nagios itself is requires manual configuration unless you use a separate front-end like Centreon, which is also far from perfect..
A friend of mine has been toying with OpenNMS for the last few months, and he's pretty happy with it although he reports that it's still got some minor issues that need to be worked out. It's FCAPS compliant, and I get the impression that it might be the better option for handling a large installation. There's a new version scheduled for release soon, so we'll see what that brings to the table.
There's also recently been an announcement of a Nagios fork, scheduled for release sometime around October. I forget the site or project name but I'm sure a bit of Googling will locate their site for you.
This sounds like the perfect opportunity to harness the power of app partisans to fix the wikipedia article comparing monitoring software. See http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems . Some good info there. And probably bad info. But certainly has a good list of applications. Also, if you like nagios (and he seems to me to be fair to a lot of packages, including ossim), you might check out some of David Josephsen's articles (or Nagios book), etc.. His site is http://www.skeptech.org/ . A decent design article is here -- Best Practices for Designing a Nagios Monitoring System -- http://www.informit.com/articles/printerfriendly.aspx?p=705685 .
I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.
What you want in large-scale monitoring is:
Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.
Advice: on VPS providers
Which revision?
i tried it for a couple of months, and rather like it, but it'd simply stop monitoring stuff, triggers wouldn't fire reliably etc.
Deleted
Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.
Conformity is the jailer of freedom and enemy of growth. -JFK
Big fan of intermapper (www.dartware.com) ... It can use nagios plugins as well.
It's fairly cheap.. We monitor about 1250 devices at the moment with it... can be set all way down to 5 seconds.
Server and Client are both in Java... so more or less it runs on any platform.
They give out 30 day demo keys.
I use Cacti, with THold and weathermap plugins.
But then I'm biased.
-- I care not for your foolish signatures.
Foglight from Quest Software covers out most of the requirements out of the box and is script friendly. Its all Java based thou it itself isn't an OpenSource project.
There is also a community around customization that might be worth checking out over at www.foglight.org.
Then I advise you to let all the connection checks do by a machine dedicated for this task.
The statistical checks on each of the 5000 machines itself 24 hours a day while pushing/pulling data at intervals to that dedicated machine.
I advise 2 or more measuring machines each on a separate network each doing the same task and synchronizing their data for redundancy.
You can always try INM (http://www.intellipool.se/).
It's quite feature rich and it's worth a look.
--- Reality doesn't care about your opinions, it happens anyway and if you are in the way you'll get squished.
Was a step above Nagios in terms of reliability (I didn't have to restart the server four times a day just to keep it running), and did much better at autodiscovery.
That fact that it is also NRPE compatible was a plus - I could use all the Nagios plugins and check scripts I'd written.
I was also planning on using it to launch a more aggressive webmin-style management solution - since OpenNMS built this great database of data about my devices and hosts, I could use it to do actual management - change data/settings.
Cons: It's a Java/Tomcat tool, as much as that is really a con. It's not like you need to run Jboss or Websphere to use it (though I suppose you could).
I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.
Transparent Planned Design
Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.
Event management
Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.
Modularity/Seamless Integration
Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.
Complexity
There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.
The Perfect Monitoring System
After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.
What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.
As you've discovered, the free systems will fall over and die once they're past a certain size. I've worked with Tivoli customers that have tens of thousands of servers, and for all of its problems, Tivoli is scalable.
They way they do it is, obviously, divide and conquer. There are specific ways that they do it. I'm mixing up the architecture and terminology on purpose, because the Tivoli terminology will confuse you.
* there are agents on every box that do the monitoring
* they report to a region
* those regions report to a top-level region
That doesn't mean that you can't have a poll engine somewhere, poking machines. What it means is that if you do have a poll engine, it manages a specific number of machines and reports results upwards if necessary.
Tivoli has a bunch of other stuff that makes things like this easier, like profile-based management and lightweight (relatively) endpoints. You can simulate that using a centralized source control system that everything pulls from - configuration of monitoring, etc comes from those config files, and every your agents pull their configs depending on criteria, like their hostname, ip, or by looking in some file for what they're supposed to get. This also becomes your shared filesystem of sorts, because you can pull monitoring binaries from them as well as config files.
Management of alerts is always a problem. Having worked on an EMS I can tell you that all the free ones suck, so it doesn't matter which one you pick. Spend some money and buy the BMC Event Manager.
Besides that, avoid UDP - it fails when you need it the most. And don't do management by exception - it's for lazy admins. Instead, do some kind of thresholding on your stuff, so you can tell before it fails. MBE gets you there 5 minutes before your users call. Real monitoring allows you to ignore the problem for weeks, or at least blame someone else for not acting when the systems finally do fail.
Needed features in random order: ...and a system for annotating the above. Raw data is neat, annotated data is even better.
* Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
* Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
* Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
* Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
* Some language to easily calculate derivative values from the data above.
* Interface for defining graphs, using collected data.
*
* Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
* (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
* (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
* hooks/libraries to use collected data "outside" of the system
I realize that's a lot, but boy, such system would be very useful and flexible.
-- we're here you're not
If your monitoring something of that scale, you should probably look into a profession solution.
I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.
That seems to be one split between the various different monitoring systems out there. Either it's intended for the network guys and its only understanding of host/server metrics is what it can poll out of SNMP, or it's SA-focused but has few of the broad, large-scale network features that the network guys want.
Personally (as an SA), I've been very satisfied with Xymon (nee Hobbit, which was a fork/rebuild of Big Brother). Performance is great, even with 5000+ devices, it's got an open and simple-to-parse protocol, and an incredibly extensible architecture. As an SA, being able to script up a monitor and throw it into a data stream as plain text makes it very easy to develop new tests (or add simple monitoring/logging/rrdgraphing) out of pre-existing scripts. Don't limit yourself to what SNMP gives you if you're dealing with servers, services, and higher-level app testing. KISS: http://en.wikibooks.org/wiki/Hobbit_Design_Document
Whatever you do, pick something you can easily customize: Hack together three different monitoring systems to come up with a best-of-both worlds solution. Everyone's monitoring needs are different, after all.
Hire a Linux system administrator, systems engineer,
For COTS I'd go with CA (formerly Concord) eHealth. Their SNMP agent is light-weight, flexible and the product scales out very well. It's also reasonably straight forward to deploy and configure and quite expandable. If you want an open source alternative that will grow with you and/or offer support options down the road I'd give Zenoss a shot. Steer clear of HP, BMC, or IBM solutions due to complexity and/or price.
If you want a monitor that can display useful information about thousands of nodes on a single display try clumon. We use it for our 1000+ node clusters. The software was developed in-house but is available under the University of Illinois/NCSA Open Source License Copyright (noticeware). If you're just going to use this in-house, the license shouldn't be an issue.
You can see a sample clumon display of a working cluster at NCSA Linux Cluster Monitor.The clumon page for that cluster shows you each the job status of each individual node (if the node is colored, it has a job assigned), the load on the machine (the height of the line is proportional to the load, and red tips show loads over 1.0 per cpu) and the service status (green underline is ready, yellow/black stripes is offline, and red is unexpected offline/no comms). If you mouse-over a node, a status box pops up with more information on that specific node.
As this was designed for a cluster with the Torque resource manager, it won't be exactly what you need, but since you are willing to write a monitor from scratch, it might be a really useful starting point. Design-wise, this monitor allows the engineer or manager to see what's going on in general, with problem areas being immediately obvious, and without being overly cluttered.
The open source Performance Co-Pilot software runs on each node to collect information, which is polled by the central server. Back end is MySQL. The dynamic display is PHP.
Straightforward, useful and very configurable.
The Internet has no garbage collection
He's just asking what to use
I know I'm late to the party, but I haven't seen anyone bring this one up yet: Real-world alarm/notification capability (pager, buzzer, a machine that goes bing, something like that)
My reasoning: I run a small IT business with various support contracts. I, and probably quite a few others, can't afford to pay someone to sit at a monitor and watch a screen (or a bunch of screens) whilst tied to a desk.
Most of the monitoring solutions (Nagios, others) are capable of off-site notification, but it's the "last yard" that's the problem--how to tell someone, even a non-techy, there's a problem so he can call in the cavalry. Despite Verizon's "largest 3G network" claim, a lot of my clients and workers in Silicon Holler don't have cell coverage...so SMS, pagers, etc. aren't all that reliable. But we do have office staff who could be around to listen for an alarm, and we have a solid internet connection...so calling for help via the network is viable, but not paying someone to be otherwise unproductive because they can't go anywhere else.
I even started developing my own ATMEGA based solution...still working on it, and I think it's completely doable. If I ever get it up and running, I'll publish the plans, code, and scripts/software under GPL and let someone else worry about the marketing.
Never confuse movement with action. --Hemingway
very non invasive, monitors just about anything and graphs via open libraries. opensource and pretty easy to get started.
Steve Maher freeunixtraining.com
We are using http://www.groundworkopensource.com/ for our monitoring. It is working pretty well, and we can use existing Nagios scripts with it.
Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.
I implemented The Hobbit Monitor where I work. Actually, its called Xymon now because of a copyright complaint (who knew?!) but I digress...
We monitor basic to complex information of about 5000 machines and a handful of NAS devices. It is a server/client setup, and highly customizable: an evolution of the Big Brother monitor from days of yonder. The histories can go back indefinitely, and all the configuration is done by flat-files: helpful if you like to roll-your-own automatic configuration tool.
It is pretty basic out of the box, but the way it is implemented makes it very easy to track whatever you want and write your own tests: from simple bash and perl scripts, to c programs with api hooks into your applications.
We didn't go with Nagios because it initial testing showed it was very chatty and the interface was unintuitive. I happen to like the easy 'smiley face good, frowning face bad' for taking a quick glance at our infrastructure.
Ganglia for performance monitoring (it's for clusters after all and has been shown to handle 5000+ machines with very little overhead) and Nagios for host/service down alerts.
Zabbix.
http://www.zabbix.com/
If it leaves something to be desired, please tell us what.
-fb Everything not expressly forbidden is now mandatory.
To be fair, I wouldn't say the Zope database (ZODB) is not a "solid foundation". It's one of the best parts of the Zope stack and, in 3 years of dozens of clients using it in Zenoss, Plone, and other apps, I've never had it corrupt or lose any data. It's a proper DB--ACID, MVCC, and all that--and you can even lop transactions off the storage to go back in time. Don't expect it to be a relational DB with the ad hoc query tools typical thereof; it's an object DB, with the aim of persisting graphs of Python objects transparently.
Now, if you aren't familiar with it, the ZODB can indeed seem opaque, but, just like any DB, there are tools to read and modify it. At the highest level, just stick "manage" after your Zenoss URL, e.g. http://example.com/zport/dmd/manage . That'll get you into the web-based Zope Management Interface (colloquially, "the ZMI"), where you can poke around at any object that someone's bothered to write a UI for. Deeper than that, you can connect to ZEO (a server that brokers access to the ZODB over a socket) and mess with the object graph using normal Python. When you're done, "import transaction; transaction.commit()". (The Zenoss developers are probably trying to scare you away from such digging around in fear that you'll violate their objects' invariants and leave them a real mess to solve.)
Now, I don't say that Zope isn't scary; it has over 10 years of scary stored up in it. But the ZODB is a cuddly, loving part.
Cheers!
I've used Spiceworks for multiple smaller sites and it works well...
Cacti was a pain to configure for every client(tried it first)...IMO
"Just Smile and Nod." --Huck
I would have to agree with the other poster that suggested SolarWinds Orion network monitor. You can monitor network swithes, each port, servers, apps on the server, other devices with SNMP strings, things that don't support that.... you can import multiple maps etc. At our site in Orion we have a US Map, state map, then campus maps for a few sites. then building maps then to server room. Custom views and alters. My login to orion shows different stuff then our telecom person's login. and no i don't work for them http://www.solarwinds.com/
We use up.time from http://www.uptimesoftware.com/ We monitor all aspects of our enterprise using it, including: Network devices, OS (Windows, Linux, NetWare), Applications (LDAP, DNS, MS Exchange, Oracle, etc), Performance, and hardware monitoring. We have much more than 5000 elements being monitored. We had a huge number of separate systems monitoring each flavor of component, including MOM, Nagios, NetMon, and many others. We wanted one pane of class to see the whole Infrastructure and be able to show the "service" availability. Obviously, each specialist system can monitor their own key element in some ways better, eg MOM can monitor things within Exchange better, but for a single Monitoring system this one won our evaluation process. Check it out.
I vote Zabbix. Here's why.
1) Free but offers paid support if you need it
2) Can use agents, snmp or simple checks like ping
3) Agents can be extended with your own scripts and such. If a check isn't built in you can add it. For example, I added a very simple script for checking of MySQL replication had stopped or failed.
4) Templates, makes it easy to add a metric and create a trigger based on that metric to any host attached to that template
5) Triggers can be configured to minimize false positives (multiple dropped packets before sending an alert.
6) You can graph item, group of items or an aggregate value of items in a host group
7) Create your own maps
8) Create custom screens that group simple or complex graphs or whatever else you want onto a single page
There are some things to know about Zabbix though. You need to put some thought into items to get accurate values. Is the value you are getting from a device in bits or bytes for example. You can use custom multipliers to convert values into what you want to see.
Honestly, Zabbix is incredibly flexible and this flexibility also gives it a steep learning curve but once you get hosts entered and the templates situated the way you want it becomes very easy to add new hosts down the line. The biggest tip I can give is to make sure you spend a lot of time thinking out how to setup your templates. Zabbix includes a number of them and you'll want to customize them. One thing I found that wasn't a good idea is to make a template and then attach it to a template. It's much easier to join a host to multiple templates.
http://www.zabbix.com/
A far as the actual question goes, I think a patchwork of tools that you understand well and have proven themselves reliable is often a better choice than the all-singing all-dancing approach. The patchwork does take more time to roll out and configure, but if the tools are simple and easily managed, it is probably the better choice for large environments. However, at the 5000 device level, I'm not sure if you're at that break-even point. I've only personally deployed nagios, cacti, and similar tools on the small scale.
.
But more usefully, I'd recommend a tool that is *not* network monitoring to go along with it. Monitoring is great for seeing what is going on at the moment within a domain of events, but once you find out about a problem, how do you then dig into it? I really recommend feeding all your monitoring data and *other* IT data as well into a system that lets you investigate all of them. I think Splunk is pretty good way to do this. It's a search engine into all the time-series data in your environment, so you can learn things like what *else* broke at that time, who was logged in, and so on all pretty quickly. It's commercial, but reasonably priced, and can be used free at some data levels.
.
Caveat, they pay my bills, so I may be biased.
-josh
Zenoss for general device monitoring. Hyperic HQ for app monitoring. Splunk for scanning log files.
Nagios with Centreon. Centreon is a decent front end to Nagios, with commercial support if required.
OK, guys, I had into this for a while ago and had to choose what to do. Here is a list what I've tried (means really-really tried and even looked at source code) and my short opinion as a result. Disclaimer: was my own personal research and practice, so I might sound different from others. Any suggestions are welcome. :-) Also I want (yeah, I am picky):
So! Here it is:
I am sure they are not any leaders in monitoring technology. Also I even doubt they are leaders in monitoring in general. However, this worked for me and I wish Nagios all the best.
Personally I recommend go with OpenNMS. Not going to say it is excellent: it also has its gotchas. For example, don't even think installing it on a slow machines with low memory and/or on LVM. Also I would love to see it with other databases working... It is written in Java and it wants enough space in resources. However, once you give it to it, you will see all the best of it. Integr
Yeah, I played with Splunk on my last client.
Killed it in about ten minutes of futzing around with it. Reliability? Fail.
Performance? You need dual Xeons. Fail.
Support? Asked a question on the forums, got told to RTFM - which, by the way, is incomprehensible. Fail.
Splunk sucks rocks. All I wanted was some relatively simple Windows event log monitoring. Ended up going with Network Event Monitor (which is also heavy on the hardware needed, but it actually runs and is relatively simple to set up, although limited in its filter creation).
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
Hey, I don't know how I can get in touch with you since that our emails are protected on slashdot. I think I left a comment on your blog, but it was the best guess I had.
Now a days you have to consider a monitor tool or tools that can do the following: 1. Event Correlation (don't page me or turn my entire dashboard red if I lose 2 severs in a 20 server load balanced pool) 2. Application Mapping, dashboards, and portals 3. User experience monitoring mapped to hardware ( eliminate finger pointing and shorten problem identification) 4. Ability to publish reports ( reports tailored to the person's skill level. The higher up the food chain the less they will understand very technical graphs) 5. Historical comparisons (We now have to justify clearly why we need the upgrade or the latest and greatest) 6. On large scale monitoring solution you have to manage your database, all this data can pile up quickly What tool you use is not as important as what you do with the information to quickly resolve issues and provide data on the health of your infrastructure.
If you haven't evaluated BixData http://bixdata.com/ yet? You're missing out. No nag/reg required, free use for less than 30 hosts.
Does not include kitchen sink. Only the next generation advanced monitoring system that can handle phyisical and virtual as well as the hypervisors! VMware friendly.
On their science page http://www.bixdata.com/science, they say Bix is Borg. And they're not kidding.
"BixData is profoundly different. It took science fiction to provide the metaphor. Bix is Borg - 'an inter-connected collective' (self-organizing p2p) that 'assimilates' new life forms (cross-platform virtual machine), functions with a single hive-mind (n-cube datastore) and adapts through self-learning (cybernetic feedback loop) - all in pursuit of perfection. Resistance is Futile."
Just... Wow.
-- Robi
I use it both at home and on my previous job at a big datacenter with several hundreds of monitored devices. Worked very well and the map is superb. I haven't seen anything comparable. Espacially the realtime traffic status of network links (witch fade to red when overloaded) is a grad diagnostic tool.
If monitoring everything your sysadmins do is important because they have a habit of wiping their histories, I think you need new sysadmins.
Big Brother (or Sister) which uses push agents so you are not generating vast amount of SNMP polls and you get instant feedback on a stupid simple dashboard.
http://www.bb4.org/
Project Observer is super easy to set up for SNMP and can auto-discover Cisco gear (with CDP). A good, simple SNMP monitor but it has serious scaling limitations.
http://www.observernms.org/
Nagios for hard core up/down monitoring with good flap detection and Cacti for performance monitoring.
OCS Inventory for push software distribution and inventory control.
Or you could drop some serious cash and just get Unicenter TNG and go bald from ripping your hair out.
Seriously, though, try a bunch of things and see what actually works for your team.
[RIAA] says its concern is artists. That's true, in just the sense that a cattle rancher is concerned about its cattle.
Fast, scalable and integrates the business view in the tool, as well as a real GIS (you can use it or not), SLA management and BI and datamining views. Everything related to monitoring in made in C++ using the multiplatform near real-time framework ACE, so it's really fast.
The Web console is based on TomCat (J2EE) and you can manage, deploy, configure or update agents from there.
You can develop new events and new agents to monitor "whatever" you want: stock shares, temperatures, web transactions,... it's up to you. And Osmius is real Open: There's no open core, nor closed features, and you can access the documentation, the analysis info, datamodel staff, and (of course) the code.
We are now (I'm in the development team) doing the final tests and dealing with the latest bugs and also testing scalability and the behaviour under stress of receiving millions of events every day, and thousands per seconds. It's working properly.
The D day is July the 30th. Every comment and suggestion, and even hard criticism, is very welcome.
We want Osmius to be:
Let's see if we made it.
Well, I suppose it depends.
How many large scales are you planning on monitoring?
I think you have already gathered from the large amount of responses that the problem has been "over solved" - many options to choose from. We provide a rip-n-replace service for HP Openview users (banks and trading exchanges), which, including any coding required and 24/7 support comes out at about 20% of those annual costs in year 1 and below 10% in subsequent years, but there's no point in telling you what Open Source product we use - you need to do your homework so you arrive at an answer that you and your boss understand yourself.
You will probably find a number of answers to your criteria - TRY THEM. Give the ones that seem viable in terms of support, community, code quality and your own ability to make it work for your company a good try - most you can even do in parallel. Only after a live test can you decide what you're going with, because you will be investing time in tuning it for your own needs - this is not the time you want to waste. A good preparation is worth 80% of the work for monitoring, or you will spend time monitoring the monitoring system instead which is a waste of your time.
Good luck :-)
Insert
Thanks for the info.
Love many, trust a few, do harm to none.
Both Yacoob and Afabbro have some great lists above (especially Afabbros list!). A combination of these features would be an ultimate system.
:))
I work for a company called iQuate - I have been (sometimes literally) developing a monitoring system for about 7 years. We do many of the things mentioned in the 2 lists, but not all (I wish!). We have a product called iQRMS which integrates several functions, the largest of which is monitoring.
- It is agentless - it uses about 30 different protocols (including SNMP obviously!) to connect to remote machines, so it can be deployed very quickly and gives a pretty "true" picture of client connectivity (which sometimes an agent based approach will not).
- It is horizontally scalable (you can have many scanning services on many computers and they will load balance between them).
- It has failover built in - when 1 or more of the scanning services die, the others redistribute the load.
- It has intelligent aggregation of data, recording max, min and average values for any monitor over time - for up to 6 years - in such a way that it doesn't just eat disk and kill performance (that one took a while to crack...)
- It has pretty graphs and in-depts reports on events
- It supports complex (or simple!) escalation rules to control who gets told about what, when and how often when events happen
- It integrates with a helpdesk (it's own or others)
- It allows you to create templates of monitors using different protocols to get a wider picture of an issue
- It is easy to understand and designed with 24x7 operations in mind (hence all that failover/scalability)
- It doesn't cost the earth
It also doesn't do some of (1 of) the things Timothy mentions at the start of the post (gratz on the new job btw!) - specifically it doesn't create a 2D map of the environment, although there are some plans to implement that in future. It treats and represents devices in the network as groups of hosts - it doesn't display them in relation to physical layout...
Maybe it's worth having a look at it Tim, I can certainly vouch for the support being excellent (but like I say above - I'm biased
JK
Both Yacoob and Afabbro have some great lists above (especially Afabbros list!). A combination of these features would be an ultimate system.
:))
I work for a company called iQuate - I amd the CTO and have been (sometimes literally) developing a monitoring system for about 7 years. We do many of the things mentioned in the 2 lists, but not all (I wish!). We have a product called iQRMS which integrates several functions, the largest of which is monitoring.
- It is agentless - it uses about 30 different protocols (including SNMP obviously!) to connect to remote machines, so it can be deployed very quickly and gives a pretty "true" picture of client connectivity (which sometimes an agent based approach will not).
- It is horizontally scalable (you can have many scanning services on many computers and they will load balance between them).
- It has failover built in - when 1 or more of the scanning services die, the others redistribute the load.
- It has intelligent aggregation of data, recording max, min and average values for any monitor over time - for up to 6 years - in such a way that it doesn't just eat disk and kill performance (that one took a while to crack...)
- It has pretty graphs and in-depts reports on events
- It supports complex (or simple!) escalation rules to control who gets told about what, when and how often when events happen
- It integrates with a helpdesk (it's own or others)
- It allows you to create templates of monitors using different protocols to get a wider picture of an issue
- It is easy to understand and designed with 24x7 operations in mind (hence all that failover/scalability)
- It doesn't cost the earth
It also doesn't do some of (1 of) the things Timothy mentions at the start of the post (gratz on the new job btw!) - specifically it doesn't create a 2D map of the environment, although there are some plans to implement that in future. It treats and represents devices in the network as groups of hosts - it doesn't display them in relation to physical layout...
Maybe it's worth having a look at it Tim, I can certainly vouch for the support being excellent (but like I say above - I'm biased
JK
Do you know Pandora FMS?. Pandora Flexible Monitoring System is a general purpose monitoring tool. It was born in 2002 at the IT department of a international finance corporation. The ultimate goal of Pandora FMS is being an adaptable platform for any organization, able to collect events of any type, generate alarms through a metric system and to represent obtained events in graphs, reports or maps. Pandora FMS can detect a network interface down, a defacement in your website, a memory leak in one of your servers applications, a delay in your website when the customer pays, or the movement of any value of the NASDAQ new technology market. Pandora FMS can show you the state of your servers, systems, applications, communications, or the sale level of your commercial team. Pandora FMS is extremely modular and decentralized. The most important component, and where everything is stored is the Database (right now only MySQL is supported). Every single component of Pandora FMS can be replicated and work under a pure HA system (Active/Passive) or under a clusterized system (Active/Active with balanced load). Pandora FMS can gather information locally with agents software or hardware: - Pandora FMS has specific agent software that runs on any operating system, GNU/Linux, AIX, Solaris, HP-UX, BSD/IPSO, and Windows 2000, XP and 2003, gathering data and sending it to a Data server. - Pandora FMS has a specific hardware agent, being able to connect any sensor to this devices. Using it, it is possible to monitor temperatures, lightness, movement, smoke, ...
Pandora FMS can also gather information remotely, without installing software or hardware agents:
- With the Network Server Pandora FMS can monitor any kind of service or port via TCP query, any devices via SNMP, and any communication latency or state via ICMP.
- With the Plugin Server Pandora FMS can monitor any kind of system with complex code. It is compatible with Nagios Plugins.
- With the WMI Server Pandora FMS can monitor any Windows via WMI.
- With the Web Server Pandora FMS can monitor web applications via complex checks. There are two kind of webchecks: A check for response time and check for availability. Of course, webchecks are not just making a simple http request to say if works or not, webchecks can make fully complex web operations, like perform logins, choose a parameter from a menu, enter text into a form, expect a specific response in each step and make sure that all programmed steps are done correctly before saying âoeweb application response is OKâ.
- With the Prediction Server Pandora FMS can detect trends. It implements in an statistic way a data forecast based on past data (to almost 30 days in four temporary references).
- With the SNMP Console Pandora can monitor any device via SNMP traps.
You can visit pandorafms.com.
Bye
I was a long time Nagios user but the manual config changes and management of it was just getting too much, I've recently switched to running a clustered OpsView setup monitoring 2 geographically separated sites and around 1000 devices. It "just works". It is easy to configure and manage, the data warehousing/searching/reporting feature is great, graphing is excellent, dashboard and nagvis elements let you present data nicely and there's even a scheduled reporting tool to email the management a PDF full of pretty graphs every month. Because it's built on Nagios there's a plugin for monitoring just about anything you can think of and it's free! I don't work for OpsView, I'm just a fan!
You openly admit that you monitor thousands of people's PC's without their consent and probably without their knowledge? I would be ashamed. Collected any good blackmail material yet?
I piss off bigots.
Most monitoring systems only check each server once every 5 minutes, giving an average of 2.5 minutes between error and alert; with a customer, you'll be out of bed at 3 in the morning in a matter of seconds :-)
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
I've found Nagios and NagVis a solid solution. NagVis is a plugin/addon for Nagios which allows you to create a Heads-Up display with status information on your own network diagram. It has an interactive map, which is 100% customizable. When kept in a browser window, it will play a sound during an event and flash the icon for a host indicating the problem.
PNP adds graphing of performance data to Nagios. It allows you to click through the nagios interface directly to the graphs for a given host or process. It will graph anything that has performance data output.
Finally, Cacti is a great solution for things which you may not roll into your Nagios insallation. We use it for monitoring network bandwidth utilization, mostly.
[ this a general comment for anyone interested in Zabbix ]
I'm currently working on a project in a Hospital. We chose Zabbix (1.6.4) and myself and a colleague have set it up. It's currently monitoring 169 hosts (mixture of Windows, Tru64 Unix, Linux and UPSes) with 7108 items monitored and graphed and 1,704 trigger alarms. And this is really small potatoes compared to what Zabbix can handle. We've not had any issues with the Zabbix server processes. It runs just fine.
What I really like the most is its graphing capabilities: the ability to zoom in on a section of a graph by just dragging and selecting the time interval you want. This also works with "Favourite Screens" containing multiple graphs, where selecting a time window on one graph automatically zooms the rest of the graphs on that screen at the same time.
There's an excellent FireFox plugin too that gives you a summary icon at the bottom of the browser, a single bar to display the most recent event, and clicking on the icon slides a tabbed pop-up frame into view that contains all the information you'd expect to see on the Dashboard.
Like any monitoring tool of this size though, getting all the data in for the systems you want monitor is time consuming. My advise would be to concentrate first on getting a few systems just how you want them, then work out whether it's faster for you to clone them and mod system specific information or export the XML for those systems, use those as templates for other systems/hosts of the same time and work with XML import.
Zabbix gets a big thumbs up from me.
We used OpManager in production for over a year. It has terrible Linux support. None of their built-in plugins worked properly for monitoring even basic parameters like disk space, free memory, CPU usage, etc. When we pointed this out to their support people, they said we should build our own plugins with SNMP OIDs. Um....no. Not for the amount of money we paid for that steaming POS. We finally kicked OpManager to the curb about a month ago, and have our entire environment, Windows and Linux servers being monitored with Nagios. Nagios scales well, we are currently watching several hundred hosts and about 3500 services.
OpenNMS is also a good tool, its ability to map servers back to switch ports is extremely handy.
Unfortunately (or not), Windoze based. My experience with it started out unstable and feature poor at version 4, but it kept the relatively inexpensive (core, support, and add-on) price tags, and features kept getting better, and stability continues to improve at version 7.1. Remote windows and java consoles, remote pollers, SNMPv3, easy custom MIB compiles, functional dependencies, device grouping, custom alarms, restricted console views, packaged third party paging and email, custom tool integration, easy maps, acceptable (to me) TCP service monitoring and third party script support. Reporting is also integrated, or use the up-featured SQL add-on. I'm using it for just shy of a couple thousand devices on a single modest server. It's been able to accommodate every NMS feature I need, and a great many wants. My only real gripes are: console authentication still doesn't have a RADIUS, LDAP, or AD hook, and I'd like a Linux port for the backend. Other than that, it's shamefully simple to get new staff up and running, and it requires very little care and feeding. Good luck with your search.
HypericHQ would be good
PandoraFMS is another option.
ZenOSS and Zabbix are popular too
The biggest problems we've seen from a monitoring perspective is that most systems really do have a hard time scaling to large levels and being usable. [A common trick (and one we employ) is to have a multi-tier monitoring system in place, where one monitoring stack monitors the monitoring stack that is actually watching the service/hosts.]
Once one gets past that hurdle, the tricky part is dealing with the "it is OK if X% of my machines are down". Most monitoring systems that I've dealt with are based around the view that they are monitoring a single host/single service and not a collection of hosts where it is OK if chunks disappear. For those types of problems, one still ends up writing a lot of custom smarts it seems.
The solution is real simple. If you can program in anything then Hobbit/Xymon with Devmon is your only choice.
Create your own Weather Map for 2D, you never need a full 2D map of 5000 hosts... Less is more.
1. Free
2. Fully customizable
3. Easy administration
4. Offers clients for all the major OS (And quite a few minor ones)
5. Large support base (Users with high technical level)
6. Nice author (Replies to comments and considers all ideas)
7. You can write a test for anything you can think of and easily add it into hobbit
8. Offers client/server montoring, remote monitoring, script monitoring, snmp monitoring(devmon) or scripts
The possibilities with Hobbit are endless
Personally I use Hobbit to monitor over 2400 devices, including Cisco hardware, AIX, Windows Servers, VMware Clusters, Exchange, Sharepoint etc.etc.etc.etc.
I've never encountered a system I could not monitor with Hobbit (Or scripts that send their results into hobbit).
For the price, I am a big fan of SolarWinds Network Performance Monitor (NPM)and ipMonitor. Together, these give me the ability to track and monitor as many devices as I want. They both have network discovery, and can monitor network devices, servers, workstations, and applications. I have not tried the Application Monitor add-on for NPM as a replacement to ipMonitor, but it looks like it would work very well. I have written several c# scripts to augment ipMonitor for the custom applications I need to monitor. The only downside I can see for you is that it these are Windows based productions, and NPM requires Microsoft SQL 2005. I love the map capabilities in NPM and the graphs it makes. I am able to alert on any OID I collect data on, which is a plus. Also, NPM has thousands of MIBs already installed, which makes finding an OID much easier. Best of all, NPM supports OID tables, which makes my monitoring very dynamic. I have, for example, created an alert on disk partitions getting more then 95% full. I do not have to worry about making an monitor for each possible partition, or even worry about how many partitions are on a server. I just monitor the OID table. As long as I have set SNMP properly (of course) I see all the partitions I care about. ipMonitor I use mostly for application monitoring. In this respect, it is very nice, since I can execute custom scripts. For your scenario, I would seriously look into NPM. This is a very easy product to learn, very powerful, and can be fairly cheap to implement.
In the orchestra, you need different instruments, musicians and a Maestro who knows the whole partition and can render it. Nagios is not playing like Cacti, MOM, Quest, etc...Even if they play in the same sandbox....I mean here that in your environment, you may already have some monitoring solutions specific to each System Management disciplines: Database, Servers, Network devices,....They have all their own user, admin & configuration consoles and may be also their own agents. You need an integration of all these Monitoring solutions ( at least the most important, valuable and easy ones). If you drop all the monitoring solutions and go for a single one from scratch, it will sound like a ring tone in the hall of an airport at 10:00AM...You need a chief, a manager...a Maestro...an Event Manager (centralized or distributed if needed). The role of this Event Manager is to get the most valuable monitoring information from these monitoring tools and to consolidate all these alerts using a single syntax & semantics ....Consolidation: enrichment, correlation, filtering, reaction... of the alerts reported by the monitoring tools to an Event Management Solution sounds like repeating again and again with the orchestra for the D day of the concert. This Event Management Solution will be the Maestro. All the musician (sysadmins) will recognize their partition and their instrument (their monitoring tool) when the music will play => Investigate an Enterprise Event Management Solution OVER your current monitoring instruments. Look for the most flexible (easy to integrate..easy scripting), the one able to speak through multiple and SIMPLE protocols. The one that will serve you on a gold plate the root cause of all your troubles when the poultry will make some noise. There are some of them on the market (Open world included).
Music Maestro !
We have similar goals with our project Clearsite.sourceforge.net. We've learned our lessons and think we can begin taking on the likes of SolarWinds, OSSIM, ZenOss, SpiceWorks etc... We made the mistake of being to geared toward one vendor(cisco) but no longer. We're making the software work for us, were not working with the software. Crating a Snort interface that highlights the portion of the packet that trips the content rule, being able to note FP's, highlight the portion that's a FP in the packet, and it's added to the rule once you click submit. Some user-agent rule goes off, but it's your own app, highlight the user-agent your app uses, click submit and content:!"user-agent: xyz"; gets added to a display filter and or the actual sig itself. A snort rule is triggered for Bittorrent being used, a cron job connects via wmi, snmp or ssh to a host, runs a netstat -abn effectively and figures out the process and location of the executable that triggered the rule, or the lack of being able to get such a result back might further point to a FP or a machine not under your control. If no contact, check the mac address db to see if it's one of yours, if not, snmp set fa0/22 disable. Proactive. Naturally there are more checks and balances in there, but that's where were heading with just the snort portion. Again making the software work for us. As always we'll use our very popular ajax search for everything we can. http://clearsite.blogspot.com/search?updated-min=2007-01-01T00%3A00%3A00-08%3A00&updated-max=2008-01-01T00%3A00%3A00-08%3A00&max-results=3 -rich (google: xinn.org contact)
You bring up an excellent point: ZODB doesn't do any referential or data-type integrity checking; it's pretty much just a dumb (though rather concurrent and durable) graph store. Thus, ZODB-using apps have to take care of data integrity themselves or else interpose another layer (which you'd want to do in a "shared" situation like you mention).
I guess that's the tradeoff ZODB makes: really fast and agile development (no schemas to maintain, etc.) in exchange for no particular constraint enforcement. In practice, the latter is mitigated (and lots of painful debugging saved) through use of constraint-enforcement frameworks like Archetypes, but that still makes me queasy in a multi-app situation, as you'd have to make sure everybody uses the framework.
Personally, I'm both a ZODB and a Postgres wonk. What I'd love to see is the best of both worlds: a language-agnostic graph DB with internal constraint enforcement and, as my pony, a declarative ad hoc querying language. :-)
We are using a combination of Cacti and mon to monitor about 200 devices, both network gear and PC servers. Cacti is used to graph performance data(bandwidth, cpu, mem, temp) and maps for the visually inclined, while mon is used to do the actual service monitoring and alerting.
I won't comment on Cacti, since it has been mentioned here already, though iI will say that you CAN change the default behavior of "sample averaging" by increasing the size of the RRD database. There are discussions on the Cacti forum/wiki that cover this topic.
Mon on the other hand, I didn't see mentioned at all, so here's my blurb on that. The core of mon is a scheduler written in perl, which handles running monitor tests(also perl or any script/program that can exit with a 1/0) and then alerting(also perl, or other languages, and can do more than just sending mail or paging) when necessary, based on the configuration for that service. Like most open source projects, it is extremely flexible, if you have the initial time investment to set up your tests and dependencies correctly, but once this is done, the tests/alerts can be reused, or further modified. There are quite a few monitor tests and alert scripts already included, along with some handy tools for interaction through a web browser(via moncgi), generating dependency trees, generating reports, and more. Theres also a perl module, Mon::Client, that provides an API for interacting with the mon scheduler. The downside, besides configuring it with a text file(m4 can be helpful here), is there hasn't been any activity since 2007(according to the CVS repo on sourceforge).
Probably not the solution for an extremely large number of hosts, though resource-wise, it could handle it, but maybe someone else might be able to benefit from it. If you need very specific tests(number of BGP routes, verifying NH on routes, customer redundancy) and smart alert logic, it's worth looking at.
I want DRDs.....
I would suggest you give GroundWork a go. It an amalgamation of all the best open source monitoring tools previously mentioned in these comments such as Nagios and Cacti but they are fully integrated into one interface and reduces complexity.
GroundWork Open Source uniquely combines the most mature and successful open source projects available today into a single package. These amalgamated projects have been downloaded over the last decade for more than 4 million times and have a strong codebase and a strong community behind them.
Combining these projects into a single package that is commercially supported gives you a simplified deployment experience, a single console for managing and monitoring, and a comprehensive view of your IT operations efficiency.
Other GroundWork Monitor advantages include open APIs and an open event manager so the information collected by GroundWork Monitor can be biâ"directionally shared with your other ITSM systems such as asset tracker, ticketing system or a CMDB.
But the best part about GroundWork Monitor is the low cost for the enterprise IT management system and its fast return on investment and value.
GroundWork amalgamates and supports established, mature projects including:
* Nagios® â" for event handling and notification
* SNMP and SNMPâ"TT â" for network management protocol
* RRDtool â" for underlying data collection and management
* BIRT â" for adâ"hoc reporting
* Ganglia â" for grid and cluster monitoring
* Cacti â" for graphing and trending
Each of these projects and more are included with each GroundWork Monitor edition tier. GroundWork Monitor has three editions: Enterprise, Professional and Community Edition. All editions are based off of the same codebase so trade-up is easy.
It's based on Debian, you are able to run scripts as an action, can page, email, has a Postgresql backend (Very easy to backup). It can do SNMP v1-3, Syslog, Traps, Portmonitoring graphs, monitor CIFS/NFS volumes, Linux/Windows services, etc
http://www.netmon.ca
I recently went through the process of evaluating new monitoring software for my company as well and which product you go with really depends on what your needs are. I have also worked with may of the products that you noted and they all have both their strengths. If you aren't familiar with it OpenNMS really is a great product, I have used it for years, have attended the training offered by the OpenNMS Group, and also had the commercial support for it which was fantastic. Tarus and the rest of the guys with the OpenNMS group are all fantastic to work with and the community support is awesome as well. If you are used to working with OpenSource products and are familiar with RRD (which you likely are from using Cacti) then OpenNMS is absolutely worth a try. If you have some budget to work with and you want a more commercial solution that is MS Windows based I would suggest you take a look at SolarWinds Orion. Orion has a great monitoring solution for a good price. SolarWinds also offers several modules for Orion for things such as high level application monitoring, VoIP monitoring, network device configuration management, IP address space management, etc. Like I said, it all really comes down to your needs and where your personal comforts are.
Totally separate the data collection from the user interface. Keep both of those totally separate from the system that selects and delivers the alerts. Make sure the system as a whole won't make the problem worse (e.g. if you lose a major piece of infrastructure, will it send you 300 alerts?)
Orchestra and the Maestro is one the of the best analogies I've heard to describe the role of the Event Management system. RapidInsight aspires to be the Maestro over all other management to consolidate IT management information from different tools used to monitor systems, network and applications but also fault, performance, config/change, tickets, etc.