What Would You Want In a Large-Scale Monitoring System?
Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.
OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.
http://www.opennms.org/
Enjoy.
You can also look into Zabbix. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.
Nothing to see here
I was really impressed by Zenoss, which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.
You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.
He said he was asked to "develop a new solution" - which most likely means he gets to pick and choose what to implement, whether parts of it are custom developed or off the shelf. I would imagine a good solution would be a core product plus custom built extensions for the features he needs that the product doesn't implement itself.
I use Nagios and some custom rolled scripts myself.
For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga.
Reconnoiter also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.
Needed features in random order: ...and a system for annotating the above. Raw data is neat, annotated data is even better.
* Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
* Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
* Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
* Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
* Some language to easily calculate derivative values from the data above.
* Interface for defining graphs, using collected data.
*
* Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
* (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
* (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
* hooks/libraries to use collected data "outside" of the system
I realize that's a lot, but boy, such system would be very useful and flexible.
-- we're here you're not
As an aside, SCOM is a good product, but be sure you have (and are willing to invest) the time to configure it to match your environment. Just because it's also made by MS and has management packs for all of their products doesn't mean you can just flip the on switch and have everything monitored. You will almost certainly be flooded with useless alerts, and not alerted for things that you do care about.
"...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
I saw SCOM 1 year ago, the hardware requirement for just the client was higher then the whole Cacti server.
Unless they start to optimize the mess I'm not sure I want to use it.
Love many, trust a few, do harm to none.