What Would You Want In a Large-Scale Monitoring System?
Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
What limitations exist in current solutions that justifying developing a new one from scratch ?
MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak
Damnit... I want to keep information on my peaks for capacity planning!
I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.
I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.
Transparent Planned Design
Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.
Event management
Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.
Modularity/Seamless Integration
Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.
Complexity
There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.
The Perfect Monitoring System
After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.
What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.