What Would You Want In a Large-Scale Monitoring System?
Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
Publish them in DNS, and have the NSA monitor them for me!
"Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
Hyperic HQ may be worth checking out.
That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.
OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.
http://www.opennms.org/
Enjoy.
What limitations exist in current solutions that justifying developing a new one from scratch ?
I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.
You can also look into Zabbix. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.
Nothing to see here
You can pry my GKrellM from my cold, dead hands!
Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!
*cough*
MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak
Damnit... I want to keep information on my peaks for capacity planning!
I was really impressed by Zenoss, which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.
You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.
I use Nagios and some custom rolled scripts myself.
For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga.
Reconnoiter also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.
I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).
If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.
The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.
MRTG also integrates w/Nagios, which can be useful.
Good luck.
Cons:
The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.
Give them a text screen of whats currently down ... that'll work.
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Here's a similar thread from a while back that covers most of the options: http://linux.slashdot.org/article.pl?sid=07/03/05/1812247
I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.
I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.
The system did push and pull operations. First, the system was built in PHP.
In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.
The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.
I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.
I use Cacti for this and it's fantastic.
Love many, trust a few, do harm to none.
I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.
What you want in large-scale monitoring is:
Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.
Advice: on VPS providers
Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.
Conformity is the jailer of freedom and enemy of growth. -JFK
I use Cacti, with THold and weathermap plugins.
But then I'm biased.
-- I care not for your foolish signatures.
Was a step above Nagios in terms of reliability (I didn't have to restart the server four times a day just to keep it running), and did much better at autodiscovery.
That fact that it is also NRPE compatible was a plus - I could use all the Nagios plugins and check scripts I'd written.
I was also planning on using it to launch a more aggressive webmin-style management solution - since OpenNMS built this great database of data about my devices and hosts, I could use it to do actual management - change data/settings.
Cons: It's a Java/Tomcat tool, as much as that is really a con. It's not like you need to run Jboss or Websphere to use it (though I suppose you could).
I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.
Transparent Planned Design
Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.
Event management
Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.
Modularity/Seamless Integration
Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.
Complexity
There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.
The Perfect Monitoring System
After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.
What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.
You certainly want to check ZenOSS out then. Our instance monitors JVMs across our deployment for everything from heap size to open file descriptors to uptime for the jvm specifically-- all of which can be alerted on if desired.
Needed features in random order: ...and a system for annotating the above. Raw data is neat, annotated data is even better.
* Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
* Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
* Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
* Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
* Some language to easily calculate derivative values from the data above.
* Interface for defining graphs, using collected data.
*
* Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
* (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
* (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
* hooks/libraries to use collected data "outside" of the system
I realize that's a lot, but boy, such system would be very useful and flexible.
-- we're here you're not
If your monitoring something of that scale, you should probably look into a profession solution.
I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.
I find a human monitoring system to be the most reliable. There is always someone to fire, if something goes wrong.
Living in Chile
Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.
I don't believe in agents. I refuse to install anything not needed on the server. SNMP should be enough for all the information, unfortunately this is not the case. So I use WMI and netbios querying.
Love many, trust a few, do harm to none.
To be fair, I wouldn't say the Zope database (ZODB) is not a "solid foundation". It's one of the best parts of the Zope stack and, in 3 years of dozens of clients using it in Zenoss, Plone, and other apps, I've never had it corrupt or lose any data. It's a proper DB--ACID, MVCC, and all that--and you can even lop transactions off the storage to go back in time. Don't expect it to be a relational DB with the ad hoc query tools typical thereof; it's an object DB, with the aim of persisting graphs of Python objects transparently.
Now, if you aren't familiar with it, the ZODB can indeed seem opaque, but, just like any DB, there are tools to read and modify it. At the highest level, just stick "manage" after your Zenoss URL, e.g. http://example.com/zport/dmd/manage . That'll get you into the web-based Zope Management Interface (colloquially, "the ZMI"), where you can poke around at any object that someone's bothered to write a UI for. Deeper than that, you can connect to ZEO (a server that brokers access to the ZODB over a socket) and mess with the object graph using normal Python. When you're done, "import transaction; transaction.commit()". (The Zenoss developers are probably trying to scare you away from such digging around in fear that you'll violate their objects' invariants and leave them a real mess to solve.)
Now, I don't say that Zope isn't scary; it has over 10 years of scary stored up in it. But the ZODB is a cuddly, loving part.
Cheers!
We used OpManager in production for over a year. It has terrible Linux support. None of their built-in plugins worked properly for monitoring even basic parameters like disk space, free memory, CPU usage, etc. When we pointed this out to their support people, they said we should build our own plugins with SNMP OIDs. Um....no. Not for the amount of money we paid for that steaming POS. We finally kicked OpManager to the curb about a month ago, and have our entire environment, Windows and Linux servers being monitored with Nagios. Nagios scales well, we are currently watching several hundred hosts and about 3500 services.
OpenNMS is also a good tool, its ability to map servers back to switch ports is extremely handy.
The solution is real simple. If you can program in anything then Hobbit/Xymon with Devmon is your only choice.
Create your own Weather Map for 2D, you never need a full 2D map of 5000 hosts... Less is more.
1. Free
2. Fully customizable
3. Easy administration
4. Offers clients for all the major OS (And quite a few minor ones)
5. Large support base (Users with high technical level)
6. Nice author (Replies to comments and considers all ideas)
7. You can write a test for anything you can think of and easily add it into hobbit
8. Offers client/server montoring, remote monitoring, script monitoring, snmp monitoring(devmon) or scripts
The possibilities with Hobbit are endless
Personally I use Hobbit to monitor over 2400 devices, including Cisco hardware, AIX, Windows Servers, VMware Clusters, Exchange, Sharepoint etc.etc.etc.etc.
I've never encountered a system I could not monitor with Hobbit (Or scripts that send their results into hobbit).
We are using a combination of Cacti and mon to monitor about 200 devices, both network gear and PC servers. Cacti is used to graph performance data(bandwidth, cpu, mem, temp) and maps for the visually inclined, while mon is used to do the actual service monitoring and alerting.
I won't comment on Cacti, since it has been mentioned here already, though iI will say that you CAN change the default behavior of "sample averaging" by increasing the size of the RRD database. There are discussions on the Cacti forum/wiki that cover this topic.
Mon on the other hand, I didn't see mentioned at all, so here's my blurb on that. The core of mon is a scheduler written in perl, which handles running monitor tests(also perl or any script/program that can exit with a 1/0) and then alerting(also perl, or other languages, and can do more than just sending mail or paging) when necessary, based on the configuration for that service. Like most open source projects, it is extremely flexible, if you have the initial time investment to set up your tests and dependencies correctly, but once this is done, the tests/alerts can be reused, or further modified. There are quite a few monitor tests and alert scripts already included, along with some handy tools for interaction through a web browser(via moncgi), generating dependency trees, generating reports, and more. Theres also a perl module, Mon::Client, that provides an API for interacting with the mon scheduler. The downside, besides configuring it with a text file(m4 can be helpful here), is there hasn't been any activity since 2007(according to the CVS repo on sourceforge).
Probably not the solution for an extremely large number of hosts, though resource-wise, it could handle it, but maybe someone else might be able to benefit from it. If you need very specific tests(number of BGP routes, verifying NH on routes, customer redundancy) and smart alert logic, it's worth looking at.