What Would You Want In a Large-Scale Monitoring System?

← Back to Stories (view on slashdot.org)

What Would You Want In a Large-Scale Monitoring System?

Posted by timothy on Wednesday July 8, 2009 @09:30AM from the detect-curse-words-and-fine-a-quarter-apiece dept.

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"

20 of 342 comments (clear)

Min score:

Reason:

Sort:

I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · 2009-07-08 09:31 · Score: 5, Funny

Publish them in DNS, and have the NSA monitor them for me!

--
"Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
1. Re:I Name My Devices After Al Qaeda Members by Philip+K+Dickhead · 2009-07-08 09:48 · Score: 5, Funny
  
  The only drawback to this comes in the form of UAVs.
  
  --
  "Speaking the Truth in times of universal deceit is a revolutionary act." -- George Orwell
2. Re:I Name My Devices After Al Qaeda Members by Hurricane78 · 2009-07-08 09:52 · Score: 4, Funny
  
  No need to. We are doing it anyway.
  Your NSA.
  
  --
  Any sufficiently advanced intelligence is indistinguishable from stupidity.
OpenNMS by Anonymous Coward · 2009-07-08 09:34 · Score: 5, Informative

That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.
OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.
http://www.opennms.org/
Enjoy.
1. Re:OpenNMS by mu51c10rd · 2009-07-08 09:49 · Score: 4, Interesting
  
  I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.
2. Re:OpenNMS by Cato · 2009-07-08 09:57 · Score: 4, Insightful
  
  I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.
A more interesting question by drsmithy · 2009-07-08 09:35 · Score: 5, Insightful

What limitations exist in current solutions that justifying developing a new one from scratch ?
Before I get flamed... by jwilki1 · 2009-07-08 09:37 · Score: 4, Interesting

I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.
Zabbix by ender- · 2009-07-08 09:37 · Score: 5, Informative

You can also look into Zabbix. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.

--
Nothing to see here
1. Re:Zabbix by TooMuchToDo · 2009-07-08 09:44 · Score: 5, Informative
  
  We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.
2. Re:Zabbix by BlueBlade · 2009-07-08 11:18 · Score: 4, Informative
  
  We're using Zabbix at work and I'm doing daily backups of the database with a simple mysqldump command. Since the tables are InnoDB and not MyISAM, you can use the --single-transaction switch. That way, it takes a virtual snapshot of the db at the start of the backup process and the writes can still keep going (they are still happening but they aren't commited until the transaction finishes). Granted, our DB isn't that big (10GB only), but it's been working fine and restore tests also seem to work fine.
  Here's the daily cron:
  mysqldump -u blah -pblah --single-transaction --opt --skip-lock-tables zabbix | gzip > /backup/zabbix_db.sql.gz
  
  --
  Religion is the best example of mass psychosis
GKrellM by Areyoukiddingme · 2009-07-08 09:38 · Score: 5, Funny

You can pry my GKrellM from my cold, dead hands!
Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!
*cough*
The Dangers of averaging by Anonymous Coward · 2009-07-08 09:40 · Score: 5, Insightful

MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak
Damnit... I want to keep information on my peaks for capacity planning!
Zenoss by KerberosKing · 2009-07-08 09:40 · Score: 4, Informative

I was really impressed by Zenoss, which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.
You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.
1. Re:Zenoss by rawler · 2009-07-08 10:26 · Score: 5, Informative
  
  ZenOSS may be great, but a word of warning. We've had 3 failed attempts at implementing it in our shop. What we tried to achieve was mainly host and service-monitoring, with some slight network-monitoring on the side. Nothing fancy, just some 20 hosts, maybe 30 network-devices, and a variety of services.
  One of the major parts we've found missing in most open-source solution was proper event-management (recieving syslog + snmp traps, and apply some intelligence to it regarding flow control, dispatching, archival and that stuff.) ZenOSS is on paper, and throughout the initial evaluation one of the best open source tools to do this.
  However, during our three attempts to get it up and running, we've always encountered some major obstacle (usually after a while of operation), forcing us to start all over from scratch. The problems we had was always in the same category, strange and unexplainable errors, often hard to reproduce, and in general it resulted in a very flaky experience. Some of the problems have been service-checks showing both false positives and false negatives, and in the last problem ZenOSS refused to import new SNMP MIB:s, complaining about some IP-address that could not be found anywhere in the config, and grepping ultimately found the IP to be only present somewhere in the opaque zope-database, where evidently it could not easily be removed, nor even found exactly what the ip-address was for. (It was something auto-discovered in a remote network segment out of our control, but advertised throughout the routers.)
  So, while ZenOSS can do all kinds of things, and does a LOT of things really well, it's extremely complex, not in all parts on solid foundation (such as all network objects in a non-accessible Zope-database that the devs themselves recommends not touching since it may upset things more). If you plan on implementing ZenOSS, I would not go without the support, which I assume is great, since there seems to be quite some dark pits to fall in on your own.
  I dont know how come we had so much obstacles and strange problems when others seem to have a smooth ride. Maybe one explanation is what were the final nail in the coffin for ZenOSS in our deployment. When I started asking around about these problems (and ZenOSS has a really helpful community, no problems there), I realised that many users claimed to have gotten into similar problems that we had, but their solution were to just keep daily backups, and revert to a backup when they ran into these problems. For us, the monitoring data is basis for a lot of 3d-party agreement, and loosing even days worth of monitoring and logging is completely unacceptable due to these reasons. We do backup everything, but in case of rare disasters, and we must be able to rely on the monitoring system giving us a clear view through those disasters.
2. Re:Zenoss by Ranger+Rick · 2009-07-08 12:46 · Score: 4, Interesting
  
  And this is why we (OpenNMS) don't play the per-node. It's not any harder to run OpenNMS when managing 1000 nodes than when managing 100, you only need to scale hardware appropriately. Per-node pricing is an artificial limitation.
  We also don't play the "you get a special price behind closed doors" game, our support prices are public, fair, and the same for everyone -- and that's only if you need commerical support -- our prices are $0 if you don't need or want support.
  If you do the math, it's $0 for the software, plus $14,995/year for support for any number of nodes, and the software is 100% open-source and fully capable of replacing or exceeding OpenView. ;)
  
  --
  WWJD? JWRTFM!!!
ZenOSS all the way by Midnight+Warrior · 2009-07-08 09:52 · Score: 5, Interesting
We use ZenOSS exclusively at work and have enjoyed every minute of it. Pro's include:
- 2D map with status of all nodes or submaps, organized by network
- Application monitoring, with more advanced maps available for purchase (Oracle, JBoss, Cisco) for those things you already paid a lot of money for
- Performance monitoring via SNMP or other data sources using RRDtool internally which includes graphs linked to each other during zoom in/out or panning
- Nagios plugins already do some of the heavy lifting
- Built-in support for watching Windows servers (any metric accessible via WMI)
- Access control using at least LDAP and Active Directory
- Secondary data collectors for those networks which are too big for just one central source
- Highly customizable through Python
- It has so, so much more than pathetic commercial solutions like OpenView
Cons:
- You have to keep your eye on the back end database
- It still takes a long, long time to tune it to remove noise events
- If you don't know Python, it can be tough in a few places
- Proper support is not cheap
I Hate War Rooms by afabbro · 2009-07-08 10:33 · Score: 4, Interesting
I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.
What you want in large-scale monitoring is:
- The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".
- I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.
- I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.
- I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.
- Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.
- I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.
- I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.
- I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.
- I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.
- I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.
Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.
--
Advice: on VPS providers
What I Lack in Open Source Monitoring Solutions by rawler · 2009-07-08 11:10 · Score: 5, Insightful

I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.
Transparent Planned Design
Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.
Event management
Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.
Modularity/Seamless Integration
Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.
Complexity
There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.
The Perfect Monitoring System
After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.
What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.
How about... by yacoob · 2009-07-08 11:20 · Score: 5, Informative

Needed features in random order:
* Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
* Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
* Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
* Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
* Some language to easily calculate derivative values from the data above.
* Interface for defining graphs, using collected data.
* ...and a system for annotating the above. Raw data is neat, annotated data is even better.
* Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
* (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
* (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
* hooks/libraries to use collected data "outside" of the system
I realize that's a lot, but boy, such system would be very useful and flexible.

--
-- we're here you're not