Slashdot Mirror


Server Monitoring With Munin And Monit

hausmasta writes "In this article I will describe how to monitor your server with munin and monit. munin produces nifty little graphics about nearly every aspect of your server (load average, memory usage, CPU usage, MySQL throughput, eth0 traffic, etc.) without much configuration, whereas monit checks the availability of services like Apache, MySQL, Postfix and takes the appropriate action such as a restart if it finds a service is not behaving as expected. The combination of the two gives you full monitoring: graphics that lets you recognize current or upcoming problems (like "We need a bigger server soon, our load average is increasing rapidly."), and a watchdog that ensures the availability of the monitored services."

27 of 124 comments (clear)

  1. But can I run this on Windows? by Steve_Jobs_HNIC · · Score: 5, Funny

    .... been waiting a while to say that.

    1. Re:But can I run this on Windows? by hackstraw · · Score: 2, Interesting

      Can it run on Windows .... been waiting a while to say that.

      Dunno. Don't care either, but it might. Its based on rrdtool which does run on Windows. I don't know if this article is a slashvertisement, or just void of information. I've linked to rrdtool, and here is the munin homepage.

      There are _tons_ of these things running around. In my opinion, rrdtool is one of the best tools that has come to computing in a long time. Its awesome. Other packages that use rrdtool are cricket, ganglia, and many others. I believe that the rrdtool site has a listing of some of these.

      For those not familiar with it, rrdtool is a database that is designed for time series data. Its kinda like a smart FIFO where it looses details the further back in time you go by storing running averages. I have rolled my own monitoring stuff with rrdtool and perl to monitor CPU, load, temperatures, you name it. One of the cool things about rrdtool is that the database is fixed in size. rrdtool is not easy to initially set up and work with, but the effort is definitely worth it.

      Basically, if your a sysadmin in 2006 and you do not have rrdtool based monitoring going on. Well, maybe the job is not for you. Its that important and good. A simple click on a link of a webpage with a rrdtool graph can demonstrate to even the pointiest of pointy PHB that you need more equipment or a trend is going on or whatever.

      This is the kind of stuff I would like to see more talked about here on slashdot.

    2. Re:But can I run this on Windows? by Vancorps · · Score: 2, Interesting
      I'll be setting up the linux tools on the db servers, have to find out if it works with Oracle alright.

      As for the Windows servers, the monitoring is nothing new, Microsoft Operations Manager or MOM has been around for 6 years now and is exceedingly friendly to both setup and use, also works with all servers and workstations flagging alerts like low disk space or high cpu utilization so you can see if some new virus is coming at you. They even have agents for Linux and OS X.

      I'll have to check out rrdtool though, its new to me, most of the linux boxes I have in production are only doing one task and there aren't that many servers. 20 in total that I manage so its fairly easy to check availability and go over the logs real quick manually. Time is always against me but now that its summer I should have time to get my house in order.
  2. Cacti by mtenhagen · · Score: 4, Insightful

    How is this different from cacti?

    --
    200GB/2TB $7.95 Coupon: SAVE90DOLLAR
    1. Re:Cacti by isolationism · · Score: 3, Informative

      Munin isn't at all different from Cacti, really, except that Cacti is 100% web based and perhaps a bit more mature (I use Cacti and like it a lot more than at least 4-5 other similar products out there). Cacti won't do service-testing though; maybe this is a good walkthrough for people who just want something up and running in 15 minutes (I wouldn't know, I'm not inclined to read the whole thing since a cursory glance shows there's nothing here that I don't have a running alternative for already).

  3. Re:RTFA! by remembertomorrow · · Score: 2, Informative

    He was simply playing on the "But does it run on Linux?" post that appears in tons of threads. He doesn't need to RTFA. :)

    --
    Registered Linux user #421033
  4. Insignificanct in the trails of NAGIOS? by pl1ght · · Score: 2, Interesting

    Im not sure i follow why this is newsworthy. NAGIOS is OSS and is an extremely mature product with a community writing modules and plugins etc etc, to monitor any aspect you wanted of your Servers/Routers/Networks/room temperatures, i mean anything. Why would anyone bother?

    1. Re:Insignificanct in the trails of NAGIOS? by Stinking+Pig · · Score: 3, Interesting

      because in software-land, "mature" is rapidly followed by "obsolete." I love Nagios, but I'm hesistant to recommend it to anyone who's not comfortable spending a week on building and configuring software.

      Packages for it are often broken or from the old 1.3 tree, which makes for confusion when following examples that use 2.0 syntax.

      Configuration is extremely challenging to start from scratch with, especially if you want to do anything custom.

      There are a number of external dependencies, particularly if you want to compile the plugins.

      That said, Nagios still whips the pants off quite a few commercial monitoring products I've evaluated.

      --
      "Nothing was broken, and it's been fixed." -- Jon Carroll
  5. Hobbit by Anonymous Coward · · Score: 2, Informative

    Don't forget about the big brother clone, hobbit.

    SF.net at: http://hobbitmon.sourceforge.net/
    Live example at: http://www.hswn.dk/hobbit/

  6. Automatic restarts are bad by Erik+Hensema · · Score: 5, Insightful
    • A restart usually kills hanging processes, making the actual cause of the hang impossible to determine afterwards.
    • Automatic restarts make some admins lazy. Instead of debugging the problem, they accept apache/whatever service is restarted once a day.

    However, making graphs and monitoring your services is a very good thing. Graphs are invaluable in determining trends, such as memory leaks or steadily increasing load. Monitoring saves lots of downtime and unhappy customers ;-)

    Personally I use nagios for monitoring and DIY scripts for graphing. The latter mostly because I started making graphs before decent of-the-shelf software was available ;-)

    PS. what's this subject got to do with debian?

    --

    This is your sig. There are thousands more, but this one is yours.

    1. Re:Automatic restarts are bad by Jeff+DeMaagd · · Score: 3, Insightful

      Point taken, but I think an automatic restart is necessary to minimize intrusions into off-work-time with maintainaince and such. If the service hangs and there's no one there to tend to it, then it will stay hung until someone notices. This is not good if you want to keep going and not lose potential business if the site is down.

      Anyway, I'm glad I'm not a server admin. I'd like to live my private life NOT being on-call.

    2. Re:Automatic restarts are bad by Burv · · Score: 2, Insightful
      Good points. However, I think there's something to be said about automating things to increase uptime and lessening the load on the sysadmin, especially if it's critical that the service be available and you always go through the same checks (e.g. check /var/adm/messages, run look at the process table, load, etc.) that you go through. There's also a tradeoff in knowing details of what caused the problem if every minute your server is down, your company is or could be losing money, like for someplace like ebay.

      Oh, and I think these packages are installed as part of debian, either by default or optionally. That's why the article mentioned apt-get.

  7. Restarting services... by fimbulvetr · · Score: 2, Insightful

    It always bothers me when people use utilities to restart services that die/have been killed. Shouldn't a daemon be designed to run indefinitely? Doesn't the fact that a process died mean that something is wrong and needs to be fixed? For instance, if my apache daemon dies because the logfile is larger than it can handle, what good is restarting it going to do? It's just going to beat the crap out of a server - process dies - watcher daemon starts it up - process dies...etc.
    Or, if the OOM killer kills my ftp server because he's hogging the memory, doesn't that mean I have bigger problems than just doing a restart(I need more memory, the ftp server has a mem leak, etc)?

    None of my hundreds of critical daemons die for no reason whatsoever - all of require some type of human interaction if they have died. It doesn't happen very often, maybe once every several months.

    Not that I care about this software in general, I use hobbit for my trending/graphing/service availability, but I hate to see bad admin'ing, even if I'm not involved.

    1. Re:Restarting services... by NevarMore · · Score: 3, Interesting

      Egads! My education is useful!

      We're discussing such issues in a class I'm taking on software fault tolerance. In discussing selective restarts and backup processes Apache is frequently cited as an example of how software should fail gracefully, consistently, and then handle that failure itself. The lecture slides can be found here: http://wwwse.inf.tu-dresden.de/index.php?language= English&site=courses&course=ss06vl02

      Apache has some memory leaks in it. It is not bad, it happens, especially in a piece of software like that which is expected to run constantly and NEVER fail. So what the Apache software does is every so often, or when it detects that its memory usage is getting out of hand, it fires up a second copy of itself and then kills itself letting the new not-yet-leaky copy take over.

      So to you (IT/admin) that daemon may run forever, but thats because my people (CS/developer) did our jobs (for once) and ensured that the application cleaned up its own messes.

  8. no but use perfmon by badriram · · Score: 2, Informative

    Performance monitor is one of the best utilities on windows. It is very detailed, and most MS apps have additional counters for other detailed views. It also does remote logging, basic graphing, alerts etc.

  9. Orca by otisg · · Score: 2, Insightful

    I'm a happy user of Orca, which I use to graph all kinds of aspects of the system that runs Simpy's cluster.

    --
    Simpy
  10. Seems a lot less clunky than Nagios or Cacti by Burv · · Score: 3, Informative
    I've tried both Nagios and Cacti for years. They work great, are very feature rich, and seem to have a strong community.

    The one thing that annoys me about them is that, out of the box, they don't have much configured, and to install/configure stuff, you have to jump through a lot of hoops.

    In the case of cacti, it's mostly through a web-based GUI, which is OK if you have one server with one thing you want to measure, say %CPU usage, that you want to measure, but if you want to do it for a server farm or even a couple machines, it's a pain in the butt. They do have a templating system, but you still have to do a lot through the GUI. I've posted on their forums before to this effect, and they have suggestions for making changes like this en masse, but again, it doesn't work out of the box. Bottom line, the designers of cacti seem to be focused on the Web GUI, which is kinda nice for newbies, but a huge pain for people like me that like to script things.

    It's the same thing with Nagios, although at least they let you change text files for the settings. Although the number (about 20) of files is reflective of how feature rich it is, it also makes it a hassle to set up. Here's an article at samag.com that illustrates the process you need to go through... imagine this for a couple hundred servers, and you can see how arduous setting up nagios could be.

    So, although munin may not be as mature and well known as cacti, and monit not as popular as nagios, I think they're still worth trying out..

  11. These Guys ROCK! by thehunger · · Score: 2, Informative

    I dont know anything about Munin, but the guys that wrote Munin absolutely rock! The company is Linpro, and they've been doing Linux and open source for over 10 years now. They do hosted management, remote management, development and Linux and OSS training. They also begun to package Linux and OSS based solutions for groupware, voip, management etc.

    The point is, they've been doing server management for years (using Nagios) and wrote Munin to -complement- it, not compete with it.

    Check them out, they absolutely rock..

  12. practical experience by routerguy666 · · Score: 2, Insightful

    I've tried a number of these monitoring apps as they've come out. To date, I still can't find a combination better than MRTG and Nagios. If you know a bit about SNMP and how to find the OID of what you are interested in (and where to get mibs), it's hard to find a simpler, cleaner pair of monitoring products.

    Although in all honesty, Nagios' only real benefit is the ability to send out alerts. I'm more fortunate than others, I know, in that I've had the resources available to build redundancy in at every level of our production networks so when something does die (and with modern platforms this is becoming a once every two years event) it doesn't create a major catastrophe.

    Other than that, all the trending info I want/need on bandwidth, cpu, disk space, user loads, etc, etc, I can pull out of any device via snmp and track it with MRTG. Plus each MRTG release doesn't require me to rewrite umpteen config files to match the author's latest greatest idea of how they should be formatted (my only real gripe about nagios/netsaint).

    In the end I guess you use what you are familiar with, and I cut my teeth on these.

  13. Add OpenNMS by nrc · · Score: 3, Informative

    Add OpenNMS to the list of stuff that this duplicates or overlaps with. Not that anyone in OSS needs permission to reinvent the wheel. You've got an itch - you scratch as it pleases you.

  14. JFFNMS, BB, Hobbit,etc by falzbro · · Score: 2, Informative

    Since we're on the subject, others have mentioned Nagios and MRTG of course. Be sure to check out JFFNMS (Just for fun). Horrible name for what it does, since it's quite powerful. For Big Brother users, I would recommend checking out Hobbit Monitor as a replacement of the server portion. It's compatible with the BB client, but has far more features and includes some basic MRTG graphs.

    I have yet to find an all in one integrated open source solution for monitoring (cpu, processes, port reachability), alerts (email, sms, etc). The closest I've found is JFFNMS, but writing alert rules and such is difficult to say the least.

    While on the subject, if it's not too terribly off-topic, what do people use to bill based on network usage (MRTG, RRD). Both claim that you should NOT bill off of that information, but I have yet to find any other open source solution.

    --falz

    1. Re:JFFNMS, BB, Hobbit,etc by fimbulvetr · · Score: 2, Informative

      Back when I admin'd an ISP that billed by usage, we used mrtg and the mrtg 95 percentile scripts. On more than one occasion, we had customers inquire about our billing. Fortunately, most of our customers were technically literate, so I stepped through the code and procedures with them. All of them were happy with the explanitions and were satisfied after they saw the methods. That's not to say mrtg and the 95th percentile scripts are bulletproof, but they held up under our scrutiny.

      http://www.seanadams.com/95/

  15. Re:swatch? by Whanana · · Score: 2, Informative

    This sounds a lot like Nagios. From TFA I couldn't see anything Munin and Monit would do that you can't do on Nagios with a few plugins. Just a plug - Nagios is beautiful, it makes nice graphical representations of load, hits, throughput, and about anything else you can think of.

  16. Very nice! by ngunton · · Score: 2, Insightful

    I hadn't heard of this before. I liked the sound of pretty graphs, and I particularly liked how easy the article made it sound to install and get things working. So I tried it (I'm running Sarge AMD64 on the server) and it worked fine. In fact, it was up and running in a couple of minutes. Very nice!

    I have to say it is refreshing to see something that "just works" out of the box with sensible defaults. Truth be told, I am sick and tired of these holier-than-thou OSS zealots who keep pushing bloated, complex toolkits which have every option under the sun, but it doesn't all "just work" out of the install, no, that would be too easy wouldn't it. You have to read through reams of distributed, fragmented documentation, forum posts and other sources to get the damn thing working properly, not to mention cobbling together all these !@#$ing plugins that are sooooo wonderful and yet just end up being a pain in the butt because you have to track them all down individually. Why can't geeks grasp a simple fact: People don't necessarily have the time or inclination to spend days learning the arcane innards of your toolkit. I don't care if people say "well if you can't be bothered taking the time then you're not a real admin" or whatever, if I had to spend a lot of time on every package tuning it and writing a sendmail.cf-esque config file just to get it working *the way it should by default* then I'm probably just going to look for something else. That something else may be simpler and not as "pure" as your baby, but you know what? I'll use it, because it *just works* and does *most* things in a simple intuitive way. That's why MySQL became successful, and why PostgreSQL didn't - sure, PostgreSQL was more powerful (in theory anyway) and had a bunch more features, but it isn't optimized out of the box. Whenever I see people complain about how slow PostgreSQL turns out to be when they finally try it, the inevitable reply is "Well, you need to spend time tuning it - if you don't do that then you don't deserve to be running a server". Whatever. As far as I'm concerned these "Tuning required by default" and "You aren't a *real* x if you don't learn these reams of config options just to get it working" people just don't get it. Make it work out of the box with sensible defaults, and let people delve into stuff further *if they want to*, not by requirement.

    I think the snobs are like this because they did go and learn all that stuff, and so they feel deep down that they have to justify that it was all worth it by putting down those who have a life and don't feel like dedicating days and weeks of effort to getting some stupid software package to function in the most basic way.

    So, great job Munin. My hat is off to you - I have a graphical monitoring system for my server, and it took me about two minutes to get it working. Fantastic.

  17. What Digg Uses by philovivero · · Score: 2, Informative

    At Digg, we use Nagios to alert (with all the warts that go along with that). We use Cacti to monitor and graph. It's a relatively nice front-end to RRDtool.

    I'm the MySQL DBA and I spent a long, long time (in concert with Peter Zaitsev of MySQL AB fame) tweaking the existing Cacti MySQL templates to add InnoDB graphing support (and a new MemcacheD set of graphing templates) and put them all over here: my mysqlUtils page.

    I'd never heard of this pair of monitoring/alerting software before. Hopefully it improves on the state of monitoring and alerting, because I feel Nagios and Cacti (and Ganglia) leave a fair bit to be desired.

    (By the way, that page includes a fair bit of other utilities, too, not just Cacti templates)

  18. Munin and restarts. by jafo · · Score: 2, Informative

    Munin is nice because it's just so simple to install and configure it. We used to use some scripts I had written to track server statistics, but have entirely switched to munin. However, munin also has some "monitoring" capabilities, which I usually disable. I wish they just stuck to graphing and didn't try to add monitoring to munin.

    Also, generating a lot of graphs can impact the system load. Not that you shouldn't use it, but I have definitely seen times where the system was getting hit particularly hard and munin seemed to be using up a lot of resourcesm at the same time. You probably don't want to install it on an already overloaded system...

    Also, munin's design is such that if the system gets hit particularly hard, munin may not be able to run and capture this information. It doesn't lock itself into memory, or run at an escallated priority, so if the system is being thrashing particularly hard, you often will get empty samples in munin instead of getting pointers to whether the problem was due to high load, high disc activity, high swap activity, etc... So it's really better suited to long-term capacity planning more than tracking down short-term load problems.

    As far as setting up service restarts, I totally agree that it's the lazy way out. The ideal solution is to track the problem to root cause and prevent it from happening. However, unlike the other respondant, I'm fine with that.

    As a sys admin, your job is to keep the system and services available. A brain-dead restart of Apache or bind once a week is much preferable to leaving it down for hours from 3am to 9am and then trying to track down a bug in bind or some random PHP application.

    So, by all means fix the real cause if possible. However, I recommend setting up automatic restarts with alerts going to appropriate people so you can keep an eye on when restarts happen. For one of my machines an apache restart happens about once every 2 weeks, and a bind restart happens once every other month. I'm not particularly inclined to spend significant resources debugging bind to prevent a 60 second outage of one of my two name servers once every 60 days. At least not today, I have other higher priority tasks to work on.

    Sean

  19. Munin And Monit? by AndroidCat · · Score: 2, Funny

    Shouldn't that be Hugin and Munin?

    --
    One line blog. I hear that they're called Twitters now.