Slashdot Mirror


Bayesian Tail

flok writes "We all know anti-spam-software using Bayesian filtering. The results with these are amazingly good. So that made me thinking: why not create a tool which monitors logfiles and determines using a Bayesian filter what events to display and what not? That's why I created btail. Btail is just that: it monitors a logfile and filters it with a Bayesian filter. The results are above my own expectations!"

24 of 63 comments (clear)

  1. Cool idea but may be dangerous by PhilippeT · · Score: 2, Insightful

    This is a cool idea but I wouldn't want to use it on to filter logs on important systems... every line may be crucial.

    Anyhow credits on a decent idea

    --
    A psychopath can't tell the difference between right and wrong. A sociopath knows the difference - he just doesn't care.
    1. Re:Cool idea but may be dangerous by dougmc · · Score: 4, Insightful
      This is a cool idea but I wouldn't want to use it on to filter logs on important systems... every line may be crucial.
      Perhaps, but doesn't the same apply to your email? Every email may be crucial as well -- but if you miss a crucial email because it was buried in spam, isn't the effect the same as if it was caught by an overzealous spam filter?
    2. Re:Cool idea but may be dangerous by cpuffer_hammer · · Score: 4, Insightful

      Why not use it to colorize, Or to rebuild the logs in HTML.

    3. Re:Cool idea but may be dangerous by flok · · Score: 3, Informative

      If you need something that colorizes and/or does regular expression filtering, merging with other (log-)files, multiple windows, etc. etc. then maybe multitail might come in handy.

      Initially I wanted to integrate btail into multitail, but multitail is bloated enough already :-)

      --

      www.vanheusden.com - home of Multitail, HTTPing, CoffeeSaint, EntropyBroker, rsstail, bsod, listener, nagcon, nagi
    4. Re:Cool idea but may be dangerous by GlassHeart · · Score: 2, Insightful
      The far more important difference is that we cannot control the generation of incoming email, which is why we are reduced to filtering as intelligently as possible.

      Server logs are not the same at all. The administrator has some control over the logs that get generated, and the programmer has full control. There isn't supposed to be the equivalent of email spam at all, because useless messages should just be filtered or redirected at the source. Leaving everything at "verbose" and relying on filtering just doesn't seem like the right approach to the problem.

      It is a cute idea, though, and probably applicable to some specific cases (no source code, etc).

    5. Re:Cool idea but may be dangerous by lars_stefan_axelsson · · Score: 2, Informative
      Why not use it to colorize, Or to rebuild the logs in HTML.

      I published a paper, with GPL source code (you need Python etc) a few months back using visualisation (colorisation) to lend the user insight into the operation of a Bayesian classifier.

      It actually works pretty well, and the idea could be applied to other uses of the Naive Bayesian classifier.

      --
      Stefan Axelsson
  2. examples by rogueuk · · Score: 3, Interesting

    Do you have any examples of what type of stuff it learns to filter and what it learns to show? The btail site is kind of lacking of what it outputs versus what it filters

  3. Site getting sluggish already by Kiaser+Zohsay · · Score: 4, Informative
    Blockquote from the readme.txt:


    Step 1. compile & install

    make install

    Step 2. configure btail

    Default configuration file:
    db_bad = .btail_db_bad
    db_good = .btail_db_good
    db_conf = .btail_db_conf
    logfile = /var/adm/messages

    db_... are the database files which are filled by blearn. They are
    used as reference when btail calculates if an event is bad or good.
    logfile is the logfile which you want to monitor. As you see, one
    needs a seperate configurationfile AND databases(!) for each file
    to monitor.

    Step 3. learn logging

    blearn -g good_logging
    blearn -b bad_logging

    good_logging should contain events which are considered ok.
    bad_logging should contain logging of events you want to see, e.g.
    disk errors, invalid loggings, etc.

    Step 3. use btail

    btail

    This will read the logfile defined in btail.conf and emit events
    which are considered not-ok by the bayesian filter.

    --- folkert@vanheusden.com


    Still very preliminary at this point, but shows promise. Now, to build and try it out!
    --
    I am not your blowing wind, I am the lightning.
  4. If this were Trek... by AndroidCat · · Score: 5, Insightful
    01:37 Overheat in plasma injector #1.
    01:56 Plasma injector #1 offline, switching to #2 backup.
    02:23 Overheat in plasma injector #2.
    02:44 Failure to shutdown plasma injector #2.
    02:58 Overheat in reactor core.
    03:20 Containment weakening.
    03:25 Containment weakening.
    03:30 Containment weakening.
    03:35 Five minutes to containment failure.
    03:40 FIVE SECONDS TO WARP CORE BREACH!!!

    Better be careful to train the filter about those warnings that don't happen very often, but when they do, you really want to know about them.

    --
    One line blog. I hear that they're called Twitters now.
    1. Re:If this were Trek... by aoteoroa · · Score: 2, Interesting

      True. But if the Star Trek error log resembled real life then it might look more like:
      01:37 [error] Overheat in plasma injector #1.
      01:37 [warning] Cargo bay door 2 is open.
      01:38 [warning] Cargo bay door 2 is open.
      01:38 [warning] Oxegen sensor on deck 2 not responding.
      01:39 [warning] Cargo bay door 2 is open.
      01:40 [warning] Cargo bay door 2 is open.
      01:41 [warning] Oxegen sensor on deck 2 not responding.
      01:56 [error] Plasma injector #1 offline, switching to #2 backup.

      In other words real interesting errors in the logs can get hidden by a bunch of trivial log entries.

      I use tail all the time when developing php applications. PHP logs errors to the apache log file so I type:
      tail -f /var/log/apache/mysite.com-error.log
      To track changes to the apache logs as I test the php pages.

      But the truth of the matter is that I am only interested in php errors, and not broken links, and missing images. So if I can train btail to pay attention to php errors like:
      [Wed Dec 29 10:58:04 2004] [error] PHP Fatal error: Call to undefined function: badFunction() in /home/aoteoroa/www/pages/info.php on line 1

      and ignore file not found errors like:
      [Wed Dec 29 11:16:22 2004] [error] [client 192.168.0.2] File does not exist: /home/aoteoroa/www/pages/info-over.gif
      it would make my job just a little bit easier.

  5. Re:What I would like to see by tonkdude · · Score: 4, Informative

    I currently use CRM114 and on the mailing list, some one (Evan Prodromou) has created a program that does just this using the CRM114 language. It is called "Monkeyplexer" based on the idea that you could train a monkey to sort your mail box into folders.

    If you pop over to the CRM114 site and search the general list archives for monkeyplexer to find the discussions about it.

    Here is the last version announcement that I could find in my mailbox:

    monkeyplexer is a tool for automatically sorting incoming email messages into appropriate folders. A new version of monkeyplexer, 0.7, is now available. http://bad.dynu.ca/~evan/monkeyplexer/monkeyplexer -0.7.tar.gz

    This version includes the following changes:
    You can specify which mailboxes to use, instead of which mailboxes to exclude. This can save some typing and some time at runtime, at the expense of dynamically updating the list. You can tell the monkeytrainer to only train messages that were received in the last few weeks, days, hours, minutes -- whatever. The monkeyplexer remembers which messages have been trained for which folders. If you train a message for a different folder, the monkeyplexer will automatically forget the first folder before training for the new one. Thanks to everyone who has installed monkeyplexer already. I hope this new version helps some people out. I find it easier and more accurate.

    ~ESP

  6. Bayesian is good for almost everything by Ki+Master+George · · Score: 4, Interesting

    Bayesian filtering could be used for lots of things outside of spam. One example could possibly be Wikis, determining spam from ham modifications (well, yes, it is spam here). I've had some other ideas that involve Bayesian, but they've escaped me for the moment.

    --
    Before you walk a mile in someone's shoes, you should insult them so you know how they are and what they're doing.
    1. Re:Bayesian is good for almost everything by dasunt · · Score: 2, Interesting

      Bayesian filtering could be used for lots of things outside of spam. One example could possibly be Wikis, determining spam from ham modifications (well, yes, it is spam here). I've had some other ideas that involve Bayesian, but they've escaped me for the moment.

      • Email sorting filters: imagine a baynesian setup that can decide if a new mail should be sorted into "work", "friends", "ebay", "amazon", "project", etc.
      • Interest filters: Run slashdot stories and comments through your own trained baynesian sorting system and filter out the stories you probably don't want to see. Do the same for news.google.com, cnn, or usenet.
      • Music sorter: Can Baynesian filters be taught to understand music (pitch, amplitude, etc?) If so, can they sort on it? If I see a song playing in xmms, can I use my nifty baynesian_sort plugin to play more songs that sound like that for the rest of the day? Consider tying it in to the 'next' button -- if I don't play a song completely, I probably don't want to hear songs like that for the next few days.
      • IM secretary: Add a 'secretary' feature to your IM client. When you enable it, it will show you only messages that it thinks you want to see.

      There are a ton of possibilities available.

  7. Re:What I would like to see by rmohr02 · · Score: 2, Informative

    POPFile is exactly what you're looking for.

  8. Re:This code belongs on by rmohr02 · · Score: 2, Insightful

    Give him a break--it is the first release, and I doubt he's had much feedback yet.

  9. Well, no it doesn't ... by Chromodromic · · Score: 4, Insightful

    All due respect, you're being a bit hard on the guy. He's not doing badly here.

    The [brackets] used in the usage message are standard in the Unix world for specifying an optional or default argument. Just look at any man page. So that, actually, is pretty straightforward. The name of the default config file would likely also be spelled out in the man page, which I would expect, so that's not confusing.

    As for changing the if construct into a switch, well, I'm trusting the accuracy of your excerpt, but I didn't find his code to be very difficult to read, to be honest, and certainly not a candidate for DailyWTF, which typically contains laughably horrible code.

    As far as other code may go, the guy states that this is in a nascent stage, so jumping on his source files seems like a bit of an easy shot :|

    --
    Chr0m0Dr0m!C
  10. Re:This code belongs on by Hard_Code · · Score: 4, Insightful
    That aside, your code would be easier to read (slashcode's broken formatting nonwithstanding) if you used a switch construct.
    Speak for yourself. Given that the switch cases are all mutually exclusive, and disregarding the default case, there are only 2 paths, switch is more obfuscatory than clarifying in my opinion.
    --

    It's 10 PM. Do you know if you're un-American?
  11. Reinvent the Wheel Much? by runswithd6s · · Score: 4, Informative
    (Stage Left) Enters the Controllable Regex Mutilator, crm114, with a noticable strut. He's been there, done that.
    CRM114 is a system to examine incoming e-mail, system log streams, data files or other data streams, and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires. Criteria for categorization of data can be by satisfaction of regexes, by sparse binary polynomial matching with a Bayesian Chain Rule evaluator, a Hidden Markov Model, or by other means. Accuracy of the SBPH/BCR classifier has been seen in excess of 99 per cent, for 1/4 megabyte of learning text. In other words, CRM114 learns, and it learns fast .
    --
    assert(expired(knowledge)); /* core dump */
  12. Why learning with supervision? by MoobY · · Score: 2, Interesting

    I thought this app was learning everything was in the log, and then only showed the new out-of-the-ordinary log entries that didn't quite fit in with the rest. This would allow to filter out freak events from the log and show them to the user. How different would such an app be from the proposed btail? And how confident would you be about such an unsupervised log analyzer?

    --
    --- Sigmentation Fault - Comments Dumped
  13. Re:Sure... by I_Love_Pocky! · · Score: 3, Funny

    Why would you run this on an MS system? The critical errors are so common that btail would discard them with the rest of the log file.

  14. Hey! by Jeremiah+Cornelius · · Score: 2, Funny

    Now even geeks can get a little tail!

    --
    "Flyin' in just a sweet place,
    Never been known to fail..."
  15. Here's how to make this a lot more useful by Julian+Morrison · · Score: 3, Interesting

    Step 1: Allow the option to automatically discover and load canned training packages, eg: a directory under /etc. Make it automatically pick the right training file to use when called with a logfile (so eg: btail httpd.conf knows to look for the training for httpd.conf files).

    Step 2: Include btail with major distros

    Step 3: Any package for an app that generates logs can come with a ready-made canned training package, which gets dropped into the /etc directory.

    That way, you could apt-get a package, start btail-ing its logfiles immediately without the need to tediously train the filter first. Training would still be possible, to personalise the filter.

  16. Bayesian by inertia187 · · Score: 2, Interesting

    Bayesian tail might be neat. I like the idea of broadening the use, but I'd much rather see bayesian filters used on my in-box for more than just spam. I envision a filter that would sort out e-mails based on subject matter. This would have the net effect of improving the filter technology because it's trying to sort e-mails you actually want to look at.

    We all know that if the filter makes a mistake and hides a message in the Spam box, and chances are you'll might miss many of them, another the chance to train the filter has been lost. But if an e-mail that was intended to land in the Irate Customer box, instead lands in the Clueless Customer box, the likelihood of noticing it is much greater.

    --
    A programmer is a machine for converting coffee into code.
  17. Bayesian AIM bot by duncangough · · Score: 3, Interesting

    I love Bayes stuff - and there's a very nice Python module written by divmod.

    I was playing around with AIML to cobble together a basic chat bot when I realised that I could use a Bayesian parser to radically cut down the amount of AIML that I needed to write. AIML is an XML style of chat bot repsonses, it's clever in that it's highly recursive but the downside is that you need to create a rule for every eventuality.

    By adding in a bit of Bayesian guessing before the AIML parser got it hands on the conversation, I'm able to keep the AIML files very focused and give the chat bot a bit more sparkle - you don't have to train him about everything. After a while he realised that 'yo', 'hi' and 'hello' are all the same thing, so he just guesses that you're saying hello and pulls out the correct response from the AIML file (rather than creating an AIML rule to deal with all the variations on 'hello').

    If you're interested I'd strongly recommend installing GrokitBot. You can get the source and a bit more explanation at my site, Suttree.com

    Playaholics : Free Online Games