Slashdot Mirror


Text-Mining Your E-mail

Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.

11 of 217 comments (clear)

  1. PS-PDF Document format conversion by Misha · · Score: 5, Informative
    --



    I was thinking of how to intentionally fail my drug test... It would make a good memoir story someday.
  2. Yet another reason for.. by Dr+Caleb · · Score: 4, Informative
    Lotus Notes.

    It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.

    --
    "History doesn't repeat itself, but it does rhyme." Mark Twain
    1. Re:Yet another reason for.. by Dr+Caleb · · Score: 3, Informative
      How do you figure that?

      Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.

      Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
  3. Remembrance Agent by Tekmage · · Score: 5, Informative

    It's more general than e-mail, but in the wearable computing community, there's a little application called Remembrance Agent, written by Bradley Rhodes that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.

    I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)

    Check out Brad's site for more insight into the work he did and is doing.

    --
    --The more you know, the less you know.
  4. procmail! [Re:The ultimate spam blocker?] by Styx · · Score: 5, Informative
    I use procmail, with weighted scoring
    First, I sort out mail from the mailingslists I read.
    Then, mail from friends, and people I correspond with a lot.
    Finally, I have a weighted scoring recipe:

    :0 Bh
    * -199^0
    #Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
    * 50^1 ^(From|To):.*@hotmail.com
    * 50^1 ^(From|To):.*@yahoo.com
    * 50^1 ^(From|To):.*@aol.com
    * 50^1 ^(From|To):.*@msn.com
    * 50^1 ^(From|To):.*@excite.com
    * 50^1 ^(From|To):.*@netscape.net
    * 50^1 ^(From|To):.*@yahoo.co.uk
    #Most mail to and from these domains is spam, so score it.
    * 100^1 opt-out
    * 50^1 opt-in
    * 200^1 OTCBB
    * 50^1 viagra
    * 50^1 zyban
    * 50^1 propecia
    * 75^1 FREE
    * 75^1 GUARANTEED
    * 75^1 LEGAL
    * 50^2 MILLIONAIRE
    * 50^1 100%
    #Words I only see in spam.
    mail/Trash

    This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.

    --
    /Styx
    1. Re:procmail! [Re:The ultimate spam blocker?] by bruckie · · Score: 4, Informative

      Or you could just use SpamAssassin, which is designed specifically to do this and has many more rules that have been created by others.

      --Bruce

      --
      There are 10 kinds of people in the world: those who understand binary, and those who don't.
  5. Since 5.0 it can by barzok · · Score: 3, Informative

    Message rules are very easy to set up and manage. No agents.

  6. VM & EMACS by pmz · · Score: 3, Informative

    I have enjoyed using the VM module for Emacs. It allows sorting your entire Inbox into separate categorized mail boxes via regular expressions. Basically with one shift-A keystroke, my entire day's worth of mailing list stuff gets whisked away into a half-dozen different files. After this, I feel really sorry for people trapped in the Outlook dungeons!

  7. Re:What I want by nosferatu-man · · Score: 5, Informative

    Welcome to Gnus. Have a sandwich.

    (jfb)

    --
    To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
  8. Done already by Matts · · Score: 5, Informative

    "Perhaps even one of them Perl monkeys will quickly hack such a background tool."

    Been done already. Check out Mail::Miner.

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
  9. Not new, but cool. by jefferson · · Score: 3, Informative

    There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.

    The paper is here here.

    J.