Slashdot Mirror


Text-Mining Your E-mail

Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.

18 of 217 comments (clear)

  1. PS-PDF Document format conversion by Misha · · Score: 5, Informative
    --



    I was thinking of how to intentionally fail my drug test... It would make a good memoir story someday.
  2. Yet another reason for.. by Dr+Caleb · · Score: 4, Informative
    Lotus Notes.

    It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.

    --
    "History doesn't repeat itself, but it does rhyme." Mark Twain
    1. Re:Yet another reason for.. by Dr+Caleb · · Score: 3, Informative
      How do you figure that?

      Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.

      Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.

      --
      "History doesn't repeat itself, but it does rhyme." Mark Twain
  3. Too much information. by abucior · · Score: 5, Funny

    Personally, I'd prefer that I simply get less email. The fact that we need NLP tools to pre-screen our email for us just shows how information-overloaded our society has become. What I really need is a tool at the sender's end that can pre-screen my email and tell the sender "Don't send this. He just doesn't care!"

  4. look by Joe+the+Lesser · · Score: 4, Funny

    Now we all now that most email is delivered promptly by gremlins, but gremlins are hungry and will eat a few bytes here and there.

    They also leave waste in the form of spam.

    So, I propose that we turn to gnomes to deliver the mail instead, as they are much cleaner, and can be satiated by attaching a file like 'Hamburger.txt'.

    --
    "I only speak the truth"
    Karma: null(Mostly affected by an unassigned variable)
  5. The joys of owning a domain by CaptainPhong · · Score: 5, Insightful
    I've found the most joy from owning my own domains, and a lot of it has to do with e-mail sorting/filtering as much as the traditional benefits (a permanent www.yourdomain.com web site address and yourname@yourdomain.com e-mail address).

    Every time you sign up for some mailing list or discussion group, create a new e-mail account or alias for just those mailings. Bam, it's automatically sorted out by itself with extreme ease. If you have limited bandwith (or are checking, say, on your palm) sometimes, just check your important addresses frequently, and reserve your mailing lists for a once-per-day check.

    If some site asks for your e-mail address to download a piece of software, or to register, make up a new alias and give that to them. If you start getting tons of crap at that address, you can just remove that alias, and they get it all bounced back in their stupid spamming faces.

    Give one address to your cow-orkers just for work stuff. Give a different one to your Mom and other techno-nots that blocks all attachments. Give another one to your friends with brains that goes unfiltered. For people you don't want to talk to, give them the address of an autoresponder tied to Eliza.

    Be a *Happy Camper* and let your addresses be *Bubbles* and you be just *You*.

    --
    ... "Give me a woman who loves beer and I will conquer the w
  6. Remembrance Agent by Tekmage · · Score: 5, Informative

    It's more general than e-mail, but in the wearable computing community, there's a little application called Remembrance Agent, written by Bradley Rhodes that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.

    I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)

    Check out Brad's site for more insight into the work he did and is doing.

    --
    --The more you know, the less you know.
  7. One use to rule them all by Col.+Panic · · Score: 3, Funny

    my $pr0n = "adult";
    my $spam = "viagra";
    my $urgent = "penis enlargement";
    open (INBOX,/home/mail) or die "Damn! No fun for me:$!\n";
    @list = readdir(INBOX);

    foreach $ (@list) {
    if (-f $spam) {
    my $status = unlink($spam);
    }
    if (-f $pr0n) {
    my @MUST_SEE = $pr0n;
    next;
    }
    if (-f $viagra) {
    my @RAINY_DAY = $viagra;
    next;
    }
    }
    # or something like that ...

  8. Re:Link to a postscript file? by SuiteSisterMary · · Score: 4, Funny
    why not just bundle a damn interpreter with the OS and have a minimal frontend on it for screen viewing?
    Gee, wouldn't that be illegally using their monopoly to muscle out third party developers? Why, if the OS had a PS viewer built in, nobody would every buy one! Businesses would go bankrupt!
    --
    Vintage computer games and RPG books available. Email me if you're interested.
  9. Postscript document by Tim+Ward · · Score: 3, Interesting

    Somewhat to my astonishment when I clicked on the link up popped a box asking me to confirm Postscript Renderer options! I had no idea that I had anything on this box that could read Postscript.

    Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.

    But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.

  10. procmail! [Re:The ultimate spam blocker?] by Styx · · Score: 5, Informative
    I use procmail, with weighted scoring
    First, I sort out mail from the mailingslists I read.
    Then, mail from friends, and people I correspond with a lot.
    Finally, I have a weighted scoring recipe:

    :0 Bh
    * -199^0
    #Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
    * 50^1 ^(From|To):.*@hotmail.com
    * 50^1 ^(From|To):.*@yahoo.com
    * 50^1 ^(From|To):.*@aol.com
    * 50^1 ^(From|To):.*@msn.com
    * 50^1 ^(From|To):.*@excite.com
    * 50^1 ^(From|To):.*@netscape.net
    * 50^1 ^(From|To):.*@yahoo.co.uk
    #Most mail to and from these domains is spam, so score it.
    * 100^1 opt-out
    * 50^1 opt-in
    * 200^1 OTCBB
    * 50^1 viagra
    * 50^1 zyban
    * 50^1 propecia
    * 75^1 FREE
    * 75^1 GUARANTEED
    * 75^1 LEGAL
    * 50^2 MILLIONAIRE
    * 50^1 100%
    #Words I only see in spam.
    mail/Trash

    This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.

    --
    /Styx
    1. Re:procmail! [Re:The ultimate spam blocker?] by bruckie · · Score: 4, Informative

      Or you could just use SpamAssassin, which is designed specifically to do this and has many more rules that have been created by others.

      --Bruce

      --
      There are 10 kinds of people in the world: those who understand binary, and those who don't.
  11. Since 5.0 it can by barzok · · Score: 3, Informative

    Message rules are very easy to set up and manage. No agents.

  12. Re:What's wrong with IMAP ? by statusbar · · Score: 5, Interesting

    DBMAIL looks cool, once it supports postgresql it would be awesome.

    I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.

    So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.

    And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.

    I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.

    Jeff

    --
    ipv6 is my vpn
  13. VM & EMACS by pmz · · Score: 3, Informative

    I have enjoyed using the VM module for Emacs. It allows sorting your entire Inbox into separate categorized mail boxes via regular expressions. Basically with one shift-A keystroke, my entire day's worth of mailing list stuff gets whisked away into a half-dozen different files. After this, I feel really sorry for people trapped in the Outlook dungeons!

  14. Re:What I want by nosferatu-man · · Score: 5, Informative

    Welcome to Gnus. Have a sandwich.

    (jfb)

    --
    To spur "enterprise Linux," Big Bang, the distributed two-phase commit.
  15. Done already by Matts · · Score: 5, Informative

    "Perhaps even one of them Perl monkeys will quickly hack such a background tool."

    Been done already. Check out Mail::Miner.

    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
  16. Not new, but cool. by jefferson · · Score: 3, Informative

    There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.

    The paper is here here.

    J.