Text-Mining Your E-mail
Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
Somewhat to my astonishment when I clicked on the link up popped a box asking me to confirm Postscript Renderer options! I had no idea that I had anything on this box that could read Postscript.
Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.
But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.
DBMAIL looks cool, once it supports postgresql it would be awesome.
I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.
So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.
And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.
I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.
Jeff
ipv6 is my vpn
Spam filtering is one possible application of this type of tool, but the more useful involves taking the mail you *do* want, and sorting it into logical buckets. For instance, let's say work on several open source projects, belong to a couple organizations, and have a real-life job. You could toss a filter in your email that scans each incoming message and throws it in the proper bucket. This allows you to logically separate your mail to reduce confusion of each non-overlapping category.
:-)
Procmail only goes so far, it's really only useful for simple header scanning.. I could really see a good scanner utility being a valuable tool. Maybe Google should share some of their technology..
Josh Woodward
The post you're ranting against was a reply to one that suggests filtering is not what we should do. That spam needs to be "killed at the source". Which means legally preventing someone from creating any mail in the first place.
Say what you will about spammers, but that IS censorship.
('Course there's plenty of people here who believe that censorship is fine in this case, but that's not what you're arguing, so I won't either.)