Organizing Large Volumes of Email?
Trixter asks: "Like most nerds, I receive a large volume of email that I archive in several files and directories in a filesystem. This is inefficient, especially when it comes to searching for an old or obscure bit of information. I can imagine several better ways to organize email for archival and lookup, but has anyone already done this? I want to try avoiding reinventing the wheel for the tenth time this year. By 'better ways', I'm talking about all solutions--from the Perl monger 'one 10-line script will do the trick' perl script to parse up a long mbox-format file into little bits for intelligent grepping, to maybe an elegant 'mbox-format file to SQL database' loader/translator script and a series of SQL statements to support searches. Please, help me organize my gigabytes-long, decade-long email archive!"
I've got everything I've ever written on a computer -- email and else -- since around 1981.
It ain't always pretty but I'll tell you the secret to my success. Plain text. ASCII. If you want to be able to read what you've written now ten years from now, keep it ASCII.
Thanks to a basic format, I was able to convert my TRS-80 Model I tapes to Model 4 disks and my Model 4 disks to the 20-meg drive in my first Tandy 1000. From there it has been easy. My new harddrive is always huge compared to the last one so my old data usually takes up a third of the new disk. No big deak.
I read all my email with 'mail' thus protecting myself from viruses and funky email formats (Eudora, Outlook, CCMail, etc.). At the end of the month, my mailbox it is dated, rotated and gzipped. The header information (Date, From and Subject) is added to a master index file along with the filename where the message can be found.
I've got a few ugly scripts that will search by keyword so I can find old stuff.
Yes, I'm living in the stone age. Those of you able to read your email going back to the early 1980s feel free to throw stones.
I think putting the stuff in a database would be a bad idea. When you change platforms, there will be maintenance. When you change databases, there will be maintenance. With plain ASCII text, you know you'll always be able to read it and you never have to upgrade. (Okay, by using gzip, there may be some extra effort on my part. A few years ago, I moved everything from compress to gzip. That's not the same thing as going from Oracle to Sybase, however.)
Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.
InitZero