Slashdot Mirror


Organizing Large Volumes of Email?

Trixter asks: "Like most nerds, I receive a large volume of email that I archive in several files and directories in a filesystem. This is inefficient, especially when it comes to searching for an old or obscure bit of information. I can imagine several better ways to organize email for archival and lookup, but has anyone already done this? I want to try avoiding reinventing the wheel for the tenth time this year. By 'better ways', I'm talking about all solutions--from the Perl monger 'one 10-line script will do the trick' perl script to parse up a long mbox-format file into little bits for intelligent grepping, to maybe an elegant 'mbox-format file to SQL database' loader/translator script and a series of SQL statements to support searches. Please, help me organize my gigabytes-long, decade-long email archive!"

3 of 24 comments (clear)

  1. A few ideas by kevin42 · · Score: 3
    I've got the same problem, and I've handled it by sorting my email into different mbox files based on content and/or who it's to/from. Then used a bit of perl to take old messages from a cut-off date and archive them into a directory with an identical layout (i.e. the same folder configurations)

    so I have something like this:

    Pending
    Misc
    Friends
    Pre-1998\Pending
    Pre-1998\Misc
    Pre-1998\Friends

    Then I use hypermail to create an html archive of everything nightly, and put it into a password protected directory on my webserver. Then I use a regular web based search engine for searching.

    Right now I'm playing around with doing all my mail via a web interface (using aeromail->imap) so I can access it securely (SSL) anywhere. It's working pretty good, I just need to figure out a good way to notify me when I get a new message (I'm thinking of a ICQ bot that sends me a message or something...)

    I hope that helps. I'd be happy to work with anyone who wants to creat a better shrink-wrapped system for managing large amounts of old email. To me it's important that however it's stored that it is very portable since I've changed email clients a lot over the years.

  2. Don't sort it out if you don't have to by human+bean · · Score: 3
    As mentioned elsewhere, ASCII text files are the best way of dealing with long term storage of email.

    The other necessary item is the fastest file content search utility you can lay hands on.

    Indices, tables of contents, folders, catagories, catalogs, and directories will always have misfiles after a certain period of time, and besides, who wants to categorise all that mail anyway? Dump it into a few (time-based?) directories and simply search for things when you need them.

    --

    *whup* "Get along, little electrons. Heeyah!"

  3. Keep it Simple by InitZero · · Score: 5

    I've got everything I've ever written on a computer -- email and else -- since around 1981.

    It ain't always pretty but I'll tell you the secret to my success. Plain text. ASCII. If you want to be able to read what you've written now ten years from now, keep it ASCII.

    Thanks to a basic format, I was able to convert my TRS-80 Model I tapes to Model 4 disks and my Model 4 disks to the 20-meg drive in my first Tandy 1000. From there it has been easy. My new harddrive is always huge compared to the last one so my old data usually takes up a third of the new disk. No big deak.

    I read all my email with 'mail' thus protecting myself from viruses and funky email formats (Eudora, Outlook, CCMail, etc.). At the end of the month, my mailbox it is dated, rotated and gzipped. The header information (Date, From and Subject) is added to a master index file along with the filename where the message can be found.

    I've got a few ugly scripts that will search by keyword so I can find old stuff.

    Yes, I'm living in the stone age. Those of you able to read your email going back to the early 1980s feel free to throw stones.

    I think putting the stuff in a database would be a bad idea. When you change platforms, there will be maintenance. When you change databases, there will be maintenance. With plain ASCII text, you know you'll always be able to read it and you never have to upgrade. (Okay, by using gzip, there may be some extra effort on my part. A few years ago, I moved everything from compress to gzip. That's not the same thing as going from Oracle to Sybase, however.)

    Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.

    InitZero