Best Way To Archive Emails For Later Searching?
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
Time to delete them all
I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...
You are a hostile lawyer's fantasy come true. ;-)
See subject.
If you want an "email format" why not mbox? Many things currently support that as an import option.
If you want a database, why not SQLite? It's about as open as can be, backwards compatibility is almost a religion and should have no problem with hundreds of thousands of entries.
I second that. Invest in Google Apps to benefit from additional services as well.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Yes! The thing that appeals to me the most about using Gmail is that searching through 5+GB of old emails won't make everything in my machine slow to a crawl. Even with the free Gmail account, you can up the storage to 20GB for $5/year, and that extra space is available from other Google services connected to the same account.
If you want to have more flexibility, sign up for a Backupify account, which can backup Gmail pretty well. As a bonus, when Backupify stores your backups they are kept in plain text format, so you can always pull these and move them elsewhere without having to worry about issues with Gmail's storage formats.
Pedro
----
The Insomniac Coder
I recommend mbox (MBX) format.
1. The format is text based and not likely to become unreadable anytime in the forseeable future.
2. There are no shortage of tools for manipulating mbox.
3. Its easily indexed by full text search applications (MS Search included with windows)
The outlook tools save dialouge has an apple export option which is actually the mbox format.
In terms of archival access I recommend an IMAP server with a folder hirarchy based on month/year. Your mail client should be configured to leave the messages on the server (not attempt to download via IMAP). This somewhat future proofs migration to different mail clients.
The only issue is that imap searches are out of the question so you will need to do searches offline with a full text indexing/search application to first find the general folder location of the message you are seeking.
If your computer has lots of memory then why not just use grep and write a small shell script to forward the message from the archival file to your inbox so that formatting..etc is preserved. If your doing lots of searches the disk cache will back most of it in ram even if its a few GB..
I have been archiving my mails for the past 10 years. My method has been to download the mails in mbox format once a year and use a combination of mairix to search through teh mails and either mutt or thunderbird to see the actual mails.
At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.
Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:
INBOX
|- YYYY
|- MM
|- DD
The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.
Hope this helps.
--Necesito una chela, bien fria...
I migrated all my old personal emails to gmail using IMAP. You can use this to migrate between different on-disk formats like maildir, mbox and pst. I had all my email in yahoo and pulled it down using POP to a maildir, then used an IMAP mail client to copy it across to gmail. Then I regularly back them up from gmail to an on-disk maildir format using mbsync. I picked maildir because it's open and seemed better designed than the alternative, mbox. It's not completely standardized though. I've seen PSTs become corrupt so I try and stay away.
stay frosty and alert
I have to say that PST's can be convenient. However, I have seen many corrupted PST's over the years, and yes I know that there are tools to fix this, but the name of the game here is to actually get your emails out with a minimum of fuss. Also, as to compatibility, I know MS has arbitrarily changed the format of Word. There is nothing to stop them from doing the same to the PST format, and there are several versions of that in existence now. Add this to the fact that as the PST's get bigger, performance drops off. As a really easy expedient solution, using PST's will work, but not well. Using them as a solution for the problem however, I think it will only compound the issues in the long run.
I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.
Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.
It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...
DO NOT DELETE YOUR ARCHIVES, EVER!***
*** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.
http://slashdot.org/~GuyFawkes/journal
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important.
While the value you place on being able to retrieve critical pieces of information may be valid, your choice of storage medium is not. An email system is not a file server or database. Most index poorly, if at all, making searches horribly inefficient. And as has already been observed, it may be quite likely that those same things you value will be more than offset by their value to a hostile litigant.
just because I can.
That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.
What about the privacy of those you correspond with? If they send an email to a gmail account that is one thing, but you are unilaterally deciding to have them participate in the targeted advertising profiling.
The only sane alternatives are, as far as I'm concerned:
Maildir isn't quite as well supported as mbox, but I suppose it's sometimes more convenient to grep these since you get a hit on the particular mail you're searching for, not the mbox file which contains that mail and a thousand others.
I use gzipped mbox files. One thing I have considered doing is to convert away Quoted-Printable MIME encoding and use Latin 1 (or UTF-8) everywhere. That would make the mboxes easier to use with standard tools like text editors and grep.
I would never use a database for this. It serves no purpose, except as an invitation for the fuckup fairy. The searches you'd want to are free-text searches anyway.
Say what?
It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem. For e-mail, you can store it all on one small (these days) hard disk placed in a drawer somewhere, with space to spare -- even with all the spam! And the process of figuring out how to better organize it and archive it going forward will be a useful learning exercise that might have applications elsewhere (e.g., at work, where people might be asking exactly the same question).
It's no worse than deciding to tidy up your office or study area and figuring out a system to better keep track of things so you can find them later.
I mean, heck, the President of the United States had the same fricking problem: how to properly archive e-mail, a problem discussed here numerous times. As a common problem -- personally and in business -- listening to other people's solutions before digging into it yourself is an efficient way to deal with it.
Huh? How does a server help with a local archive of emails? Does any of these servers help with importing emails (pre-mbox arapnet emails for instance .doc and .ppt awareness)? This may be a storage approach but does
or dbx emails for a more modern example)? Does it provide fast searching (including
not begin to deal with the question raised.
It's obvious, upload them to gmail!
(only half kidding)
Flappinbooger isn't my real name
Computers, hard drives, backups, electricity, rack space, and maintenance are all free! Fuck! Tell me where you shop for this stuff.
Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.
Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.
I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.
Watch for Penguins, they eat Apples and throw rocks at Windows.