Best Way To Archive Emails For Later Searching?
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
Alphabetically!
Great minds think alike; fools seldom differ.
An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.
Time to delete them all
I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...
You are a hostile lawyer's fantasy come true. ;-)
See subject.
This is slashdot. We save computers older than your dad just to use them as alarm clocks. Please leave.
It isn't particularly platform independent (because no one is paying much attention to Windows), but Not Much offers threads and full text search:
http://notmuchmail.org/
Nerd rage is the funniest rage.
Print then scan
rewriting history since 2109
While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.
MailSteward on the Mac.
SQL database. Good, Inexpensive, works w/many tens of thousands of emails & more.
http://mailsteward.com/
If you want an "email format" why not mbox? Many things currently support that as an import option.
If you want a database, why not SQLite? It's about as open as can be, backwards compatibility is almost a religion and should have no problem with hundreds of thousands of entries.
I use mbox format files and grep.
IMO, one can't get much more portable than that.
Maildir.
And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread
I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
If this is really important to you, and you want it all to work across multiple workstations/OSes, your best bet will be to store it all in IMAP. If you have the means and motivation to run this yourself, I would recommend Dovecot. If you don't have the means and motivation, then you can use a service like Gmail to run your IMAP although you give up certain freedoms in doing so. For example, I use Dovecot coupled with Maildir++ as the physical storage format - as a result I can (if I wanted to) change to any email client I wish very quickly, use different email clients at the same time, etc.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Maildir storage format is resistant to bit-rot because it stores each message in a separate file, and uses filesystem directories for mail folders. It's widely supported by user agents (mail readers) and IMAP/POP3/SMTP servers, so you'll never be stranded by the actions of a single software vendor. Finally, it's easily searched using everyday unix tools - find, grep, sed, awk, etc., and you can use the full-text search engine of your choice for speedy searches.
I never thought of turning an ancient host into an alarm clock.
Once however, I did hollow out an SGI case and turn it into a refrigerator.
The case was just too damned pretty to throw away.
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
citadel at www.citadel.org is a full pop3/imap server with full-text indexing.
Thunderbird can use server-side searches to find messages, and I find that works pretty well.
blog.sam.liddicott.com
And now the poster becomes an advertiser's dream come true in addition to being a hostile lawyer's dream come true. ;-)
Remember that from Google's perspective gmail is a tool to better profile you for targeted advertising. Make sure you are OK with that before giving them access to all your emails.
Starting with GMail I have kept every e-mail since 6/22/2004. I also brought over many e-mails I had in my saved folders from long before that. Am I insane? No. I have found this archive incredibly useful for any variety of uses even 6 years later.
Nothing like having your wife ask, "man, I wish we still had the recipe for deviled eggs we made in college. Too bad it was back in 2001." "No problem honey, hold."
Pulled that out a couple weeks ago for a picnic. Yum yum!! was right.
I recommend mbox (MBX) format.
1. The format is text based and not likely to become unreadable anytime in the forseeable future.
2. There are no shortage of tools for manipulating mbox.
3. Its easily indexed by full text search applications (MS Search included with windows)
The outlook tools save dialouge has an apple export option which is actually the mbox format.
In terms of archival access I recommend an IMAP server with a folder hirarchy based on month/year. Your mail client should be configured to leave the messages on the server (not attempt to download via IMAP). This somewhat future proofs migration to different mail clients.
The only issue is that imap searches are out of the question so you will need to do searches offline with a full text indexing/search application to first find the general folder location of the message you are seeking.
If your computer has lots of memory then why not just use grep and write a small shell script to forward the message from the archival file to your inbox so that formatting..etc is preserved. If your doing lots of searches the disk cache will back most of it in ram even if its a few GB..
At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.
Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:
INBOX
|- YYYY
|- MM
|- DD
The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.
Hope this helps.
--Necesito una chela, bien fria...
Parent is +informative and/or +interesting, not troll. Fucking brain dead moderators these days. Sheesh.
it suggested a linux solution and made the windows weenies realize how useless their os is. by extension they realized how tiny their penises are and then they finally understood why they like Micro Soft because it describes them perfectly. so they got mad and said "i'll mod it down, yeah, that'll teach them a lesson and make me feel like a real man again!"
Kmail has an excellent .pst converter that will pull out your old Outlook mail. Once you have it in Kmail, you can drag and drop it into any of the supported formats, mbox, mdir etc. If you have already established filters, you can let them sort things out. If not you can use a manual search for to, from, mail list, subject, etc. From there you can run your imap.
I carry everything around on my laptop and use kmail instead of using imap. With full drive encryption and xscreensaver, I don't have any worry about losing private information and know that my ISPs have better collections of my email anyway, despite what they say about size limits. I could use Gmail's imap instead of my own but prefer to suck my gmail out with kmail's imap support. Until US networks get more reasonable, I want my mail with me instead of on my own server and I would not advise anyone to leave their mail on someone else's server without having a copy yourself.
Because your question is all about search, I have to plug Kmail again. With proper organization of your mail into subfolders for friends, family, lists, companies and projects, mail searches are quick, even on modest hardware like my ancient PIII laptop. Searching everything takes a little longer, but it is not such a burden. Evolution may do as well but something about Gnome turns me off. The only downside is that the 3.5 branch does not seem to be able to search through encrypted mail but I imagine there's some gpg-agent fix for that I'm not aware of.
Friends don't help friends install M$ junk.
I migrated all my old personal emails to gmail using IMAP. You can use this to migrate between different on-disk formats like maildir, mbox and pst. I had all my email in yahoo and pulled it down using POP to a maildir, then used an IMAP mail client to copy it across to gmail. Then I regularly back them up from gmail to an on-disk maildir format using mbsync. I picked maildir because it's open and seemed better designed than the alternative, mbox. It's not completely standardized though. I've seen PSTs become corrupt so I try and stay away.
stay frosty and alert
What do they say?
:(".
June 2001 - "Dave, can't go out tonight. I got a date with that fat chick.YEAH!"
Sept 2001 - "Dave, She's told me she pregnant."
Jan 2002 - "Dave, will you be the best man at the wedding
Shhhh - Dave's the real father (AC doesn't know)..
I have to say that PST's can be convenient. However, I have seen many corrupted PST's over the years, and yes I know that there are tools to fix this, but the name of the game here is to actually get your emails out with a minimum of fuss. Also, as to compatibility, I know MS has arbitrarily changed the format of Word. There is nothing to stop them from doing the same to the PST format, and there are several versions of that in existence now. Add this to the fact that as the PST's get bigger, performance drops off. As a really easy expedient solution, using PST's will work, but not well. Using them as a solution for the problem however, I think it will only compound the issues in the long run.
I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.
Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.
It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...
DO NOT DELETE YOUR ARCHIVES, EVER!***
*** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.
http://slashdot.org/~GuyFawkes/journal
...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).
El_Muerte_TDS has just pointed me towards mairix, a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.
Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here.
Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.
Moderation Total: -1 Troll, +3 Goat
Yes, it is not free, and yes, this suggestion will bring out the trolls, but you might want to consider Lotus Notes/Domino. It is ~$140 for the system, and ~$40 a year maintenance (Includes all upgrades) cost per user, but IBM isn't going anywhere any time soon.
It has good full text indexing, you can keep your mail on a client, and on the server, with incredibly flexible replication rules for what is stored where.
It supports IMAP, so it talks well to most clients.
The iPhone syncs seamlessly with it via ActiveSync, and an Android client is in beta as we speak.
It includes an http client, and the http client even offers offline access. That's right. You can use the http client, and still read your mail and write emails that will be sent the next time you make a connection.
It also has folders, but you can put any email into as many folders as you want, so you have the best of both Outlook folders and Gmail tags.
It supports auto-processing rules for automatic filing of data, as well as being a full development environment if you want to get really fancy.
It is brain dead easy to set up and maintain.
The server runs on Linux and Window, and the client runs on Linux, Windows and Mac.
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important.
While the value you place on being able to retrieve critical pieces of information may be valid, your choice of storage medium is not. An email system is not a file server or database. Most index poorly, if at all, making searches horribly inefficient. And as has already been observed, it may be quite likely that those same things you value will be more than offset by their value to a hostile litigant.
just because I can.
That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.
What about the privacy of those you correspond with? If they send an email to a gmail account that is one thing, but you are unilaterally deciding to have them participate in the targeted advertising profiling.
I've been using hMailServer for a few years now, and it's free. It's an ISP solution, but has some great IMAP facilities like shared storage. I have over 120GB of IMAP data and it's not twitching. It has a MySQL Lite backend, and capabilties for web, pop, imap, AD integration, etc etc. Would recommend it.
When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
This sort of behavior is odd and not normal. If you want to keep your email, then that's fine, but thinking that it's "vitally important" is odd and I think without question points to some "OCD with some component of Aspberger". If you don't then maybe you need to re-evaluate.
I am however interested in how you pull demographic analysis out of emails? I mean, hopefully you're not suggesting that you go and chomp on the text to pull out fields of data?
So on the one hand, you think my saving email for later access and analysis is not useful, but then, you want to know why it is useful?
I run a research laboratory where we do two things, one is work on restoring sight to the blind, the other is to organize a conference every two years. The primary demographic analysis I need to do is to analyze the country-of-origin for email traffic pertinent to the conference. This has helped to raise many tens of thousands of dollars of support for the conference by demonstrating various aspects of the global attendance to funding agencies.
Being able to access my email and locate attachments, review discussions, find references, remember addresses, etc., in other words, to recall what someone once wrote to me, has resulted in millions of dollars of grant money to fund my research. Without the ability to review email that is, at times, years old, that would not be possible. Having rich access to my email stream has allowed me to fund my lab, and therefore feed and house my family and the people who work for me, publish high-impact papers, receive numerous awards, get coverage in the international press, etc., or, put better, to run the daily business of a research lab at a high-profile university. While the tools I use are good, they leave a lot to be desired, and having a better system would make me more productive.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea,
I think that GMail could be the panacea here. I mean, if you're just trying to make sure it lasts and you can search it with ease, then GMail can do it better than you can.
I dislike GMail for my professional correspondence for a number of reasons: (1) it does not allow me to readily use my university affiliation address (and since that's a top university, that makes a difference whether people like it or not), (2) I do not have ownership of my email, (3) the lack of a good filing / archiving interface makes it hard to associate different threads together, or to limit searches (I intensely dislike the tagging feature), (4) GMail has an only rudimentary ability to edit text since it's browser-based.
I do use GMail for my personal correspondence, but that's mostly because it's the best of a bunch of poor, but free, services. It does have the best searching features, but falls down in a lot of other ways. It also would be against my employer's policies to store HIPAA-regulated email offsite. So GMail is not a panacea. Thanks for the suggestion, though.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
It's obvious, upload them to gmail!
(only half kidding)
Flappinbooger isn't my real name
Computers, hard drives, backups, electricity, rack space, and maintenance are all free! Fuck! Tell me where you shop for this stuff.
As anyone who actually uses IMAP can tell you, it bogs down quickly on large mailboxes, violating the poster's requirement about b)
Not true. Not absolutely false, either. IMAP is an access protocol, not a storage or indexing mechanism, and there is nothing inherent in IMAP that dooms it to be slow in handling large mailboxes. Different combinations of client and server, configurations, and mailbox content and usage can make huge differences in performance. Tens of thousands of messages in a single IMAP folder on a memory-lean server that uses Maildir storage on a UFS or ext2 filesystem with atimes enabled is going to suck horribly, especially with a client that doesn't cache heavily or maintain its own indices. Make that a mbox, and it will work great until you start trying to change it every couple of seconds.
Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.
Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.
I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.
Watch for Penguins, they eat Apples and throw rocks at Windows.
Do you know what imap is? He's gonna have to have some central storage thing but the mail access is platform independent..yeah if he wants his imap server to be his own than he'll have to pick one os to serve from but every nonshit mail application has imap support from desktop to mobile and hands down gives him what he wants if he takes the time to organize and set it all up