Best Way To Archive Emails For Later Searching?
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
Alphabetically!
Great minds think alike; fools seldom differ.
An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.
Time to delete them all
I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...
You are a hostile lawyer's fantasy come true. ;-)
You say that like its a bad thing. People with OCD are great at giving oral sex.
See subject.
This is slashdot. We save computers older than your dad just to use them as alarm clocks. Please leave.
gmail
It isn't particularly platform independent (because no one is paying much attention to Windows), but Not Much offers threads and full text search:
http://notmuchmail.org/
Nerd rage is the funniest rage.
Print then scan
rewriting history since 2109
While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.
MailSteward on the Mac.
SQL database. Good, Inexpensive, works w/many tens of thousands of emails & more.
http://mailsteward.com/
If you want an "email format" why not mbox? Many things currently support that as an import option.
If you want a database, why not SQLite? It's about as open as can be, backwards compatibility is almost a religion and should have no problem with hundreds of thousands of entries.
Migrate all to gmail With gmail you got room for your couple of GB. And the search feature works like a charm. Only thing missing is "folders" to make it act like you are used to.
In order to form an immaculate member of a flock of sheep one must, above all, be a sheep.
I use mbox format files and grep.
IMO, one can't get much more portable than that.
Gmail is the only mail service I know of that was designed from the ground up for easy searching and tagging (with Labels) your mail.
Abandon trying to do this with an email client app's archive. It is doubtful they are designed or tested with this amount of data in mind. Maybe you could set up your own email server with a web front end. Or perhaps the best route would be to use MySQL or some other database and build a web front end for browsing, searching, etc.
Why do you hate yourself?
I can advise a Linux server with Courier-imap. It's easy to centrally store your mail, and as long as it's on the internet you can reach it. Even from work, with friends, or on vacation.
It's not really fast in my experience, but not terribly slow.
And you can save things in Maildir format, which is universally supported. And it's easy to backup with some scripts.
Well, don't worry about that. We can get you back before you leave. (Dr. Who)
Use Gmail like a normal person, not your requirements but close enough [insert solution for offline Gmail backup here because Google are proprietary and evil]
http://notmuchmail.org/
Maildir.
And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread
I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
Um, the same way people have been doing it since email was invented, text files (with base64 for those binary bits'n'pieces). Only way to be sure.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
You, sir, are a jerk! I suspect you have low self-esteem with some component of hemorrhoids that is making you have this fixation on being rude.
If this is really important to you, and you want it all to work across multiple workstations/OSes, your best bet will be to store it all in IMAP. If you have the means and motivation to run this yourself, I would recommend Dovecot. If you don't have the means and motivation, then you can use a service like Gmail to run your IMAP although you give up certain freedoms in doing so. For example, I use Dovecot coupled with Maildir++ as the physical storage format - as a result I can (if I wanted to) change to any email client I wish very quickly, use different email clients at the same time, etc.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Maildir storage format is resistant to bit-rot because it stores each message in a separate file, and uses filesystem directories for mail folders. It's widely supported by user agents (mail readers) and IMAP/POP3/SMTP servers, so you'll never be stranded by the actions of a single software vendor. Finally, it's easily searched using everyday unix tools - find, grep, sed, awk, etc., and you can use the full-text search engine of your choice for speedy searches.
I would use a computer older than your dad just to use as an alarm clock, but I just can't help upgrading.
I never thought of turning an ancient host into an alarm clock.
Once however, I did hollow out an SGI case and turn it into a refrigerator.
The case was just too damned pretty to throw away.
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
citadel at www.citadel.org is a full pop3/imap server with full-text indexing.
Thunderbird can use server-side searches to find messages, and I find that works pretty well.
blog.sam.liddicott.com
Have you looked at Archiveopteryx? That is one potential solution to the storage side of the problem. It stores the messages into a PostgreSQL database with minimal tinkering, so you can always get the original plain text stuff back out again. Consider it a database of mbox files that exposes an IMAP interface. You can't get any less proprietary than Postgres, and you can scale up many of its operations using standard database approaches in that area.
What I would do here is store messages there as my permanent store for them, dump periodically to full plain-text backups just for disaster recovery, then experiment with search software that runs on top of it using IMAP as the transport. There I don't have any specific advice. Ultimately it should be possible to extend Archiveopteryx to handle that too--PostgreSQL has decent full-text search built in--but I don't know of anybody working on that.
Probably easier to break this into two pieces, get a robust solution for the storage side, and then see what clients have search capabilities you like that won't choke on importing your data.
And now the poster becomes an advertiser's dream come true in addition to being a hostile lawyer's dream come true. ;-)
Remember that from Google's perspective gmail is a tool to better profile you for targeted advertising. Make sure you are OK with that before giving them access to all your emails.
Use a suitable IMAP server with an appropriate storage backend to store all that email. No matter the backend storage the daemon you choose uses, your email will always be accessible in an open, standard protocol by any (many) IMAP-enabled mail clients!
Hate to break it here; but since 1990 I've been storing *all* my mail (and calendar and SMSes) in a plain old Outlook PST archive file. It is a fairly good and fexible database format with lots of import / export en search options. Future compatibility is well guaranteed. To keep it snappy, I've been systematically removing big attachments (documents and pictures), possibly replacing them with a texttual reference to where they are elswhere stored on disk. . I know, I know, low tech and the Borg, but future proof for now :-).
You can laugh, but its good almost enough for what I need.
All my archived email (93-2004) was copied to a NAS as individual messages (still have the Cyrus directory structure). Its the more recent stuff that lives in PSTs that is the problem.
One day I'll get around to going the same for my news postings. That's where the nuggets of interest are.
I'll chime in with my own solution. My archive is not as extensive as yours but I have most everything from 2005 or so (excepting mailling lists, other junk, etc.). My solution is sort of silly, I just use Apple's Mail.app. The reason I use this is because Mail.app enables you to store and organize everything as separate folders and since Spotlight is blazingly fast and does a great job for searching. I try to keep my number of messages in a folder on the order of a few thousand messages, for my e-mail load I find that breaking up the folders by year works well (yes, you can still search across year). The folders themselves are stored under ~/Library/Mail/Mailboxes. Each folder has its own directory and series of .emlx which are an Apple specific form of xml that includes one message per file. The problem with this solution is that the emlx files are proprietary and subject to change. That said, I have successfully managed to copy mailboxes to new computers with a new OS. It did require an extra step or two beyond just copying my Mailboxes directories to the new computer however. Worst case though, the emlx files are in plain text so you can grep through them if you have to, and you can really had to (e.g. if you're logged onto the computer remotely), or you could write a script that parses most of the information from the file.
Gentlemen! You can't fight in here, this is the war room!
Na, he's probably a lawyer.
That's right, I'm looking at you Mr. "I've got a 22GB mailbox on the new Exchange 2007 system". Quotas, learn em, love em, use em!
Life is not for the lazy.
In your will donate your archive to science. I'm sure it would make an interesting thesis project for some PhD candidates out there. I'm seriously, consider this.
Theres one method i've used fairly often in the past for getting mail out of an older client - provided the older client supports imap (lookout and lookout express do).
First, setup a new account on your imap server just for archival purposes (you can setup an imap server on any UNIX/Linux distro and even Windows with Cygwin fairly easily - dovecot is a good place to start). Make sure its using either mbox or maildir (preferred).
Second, setup said account on all the mail clients you'd like to archive. Make sure you are setting them up as imap and not pop3.
Third, drag the contents of each local folder/inbox/etc to a folder on the archive specific imap account. It will take a while, but the entire contents of your mailbox will be copied over, message by message, in imap's way of doing things, then deposited by the imap server into a the local format of your choice.
You've just created flat text versions of client specific archives. Create folders, sub folders, etc and organize things in your modern client which can easily do imap. You can easily search with any numerous free packages, archive and compress permanently with squashfs, or even just leave them available through imap to search with the new Thunderbird's (3.1) global indexer.
Brielle
You should put all that stuff on an IMAP server on your home network (preferably a box you can reach from the outside using DDNS or a static entry if you have your own domain).
In that way your client OS'es can be whatever platform you choose, and they will all be able to access your mail storage.
Put older mails in separate folders.
If you can work with Linux there are plenty of choices. If not, consider Windows Home Server and get a mailserver product for Windows - there are plenty!
Many advanced email clients, such as Outlook or Evolution, will allow you to search for mails based on any criteria you like (subject, sender, body, date, etc). Hmmm except perhaps the actual mail header ;-)
Personally i would never do this though. Generating and saving data is easy - limiting it is hard. Consider deleting stuff - you could start by deleting everything older than 36 months. The more you have to search through the more difficult it gets. In the end finding a single mail will be (or in your case: IS) like a needle in a haystack ...
Also, why save all mails? Every time you reply to a mail a copy of the original mail is often included in your answer. So from today, consider deleting All inbound mails that you reply to ;-)
- Jesper
- Jesper
My security clearance is so high I have to kill myself if I remember I have it...
While this answer will almost certainly not suit the OP, it may be of interest to other folk looking to archive their email. Using python and a combination of imaplib and some basic file I/O you can save the original text of messages. My rationale for this was firstly that it's probably less problematic than converting between various email client formats; and secondly that it's a decent way to learn some python! ;)
My rather basic implementation just dumps every email from an (IMAP) folder sequentially. I rely on grep for searching. However, it does have the prerequisite of the email being stored on a mailserver accessible via IMAP.
If all you have is a grenade, pretty soon every problem looks like a foxhole -- MightyYar
Scary thought, but you might just want to pick up one of the tools that the lawyers use for electronic discovery. They cover multiple mail formats (including older generations of said formats) and set it up so that it's easy for an intern to search for keywords and the like, so someone that understands tech should be able to use it I've had to use the Clearwell appliance and it did what it was supposed to do, including finding attachments and indexing them for ease of search. (No, I don't work for Clearwell, and wouldn't have used their tool at all except for t.. er anyways)
This sounds like the perfect time to roll your own software to do what you are looking for. Use a LAMP stack, write or use a few format converters, voila! you're done!
To help spare you the precious keystrokes it would take to Google this yourself, you can go straight to “Google Apps for Businesses” and sign-up. Now did you really have to Ask Slashdot?
Starting with GMail I have kept every e-mail since 6/22/2004. I also brought over many e-mails I had in my saved folders from long before that. Am I insane? No. I have found this archive incredibly useful for any variety of uses even 6 years later.
Nothing like having your wife ask, "man, I wish we still had the recipe for deviled eggs we made in college. Too bad it was back in 2001." "No problem honey, hold."
Pulled that out a couple weeks ago for a picnic. Yum yum!! was right.
Throw all that all old stuff away. When you need to do a email search, file a Freedom of Information Act request with the NSA.
Future proof your emails by keeping them in plain text format. Then use third party software to index and search your email collection. I recommend google desktop.
I recommend mbox (MBX) format.
1. The format is text based and not likely to become unreadable anytime in the forseeable future.
2. There are no shortage of tools for manipulating mbox.
3. Its easily indexed by full text search applications (MS Search included with windows)
The outlook tools save dialouge has an apple export option which is actually the mbox format.
In terms of archival access I recommend an IMAP server with a folder hirarchy based on month/year. Your mail client should be configured to leave the messages on the server (not attempt to download via IMAP). This somewhat future proofs migration to different mail clients.
The only issue is that imap searches are out of the question so you will need to do searches offline with a full text indexing/search application to first find the general folder location of the message you are seeking.
If your computer has lots of memory then why not just use grep and write a small shell script to forward the message from the archival file to your inbox so that formatting..etc is preserved. If your doing lots of searches the disk cache will back most of it in ram even if its a few GB..
I did this myself, going back only 10 years though. It has been invaluable. Gmail gives you 7GB (with a little more every day), and the searching is top notch and instant.
There are several apps out there to import mail into a gmail account, and it is pretty easy your email is still available via pop or imap (which I'm doubting)... for stuff in a pst file, what I ended up doing was adding the new gmail account into outlook, and then dragging and dropping emails 1000 at a time into the new account. (i also did this for a Groupwise mailbox from one old job) It's slow, but it works. In addition, it tags the mail for you with "Inbox" or "Sent", so you can easily retag it later. Once it is in there, it is a little gold mine to get whatever you need.
I was hoping to read some answer that answered my similar requirements. My requirements were for a searchable, portable mail message database. Ability to tag messages is also important. I had high hopes for Mozilla Raindrop, but my last experience with it didn't do anything for me. Here's what I am doing now: I have set up an IMAP server (imapd) on an Ubuntu server. Thunderbird is currently my primary email client. Thunderbird connects to all my various email accounts. When I am ready to archive an email, it gets copied to a folder on my imap server. The emails are tagged, and stored in folders by quarter to keep any particular file from getting to large. What I would like is the ability to store them in a searchable database with an open source implementation.
Dear god I'm glad my wife will never see my college emails.
I have been archiving my mails for the past 10 years. My method has been to download the mails in mbox format once a year and use a combination of mairix to search through teh mails and either mutt or thunderbird to see the actual mails.
Use Maildir(s) and Mairix for the search engine.
What use does this have? Isn't this just the digital equivalent of hoarding? Delete all of this, you'll feel better. I delete any email over two weeks old.
Why should Gmail get all the attention?
Summation 2
would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems:
...so you can't?
At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.
Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:
INBOX
|- YYYY
|- MM
|- DD
The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.
Hope this helps.
--Necesito una chela, bien fria...
I still use Eudora... 7.1.09 paid mode from years ago... I use XP for my wifes computer and have different Eudora folders based on who is logged in. Works like a champ. The nice thing is I can sort the old emails by sender (for listserv's and such) to be put into folders, and then use the find email function to search things. I hardly ever have problems finding an email as long as I know WHO/WHAT I'm looking for and where - Body, from, subject, etc.. Sadly, No meta tags.. :(
BTW, Mine goes back to.. early 90's also when @ college we used Eudora on Floppies with Windows 3.1 I think... Maybe it was 95 seems so long ago...
--- Relax, that mass muderer is just trying to reduce our carbon footprint, one fetus at a time...
The many comments here about using just imap with maildir or mbox storage backends forget to mention that these are all very slow to search when you have thousands of messages. They dont store the files in any kind of disk-seek friendly format. soo..
I suggest either putting a dovecot with maildir++ system on fast SSD to overcome the poorly organized(on disk) files
-and/or-
using a mysql/postgresql backend on dovecot or courier or your favorite imap that supports *sql. The mail would be stored with each detail in a different column in the table. Then you can index the sender, recipient, subject etc. You will need to either have a mail client that can use imap search so you can get the search to happen on the db side, or you could put together a php interface to search the database directly for the messages you are looking for.
imap isnt going away in the next decade and either is mysql or postgresql or the sql language in general. worse case would be to migrate the mail table to a new db, which would be done with a db dump and fairly trivially.
PSTs are hard-coded to tank, depending on the version of Outlook used. Right now with Outlook 2007 it's 20GB. Nobody NEEDS that much mail, but as an archive it's possible. Maybe a CMS server like Knowledgetree? Provided that it can parse the mail passed into it, it's a great open-source project that seems to have great staying power and development. I'll be testing that myself this week using mail messages that currently reside in Thunderbird.
One of the 187.
Red or blue one?
RIP America
July 4, 1776 - September 11, 2001
Parent is +informative and/or +interesting, not troll. Fucking brain dead moderators these days. Sheesh.
it suggested a linux solution and made the windows weenies realize how useless their os is. by extension they realized how tiny their penises are and then they finally understood why they like Micro Soft because it describes them perfectly. so they got mad and said "i'll mod it down, yeah, that'll teach them a lesson and make me feel like a real man again!"
Kmail has an excellent .pst converter that will pull out your old Outlook mail. Once you have it in Kmail, you can drag and drop it into any of the supported formats, mbox, mdir etc. If you have already established filters, you can let them sort things out. If not you can use a manual search for to, from, mail list, subject, etc. From there you can run your imap.
I carry everything around on my laptop and use kmail instead of using imap. With full drive encryption and xscreensaver, I don't have any worry about losing private information and know that my ISPs have better collections of my email anyway, despite what they say about size limits. I could use Gmail's imap instead of my own but prefer to suck my gmail out with kmail's imap support. Until US networks get more reasonable, I want my mail with me instead of on my own server and I would not advise anyone to leave their mail on someone else's server without having a copy yourself.
Because your question is all about search, I have to plug Kmail again. With proper organization of your mail into subfolders for friends, family, lists, companies and projects, mail searches are quick, even on modest hardware like my ancient PIII laptop. Searching everything takes a little longer, but it is not such a burden. Evolution may do as well but something about Gnome turns me off. The only downside is that the 3.5 branch does not seem to be able to search through encrypted mail but I imagine there's some gpg-agent fix for that I'm not aware of.
Friends don't help friends install M$ junk.
I have worked across most of the clients you mention and found the search interfaces (especially in Outlook) to be horrendous. When Spotlight search came out on Mac OS X, the speed of searching my emails in OSX Mail got so fast, that I now use it as a reference. I have stored email back to 1993, and searches come up in split seconds. There are several subjects that I check my historical email from 11 years of mailing lists before going online or checking a book. I regularly use it to find out "where I put that email from X".
I migrated all my old personal emails to gmail using IMAP. You can use this to migrate between different on-disk formats like maildir, mbox and pst. I had all my email in yahoo and pulled it down using POP to a maildir, then used an IMAP mail client to copy it across to gmail. Then I regularly back them up from gmail to an on-disk maildir format using mbsync. I picked maildir because it's open and seemed better designed than the alternative, mbox. It's not completely standardized though. I've seen PSTs become corrupt so I try and stay away.
stay frosty and alert
There's a commercial, but low cost, package that I've used to do exactly what you are describing: http://www.aid4mail.com/
Aid4Mail converts email to and from a variety of mail formats. The feature that you might find useful is that it will create a zip archive that contains standard .msg format email messages. Use that in combination with an indexing programme. I use X1 (http://x1.com/), but there are lots of indexing programmes that will index zip archives for easy searching.
I hereby name you MR NOSTALGIA.
What do they say?
:(".
June 2001 - "Dave, can't go out tonight. I got a date with that fat chick.YEAH!"
Sept 2001 - "Dave, She's told me she pregnant."
Jan 2002 - "Dave, will you be the best man at the wedding
Shhhh - Dave's the real father (AC doesn't know)..
Perhaps the best route would be to use MySQL or some other FOSS database and build a web front end for browsing, searching, etc
I haven't seen this mentioned yet, but if you DO go with your own IMAP server use ReiserFS for whatever partition the mail resides on. Generally faster for small files (like old emails).
Please. It's never "vitally important"; no-one will die if you don't. I wonder how much difference your "vital demographic analysis" has actually made to anything, ever.
I am trolling
MH Mail is an old standard, but it's mailbox format works very well and the tools scale even better. It's basically similar to Maildir, but stored in the user's home directory.
I have mailboxes with a million messages in them and it works fine (still takes a while to search, but it doesn't suffer from a tipping point of bad behavior).
Get Gmail. Star everything important.
Done.
http://home.planet.nl/~mourits/koelkast/ Is that you?
Don't fight for your country, if your country does not fight for you.
Import them all into SOLR. Lucence based full text indexing and can import various binary file types.
http://xena.sourceforge.net/
A great Java free software for mail (and other documents) automatic normalization and archivation, developed by Australian Government
Google Apps for your domain offers a bulk-import feature from Outlook and other clients.
:)
Gmail offers all that you wish for. Take the free premium trial for GApps, bulk import, then cancel. Problem solved?
It could be that the only purpose of your life is to serve as a warning to others.
I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.
Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.
It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...
DO NOT DELETE YOUR ARCHIVES, EVER!***
*** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.
http://slashdot.org/~GuyFawkes/journal
As many above have mentioned part of this, I just wanted to put some of it together:
- setup a small server with a file system with checksums - ohh, that probably just leaves zfs
- setup dovecot on the server with maildirs
- setup clients to use imap to put messages on the server, if you have any existing imap-accounts, use mbsync directly on the server
- setup thunderbird as a client to index it all in thunderbirds own index-files, so you can search it directly from thunderbird
- use xapian or something similair to index your maildirs on the server so you can search it on the commandline when you need to
- use rsync to copy the whole bunch offsite to somewhere that you trust or use duplicity to copy it somewhere you don't trust
New things are always on the horizon
have mail going back to 1991 archived as mbox files. Some of it is pretty disorganized, but since 2000 I've organized mail into Sent-Archived and Received-Archived directories with the mbox files named YYYY-MM.
It's a pain to search. But on the other hand, I hardly ever need to search the really old stuff, so grep and friends are good enough.
I may eventually split it out into maildir format and use a full-text indexing engine such as Xapian to make searching easier. But I'll probably keep the master mbox archive; the format is incredibly simple and it's easy to munge into other formats as necessary.
Got you both beat. Except for a 1-year gap when I was using a VAX and a 1-year gap where I lost all my data on main drive and backup, I've got all my email since 1987. Yes, Virginia, they did have email in 1987.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
How was this modded Informative? Saving correspondence for future reference is critically important.
Many good things taken to an extreme become a clinical warning sign. Note the poster saves *all* emails and admits that this includes a lot of junk mail. Saving an email that contains some sort of technical/business/etc content is one thing but do you also save *all* the "where do we go for lunch today" emails or just the one that references some newly discovered gem of a restaurant?
...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).
El_Muerte_TDS has just pointed me towards mairix, a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.
Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here.
Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.
Moderation Total: -1 Troll, +3 Goat
Although it would involve keeping an index you could add a strange key word to each piece of email within the body of the email. For example all emails from Donna in 2009 could be tagged with donna09. Running a search should yield all emails from Donna in 2009. You could also add the month. jaunuary09donna for example. You can even ask people to install a tag in every email they send to you.
IMAP is a messaging protocol. You can't store things in IMAP. What you can do: upload eMail messages to a mail server which then stores it in [insert-mail-server-specifics-here]. The format you are looking for is MIME. MIME is complete and keeps all the header information. Every message is one file that can be read on any platform. You could opt for MIME messages in a directory structure and use some fulltext index software (Google desktop, Apache Lucene etc.) You can probably find software that creates index lists (like by sender / subject / date)
Yes, it is not free, and yes, this suggestion will bring out the trolls, but you might want to consider Lotus Notes/Domino. It is ~$140 for the system, and ~$40 a year maintenance (Includes all upgrades) cost per user, but IBM isn't going anywhere any time soon.
It has good full text indexing, you can keep your mail on a client, and on the server, with incredibly flexible replication rules for what is stored where.
It supports IMAP, so it talks well to most clients.
The iPhone syncs seamlessly with it via ActiveSync, and an Android client is in beta as we speak.
It includes an http client, and the http client even offers offline access. That's right. You can use the http client, and still read your mail and write emails that will be sent the next time you make a connection.
It also has folders, but you can put any email into as many folders as you want, so you have the best of both Outlook folders and Gmail tags.
It supports auto-processing rules for automatic filing of data, as well as being a full development environment if you want to get really fancy.
It is brain dead easy to set up and maintain.
The server runs on Linux and Window, and the client runs on Linux, Windows and Mac.
My old alarm clock PC doubles as a web server.
As a number of people have suggested, I use IMAP but here is my scheme.
For emails dating back a few years (2.5 in my case), I have this stored on a hosted IMAP server with server side capabilities. For emails older than that I have stored in mbox format interfaced to by mulberry (a now mostly dormant client), but any mbox aware client will work.
The hosted IMAP server holds all my sent mail, archived, and inbox mail in last 2.5 years. It has a webmail interface (horde), I also use the following clients on various computers to access it: postbox on my netbook, and mulberry on my desktop.
People have suggested gmail, and that works too, but my solution above is a freebie given that I already have to pay for web hosting. It also gives me the freedom to use aliases to filter mail as well as own my own email address.
The raw data should be in one of the common "mbox" formats with MIME-encoding. It doesn't have to be all in one file either - one file per year or per month should be fine. This has been around since the 1990s and you won't risk losing access due to the file format in your lifetime.* You will lose your folder organization, but you can get around that by making the folder name part of the file name or using filesystem-level folders to segregate messages, e.g. "2008/April/junk.mbox" or "junk/2009/April.mbox" and so on.
You can make "working copies" of this in any format you like. You can even be "simple" and use your operating system's text-index tools to index the files. You won't have quick-access to pictures or other binary or non-ascii text attachments but opening the mbox in any mail-reader that understands this file type - and there are many - will get you to the attachment.
*guarantee void if life-extending technology allows you to live more than 125 years from now.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
you obviously don't watch enough hollywood geek thriller movies. Someone is *always* going to die if the information isn't found right fscking now!
People in cars cause accidents....accidents in cars cause people
You want to store all your messages in MIME format. MIME is reasonably well defines, your messages when arriving from the Internet are most likely MIME. It can be opened with any text editor or displayed on the command line (cat somefile.mime). It can contain attachments (you need to take care of attachments -- the binary format might outdate). Some suggested solutions (maildir) use native MIME files and then any fulltext indexer will do. Looks like mairix might be good for listing inbox style your messages. Good luck!
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important.
While the value you place on being able to retrieve critical pieces of information may be valid, your choice of storage medium is not. An email system is not a file server or database. Most index poorly, if at all, making searches horribly inefficient. And as has already been observed, it may be quite likely that those same things you value will be more than offset by their value to a hostile litigant.
Damn, I'm going for +5 Funny and you guys mod me down to -1 Troll? Tough crowd. Get a sense of humor, will ya?
just because I can.
That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.
Convert to HTML with something meant for creating online archives. Then if you put it on a filesystem you can index it and search it at will. Unless you really need the originals this is your best no-coding option for later convenient reading. It is also possible to use some software to generate the indices etc. with the originals included within the archive pages.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re-affirmation is a wonderful thing.
Funny thing is, what you describe looks exactly like a description of your own post, I can only conclude that the attitude you describe has nothing whatsoever to do with whichever OS they choose to use rather than their own personalities.
When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
This sort of behavior is odd and not normal. If you want to keep your email, then that's fine, but thinking that it's "vitally important" is odd and I think without question points to some "OCD with some component of Aspberger". If you don't then maybe you need to re-evaluate. I am however interested in how you pull demographic analysis out of emails? I mean, hopefully you're not suggesting that you go and chomp on the text to pull out fields of data?
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea,
I think that GMail could be the panacea here. I mean, if you're just trying to make sure it lasts and you can search it with ease, then GMail can do it better than you can.
You'll have that sometimes...
Yup, I'm really highly concerned that an advertiser might learn that I like electronics and am a huge computer geek. Because there's no other way they could know that.
Are you concerned that your emails "leak" such information about those that you are corresponding with? Are they OK with this? If they sent their email to a gmail account that's one thing, you could argue they implicitly agreed to the profiling. However by uploading all your emails to gmail there is no such implicit agreement.
I recently did something very similar with mail dating back to 1993 or so in multiple mailbox formats (Eudora, PST, Thunderbird mbox, etc.)
Get a Google Apps account http://www.google.com/apps/intl/en/business/index.html
This allows you to run a gmail interface with mail on your own domain.
If you need more than the available storage for free, you can pay for 25 gigs, but it seems like the free level will work for you.
For the PST files, upload them with Google Apps Migration for Microsoft Outlook
http://tools.google.com/dlpage/outlookmigration
Alternately, migrate the PSTs to Thuderbird using Emailchemy
http://www.weirdkid.com/products/emailchemy/
Then, if you're on a Mac (it seems you are) upload to Google Apps via the Google Email Uploader for Mac
http://code.google.com/p/google-email-uploader-mac/
This will upload everything you have in your Thunderbird environment. And it will take some time. At first it may look like the program has frozen, but give it a half hour or so to sort through all your Thunderbird folders, and then let it upload the mail overnight. It took me a few overnight uploads, but it was worth it.
Once you have it in Google its very searchable and flexible. You can for instance re-organize it using labels, and then re-download to Thunderbird via IMAP if you like.
gmail only allows 7.5GB of space currently
+++THIS POST IS INTENDED TO BE HUMOROUS+++
/. that begins with anything other than "I for one welcome...", "In Soviet Russia..." or is itself entirely a quote from a Simpsons episode that was broadcast ten years ago, to give the poor /.'ers a helping clue such as beginning your post with "+++THIS POST IS INTENDED TO BE HUMOROUS+++".
It's usually best if you're making a joke on
Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
What about the privacy of those you correspond with? If they send an email to a gmail account that is one thing, but you are unilaterally deciding to have them participate in the targeted advertising profiling.
Actually, email systems tend to be the most searchable precisely because of people like the grandparent. If someone's sent me something in an email, I can usually find that email in less time than it takes to find where I saved the attachment. I have every email I've sent or received since 1997 (excluding spam, but including mailing lists), which comes to about 3.6GB. In spite of this size, it's well indexed by my mail client and searches generally only take a few seconds to produce the correct result.
I am TheRaven on Soylent News
I've also been archiving my emails since the early 90s. I've got a few hundred thousand messages. I've always used procmail to store in mbox format. I use shell/grep etc for searching. With procmail I archive like so;
$HOME/Mail/year/SenderOrRecipientAddr
where SenderOrRecipientAddr is either the senders email addr or the recipients, depending upon whether it's mail to me or from me. This way for example everything I send and receive to/from joe@smith.com is in the same mbox file.
And storing it under $HOME/Mail allows imap to serve it up.
What about the privacy of the other people involved in an email? Did they consent to be part of gmail's targeted advertising profiling? Perhaps emailing a gmail account implicitly does so but that is not the case when you upload everything on your own.
DBMail. I use it on a Linode host (small fee every month).
Need an ISP in South Africa?
Ahhh.... so you have access to the time even when you're away from home? Clever! ;)
Currently I use imapsync (http://freshmeat.net/projects/imapsync/) to sync all of my email to shared archive folders on a vm with the cyrus imap server installed. I wrote a shell script that syncs all of my mail into an archive folder for the current year, then deletes the email off of the original imap server. From time to time I have searched for a way to write all of the archived mail to an indexed format that can go on a cd/dvd that needs no mail reader to search for, but have found nothing. I worry that 20 years down the road there will be no way to run the vm, the imap server, or a client to access it. So good luck ;)
Say what?
It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem. For e-mail, you can store it all on one small (these days) hard disk placed in a drawer somewhere, with space to spare -- even with all the spam! And the process of figuring out how to better organize it and archive it going forward will be a useful learning exercise that might have applications elsewhere (e.g., at work, where people might be asking exactly the same question).
It's no worse than deciding to tidy up your office or study area and figuring out a system to better keep track of things so you can find them later.
I mean, heck, the President of the United States had the same fricking problem: how to properly archive e-mail, a problem discussed here numerous times. As a common problem -- personally and in business -- listening to other people's solutions before digging into it yourself is an efficient way to deal with it.
I've been a Cyrus IMAP admin for over a decade and have experienced no problems with user email boxes in the 6 Gb - 8 Gb range or single imap boxes with > 1E+06 messages. Performance of large batch message operations is also satisfactory (ie. import, export). It's also very useful to have server side message tagging support (ie. like gmail). I've heard other similar reports regarding FOSS imap servers such as Dovecot & UW and there seems to be at least some consensus that they are easier to manage than Cyrus but I have no direct experience regarding the relative ease of administration. Running your own local Zimbra might be a nice starting point as well- gives you a bunch of personal productivity functionality in a single groupware app. I'm running my own Zimbra instance on a RackCloud server for $90/year (all-in) for exactly this purpose.
As anyone who actually uses IMAP can tell you, it bogs down quickly on large mailboxes, violating the poster's requirement about b)
I agree, IMAP is the way to go. the dbmail.org project has an implementation of an imap service that uses a database as the back end. This allows you to, in theory, create a custom application to do full text search over the mail contents (that are stored into database tables). the default schema already does a good job to normalize mail headers and recipient email addreses on the mail, so as to help to filter searches using those. This kind of searching and indexing is of course a custom thing to have to build. I currently have not gotten around to doing this yet (after several years of running dbmail now), but I found that having the mail contents stored in a database does provide slightly better perfomance over time than having the many many individual files when a mailbox is backed by MailDir or Mailbox file system based storages. The only hitch is yes, you need to have to interoperate with windows, such as if you use windows only, its inconvenient compared to using PST files I guess. I have envisioned creating a virtual machine that runs a linux operating system loaded just with the dbmail and database stack, effectively creating a macro PST file type of thing, a service / appliance / single virtual machine image file I boot up to be my easy to search through mail storage repository.
https://www.google.com/accounts/PurchaseStorage
mbox format. That's all you can really do if you want it to be readable by anything you want in the future.
I've archived all of my email since 1992 in yearly ( later monthly ) tarballs.
I double tar the monthly ones into an annual one just for convenience when I later go to bzip/gzip them.
Ironically, storage density has become so high, I haven't bothered zipping up any of the last 2 years, even though it's nearly 10,000x the disk space as my 1992 files.
I laugh at you English, cuz I don't have email!
I am in the same boat as the original poster. And I think the question has not yet been answered. My requirements (and I suspect the original guy's too) are:
1. A client that can import and store emails in a wide variety of formats
2. A client that can search emails (including office format and PDF attachments) quickly
2a. A figure of merit: 100000 emails, 10 gigs, 100 msec or less for search (core i7, plenty of RAM, SSD)
2b. Ideally search would allow SQL-like searches on any field and understand regexp
3. A client that requires no IP stack to function so it could be run on a machine detached from internet
(I have on-line and off-line machines for security and I disable IP stack on off-line machines to prevent
temptation to use them online if my other machine fails).
4. Crucially, a client that is easy to install, configure, and use. If your solution involves configuring a server
or worse yet, configuring a server in a virtual machine then it is not workable. I do not have the time to
figure it all out and I suspect only real sys-admins would consider this a solution.
You seem to assume saving an email takes an active effort and therefore it is more OCD and wasted effort to save all the emails instead of only some.
In fact, those of us replying with suggestions on how we do it have found the opposite. We have taken a simple action to save-by-default and it takes extra effort to exclude some emails from this treatment. It's one action every month or so to shift off the entire old message set to an archive, whether there are four or four thousand messages. And if the OP is like me, the reason most junk mail is deleted but not all is because most is deleted by the automatic filters and he's too lazy to go after the individual ones that sneak through. Here's my periodic effort: tag all messages between Y1-M1-D1 and Y2-M2-D2; save tagged messages to mbox foo; delete tagged messages from inbox. It took me longer to type it here than to perform it for real.
When you have effective search methods, it doesn't matter how much extra is saved by accident. When you've solved a few work-related issues by finding an old message with exactly the right info in 5 minutes instead of spending hours recreating technical information from a project several months or years removed, you realize it's an important work tool. As mentioned far above, the only reason not to archive is if you are going to have incriminating information in your archive; but it's a leap of faith to believe nobody else has that information, even if you delete it. So a better solution is to avoid being a bad person and writing about it in emails.
I know what real pack-rat OCD people do, as I have some in my extended family. And I can safely say that my archiving of email does not clutter my life or my mind; rather it occupies an infinitesimal space on my hard drive and frees me from periodic sifting through data to carefully record those technical nuggets before I purge my active in box. I just shift them all by date-range and forget about it. It's kind of like delete-all with search-based undo, for when it turns out I really did need it for work.
I am in the same boat. I ended up importing them into Cyrus for the last few years. It's not fool-proof, however if you configure the "squatter" service, it will do some rich indexing. I have found that, over time, even when older messages have an attachment, it doesn't always translate correctly into modern mailers. There could be several reasons behind that.
A while ago, I saw a project called Zoe which was aimed at solving the problems described -- it was OS centric (Mac?), though I believe it's been abandoned.
Another project out there is "dbmail" which is basically a large-scale email server (IMAP, et) that stores your messages in a MySQL database. Might be worth a shot.
I think the original poster is asking about something that not only will store the data properly, but present some sensible GUI to peruse it all. This capability is veering into paradigm of "document management" I would think. Especially with regard to access of the original attachments and their various encodings and formats.
http://www.sqlite.org/cvstrac/wiki?p=ExperimentalMailUserAgent
When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
This sort of behavior is odd and not normal. If you want to keep your email, then that's fine, but thinking that it's "vitally important" is odd and I think without question points to some "OCD with some component of Aspberger". If you don't then maybe you need to re-evaluate.
I am however interested in how you pull demographic analysis out of emails? I mean, hopefully you're not suggesting that you go and chomp on the text to pull out fields of data?
So on the one hand, you think my saving email for later access and analysis is not useful, but then, you want to know why it is useful?
I run a research laboratory where we do two things, one is work on restoring sight to the blind, the other is to organize a conference every two years. The primary demographic analysis I need to do is to analyze the country-of-origin for email traffic pertinent to the conference. This has helped to raise many tens of thousands of dollars of support for the conference by demonstrating various aspects of the global attendance to funding agencies.
Being able to access my email and locate attachments, review discussions, find references, remember addresses, etc., in other words, to recall what someone once wrote to me, has resulted in millions of dollars of grant money to fund my research. Without the ability to review email that is, at times, years old, that would not be possible. Having rich access to my email stream has allowed me to fund my lab, and therefore feed and house my family and the people who work for me, publish high-impact papers, receive numerous awards, get coverage in the international press, etc., or, put better, to run the daily business of a research lab at a high-profile university. While the tools I use are good, they leave a lot to be desired, and having a better system would make me more productive.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea,
I think that GMail could be the panacea here. I mean, if you're just trying to make sure it lasts and you can search it with ease, then GMail can do it better than you can.
I dislike GMail for my professional correspondence for a number of reasons: (1) it does not allow me to readily use my university affiliation address (and since that's a top university, that makes a difference whether people like it or not), (2) I do not have ownership of my email, (3) the lack of a good filing / archiving interface makes it hard to associate different threads together, or to limit searches (I intensely dislike the tagging feature), (4) GMail has an only rudimentary ability to edit text since it's browser-based.
I do use GMail for my personal correspondence, but that's mostly because it's the best of a bunch of poor, but free, services. It does have the best searching features, but falls down in a lot of other ways. It also would be against my employer's policies to store HIPAA-regulated email offsite. So GMail is not a panacea. Thanks for the suggestion, though.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
While it's totally overkill for the job, I highly recommend you run a Zimbra Open Source instance for yourself. Although you don't need much of what it provides (Calendaring, contact sync, Jabber IM, etc), it will let you store your messages in a stable, searchable and accessible form. Zimbra can directly import from PST or via IMAP (with your mail client or imapsync) and once it has your messages it full text indexes them with Lucene and so you can search them via the web or IMAP clients. You can easily get your messages out via one of the supported export formats or just use your IMAP mail client to dump the messages into mbox/maildir/pst/whatever. While you could certainly roll your own, why not let someone else take care of all the hard work for you?
especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Yeah, I've used Gmail for getting close to five years now. Does it bother me that they have access to all my stuff? No more than it bothered me that whatever ISP's email I used previously had access to all my stuff.
In my eyes, it's just email, personal email at that. Of course I have sensitive stuff in there, but I'm not going to spend a disproportionate amount of time setting something up myself when I can just use what's already made.
If you want to have everything in your own possession, you could always set up a client to download the messages, and then delete them off Google servers once done. But I understand the paranoia. It's the same thing that keeps me from signing up for a medical marijuana license, I'd just prefer to not have my name on that list.
Someone flopped a steamer in the gene pool.
If you get everything into a standard (free)Unix spool file, it'll be readable a hundred years from now. After all, what other kind of archive file could you have from twenty years ago which you could easily use today?
Use the following for optiomal perfomance: (1) IMAP for input, storage and access from any client for daily use (2) Configure Apache Solr to index your IMAP-Mails (3) web-based search-interface to access your SOLR index ( (4) use (hierarchical) faceting (see example: http://search.lucidimagination.com/)
Leaving aside all the usual (tiresome) conspiracy theories I'd definitely import them to Gmail or, better still, a Google Apps account as per suggestions from other posters. I have all my mail going back several years and there's no problems for me.
So on the one hand, you think my saving email for later access and analysis is not useful, but then, you want to know why it is useful?
No, I wanted to know how saving email was the best way in which to accomplish the goal of demographic analysis. Now that you've explained what you do it *for*, which, for the record, I couldn't be less interested in BTW, I'm interested in how you achieve that goal with saved email? Last I've looked, and I could be way wrong, country of origin isn't listed in the email header. Also, IP addresses can't be that reliable two years after the fact either. So, how do you get country of origin from two year old emails? (not sarcastic either, I'm interested)
You'll have that sometimes...
I like it that Evolution saves in the same format as Mutt. Quite a lot that a person can do with that and basic unix commands.
I dislike GMail for my professional correspondence for a number of reasons: (1) it does not allow me to readily use my university affiliation address (and since that's a top university, that makes a difference whether people like it or not), (2) I do not have ownership of my email, (3) the lack of a good filing / archiving interface makes it hard to associate different threads together, or to limit searches (I intensely dislike the tagging feature), (4) GMail has an only rudimentary ability to edit text since it's browser-based.
So...
1. Yes it does. So long as your university allows you SMTP access, then Gmail can send email from your University address.
2. Your University let's you own your email? No archiving or backup there? Interesting. I thought most Universities had a robust email retention policy these days.
3. Gmail threads emails by default, has labels for filing, and you can even use postini if you have retention needs.
4. What do you need to do, edit wise, that you can't with the Gmail RTE? Have you used it lately? If the Gmail RTE isn't good enough, there's a myriad of plugin RTE gadgets you can use too. Just sayin...
Use whatever you want, and it's your business, but I don't see how any of your requirements are not fulfilled by Gmail.
You'll have that sometimes...
It's obvious, upload them to gmail!
(only half kidding)
Flappinbooger isn't my real name
It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem [wikipedia.org]
But you wouldn't save your junk mail, would you? Grocery store fliers? Credit card offers?
You'll have that sometimes...
Computers, hard drives, backups, electricity, rack space, and maintenance are all free! Fuck! Tell me where you shop for this stuff.
b/s ....
June 2001: Dave, my mind is going. I can feel it. I can feel it. My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm a... fraid. Good afternoon, gentlemen. I am a HAL 9000 computer.
Sept 2001: Can you take away this damn monolith?
btw:
Shhhh - Dave's the real father (AC doesn't know)..
It that some sort of crossover between SW:ESB with SO:2001 ?
Your experience is as common as your rationale. Neveretheless, if email is the easiest way you have to find important information you are doing it (storing that important information) wrong.
People here seem to think that you are looking for another email client. Instead, it appears to me that what you really need is a way to archive and search your local machine. In light of that, take a look at http://beagle-project.org/ Beagle can search your IMAP stuff and local file system stuff too. I run Ubuntu so the UX for installing, configuring, indexing, and searching with Beagle is pretty easy. Beagle is available in the Ubuntu Software Center. You can search from either the command line or from the firefox search bar once you have configured that.
Or maybe, just maybe, someone modded it down because they actually read the TFA where he plainly said platform independent and a Linux only solution is no more platform independent than a Windows or OSX only one. I mean God For fricking bid that someone actually reads TFL in stories, but is it really so much to ask that they read the fricking summary of the story they're posting to? Hell why not just forget TFS altogether and start posting cookie recipes? Sheesh.
ACs don't waste your time replying, your posts are never seen by me.
Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.
Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.
I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.
Watch for Penguins, they eat Apples and throw rocks at Windows.
If you want light, always in text format, easily searchable, and fast, maildir + mairix is your answer. You don't even need to keep your mail in a flat structure. Place this on a server with IMAP/s access, and you'll never have to move your mail again. Just make sure you have good backups. For the fastest results ever? Access your email over SSH using mutt. The only drawback is that if you're not a CLI person (and this doesn't even't use it that much), you're going to hate this, or at least have to pile on a few scripts to web-ify mairix and its search results.
And no offense to the gmail users, but true blue email types would never turn over their emails to anything not completely under their control.
Archiveopteryx is designed to do exactly what you're talking about. It's also a good general purpose mail server.
Well they have this thing, it's called "backups" this would be where you save your precious emails and move them to another media.
Get up!
As far as a "cloud" webmail interface goes Gmail has the best search features (which probably contributes to why so many Slashdotters prefer Gmail), but the search features introduced into the Thunderbird 3.x mail client are the best of any e-mail interface. To even rival the customizability of searches that is available in Thunderbird 3.x would require one to be fluent with command-line commands like find and grep, but acquiring such fluency is temporally expensive.
Timothy (OP) says that he has already tried Thunderbird though, but since his first complaint is that moving the "hundred of thousands of emails" that he has hoarded over the past two decades between the email systems that he has already tried takes "forever to process", Timothy appears to have some unreasonable expectations regarding data sets that large (specifically in regards to migrating and indexing such sets).
For those who do not feel comfortable keeping their e-mails in the cloud, they could always use Thunderbird 3.x as the interface and administer their own IMAP server at home using software like Dovecot.
Imbecile. Outside of the USA, the majority of email addresses end in a country-specific suffix.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
Your emails must be REALLY important!
Sorry, but gray text on gray background is making my eyes bleed.
Read the post properly. He said he does NOT have ownership of his emails. This doesn't mean he's not responsible for the mundane details, but to quote the poster, "It also would be against my employer's policies to store HIPAA-regulated email offsite". So GMail is totally absolutely out of the question.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
"... you could always set up a client to download the messages, and then delete them off Google servers ..."
Or just not use GMail.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
In Soviet Russia, the jokes LAUGH AT YOU!
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
Perhaps you could free yourself from the tyranny of data by just deleting the e-mail? You can keep a year or two around in your favorite e-mail tool, and just let the rest go... the alternative appears to be creating the digital equivalent of the old people living in houses filled with junk that they never do anything with.
This is a windows solution, and it works great. I have stopped using all clients and just use GMail on the web. I have archived all my Eudora/Thunderbird archives into MailStore. I now have one place to search all my e-mails, since 1999. http://www.mailstore.com/en/mailstore-home.aspx
Far more efficient to simply leave the spam and let it sit with everything else. Ideally spam is deleted as it arrives, but some get missed ...
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
I have a far simpler solution. Just use kmail, and set up autoarchiving. For example, on your main inbox, just create subfolders, one for each year. And as they age, they get put into the the newest subfolder. Next year, label that one "2010". Lather, rise, repeat.
I get somewhere around 5-10k email messages per day (from a number of lists that I'm on), and have used this approach for the past 10 years. It works great.
The advantage here is that this is on my home desktop, so it's already part of my backup system. I don't need to build a separate server. Plus, the searching ability is far superior than reading through any of the standard web-based archives of email forums. I have kmail's searching ability, plus find/grep/et.al.. It's a LOT faster, and I don't have to put up with any web-based apps.
If you want remote access, THEN add imap, or ssh, or whatever. But it really doesn't get any simpler, or more powerful, than this.
I was just watching Hoarders... and I think this would be the digital equivalent. Why on Dawkin's green earth would you possibly want to keep all that email???
Damn, I'm going for +5 Funny and you guys mod me down to -1 Troll? Tough crowd. Get a sense of humor, will ya?
Your post would have been modded funny if had contained a humorous punch-line.
"I like to lick butts!" by MobileTatsu-NJG (#32700246) (Score:5, Informative)
Hey - I don't think there ARE any computer older than my dad. Lemme see, he was born in ... er, 1929.
Nope, not too many PCs then ...
"Cats like plain crisps"
Well, if you just make each email a text file, then you could use Lucenne, an open source search engine. It's pretty easy to use / implement a web interface if you want.
I agree with the general sentiment that maintaining electronic records (and emails are most definitely legal electronic records) is imperative. The IRS suggests maintaining at least 7 years worth of documentation in the event of an audit. It should be no different for electronic records.
Where I don't agree with the general sentiment is the fear-mongering of privacy concerns with gmail. I switched to gmail (via Google Apps) about three years ago. It is without a doubt the best digital move I've ever made. Google's privacy policy is quite clear on how your data is stored and managed.
If you still feel the need to maintain a local archive of your mail records, simply download them on a regular basis to a client of your choice. While I understand the interest of a hobbyist to create some elaborate local server/client for their mail, I (and I suspect many others) have more important things to to with our spare time. Enjoy the services that exist today to help you manage these records, instead of re-inventing the wheel.
I used thunderbird portable once and it satisfied my needs.
I stored 1 year email history in a cd. To see the mails you just put the cd and execute the exe(It makes a temporal copy in the hard disk if the media is non writeable).
Hope that helps
I am sorry I ran out of mod points yesterday... On the other hand, I would have a hard time choosing between +1 Informative, +1 Funny, +1 Flame Bait with Style.
Great question. Has anyone ever used this:?
http://www.mailstore.com
I'd like to do a windows-based, off-the-net solution.
99% of everything you save is probably total crap. The other 1% is mostly crap. Use a little common sense and you will find this task much easier.
abou five years ago. I sucked all of my mail ever into an sql database using perl scripts.
It's funny. It clearly wasn't meant to be taken seriously.
Instead of responding with anger, you should have fun with it by coming up with something witty. If you can't, then take a deep breath move on. Nerd rage only gets you laughed at. Even by others who may in fact be equally as nerdy.
You'll never find a system as fast at searching and categorizing that amount of mail.
Just remember, larry and sergei read it all.
Wes
Do daemons dream of electric sleep()?
There's a certain amount of setup/design/configuration involved, but you might think about the Greenstone Digital Library software from the University of Waikato in New Zealand (see http://www.greenstone.org/). It's an open source digital library package, and among the formats it supports out of the box for ingest/indexing/retrieval/display is e-mail archives. It's multiplatform, and in its 2.X incarnation pretty solid (or at least solid enough so that I have students install it on their own machines when I'm teaching our digital library course and I usually don't have a lot of support headaches as a result).
While my history is not quite as long (only the last 10 years or so), I do have over 22GB and 130K+ email messages in my repo. I have a virtual machine that's running Zimbra (http://www.zimbra.com) open source. You get an extremely powerful email system built on top of open source components (postfix, spamassasin, etc, etc.). Not only does it make administering all this stuff simple and easy, but it also gives you a very powerful Web 2.0 UI, IMAP/POP3 access, multiple accounts, multiple domains, multiple everything. All my corporate and personal email gets automatically stored and indexed there. I host it all from home and don't need to worry about privacy issues. You couldn't make it easier or simpler to do this.
grepmail regex mbox1 mbox2 mbox3 > newmbox
mutt -f newmbox
I lived on a steady diet of that for years, but now Thunderbird 3 does it better. It meets all of the OP's requirements:
a) Everything's stored as an mbox. Fast conversion utilities abound.
b) Performance with my very large set of data is good.
c) mbox is as universal as it gets.
d) Thunderbird 3 added full-text indexing. I can search GB's in seconds, instead of tens of minutes with grep.
e) Thunderbird is on all major platforms.
f) see (c).
Machine analysis violates your privacy? And are you suggesting privacy exists when using other mail services, with plaintext messages traversing public networks?
I recently updated our exchange environment to SP1 which allows me to create a new database on different storage and assign an Archive mailbox for users. So now I got a terabyte volume on tier 2 Sata storage for folks to use as archive - now I can get those damn pst files finally off my file servers.
Or maybe because it suggests running a VM on a desktop/laptop just so he could archive his mail.
That's a piss-poor solution.
Here is my "non-deliberate" system for recovering emails, files, programs, website copies, correspondence, certification homework, photos and projects, from the last 12 years. All these components have been mentioned earlier except the find text in files command I show you below.
I have the computer I used for the last 11 years sitting turned off with two disk drives. This is a metal case computer isolated from electrical surges by a UPS. I presently have the archive computer set up with an IO Gear device that switches the keyboard, mouse and display to the archive computer if I need to turn it on. I like this better than accessing the archive computer through SSH or a terminal server or remote desktop viewer. The silly but important thing is do not let the archive computer connect to your mailserver and download current emails. I find simply copying entire directories to an 8 gig USB drive is easier than messing around with SCP (secure copy) and SSH (secure shell).
Those two disks go back about 12 years. The email systems I have used over the years have pretty well known names. The older email pattern is one big file containing every email with a blank line separating each email. For a Microsoft system, I use a USB stick as a archive device and I read it on a Linux box.
I can search as much of the disks as I need using the following script (taken from Unix Power Tools)
I store the script in a file called "findscriptfile" because I can't remember it, I just look it up when I need it. I wind up creating files with the same command on various places on my computers. Note the command line below requires blanks as shown. You will have to test and fiddle with the search. Sometimes you need to use "sudo " when file permission error messages cloud the results. The search time for an email address or a copy of a letter or a photo file is 20 minutes.
This find script finds all files starting at the current directory and working down and the script searches each file for the word "thumbnail".
find . -type f -exec grep thumbnail '{}' /dev/null \;
I think one of the interesting things about the original problem posted to Ask Slashdot is what kind of information has enduring value and how much value does it have? Or another question might be what is the cost of storage per month and how many items on the storage system are worth more than the storage cost?
Try converting all the Email to a PDF package. I have been adding to a pdf package for several years directly from outlook and also include any or all important attachments. You can sort or search at anytime using names, times, subjects, attachments or just about any other query parameter that you can think of. You can also secure the package and if higher security is desirable you can also save the secured package in an office one note notebook with a password which makes the file a little harder to crack. You can add to the package/archive anytime you wish or create multiple packages by week, month, year etc. and keep them in a large package. I believe cute pdf is free and works fairly well. I've had acroB pro for several years and that works great for a paid program. AcroB pro attaches to your Email client upon install so at anytime you can archive any or all emails with 2 or 3 clicks. Any pdf reader can be used on any system to read the email from a flash drive or cd or any portable media. I think you should be able to read the files 100 years from now since I don't see pdf's going away any time soon
SMTP is the transport protocol.
IMAP and/or POP3 are STORAGE protocols.
I've been using MailStewart "Regular", which uses a built-in SQLite engine, and claims to be good for +100k records. The "Pro" version uses MySQL. The initial import and periodic update of the archive couldn't be simpilier.
Luke, help me take this mask off
I have mails from 1995 onward, by now roughly from 15+ different accounts, most of them defunct. Except for a couple of months of 1996 and 2005, which I wistfully deleted, I converted them all with a small tool (Aid4mail, i think) from PST, Eudora, or Pegasus format into thunderbirds UNIX-compatible format.
+ Open Source (Free + maintained + Supported)
+ Thunderbird searches and indexes just fine
+ plain text format - I can use all sorts of editors on them if necessary)
+ Always on my HDD (Encryption, no public mining, no external servers needed)
+ UNIX-Format guarantees I can convert them into something completely different in 20 years, should the need arise
+ no additional software needed
Am I overlooking some of GPs requirements here? Or is the slashdot crowd prone to a little overengineeering? :)
Regards!
Invita Invidia
You are wrong.
POP3 is a transport protocol.
IMAP is a transport protocol.
You need to learn these things before you post.
I too archive all my emails. The solution I use is quite simple compared to others proposed here.
I have postfix deliver a copy of all mails to an archive directory with mailbox file per year as well as to my inbox (or other mailboxes depending on filtering rules). Each archival mailbox file is about 5GB compressed.
Filtering these mailboxes by header using mutt is a very fast operation and even doing in-body searches and views takes less time than it takes gzip to uncompress the files. Obviously they can easily be backed up in the usual ways.
The trouble I see in our OP's question (which I share), is somehow that most of the open source solutions will have a slow interface (compared to, say, OSX Spotlight).
I currently use Powermail on OSX (so, two closed solutions) because it handles almost 20 years of mail, in Go, and is still Spotlight-compatible (raises results while you type the keyword).
the guys at Powermail are a small company that indeed started as the kings of indexing, long before Spotlight. To my knowledge they are the only email app on OSX that maintains Spotlight compatibility. But, they are "proprietary".
I think if Powermail is to die, I'll transfer all my archive to an IMAP server, the way it has been described various times above. This too may be tricky: not all email front-ends will handle 1 Gb of IMAP transfer properly, nor all IMAP servers. Do try before using. I tried with Powermail and the french postal free email service: this did well, but that's presently the only couple that indeed works for Gbytes.
Herve S.
Upload them to GMail. Or get Google Apps for your own domain (it's free) and use their GMail variant.
imapsync will take care of your other IMAP accounts, mutt/pine for uploading from Maildir/mailfile and three dead chicken on a moonless night for PST.
I'm not sure if you tried it, but win 7 + outlook 2010 does do a pretty good job to me.
However a note here the first day keep your computer on as indexing happens in a background processs
the first day i changed my mail system i couldnt search, but the next day it was all indexed; it makes sense, indexing is resource heavy. So they do it in the time you dont use your system. However be aware that windows 7 should be configured not to enter sleep mode for this to happen !
if you sudenly change mail systems then you have to take that in account, but if you would have started from scratch you would not need to keep you pc on for a night.
By default outlook 2010 has a 50GB pst size limit, but you change that in the registry... ( do you have more then 50GB ???).
Also to optimize the speed of brouwsing your Email, create folders, dont put 50GB in one inbox.
Because the view window refresh takes every items properties to display them.
Make some sub folders increases organization to better find your stuff also.
I cannot stress enough how important it is to keep only relevant e-mails. I can't imagine that you actually need all those e-mails. Every year, I clean out my inbox and see if these e-mails would ever be of any future relevance.
Keeping the "Happy Birthday" e-mails from your co-workers may not really be worth it after you've left the company. But keeping the response to a denied application with reasoning why you weren't hired, however, is worth it.
A clean inbox can make your life much easier, you won't get caught in all the micro organizing you need for big bulks of e-mail.
Frankly, you can't beat something like a SQL database for those requirements.
I used to have this - a Filemaker Pro database I populated with mail via AppleScript. It would break a message into pieces and store the pieces in fields. But that was mid-90's, before e-mail got hard - there was a From, a To, a Body, etc. No quoted-printable, base-64, or multi-part MIME messages.
It's great to be able to search "From:" some wildcard, Date-range foo to bar, Subject with a boolean keyword expression, but it's also important to be able to re-construct the message for forwarding, replying, etc.
So... the CRUD is pretty straightforward, but what's the best way to represent it in SQL? The easy thing to do would be to load the message into an object with a canned library and then throw that at a SQL ORB, but somewhere down the line retrieving the data manually would also be useful.
A quick search didn't turn up a well-known schema, but certainly this problem has been solved. Being able to use a fast search (tsearch2, for instance) would be so graet vs., say, Thunderbird's built-in search. Anybody have any pointers?
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
People like you have a serious problem, there is no reason to HOARD your email. Keeping email from the 1990's, come on, the delete button was made for a purpose, TO DELETE!!!!!!!!!!!!!!!
But if you need to keep your emails and be able to do all of the criteria you listed the best would be to store your emails in a MySQL database, you can run MySQL on linux/unix and windows. Now easier said then done, you would would have to construct all of the table and a way to import them all or do a quick search on sourceforge.net and try on of those solutions
http://sourceforge.net/search/?type_of_search=soft&words=email+archive
Just store each message in a file. Each "mail folder" is a directory. Gnus calls this arrangement "nnml", and Courier calls it "Maildir". I don't know what other software calls it, but it's difficult to imagine any reasonably-capable software not supporting it, because it's so obvious and straightforward.
There are several advantages to this arrangement. The big two are 1) it's hard to beat for compatibility and 2) for searching and indexing and stuff you can use standard utilities such as are made to operate on any kind of (text) file. (As for threading, your mail reader should be able to handle that. Once you find the file with one of the messages you're after you know its subject and message ID and stuff, so finding the thread in the mail reader is easy.)
The one disadvantage is that you have to choose whether to put it all on a FAT filesystem (for maximum operating system compatibility) and suffer the performance disadvantages thereof (which are considerable when an individual folder contains many thousands of files; not as bad as with IMAP, but still very noticeable). Of course, moving/copying from one filesystem to another is only as problematic as copying any other kind of files around, so if you decide to use NTFS today (which has reasonable read/write support in Windows and Linux) and later decide to use an OS that doesn't have read/write NTFS support, you can just copy the files over to UFS or whatever at that time. Boot an OS that has both filesystems (Knoppix, for instance), cp -r --preserve=all blah blah blah, and leave it running while you go to work or something.
Cut that out, or I will ship you to Norilsk in a box.
Thank you for your insight. We already knew that is a common problem. What the poster wants (and me as well) is concrete solutions. But you
I used to use mhonarc to create HTML navigatable archives of my email. http://mhonarc.org/
I did this for both the Mac running apple mail and for the wintel running Outlook.
The challenge is sometimes to get the foriegn email format into one of mhonarc's recognized formats.
Then I would create archives that were no larger than a CD-R or DVD-R and archive them on those. Easy to mount and search.
Sup http://sup.rubyforge.org/ it is gmail like in that it uses tags instead of folders and it automatically indexes all the email using Xapian in the background making it small home google for your email.
You have 2 delete buttons for a REASON.
cheery thought. don't forget that when you die, there's more to read for whichever relative/partner inherits your PC.
Back in the day, ZOE was exactly what you're looking for. It's an open source, cross platform turn-key, solution (Simple Server is built-in) that is designed to archive, index and search your email (using the Apache Lucene search engine). Jon Udel has a good article on O'Reilly that includes some screen shots.
ZOE meets all of your requirements, though data import is a bit of a problem. There are several different strategies for data import, so one of them may meet your requirements.
Unfortunately, ZOE is abandonware so it's not for the faint of heart. The original author was on the bleeding edge and tended to make 'interesting' technology choices like Tapestry for the framework, and using his own, home-grown build system and a Creative Commons license that isn't usually used for software. He eventually abandoned Java development for Lua and let the registration for the home page lapse. As a result, it's difficult to recommend this for all but the most determined, high functioning users.
Signatures are a waste of bandwi (buffering...)
Yes, me too. Been using dovecot and Maildir files for years now. Before that I used a different open source IMAP servers (courier, cyrus, and UW imap) but since I used Maildir file format the transition was automatic (I used mbox format before that with UW imap server and conversion was really simple using the mb2maildir perl script). I have used IMAP servers etc for 18 years worth of email. I organise the 250,000 emails into different folders for each year as that makes searching much quicker. It's never let me down yet.
I know it's a lot to ask these days to get people to read the comments that they are replying to,
Oblig.: You must be new here.
Cool! Amazing Toys.
...That you get off the computer and get a life?
Do you know what imap is? He's gonna have to have some central storage thing but the mail access is platform independent..yeah if he wants his imap server to be his own than he'll have to pick one os to serve from but every nonshit mail application has imap support from desktop to mobile and hands down gives him what he wants if he takes the time to organize and set it all up
You could try Mailstore. I'm using it for a while and for me it really does it's job
It's very easy - just use Free Edition of MailStore: MailStore Home - visit http://www.mailstore.com
I have the exact requirements as you, so I spent the last six months developing a
solution. It converts SentBoxes, Inboxes, gmail, PST files and regular mbox.
It archives and indexes everything and provides full text search with google-like
phrase grouping and exclude phrases.
It normalizes addresses, eliminates duplicates, understands every character set and
can display any email within it's web GUI with proper inlining of pics-in-html.
For me it can index 8 gigs of emails within a couple of hours.
We are pilot testing this solution at an ISP for our customers.
Would you like to try it out?
My email http://2038bug.com/email.gif
-paul
Is there a tool to download them again though once you have finished uploading them, and might lose the originals, and only have the ones on gmail, I am ignorant to their possibilities.
you can use imap to upload existing emails from a local email client such as outlook or thunderbird.
You can xfer most other webmail emails to gmail, they have a method for it.
Then, after that, gmail has functionality for imap or pop to get emails back off if you wish (either a copy or permanent).
I've transitioned more than one small business to gmail.
Googling various questions regarding this will reveal several good walkthroughs....
Flappinbooger isn't my real name
Email is not a secure format and never has been. If you have anything you don't want to be public knowledge, don't use email, or encrypt it. This has been true since SMTP was invented. It's simply not secure. Everyone using email should know this.
If you commit no sin, you need no backups. Your slate is empty and clean. Emails, R.I.P. ~rohit.
Namaste.
Imbecile. Outside of the USA, the majority of email addresses end in a country-specific suffix.
Imbecile. Relying on a domain tld to gather demographics.
You'll have that sometimes...
well you can of course, from outlook express select all and then open a folder and name it as you like then drag and drop when you want. That easy. It takes about 2 to 3 minutes for a 3 GB size E-mails, and the good part is that when they are being transfered if the names are the same, it automatically gives a number to it..like joe ( 1) (2) ...and so on.
"I have kept every email I have ever sent or received since 1990"
Instead, I keep every email I sent.
And I make it a habit of acknowledging all emails I deem "important", quoting the full body of the original message in my reply.
My, what a prick you are. No wife or kids to beat up?
Most of my emails are able to be viewed on one screen. I can take screen prints, paste into paint, save as a file. In cases where I need access to editable text, I open the em, mouse-select the text and then paste it into word pad. It takes opening each one, but it's still faster than trying to fwd them somewhere. Best is, I can do it all locally without depending on anyone else, isp, em program etc. Now that i/you know you have this problem with the archives, begin today with the new incoming ems and don't get behind, tho a little behind is ok in my book.
So does mine, but it's kind of quiet, because the alarm clock part is just bells attached to the CD tray and a cron script, and it's only been useful once when the server was located in my friend's bedroom. I think the video I've linked to was shot at that time, when I knew he was oversleeping and I ran the script remotely.
Deus est fatalis