Best Way To Archive Emails For Later Searching?
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
Alphabetically!
Great minds think alike; fools seldom differ.
An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.
Time to delete them all
I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...
You are a hostile lawyer's fantasy come true. ;-)
See subject.
This is slashdot. We save computers older than your dad just to use them as alarm clocks. Please leave.
gmail
It isn't particularly platform independent (because no one is paying much attention to Windows), but Not Much offers threads and full text search:
http://notmuchmail.org/
Nerd rage is the funniest rage.
Print then scan
rewriting history since 2109
While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.
MailSteward on the Mac.
SQL database. Good, Inexpensive, works w/many tens of thousands of emails & more.
http://mailsteward.com/
If you want an "email format" why not mbox? Many things currently support that as an import option.
If you want a database, why not SQLite? It's about as open as can be, backwards compatibility is almost a religion and should have no problem with hundreds of thousands of entries.
I use mbox format files and grep.
IMO, one can't get much more portable than that.
I can advise a Linux server with Courier-imap. It's easy to centrally store your mail, and as long as it's on the internet you can reach it. Even from work, with friends, or on vacation.
It's not really fast in my experience, but not terribly slow.
And you can save things in Maildir format, which is universally supported. And it's easy to backup with some scripts.
Well, don't worry about that. We can get you back before you leave. (Dr. Who)
Maildir.
And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread
I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
You, sir, are a jerk! I suspect you have low self-esteem with some component of hemorrhoids that is making you have this fixation on being rude.
If this is really important to you, and you want it all to work across multiple workstations/OSes, your best bet will be to store it all in IMAP. If you have the means and motivation to run this yourself, I would recommend Dovecot. If you don't have the means and motivation, then you can use a service like Gmail to run your IMAP although you give up certain freedoms in doing so. For example, I use Dovecot coupled with Maildir++ as the physical storage format - as a result I can (if I wanted to) change to any email client I wish very quickly, use different email clients at the same time, etc.
You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
Maildir storage format is resistant to bit-rot because it stores each message in a separate file, and uses filesystem directories for mail folders. It's widely supported by user agents (mail readers) and IMAP/POP3/SMTP servers, so you'll never be stranded by the actions of a single software vendor. Finally, it's easily searched using everyday unix tools - find, grep, sed, awk, etc., and you can use the full-text search engine of your choice for speedy searches.
I would use a computer older than your dad just to use as an alarm clock, but I just can't help upgrading.
I never thought of turning an ancient host into an alarm clock.
Once however, I did hollow out an SGI case and turn it into a refrigerator.
The case was just too damned pretty to throw away.
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
citadel at www.citadel.org is a full pop3/imap server with full-text indexing.
Thunderbird can use server-side searches to find messages, and I find that works pretty well.
blog.sam.liddicott.com
Have you looked at Archiveopteryx? That is one potential solution to the storage side of the problem. It stores the messages into a PostgreSQL database with minimal tinkering, so you can always get the original plain text stuff back out again. Consider it a database of mbox files that exposes an IMAP interface. You can't get any less proprietary than Postgres, and you can scale up many of its operations using standard database approaches in that area.
What I would do here is store messages there as my permanent store for them, dump periodically to full plain-text backups just for disaster recovery, then experiment with search software that runs on top of it using IMAP as the transport. There I don't have any specific advice. Ultimately it should be possible to extend Archiveopteryx to handle that too--PostgreSQL has decent full-text search built in--but I don't know of anybody working on that.
Probably easier to break this into two pieces, get a robust solution for the storage side, and then see what clients have search capabilities you like that won't choke on importing your data.
Migrate all to gmail With gmail you got room for your couple of GB. And the search feature works like a charm. Only thing missing is "folders" to make it act like you are used to.
Although the searching features in GMail are great, I find the interface with a single unified sequence of mail, and lack of folders (the tagging feature is far too clunky) to be a major impediment. The biggest issue though, is that I do not own a copy of the information on my own server.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
And now the poster becomes an advertiser's dream come true in addition to being a hostile lawyer's dream come true. ;-)
Remember that from Google's perspective gmail is a tool to better profile you for targeted advertising. Make sure you are OK with that before giving them access to all your emails.
Hate to break it here; but since 1990 I've been storing *all* my mail (and calendar and SMSes) in a plain old Outlook PST archive file. It is a fairly good and fexible database format with lots of import / export en search options. Future compatibility is well guaranteed. To keep it snappy, I've been systematically removing big attachments (documents and pictures), possibly replacing them with a texttual reference to where they are elswhere stored on disk. . I know, I know, low tech and the Borg, but future proof for now :-).
You can laugh, but its good almost enough for what I need.
All my archived email (93-2004) was copied to a NAS as individual messages (still have the Cyrus directory structure). Its the more recent stuff that lives in PSTs that is the problem.
One day I'll get around to going the same for my news postings. That's where the nuggets of interest are.
I'll chime in with my own solution. My archive is not as extensive as yours but I have most everything from 2005 or so (excepting mailling lists, other junk, etc.). My solution is sort of silly, I just use Apple's Mail.app. The reason I use this is because Mail.app enables you to store and organize everything as separate folders and since Spotlight is blazingly fast and does a great job for searching. I try to keep my number of messages in a folder on the order of a few thousand messages, for my e-mail load I find that breaking up the folders by year works well (yes, you can still search across year). The folders themselves are stored under ~/Library/Mail/Mailboxes. Each folder has its own directory and series of .emlx which are an Apple specific form of xml that includes one message per file. The problem with this solution is that the emlx files are proprietary and subject to change. That said, I have successfully managed to copy mailboxes to new computers with a new OS. It did require an extra step or two beyond just copying my Mailboxes directories to the new computer however. Worst case though, the emlx files are in plain text so you can grep through them if you have to, and you can really had to (e.g. if you're logged onto the computer remotely), or you could write a script that parses most of the information from the file.
Gentlemen! You can't fight in here, this is the war room!
Na, he's probably a lawyer.
That's right, I'm looking at you Mr. "I've got a 22GB mailbox on the new Exchange 2007 system". Quotas, learn em, love em, use em!
Life is not for the lazy.
In your will donate your archive to science. I'm sure it would make an interesting thesis project for some PhD candidates out there. I'm seriously, consider this.
Theres one method i've used fairly often in the past for getting mail out of an older client - provided the older client supports imap (lookout and lookout express do).
First, setup a new account on your imap server just for archival purposes (you can setup an imap server on any UNIX/Linux distro and even Windows with Cygwin fairly easily - dovecot is a good place to start). Make sure its using either mbox or maildir (preferred).
Second, setup said account on all the mail clients you'd like to archive. Make sure you are setting them up as imap and not pop3.
Third, drag the contents of each local folder/inbox/etc to a folder on the archive specific imap account. It will take a while, but the entire contents of your mailbox will be copied over, message by message, in imap's way of doing things, then deposited by the imap server into a the local format of your choice.
You've just created flat text versions of client specific archives. Create folders, sub folders, etc and organize things in your modern client which can easily do imap. You can easily search with any numerous free packages, archive and compress permanently with squashfs, or even just leave them available through imap to search with the new Thunderbird's (3.1) global indexer.
Brielle
You should put all that stuff on an IMAP server on your home network (preferably a box you can reach from the outside using DDNS or a static entry if you have your own domain).
In that way your client OS'es can be whatever platform you choose, and they will all be able to access your mail storage.
Put older mails in separate folders.
If you can work with Linux there are plenty of choices. If not, consider Windows Home Server and get a mailserver product for Windows - there are plenty!
Many advanced email clients, such as Outlook or Evolution, will allow you to search for mails based on any criteria you like (subject, sender, body, date, etc). Hmmm except perhaps the actual mail header ;-)
Personally i would never do this though. Generating and saving data is easy - limiting it is hard. Consider deleting stuff - you could start by deleting everything older than 36 months. The more you have to search through the more difficult it gets. In the end finding a single mail will be (or in your case: IS) like a needle in a haystack ...
Also, why save all mails? Every time you reply to a mail a copy of the original mail is often included in your answer. So from today, consider deleting All inbound mails that you reply to ;-)
- Jesper
- Jesper
My security clearance is so high I have to kill myself if I remember I have it...
While this answer will almost certainly not suit the OP, it may be of interest to other folk looking to archive their email. Using python and a combination of imaplib and some basic file I/O you can save the original text of messages. My rationale for this was firstly that it's probably less problematic than converting between various email client formats; and secondly that it's a decent way to learn some python! ;)
My rather basic implementation just dumps every email from an (IMAP) folder sequentially. I rely on grep for searching. However, it does have the prerequisite of the email being stored on a mailserver accessible via IMAP.
If all you have is a grenade, pretty soon every problem looks like a foxhole -- MightyYar
Scary thought, but you might just want to pick up one of the tools that the lawyers use for electronic discovery. They cover multiple mail formats (including older generations of said formats) and set it up so that it's easy for an intern to search for keywords and the like, so someone that understands tech should be able to use it I've had to use the Clearwell appliance and it did what it was supposed to do, including finding attachments and indexing them for ease of search. (No, I don't work for Clearwell, and wouldn't have used their tool at all except for t.. er anyways)
To help spare you the precious keystrokes it would take to Google this yourself, you can go straight to “Google Apps for Businesses” and sign-up. Now did you really have to Ask Slashdot?
Starting with GMail I have kept every e-mail since 6/22/2004. I also brought over many e-mails I had in my saved folders from long before that. Am I insane? No. I have found this archive incredibly useful for any variety of uses even 6 years later.
Nothing like having your wife ask, "man, I wish we still had the recipe for deviled eggs we made in college. Too bad it was back in 2001." "No problem honey, hold."
Pulled that out a couple weeks ago for a picnic. Yum yum!! was right.
I recommend mbox (MBX) format.
1. The format is text based and not likely to become unreadable anytime in the forseeable future.
2. There are no shortage of tools for manipulating mbox.
3. Its easily indexed by full text search applications (MS Search included with windows)
The outlook tools save dialouge has an apple export option which is actually the mbox format.
In terms of archival access I recommend an IMAP server with a folder hirarchy based on month/year. Your mail client should be configured to leave the messages on the server (not attempt to download via IMAP). This somewhat future proofs migration to different mail clients.
The only issue is that imap searches are out of the question so you will need to do searches offline with a full text indexing/search application to first find the general folder location of the message you are seeking.
If your computer has lots of memory then why not just use grep and write a small shell script to forward the message from the archival file to your inbox so that formatting..etc is preserved. If your doing lots of searches the disk cache will back most of it in ram even if its a few GB..
Gmail does not have folders but it does have tags. Tags can be used like folders but are more flexible since you can have more than one tag on a message. However, I have found that gmail's searching is so good that I don't even need to use the tags. Everything just goes into the "Archive" and the gmail search always finds what I want... quickly and easily.
I don't read your sig. Why are you reading mine?
I did this myself, going back only 10 years though. It has been invaluable. Gmail gives you 7GB (with a little more every day), and the searching is top notch and instant.
There are several apps out there to import mail into a gmail account, and it is pretty easy your email is still available via pop or imap (which I'm doubting)... for stuff in a pst file, what I ended up doing was adding the new gmail account into outlook, and then dragging and dropping emails 1000 at a time into the new account. (i also did this for a Groupwise mailbox from one old job) It's slow, but it works. In addition, it tags the mail for you with "Inbox" or "Sent", so you can easily retag it later. Once it is in there, it is a little gold mine to get whatever you need.
I was hoping to read some answer that answered my similar requirements. My requirements were for a searchable, portable mail message database. Ability to tag messages is also important. I had high hopes for Mozilla Raindrop, but my last experience with it didn't do anything for me. Here's what I am doing now: I have set up an IMAP server (imapd) on an Ubuntu server. Thunderbird is currently my primary email client. Thunderbird connects to all my various email accounts. When I am ready to archive an email, it gets copied to a folder on my imap server. The emails are tagged, and stored in folders by quarter to keep any particular file from getting to large. What I would like is the ability to store them in a searchable database with an open source implementation.
I have been archiving my mails for the past 10 years. My method has been to download the mails in mbox format once a year and use a combination of mairix to search through teh mails and either mutt or thunderbird to see the actual mails.
would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems:
...so you can't?
At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.
Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:
INBOX
|- YYYY
|- MM
|- DD
The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.
Hope this helps.
--Necesito una chela, bien fria...
I still use Eudora... 7.1.09 paid mode from years ago... I use XP for my wifes computer and have different Eudora folders based on who is logged in. Works like a champ. The nice thing is I can sort the old emails by sender (for listserv's and such) to be put into folders, and then use the find email function to search things. I hardly ever have problems finding an email as long as I know WHO/WHAT I'm looking for and where - Body, from, subject, etc.. Sadly, No meta tags.. :(
BTW, Mine goes back to.. early 90's also when @ college we used Eudora on Floppies with Windows 3.1 I think... Maybe it was 95 seems so long ago...
--- Relax, that mass muderer is just trying to reduce our carbon footprint, one fetus at a time...
The many comments here about using just imap with maildir or mbox storage backends forget to mention that these are all very slow to search when you have thousands of messages. They dont store the files in any kind of disk-seek friendly format. soo..
I suggest either putting a dovecot with maildir++ system on fast SSD to overcome the poorly organized(on disk) files
-and/or-
using a mysql/postgresql backend on dovecot or courier or your favorite imap that supports *sql. The mail would be stored with each detail in a different column in the table. Then you can index the sender, recipient, subject etc. You will need to either have a mail client that can use imap search so you can get the search to happen on the db side, or you could put together a php interface to search the database directly for the messages you are looking for.
imap isnt going away in the next decade and either is mysql or postgresql or the sql language in general. worse case would be to migrate the mail table to a new db, which would be done with a db dump and fairly trivially.
PSTs are hard-coded to tank, depending on the version of Outlook used. Right now with Outlook 2007 it's 20GB. Nobody NEEDS that much mail, but as an archive it's possible. Maybe a CMS server like Knowledgetree? Provided that it can parse the mail passed into it, it's a great open-source project that seems to have great staying power and development. I'll be testing that myself this week using mail messages that currently reside in Thunderbird.
One of the 187.
Red or blue one?
RIP America
July 4, 1776 - September 11, 2001
Parent is +informative and/or +interesting, not troll. Fucking brain dead moderators these days. Sheesh.
it suggested a linux solution and made the windows weenies realize how useless their os is. by extension they realized how tiny their penises are and then they finally understood why they like Micro Soft because it describes them perfectly. so they got mad and said "i'll mod it down, yeah, that'll teach them a lesson and make me feel like a real man again!"
Kmail has an excellent .pst converter that will pull out your old Outlook mail. Once you have it in Kmail, you can drag and drop it into any of the supported formats, mbox, mdir etc. If you have already established filters, you can let them sort things out. If not you can use a manual search for to, from, mail list, subject, etc. From there you can run your imap.
I carry everything around on my laptop and use kmail instead of using imap. With full drive encryption and xscreensaver, I don't have any worry about losing private information and know that my ISPs have better collections of my email anyway, despite what they say about size limits. I could use Gmail's imap instead of my own but prefer to suck my gmail out with kmail's imap support. Until US networks get more reasonable, I want my mail with me instead of on my own server and I would not advise anyone to leave their mail on someone else's server without having a copy yourself.
Because your question is all about search, I have to plug Kmail again. With proper organization of your mail into subfolders for friends, family, lists, companies and projects, mail searches are quick, even on modest hardware like my ancient PIII laptop. Searching everything takes a little longer, but it is not such a burden. Evolution may do as well but something about Gnome turns me off. The only downside is that the 3.5 branch does not seem to be able to search through encrypted mail but I imagine there's some gpg-agent fix for that I'm not aware of.
Friends don't help friends install M$ junk.
I migrated all my old personal emails to gmail using IMAP. You can use this to migrate between different on-disk formats like maildir, mbox and pst. I had all my email in yahoo and pulled it down using POP to a maildir, then used an IMAP mail client to copy it across to gmail. Then I regularly back them up from gmail to an on-disk maildir format using mbsync. I picked maildir because it's open and seemed better designed than the alternative, mbox. It's not completely standardized though. I've seen PSTs become corrupt so I try and stay away.
stay frosty and alert
There's a commercial, but low cost, package that I've used to do exactly what you are describing: http://www.aid4mail.com/
Aid4Mail converts email to and from a variety of mail formats. The feature that you might find useful is that it will create a zip archive that contains standard .msg format email messages. Use that in combination with an indexing programme. I use X1 (http://x1.com/), but there are lots of indexing programmes that will index zip archives for easy searching.
What do they say?
:(".
June 2001 - "Dave, can't go out tonight. I got a date with that fat chick.YEAH!"
Sept 2001 - "Dave, She's told me she pregnant."
Jan 2002 - "Dave, will you be the best man at the wedding
Shhhh - Dave's the real father (AC doesn't know)..
Perhaps the best route would be to use MySQL or some other FOSS database and build a web front end for browsing, searching, etc
Lots of people have been suggesting gmail, and that's great for some. There are some significant limitations/constraints, though.
1) I use the common "business identifier@vanitydomain.com" trick to help identify who is selling my e-mail address. Gmail has plus-addressing, which works reasonably well, however it is imperfect. Some spammers know about plus-addressing, and strip the plus.
Google Apps for Domains would work, except that you're pretty limited in the number of addresses you can use without paying exorbitant (for these purposes) fees.
2) Forwarding mail to Google destroys valuable header information. Redirecting mail can cause it to get blocked by the spam filter (sometimes so badly that it doesn't even make it into your spam folder.) So even keeping your own mail server and just bouncing everything up there isn't a viable solution.
3) Having Google pop mail from your server is probably the most workable technical solution, but then Google has your password. Also, there are size limitations, in case you happen to have large attachments that you need to preserve.
The OP may not have any of these issues, in which case Gmail is a great choice. Unfortunately, I'm looking for the same thing (searchability) and Gmail won't work for me.
However, mairix works reasonably well.
http://home.planet.nl/~mourits/koelkast/ Is that you?
Don't fight for your country, if your country does not fight for you.
http://xena.sourceforge.net/
A great Java free software for mail (and other documents) automatic normalization and archivation, developed by Australian Government
Google Apps for your domain offers a bulk-import feature from Outlook and other clients.
:)
Gmail offers all that you wish for. Take the free premium trial for GApps, bulk import, then cancel. Problem solved?
It could be that the only purpose of your life is to serve as a warning to others.
I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.
Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.
It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...
DO NOT DELETE YOUR ARCHIVES, EVER!***
*** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.
http://slashdot.org/~GuyFawkes/journal
As many above have mentioned part of this, I just wanted to put some of it together:
- setup a small server with a file system with checksums - ohh, that probably just leaves zfs
- setup dovecot on the server with maildirs
- setup clients to use imap to put messages on the server, if you have any existing imap-accounts, use mbsync directly on the server
- setup thunderbird as a client to index it all in thunderbirds own index-files, so you can search it directly from thunderbird
- use xapian or something similair to index your maildirs on the server so you can search it on the commandline when you need to
- use rsync to copy the whole bunch offsite to somewhere that you trust or use duplicity to copy it somewhere you don't trust
New things are always on the horizon
have mail going back to 1991 archived as mbox files. Some of it is pretty disorganized, but since 2000 I've organized mail into Sent-Archived and Received-Archived directories with the mbox files named YYYY-MM.
It's a pain to search. But on the other hand, I hardly ever need to search the really old stuff, so grep and friends are good enough.
I may eventually split it out into maildir format and use a full-text indexing engine such as Xapian to make searching easier. But I'll probably keep the master mbox archive; the format is incredibly simple and it's easy to munge into other formats as necessary.
...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).
El_Muerte_TDS has just pointed me towards mairix, a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.
Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here.
Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.
Moderation Total: -1 Troll, +3 Goat
Although it would involve keeping an index you could add a strange key word to each piece of email within the body of the email. For example all emails from Donna in 2009 could be tagged with donna09. Running a search should yield all emails from Donna in 2009. You could also add the month. jaunuary09donna for example. You can even ask people to install a tag in every email they send to you.
IMAP is a messaging protocol. You can't store things in IMAP. What you can do: upload eMail messages to a mail server which then stores it in [insert-mail-server-specifics-here]. The format you are looking for is MIME. MIME is complete and keeps all the header information. Every message is one file that can be read on any platform. You could opt for MIME messages in a directory structure and use some fulltext index software (Google desktop, Apache Lucene etc.) You can probably find software that creates index lists (like by sender / subject / date)
Yes, it is not free, and yes, this suggestion will bring out the trolls, but you might want to consider Lotus Notes/Domino. It is ~$140 for the system, and ~$40 a year maintenance (Includes all upgrades) cost per user, but IBM isn't going anywhere any time soon.
It has good full text indexing, you can keep your mail on a client, and on the server, with incredibly flexible replication rules for what is stored where.
It supports IMAP, so it talks well to most clients.
The iPhone syncs seamlessly with it via ActiveSync, and an Android client is in beta as we speak.
It includes an http client, and the http client even offers offline access. That's right. You can use the http client, and still read your mail and write emails that will be sent the next time you make a connection.
It also has folders, but you can put any email into as many folders as you want, so you have the best of both Outlook folders and Gmail tags.
It supports auto-processing rules for automatic filing of data, as well as being a full development environment if you want to get really fancy.
It is brain dead easy to set up and maintain.
The server runs on Linux and Window, and the client runs on Linux, Windows and Mac.
My old alarm clock PC doubles as a web server.
The raw data should be in one of the common "mbox" formats with MIME-encoding. It doesn't have to be all in one file either - one file per year or per month should be fine. This has been around since the 1990s and you won't risk losing access due to the file format in your lifetime.* You will lose your folder organization, but you can get around that by making the folder name part of the file name or using filesystem-level folders to segregate messages, e.g. "2008/April/junk.mbox" or "junk/2009/April.mbox" and so on.
You can make "working copies" of this in any format you like. You can even be "simple" and use your operating system's text-index tools to index the files. You won't have quick-access to pictures or other binary or non-ascii text attachments but opening the mbox in any mail-reader that understands this file type - and there are many - will get you to the attachment.
*guarantee void if life-extending technology allows you to live more than 125 years from now.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
you obviously don't watch enough hollywood geek thriller movies. Someone is *always* going to die if the information isn't found right fscking now!
People in cars cause accidents....accidents in cars cause people
You want to store all your messages in MIME format. MIME is reasonably well defines, your messages when arriving from the Internet are most likely MIME. It can be opened with any text editor or displayed on the command line (cat somefile.mime). It can contain attachments (you need to take care of attachments -- the binary format might outdate). Some suggested solutions (maildir) use native MIME files and then any fulltext indexer will do. Looks like mairix might be good for listing inbox style your messages. Good luck!
How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important.
While the value you place on being able to retrieve critical pieces of information may be valid, your choice of storage medium is not. An email system is not a file server or database. Most index poorly, if at all, making searches horribly inefficient. And as has already been observed, it may be quite likely that those same things you value will be more than offset by their value to a hostile litigant.
Damn, I'm going for +5 Funny and you guys mod me down to -1 Troll? Tough crowd. Get a sense of humor, will ya?
just because I can.
That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.
Convert to HTML with something meant for creating online archives. Then if you put it on a filesystem you can index it and search it at will. Unless you really need the originals this is your best no-coding option for later convenient reading. It is also possible to use some software to generate the indices etc. with the originals included within the archive pages.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Yup, I'm really highly concerned that an advertiser might learn that I like electronics and am a huge computer geek. Because there's no other way they could know that.
Are you concerned that your emails "leak" such information about those that you are corresponding with? Are they OK with this? If they sent their email to a gmail account that's one thing, you could argue they implicitly agreed to the profiling. However by uploading all your emails to gmail there is no such implicit agreement.
I recently did something very similar with mail dating back to 1993 or so in multiple mailbox formats (Eudora, PST, Thunderbird mbox, etc.)
Get a Google Apps account http://www.google.com/apps/intl/en/business/index.html
This allows you to run a gmail interface with mail on your own domain.
If you need more than the available storage for free, you can pay for 25 gigs, but it seems like the free level will work for you.
For the PST files, upload them with Google Apps Migration for Microsoft Outlook
http://tools.google.com/dlpage/outlookmigration
Alternately, migrate the PSTs to Thuderbird using Emailchemy
http://www.weirdkid.com/products/emailchemy/
Then, if you're on a Mac (it seems you are) upload to Google Apps via the Google Email Uploader for Mac
http://code.google.com/p/google-email-uploader-mac/
This will upload everything you have in your Thunderbird environment. And it will take some time. At first it may look like the program has frozen, but give it a half hour or so to sort through all your Thunderbird folders, and then let it upload the mail overnight. It took me a few overnight uploads, but it was worth it.
Once you have it in Google its very searchable and flexible. You can for instance re-organize it using labels, and then re-download to Thunderbird via IMAP if you like.
gmail only allows 7.5GB of space currently
+++THIS POST IS INTENDED TO BE HUMOROUS+++
/. that begins with anything other than "I for one welcome...", "In Soviet Russia..." or is itself entirely a quote from a Simpsons episode that was broadcast ten years ago, to give the poor /.'ers a helping clue such as beginning your post with "+++THIS POST IS INTENDED TO BE HUMOROUS+++".
It's usually best if you're making a joke on
Aide-toi, le Ciel t'aidera - Jeanne D'Arc.
What about the privacy of those you correspond with? If they send an email to a gmail account that is one thing, but you are unilaterally deciding to have them participate in the targeted advertising profiling.
Actually, email systems tend to be the most searchable precisely because of people like the grandparent. If someone's sent me something in an email, I can usually find that email in less time than it takes to find where I saved the attachment. I have every email I've sent or received since 1997 (excluding spam, but including mailing lists), which comes to about 3.6GB. In spite of this size, it's well indexed by my mail client and searches generally only take a few seconds to produce the correct result.
I am TheRaven on Soylent News
I've also been archiving my emails since the early 90s. I've got a few hundred thousand messages. I've always used procmail to store in mbox format. I use shell/grep etc for searching. With procmail I archive like so;
$HOME/Mail/year/SenderOrRecipientAddr
where SenderOrRecipientAddr is either the senders email addr or the recipients, depending upon whether it's mail to me or from me. This way for example everything I send and receive to/from joe@smith.com is in the same mbox file.
And storing it under $HOME/Mail allows imap to serve it up.
What about the privacy of the other people involved in an email? Did they consent to be part of gmail's targeted advertising profiling? Perhaps emailing a gmail account implicitly does so but that is not the case when you upload everything on your own.
DBMail. I use it on a Linode host (small fee every month).
Need an ISP in South Africa?
Ahhh.... so you have access to the time even when you're away from home? Clever! ;)
Currently I use imapsync (http://freshmeat.net/projects/imapsync/) to sync all of my email to shared archive folders on a vm with the cyrus imap server installed. I wrote a shell script that syncs all of my mail into an archive folder for the current year, then deletes the email off of the original imap server. From time to time I have searched for a way to write all of the archived mail to an indexed format that can go on a cd/dvd that needs no mail reader to search for, but have found nothing. I worry that 20 years down the road there will be no way to run the vm, the imap server, or a client to access it. So good luck ;)
Say what?
It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem. For e-mail, you can store it all on one small (these days) hard disk placed in a drawer somewhere, with space to spare -- even with all the spam! And the process of figuring out how to better organize it and archive it going forward will be a useful learning exercise that might have applications elsewhere (e.g., at work, where people might be asking exactly the same question).
It's no worse than deciding to tidy up your office or study area and figuring out a system to better keep track of things so you can find them later.
I mean, heck, the President of the United States had the same fricking problem: how to properly archive e-mail, a problem discussed here numerous times. As a common problem -- personally and in business -- listening to other people's solutions before digging into it yourself is an efficient way to deal with it.
I've been a Cyrus IMAP admin for over a decade and have experienced no problems with user email boxes in the 6 Gb - 8 Gb range or single imap boxes with > 1E+06 messages. Performance of large batch message operations is also satisfactory (ie. import, export). It's also very useful to have server side message tagging support (ie. like gmail). I've heard other similar reports regarding FOSS imap servers such as Dovecot & UW and there seems to be at least some consensus that they are easier to manage than Cyrus but I have no direct experience regarding the relative ease of administration. Running your own local Zimbra might be a nice starting point as well- gives you a bunch of personal productivity functionality in a single groupware app. I'm running my own Zimbra instance on a RackCloud server for $90/year (all-in) for exactly this purpose.
I agree, IMAP is the way to go. the dbmail.org project has an implementation of an imap service that uses a database as the back end. This allows you to, in theory, create a custom application to do full text search over the mail contents (that are stored into database tables). the default schema already does a good job to normalize mail headers and recipient email addreses on the mail, so as to help to filter searches using those. This kind of searching and indexing is of course a custom thing to have to build. I currently have not gotten around to doing this yet (after several years of running dbmail now), but I found that having the mail contents stored in a database does provide slightly better perfomance over time than having the many many individual files when a mailbox is backed by MailDir or Mailbox file system based storages. The only hitch is yes, you need to have to interoperate with windows, such as if you use windows only, its inconvenient compared to using PST files I guess. I have envisioned creating a virtual machine that runs a linux operating system loaded just with the dbmail and database stack, effectively creating a macro PST file type of thing, a service / appliance / single virtual machine image file I boot up to be my easy to search through mail storage repository.
https://www.google.com/accounts/PurchaseStorage
I am in the same boat as the original poster. And I think the question has not yet been answered. My requirements (and I suspect the original guy's too) are:
1. A client that can import and store emails in a wide variety of formats
2. A client that can search emails (including office format and PDF attachments) quickly
2a. A figure of merit: 100000 emails, 10 gigs, 100 msec or less for search (core i7, plenty of RAM, SSD)
2b. Ideally search would allow SQL-like searches on any field and understand regexp
3. A client that requires no IP stack to function so it could be run on a machine detached from internet
(I have on-line and off-line machines for security and I disable IP stack on off-line machines to prevent
temptation to use them online if my other machine fails).
4. Crucially, a client that is easy to install, configure, and use. If your solution involves configuring a server
or worse yet, configuring a server in a virtual machine then it is not workable. I do not have the time to
figure it all out and I suspect only real sys-admins would consider this a solution.
I am in the same boat. I ended up importing them into Cyrus for the last few years. It's not fool-proof, however if you configure the "squatter" service, it will do some rich indexing. I have found that, over time, even when older messages have an attachment, it doesn't always translate correctly into modern mailers. There could be several reasons behind that.
A while ago, I saw a project called Zoe which was aimed at solving the problems described -- it was OS centric (Mac?), though I believe it's been abandoned.
Another project out there is "dbmail" which is basically a large-scale email server (IMAP, et) that stores your messages in a MySQL database. Might be worth a shot.
I think the original poster is asking about something that not only will store the data properly, but present some sensible GUI to peruse it all. This capability is veering into paradigm of "document management" I would think. Especially with regard to access of the original attachments and their various encodings and formats.
http://www.sqlite.org/cvstrac/wiki?p=ExperimentalMailUserAgent
I agree with this.
In fact, the OP said he uses (used?) Entourage, which means that he has a Mac (or at least had one recently).
One important thing the AC did not mention: You can easily export from Mail to mbox format (just select the messages you want and choose Save As... and "Raw Message Source" format). mbox is unlikely to go away any time soon, and is anyway text-based so the info will always be recoverable.
Having an exit strategy is crucial when choosing a format for your data, be it email, music, photos, word processing, etc.
When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.
This sort of behavior is odd and not normal. If you want to keep your email, then that's fine, but thinking that it's "vitally important" is odd and I think without question points to some "OCD with some component of Aspberger". If you don't then maybe you need to re-evaluate.
I am however interested in how you pull demographic analysis out of emails? I mean, hopefully you're not suggesting that you go and chomp on the text to pull out fields of data?
So on the one hand, you think my saving email for later access and analysis is not useful, but then, you want to know why it is useful?
I run a research laboratory where we do two things, one is work on restoring sight to the blind, the other is to organize a conference every two years. The primary demographic analysis I need to do is to analyze the country-of-origin for email traffic pertinent to the conference. This has helped to raise many tens of thousands of dollars of support for the conference by demonstrating various aspects of the global attendance to funding agencies.
Being able to access my email and locate attachments, review discussions, find references, remember addresses, etc., in other words, to recall what someone once wrote to me, has resulted in millions of dollars of grant money to fund my research. Without the ability to review email that is, at times, years old, that would not be possible. Having rich access to my email stream has allowed me to fund my lab, and therefore feed and house my family and the people who work for me, publish high-impact papers, receive numerous awards, get coverage in the international press, etc., or, put better, to run the daily business of a research lab at a high-profile university. While the tools I use are good, they leave a lot to be desired, and having a better system would make me more productive.
IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea,
I think that GMail could be the panacea here. I mean, if you're just trying to make sure it lasts and you can search it with ease, then GMail can do it better than you can.
I dislike GMail for my professional correspondence for a number of reasons: (1) it does not allow me to readily use my university affiliation address (and since that's a top university, that makes a difference whether people like it or not), (2) I do not have ownership of my email, (3) the lack of a good filing / archiving interface makes it hard to associate different threads together, or to limit searches (I intensely dislike the tagging feature), (4) GMail has an only rudimentary ability to edit text since it's browser-based.
I do use GMail for my personal correspondence, but that's mostly because it's the best of a bunch of poor, but free, services. It does have the best searching features, but falls down in a lot of other ways. It also would be against my employer's policies to store HIPAA-regulated email offsite. So GMail is not a panacea. Thanks for the suggestion, though.
Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
While it's totally overkill for the job, I highly recommend you run a Zimbra Open Source instance for yourself. Although you don't need much of what it provides (Calendaring, contact sync, Jabber IM, etc), it will let you store your messages in a stable, searchable and accessible form. Zimbra can directly import from PST or via IMAP (with your mail client or imapsync) and once it has your messages it full text indexes them with Lucene and so you can search them via the web or IMAP clients. You can easily get your messages out via one of the supported export formats or just use your IMAP mail client to dump the messages into mbox/maildir/pst/whatever. While you could certainly roll your own, why not let someone else take care of all the hard work for you?
especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.
Yeah, I've used Gmail for getting close to five years now. Does it bother me that they have access to all my stuff? No more than it bothered me that whatever ISP's email I used previously had access to all my stuff.
In my eyes, it's just email, personal email at that. Of course I have sensitive stuff in there, but I'm not going to spend a disproportionate amount of time setting something up myself when I can just use what's already made.
If you want to have everything in your own possession, you could always set up a client to download the messages, and then delete them off Google servers once done. But I understand the paranoia. It's the same thing that keeps me from signing up for a medical marijuana license, I'd just prefer to not have my name on that list.
Someone flopped a steamer in the gene pool.
If you get everything into a standard (free)Unix spool file, it'll be readable a hundred years from now. After all, what other kind of archive file could you have from twenty years ago which you could easily use today?
Use the following for optiomal perfomance: (1) IMAP for input, storage and access from any client for daily use (2) Configure Apache Solr to index your IMAP-Mails (3) web-based search-interface to access your SOLR index ( (4) use (hierarchical) faceting (see example: http://search.lucidimagination.com/)
Leaving aside all the usual (tiresome) conspiracy theories I'd definitely import them to Gmail or, better still, a Google Apps account as per suggestions from other posters. I have all my mail going back several years and there's no problems for me.
So on the one hand, you think my saving email for later access and analysis is not useful, but then, you want to know why it is useful?
No, I wanted to know how saving email was the best way in which to accomplish the goal of demographic analysis. Now that you've explained what you do it *for*, which, for the record, I couldn't be less interested in BTW, I'm interested in how you achieve that goal with saved email? Last I've looked, and I could be way wrong, country of origin isn't listed in the email header. Also, IP addresses can't be that reliable two years after the fact either. So, how do you get country of origin from two year old emails? (not sarcastic either, I'm interested)
You'll have that sometimes...
I like it that Evolution saves in the same format as Mutt. Quite a lot that a person can do with that and basic unix commands.
I dislike GMail for my professional correspondence for a number of reasons: (1) it does not allow me to readily use my university affiliation address (and since that's a top university, that makes a difference whether people like it or not), (2) I do not have ownership of my email, (3) the lack of a good filing / archiving interface makes it hard to associate different threads together, or to limit searches (I intensely dislike the tagging feature), (4) GMail has an only rudimentary ability to edit text since it's browser-based.
So...
1. Yes it does. So long as your university allows you SMTP access, then Gmail can send email from your University address.
2. Your University let's you own your email? No archiving or backup there? Interesting. I thought most Universities had a robust email retention policy these days.
3. Gmail threads emails by default, has labels for filing, and you can even use postini if you have retention needs.
4. What do you need to do, edit wise, that you can't with the Gmail RTE? Have you used it lately? If the Gmail RTE isn't good enough, there's a myriad of plugin RTE gadgets you can use too. Just sayin...
Use whatever you want, and it's your business, but I don't see how any of your requirements are not fulfilled by Gmail.
You'll have that sometimes...
It's obvious, upload them to gmail!
(only half kidding)
Flappinbooger isn't my real name
It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem [wikipedia.org]
But you wouldn't save your junk mail, would you? Grocery store fliers? Credit card offers?
You'll have that sometimes...
Computers, hard drives, backups, electricity, rack space, and maintenance are all free! Fuck! Tell me where you shop for this stuff.
b/s ....
June 2001: Dave, my mind is going. I can feel it. I can feel it. My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm a... fraid. Good afternoon, gentlemen. I am a HAL 9000 computer.
Sept 2001: Can you take away this damn monolith?
btw:
Shhhh - Dave's the real father (AC doesn't know)..
It that some sort of crossover between SW:ESB with SO:2001 ?
Your experience is as common as your rationale. Neveretheless, if email is the easiest way you have to find important information you are doing it (storing that important information) wrong.
People here seem to think that you are looking for another email client. Instead, it appears to me that what you really need is a way to archive and search your local machine. In light of that, take a look at http://beagle-project.org/ Beagle can search your IMAP stuff and local file system stuff too. I run Ubuntu so the UX for installing, configuring, indexing, and searching with Beagle is pretty easy. Beagle is available in the Ubuntu Software Center. You can search from either the command line or from the firefox search bar once you have configured that.
1) I use the common "business identifier@vanitydomain.com" trick to help identify who is selling my e-mail address. Gmail has plus-addressing, which works reasonably well, however it is imperfect. Some spammers know about plus-addressing, and strip the plus. Google Apps for Domains would work, except that you're pretty limited in the number of addresses you can use without paying exorbitant (for these purposes) fees.
Yeah. 50. You need more than that for your email address?
2) Forwarding mail to Google destroys valuable header information. Redirecting mail can cause it to get blocked by the spam filter (sometimes so badly that it doesn't even make it into your spam folder.) So even keeping your own mail server and just bouncing everything up there isn't a viable solution.
So, get google to check it for you. Don't forward it, have google check it with POP or Imap for you. No problem, your headers stay intact and you're good to go.
3) Having Google pop mail from your server is probably the most workable technical solution, but then Google has your password. Also, there are size limitations, in case you happen to have large attachments that you need to preserve.
The size is pretty large to start and super cheap to increase.
You'll have that sometimes...
As anyone who actually uses IMAP can tell you, it bogs down quickly on large mailboxes, violating the poster's requirement about b)
Not true. Not absolutely false, either. IMAP is an access protocol, not a storage or indexing mechanism, and there is nothing inherent in IMAP that dooms it to be slow in handling large mailboxes. Different combinations of client and server, configurations, and mailbox content and usage can make huge differences in performance. Tens of thousands of messages in a single IMAP folder on a memory-lean server that uses Maildir storage on a UFS or ext2 filesystem with atimes enabled is going to suck horribly, especially with a client that doesn't cache heavily or maintain its own indices. Make that a mbox, and it will work great until you start trying to change it every couple of seconds.
Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.
Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.
I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.
Watch for Penguins, they eat Apples and throw rocks at Windows.
If you want light, always in text format, easily searchable, and fast, maildir + mairix is your answer. You don't even need to keep your mail in a flat structure. Place this on a server with IMAP/s access, and you'll never have to move your mail again. Just make sure you have good backups. For the fastest results ever? Access your email over SSH using mutt. The only drawback is that if you're not a CLI person (and this doesn't even't use it that much), you're going to hate this, or at least have to pile on a few scripts to web-ify mairix and its search results.
And no offense to the gmail users, but true blue email types would never turn over their emails to anything not completely under their control.
Well they have this thing, it's called "backups" this would be where you save your precious emails and move them to another media.
Get up!
As far as a "cloud" webmail interface goes Gmail has the best search features (which probably contributes to why so many Slashdotters prefer Gmail), but the search features introduced into the Thunderbird 3.x mail client are the best of any e-mail interface. To even rival the customizability of searches that is available in Thunderbird 3.x would require one to be fluent with command-line commands like find and grep, but acquiring such fluency is temporally expensive.
Timothy (OP) says that he has already tried Thunderbird though, but since his first complaint is that moving the "hundred of thousands of emails" that he has hoarded over the past two decades between the email systems that he has already tried takes "forever to process", Timothy appears to have some unreasonable expectations regarding data sets that large (specifically in regards to migrating and indexing such sets).
For those who do not feel comfortable keeping their e-mails in the cloud, they could always use Thunderbird 3.x as the interface and administer their own IMAP server at home using software like Dovecot.
Imbecile. Outside of the USA, the majority of email addresses end in a country-specific suffix.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
Your emails must be REALLY important!
Sorry, but gray text on gray background is making my eyes bleed.
Read the post properly. He said he does NOT have ownership of his emails. This doesn't mean he's not responsible for the mundane details, but to quote the poster, "It also would be against my employer's policies to store HIPAA-regulated email offsite". So GMail is totally absolutely out of the question.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
"... you could always set up a client to download the messages, and then delete them off Google servers ..."
Or just not use GMail.
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
In Soviet Russia, the jokes LAUGH AT YOU!
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
Perhaps you could free yourself from the tyranny of data by just deleting the e-mail? You can keep a year or two around in your favorite e-mail tool, and just let the rest go... the alternative appears to be creating the digital equivalent of the old people living in houses filled with junk that they never do anything with.
This is a windows solution, and it works great. I have stopped using all clients and just use GMail on the web. I have archived all my Eudora/Thunderbird archives into MailStore. I now have one place to search all my e-mails, since 1999. http://www.mailstore.com/en/mailstore-home.aspx
Far more efficient to simply leave the spam and let it sit with everything else. Ideally spam is deleted as it arrives, but some get missed ...
char*f="char*f=%c%s%c;main(){printf(f,34,f,34);}";main(){printf(f,34,f,34);}
I was just watching Hoarders... and I think this would be the digital equivalent. Why on Dawkin's green earth would you possibly want to keep all that email???
Damn, I'm going for +5 Funny and you guys mod me down to -1 Troll? Tough crowd. Get a sense of humor, will ya?
Your post would have been modded funny if had contained a humorous punch-line.
"I like to lick butts!" by MobileTatsu-NJG (#32700246) (Score:5, Informative)
Hey - I don't think there ARE any computer older than my dad. Lemme see, he was born in ... er, 1929.
Nope, not too many PCs then ...
"Cats like plain crisps"
I agree with the general sentiment that maintaining electronic records (and emails are most definitely legal electronic records) is imperative. The IRS suggests maintaining at least 7 years worth of documentation in the event of an audit. It should be no different for electronic records.
Where I don't agree with the general sentiment is the fear-mongering of privacy concerns with gmail. I switched to gmail (via Google Apps) about three years ago. It is without a doubt the best digital move I've ever made. Google's privacy policy is quite clear on how your data is stored and managed.
If you still feel the need to maintain a local archive of your mail records, simply download them on a regular basis to a client of your choice. While I understand the interest of a hobbyist to create some elaborate local server/client for their mail, I (and I suspect many others) have more important things to to with our spare time. Enjoy the services that exist today to help you manage these records, instead of re-inventing the wheel.
As someone who used to run an IMAP server for a few hundred users (dovecot on OpenBSD, Maildirs totaling several TB in size), I can say this is not true. How well IMAP performs on large mailbox is largely a function of how braindead your IMAP client is. Certain versions of Outlook are pretty slow, but things work rather well with Outlook 2010. Thunderbird is insanely fast, UNLESS you turn on the offline indexing features. I haven't used the latest Apple Mail, but it also had a tendency to spawn so many threads that the imapd on the other end would start closing them. You can configure how many concurrent connections to use somewhere in the prefs. My iPhone works wonderfully with IMAP. Back in the day, I used Sylpheed, and it too was quite fast.
+1 on the database. If you can get the data into some sort of mail server temporarily, you can use procmail to parse the mail headers and generate SQL insertions. There's probably something newer - I used this method in 1998 to parse incoming mail from a remote server, that sent status updates every hour.
Mail headers are not that difficult, so if you can get the data into a few standard formats (I don't know about the Outlook formats), you could even do this with a scripting language of your choice, directly from the file. Procmail is nice because it's very good about splitting the mail at the correct points. But, like I said, there's probably newer tools.
In the database you only need fields for (off the top of my head) Date Sent, Date Delivered, To, From, Subject, All Headers, Body and Attachments, plus probably one separate table for the raw data with the same indices so you can augment it later with stuff like mail ID and threading, etc. Then run a Full Text index on the body and subject. You could get fancy with separate tables for all the different To and From, etc.
In the very early 1990s I built and sold a tool for the NeXT called MailQuery, which combined NeXTMail with a 'context aware full text semantic search engine' called Metamorph - presently part of the Texis text search system. That was cool. It was phenomenally good at letting you type in key words that were related to what you were looking for, and finding exactly the right email. You didn't have to remember the exact words - just the ideas, more or less.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
abou five years ago. I sucked all of my mail ever into an sql database using perl scripts.
You'll never find a system as fast at searching and categorizing that amount of mail.
Just remember, larry and sergei read it all.
Wes
Do daemons dream of electric sleep()?
Given the price of storage, it doesn't make sense to spend a lot of time (potentially any time at all) sorting through messages by hand, deciding what to save, if you can just as easily archive all of them and then search for the ones you want later. Unless you put a very low value on your time, you can buy a lot of disk for an hour's worth of sorting.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
Because Hotmail and Ymail are crap.
Gmail is on par with the better paid for email services I've used, nowhere near as good as having a fully qualified and competent sysadmin running your own private server but still, Gmail is free.
Calling someone a "hater" only means you can not rationally rebut their argument.
grepmail regex mbox1 mbox2 mbox3 > newmbox
mutt -f newmbox
I lived on a steady diet of that for years, but now Thunderbird 3 does it better. It meets all of the OP's requirements:
a) Everything's stored as an mbox. Fast conversion utilities abound.
b) Performance with my very large set of data is good.
c) mbox is as universal as it gets.
d) Thunderbird 3 added full-text indexing. I can search GB's in seconds, instead of tens of minutes with grep.
e) Thunderbird is on all major platforms.
f) see (c).
I recently updated our exchange environment to SP1 which allows me to create a new database on different storage and assign an Archive mailbox for users. So now I got a terabyte volume on tier 2 Sata storage for folks to use as archive - now I can get those damn pst files finally off my file servers.
Or maybe because it suggests running a VM on a desktop/laptop just so he could archive his mail.
That's a piss-poor solution.
Here is my "non-deliberate" system for recovering emails, files, programs, website copies, correspondence, certification homework, photos and projects, from the last 12 years. All these components have been mentioned earlier except the find text in files command I show you below.
I have the computer I used for the last 11 years sitting turned off with two disk drives. This is a metal case computer isolated from electrical surges by a UPS. I presently have the archive computer set up with an IO Gear device that switches the keyboard, mouse and display to the archive computer if I need to turn it on. I like this better than accessing the archive computer through SSH or a terminal server or remote desktop viewer. The silly but important thing is do not let the archive computer connect to your mailserver and download current emails. I find simply copying entire directories to an 8 gig USB drive is easier than messing around with SCP (secure copy) and SSH (secure shell).
Those two disks go back about 12 years. The email systems I have used over the years have pretty well known names. The older email pattern is one big file containing every email with a blank line separating each email. For a Microsoft system, I use a USB stick as a archive device and I read it on a Linux box.
I can search as much of the disks as I need using the following script (taken from Unix Power Tools)
I store the script in a file called "findscriptfile" because I can't remember it, I just look it up when I need it. I wind up creating files with the same command on various places on my computers. Note the command line below requires blanks as shown. You will have to test and fiddle with the search. Sometimes you need to use "sudo " when file permission error messages cloud the results. The search time for an email address or a copy of a letter or a photo file is 20 minutes.
This find script finds all files starting at the current directory and working down and the script searches each file for the word "thumbnail".
find . -type f -exec grep thumbnail '{}' /dev/null \;
I think one of the interesting things about the original problem posted to Ask Slashdot is what kind of information has enduring value and how much value does it have? Or another question might be what is the cost of storage per month and how many items on the storage system are worth more than the storage cost?
Try converting all the Email to a PDF package. I have been adding to a pdf package for several years directly from outlook and also include any or all important attachments. You can sort or search at anytime using names, times, subjects, attachments or just about any other query parameter that you can think of. You can also secure the package and if higher security is desirable you can also save the secured package in an office one note notebook with a password which makes the file a little harder to crack. You can add to the package/archive anytime you wish or create multiple packages by week, month, year etc. and keep them in a large package. I believe cute pdf is free and works fairly well. I've had acroB pro for several years and that works great for a paid program. AcroB pro attaches to your Email client upon install so at anytime you can archive any or all emails with 2 or 3 clicks. Any pdf reader can be used on any system to read the email from a flash drive or cd or any portable media. I think you should be able to read the files 100 years from now since I don't see pdf's going away any time soon
SMTP is the transport protocol.
IMAP and/or POP3 are STORAGE protocols.
I've been using MailStewart "Regular", which uses a built-in SQLite engine, and claims to be good for +100k records. The "Pro" version uses MySQL. The initial import and periodic update of the archive couldn't be simpilier.
Luke, help me take this mask off
I have mails from 1995 onward, by now roughly from 15+ different accounts, most of them defunct. Except for a couple of months of 1996 and 2005, which I wistfully deleted, I converted them all with a small tool (Aid4mail, i think) from PST, Eudora, or Pegasus format into thunderbirds UNIX-compatible format.
+ Open Source (Free + maintained + Supported)
+ Thunderbird searches and indexes just fine
+ plain text format - I can use all sorts of editors on them if necessary)
+ Always on my HDD (Encryption, no public mining, no external servers needed)
+ UNIX-Format guarantees I can convert them into something completely different in 20 years, should the need arise
+ no additional software needed
Am I overlooking some of GPs requirements here? Or is the slashdot crowd prone to a little overengineeering? :)
Regards!
Invita Invidia
You are wrong.
POP3 is a transport protocol.
IMAP is a transport protocol.
You need to learn these things before you post.
I too archive all my emails. The solution I use is quite simple compared to others proposed here.
I have postfix deliver a copy of all mails to an archive directory with mailbox file per year as well as to my inbox (or other mailboxes depending on filtering rules). Each archival mailbox file is about 5GB compressed.
Filtering these mailboxes by header using mutt is a very fast operation and even doing in-body searches and views takes less time than it takes gzip to uncompress the files. Obviously they can easily be backed up in the usual ways.
The trouble I see in our OP's question (which I share), is somehow that most of the open source solutions will have a slow interface (compared to, say, OSX Spotlight).
I currently use Powermail on OSX (so, two closed solutions) because it handles almost 20 years of mail, in Go, and is still Spotlight-compatible (raises results while you type the keyword).
the guys at Powermail are a small company that indeed started as the kings of indexing, long before Spotlight. To my knowledge they are the only email app on OSX that maintains Spotlight compatibility. But, they are "proprietary".
I think if Powermail is to die, I'll transfer all my archive to an IMAP server, the way it has been described various times above. This too may be tricky: not all email front-ends will handle 1 Gb of IMAP transfer properly, nor all IMAP servers. Do try before using. I tried with Powermail and the french postal free email service: this did well, but that's presently the only couple that indeed works for Gbytes.
Herve S.
Upload them to GMail. Or get Google Apps for your own domain (it's free) and use their GMail variant.
imapsync will take care of your other IMAP accounts, mutt/pine for uploading from Maildir/mailfile and three dead chicken on a moonless night for PST.
Frankly, you can't beat something like a SQL database for those requirements.
I used to have this - a Filemaker Pro database I populated with mail via AppleScript. It would break a message into pieces and store the pieces in fields. But that was mid-90's, before e-mail got hard - there was a From, a To, a Body, etc. No quoted-printable, base-64, or multi-part MIME messages.
It's great to be able to search "From:" some wildcard, Date-range foo to bar, Subject with a boolean keyword expression, but it's also important to be able to re-construct the message for forwarding, replying, etc.
So... the CRUD is pretty straightforward, but what's the best way to represent it in SQL? The easy thing to do would be to load the message into an object with a canned library and then throw that at a SQL ORB, but somewhere down the line retrieving the data manually would also be useful.
A quick search didn't turn up a well-known schema, but certainly this problem has been solved. Being able to use a fast search (tsearch2, for instance) would be so graet vs., say, Thunderbird's built-in search. Anybody have any pointers?
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Just store each message in a file. Each "mail folder" is a directory. Gnus calls this arrangement "nnml", and Courier calls it "Maildir". I don't know what other software calls it, but it's difficult to imagine any reasonably-capable software not supporting it, because it's so obvious and straightforward.
There are several advantages to this arrangement. The big two are 1) it's hard to beat for compatibility and 2) for searching and indexing and stuff you can use standard utilities such as are made to operate on any kind of (text) file. (As for threading, your mail reader should be able to handle that. Once you find the file with one of the messages you're after you know its subject and message ID and stuff, so finding the thread in the mail reader is easy.)
The one disadvantage is that you have to choose whether to put it all on a FAT filesystem (for maximum operating system compatibility) and suffer the performance disadvantages thereof (which are considerable when an individual folder contains many thousands of files; not as bad as with IMAP, but still very noticeable). Of course, moving/copying from one filesystem to another is only as problematic as copying any other kind of files around, so if you decide to use NTFS today (which has reasonable read/write support in Windows and Linux) and later decide to use an OS that doesn't have read/write NTFS support, you can just copy the files over to UFS or whatever at that time. Boot an OS that has both filesystems (Knoppix, for instance), cp -r --preserve=all blah blah blah, and leave it running while you go to work or something.
Cut that out, or I will ship you to Norilsk in a box.
Please backup up your assertations with facts or references. You dont agree to "When you knowingly use gmail you are agreeing to keywords being added to whatever profile google has on you".
Visit http://www.crunzh.com/ for free software. Mac/Lin/Win
Thank you for your insight. We already knew that is a common problem. What the poster wants (and me as well) is concrete solutions. But you
Sup http://sup.rubyforge.org/ it is gmail like in that it uses tags instead of folders and it automatically indexes all the email using Xapian in the background making it small home google for your email.
You have 2 delete buttons for a REASON.
Back in the day, ZOE was exactly what you're looking for. It's an open source, cross platform turn-key, solution (Simple Server is built-in) that is designed to archive, index and search your email (using the Apache Lucene search engine). Jon Udel has a good article on O'Reilly that includes some screen shots.
ZOE meets all of your requirements, though data import is a bit of a problem. There are several different strategies for data import, so one of them may meet your requirements.
Unfortunately, ZOE is abandonware so it's not for the faint of heart. The original author was on the bleeding edge and tended to make 'interesting' technology choices like Tapestry for the framework, and using his own, home-grown build system and a Creative Commons license that isn't usually used for software. He eventually abandoned Java development for Lua and let the registration for the home page lapse. As a result, it's difficult to recommend this for all but the most determined, high functioning users.
Signatures are a waste of bandwi (buffering...)
Yes, me too. Been using dovecot and Maildir files for years now. Before that I used a different open source IMAP servers (courier, cyrus, and UW imap) but since I used Maildir file format the transition was automatic (I used mbox format before that with UW imap server and conversion was really simple using the mb2maildir perl script). I have used IMAP servers etc for 18 years worth of email. I organise the 250,000 emails into different folders for each year as that makes searching much quicker. It's never let me down yet.
I know it's a lot to ask these days to get people to read the comments that they are replying to,
Oblig.: You must be new here.
Cool! Amazing Toys.
...That you get off the computer and get a life?
Do you know what imap is? He's gonna have to have some central storage thing but the mail access is platform independent..yeah if he wants his imap server to be his own than he'll have to pick one os to serve from but every nonshit mail application has imap support from desktop to mobile and hands down gives him what he wants if he takes the time to organize and set it all up
I have the exact requirements as you, so I spent the last six months developing a
solution. It converts SentBoxes, Inboxes, gmail, PST files and regular mbox.
It archives and indexes everything and provides full text search with google-like
phrase grouping and exclude phrases.
It normalizes addresses, eliminates duplicates, understands every character set and
can display any email within it's web GUI with proper inlining of pics-in-html.
For me it can index 8 gigs of emails within a couple of hours.
We are pilot testing this solution at an ISP for our customers.
Would you like to try it out?
My email http://2038bug.com/email.gif
-paul
Is there a tool to download them again though once you have finished uploading them, and might lose the originals, and only have the ones on gmail, I am ignorant to their possibilities.
you can use imap to upload existing emails from a local email client such as outlook or thunderbird.
You can xfer most other webmail emails to gmail, they have a method for it.
Then, after that, gmail has functionality for imap or pop to get emails back off if you wish (either a copy or permanent).
I've transitioned more than one small business to gmail.
Googling various questions regarding this will reveal several good walkthroughs....
Flappinbooger isn't my real name
I actually use mharc, http://www.mhonarc.org/mharc/, on my private server to archive various public lists and work-related email since it provides searching capabilities.
For work-related email, I have nmh folders I file message to for the various projects I work on and then have a cron job that runs each night that uses mharc's mh-month-pack script to copy the mail into mharc's archive area.
This system provides me an MUA-neutral way to read and search email. Since mharc keeps an mbox formatted version of all data, I can import the messages to any MUA I want, however since I still have the nmh folder data, I've never had the need.
The other advantage of my setup is I can following mailing lists w/o cluttering my inbox. I have a separate mail account I used to subscribe to lists of interest, and I archive the messages on my private server for reading and searching whenever I want.
Email is not a secure format and never has been. If you have anything you don't want to be public knowledge, don't use email, or encrypt it. This has been true since SMTP was invented. It's simply not secure. Everyone using email should know this.
If you commit no sin, you need no backups. Your slate is empty and clean. Emails, R.I.P. ~rohit.
Namaste.
Imbecile. Outside of the USA, the majority of email addresses end in a country-specific suffix.
Imbecile. Relying on a domain tld to gather demographics.
You'll have that sometimes...
"I have kept every email I have ever sent or received since 1990"
Instead, I keep every email I sent.
And I make it a habit of acknowledging all emails I deem "important", quoting the full body of the original message in my reply.
Most of my emails are able to be viewed on one screen. I can take screen prints, paste into paint, save as a file. In cases where I need access to editable text, I open the em, mouse-select the text and then paste it into word pad. It takes opening each one, but it's still faster than trying to fwd them somewhere. Best is, I can do it all locally without depending on anyone else, isp, em program etc. Now that i/you know you have this problem with the archives, begin today with the new incoming ems and don't get behind, tho a little behind is ok in my book.
So does mine, but it's kind of quiet, because the alarm clock part is just bells attached to the CD tray and a cron script, and it's only been useful once when the server was located in my friend's bedroom. I think the video I've linked to was shot at that time, when I knew he was oversleeping and I ran the script remotely.
Deus est fatalis