Ask Slashdot: Best Way To Archive and Access Ancient Emails?
An anonymous reader writes "I started using email in the early 90s and have lost most of that first decade due to ignorance, botched backups, and so on. But since about 2000, I've got most — if not all — of my email in some form or other. I run Linux, so this has mainly been in a mix of various programs: Kmail, Evolution, Thunderbird. The past 2-3 years are still on the IMAP servers. My problem is that I only rarely NEED to look back to email of 5 years ago. But sometimes it's nice. Or I just want to reminisce about something...or find an old attachment that I was sent. But I do not want to be clogging my current email client of choice with vast backups and even more, I don't know if it will even easily convert. The file structures are different, some are mbox, others maildir, etc., and I would ideally like a way to 1) store and archive these emails, 2) access them, and 3) search by Sender, Subject, Date, Attachments. Is there anything I can do or do I just have to keep legacy applications on hand for this? Should I keep trying to upgrade and pull old files into the new applications? Any help or suggestions about what YOU do would be great."
Personally, I use getmail and dovecot for my mail, not just archived. Everything's available, sorted and filtered on retrieval. I even added dspam to catch what google misses. I think I wrote a script to get the old mbox files into maildir via dovecot's processing, but it all worked and continues to work. Multiple email accounts aren't a problem.
Just IMAP it all.
I went IMAP in 1997 and have never looked back.
I've also used IMAP as a temporary conversion measure for people switching e-mail clients so even if you aren't sure, it makes a good first step.
I don't understand the concern about too many e-mails. I can access my email back to 1992. With multiple folders it shouldn't be a problem and with modern indexing a search shouldn't be an issue.
Use the IMAP server - if you have control and/or space available.
I just have a single large archive IMAP folder into which everything that isn't spam gets pushed. You could optionally create subfolders for time ranges (every 1-2 years, whatever works for you). Using dovecot with good indexing support on the backend quick searching has been great. If you do a sub-archive breakout on time the searches will be quicker, you could also then create a virtual mailbox combining them all for when search really needs to span time (and take a good chunk longer)
There are scripts/utilities available to push mbox, etc. into an IMAP folder, push everything there and use it.
I have all my personal email from 1998 in a Maildir directory with Dovecot as the server on a dual core Atom server running Centos. About 900 MB worth.
Plenty fast.
Trying to figure out what formats will be available in the future is pretty hard, it's easier to see what formats have been around a long time and are still in use.
As such, two formats come up readily:
mbox http://en.wikipedia.org/wiki/Mbox and maildir http://en.wikipedia.org/wiki/Maildir
Convert all your mail to maildir, and keep it on your home filesystem, whenever you need access from where ever connect to your home vpn, and connect to the filesystem, I have an account in thunderbird where I can search, or do whatever I want to it. Seems to work well.
Had the same need 20 years ago when migrating from VAX/VMS to Unix. The old emails were saved in a not quite readable format, but I figured I could recover them if necessary. In the end, never bothered. Yes, there are a few (actually, only two) that I'd like to resurrect now, but life moves on.
Translate it _all_ to IMAP services, in MAILDIR format if available. I've repeatedly been faced with clients, partners, and colleagues who use their email as their insitutional memory and need to migrate to a new service. There are few technologies as straightforward, and robust, as a simiple IMAP server running a light, uncluttered IMAP daemon such as "dovecot", without the complex and nunnecessary requirements of aCyrus IMAP daemon, and most _definitely_ without the complex support requirements of an Exchange, Zimbra, or other corporate grade mail service.
The primary technology difficulty of this approach is in slurping the mail from your numerous external sources and getting it into the consistent layout. Use folders, not database folders but actual directory folders to separate them. Split them by year to reduce the size of the bulkiest folders. (which MAILDIR does very well). The secondary difficulty is a robust offsite backup policy, so that a hardware or system error does not lose this personal treasure trove of data.
I'm a big fan of throwing together a DB when I want to store things categorically like that and want fast searches. If you are up to the task, hunt down some tools/roll your own so that you have a nice relational database and some stored procedures for getting what you want when you need it.
You could export your emails to some parsable format, write an importer to extract the basics that you want to keep (from/to/subject/body,attachments/entire binary blob/etc) and then bulk insert that mess into on a mysql/sql server tucked away somewhere locally or "in the cloud" (EC2, Azure). Just another option as I'm sure you'll see here many here. At least with this route you are in full control of how you index, what you can search, encryption, performance, level of backups, etc. Maybe not the best way for some but I know if I had over 100000 emails that I wanted searchable very very quickly with advanced SQL like searching, this would be a cool way to do it (time permitting). Good luck! And to the pedantry to ensue...Yes. Good day.
'We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.' RPF
I use plain text or HTML if it has embedded pictures. Works great.
Check out Zimbra Desktop, it may be able to handle all the old formats. I use it to download my yahoo mail without paying for premium yahoo garbage in order to back it all up. It's open source and has a linux version.
one per year, done.
Best method of storing and searching old email? Gmail. It can import from pop and imap so you can point it at your other inboxes and let it get on with it.You can upload from other mail clients to Google's imap server. Obviously it's amazing at searching through the archives.
Best method if you're concerned about Gmail's privacy? I'm still working on that one.
A latent existence
I keep mail archives going as far back as 1996 on my home box in mh format. Sylpheed (my usual mail client), alpine (used over ssh), and nmh (occasionally used in scripting fashion) can all access it, plus I've got the usual Unixy goodness of grep and find and so on. It's a robust and simple setup.
I pull mail from my server onto my home box via POP. Why anyone wants their e-mail archives on a box that's not under their physical control is beyond my comprehension.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
I have email there from 2005 or earlier and I can get it on pretty much any device I want.
Design a MySQL database for storing your mail messages, keying on sender, subject, date, and presence of attachments (bonus points for storing the attachments as blobs rather than as external files). Then write a perl script that'll automatically parse all your incoming email and convert it to database entries. I suppose if you're lazy the script could just monitor your mail spool, but it'd be better to just have it listen for incoming connections and handle the mail directly.
Next, make copies of that script, modifying as necessary to process all your old mail archives.
Oh, and you'll need to write another perl script to access all new mail - not from your mail spool, but from this database. You should probably name this system after some animal too. If you absolutely MUST have a graphical interface on it, don't use anything newer than TCL+Tk - but going with curses would be a better choice.
Oh - it has to be GPLv3, or we'll hate you and probably mailbomb your machine.
What - isn't that the Slashdot way?
#DeleteChrome
You don't need all those e-mails. Keep the few you actually care about (copy and paste the text into a regular file, and save any attachments you want), and get on with your life.
People that keep every e-mail are weird. Quit living in the past.
I need to archive emails that I can search later - but with a twist. These are employees who've left the company. I can't keep 'em on at Google Apps 'cause I have to pay for that by user. So I use IMAP (making sure to set Chats to be shown in the IMAP list), create an account in Thunderbird, and slurp it all on to the local machine. It keeps all the folders, although I doesn't seem to be smart enough to figure out multiple labels, so it looks like it downloads the same email multiple times, once for it's folder, and once for "All Mail." Then I delete the account at Google. You just have to be sure to click through all the folders in Thunderbird and make sure it is done downloading before you blow the Google account away.
You can even read them in a text editor, every half decent email client can use them and there are free or cheap converters for the email clients that are not half decent.
http://notmuchmail.org is Gmail for people that don't trust Google. Works great with your existing IMAP server using offlineimap.
As soon as gmail made IMAP available, everything went there. I used to get my stuff via POP and saved it all going back to the early 90s. When IMAP went live on gmail, I let it chug away for hours and hours until it was synced and all my archived stuff was stored on my gmail account. They've been bumping up the limit faster than my mail's built up so I'm now at 3.9 gigs used of 10.1 available, holding about twenty years of email. I have email clients on a desktop and couple laptops that I fire up every couple of months to sync with gmail and keep local stores in the event that google screws up and loses my data. (I like to think I'd be smart enough to disconnect from the internet before accessing the local clients if my gmail account ever went blank but I've got multiple copies just in case I forget.)
I know that won't work for email fiends who pile up a gig a month but it works for me. I don't even bother sorting my email any more. It's faster to just search. Not like the old days when it would take my email client half an hour to slog through all the messages. :)
Set up a local courier IMAP server and copy mails there, and archive the Maildirs...each message will be a file and you can use tools like grep to search the Maildirs
Does this classify you as a hoarder? :) I know I am!
Call it "Lexi Diamond - Ronda Rousey mud wrestle" and share it on a torrent and soon the whole world will back it up for you....
Seriously though, even if you were a previous email hoarder, you will likely be able to comfortably archive all your emails *and* the tools needed to access them on a USB stick. Start by finding all the tools you need, source included, and place them on your storage medium. Compress it. Send it to the cloud.
Mail files can be stored by year (easy enough to do with awk or other mail tools). It will a lot smaller then some may think when you consider the size of your mail spool to the typical Library of Congress (10 Terabytes around 2002). Newegg currerntly has a 3TB drive for $140...
Just sweep it all into the Trash Bin, breathe deep, and move on with your life confident in the impermanence of all things.
Namaste!
Left MS Windows for Linux Mint and never looked back!
Vote for Bernie in 2016!
Take your pick.
Troll is not a replacement for I disagree.
I wouldn't posit this as the best way, but it's what I do. I keep my archival mail on a local filesytem arranged in directories, stored in the old-school mbox format. I run Dovecot under OS X for IMAP access to those messages from anywhere; when I need to search through the whole collection, I use mairix (an indexing and retrieval system).
Just delete some goddamn email.. hoarder!
"My immediate reaction is "WTF? What kind of moron doesn't make things 64-bit safe to begin with?" Linus
Simple. Archive mail by the year as it gets too big. Use mutt's search for the basic searching and maildir-utils for the heavy lifting.
To those saying keeping email forever is hoarding: not if it's done right. You'd be surprised how useful it is to go back and find an email from four years ago.
.... "News for nerds. Stuff that matters" these days?
Oh, and stick it all in imap.
heh - i have all my email going back to '98 in Outlook Express. Best email program ever! It's nearly perfect for what i want. (Any way to get it to do inline spell checking, ie, underlines misspelled words as you type?) Still running it on an XP box. Been using Windows Live Essentials a bit for Win8, it's not horrific, but lacks some of the characteristics..hope MS injects some of the OE spirit into it..
I have been using Mailstore for this purpose for the last few years. Works for my gmail, hosted exchange and my old, unlamented exchange system. Faster to find things with their query than Thunderbird/Outlook search. And the price was right. Before I retired, I kept separate email archives for my major clients -- made it easy to cleanup the file when the relationship wound down. This no longer matters. Everything ends up in Mailstore now -- except the immense quantities of spam. Works for me -- your mileage may vary.
An old open-source tool called hypermail may be what you're looking for. It parses mbox files and produces HTML pages with the emails sorted by thread, author, subject, date, etc. http://hypermail-project.org/
Eudora still runs on my Win7 box. I have email going back to at least the early '90s. All plaintext and easily searchable.
Upload it to one or several Google accounts and you have a permanent searchable archive.
I use Thunderbird.
My mailboxes are all IMAP, so I found a use for the Local Inbox in Thunderbird that I always thought was a useless feature.
At the end of the year, I create a subfolder labeled by Year, and I download all copies from the the year before the last (eg, my last download was of 2011 emails), then I purge them from the IMAP server to save space This way I still have universal access to my last years emails but easily searchable archives available at home.
If you keep regular backups of your /home dir then you need not worry about losing them.
I'd say follow the same rules as any archiving of media:
:)
Pick one format and migrate all of your messages to that: In this case, I'd say mbox. Thunderbird and most other mail programs read it and you can get most of your mail into mbox format via IMAP/Thunderbird from whatever mail client can read your old ones. You can store your mbox files locally in Thunderbird and gain Thunderbird's searching (for instance) without the need for an actual back-end. I was able to read some mail stored in Netscape Mail because it was just mbox files and opening them in Thunderbird was a breeze.
Most importantly: Every 5-10 years, re-evaluate your storage choice. Is Thunderbird still around? Is mbox still pretty well regarded? If you find you need to migrate again, do it! If both are still active / supported, then hold onto 'em. The only way to perpetually maintain media access is to make sure your choices are still valid on a regular basis. This is true for any media: As the old formats go obsolete (cassette tape, VHS), you need to migrate that data to the next readily accessible format (CDs, DVDs; FLACs, MPEG(?)).
I think the biggest problem is that you have a mish-mash of stored files right now. You'll save yourself a headache in the future by tearing the band-aid off now and taking the time to get all of your mail into one format. Then, in the future, when you need to convert, it'll be many steps easier since you won't have to visit Slashdot and find out what to do about your mail again next time.
Does there exist any program that's basically a lightweight Windows auto-starting (but 99.999% asleep and inert unless you're actively using it) background service that does nothing besides act like an abstraction layer between some kind of reasonable file-based mailstore roughly analogous to an Outlook .pst file (AFAIK, canonical Maildir is a physical impossibility under Windows) and any IMAP-compatible email client?
I don't care about being able to access it from anywhere besides my local PC... binding to localhost, and refusing to talk to anything external to my PC is fine. I've just had it with the mess Thunderbird's developers made of their local mailstore right around the time it completely went to hell ~4 years ago (well, and the mess they made with Thunderbird in general). For years, I just moved mailstore files around. Then, for some insane reason, it seems like Thunderbird's files just kind of exploded and proliferated... and worse, did so in ways that seem to screw up and confuse newer versions if you try to make them use files from an older version. If I could just run a semi-fake local IMAP server on my PC to abstract my mail storage away from Thunderbird itself, I could try other mail clients without having to worry about how I'm going to get my mail into them (a remote IMAP server is out of the question... I literally have gigabytes of email, some that literally came from Eudora Pro more than 18 years ago and just got converted and converted as I went along.
I thought about the usual option of an ARM-based mini-server or an old laptop, but I also have zero-tolerance for server slowness. Stutters and hiccups are bad enough without adding a server that has the resources & performance of a 500MHz Pentium III (on a good day) into the equation. At least if I'm running it locally & it spends 99.999% of its time harmlessly asleep, when I *do* go to access it, it'll have the full resources of a quadcore 3.2GHz i7 behind it for nearly-instantaneous response. The problem with running a full-blown IMAP4 server on my PC is that it's going to always be soaking up ram, and running at a higher background level (anticipating constant remote users it'll never actually see). I just want something that runs as fast and hard as it needs to and can when called upon to do so, then goes and silently hides in the corner until the next time I speak to it.
I run qmail for sending/receiving mail (on Gentoo; netqmail package), using maildir, of course. On top of that, I run the Courier IMAP server on my internal network (with TLS encryption). Until a few months ago I used Mutt as a client (console-based), but I've moved to using Roundcube (web-based email), which I initially installed for my wife, and have been happy with it. I also have some automatic filtering to folders via Maildrop (another Courier utility; it looks at a ~/.mailfilter file to route mail).
Roundcube/the IMAP server's search is OK most of the time - I keep my inbox small and move older mail to sub-folders - when I want to do advanced searches or search large mailboxes I log in and grep through folders of interest; this works well with the maildir format with one file per message. Maildir was also quite resilient when I had a HD crash and needed to recover some lost mail (block scan for blocks that look like mail headers found most missing items, and I do better backups now - mail is under ~/.maildir and gets backed up automatically).
I would move older messages to maildir (there are plenty of mbox converters, and almost anything non-proprietary should be convertible to mbox or maildir via existing programs or a short perl script) - even if at some point maildir dies off entirely, which seems unlikely, converting it to another format will always be trivial due to its simplicity and it has the advantages mentioned above of being able to search easily with grep etc.
Convert to maildir (one message per file). Then upload into an object storage service, such as Amazon S3 or something based on Openstack Swift. Those services are designed to handle millions of objects in a single, flat namespace.
Since S3 and Swift both publish over HTTP, you can use some sort of simple text document indexing service (local Solr installation, maybe? Dunno). Object storage is perfect for archiving and storage of emails, but I don't know of any commercial or open source implementations around that yet.
I use PSTs and nightly backup.
Sure, you can use GMail or the amorphous cloud for your purposes, but quite frankly, remember - if it's not in your possession, it's not as secure as it could be.
No, I don't have world-ending secrets in my possession, but yes, I do get paranoid about my data.
Striking fear in the authors of godawful fanfiction, I am here, appearing in darkness, Tuxedo Jack!
Force feed? WTF are you taking about? Dovecot can use any make mail format. Just set MAILDIR if it's in a non-standard directory. So the whole procedure is:
yum install dovecot /etc/dovecot.conf (only if using a nonstandard mail location)
vim
service dovecot restart
set username and password in GUI client
I never will understand why some people feel the need to post on topics they don't have the slightest clue about.
because LinkedIn is sending me suggestions that I know people, that I know for a fact I only corresponded with a few times ten years ago on another old email account, and know no other way
Just print them all out. Why?
1. Reliability. Paper lasts longer and is not subject to bit rot, computer crashes, or system failures.
2. Redundancy. Just make extra copies.
3. Disaster recovery. Ship a second set of copies to another city.
4. Indexing and retrieval? Boost the economy and create jobs by hiring a nephew/niece/virtual assistant.
I have them backed up in my gmail account as well, but basically I have a IMAP folder called Archive and all of my old mail is in there. The search in mutt ignores those folders unless I am in those folders. Tada!
As a side point I have 51GB of mail!!! (1988 onward)
MH stores each email as a plain text file, each folder as a directory. It uses the unix filesystem as its database. It's very quick and has tools to re-order a folder quickly.
In addition, MH has tools to convert mail formats. It was designed in the days of low cpu power and small disks. It also lent itself well to being wrapped by other tools like xmh, exmh and mh-e so you don't have to learn the raw MH commands.
Yes, IMAP is cool, but don't discount MH. Plus the O'Reilly MH book is free as a PDF.
Oh, some IMAP servers and mail clients use MH format or something derived from it.
The 500 Mhz Pentium and the Core i7 will have roughly the same performance in this use case because IO is the bottleneck. The speed is the speed of the disk and filesystem.
To be more specific, a Pentium has a throughput of around 2 GB/s. Compare to 10 MB/s for a 7200 RPM drive doing random access on small files, 100 MB/s on large ones.
So it's entirely reasonably to use a small low power Linux system like a Western Digital World Edition network drive or the ARM based stuff you mentioned for IO bound applications such as a file server or IMAP. You won't lose any appreciable performance.
I gathered up all my historical email records a few years ago, and used Aid4Mail to convert all the various mailbox formats to the common format I use today. Choose a format that's convenient for you, and standardize on it. Here's the product website: http://www.aid4mail.com/
Find a mail program that you like and stick with it. Important factors to consider:
* How it stores the mail and attachments: mbox or other ASCII format good, proprietary binary format like PST, bad.
* How well it manages years and years worth of 10's or 100's of emails a day.
* How gracefully it fails from data corruption (this is where storage configurations that keep eggs in separate baskets are a very good thing indeed)
* Something with good importers. If not, there are 3rd party programs and services that claim to be able to convert from any mail client to another.
Personally, I've used Eudora for the last ten years (v7.x when it was still maintained by Qualcomm, not that godawful travesty Mozilla cobbled together, which is just Thunderbird dolled up to look slightly like Eudora but function nothing like it) and have scarcely considered anything else.
Yes, it's a bit goofy, requires some advanced trickery at times and the configuration screens might as well all be labeled "Miscellaneous", but it more than makes up for it...
No other mail client can come close to the MDI that lets you view endlessly configurable summaries of any number of mailboxes at a glance.
It stores all mail in plain text (close to mbox, but not quite ... though close enough that you can grab an mbox file and trick Eudora into thinking it's a native file without any manual editing) and dumps all attachments as normal files into a single directory. Yes, that directory becomes kind of ... well, huge, so it's kind of klunky in that way, but it does ensure that you can access those files at any time without having to deal with any interface beyond the operating system.
Mail consumes pretty much just what that amount of text and files would. Meta data and configuration takes up little else.
I have 10+ years (~3GB) of email and there are zero performance issues. It does offer the option of indexing mailboxes for faster searches as well.
It is truly the geek's mail client. I love this mail program so much, I will use it in a VM when it eventually becomes incompatible, but it works problem-free on Windows 7. And even in the event that it were somehow unusable, I'd still have access to all my mail; after all, it's just a bunch of flat text files.
Try Mailstore http://www.mailstore.com/.
Our family / family business has run, with increasing formality, email servers in various flavours since the mid-90's. These servers have processed messages including everything from lots (like really lots -- in the tens of thousands at least) of family pictures to (no doubt) lots of personal email of the many dozens of staff who have worked with us over the years. In general, the server settings have always been set to "retain everything", including full Exchange journalling, because there was no way to delete things without risking losing some important pictures someone sent to someone else.
I'm not too worried about the business activity traffic, because anything recent is well replicated in many other places -- primarily in various cached Outlook data files. But where family members threw away their old machines, the only copies of these important things are in the server journals we have archived. Is there some solution that can rationalize these millions of messages into some sort of structure?
In addition, I presume that this can only be done for individuals who actually want old items to be retrieved from the archives, as anyone else would be protected by privacy rights.
No. Well...maybe. Actually, yes. It really just depends.
Try IMAPSize
Since your email service is IMAP based, it will make an off-line cache which you can periodically sync.
You can search it in various ways, and it will let you take backups.
You can even use it to migrate your mail history to another IMAP server.
Enjoy.
I've got 16 years' worth of email in a multitude of formats, including all of those that are excoriated as unreliable, fatally flawed, or Satan's preferred meas of communication with our world.
I have them on O-L-D CD-ROMS, DVDs, saved to two cloud services, on my personal server, and tar'd/zipped/RAR'd/Stuff'd in some of the same places, and up to three copies in each of these places. I wonder if the .sitf files will even decompress, but no point in deleting them.
I've really only used a very few mail clients though; pine, elm, Eudora, GroupWise, Outlook in so many versions, POTP. My own servers have been the usual evolution of Sendmail, Dovecot, and now dbMail. And I use Yahoo! and Gmail as mirrors. Yahoo! Mail is my spam bucket as well as second realtime mirror, Gmail I use as a mirror and for some primary communications. my 'personal' email has been the same since 1996, but I've had three work emails.
And my email archives are close to unusable, of course. I guess I should try and take some of this advice.
And when I do rifle through the really old stuff, I need to put it through a spam filter. Some of that old spam people would pay for today. Not the IRC stuff.
I really should take a week and make sense of it. Naw, crap, who am I kidding? 1996?
deleting the extra space after periods so i can stay relevant, yeah.
If you just want "archival, for the next 5-10 years, then redo it all over again as technology changes" then the other answers in this thread are what you want.
If you want "archival, for 20+ years, without having to do it over every 5-10 years" then some form of human-readable plain text or at least representeded-as-plain-text-for-attachments is what you want. Make sure all file attachments are in well-documented formats (e.g. JPEG) so someone will be able to write a decoder for them 20 years from now if one isn't readily available. If they aren't, be sure to store file-format information with your archive.
If you want "archival, for 200+ years" the you want all of the above, stored on archival media that are likely to be readable 200+ years from now along with a description of how to interpret html and file attachments. Archival paper, archival microfilm, archival "etched onto plastic but microscopicly" media, etc. are what you want.
If you want "archival, for 20,000+ years" then talk to the people who are working on how to label long-term (10,000+ -year) nuclear waste storage dumps, they may have some ideas that work.
If you want "archival, 2M+ years" then I'm out of ideas. Look me up in 2M+1 years and tell me what you found that worked for you.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
http://www.mailstore.com/en/mailstore-home.aspx
works well quick searches and its local .
unfortunately its windows only but may work fine under wine.
Music the Paint dancefloor the canvas your body the brush
I've got six years of archived email. I like Thunderbird's archiving scheme. You can have it automatically create archives as a calendar year goes by.
If you're not actually working with the old e-mails, and you don't mind waiting a few moments to search them, just keep the raw e-mails, in raw transmissible e-mail format, and be done.
They are nothing more than a whack of text files at that point. And they are properly formatted with headers and everything.
Want to seach? Full text search and you're done. Want to search by subject only" Simple regex search /^Subject\:.*?cucumber/ finds "cucumber" only on the subject line (yeah yeah, header folding exists, this isn't a regex lesson).
Every e-mail client from the birth of the first one until the death of the last one support raw e-mail formats. And you can probably just pipe them all to sendmail and send them all again.
All of that said, I'm a big proponent of forgetting the past. Hoarding is consistent with many psychological problems.
I've been using MailSteward on OSX. The starter version handles 15k or so entries using SQLite before it starts to bog, while the trade up is a front end to MySQL.
Luke, help me take this mask off
All you need to is export to UNIX MBX then you can convert to anythying
for windows, thebat! can doo this shiznits
Store it all as plain text files (mbox format?), and write a quick script to send it all to an ElasticSearch index.
Seriously... Just delete that old, stale useless data and move on. Stop being OCD and hoarding that old crap...
The problem is that a throwaway email might become critically important later on. There is no way to know in advance what is important and what is not.
True story: while deployed in the Army, our communications guy could not find a piece of equipment which was very important and very pricey. He had been signing the monthly inventory forms saying he had it, assuming it was in a cabinet. He could not find any paperwork showing it was signed out - it had just disappeared sometime in the last 3 months and no one had seen it.
On a long shot, I started searching my email - since I keep every last one. Sure enough, about 2 months prior, there was a throwaway email from him to the effect that he was going to turn in item X for repair since it was acting flaky. He checked at the contractor mentioned in that email, and it was sitting on the shelf waiting for pickup.
Support microSD: in a post 9/11 world, it is unwise to carry your data on media that you cannot comfortably swallow.
tar tzvf | grep
I've got email going back 20 years, and this has not failed me. Maybe you need to sub 'x' for 't' and use less, but don't over-engineer this problem.
100 REM PISS OFF CODE FASCISTS 200 GOTO 100
And this is what passes for snarky comment, now-a-days?
Back in my day, we'd walk a mile uphill in the snow and get our 9600 baud modems connected before chiming in from the peanut gallery.
100 REM PISS OFF CODE FASCISTS 200 GOTO 100
I just put the original clay tablets in shoeboxes and stack them in my garage.
Sheesh, evil *and* a jerk. -- Jade
You should probably get Dr Daniel Jackson or Samantha Carter or Dr Rodney McKay to translate it into english first.
Parchment -no less- does it for my ancient emails.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I had to archive the emails since 1996. They were in multiple formats - Outlook Express dbx, Mailbox from Netscape Navigator and Thunderbird, Outlook.
I converted all of them in .eml format. It's a simple, text format that can be read by the OS and easily parsed by any program and script. Much better than mbox or something else. Then I renamed all of them according to a rule - YYYYMMDDhhmm [From] [Subject]
Now I can easily find any email. I can browse them using the file system, I can search them using the OS or via a script. Windows indexes them and extracts the metadata so any search is very quick.
Delete them
I switched to Office365 for this. $8/month for unlimited storage and good bandwidth.
I used to use mbox format on a regular Linux web host (pair networks) from 1995 which worked fine. But they weren't scaling up their storage allowance as technology progressed, so as my archive grew bigger, I was paying too much for it. Tipping point was about 2005.
Next I switched to maildir format on an opensuse box running in my basement, 1tb RAID2 hard disk backed up automatically to a 1tb Usb drive, and also mirrored (using unison) to an offsite machine. My email archive was important to me and I never wanted to lose it.
But this was a pain. It was a pain to administer the system, a pain to make sure I had spare disks for when they failed, a pain to be sure my software RAID would even work, a pain to make sure my firewall was always open to inbound IMAPS, a pain to periodically move email from pair networks to this archive every year.
Also I provide email for my family on the other side of the Atlantic, and this basement server wasn't suitable for them (not enough uptime when e.g. rewiring my house).
Office365 wound up being cheaper for my family than pair networks. It has an "unlimited" $8/month plan for me, and 25gb $4/month for my family. It has a decent enough webclient and great (fast) online search, far faster than any searching I did with mbox or maildir servers. I feel more secure with its reliability and uptime. And being a Microsoft employee (C#/VB language design team, unrelated to Office) I use Windows devices and email clients, which generally work better with Exchange than IMAP.
Zip disks.
Use www.weirdkid.com/products/emailchemy to ensure your mails are "normalized" to a homogenous format ( rfc2822 ). Then, find a Linux solution akin to www.mailsteward.com to manage your archive.
I have a number of messages from the mid 80s that are in MMDF or PMDF format as well as mbox but they are on a reel to reel tape and my new computer doest have any place for the tape to go.
Can anyone in Melbourne read a 9 track tape?
I dont trust anyone else for mail than myself.
I have all mails dating back to the early 90ies in monthly mbox files. Every year or so i gzip -9 those mbox files.
For archival purpose the first rule in my procmailrc is "write copy of email to archive". I _never_ delete mails from the archive.
Looking up mails is done by "grepmail" which constructs a new temporary mbox file from old mbox files by searching header and/or body.
Thats approx. ~5GB gzipped mbox files for my private emails.
mbox files are much better for archival and compression than maildirs are.
This all fits well into my email setup with mutt, procmail, bogofilter, gpg, fetching mails with uucp to my laptop. Sounds all very 70ies but i am used to be able to have access to all my emails without network connectivity. Open the lid, read/write mails, answer to mailinglists etc. The next time i have IP connectivity e.g. a low bandwidth GSM connection i exchange mails - compressed, failure resilient etc.. UUCP has long been forgotten and people try to get the same comfort with offlimap etc ...
A UNIX system be it a laptop should have email connectivity.
Flo f@zz.de
Anybody having a Greasemonkey script to filter out this asshole?
It's getting tedious lately, my Finger hurts from scrolling.
You had better choosing a safe and well tested format. You know that DVD, HDD, SSD or similar modern storage technologies are not reliable enough or do not have a historic proof of reliability. Choose something with centuries of history. You know, my grandma's love letters and photos survived WWII and all the bombings. And birth records on paper go back to the middle-ages. But you may turn for more sound technology to ancient Mesopotamia.
Personally, I'm lazy. I've been using Pine (now Alpine) directly on a mail server for all my mail since 1995 (on my own servers since '97). Old habits die hard.
It works great over really low bandwidth connections (though sometimes high latency can be annoying), you can view any attachments you need automagically with X11 forwarding via SSH, and you don't care at all about which machine you're accessing it from. Also you get to read the TEXT in your mails & not HTML, most of which is useless garbage when it comes to emails (for the 0.1% of HTML mail I do actually need to read as HTML, such as tables, Linx often gets the job done, & if not I just bounce it to my gmail account, which is pretty much full of spam otherwise).
When various folders get Too Big (or I move on to another job, or whatever) I move them into an "archive" folder (& I have an "old-archive" folder for the really ancient stuff) and bzip2 them. I archive my inbox files at the beginning of every year too. When I need to find something old, I just bzgrep for it. After an archiving session (which takes all of 5 minutes) the whole thing gets backed up from my mail server to my NAS at home.
Did I mention that my backup MX is a SparcStation 20 and still works just fine for all this? Of course I don't keep much on it but if my main server dies I can still send & receive mail just fine.
Note that this is not exactly something I sat down & spent time thinking about, I just started moving mail out of the way like this when I left college & built a couple of OpenBSD mail & DNS servers, and kept doing it as it works well enough.
"An anonymous reader writes" - What? I've been on Slashdot for a while and enough is enough. "An anonymous reader" shouldn't be able to submit articles. "Anonymous" cowards are already trolling Slashdot to dead. If you don't have the guts to post under your username, then why should you have the right to post anything?
Just my opinion.
Ugh. Drop all that stuff. Who needs it? My gmail folder has 20 messages in it. Lighten your (psychic) load.
I would convert everything to Maildir and either use mutt directly on them and/or run a local Imap server and rely on its searching capabilities.
I have more than 13 years' worth of archived mail; I keep two bzip-compressed mbox files for each month: Sent-YYYY-MM.bz2 and Received-YYYY-MM.bz2
Searching is a bit slow, but I hardly ever have to search that far back so I don't mind. More recent mail (going back about a year) stays on the IMAP server. Also, my company produces an email archiving product that lets me search very quickly based on sender, recipient, subject, full-text body search, etc. which is great for mail going back up to about two years.
I fail to see the problem. I have mails going back a decade or more all stored in maildir on an imap server. Done. I've changed clients several times, servers several times, no problem.
So what's the problem that makes an "ask slashdot" necessary?
Assorted stuff I do sometimes: Lemuria.org
Easy to back up the files, and a documented format.
This one will probably get buried because of the sheer weight of comments in this thread... but here goes.
I had the same quandary about four years ago; mail going back a decade at that point which I wanted to keep around. It was in various clients, as in your case. What I did was build a POSTFIX / IMAP server using (at the time) Gentoo. I then attached those clients and simply copied all the archived email up, one client at a time. I then went about building a SquirrelMail front end which did great for a while.
The problem as you can probably ascertain was search. It was tough to trawl through all those emails... but last year I converted my entire email system to Zimbra and simply did an IMAP import of all the data from my old IMAP server to the Zimbra database. While Zimbra still stores everything in MBX format (I think), it also uses MYSQL to store index data. It also happens that Zimbra has a really nice web front end, and everything's really nicely integrated. Now I have email going back 15 years or thereabouts, all searchable in pretty swift order. I added the Zimbra Desktop app to my laptops and I even have a local cache. As for backups, I have a Linode running a custom kernel and the ZFS filesystem, and nightly I have a script on my server that backs up the entire Zimbra store using "zfs send / zfs recv". Since my entire email store is around 9GB it isn't terribly expensive... and I use the same Linode for hosting a hub for my OpenVPN network... which means all my computers can communicate privately from anywhere in the world across a constantly up VPN tunnel.
And for those who think you don't need to keep all that email... bully for you. I have had to refer to decade old emails before in order to provide better service to my customers. My email archive also came in very handy during the divorce from my ex wife for reasons you can probably imagine but I'd rather not get into. That's also handy stuff to keep around... just in case.
Wow, lots of unhelpful 'dump it' stuff here. Allow me to restate the question in a fashion that might draw a decent response.
I need a mail archiving solution. I have lots of mail/mailboxes, some years old that must be retained due to policy or legal requirement. I want to off load it from my client(s) and server(s) for performance and backup reasons, but I need to be able to go back and extract messages for evidence, discovery, or what-have-you.
What's an efficient and inexpensive way for me to archive my mail way from my email client and server, yet keep it available, should I need to search for something? Ideally, the solution would not be home built/custom. A COTS solution seems like a better idea for the sake of ongoing compatibility.
There are many solutions for Outlook/Exchange, but they don't support Evolution and they are also very expensive. What are my options ofr non-Microsoft systems?
You could just leak them online. They'd be around forever.
IANAL, but those emails pose a HUGE legal liability if you ever get sued. You might think it's innocent enough -- maybe a cat picture or something -- but you have no idea how creative a lawyer can be. Perhaps he'll try to claim copyright infringement or something.
You need to take the complete opposite approach. You should only be archiving emails that have a clear need to be retained. I realize you cannot always know that in advance. However, in the rare occurrences I didn't have an email I needed, I was able to get the information another way. IMHO, you are far better off risking not having an email than a sh!t storm legal woes from having too many emails.
I did this a bit over a year ago with my Gmail account. I wanted to have a local backup "just in case" I had problems with Google in the future (I think the Google+ push-out contributed to my motivation). I know it's not entirely analogous to your scenario because I've only got a single source, but the process should be adaptable.
I created a loopback ZFS filesystem (gmail.backup) and a script that runs every day at 11am (cron! w00t!). The script
1. Mounts the ZFS loopback;
2. Marks the ZFS as read-write;
3. Runs offlineimap (python script) to copy gmail IMAP to local IMAP
4. Takes a snapshot;
5. Marks the ZFS as read-only;
6. Unmounts the ZFS
To access the archives, I use mutt pointed at the local imap folders (i.e., "mutt -f gmail/INBOX". To get a "point in time" picture of my email, I point mutt at the relevant snapshot ("mutt -f gmail.snapdir/20120630/INBOX").
I haven't dug into the specifics of retrieving emails beyond the mutt interface, but since each message (maildir format) is a file, I'm assuming that if/when the need arises that I'll be able to get what I need. In the meantime, I've got almost nine years of messages synced to my local system and updated every day.
http://notmuchmail.org/
Recoll http://www.lesbonscomptes.com/recoll/ allows you to search your messages, and it handles most storage formats. You will get duplicates if you index live and backup messages, but you can filter by path in an advanced search.
recoll runs locally, with your messages stored in mbox or most other common email formats, so the archive remains private and out of hosted services, if you want privacy (and / or speed).
(email is just a tiny part of desktop search - you get a keyword index of every document in the search path, with stemming)
I have had the same issue, email archives that are complete from the mid-90s and sporadic emails from the 1980s. What I've been doing is archiving most of the messages in text files in mbox format , one file per month, and I gzip them after a certain period of time to conserve space.
Unfortunately 'grep' and similar utilities have been insufficient to do decent searches on them. What I ended up doing is building my own search utility in python. It allows me to specify multiple search terms, regular expressions or strings, search blocks of files (e.g. in this case finding blocks that are delimited by a starting '^From ' line), as well as automatically descending into directories, tar files, gzipped files, etc. With this I can easily run a search across any set of files that I desire (even if I've tarred and compressed them) and get out resulting output that I can read with a mail reader program such as Mutt. I've found it to be extremely useful for this, as well as almost all other search tasks that I do.
If you are interested in using it, I've made it available on github. It's at https://github.com/bruceisrael/search
Dovecot handles all the formats you mentioned, mbox, maildir, etc...
Then access everything w/IMAP.
I keep everything in mbox format...going back to 1999....
Things are very hierarchical. I don't keep everything. List mails
go into list-boxes and I read them like newsgroups.
I have multiple levels of personal mail.....sorta like google's circles...
but unrelated to that...
Keep it all in /home/lpq/mail ... about 5.1G of it...
I don't bother sorting or categorizing or anything. I just have procmail send a copy to an archive file which I rotate once a year, and I index it all with mairix: http://www.rpcurnow.force9.co.uk/mairix/ . I can search on date, sender, subject, body, etc, and in a few seconds I have what I need.
I am presently working on this myself... I have pulled down all my webmail into Outlook 2010.
As long as the view is not set to "show in converstaions", you can ctrl-a the "all mail" list, and then go to
File > Save as > text only
It takes a while, and Outlook looks like it is going to lock up, but it will dump out 25,000+ messages into one pretty large text file (22mb). Still legible, full message headers, no attachments though.
I put it in Evernote (I am guessing OneNote would work too), but the .txt files have to be broken down to 4mb or so files, as Evernote refuses to deal with files longer than 5 million characters. Evernote will accept up to 25mb files according to their FAQ, but if it's trying to index the file I think the rules change.
Anyway, I have 15 years of webmail stored away there. Text is a decent archive solution for me in this case.