Ask Slashdot: Best Way To Archive and Access Ancient Emails?
An anonymous reader writes "I started using email in the early 90s and have lost most of that first decade due to ignorance, botched backups, and so on. But since about 2000, I've got most — if not all — of my email in some form or other. I run Linux, so this has mainly been in a mix of various programs: Kmail, Evolution, Thunderbird. The past 2-3 years are still on the IMAP servers. My problem is that I only rarely NEED to look back to email of 5 years ago. But sometimes it's nice. Or I just want to reminisce about something...or find an old attachment that I was sent. But I do not want to be clogging my current email client of choice with vast backups and even more, I don't know if it will even easily convert. The file structures are different, some are mbox, others maildir, etc., and I would ideally like a way to 1) store and archive these emails, 2) access them, and 3) search by Sender, Subject, Date, Attachments. Is there anything I can do or do I just have to keep legacy applications on hand for this? Should I keep trying to upgrade and pull old files into the new applications? Any help or suggestions about what YOU do would be great."
Personally, I use getmail and dovecot for my mail, not just archived. Everything's available, sorted and filtered on retrieval. I even added dspam to catch what google misses. I think I wrote a script to get the old mbox files into maildir via dovecot's processing, but it all worked and continues to work. Multiple email accounts aren't a problem.
Just IMAP it all.
I went IMAP in 1997 and have never looked back.
I've also used IMAP as a temporary conversion measure for people switching e-mail clients so even if you aren't sure, it makes a good first step.
I don't understand the concern about too many e-mails. I can access my email back to 1992. With multiple folders it shouldn't be a problem and with modern indexing a search shouldn't be an issue.
Use the IMAP server - if you have control and/or space available.
I just have a single large archive IMAP folder into which everything that isn't spam gets pushed. You could optionally create subfolders for time ranges (every 1-2 years, whatever works for you). Using dovecot with good indexing support on the backend quick searching has been great. If you do a sub-archive breakout on time the searches will be quicker, you could also then create a virtual mailbox combining them all for when search really needs to span time (and take a good chunk longer)
There are scripts/utilities available to push mbox, etc. into an IMAP folder, push everything there and use it.
I have all my personal email from 1998 in a Maildir directory with Dovecot as the server on a dual core Atom server running Centos. About 900 MB worth.
Plenty fast.
Trying to figure out what formats will be available in the future is pretty hard, it's easier to see what formats have been around a long time and are still in use.
As such, two formats come up readily:
mbox http://en.wikipedia.org/wiki/Mbox and maildir http://en.wikipedia.org/wiki/Maildir
Convert all your mail to maildir, and keep it on your home filesystem, whenever you need access from where ever connect to your home vpn, and connect to the filesystem, I have an account in thunderbird where I can search, or do whatever I want to it. Seems to work well.
Had the same need 20 years ago when migrating from VAX/VMS to Unix. The old emails were saved in a not quite readable format, but I figured I could recover them if necessary. In the end, never bothered. Yes, there are a few (actually, only two) that I'd like to resurrect now, but life moves on.
Translate it _all_ to IMAP services, in MAILDIR format if available. I've repeatedly been faced with clients, partners, and colleagues who use their email as their insitutional memory and need to migrate to a new service. There are few technologies as straightforward, and robust, as a simiple IMAP server running a light, uncluttered IMAP daemon such as "dovecot", without the complex and nunnecessary requirements of aCyrus IMAP daemon, and most _definitely_ without the complex support requirements of an Exchange, Zimbra, or other corporate grade mail service.
The primary technology difficulty of this approach is in slurping the mail from your numerous external sources and getting it into the consistent layout. Use folders, not database folders but actual directory folders to separate them. Split them by year to reduce the size of the bulkiest folders. (which MAILDIR does very well). The secondary difficulty is a robust offsite backup policy, so that a hardware or system error does not lose this personal treasure trove of data.
I'm a big fan of throwing together a DB when I want to store things categorically like that and want fast searches. If you are up to the task, hunt down some tools/roll your own so that you have a nice relational database and some stored procedures for getting what you want when you need it.
You could export your emails to some parsable format, write an importer to extract the basics that you want to keep (from/to/subject/body,attachments/entire binary blob/etc) and then bulk insert that mess into on a mysql/sql server tucked away somewhere locally or "in the cloud" (EC2, Azure). Just another option as I'm sure you'll see here many here. At least with this route you are in full control of how you index, what you can search, encryption, performance, level of backups, etc. Maybe not the best way for some but I know if I had over 100000 emails that I wanted searchable very very quickly with advanced SQL like searching, this would be a cool way to do it (time permitting). Good luck! And to the pedantry to ensue...Yes. Good day.
'We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.' RPF
I use plain text or HTML if it has embedded pictures. Works great.
Best method of storing and searching old email? Gmail. It can import from pop and imap so you can point it at your other inboxes and let it get on with it.You can upload from other mail clients to Google's imap server. Obviously it's amazing at searching through the archives.
Best method if you're concerned about Gmail's privacy? I'm still working on that one.
A latent existence
I keep mail archives going as far back as 1996 on my home box in mh format. Sylpheed (my usual mail client), alpine (used over ssh), and nmh (occasionally used in scripting fashion) can all access it, plus I've got the usual Unixy goodness of grep and find and so on. It's a robust and simple setup.
I pull mail from my server onto my home box via POP. Why anyone wants their e-mail archives on a box that's not under their physical control is beyond my comprehension.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
Design a MySQL database for storing your mail messages, keying on sender, subject, date, and presence of attachments (bonus points for storing the attachments as blobs rather than as external files). Then write a perl script that'll automatically parse all your incoming email and convert it to database entries. I suppose if you're lazy the script could just monitor your mail spool, but it'd be better to just have it listen for incoming connections and handle the mail directly.
Next, make copies of that script, modifying as necessary to process all your old mail archives.
Oh, and you'll need to write another perl script to access all new mail - not from your mail spool, but from this database. You should probably name this system after some animal too. If you absolutely MUST have a graphical interface on it, don't use anything newer than TCL+Tk - but going with curses would be a better choice.
Oh - it has to be GPLv3, or we'll hate you and probably mailbomb your machine.
What - isn't that the Slashdot way?
#DeleteChrome
You don't need all those e-mails. Keep the few you actually care about (copy and paste the text into a regular file, and save any attachments you want), and get on with your life.
People that keep every e-mail are weird. Quit living in the past.
So can anyone with a subpoena. And you can bet Google would be running their advertising stuff on that.
There is no way I would put my life on a public server like that.
I need to archive emails that I can search later - but with a twist. These are employees who've left the company. I can't keep 'em on at Google Apps 'cause I have to pay for that by user. So I use IMAP (making sure to set Chats to be shown in the IMAP list), create an account in Thunderbird, and slurp it all on to the local machine. It keeps all the folders, although I doesn't seem to be smart enough to figure out multiple labels, so it looks like it downloads the same email multiple times, once for it's folder, and once for "All Mail." Then I delete the account at Google. You just have to be sure to click through all the folders in Thunderbird and make sure it is done downloading before you blow the Google account away.
You can even read them in a text editor, every half decent email client can use them and there are free or cheap converters for the email clients that are not half decent.
http://notmuchmail.org is Gmail for people that don't trust Google. Works great with your existing IMAP server using offlineimap.
As soon as gmail made IMAP available, everything went there. I used to get my stuff via POP and saved it all going back to the early 90s. When IMAP went live on gmail, I let it chug away for hours and hours until it was synced and all my archived stuff was stored on my gmail account. They've been bumping up the limit faster than my mail's built up so I'm now at 3.9 gigs used of 10.1 available, holding about twenty years of email. I have email clients on a desktop and couple laptops that I fire up every couple of months to sync with gmail and keep local stores in the event that google screws up and loses my data. (I like to think I'd be smart enough to disconnect from the internet before accessing the local clients if my gmail account ever went blank but I've got multiple copies just in case I forget.)
I know that won't work for email fiends who pile up a gig a month but it works for me. I don't even bother sorting my email any more. It's faster to just search. Not like the old days when it would take my email client half an hour to slog through all the messages. :)
Set up a local courier IMAP server and copy mails there, and archive the Maildirs...each message will be a file and you can use tools like grep to search the Maildirs
Call it "Lexi Diamond - Ronda Rousey mud wrestle" and share it on a torrent and soon the whole world will back it up for you....
Seriously though, even if you were a previous email hoarder, you will likely be able to comfortably archive all your emails *and* the tools needed to access them on a USB stick. Start by finding all the tools you need, source included, and place them on your storage medium. Compress it. Send it to the cloud.
Mail files can be stored by year (easy enough to do with awk or other mail tools). It will a lot smaller then some may think when you consider the size of your mail spool to the typical Library of Congress (10 Terabytes around 2002). Newegg currerntly has a 3TB drive for $140...
Just sweep it all into the Trash Bin, breathe deep, and move on with your life confident in the impermanence of all things.
Namaste!
Left MS Windows for Linux Mint and never looked back!
Vote for Bernie in 2016!
Take your pick.
Troll is not a replacement for I disagree.
It is email. AKA over the web. AKA public.
And someone with a subpoena can get your records off of your ISP, or just come into your house and take it off your computers.
Troll is not a replacement for I disagree.
I wouldn't posit this as the best way, but it's what I do. I keep my archival mail on a local filesytem arranged in directories, stored in the old-school mbox format. I run Dovecot under OS X for IMAP access to those messages from anywhere; when I need to search through the whole collection, I use mairix (an indexing and retrieval system).
Just delete some goddamn email.. hoarder!
"My immediate reaction is "WTF? What kind of moron doesn't make things 64-bit safe to begin with?" Linus
Simple. Archive mail by the year as it gets too big. Use mutt's search for the basic searching and maildir-utils for the heavy lifting.
To those saying keeping email forever is hoarding: not if it's done right. You'd be surprised how useful it is to go back and find an email from four years ago.
.... "News for nerds. Stuff that matters" these days?
Oh, and stick it all in imap.
heh - i have all my email going back to '98 in Outlook Express. Best email program ever! It's nearly perfect for what i want. (Any way to get it to do inline spell checking, ie, underlines misspelled words as you type?) Still running it on an XP box. Been using Windows Live Essentials a bit for Win8, it's not horrific, but lacks some of the characteristics..hope MS injects some of the OE spirit into it..
I have been using Mailstore for this purpose for the last few years. Works for my gmail, hosted exchange and my old, unlamented exchange system. Faster to find things with their query than Thunderbird/Outlook search. And the price was right. Before I retired, I kept separate email archives for my major clients -- made it easy to cleanup the file when the relationship wound down. This no longer matters. Everything ends up in Mailstore now -- except the immense quantities of spam. Works for me -- your mileage may vary.
An old open-source tool called hypermail may be what you're looking for. It parses mbox files and produces HTML pages with the emails sorted by thread, author, subject, date, etc. http://hypermail-project.org/
Eudora still runs on my Win7 box. I have email going back to at least the early '90s. All plaintext and easily searchable.
A subpoena won't get you into a house. That requires a search warrant which requires probably cause of a crime.
Completely different.
Except the gmail server and the ISP server might be located in
different places.
Upload it to one or several Google accounts and you have a permanent searchable archive.
I use Thunderbird.
My mailboxes are all IMAP, so I found a use for the Local Inbox in Thunderbird that I always thought was a useless feature.
At the end of the year, I create a subfolder labeled by Year, and I download all copies from the the year before the last (eg, my last download was of 2011 emails), then I purge them from the IMAP server to save space This way I still have universal access to my last years emails but easily searchable archives available at home.
If you keep regular backups of your /home dir then you need not worry about losing them.
I'd say follow the same rules as any archiving of media:
:)
Pick one format and migrate all of your messages to that: In this case, I'd say mbox. Thunderbird and most other mail programs read it and you can get most of your mail into mbox format via IMAP/Thunderbird from whatever mail client can read your old ones. You can store your mbox files locally in Thunderbird and gain Thunderbird's searching (for instance) without the need for an actual back-end. I was able to read some mail stored in Netscape Mail because it was just mbox files and opening them in Thunderbird was a breeze.
Most importantly: Every 5-10 years, re-evaluate your storage choice. Is Thunderbird still around? Is mbox still pretty well regarded? If you find you need to migrate again, do it! If both are still active / supported, then hold onto 'em. The only way to perpetually maintain media access is to make sure your choices are still valid on a regular basis. This is true for any media: As the old formats go obsolete (cassette tape, VHS), you need to migrate that data to the next readily accessible format (CDs, DVDs; FLACs, MPEG(?)).
I think the biggest problem is that you have a mish-mash of stored files right now. You'll save yourself a headache in the future by tearing the band-aid off now and taking the time to get all of your mail into one format. Then, in the future, when you need to convert, it'll be many steps easier since you won't have to visit Slashdot and find out what to do about your mail again next time.
For a criminal case, yes. Not for a civil case.
OTOH, for a subpoena to issue for private papers like that the court typically must already know what's in those papers, more or less. You cannot use a subpoena to simply go hunting for evidence to help make a claim. This is why it's so hard to catch illegal dumping of toxic waste, for example. Cancer cluster plaintiffs can't just go ask a judge for a subpoena to scour corporate records. They need a whistleblower to say, "yeah... they did it, and it's documented, and it's in a filing cabinet at such-and-such address". _Then_ you can get a subpoena.
Does there exist any program that's basically a lightweight Windows auto-starting (but 99.999% asleep and inert unless you're actively using it) background service that does nothing besides act like an abstraction layer between some kind of reasonable file-based mailstore roughly analogous to an Outlook .pst file (AFAIK, canonical Maildir is a physical impossibility under Windows) and any IMAP-compatible email client?
I don't care about being able to access it from anywhere besides my local PC... binding to localhost, and refusing to talk to anything external to my PC is fine. I've just had it with the mess Thunderbird's developers made of their local mailstore right around the time it completely went to hell ~4 years ago (well, and the mess they made with Thunderbird in general). For years, I just moved mailstore files around. Then, for some insane reason, it seems like Thunderbird's files just kind of exploded and proliferated... and worse, did so in ways that seem to screw up and confuse newer versions if you try to make them use files from an older version. If I could just run a semi-fake local IMAP server on my PC to abstract my mail storage away from Thunderbird itself, I could try other mail clients without having to worry about how I'm going to get my mail into them (a remote IMAP server is out of the question... I literally have gigabytes of email, some that literally came from Eudora Pro more than 18 years ago and just got converted and converted as I went along.
I thought about the usual option of an ARM-based mini-server or an old laptop, but I also have zero-tolerance for server slowness. Stutters and hiccups are bad enough without adding a server that has the resources & performance of a 500MHz Pentium III (on a good day) into the equation. At least if I'm running it locally & it spends 99.999% of its time harmlessly asleep, when I *do* go to access it, it'll have the full resources of a quadcore 3.2GHz i7 behind it for nearly-instantaneous response. The problem with running a full-blown IMAP4 server on my PC is that it's going to always be soaking up ram, and running at a higher background level (anticipating constant remote users it'll never actually see). I just want something that runs as fast and hard as it needs to and can when called upon to do so, then goes and silently hides in the corner until the next time I speak to it.
I run qmail for sending/receiving mail (on Gentoo; netqmail package), using maildir, of course. On top of that, I run the Courier IMAP server on my internal network (with TLS encryption). Until a few months ago I used Mutt as a client (console-based), but I've moved to using Roundcube (web-based email), which I initially installed for my wife, and have been happy with it. I also have some automatic filtering to folders via Maildrop (another Courier utility; it looks at a ~/.mailfilter file to route mail).
Roundcube/the IMAP server's search is OK most of the time - I keep my inbox small and move older mail to sub-folders - when I want to do advanced searches or search large mailboxes I log in and grep through folders of interest; this works well with the maildir format with one file per message. Maildir was also quite resilient when I had a HD crash and needed to recover some lost mail (block scan for blocks that look like mail headers found most missing items, and I do better backups now - mail is under ~/.maildir and gets backed up automatically).
I would move older messages to maildir (there are plenty of mbox converters, and almost anything non-proprietary should be convertible to mbox or maildir via existing programs or a short perl script) - even if at some point maildir dies off entirely, which seems unlikely, converting it to another format will always be trivial due to its simplicity and it has the advantages mentioned above of being able to search easily with grep etc.
I use PSTs and nightly backup.
Sure, you can use GMail or the amorphous cloud for your purposes, but quite frankly, remember - if it's not in your possession, it's not as secure as it could be.
No, I don't have world-ending secrets in my possession, but yes, I do get paranoid about my data.
Striking fear in the authors of godawful fanfiction, I am here, appearing in darkness, Tuxedo Jack!
Force feed? WTF are you taking about? Dovecot can use any make mail format. Just set MAILDIR if it's in a non-standard directory. So the whole procedure is:
yum install dovecot /etc/dovecot.conf (only if using a nonstandard mail location)
vim
service dovecot restart
set username and password in GUI client
I never will understand why some people feel the need to post on topics they don't have the slightest clue about.
MH stores each email as a plain text file, each folder as a directory. It uses the unix filesystem as its database. It's very quick and has tools to re-order a folder quickly.
In addition, MH has tools to convert mail formats. It was designed in the days of low cpu power and small disks. It also lent itself well to being wrapped by other tools like xmh, exmh and mh-e so you don't have to learn the raw MH commands.
Yes, IMAP is cool, but don't discount MH. Plus the O'Reilly MH book is free as a PDF.
Oh, some IMAP servers and mail clients use MH format or something derived from it.
The 500 Mhz Pentium and the Core i7 will have roughly the same performance in this use case because IO is the bottleneck. The speed is the speed of the disk and filesystem.
To be more specific, a Pentium has a throughput of around 2 GB/s. Compare to 10 MB/s for a 7200 RPM drive doing random access on small files, 100 MB/s on large ones.
So it's entirely reasonably to use a small low power Linux system like a Western Digital World Edition network drive or the ARM based stuff you mentioned for IO bound applications such as a file server or IMAP. You won't lose any appreciable performance.
I gathered up all my historical email records a few years ago, and used Aid4Mail to convert all the various mailbox formats to the common format I use today. Choose a format that's convenient for you, and standardize on it. Here's the product website: http://www.aid4mail.com/
Find a mail program that you like and stick with it. Important factors to consider:
* How it stores the mail and attachments: mbox or other ASCII format good, proprietary binary format like PST, bad.
* How well it manages years and years worth of 10's or 100's of emails a day.
* How gracefully it fails from data corruption (this is where storage configurations that keep eggs in separate baskets are a very good thing indeed)
* Something with good importers. If not, there are 3rd party programs and services that claim to be able to convert from any mail client to another.
Personally, I've used Eudora for the last ten years (v7.x when it was still maintained by Qualcomm, not that godawful travesty Mozilla cobbled together, which is just Thunderbird dolled up to look slightly like Eudora but function nothing like it) and have scarcely considered anything else.
Yes, it's a bit goofy, requires some advanced trickery at times and the configuration screens might as well all be labeled "Miscellaneous", but it more than makes up for it...
No other mail client can come close to the MDI that lets you view endlessly configurable summaries of any number of mailboxes at a glance.
It stores all mail in plain text (close to mbox, but not quite ... though close enough that you can grab an mbox file and trick Eudora into thinking it's a native file without any manual editing) and dumps all attachments as normal files into a single directory. Yes, that directory becomes kind of ... well, huge, so it's kind of klunky in that way, but it does ensure that you can access those files at any time without having to deal with any interface beyond the operating system.
Mail consumes pretty much just what that amount of text and files would. Meta data and configuration takes up little else.
I have 10+ years (~3GB) of email and there are zero performance issues. It does offer the option of indexing mailboxes for faster searches as well.
It is truly the geek's mail client. I love this mail program so much, I will use it in a VM when it eventually becomes incompatible, but it works problem-free on Windows 7. And even in the event that it were somehow unusable, I'd still have access to all my mail; after all, it's just a bunch of flat text files.
Try Mailstore http://www.mailstore.com/.
If you get a pissy enough opposing counsel and a judge to cooperate, warrants can be issued. Trust me, being on the wrong side of it. Of course, this will vary by your jurisdiction, IANAL, and especially not yours.
Serious? Seriousness is well above my pay grade.
Our family / family business has run, with increasing formality, email servers in various flavours since the mid-90's. These servers have processed messages including everything from lots (like really lots -- in the tens of thousands at least) of family pictures to (no doubt) lots of personal email of the many dozens of staff who have worked with us over the years. In general, the server settings have always been set to "retain everything", including full Exchange journalling, because there was no way to delete things without risking losing some important pictures someone sent to someone else.
I'm not too worried about the business activity traffic, because anything recent is well replicated in many other places -- primarily in various cached Outlook data files. But where family members threw away their old machines, the only copies of these important things are in the server journals we have archived. Is there some solution that can rationalize these millions of messages into some sort of structure?
In addition, I presume that this can only be done for individuals who actually want old items to be retrieved from the archives, as anyone else would be protected by privacy rights.
No. Well...maybe. Actually, yes. It really just depends.
I've got 16 years' worth of email in a multitude of formats, including all of those that are excoriated as unreliable, fatally flawed, or Satan's preferred meas of communication with our world.
I have them on O-L-D CD-ROMS, DVDs, saved to two cloud services, on my personal server, and tar'd/zipped/RAR'd/Stuff'd in some of the same places, and up to three copies in each of these places. I wonder if the .sitf files will even decompress, but no point in deleting them.
I've really only used a very few mail clients though; pine, elm, Eudora, GroupWise, Outlook in so many versions, POTP. My own servers have been the usual evolution of Sendmail, Dovecot, and now dbMail. And I use Yahoo! and Gmail as mirrors. Yahoo! Mail is my spam bucket as well as second realtime mirror, Gmail I use as a mirror and for some primary communications. my 'personal' email has been the same since 1996, but I've had three work emails.
And my email archives are close to unusable, of course. I guess I should try and take some of this advice.
And when I do rifle through the really old stuff, I need to put it through a spam filter. Some of that old spam people would pay for today. Not the IRC stuff.
I really should take a week and make sense of it. Naw, crap, who am I kidding? 1996?
deleting the extra space after periods so i can stay relevant, yeah.
If you just want "archival, for the next 5-10 years, then redo it all over again as technology changes" then the other answers in this thread are what you want.
If you want "archival, for 20+ years, without having to do it over every 5-10 years" then some form of human-readable plain text or at least representeded-as-plain-text-for-attachments is what you want. Make sure all file attachments are in well-documented formats (e.g. JPEG) so someone will be able to write a decoder for them 20 years from now if one isn't readily available. If they aren't, be sure to store file-format information with your archive.
If you want "archival, for 200+ years" the you want all of the above, stored on archival media that are likely to be readable 200+ years from now along with a description of how to interpret html and file attachments. Archival paper, archival microfilm, archival "etched onto plastic but microscopicly" media, etc. are what you want.
If you want "archival, for 20,000+ years" then talk to the people who are working on how to label long-term (10,000+ -year) nuclear waste storage dumps, they may have some ideas that work.
If you want "archival, 2M+ years" then I'm out of ideas. Look me up in 2M+1 years and tell me what you found that worked for you.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
http://www.mailstore.com/en/mailstore-home.aspx
works well quick searches and its local .
unfortunately its windows only but may work fine under wine.
Music the Paint dancefloor the canvas your body the brush
I've got six years of archived email. I like Thunderbird's archiving scheme. You can have it automatically create archives as a calendar year goes by.
If you're not actually working with the old e-mails, and you don't mind waiting a few moments to search them, just keep the raw e-mails, in raw transmissible e-mail format, and be done.
They are nothing more than a whack of text files at that point. And they are properly formatted with headers and everything.
Want to seach? Full text search and you're done. Want to search by subject only" Simple regex search /^Subject\:.*?cucumber/ finds "cucumber" only on the subject line (yeah yeah, header folding exists, this isn't a regex lesson).
Every e-mail client from the birth of the first one until the death of the last one support raw e-mail formats. And you can probably just pipe them all to sendmail and send them all again.
All of that said, I'm a big proponent of forgetting the past. Hoarding is consistent with many psychological problems.
I've been using MailSteward on OSX. The starter version handles 15k or so entries using SQLite before it starts to bog, while the trade up is a front end to MySQL.
Luke, help me take this mask off
It is email. AKA over the web.
No, e-mail is not over the web. The SMTP protocol is older than and not a part of the HTTP protocol.
There are some e-mail clients that use the web for the presentation layer, but that has nothing to do with e-mail itself.
AKA public.
Also patently false. Most e-mail servers today use SSL to communicate. Even if you sniff the line, you can't get the content of my e-mail.
Store it all as plain text files (mbox format?), and write a quick script to send it all to an ElasticSearch index.
The problem is that a throwaway email might become critically important later on. There is no way to know in advance what is important and what is not.
True story: while deployed in the Army, our communications guy could not find a piece of equipment which was very important and very pricey. He had been signing the monthly inventory forms saying he had it, assuming it was in a cabinet. He could not find any paperwork showing it was signed out - it had just disappeared sometime in the last 3 months and no one had seen it.
On a long shot, I started searching my email - since I keep every last one. Sure enough, about 2 months prior, there was a throwaway email from him to the effect that he was going to turn in item X for repair since it was acting flaky. He checked at the contractor mentioned in that email, and it was sitting on the shelf waiting for pickup.
Support microSD: in a post 9/11 world, it is unwise to carry your data on media that you cannot comfortably swallow.
tar tzvf | grep
I've got email going back 20 years, and this has not failed me. Maybe you need to sub 'x' for 't' and use less, but don't over-engineer this problem.
100 REM PISS OFF CODE FASCISTS 200 GOTO 100
And this is what passes for snarky comment, now-a-days?
Back in my day, we'd walk a mile uphill in the snow and get our 9600 baud modems connected before chiming in from the peanut gallery.
100 REM PISS OFF CODE FASCISTS 200 GOTO 100
I just put the original clay tablets in shoeboxes and stack them in my garage.
Sheesh, evil *and* a jerk. -- Jade
You should probably get Dr Daniel Jackson or Samantha Carter or Dr Rodney McKay to translate it into english first.
Parchment -no less- does it for my ancient emails.
I hadn't the slightest objection to his spending his time planning massacres for the bourgeoisie... (P.G. Wodehouse)
I had to archive the emails since 1996. They were in multiple formats - Outlook Express dbx, Mailbox from Netscape Navigator and Thunderbird, Outlook.
I converted all of them in .eml format. It's a simple, text format that can be read by the OS and easily parsed by any program and script. Much better than mbox or something else. Then I renamed all of them according to a rule - YYYYMMDDhhmm [From] [Subject]
Now I can easily find any email. I can browse them using the file system, I can search them using the OS or via a script. Windows indexes them and extracts the metadata so any search is very quick.
I switched to Office365 for this. $8/month for unlimited storage and good bandwidth.
I used to use mbox format on a regular Linux web host (pair networks) from 1995 which worked fine. But they weren't scaling up their storage allowance as technology progressed, so as my archive grew bigger, I was paying too much for it. Tipping point was about 2005.
Next I switched to maildir format on an opensuse box running in my basement, 1tb RAID2 hard disk backed up automatically to a 1tb Usb drive, and also mirrored (using unison) to an offsite machine. My email archive was important to me and I never wanted to lose it.
But this was a pain. It was a pain to administer the system, a pain to make sure I had spare disks for when they failed, a pain to be sure my software RAID would even work, a pain to make sure my firewall was always open to inbound IMAPS, a pain to periodically move email from pair networks to this archive every year.
Also I provide email for my family on the other side of the Atlantic, and this basement server wasn't suitable for them (not enough uptime when e.g. rewiring my house).
Office365 wound up being cheaper for my family than pair networks. It has an "unlimited" $8/month plan for me, and 25gb $4/month for my family. It has a decent enough webclient and great (fast) online search, far faster than any searching I did with mbox or maildir servers. I feel more secure with its reliability and uptime. And being a Microsoft employee (C#/VB language design team, unrelated to Office) I use Windows devices and email clients, which generally work better with Exchange than IMAP.
Use www.weirdkid.com/products/emailchemy to ensure your mails are "normalized" to a homogenous format ( rfc2822 ). Then, find a Linux solution akin to www.mailsteward.com to manage your archive.
I have a number of messages from the mid 80s that are in MMDF or PMDF format as well as mbox but they are on a reel to reel tape and my new computer doest have any place for the tape to go.
Can anyone in Melbourne read a 9 track tape?
Personally, I'm lazy. I've been using Pine (now Alpine) directly on a mail server for all my mail since 1995 (on my own servers since '97). Old habits die hard.
It works great over really low bandwidth connections (though sometimes high latency can be annoying), you can view any attachments you need automagically with X11 forwarding via SSH, and you don't care at all about which machine you're accessing it from. Also you get to read the TEXT in your mails & not HTML, most of which is useless garbage when it comes to emails (for the 0.1% of HTML mail I do actually need to read as HTML, such as tables, Linx often gets the job done, & if not I just bounce it to my gmail account, which is pretty much full of spam otherwise).
When various folders get Too Big (or I move on to another job, or whatever) I move them into an "archive" folder (& I have an "old-archive" folder for the really ancient stuff) and bzip2 them. I archive my inbox files at the beginning of every year too. When I need to find something old, I just bzgrep for it. After an archiving session (which takes all of 5 minutes) the whole thing gets backed up from my mail server to my NAS at home.
Did I mention that my backup MX is a SparcStation 20 and still works just fine for all this? Of course I don't keep much on it but if my main server dies I can still send & receive mail just fine.
Note that this is not exactly something I sat down & spent time thinking about, I just started moving mail out of the way like this when I left college & built a couple of OpenBSD mail & DNS servers, and kept doing it as it works well enough.
So I take it that you only send and receive email with people running their own private mail server; certainly not with anyone using an email provider that 'anyone with a subpoena' could access.
"An anonymous reader writes" - What? I've been on Slashdot for a while and enough is enough. "An anonymous reader" shouldn't be able to submit articles. "Anonymous" cowards are already trolling Slashdot to dead. If you don't have the guts to post under your username, then why should you have the right to post anything?
Just my opinion.
Well it is as public as any Gmail account.
Troll is not a replacement for I disagree.
Ugh. Drop all that stuff. Who needs it? My gmail folder has 20 messages in it. Lighten your (psychic) load.
I have more than 13 years' worth of archived mail; I keep two bzip-compressed mbox files for each month: Sent-YYYY-MM.bz2 and Received-YYYY-MM.bz2
Searching is a bit slow, but I hardly ever have to search that far back so I don't mind. More recent mail (going back about a year) stays on the IMAP server. Also, my company produces an email archiving product that lets me search very quickly based on sender, recipient, subject, full-text body search, etc. which is great for mail going back up to about two years.
I fail to see the problem. I have mails going back a decade or more all stored in maildir on an imap server. Done. I've changed clients several times, servers several times, no problem.
So what's the problem that makes an "ask slashdot" necessary?
Assorted stuff I do sometimes: Lemuria.org
This one will probably get buried because of the sheer weight of comments in this thread... but here goes.
I had the same quandary about four years ago; mail going back a decade at that point which I wanted to keep around. It was in various clients, as in your case. What I did was build a POSTFIX / IMAP server using (at the time) Gentoo. I then attached those clients and simply copied all the archived email up, one client at a time. I then went about building a SquirrelMail front end which did great for a while.
The problem as you can probably ascertain was search. It was tough to trawl through all those emails... but last year I converted my entire email system to Zimbra and simply did an IMAP import of all the data from my old IMAP server to the Zimbra database. While Zimbra still stores everything in MBX format (I think), it also uses MYSQL to store index data. It also happens that Zimbra has a really nice web front end, and everything's really nicely integrated. Now I have email going back 15 years or thereabouts, all searchable in pretty swift order. I added the Zimbra Desktop app to my laptops and I even have a local cache. As for backups, I have a Linode running a custom kernel and the ZFS filesystem, and nightly I have a script on my server that backs up the entire Zimbra store using "zfs send / zfs recv". Since my entire email store is around 9GB it isn't terribly expensive... and I use the same Linode for hosting a hub for my OpenVPN network... which means all my computers can communicate privately from anywhere in the world across a constantly up VPN tunnel.
And for those who think you don't need to keep all that email... bully for you. I have had to refer to decade old emails before in order to provide better service to my customers. My email archive also came in very handy during the divorce from my ex wife for reasons you can probably imagine but I'd rather not get into. That's also handy stuff to keep around... just in case.
We (Roaring Penguin Software Inc.) have an anti-spam system that has an archiving add-on if you're looking for commercial software. It's built on PostgreSQL, so supports searching including full-text body searches.
Searching is done via a Web interface; we don't have specific integration with particular email clients.
There were enough really good solutions proposed above:
1) Standardize on one format - preferably maildir(1)
2) Convert all your emails into rfcxxxx (i forgot - but you can look it up) and copy to maildir-format
3) on Linux or other *nix-based systems, you can use many tools to search
(1) I have 15 years of email, about 60GB, roughly 120,000 Emails (sent + received). I use Mac OSX, so I have stored them in Mail.app - because Mail.app uses something like maildir-format and I will never lose my emails, even when I switch to another client.
Every time a year ends, I create a two new folders under Archives/Inbox and Archives/Sent respectively with the year in for digits, e.g.:
Archives/Inbox/2012
Archives/Sent/2012
Then I move the emails to the respective folder. From then on, I exclude these entries from "standard default search"; Only when I purposefully want to search in them, I choose to do so.
This has worked quite well for fifteen years now - and before Mail.app, I used to use PowerMail, Eudora, Outlook Express, Mutt, Pine, and so on - now I standardized on Mail.app with its maildir-structure and am happy.
Oh, and forgot to mention:
I would suggest NOT to use any (commercial) solution that stores your emails in some weird BLOB from which there is no export possibility at one point. As long as any (commercial) solution supports something like maildir, you will be fine - anything else will be a sure guarantee that you won't be able to read your emails anymore once the solution-provider is gone and there is no documentation about their storage format.
Lastly: on backups - don't look for anything that is email-specific - I mean with that: treat your emails like any other important file/data that you have. There's nothing wrong with being paranoid with regards to backups (I have a 4-level-backup system for my emails, photos, music, and other important documents... the only thing I'm missing at the moment is an off-site backup solution for these...)
I have had the same issue, email archives that are complete from the mid-90s and sporadic emails from the 1980s. What I've been doing is archiving most of the messages in text files in mbox format , one file per month, and I gzip them after a certain period of time to conserve space.
Unfortunately 'grep' and similar utilities have been insufficient to do decent searches on them. What I ended up doing is building my own search utility in python. It allows me to specify multiple search terms, regular expressions or strings, search blocks of files (e.g. in this case finding blocks that are delimited by a starting '^From ' line), as well as automatically descending into directories, tar files, gzipped files, etc. With this I can easily run a search across any set of files that I desire (even if I've tarred and compressed them) and get out resulting output that I can read with a mail reader program such as Mutt. I've found it to be extremely useful for this, as well as almost all other search tasks that I do.
If you are interested in using it, I've made it available on github. It's at https://github.com/bruceisrael/search
Dovecot handles all the formats you mentioned, mbox, maildir, etc...
Then access everything w/IMAP.
I keep everything in mbox format...going back to 1999....
Things are very hierarchical. I don't keep everything. List mails
go into list-boxes and I read them like newsgroups.
I have multiple levels of personal mail.....sorta like google's circles...
but unrelated to that...
Keep it all in /home/lpq/mail ... about 5.1G of it...
I don't bother sorting or categorizing or anything. I just have procmail send a copy to an archive file which I rotate once a year, and I index it all with mairix: http://www.rpcurnow.force9.co.uk/mairix/ . I can search on date, sender, subject, body, etc, and in a few seconds I have what I need.