Ask Slashdot: Best (or Better) Ways To Archive Email?
An anonymous reader writes: I've been using email since the early '90s and have probably half a million emails in various places and accounts. Some of them are currently in .tar files, others in the original folders from obsolete or I-don't-use-them-anymore mail clients. Some IMAP, some POP3. You get the picture. I don't often need to access emails older than a year or two, but when I do, I have found that my only hope for the truly archived ones is to guess what Grep combo might find the right text in the file ... and then pick through the often unformatted, unwrapped, super ugly text until I find the email address or info that I'm searching for. Because of this, I tend to at-all-costs leave emails on servers or at least in the clients so that I can more easily search and find.
My question is whether there's any way to safely store them in a way that I can actually use them later, offline, in a way that allows for easy date searches, email address searches, and so on. Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore. So what ways to people archive emails effectively? Or is this totally a lost cause and I should keep limping along with grep?
My question is whether there's any way to safely store them in a way that I can actually use them later, offline, in a way that allows for easy date searches, email address searches, and so on. Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore. So what ways to people archive emails effectively? Or is this totally a lost cause and I should keep limping along with grep?
MailStore Home is the defacto best free method I've found: http://www.mailstore.com/en/mailstore-home-email-archiving.aspx
Sure, they might be useful at some point, but do you really need your emails from 20 years ago? Life is temporary. All things decay. Attachment causes suffering.
`OfflineImap` (for fetching into a local maildir), then `mu` for indexing and searching.
As for converting your already-archived mail into maildir format, that's a little more tricky. Once they're in maildir format, you can just use `tar` to compress the ones you don't currently need to access.
mairix is another good solution for searching them, once you've got them in local mbox/mh/maildir spools. I think back when I was converting to maildir I scripted mutt to copy them in, but it's obviously harder if you've got them in proprietary formats.
I just put them in a mail folder. Make a new email account for them if you want. Then you still get the benefit of being able to access them on-demand anywhere through IMAP.
https://www.mailarchiva.com/
Works pretty well.
I'm sorry, I can't hear you over the sound of how awesome I am.
I know that Thunderbird has a plug-in that supports exporting emails messages into .eml files were you can have the filename show date and subject and such. But it's not that easy to use.
https://addons.mozilla.org/en-...
I have personally been archiving with two programs, Outlook and eM client.
Outlook because it provides a .PST file where you have a database that's easy to search through (in outlook) plus I can archive calendar, contacts, and tasks.
eM Client, which is free to use for two email accounts (at a time, you can always delete and add another). It's like the Thunderbird plugin (exports to .eml) above but much more intuitive and works really well.
I remember having a similar problem years ago with E-mail in several systems and getting annoyed that everything was in different formats in different E-mail clients. I fixed the problem by setting up my own IMAP server. An IMAP server is a mail server that's compatible with virtually ALL E-mail clients but what's important about them is they act as mail stores unlike POP3 so you can upload mail to an IMAP server without screwing up formatting or anything. Then once you get all your E-mail up to your IMAP server, you can chose to just store it there (just remember to back it up now and then) or you can redownload it all into a Mail folder on ThunderBird (Backup Thunderbird's Mail store folder for protection) ThunderBird probably isn't going away in the foreseeable future but if it does, sometime down the road you can reuse your IMAP server to transfer it to another mail client.
Her server actually lasted longer than the one she was "supposed to" use. Contrary to popular myth, the office server was not designed for high-security or anything else special. It probably had lowest-bidder quality, and backups either failed or were lost. (A separate procedure was used for classified stuff.)
Table-ized A.I.
One option might be to set up a local IMAP server on your machine and archive your mail there. Then any mail client that talks IMAP could access it.
Thunderbird's nice in that it uses the standard maildir format (one file per message, mail folders are just directories under the root of the tree) for it's local copy. Most IMAP servers understand and can use that format so you can just dump a copy of the local mail store into the IMAP server's user mail directory (or if that doesn't work, use the Unix movemail command to suck everything up from the local mail store and send it to the IMAP server) and be set. The message files are text so grepping for content's still an option of last resort. There are database-based solutions that have more options for tagging and searching, but they tend to cost money and once your mail's in them it's more of a headache to get it back out when you want to change software (this is an archive, it's inevitable that your current software will be unsuitable/unavailable at least once before the archive becomes old enough to be irrelevant).
My very unideal solution is to archive individual relevant emails under 'relevant emails' folder as plain text files. Otherwise, I don't retain emails and intentionally purge them. This way, when becomes taboo in near or far future, it won't be easy to dig through my digital trash and establish long-term pattern of 'abuse', allowing me to pretend that I am also outraged at these people still practicing such barbarism. Like not recycling your urine for drinking water. Who doesn't do that in 2035?!
With modern hard drive sizes I don't see the need for compression. Without compression you can use any good free text search tool. I have kept a good proportion of my email since about 1990, and it's all in Thunderbird. (Messages from earlier clients I just emailed to myself en masse).
Thunderbird has pretty good search capability, but as I am still running on Windows 7 I use Copernic Desktop Search, which has some useful features. (It indexes and searches files, and handles Firefox as well as Thunderbird). With this kind of volume, I do think an indexing tool is better than grep unless you want to have a lot of coffee breaks.
I am sure that there are many other solipsists out there.
I don't understand why emails are not more often stored as one-file-per-message, with a time-stamp as the start of the file name (YYYY-MM-DD etc.).
Some file systems are wasteful for lots of small files by padding actual space into large discrete chunks, but they should remedy that rather than stuff all messages into one big file.
Table-ized A.I.
I go through >10 year old emails all the time. "Hey, I remember talking to a professor about this algorithm." "Where did I go camping that year?" "What was my order number for that game I bought ages and ages ago, since they accept them for free copies of the remake?" "I'm trying to gather information on something, but the person I talked to has long since died and their site isn't on archive.org." It's only going to happen more and more often for older and older stuff.
Email is also really convenient for backing up work that's under the ten megabyte range...manuscripts, source code, etc. If someone doesn't have a proper backup system or it's not easy to use from the system they're on at the moment, emailing something to themselves is quick and easy. Old work gets rescued from floppies all the time, and surely there's some fascinating, ancient projects backed up in emails that people have long since forgotten about.
I've been using email since the early 1980's, 1982 specifically. I was using "mail" then, later mailx, later whizbang graphical clients.
I still have tar archives of emails from a PDP-11. I can still read them today. Why? Because open formats. Tar archives from the dawn of time can still be read on a modern Linux system today. Once you start locking things up in proprietary formats such as used by Outlook, it gets harder to read them once that format dies. Not impossible, but certainly a bigger PITA.
Tar will probably still be here long after I am gone, so from my POV it is a format with suitable longevity. The underlying messages were encoded in plain old (mbox, I think) mail format, which is also still readable by modern mail clients, and even if it wasn't, it's plain old ASCII, so "less" would suffice in a pinch. Stay away from weird binary / closed formats!
Sounds like you are somewhat tech savvy. Dump all of your emails into files, or any basic store mechanism then load the whole thing into Solr and let it be your search engine.
Use the Thunderbird archive.
Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore.
Nope! :-)
I have about 10 years of email in Thunderbird. It keeps data in the mbox format which is a well supported open standard. The files are human readable and can be greped. There's lots of 3rd-party tools that support mbox. Thunderbird builds indexes (maybe those are proprietary) which are good enough that I can search that decade of email in a few seconds. (Maybe that is only searching by subject, to, and from. Message body searches might take longer). I remove attachments from old mail though, because that eats up space and is not valuable. If I needed the attachment, I saved it somewhere more appropriate.
The Thunderbird archive feature merely moves the mail into separate mbox folders to keep the main file from getting too big. It doesn't make them proprietary.
The hard part might be moving existing mail into that format from whatever it is in now.
I just save them in my Bluewave offline mail reader! .qwk is the way to go!
Do not look into laser with remaining eye.
Every time I switched mail clients or computers, I made sure to import all mail from the old to the new program. Messages that were made in my first mail account (in Eudora, on Macintosh System 7) are still accessible in my current Mac (Apple Mail, OSX 10.10). I don't need it often, but when I do, it's one search away.
...Solving the wrong problem.
eMail is not a storage medium; it is for short communiques, and sometimes those lead to threads while an issue is threshed through. But using your eMail system for historical storage is like buying a small automobile for long-haul freight. Or, using Twitter to negotiate a contract.
Decide what of all your data you intend keep, and find a useful, generic tool for storage and retrieval, irrespective of content.
I'm periodically annoyed by some people who still respond to emails that I wrote 15 years ago as if it was only yesterday. Delete the old emails and move on in life.
This is what I do, run IMAP locally (Dovecot). Every year or so, I create a folder callled sentbox_2013/ and move all the sent emails from 2013 there. My regular sentbox contains the last 14-20 months or so.
I also have a folder called archive/ which holds the few messages I think I'll actually need again.
Regarding whether it's a good idea or a bad idea to keep them in terms of legal disputes and such:
Having the documents will allow someone to prove what was actually said. If you're the a shady character, managing your business like it was Enron, you probably do not want to keep the evidence around. On the other hand, if you're working for the Software Freedom Law Center communicating with people who appear to be violating the GPL you probably want to save your communications- if the truth is clearly on your side, you may want to be able to prove what's true.
If you're naturally very upfront and ethical in what you do and say, emails may be more likely to help you than hurt you.
When you move, if you find a carton from the previous move unopened, discard it without opening. Follow the same rule and throw away the old emails. There is nothing of value in it.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
In addition to rolling your own imap, as has already been suggested, you can/should also do this.
If you are a Windows and Outlook user, (and if not, Google and torrents are your friend) burn a wet weekend learning the mysteries of those two plus acrobat pro. Get a clean install on a fast PC with plenty of memory and an ssd.
Import all your old crap into outlook (look it up)
Install acrobat pro including outlook plugin... Trivially use this to create searchable PDFs including attachments.
Put all your mail on an imap server. You'll be able to access it with any mail client. Set up the imap server as the archive destination for TBird. Now all your mail is archived in the imap server and is accessible.
You don't trust your email host? That's fair. Run your own imap server on your NAS or even your desktop machine. Everything stays right there on your own media and is still future-proof with regard to changing clients. If you need to change servers you just use your favorite email client to transfer mail from one to another.
I have everything online at my email provider. In my case, "everything" goes back to the mid-90s. I recently switched hosting providers and did just as I described: Set up separate accounts in TBird with the old and new providers. Select all in a folder on the old provider, drag to a folder on the new provider. (Well, actually I had to do it in chunks of under 5000 messages or TBird would get all crashy on me. But you get the idea.) It was kind of tedious to move hundreds of thousands of messages, but it was merely tedious. It wasn't problematic.
Chelloveck
I give up on debugging. From now on, SIGSEGV is a feature.
It sounds like you've made a Category 6C blunder by providing a solution to a different problem.
Nobody has the time to sift through two decades of emails and pick out the important things. Even if they did, the custom database thing to put them in will definitely not be cross platform, necessitating keeping a copy of the original mess of mbox/tar/etc files around to dig through.
After trying several solutions I settled on Mairix. Searches are screaming fast (less than a second to search several hundred thousand emails), indexing is fast, it's reliable (no problems in the 5+ years I've been using it), and the search language is easy and flexible.
* I use procmail to send a copy of everything to an archive, rotated monthly .bashrc: "function search() { mairix -o $$ $* && mutt -f ~/Mail/$$ ; rm ~/Mail/$$ ; }" .mairixrc:
* The archive is therefore just a handful of mbox files
* I have a cron job to run "mairix -Q" every 5 minutes, and "mairix -p" nightly
* I have this in my
* And here's my
base=~/Mail
database=~/.mairixdb
mbox=archive-*
mformat=mbox
omit=spam
With the above, I can find:
* everything from slashdot in the last two months: search f:slashdot d:2m-
* any emails I sent containing "squishy" in the body: search f:subreality b:squishy
* messages with "password" or "passwd" or similar in the subject: search s:passw=
* get a quick summary of the search language: search -h
It's so good that I download all my email from my work Gmail account so I can search it... sometimes Google's search just isn't precise enough to find what I need.
I just leave them in the inbox or whatever folder they end up in according to my sorting scripts. I'm using claws-mail as a client.
Works for me.
eMail is not a storage medium
Of course it is.
eMail is no difference than paper mail.
Solving the wrong problem
Depending on your "problem" you are obliegd by law to store them and have them accessible for 10 years, minimum. Depending on situation up to 30 years.
Or, using Twitter to negotiate a contract. ... Oki, done!
And what would be wrong with that? With 90% of my business partners: I have no contract at all. All we do is negotiation: can you do that? Yes I can! What is your price/timeframe? Something like X/Y
That easily runs via twitter or Skype or even IRQ.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
I recommend the one-mail-per-file, and one-directory-per-folder, idea. It's not exactly, well, new - but it beats everything else by miles. /under your control/.
Yes, this means you keep your mail local. This is a good thing, as this means
grep? Yes, works. Easily.
glimpseindex? Yes, works. Easily.
Anything else? Yes, works. Easily.
I keep all my mail from 1998 onwards (when I switched from a certain commercial provider with a proprietary email system) in that way. And it Just Works.
('course, Gmail/MSNmailorwhateveritscalledtoday are out. Who cares. exmh (and mutt in a pinch) FTW.)
I use a combination of mutt + offlineimap + notmuch for mail, local archiving and a very powerful search.
I've been on this setup the past 6years or so. If mutt isn't your thing this approach is modular so you could simply sync with offlineimap and index/search with notmuch.
Have a squat over at the hobo house.
I asked a similar question to Slashdot about a month ago, where I wanted to stash E-mail and have it accessible if I'm on the road.
I looked at a few options. Using a virtual machine, an offsite storage provider, and so on.
What I have wound up doing is buying a NAS. Synology or QNAP are good companies for this. The NAS I bought was a basic one, but it supports RAID 1, which is critical. It also gets backed up automatically via a script that goes in via SSH, creates a tar file, pipes it to zbackup which has a repository on another NAS. zbackup is ideal for backups of E-mail, and having another machine pull the backups helps deal with ransomware, once the bad guys start hitting devices.
I then enabled the mail server functionality, which gave me an implementation of dovecot and roundcube. This not just gave me IMAP access, but access via the web (SSL). Using the onboard firewalling, I limited the IP range that the NAS talks with, to just the IP range of the commercial VPN service I use (which is a small provider, run by some competent admins.) This way, for an attacker to even get to an open port forwarded past the router to the machine, they have to have an account with that small VPN provider.
For me, this has worked well. I have access to my E-mail over IMAP or the web. Since the NAS doesn't send or receive mail directly (mail just gets copied to it when archived), it doesn't need SMTP access in or out.
Caveat: Focus on security when setting this up. Ideally, you could use the NAS's built in eCryptFS capability to protect the IMAP maildir directories so physical theft of the NAS doesn't mean your critical E-mails belong to someone else. From there, put the NAS in its own DMZ, blocking all outgoing traffic except for it checking for OS updates, and only allowing incoming traffic to the TLS-based ports, preferably with heavy IP restrictions. For backups, do a pull based system, so if the NAS gets infected, the bad guys can only put garbage in the backups, and not attack previously stored data.
I go through >10 year old emails all the time. "Hey, I remember talking to a professor about this algorithm." "Where did I go camping that year?" "What was my order number for that game I bought ages and ages ago, since they accept them for free copies of the remake?" "I'm trying to gather information on something, but the person I talked to has long since died and their site isn't on archive.org." It's only going to happen more and more often for older and older stuff.
Agree. Or I'm like "What was the flight I took last year from JFK to LAX? It worked good with my connections". Even recently for work I noticed that when I ordered software from one supplier, I got an email, copied to the local vendor, with the serial number. I had another package we bought from them (by someone else that since left) where I could track down the PO, but not the serial number. I emailed the guy copied on my email, and he could dig up the copy he was CC'd on.
Email is also really convenient for backing up work that's under the ten megabyte range...manuscripts, source code, etc. If someone doesn't have a proper backup system or it's not easy to use from the system they're on at the moment, emailing something to themselves is quick and easy.
Critical University term end reports I remember regularly emailing copies to myself. If my computer exploded, or I accidentally overwrote everything and hit save, I could restore to a known good copy.
Old work gets rescued from floppies all the time, and surely there's some fascinating, ancient projects backed up in emails that people have long since forgotten about.
I'm surprised about work getting rescued from floppies. Back when floppies were a thing, I was shocked at how many people relied on them to hold ALL their school work. Given they were the most unreliable storage format ever invented (and at the time, '99 or so, hard drives were relatively reliable), very frequently people would lose an entire term's worth of work. I remember once I was able to recover the auto-save copy off a nearly corrupt floppy of someone's large term end project. They were almost in tears.
The early version of PST files has a file limit of something like 2GB, at which point the whole database has a risk of becoming corrupt. So it is worth breaking it down into bitesize chunks (yearly?) that are easier to manage and archive.
Yes, it's so tragic whenever we see these stories about a lonely old hacker found dead in his apartment, trapped under a toppled pile of bits.
Get a grip. Our digital closets are growing much faster than our digital hoards. Space and indexing technologies are growing faster than our compulsion to accumulate plaintext. Keeping email is not a problem.
Always used mbox format, got 7 years emails right here, immediately accessible,
before that on an old hard drive, same format, easy to load, backed up in annual mbox files.
Easy job for grep, or just open with Thunderbird and sort/search.
Go well
A tip: Dovecot has a nice sync tool http://wiki2.dovecot.org/Tools... Perfect to get your email from different IMAP sources to your own system. It can also change mailbox format etc. Combine that with Dovecot itself to give you IMAP access and you have access. You can also use it to keep it in sync with an off site archive.
Dovecot does have full body search, but it is quite CPU intensive. No problem if you just run it for a few users and except that it may take a while on a large amount of emails. Not too great if you're hosting for lots of users.
---