Ask Slashdot: Best (or Better) Ways To Archive Email?
An anonymous reader writes: I've been using email since the early '90s and have probably half a million emails in various places and accounts. Some of them are currently in .tar files, others in the original folders from obsolete or I-don't-use-them-anymore mail clients. Some IMAP, some POP3. You get the picture. I don't often need to access emails older than a year or two, but when I do, I have found that my only hope for the truly archived ones is to guess what Grep combo might find the right text in the file ... and then pick through the often unformatted, unwrapped, super ugly text until I find the email address or info that I'm searching for. Because of this, I tend to at-all-costs leave emails on servers or at least in the clients so that I can more easily search and find.
My question is whether there's any way to safely store them in a way that I can actually use them later, offline, in a way that allows for easy date searches, email address searches, and so on. Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore. So what ways to people archive emails effectively? Or is this totally a lost cause and I should keep limping along with grep?
My question is whether there's any way to safely store them in a way that I can actually use them later, offline, in a way that allows for easy date searches, email address searches, and so on. Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore. So what ways to people archive emails effectively? Or is this totally a lost cause and I should keep limping along with grep?
MailStore Home is the defacto best free method I've found: http://www.mailstore.com/en/mailstore-home-email-archiving.aspx
Sure, they might be useful at some point, but do you really need your emails from 20 years ago? Life is temporary. All things decay. Attachment causes suffering.
A half a million messages? That's not much. You should use notmuch.
notmuchmail.org
She will hold onto it for you!
rm -rf *
Ive always either manually PDF'd the ones I felt were worthwhile, or used an archival utility that did it for me. They go in hierarchal folders based on time/date.
I'm a Mac user, so spotlight gives me enough search functionality. Also, I can use various PDF utilities to join together various PDF that should stay connected.
`OfflineImap` (for fetching into a local maildir), then `mu` for indexing and searching.
As for converting your already-archived mail into maildir format, that's a little more tricky. Once they're in maildir format, you can just use `tar` to compress the ones you don't currently need to access.
mairix is another good solution for searching them, once you've got them in local mbox/mh/maildir spools. I think back when I was converting to maildir I scripted mutt to copy them in, but it's obviously harder if you've got them in proprietary formats.
You already have your solution, store all the email on an IMAP server then connect to it with whatever client you desire and do your searches. You can connect a client such as thunderbird to multiple accounts and copy your messages to the 1 IMAP server. Thunderbird's archive feature just copies your emails to date based folders for organization purposes, it's all still on the IMAP server. IMAP is client independent so if your current client is discontinued you just pickup another IMAP client and keep going. You'll just need to keep backups of the IMAP server / data and migrate it as updates are needed.
Now, I would encourage you to question why you really need 25 year old emails. Delete junk you really don't need.
I have Personal Storage Table (.pst) files from the late 90's that open in Outlook 2016, and in a free viewer thing (systools?) I have at home with no issues...great folder support, tagging support, sorting, filtering, file system search speed. etc. Owned my MSFT, but openly published and free forever licensing
I just put them in a mail folder. Make a new email account for them if you want. Then you still get the benefit of being able to access them on-demand anywhere through IMAP.
https://www.mailarchiva.com/
Works pretty well.
I'm sorry, I can't hear you over the sound of how awesome I am.
I know that Thunderbird has a plug-in that supports exporting emails messages into .eml files were you can have the filename show date and subject and such. But it's not that easy to use.
https://addons.mozilla.org/en-...
I have personally been archiving with two programs, Outlook and eM client.
Outlook because it provides a .PST file where you have a database that's easy to search through (in outlook) plus I can archive calendar, contacts, and tasks.
eM Client, which is free to use for two email accounts (at a time, you can always delete and add another). It's like the Thunderbird plugin (exports to .eml) above but much more intuitive and works really well.
A mental health practicionner is your best option here.
I remember having a similar problem years ago with E-mail in several systems and getting annoyed that everything was in different formats in different E-mail clients. I fixed the problem by setting up my own IMAP server. An IMAP server is a mail server that's compatible with virtually ALL E-mail clients but what's important about them is they act as mail stores unlike POP3 so you can upload mail to an IMAP server without screwing up formatting or anything. Then once you get all your E-mail up to your IMAP server, you can chose to just store it there (just remember to back it up now and then) or you can redownload it all into a Mail folder on ThunderBird (Backup Thunderbird's Mail store folder for protection) ThunderBird probably isn't going away in the foreseeable future but if it does, sometime down the road you can reuse your IMAP server to transfer it to another mail client.
One option might be to set up a local IMAP server on your machine and archive your mail there. Then any mail client that talks IMAP could access it.
Thunderbird's nice in that it uses the standard maildir format (one file per message, mail folders are just directories under the root of the tree) for it's local copy. Most IMAP servers understand and can use that format so you can just dump a copy of the local mail store into the IMAP server's user mail directory (or if that doesn't work, use the Unix movemail command to suck everything up from the local mail store and send it to the IMAP server) and be set. The message files are text so grepping for content's still an option of last resort. There are database-based solutions that have more options for tagging and searching, but they tend to cost money and once your mail's in them it's more of a headache to get it back out when you want to change software (this is an archive, it's inevitable that your current software will be unsuitable/unavailable at least once before the archive becomes old enough to be irrelevant).
My very unideal solution is to archive individual relevant emails under 'relevant emails' folder as plain text files. Otherwise, I don't retain emails and intentionally purge them. This way, when becomes taboo in near or far future, it won't be easy to dig through my digital trash and establish long-term pattern of 'abuse', allowing me to pretend that I am also outraged at these people still practicing such barbarism. Like not recycling your urine for drinking water. Who doesn't do that in 2035?!
With modern hard drive sizes I don't see the need for compression. Without compression you can use any good free text search tool. I have kept a good proportion of my email since about 1990, and it's all in Thunderbird. (Messages from earlier clients I just emailed to myself en masse).
Thunderbird has pretty good search capability, but as I am still running on Windows 7 I use Copernic Desktop Search, which has some useful features. (It indexes and searches files, and handles Firefox as well as Thunderbird). With this kind of volume, I do think an indexing tool is better than grep unless you want to have a lot of coffee breaks.
I am sure that there are many other solipsists out there.
I don't understand why emails are not more often stored as one-file-per-message, with a time-stamp as the start of the file name (YYYY-MM-DD etc.).
Some file systems are wasteful for lots of small files by padding actual space into large discrete chunks, but they should remedy that rather than stuff all messages into one big file.
Table-ized A.I.
I've been using email since the early 1980's, 1982 specifically. I was using "mail" then, later mailx, later whizbang graphical clients.
I still have tar archives of emails from a PDP-11. I can still read them today. Why? Because open formats. Tar archives from the dawn of time can still be read on a modern Linux system today. Once you start locking things up in proprietary formats such as used by Outlook, it gets harder to read them once that format dies. Not impossible, but certainly a bigger PITA.
Tar will probably still be here long after I am gone, so from my POV it is a format with suitable longevity. The underlying messages were encoded in plain old (mbox, I think) mail format, which is also still readable by modern mail clients, and even if it wasn't, it's plain old ASCII, so "less" would suffice in a pinch. Stay away from weird binary / closed formats!
Sounds like you are somewhat tech savvy. Dump all of your emails into files, or any basic store mechanism then load the whole thing into Solr and let it be your search engine.
Use the Thunderbird archive.
Thunderbird for example has 'Archive' as an option, but if I migrate to a different client I assume that won't work anymore.
Nope! :-)
I have about 10 years of email in Thunderbird. It keeps data in the mbox format which is a well supported open standard. The files are human readable and can be greped. There's lots of 3rd-party tools that support mbox. Thunderbird builds indexes (maybe those are proprietary) which are good enough that I can search that decade of email in a few seconds. (Maybe that is only searching by subject, to, and from. Message body searches might take longer). I remove attachments from old mail though, because that eats up space and is not valuable. If I needed the attachment, I saved it somewhere more appropriate.
The Thunderbird archive feature merely moves the mail into separate mbox folders to keep the main file from getting too big. It doesn't make them proprietary.
The hard part might be moving existing mail into that format from whatever it is in now.
The easiest way to convert is by uploading the mail to an IMAP Server and then using a tool of your choice to download the mails to one of the standard mail storage formats (or if you can access the mail server files and it stores mail as mbox files or in maildir format, get it directly from there).
This is not a normal problem to have. Filthy hoarder.
I don't see why you cannot just store them in mail client of your choice? You are probably better off using a local client over a webmail client (although gmail would be happy to import all of your mail and index it for you either over imap from a mail client or using one of the various loader tools out there), but I have never had any trouble importing old mail archives into thunderbird or outlook. If you set your folders or labels or whatever your client uses for organization correctly you should be able to search by account fairly easily.
gzip -9 mbox
Attachments cause suffering.
There, fixed that for you....
-email Buddha
I just save them in my Bluewave offline mail reader! .qwk is the way to go!
Do not look into laser with remaining eye.
JustAnotherOldGuy here, posting from an undisclosed location....
What I did was spend a day or so writing a script that extracted the emails from the various mail files they were in, and stuffed them all in a big honkin' MySQL database (just one table, no need to get fancy). It's about 500K rows all told. A simple interface lets me search any/all of the fields (subject, to/from email, body, etc and locate what I want without too much trouble. Yeah, it was a bit of a pain to do but it was worth it in the end.
I only need to search it once in a while but when I do it's a lifesaver.
This approach worked for me, it may or may not work for you.
Every time I switched mail clients or computers, I made sure to import all mail from the old to the new program. Messages that were made in my first mail account (in Eudora, on Macintosh System 7) are still accessible in my current Mac (Apple Mail, OSX 10.10). I don't need it often, but when I do, it's one search away.
mailpiler.org
...Solving the wrong problem.
eMail is not a storage medium; it is for short communiques, and sometimes those lead to threads while an issue is threshed through. But using your eMail system for historical storage is like buying a small automobile for long-haul freight. Or, using Twitter to negotiate a contract.
Decide what of all your data you intend keep, and find a useful, generic tool for storage and retrieval, irrespective of content.
Ever since the NSA added a RESTful API to their database,
it's been pretty trivial for me to search my email history.
CAP === 'unprimed'
I'm periodically annoyed by some people who still respond to emails that I wrote 15 years ago as if it was only yesterday. Delete the old emails and move on in life.
This is what I do, run IMAP locally (Dovecot). Every year or so, I create a folder callled sentbox_2013/ and move all the sent emails from 2013 there. My regular sentbox contains the last 14-20 months or so.
I also have a folder called archive/ which holds the few messages I think I'll actually need again.
Regarding whether it's a good idea or a bad idea to keep them in terms of legal disputes and such:
Having the documents will allow someone to prove what was actually said. If you're the a shady character, managing your business like it was Enron, you probably do not want to keep the evidence around. On the other hand, if you're working for the Software Freedom Law Center communicating with people who appear to be violating the GPL you probably want to save your communications- if the truth is clearly on your side, you may want to be able to prove what's true.
If you're naturally very upfront and ethical in what you do and say, emails may be more likely to help you than hurt you.
When you move, if you find a carton from the previous move unopened, discard it without opening. Follow the same rule and throw away the old emails. There is nothing of value in it.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
There are 4 different formats for saving mail succinctly described in the following Mutt configuration page:
http://mutt.blackfish.org.uk/storage/
Your ability to open old mail files and parse the content depends on the format. If your 10+ years of emails are in different formats you have a challenge. Migrating different formats is not perfect and the altered files can be challenged when presented in court. If you really going to spend hours on this, I would pick a format any try to migrate copies to that common format. Use the new format for the searches but keep the originals as presentable evidence.
The value of a consistent format is that it makes searching easier. I use mh format which Sylpheed/Claws-Mail/Mutt can all search by mailbox.
In addition to rolling your own imap, as has already been suggested, you can/should also do this.
If you are a Windows and Outlook user, (and if not, Google and torrents are your friend) burn a wet weekend learning the mysteries of those two plus acrobat pro. Get a clean install on a fast PC with plenty of memory and an ssd.
Import all your old crap into outlook (look it up)
Install acrobat pro including outlook plugin... Trivially use this to create searchable PDFs including attachments.
Put all your mail on an imap server. You'll be able to access it with any mail client. Set up the imap server as the archive destination for TBird. Now all your mail is archived in the imap server and is accessible.
You don't trust your email host? That's fair. Run your own imap server on your NAS or even your desktop machine. Everything stays right there on your own media and is still future-proof with regard to changing clients. If you need to change servers you just use your favorite email client to transfer mail from one to another.
I have everything online at my email provider. In my case, "everything" goes back to the mid-90s. I recently switched hosting providers and did just as I described: Set up separate accounts in TBird with the old and new providers. Select all in a folder on the old provider, drag to a folder on the new provider. (Well, actually I had to do it in chunks of under 5000 messages or TBird would get all crashy on me. But you get the idea.) It was kind of tedious to move hundreds of thousands of messages, but it was merely tedious. It wasn't problematic.
Chelloveck
I give up on debugging. From now on, SIGSEGV is a feature.
It sounds like you've made a Category 6C blunder by providing a solution to a different problem.
Nobody has the time to sift through two decades of emails and pick out the important things. Even if they did, the custom database thing to put them in will definitely not be cross platform, necessitating keeping a copy of the original mess of mbox/tar/etc files around to dig through.
After trying several solutions I settled on Mairix. Searches are screaming fast (less than a second to search several hundred thousand emails), indexing is fast, it's reliable (no problems in the 5+ years I've been using it), and the search language is easy and flexible.
* I use procmail to send a copy of everything to an archive, rotated monthly .bashrc: "function search() { mairix -o $$ $* && mutt -f ~/Mail/$$ ; rm ~/Mail/$$ ; }" .mairixrc:
* The archive is therefore just a handful of mbox files
* I have a cron job to run "mairix -Q" every 5 minutes, and "mairix -p" nightly
* I have this in my
* And here's my
base=~/Mail
database=~/.mairixdb
mbox=archive-*
mformat=mbox
omit=spam
With the above, I can find:
* everything from slashdot in the last two months: search f:slashdot d:2m-
* any emails I sent containing "squishy" in the body: search f:subreality b:squishy
* messages with "password" or "passwd" or similar in the subject: search s:passw=
* get a quick summary of the search language: search -h
It's so good that I download all my email from my work Gmail account so I can search it... sometimes Google's search just isn't precise enough to find what I need.
Use your delete key. Seriously. You don't need 25 years of every old email you've received.
This is just pack-rat hoarding behavior you're engaging in, and shame on Slashdotters for trying to enable it.
Barracuda
I just leave them in the inbox or whatever folder they end up in according to my sorting scripts. I'm using claws-mail as a client.
Works for me.
eMail is not a storage medium
Of course it is.
eMail is no difference than paper mail.
Solving the wrong problem
Depending on your "problem" you are obliegd by law to store them and have them accessible for 10 years, minimum. Depending on situation up to 30 years.
Or, using Twitter to negotiate a contract. ... Oki, done!
And what would be wrong with that? With 90% of my business partners: I have no contract at all. All we do is negotiation: can you do that? Yes I can! What is your price/timeframe? Something like X/Y
That easily runs via twitter or Skype or even IRQ.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Why waste your own time and bandwidth?
I recommend the one-mail-per-file, and one-directory-per-folder, idea. It's not exactly, well, new - but it beats everything else by miles. /under your control/.
Yes, this means you keep your mail local. This is a good thing, as this means
grep? Yes, works. Easily.
glimpseindex? Yes, works. Easily.
Anything else? Yes, works. Easily.
I keep all my mail from 1998 onwards (when I switched from a certain commercial provider with a proprietary email system) in that way. And it Just Works.
('course, Gmail/MSNmailorwhateveritscalledtoday are out. Who cares. exmh (and mutt in a pinch) FTW.)
I have about a dozen different IMAP accounts synced up with offlineimap and can search and filter through about 5GB of emails in under a second thanks to mu4e's indexing and rich filtering syntax.
And because all the emails are stored on my server, they're all incrementally backed up as part of my daily system backup process.
I use a combination of mutt + offlineimap + notmuch for mail, local archiving and a very powerful search.
I've been on this setup the past 6years or so. If mutt isn't your thing this approach is modular so you could simply sync with offlineimap and index/search with notmuch.
Have a squat over at the hobo house.
Some options:
1) Upload them to gmail - it has very strong search ability, and can do message bodies as well as metadata.
2) Put them in an imap server, like dovecot. Last I heard, dovecot could index e-mail metadata, but not message bodies
3) Put them in a Maildir or MH folder, and index them using lucene or pyindex or whatever. I use http://stromberg.dnsalias.org/~strombrg/pyindex.html , but I'm biased - I wrote it. This can do message bodies as well as metadata.
I already save it, now I want to index it and make some sense of it. It is impressively valuable.
I don't mind deleting the spam. :D
JJ
auto rename each email according to date, time, other party and subject, and save as txt file.
I asked a similar question to Slashdot about a month ago, where I wanted to stash E-mail and have it accessible if I'm on the road.
I looked at a few options. Using a virtual machine, an offsite storage provider, and so on.
What I have wound up doing is buying a NAS. Synology or QNAP are good companies for this. The NAS I bought was a basic one, but it supports RAID 1, which is critical. It also gets backed up automatically via a script that goes in via SSH, creates a tar file, pipes it to zbackup which has a repository on another NAS. zbackup is ideal for backups of E-mail, and having another machine pull the backups helps deal with ransomware, once the bad guys start hitting devices.
I then enabled the mail server functionality, which gave me an implementation of dovecot and roundcube. This not just gave me IMAP access, but access via the web (SSL). Using the onboard firewalling, I limited the IP range that the NAS talks with, to just the IP range of the commercial VPN service I use (which is a small provider, run by some competent admins.) This way, for an attacker to even get to an open port forwarded past the router to the machine, they have to have an account with that small VPN provider.
For me, this has worked well. I have access to my E-mail over IMAP or the web. Since the NAS doesn't send or receive mail directly (mail just gets copied to it when archived), it doesn't need SMTP access in or out.
Caveat: Focus on security when setting this up. Ideally, you could use the NAS's built in eCryptFS capability to protect the IMAP maildir directories so physical theft of the NAS doesn't mean your critical E-mails belong to someone else. From there, put the NAS in its own DMZ, blocking all outgoing traffic except for it checking for OS updates, and only allowing incoming traffic to the TLS-based ports, preferably with heavy IP restrictions. For backups, do a pull based system, so if the NAS gets infected, the bad guys can only put garbage in the backups, and not attack previously stored data.
.. and it is the standard format for storing emails that has been around since email was invented. Some proprietary mail applications like to use their own custom format ( I'm looking at you Micro$oft Outlook ), but Thunderbird still uses the standard format and so will always be usable.
I had email going back to 1990. Backups in various formats, including QWK, BlueWave, CSV, PST, elm and a dozen or so email accounts I rarely (if ever0 used any more. I use gmail, and wanted it all accessible to me on gmail.
I started by converting files. I found a utility that exported all of my QWK/BlueWave emails to CSV files (it also put attachments in a folder, linking to the file in the message. Very few attachments in the really old stuff...). Next I used Outlook to import the csv files from each of those accounts, each to its own pst file (this was just a smart move). A little hand work to add the 250 attachments back to the files they belonged in. I then created the matching folder structure on my gmail account and copied the messages over.
Next, I fired up Thunderbird and imported all of the .elm messages into that. After the import, I used IMAP to create the directory structure and copied over all of the mail to gmail.
Next came the CSV files. For that, I used Outlook's import feature to bring those emails in. Again, I then used IMAP to create the directory stucture and import into gmail.
Same went for the PST files. Opened them up one-by-one, created the structure and moved the mail to gmail.
The initial move took time. It took me a week or so to import all 8 gig of email - most of it was waiting for processes to complete. But once it was done, it was done. But like most, I want it in more than one place. So I use getmail to backup my gmail account. This runs every 2 hours (I can tolerate a potential 2-hour loss).
opening Outlook with the .pst file and connecting outlook to gmail via imap. Folder by folder, I copied the email (all 6GB of it) over to gmail using its psudo IMAP
Thunderbird is the next generation of Eudora. Thunderbird stores attachments within the email messages. Eudora stores the attachments as external files. Depending on your requirements, with Eudora you'll get much smaller mailboxes at the end. They're all flat text (without all the attachments), so they zip nicely. Eudora has a good search engine within as well.
Access them with whatever tool which can do it: Mutt, Dovecot imap, whatever. They'll index that for you, but remember: index is a throwaway convenience, the original is the mails in mbox or maildir format.
Simple, no lockin to any stupid software insisting on reinventing wheels.
Don't trust any software insisting in "converting" your mails.
Always used mbox format, got 7 years emails right here, immediately accessible,
before that on an old hard drive, same format, easy to load, backed up in annual mbox files.
Easy job for grep, or just open with Thunderbird and sort/search.
Go well
Market now is shifting from Veritas EV to MS Exchange 2013 that have archiving feature for free and even to office 360. Done quite a few migrations this year.
You can also look into EMC SoureOne and CommVault.
A tip: Dovecot has a nice sync tool http://wiki2.dovecot.org/Tools... Perfect to get your email from different IMAP sources to your own system. It can also change mailbox format etc. Combine that with Dovecot itself to give you IMAP access and you have access. You can also use it to keep it in sync with an off site archive.
Dovecot does have full body search, but it is quite CPU intensive. No problem if you just run it for a few users and except that it may take a while on a large amount of emails. Not too great if you're hosting for lots of users.
---
Does anyone have any actual experience in this? I've got clients with mailboxes containing 500,000 to 1 million messages. Archiving isn't very hard. Searching is a fucking BITCH!
500,000 message is a HUGE amount of email, it takes forever to archive/index/re-index in a way that allows full text searches. I have yet to find anything other than commercial SQL based archive and discovery systems, that work reliably or worth a shit. It's costly too because not only do you have to pay a massive fee for the software, there is also the additional server and storage requirements.
I'd love to find a solution. I see these idiotic posts on Slashdot about using IMAP and Thunderbird(!?). Are you kidding? You don't understand the problem.
That much mail is an issue to leave on the server. Storage requirements, performance issues, indexing and recovery times are all major problems that make leaving all the messages on the server highly undesirable. Putting the messages into an offline archive is tedious and beyond slow or difficult to search, as the OP explained. The archive needs to be online or nearline. It needs to perform well and be quickly searchable with near instant retrieval. But, it will be disused for most of its life so it needs to be cost/resource effective. AND, unlike most of the commercial offerings, it needs some form of standards compliance. Exporting 5,000 messages for discovery in a proprietary format or as individual unsearchable image PDFs is not useful. It should support import/export in multiple formats including mail client readable and searchable documents like PDF-A.
Check out MailVault (mailvault.in). It is super easy to setup and use, runs on Windows & Linux, has a Google-like search (via a web interface) for ALL your email, or if you prefer, you can access your mail right from within your email client via the built-in IMAP server.
It can import from a variety of sources (mbox, maildir, emls, thunderbird, pst, pop3, imap, smtp) - so you should be able to archive your old, as well as your current email.
Assuming this is for personal use, you can run it on your laptop, back-up onto a portable hard-disk (it can also automatically make a secondary backup on the external disk), and you'll always have access to all your email, whenever you want.
I love grep, but to handle many years worth of email, I prefer MailVault :)
There's a standard that has been around forever and is the only one that's guaranteed to be openable forever - standard Unix mail spool files and mbox files.
Why, you ask, will they be around forever? Because they're text files which can be viewed, searched and processed by anyone who knows the few rules about how they work. It's not like sed, awk and grep are somehow going to be replaced by the vendor with new programs that no longer support ASCII.
Use MailVault. It is super easy to setup and use, runs on Windows & Linux, has a Google-like search (via a web interface) for ALL your email, or if you prefer, you can access your mail right from within your email client via the built-in IMAP server.
It can archive email from a variety of sources (mbox, maildir, emls, thunderbird, pst, pop3, imap, smtp) - so you should be able to archive your old, as well as your current email.
Assuming this is for personal use, you can run it on your laptop, back-up onto a portable hard-disk (it can also automatically make a secondary backup on the external disk), and you'll always have access to all your email, whenever you want.
I love grep, but to store and manage many years worth of email, I prefer MailVault :)