Slashdot Mirror


Best Way To Archive Emails For Later Searching?

An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"

29 of 385 comments (clear)

  1. Delete by Anonymous Coward · · Score: 2, Insightful

    Time to delete them all

  2. A Lawyer's Fantasy ... by perpenso · · Score: 4, Insightful

    I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...

    You are a hostile lawyer's fantasy come true. ;-)

    1. Re:A Lawyer's Fantasy ... by tareko · · Score: 1, Insightful

      I don't get the difference between this person and all the people of old whose many personal and mundane letters litter collections everywhere and make historical accounts more rich and precise. I bet he can track the births and deaths of countless relationships through those emails, which is itself of tremendous worth.

    2. Re:A Lawyer's Fantasy ... by ShakaUVM · · Score: 2, Insightful

      >>You are a hostile lawyer's fantasy come true. ;-)

      We've won a couple lawsuits because I save all of my email.

      We had a contract to do a workshop with Maricopa County - the same people whose Sheriff is under investigation by the FBI right now, and of Immigration Law fame. And who have a lot of other shady things going on right now, but I digress.

      I'd traded a series of emails with them planning the workshop. Everything was all set. Then, about a week before the workshop, they say they don't need me to come after all. Ok, sure. So I try to reschedule with them. Nope, sorry, you didn't show up to the workshop, so you breached the contract. I sent them a copy of all the emails. Nope, sorry.

      Filed a lawsuit. They wouldn't settle. Showed the email trail of everything. Got a check for over $30k. Didn't have to do the work. (Of course, I'd have preferred if everyone had just done as they'd said, and it was much more of a hassle to sue than to just do the damn work.)

      Lawsuits are often won over who has the best documentation. If you do your work honestly, having full email records is probably going to help you more than hurt you in lawsuits.

  3. Google Mail. by sidragon.net · · Score: 3, Insightful

    See subject.

    1. Re:Google Mail. by Anonymous Coward · · Score: 1, Insightful

      Uh, privacy would be the reason.

  4. Mbox or SQLite by Anonymous Coward · · Score: 2, Insightful

    If you want an "email format" why not mbox? Many things currently support that as an import option.

    If you want a database, why not SQLite? It's about as open as can be, backwards compatibility is almost a religion and should have no problem with hundreds of thousands of entries.

  5. Re:Gmail? by siliconbits · · Score: 2, Insightful

    I second that. Invest in Google Apps to benefit from additional services as well.

  6. Re:Psychiatric consultation! by pz · · Score: 4, Insightful

    You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.

    How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.

    IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.

    --

    Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
  7. Re:Gmail? by pvera · · Score: 3, Insightful

    Yes! The thing that appeals to me the most about using Gmail is that searching through 5+GB of old emails won't make everything in my machine slow to a crawl. Even with the free Gmail account, you can up the storage to 20GB for $5/year, and that extra space is available from other Google services connected to the same account.

    If you want to have more flexibility, sign up for a Backupify account, which can backup Gmail pretty well. As a bonus, when Backupify stores your backups they are kept in plain text format, so you can always pull these and move them elsewhere without having to worry about issues with Gmail's storage formats.

    --
    Pedro
    ----
    The Insomniac Coder
  8. Store them in mbx format by Anonymous Coward · · Score: 2, Insightful

    I recommend mbox (MBX) format.

    1. The format is text based and not likely to become unreadable anytime in the forseeable future.

    2. There are no shortage of tools for manipulating mbox.

    3. Its easily indexed by full text search applications (MS Search included with windows)

    The outlook tools save dialouge has an apple export option which is actually the mbox format.

    In terms of archival access I recommend an IMAP server with a folder hirarchy based on month/year. Your mail client should be configured to leave the messages on the server (not attempt to download via IMAP). This somewhat future proofs migration to different mail clients.

    The only issue is that imap searches are out of the question so you will need to do searches offline with a full text indexing/search application to first find the general folder location of the message you are seeking.

    If your computer has lots of memory then why not just use grep and write a small shell script to forward the message from the archival file to your inbox so that formatting..etc is preserved. If your doing lots of searches the disk cache will back most of it in ram even if its a few GB..

    1. Re:Store them in mbx format by Sancho · · Score: 2, Insightful

      I find that Maildir works better than mbox for my purposes. Roughly all of the same pros, plus:
      4) Doesn't require locking your entire mailbox to modify one message.
      5) Resistant to file/inode corruption (will likely only corrupt one message instead of several.)
      6) Can essentially use shell tools to copy individual messages.

      One thing that's neat to do with maildir mailboxes is to search using grep+xargs and copy the messages you find into a new maildir mailbox (named, perhaps, searchresults). Then you have a handy mailbox populated with your search results. I imagine one could even do this using procmail, so that you could populate the mailbox remotely.

  9. mbox +mutt/thunderbird+mairix by Anonymous Coward · · Score: 1, Insightful

    I have been archiving my mails for the past 10 years. My method has been to download the mails in mbox format once a year and use a combination of mairix to search through teh mails and either mutt or thunderbird to see the actual mails.

  10. We have something similar at Work by juanca · · Score: 3, Insightful

    At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.

    Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:

    INBOX
            |- YYYY
                          |- MM
                                    |- DD

    The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.

    Hope this helps.

    --
    --Necesito una chela, bien fria...
  11. IMAP with maildir backend by Fat+Cow · · Score: 2, Insightful

    I migrated all my old personal emails to gmail using IMAP. You can use this to migrate between different on-disk formats like maildir, mbox and pst. I had all my email in yahoo and pulled it down using POP to a maildir, then used an IMAP mail client to copy it across to gmail. Then I regularly back them up from gmail to an on-disk maildir format using mbsync. I picked maildir because it's open and seemed better designed than the alternative, mbox. It's not completely standardized though. I've seen PSTs become corrupt so I try and stay away.

    --
    stay frosty and alert
  12. Re:POO (Plain Old Outlook) by dakohli · · Score: 2, Insightful

    I have to say that PST's can be convenient. However, I have seen many corrupted PST's over the years, and yes I know that there are tools to fix this, but the name of the game here is to actually get your emails out with a minimum of fuss. Also, as to compatibility, I know MS has arbitrarily changed the format of Word. There is nothing to stop them from doing the same to the PST format, and there are several versions of that in existence now. Add this to the fact that as the PST's get bigger, performance drops off. As a really easy expedient solution, using PST's will work, but not well. Using them as a solution for the problem however, I think it will only compound the issues in the long run.

  13. DO NOT DELETE. by GuyFawkes · · Score: 5, Insightful

    I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.

    Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.

    It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...

    DO NOT DELETE YOUR ARCHIVES, EVER!***

    *** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.

    --
    http://slashdot.org/~GuyFawkes/journal
    1. Re:DO NOT DELETE. by cervo · · Score: 3, Insightful

      this can also work against you. Most big companies have record retention policies that include when to delete e-mails. Because those same archives that saved you can bite you in the butt. Also in reality you should be innocent until proven guilty anyway, although I know civil court works differently. But if there is anything you did, maybe an e-mail to another woman that can be spun as evidence you had another girlfriend (even if it was a harmless e-mail just saying hi) then it could bite you.

      Plus no one is 100% squeaky clean. Maybe you admitted you were speeding to someone. Maybe you bought porn website memberships (which could be spun as the reason for a break up, or that you are an unfit parent). Maybe you admitted you were a little too drunk to drive but did it anyway. Maybe you ordered a set of army knives and have the receipt and that gets spun as you have weapons all over the place that could endanger the kids....

      Anyway just saying that too many records could bite you too. Especially if someone from court gets an order for all of them. Then they can be pulled out of context and could be very damaging. Even medical issues could be in the e-mail archives from correspondents with doctors, confirmations of appointments, etc... If that data ever got out it could be damaging to buying insurance as well.

    2. Re:DO NOT DELETE. by afabbro · · Score: 2, Insightful

      Alternatively, spend more time on your personal relationships and home life than maintaining your email archives.

      --
      Advice: on VPS providers
    3. Re:DO NOT DELETE. by Anonymous Coward · · Score: 1, Insightful

      "If one would give me six lines written by the hand of the most honest man, I would find something in them to have him hanged."

      --Armand Jean du Plessis, Cardinal et Duc de Richelieu

  14. Re:Psychiatric consultation! by Jawnn · · Score: 2, Insightful

    How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important.

    While the value you place on being able to retrieve critical pieces of information may be valid, your choice of storage medium is not. An email system is not a file server or database. Most index poorly, if at all, making searches horribly inefficient. And as has already been observed, it may be quite likely that those same things you value will be more than offset by their value to a hostile litigant.

  15. just because I can. by socsoc · · Score: 3, Insightful

    just because I can.

    That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.

  16. What about the privacy of those you email with? by perpenso · · Score: 2, Insightful

    What about the privacy of those you correspond with? If they send an email to a gmail account that is one thing, but you are unilaterally deciding to have them participate in the targeted advertising profiling.

  17. Re:Maildir by jgrahn · · Score: 2, Insightful

    Maildir storage format is resistant to bit-rot because it stores each message in a separate file, and uses filesystem directories for mail folders. It's widely supported by user agents (mail readers) and IMAP/POP3/SMTP servers, so you'll never be stranded by the actions of a single software vendor. Finally, it's easily searched using everyday unix tools - find, grep, sed, awk, etc., and you can use the full-text search engine of your choice for speedy searches.

    The only sane alternatives are, as far as I'm concerned:

    • a collection of mbox files
    • a collection of gzipped mbox files
    • a collection of Maildir folders
    • a collection of tarred and gzipped Maildir folders

    Maildir isn't quite as well supported as mbox, but I suppose it's sometimes more convenient to grep these since you get a hit on the particular mail you're searching for, not the mbox file which contains that mail and a thousand others.

    I use gzipped mbox files. One thing I have considered doing is to convert away Quoted-Printable MIME encoding and use Latin 1 (or UTF-8) everywhere. That would make the mboxes easier to use with standard tools like text editors and grep.

    I would never use a database for this. It serves no purpose, except as an invitation for the fuckup fairy. The searches you'd want to are free-text searches anyway.

  18. Re:Psychiatric consultation! by Anonymous Coward · · Score: 1, Insightful

    Say what?

    It's the modern equivalent of saving all your personal letters and other correspondence. What the heck is abnormal about that? In the old days you'd have a bundle of letters stored in the attic somewhere. But this doesn't result in heaps of paper or file cabinets full of it that get in your way, as it does for people with a genuine mental problem. For e-mail, you can store it all on one small (these days) hard disk placed in a drawer somewhere, with space to spare -- even with all the spam! And the process of figuring out how to better organize it and archive it going forward will be a useful learning exercise that might have applications elsewhere (e.g., at work, where people might be asking exactly the same question).

    It's no worse than deciding to tidy up your office or study area and figuring out a system to better keep track of things so you can find them later.

    I mean, heck, the President of the United States had the same fricking problem: how to properly archive e-mail, a problem discussed here numerous times. As a common problem -- personally and in business -- listening to other people's solutions before digging into it yourself is an efficient way to deal with it.

  19. Re:IMAP by Compuser · · Score: 2, Insightful

    Huh? How does a server help with a local archive of emails? Does any of these servers help with importing emails (pre-mbox arapnet emails for instance
    or dbx emails for a more modern example)? Does it provide fast searching (including .doc and .ppt awareness)? This may be a storage approach but does
    not begin to deal with the question raised.

  20. Re:It's obvious - Gmail by flappinbooger · · Score: 4, Insightful

    It's obvious, upload them to gmail!

    (only half kidding)

    --
    Flappinbooger isn't my real name
  21. Hold the phone! by Anonymous Coward · · Score: 2, Insightful

    Computers, hard drives, backups, electricity, rack space, and maintenance are all free! Fuck! Tell me where you shop for this stuff.

  22. Re:RETARD MODERATION by halltk1983 · · Score: 5, Insightful

    Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.

    Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.

    I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.

    --
    Watch for Penguins, they eat Apples and throw rocks at Windows.