Slashdot Mirror


Best Way To Archive Emails For Later Searching?

An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"

29 of 385 comments (clear)

  1. It's obvious by Mikkeles · · Score: 3, Funny

    Alphabetically!

    --
    Great minds think alike; fools seldom differ.
  2. IMAP by klingens · · Score: 5, Informative

    An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.

    1. Re:IMAP by 19thNervousBreakdown · · Score: 3, Informative

      Seconding this. I've been using Dovecot with Maildir on EXT3 for the last few years--my mailbox is about 25k messages, which I keep all in a single folder and use IMAP tags to organize into different virtual folders, much like Gmail's system but without the privacy concerns.

      Dovecot's supplementary indexes makes everything extremely fast (tags, dates, etc), and anything it doesn't catch Thunderbird does, I can search my entire mailbox for a single word in less than a second. I lose my Thunderbird indexes whenever I move to a new computer, but that's just a matter of leaving the client up for a few hours.

      --
      <xml><I><am><so><damn>Web 2.0</damn></so></am></I></xml>
  3. A Lawyer's Fantasy ... by perpenso · · Score: 4, Insightful

    I have kept every every email I have ever sent or received since 1990 with the exception of junk mail (though I kept a lot of that as well) ...

    You are a hostile lawyer's fantasy come true. ;-)

  4. Google Mail. by sidragon.net · · Score: 3, Insightful

    See subject.

  5. Re:Psychiatric consultation! by balaband · · Score: 5, Funny

    This is slashdot. We save computers older than your dad just to use them as alarm clocks. Please leave.

  6. Print by JustOK · · Score: 4, Funny

    Print then scan

    --
    rewriting history since 2109
  7. Gmail? by spiffydudex · · Score: 5, Informative

    While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.

    1. Re:Gmail? by pvera · · Score: 3, Insightful

      Yes! The thing that appeals to me the most about using Gmail is that searching through 5+GB of old emails won't make everything in my machine slow to a crawl. Even with the free Gmail account, you can up the storage to 20GB for $5/year, and that extra space is available from other Google services connected to the same account.

      If you want to have more flexibility, sign up for a Backupify account, which can backup Gmail pretty well. As a bonus, when Backupify stores your backups they are kept in plain text format, so you can always pull these and move them elsewhere without having to worry about issues with Gmail's storage formats.

      --
      Pedro
      ----
      The Insomniac Coder
  8. mbox + grep by Anonymous Coward · · Score: 5, Funny

    I use mbox format files and grep.

    IMO, one can't get much more portable than that.

  9. Maildir by alexhs · · Score: 4, Informative

    Maildir.

    And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread

    --
    I have discovered a truly marvelous proof of killer sig, which this margin is too narrow to contain.
    1. Re:Maildir by El_Muerte_TDS · · Score: 3, Informative

      mairix is a useful addition to a maildir setup: http://www.rpcurnow.force9.co.uk/mairix/

  10. Good IMAP Server by caffeinejolt · · Score: 5, Informative

    If this is really important to you, and you want it all to work across multiple workstations/OSes, your best bet will be to store it all in IMAP. If you have the means and motivation to run this yourself, I would recommend Dovecot. If you don't have the means and motivation, then you can use a service like Gmail to run your IMAP although you give up certain freedoms in doing so. For example, I use Dovecot coupled with Maildir++ as the physical storage format - as a result I can (if I wanted to) change to any email client I wish very quickly, use different email clients at the same time, etc.

  11. Re:Psychiatric consultation! by pz · · Score: 4, Insightful

    You, sir, are a mental case! I suspect you have OCD with some component of Aspbergers that is making you have this fixation on doing all this work to save ancient bits of information.

    How was this modded Informative? Saving correspondence for future reference is critically important. I have many times needed to refer back to messages that are years old, in order to pull up a vital bit of information that was suddenly relevant. I have needed to pull up an attachment from an email a few months old old, or view the exact wording of correspondence, check the date of a quotation, etc., more times than I can count, so searching and retrieval are both vitally important. When I run events, I need to be able to post-hoc review all of the correspondence for demographic analysis, often done two years after the event when the final reports are being written. Saying that this sort of behavior is odd, or not normal is either being a troll, or not understanding how the world works when you're not just a drone.

    IMO, this is one of the best Slashdot questions ever, and I am greatly anticipating hearing some good answers, especially if they don't include suggesting GMail as a panacea, as I want to have the email text and attachments in my possession.

    --

    Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
  12. Maildir by roderickm · · Score: 4, Interesting

    Maildir storage format is resistant to bit-rot because it stores each message in a separate file, and uses filesystem directories for mail folders. It's widely supported by user agents (mail readers) and IMAP/POP3/SMTP servers, so you'll never be stranded by the actions of a single software vendor. Finally, it's easily searched using everyday unix tools - find, grep, sed, awk, etc., and you can use the full-text search engine of your choice for speedy searches.

  13. Re:Psychiatric consultation! by Cylix · · Score: 4, Interesting

    I never thought of turning an ancient host into an alarm clock.

    Once however, I did hollow out an SGI case and turn it into a refrigerator.

    The case was just too damned pretty to throw away.

    --
    "You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
  14. citadel by samjam · · Score: 3, Informative

    citadel at www.citadel.org is a full pop3/imap server with full-text indexing.

    Thunderbird can use server-side searches to find messages, and I find that works pretty well.

  15. An Advertiser's Fantasy ... by perpenso · · Score: 5, Interesting

    And now the poster becomes an advertiser's dream come true in addition to being a hostile lawyer's dream come true. ;-)

    Remember that from Google's perspective gmail is a tool to better profile you for targeted advertising. Make sure you are OK with that before giving them access to all your emails.

  16. We have something similar at Work by juanca · · Score: 3, Insightful

    At work, we needed to archive (for compliance purposes) all the inbound/outbound email messages of our users (about a 1K aprox). We setup an Ubuntu server with postfix and dovecot IMAP over SSL, using Maildir.

    Our users generate about 20K email messages daily, and we store each day in it's own directory, something like this:

    INBOX
            |- YYYY
                          |- MM
                                    |- DD

    The auditors use Evolution to connect to the archive server and search the emails, even though it takes a little while to load a day of emails for the first time, once it's properly loaded searching is really fast. The server is not that powerful, it's a VM with 2 CPUs and 2GB of RAM. You do need a lot of storage though.

    Hope this helps.

    --
    --Necesito una chela, bien fria...
  17. Re:RETARD MODERATION by Anonymous Coward · · Score: 5, Funny

    Parent is +informative and/or +interesting, not troll. Fucking brain dead moderators these days. Sheesh.

    it suggested a linux solution and made the windows weenies realize how useless their os is. by extension they realized how tiny their penises are and then they finally understood why they like Micro Soft because it describes them perfectly. so they got mad and said "i'll mod it down, yeah, that'll teach them a lesson and make me feel like a real man again!"

  18. Re:Not Much by koiransuklaa · · Score: 3, Informative

    +1

    Notmuch can manage absolutely insane amounts of email without any artificial 'archiving'. Of course, if you are looking for a a program that does something else than tagging and searching (like sending, composing or receiving email), you need to look elsewhere.

  19. Re:Psychiatric consultation! by ciderbrew · · Score: 3, Funny

    What do they say?

    June 2001 - "Dave, can't go out tonight. I got a date with that fat chick.YEAH!"
    Sept 2001 - "Dave, She's told me she pregnant."
    Jan 2002 - "Dave, will you be the best man at the wedding :(".


    Shhhh - Dave's the real father (AC doesn't know)..

  20. DO NOT DELETE. by GuyFawkes · · Score: 5, Insightful

    I can't tell you the number of times I nearly deleted my archived data, going back to 1997 in my case, not just e-mail either.

    Then I got falsely accused of everything except 9-11 as part of a separation / child custody battle that started with a nuclear attack out of the blue.

    It is amazing how much of that old data is relevant in such cases, "He did x on 1st June 2000 at our house!" and you have data showing you were 200 miles away doing something you had completely forgotten, with someone you haven't spoken to or seen for 7 years, at the time...

    DO NOT DELETE YOUR ARCHIVES, EVER!***

    *** unless of course you are a bad person and they incriminate you, in which case you'd better avoid everyone else who archives data.

    --
    http://slashdot.org/~GuyFawkes/journal
    1. Re:DO NOT DELETE. by cervo · · Score: 3, Insightful

      this can also work against you. Most big companies have record retention policies that include when to delete e-mails. Because those same archives that saved you can bite you in the butt. Also in reality you should be innocent until proven guilty anyway, although I know civil court works differently. But if there is anything you did, maybe an e-mail to another woman that can be spun as evidence you had another girlfriend (even if it was a harmless e-mail just saying hi) then it could bite you.

      Plus no one is 100% squeaky clean. Maybe you admitted you were speeding to someone. Maybe you bought porn website memberships (which could be spun as the reason for a break up, or that you are an unfit parent). Maybe you admitted you were a little too drunk to drive but did it anyway. Maybe you ordered a set of army knives and have the receipt and that gets spun as you have weapons all over the place that could endanger the kids....

      Anyway just saying that too many records could bite you too. Especially if someone from court gets an order for all of them. Then they can be pulled out of context and could be very damaging. Even medical issues could be in the e-mail archives from correspondents with doctors, confirmations of appointments, etc... If that data ever got out it could be damaging to buying insurance as well.

  21. Echo chamber... by MrNemesis · · Score: 4, Informative

    ...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).

    El_Muerte_TDS has just pointed me towards mairix, a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.

    Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here.

    Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.

    --
    Moderation Total: -1 Troll, +3 Goat
  22. Domino by Belial6 · · Score: 4, Funny

    Yes, it is not free, and yes, this suggestion will bring out the trolls, but you might want to consider Lotus Notes/Domino. It is ~$140 for the system, and ~$40 a year maintenance (Includes all upgrades) cost per user, but IBM isn't going anywhere any time soon.

    It has good full text indexing, you can keep your mail on a client, and on the server, with incredibly flexible replication rules for what is stored where.
    It supports IMAP, so it talks well to most clients.

    The iPhone syncs seamlessly with it via ActiveSync, and an Android client is in beta as we speak.

    It includes an http client, and the http client even offers offline access. That's right. You can use the http client, and still read your mail and write emails that will be sent the next time you make a connection.

    It also has folders, but you can put any email into as many folders as you want, so you have the best of both Outlook folders and Gmail tags.

    It supports auto-processing rules for automatic filing of data, as well as being a full development environment if you want to get really fancy.

    It is brain dead easy to set up and maintain.

    The server runs on Linux and Window, and the client runs on Linux, Windows and Mac.

  23. just because I can. by socsoc · · Score: 3, Insightful

    just because I can.

    That's a big assumption. You are asking slashdot, so I'm thinking you can't. Especially because imap never occurred to you.

  24. Re:It's obvious - Gmail by flappinbooger · · Score: 4, Insightful

    It's obvious, upload them to gmail!

    (only half kidding)

    --
    Flappinbooger isn't my real name
  25. Re:RETARD MODERATION by halltk1983 · · Score: 5, Insightful

    Virtualbox is platform independent, and he also mentioned using a VM. Once all the email is on the IMAP server in the VM, you could easily attach to it with a client that runs on any platform.

    Also, IMAP servers are platform independent, as they can run on OSX, Windows, Linux, BSD, and almost any other popular OS I can think of. It's just that Linux distros are common, easy to set up, and light enough on resources that they would be easy to set up in a VM, and without the licensing costs of OSX or Windows, it becomes price comparable to lesser solutions.

    I know it's a lot to ask these days to get people to read the comments that they are replying to, but maybe, just maybe, someone complaining about a lack of reading comprehension should take more time to read.

    --
    Watch for Penguins, they eat Apples and throw rocks at Windows.