Slashdot Mirror


Ask Slashdot: Handling and Cleaning Up a Large Personal Email Archive?

First time accepted submitter txoof writes "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010. The archive is for storage only. It is a mirror of my Gmail account. The archives are both sent and received mail compressed in a hierarchy of weekly, monthly and yearly mbox files. I've chosen mbox for a variety of reasons, but mostly because it is the simplest to implement with fetchmail. After inspecting some of the archives, I've noticed that the larger files are a result of attachments sent by well-meaning family members. Things like baby pictures, wedding pictures, etc. What I would like to do is from this point forward is strip out all of the attachments and only save the texts of the emails. What would be a sane way to do that using simple tools like fetchmail?"

41 of 167 comments (clear)

  1. Why bother? by grumbel · · Score: 4, Insightful

    Storage is cheap and 500MB are hardly worth worrying about. The damage done by reducing that amount will likely be far larger then any temporal benefits you might get. If you want to have it smaller so that you can have faster search, look for a tool that is better at searching and indexing the mails instead of trying to cut the mail into pieces.

    1. Re:Why bother? by AliasMarlowe · · Score: 3, Interesting

      Exactly this, and even if it's a few GB. It's just too small amount to bother about.

      Agreed. 500MB is trivial, especially if it includes a bunch of large attachments. I just checked my email directory at home, and it's 2.7GB in size. It's on a network drive and Thunderbird accesses it more-or-less instantly; there is no discernible lag in showing the content of any mail folder - the hierarchy of folders is complicated, but some folders are large. The network drive is backed-up automatically three times a week, so its risk of loss is tolerably low. With modern email clients, the penalty of huge email directories should be tiny.

      --
      Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    2. Re:Why bother? by txoof · · Score: 2

      Storage is cheap, but backing it up to S3 is less cheap. I looked through a bunch of the mail and discovered that what I really wanted to save was the text. The rest is backed up on Google. If I lost it all, it wouldn't be a tragedy, but the mail between my wife and I before we were married and messages between my family are the things I treasure most, not the photos that I can find on facebook/flickr/gmail/picasa/etc.

      Finding a way to save some space and some bucks is worth while for me. After a lot of googling, I eventually landed on a script by Mike Leonetti that did most of the work for me stripping mime attachments. I had to tweak it to work with fetchmail and procmail, but I eventually kludged it into working. I'm just testing it out now and hopefully it will do the job. Perhaps others would be interested. You can find a copy here: Stripping Mime Attachments.

      If anyone has a better solution, I'm definitely interested as my Perl fu is pretty weak and this solution is a pretty huge kludge.

      --
      This one's tricky. You have to use imaginary numbers, like eleventeen... --Hobbes
    3. Re:Why bother? by erroneus · · Score: 2

      True-true.

      But on a related note, I have often longed for a "generic email database format" which could be a universal format for all email programs out there in some way. Pretty much a dream which is long over-due... about 10 years past-due. Perhaps there is already something like that and it has escaped me all these years but I seriously hate migrating email from one format another. Not long ago, I was helping someone to recover some old email (Outlook Express) and contacts which were in Japanese and not in UTF-8 format. Turned out that the Windows Live mail didn't do a good job of importing that format/language of email at all no matter what I did. Fortunately, I was able to access outlook express on an old Windows XP VM I had and it worked out okay by exporting OE to MS Outlook in a PST file.

      Still... it would be nice if there were some universal email archive format which all email programs can use. And you know? The content of email hasn't changed since it was created long ago. Why can't we do something as seemingly obvious as this?

    4. Re:Why bother? by ArundelCastle · · Score: 2

      nearly 500 megs from 2010.

      OP did not specify how much space is being used total, but everyone is taking the 500MB as the main sticking point. *facepalm*

      The point being it will get larger in the future, even if OP never runs the risk of exceeding Gmail's quota.

      Far as I can tell this is a TMI question about fetchmail and attachments. Wish I could help.

    5. Re:Why bother? by houghi · · Score: 3, Interesting

      Why bother indeed. When I look at my mailfolders, I try to think on my personal mail when the last time was that I actually searched for something older then one year,

      Mails that I keep are orders I placed and passwords that I requested. All the rest I delete after one year.

      I already do a lot of deleting after reading already. e.g. most mailing lists will be deleted almost immediately. Things I keep are bug reports I filed, till they are closed.

      This is something I do in real life as well. If I have not used something in a year and there is no emotional value, I will trow it away. Even though it is technically possible to keep everything, I see no reason to do so.

      --
      Don't fight for your country, if your country does not fight for you.
    6. Re:Why bother? by J4 · · Score: 2

      "snobbishly"

      WTF? It's his own personal email.

    7. Re:Why bother? by grcumb · · Score: 5, Insightful

      I have often longed for a "generic email database format" which could be a universal format for all email programs out there in some way. Pretty much a dream which is long over-due... about 10 years past-due. Perhaps there is already something like that and it has escaped me all these years but I seriously hate migrating email from one format another.

      Take a look at Maildir. It's not perfect, but it is generic, simple and easily transferred from one location to another.

      RANT: Over the course of my (far too many) years of working in technology, I've often been amazed just how enamoured everyone is with databases. There are some things that databases do well, granted, but just because something needs an index doesn't mean it needs a relational database. /RANT.

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    8. Re:Why bother? by grumbel · · Score: 4, Insightful

      My advice is to keep your archives, but take the time to filter out the stuff you really don't need or want any more.

      The problem with that is that it's extremely hard to judge what you will find valuable 20 years down the road.

      Simple example: Old TV recordings on VHS. I have all of Star Trek: TNG on VHS, labeled, sorted, with the commercials cut out. All nice and dandy you might think.

      You know which part I would love to rewatch? Now, some 15 years later? The commercials, exactly that part which I deleted. All the episodes I can get easily on DVD or on BluRay without problems, with higher quality and everything, but the stuff between the episodes? Nope, that's not available. Here and there a bit of stuff shows up on Youtube, but raw uncut TV from 15 years ago simply isn't easily available.

      There will also be obsolete software, video and flash attachments that were funny five years ago, and other junk.

      Yeah, and exactly that stuff might turn out to be extremely valuable years down the line, as your copy of it might be the only copy left or at least the only copy accessible to you.

      I have absolutely nothing against sorting, indexing and organizing the data, I quite welcome that, but that should be done as a layer on top of the data, not by hacking and slashing the original data itself.

    9. Re:Why bother? by batkiwi · · Score: 2

      Maildir is exactly what you say, a generic email database.

      It's not a relational database, as email isn't really relational in nature, but solves most/all of the problems you need to solve around storing emails. The only big "miss" in maildir is that attachments are stored inside the main message, making pass-through deduplication difficult/impossible.

      (many storage devices now can auto deduplicate files that are identical, so if you get the same image in 15 different emails due to reply-to-all etc you only store the image once).

      I think email clients (and servers for imap-searching) should keep a relational attribute-based index of emails (so that you can instantly pull up all emails from "bob," or all emails on oct 31), but that's an internal implementation note and not the actual mail store.

    10. Re:Why bother? by icebike · · Score: 4, Interesting

      Ask yourself: When are you ever going to read all those email again? When is *anybody* ever going to read them again.

      As soon as:
      1) you divorce
      2) you get arrested for ANYTHING
      3) They arrive with a search warrant for any reason
      4) You sue or are sued
      5) You run for office
      6) You get hacked

      Seriously, I keep VERY little historical Email. Very little.
      I am not so vain that I believe there is any historical significance, and have never needed to go back more than a couple months for anything.

      Just Delete it. Its safer that way.

      --
      Sig Battery depleted. Reverting to safe mode.
    11. Re:Why bother? by Anonymous Coward · · Score: 3, Informative

      That's easy. (Old school) Eudora uses the mbx format, but separates the attachments from the mails.

    12. Re:Why bother? by pla · · Score: 2

      WTF? It's his own personal email.

      Poor choice of words, perhaps, but I completely understand the sentiment. I've had some form of email since around 1991, and despite my OCD-like "completionist" tendencies, I never thought to archive it all until sometime around 2003.

      Now, considering the tiny actual disk space those early emails would have taken, I sorely regret my earlier habit of read-respond-delete.

      These days, I delete spam (and some large attachments), and nothing else. And some day, I'll probably regret deleting even the spam... But since I get around a 10:1 ratio of spam, I can't realistically keep it all.

    13. Re:Why bother? by simcop2387 · · Score: 3, Interesting

      I'm at about 12GB myself, and that's one of the two big reasons that I keep the mail in maildir format and connect all clients to it via imap. Using a real mail server has kept that from happening to me (again) for years now. The other reason is that it makes it really easy to change clients to play around, or access it from lots of places.

    14. Re:Why bother? by pla · · Score: 2

      That's true, but like the person posting the article it would be nice to have a convenient way to strip out all the attachments and have a "text only" archive with the attached images and other files stripped out (small and quickly searchable), and a "content/media only" archive of the attachments in the form of plain files rather than encoded within e-mail messages.

      Search for everything. Sort by attachment. Select all (that have attachments). Save attachment(s). Delete attachment(s). Done.

    15. Re:Why bother? by wvmarle · · Score: 2

      I have some 12 GB of mail, mainly business related, lots of attachments, dating some 8 years back.

      Quite regularly (once a month or so) I am looking for some e-mail that I received well half year to a year ago, to look up some detail about an old deal or offer.

      And sometimes I have to look up something that's a bit older than that. Two, three times so far I have been searching through e-mails that dated five, six years back, pretty much the beginning of the archive then. And that usually also had to do with the attachments.

      It is totally unpredictable what you could need in the future, which is why I don't delete any of it. Well, that is except the sent folder which I now and then trim (though the last time I did that is also two years ago or so by now), as almost everything that I send out comes back to me as quotes in a reply mail. And that's enough for me.

    16. Re:Why bother? by dbIII · · Score: 2

      7) New girlfriend
      "Why did you never send me emails like those ones you sent her?" she asked.
      "You didn't even have an email address before you moved in" just wasn't a good enough excuse.

  2. Re:Isn't there a way... by BitHive · · Score: 4, Interesting

    You have. Thunderbird includes archival folders and a Lucene search engine.

  3. Re:Isn't there a way... by zmughal · · Score: 5, Informative

    There is DBMail.

  4. Why keep it? by Anonymous Coward · · Score: 2, Insightful

    If you're not following Sarbanesâ"Oxley, just delete it. Fuck the pack-rat mentality.

  5. Re:500 Mb only? by optimism · · Score: 5, Insightful

    Many people have a larger email store than you.

    It is not a sign of status.

    More likely, it is a sign of your incompetence to filter and save relevant data.

    Congratulations.

    Now back to the OP, who perhaps is smarter than you, since he has has just 500MB of email to back up.

  6. IMAP by spinkham · · Score: 2

    IMAP is another potential answer.

    I run Dovecot locally, and it stores every mail I've ever received, indexed for quick searches.

    This way I can get my mail with all history and a fast search index on all my devices also.

    --
    Blessed are the pessimists, for they have made backups.
  7. Read then purge ... by MacTO · · Score: 5, Insightful

    There is probably some email that you need to keep, but chances are that you don't need to keep most of your email. So just read, respond, then purge (when appropriate).

    As others have pointed out, disk space isn't really a concern this day in age. But managing data that you don't need is a concern. A minute spent filing, backing up, etc. of unnecessary data is a minute wasted. Add enough of those seconds together, and it may amount to a good chunk of your life spent doing more interesting/productive things.

    As a side note, I notice that people sometimes get attached to things that don't really matter to them. I've known people who have lost all of their data due to circumstances beyond their control, then they became very distressed about that loss of data. The problem is that only a tiny fraction of that data was actually valuable, but they were worrying about all of the data. In some cases it was so traumatic to them that they spent more time worrying about the irrelevant stuff than the stuff that they would need to continue on in the future. So if you don't keep the irrelevant stuff, you can focus on what is relevant.

    1. Re:Read then purge ... by vadim_t · · Score: 3, Insightful

      It creeps me how young geeks hand out all their personal data to the first free provider they happen to come across.

      Yeah, it's a bit of a pain sometimes, but the benefit of having the data where I want it, dealt with how I want it, outweighs the cost IMO. It also makes for good system administration practice if you have an interest in that kind of thing.

  8. What's the motivation? by Just+Brew+It! · · Score: 3, Insightful

    Even at today's post-Thailand-flood inflated hard drive prices, your entire e-mail history occupies less than a dollar's worth of disk space. I fail to see the issue.

  9. mutt, ruby libraries by subreality · · Score: 2

    For my own mail archives I just use mutt and weed things a bit by hand. I find that 90% of the mbox size is in fewer than a dozen attachments, so I can hand-filter those out in ten minutes once a year. Beyond that disk is too cheap to care and time is too valuable to make a really comprehensive solution. So what I do:

    'mutt -f archive.mbox'
    ':set pager_index_lines=6' (Lets you see the message index split above the body)
    'o' (Order), 'z' (siZe), End (last entry), Enter (Open).
    while(mbox.size > acceptable_size)
    {
            'v' (View attachments)
            'jjj' (down a few times to the attachment I want to nuke)
            'd' (Delete)
            while(more attachments) { 'd' (Delete more attachments) }
            'q' (Quit back to the message view)
            'k' (previous message)
    }
    'q' (Quit back to index)
    '$' (Sync changes to disk)
    'q' (Quit mutt)

    Note the 'j' and 'k' are vi-style up/down. The arrow keys work too if you're not a home row junkie like me.

    I don't know a good fully automated way to do this that's ready to slice it right out of the box. If you want to roll your own, just pick up a library like RMail or TMail for Ruby, or equivalent for the language you prefer. That's 80% of the work done but you'll still probably find a dozen corner cases involving oddly-named HTML-alternatives named things that look like binary attachments or terribly malformed spam.

  10. Re:Isn't there a way... by icebraining · · Score: 2

    Sup uses Xapian, it's pretty fast too.

  11. Delete it! by Lazy+Jones · · Score: 2

    Google keeps a permanent copy anyway...

    --
    "I love my job, but I hate talking to people like you" (Freddie Mercury)
  12. Re:500 Mb only? by icebraining · · Score: 4, Insightful

    Who still uses e-mail?

    People who get stuff done instead of being interrupted every 5m? And who want to receive messages even while offline? And have decent systems for archiving, tagging and searching them?

  13. I have all email going back to 1980... by neurocutie · · Score: 2

    back to ARPA mail and UUCP mail days...

    for a while I used Eudora and every month religiously took each piece of email and filed it away in suitable mail folders. After Eudora started declining and I got too busy, I stopped that, but even now, religiously every month I clean out my mailbox of all junk and unwanted attachments (trimming 60-100MB to usually 20-30MB) and then stack that months email away as a single mbox file, and start fresh with a new Inbox.

    the old mailbox files are on an IMAP server that I can easily read emails from at least 10 years ago -- older with a little more effort. As single mbox files each, I can do greps on them also. Seems to be an okay way to keep the stuff, some of which has proven to be important over the years....

    another big help: all semi-junky and non business emails I let Hotmail do the work (vendor stuff, Amazon orders, etc). Have been using Hotmail since before MS bought it. Works well as a place to direct mostly junky vendor stuff.

  14. Procmail by massysett · · Score: 4, Funny

    Google for "procmail remove attachments":
    http://osdir.com/ml/mail.procmail/2002-11/msg00091.html

    That will get you started. You can do most anything with Procmail after you figure out the rather odd configuration file format.
    Make sure you have it backed up first because it's also quite easy to destroy data with Procmail.
    After you spend a lot of time futzing with Procmail scripts and sed and formail and the like, you'll wonder why you didn't go on Amazon or Newegg and buy a $10 flash drive that will hold all your mail several times over.

  15. 500mb? by nurb432 · · Score: 2

    Amateur. When you get to 8+ gb then we can talk about 'large archive'. Until then, just stick it on a CD.. you don't even need a DVD for that.

    --
    ---- Booth was a patriot ----
  16. Re:500 Mb only? by bmo · · Score: 2

    Why delete when disk space even today is 14 cents in "Salesman Gigabytes"?

    Someone back there said he has 16 GB of mail going back over a decade. That's what Two Bucks And A Quarter.

    It's less than a cup of coffee at Starbucks or even a Large at Dunkin' Donuts. It is fully irrational to worry about this.

    Anyone worrying about personal mbox size has OCD issues. Full stop.

    --
    BMO

  17. Email archiver by Dupple · · Score: 2

    Enables you to save everything off line as a pdf. Personally I don't get the question or see the point. My archive is about 6 Gig, all backed up all searchable. Anyway the company that makes the software is www.spotdocuments.com Just back up

    --
    Watch those corners
  18. Something Like This? by pscottdv · · Score: 5, Informative

    We all think you're crazy, but here it is:

    #!/bin/env python
    from mailbox import mbox, mboxMessage

    orig_mb = mbox(path/ot/orig/mbox)
    new_mb = mbox(path/to/new/mbox)

    for key,msg in orig_mb.iteritems():
            new_msg = mboxMessage()
            payload = msg.get_payload()
            if msg.is_mulltipart():
                    payload = payload[0].get_payload()
            for header in msg.keys():
                    new_msg[header] = msg[header]
            new_msg.set_payload(payload)
            new_mb.add(new_msg)
    new_mb.flush()

    --

    this signature has been removed due to a DMCA takedown notice

  19. Meta-Facepalm! by Anonymous Coward · · Score: 3, Insightful

    FTS: "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010."

    So, the total space required thus far is definitely less than (8 * 0.5 GB) = 4 GB. A USB flash drive with that small a capacity is practically classified as electronic waste these days.

    Even if his or her annual e-mail archive size doubled every year for the next 10 years, it would only be 1+2+4+8+16+32+64+128+256+512=1023 GB.

    A 3 TB hard drive he buys *today* for $100 would probably solve his "problem" for 10 more years.

    Hopefully, in the year 2021, we will have tiny 3 PB SSD drives for $100... But maybe we will be ruled by an A.I. by that time, if we haven't already destroyed ourselves with viruses, nanomachines, robots, nuclear weapons, etc.

  20. Re:Isn't there a way... by cras · · Score: 2

    Email isn't stored in SQL, because typically it's rather pointless. Full text search indexing doesn't require SQL, and it's more efficient without SQL anyway. There are some good use cases for storing emails in SQL database, but efficiency isn't one of them.

  21. Re:I have all email going back to October 2000. by tomtomtom · · Score: 2

    2000? Hell, I have my email back to the early 1980s.

    The real problem is that back then it was OK to put all messages in one file, and having one message per file is far more useful for searching with grep.

    Actually I find this less of an issue. Check out grepmail and mboxgrep. I use these pretty regularly and they're very useful for doing e.g. grepmail 'foo.*bar[a-z]' ~/Mail/mbox.gz >/tmp/messages; mutt -f /tmp/messages

  22. Re:I want that with a GUI access app by icebraining · · Score: 2

    You're in good luck! The same author is developing a similar system in a client/server model, with the server (called Heliotrope) doing the actual work. You just need to write the Linux and Android clients. A client already exists, but it's ncurses based.

  23. findbigmail.com by matty619 · · Score: 2

    Is a project a friend my mine started. Interfaces w/ gmail's API, quite easy to use.

    www.findbigmail.com

  24. There is value in personal history by erice · · Score: 2

    Ask yourself: When are you ever going to read all those email again? When is *anybody* ever going to read them again.

    1) I need to order ink recently. Now, I don't print much but I vaguely remembered a good supplier that I had used in the past. But what was it called? A few moments of greping and I found it: in a confirmation email from three years ago.

    2) I met a woman on a Meetup hike recently that I seem to have met before. Was this the blind date from four years ago? The smoking gun was in an email from 2007.

    3) I've had occasional need to look up old acquaintances. While I might have created a contact file at various points, odds are I have forgotten what I named it or where I put it. But I am quite sure the information is in email.

    The real treasure is email that is ten or more years ago. You think you remember what happened in the right order? Trust me. You don't. An email archive is like a diary except it is less work and more complete.