Slashdot Mirror


Ask Slashdot: Handling and Cleaning Up a Large Personal Email Archive?

First time accepted submitter txoof writes "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010. The archive is for storage only. It is a mirror of my Gmail account. The archives are both sent and received mail compressed in a hierarchy of weekly, monthly and yearly mbox files. I've chosen mbox for a variety of reasons, but mostly because it is the simplest to implement with fetchmail. After inspecting some of the archives, I've noticed that the larger files are a result of attachments sent by well-meaning family members. Things like baby pictures, wedding pictures, etc. What I would like to do is from this point forward is strip out all of the attachments and only save the texts of the emails. What would be a sane way to do that using simple tools like fetchmail?"

18 of 167 comments (clear)

  1. Why bother? by grumbel · · Score: 4, Insightful

    Storage is cheap and 500MB are hardly worth worrying about. The damage done by reducing that amount will likely be far larger then any temporal benefits you might get. If you want to have it smaller so that you can have faster search, look for a tool that is better at searching and indexing the mails instead of trying to cut the mail into pieces.

    1. Re:Why bother? by AliasMarlowe · · Score: 3, Interesting

      Exactly this, and even if it's a few GB. It's just too small amount to bother about.

      Agreed. 500MB is trivial, especially if it includes a bunch of large attachments. I just checked my email directory at home, and it's 2.7GB in size. It's on a network drive and Thunderbird accesses it more-or-less instantly; there is no discernible lag in showing the content of any mail folder - the hierarchy of folders is complicated, but some folders are large. The network drive is backed-up automatically three times a week, so its risk of loss is tolerably low. With modern email clients, the penalty of huge email directories should be tiny.

      --
      Those who can make you believe absurdities can make you commit atrocities. - Voltaire
    2. Re:Why bother? by houghi · · Score: 3, Interesting

      Why bother indeed. When I look at my mailfolders, I try to think on my personal mail when the last time was that I actually searched for something older then one year,

      Mails that I keep are orders I placed and passwords that I requested. All the rest I delete after one year.

      I already do a lot of deleting after reading already. e.g. most mailing lists will be deleted almost immediately. Things I keep are bug reports I filed, till they are closed.

      This is something I do in real life as well. If I have not used something in a year and there is no emotional value, I will trow it away. Even though it is technically possible to keep everything, I see no reason to do so.

      --
      Don't fight for your country, if your country does not fight for you.
    3. Re:Why bother? by grcumb · · Score: 5, Insightful

      I have often longed for a "generic email database format" which could be a universal format for all email programs out there in some way. Pretty much a dream which is long over-due... about 10 years past-due. Perhaps there is already something like that and it has escaped me all these years but I seriously hate migrating email from one format another.

      Take a look at Maildir. It's not perfect, but it is generic, simple and easily transferred from one location to another.

      RANT: Over the course of my (far too many) years of working in technology, I've often been amazed just how enamoured everyone is with databases. There are some things that databases do well, granted, but just because something needs an index doesn't mean it needs a relational database. /RANT.

      --
      Crumb's Corollary: Never bring a knife to a bun fight.
    4. Re:Why bother? by grumbel · · Score: 4, Insightful

      My advice is to keep your archives, but take the time to filter out the stuff you really don't need or want any more.

      The problem with that is that it's extremely hard to judge what you will find valuable 20 years down the road.

      Simple example: Old TV recordings on VHS. I have all of Star Trek: TNG on VHS, labeled, sorted, with the commercials cut out. All nice and dandy you might think.

      You know which part I would love to rewatch? Now, some 15 years later? The commercials, exactly that part which I deleted. All the episodes I can get easily on DVD or on BluRay without problems, with higher quality and everything, but the stuff between the episodes? Nope, that's not available. Here and there a bit of stuff shows up on Youtube, but raw uncut TV from 15 years ago simply isn't easily available.

      There will also be obsolete software, video and flash attachments that were funny five years ago, and other junk.

      Yeah, and exactly that stuff might turn out to be extremely valuable years down the line, as your copy of it might be the only copy left or at least the only copy accessible to you.

      I have absolutely nothing against sorting, indexing and organizing the data, I quite welcome that, but that should be done as a layer on top of the data, not by hacking and slashing the original data itself.

    5. Re:Why bother? by icebike · · Score: 4, Interesting

      Ask yourself: When are you ever going to read all those email again? When is *anybody* ever going to read them again.

      As soon as:
      1) you divorce
      2) you get arrested for ANYTHING
      3) They arrive with a search warrant for any reason
      4) You sue or are sued
      5) You run for office
      6) You get hacked

      Seriously, I keep VERY little historical Email. Very little.
      I am not so vain that I believe there is any historical significance, and have never needed to go back more than a couple months for anything.

      Just Delete it. Its safer that way.

      --
      Sig Battery depleted. Reverting to safe mode.
    6. Re:Why bother? by Anonymous Coward · · Score: 3, Informative

      That's easy. (Old school) Eudora uses the mbx format, but separates the attachments from the mails.

    7. Re:Why bother? by simcop2387 · · Score: 3, Interesting

      I'm at about 12GB myself, and that's one of the two big reasons that I keep the mail in maildir format and connect all clients to it via imap. Using a real mail server has kept that from happening to me (again) for years now. The other reason is that it makes it really easy to change clients to play around, or access it from lots of places.

  2. Re:Isn't there a way... by BitHive · · Score: 4, Interesting

    You have. Thunderbird includes archival folders and a Lucene search engine.

  3. Re:Isn't there a way... by zmughal · · Score: 5, Informative

    There is DBMail.

  4. Re:500 Mb only? by optimism · · Score: 5, Insightful

    Many people have a larger email store than you.

    It is not a sign of status.

    More likely, it is a sign of your incompetence to filter and save relevant data.

    Congratulations.

    Now back to the OP, who perhaps is smarter than you, since he has has just 500MB of email to back up.

  5. Read then purge ... by MacTO · · Score: 5, Insightful

    There is probably some email that you need to keep, but chances are that you don't need to keep most of your email. So just read, respond, then purge (when appropriate).

    As others have pointed out, disk space isn't really a concern this day in age. But managing data that you don't need is a concern. A minute spent filing, backing up, etc. of unnecessary data is a minute wasted. Add enough of those seconds together, and it may amount to a good chunk of your life spent doing more interesting/productive things.

    As a side note, I notice that people sometimes get attached to things that don't really matter to them. I've known people who have lost all of their data due to circumstances beyond their control, then they became very distressed about that loss of data. The problem is that only a tiny fraction of that data was actually valuable, but they were worrying about all of the data. In some cases it was so traumatic to them that they spent more time worrying about the irrelevant stuff than the stuff that they would need to continue on in the future. So if you don't keep the irrelevant stuff, you can focus on what is relevant.

    1. Re:Read then purge ... by vadim_t · · Score: 3, Insightful

      It creeps me how young geeks hand out all their personal data to the first free provider they happen to come across.

      Yeah, it's a bit of a pain sometimes, but the benefit of having the data where I want it, dealt with how I want it, outweighs the cost IMO. It also makes for good system administration practice if you have an interest in that kind of thing.

  6. What's the motivation? by Just+Brew+It! · · Score: 3, Insightful

    Even at today's post-Thailand-flood inflated hard drive prices, your entire e-mail history occupies less than a dollar's worth of disk space. I fail to see the issue.

  7. Re:500 Mb only? by icebraining · · Score: 4, Insightful

    Who still uses e-mail?

    People who get stuff done instead of being interrupted every 5m? And who want to receive messages even while offline? And have decent systems for archiving, tagging and searching them?

  8. Procmail by massysett · · Score: 4, Funny

    Google for "procmail remove attachments":
    http://osdir.com/ml/mail.procmail/2002-11/msg00091.html

    That will get you started. You can do most anything with Procmail after you figure out the rather odd configuration file format.
    Make sure you have it backed up first because it's also quite easy to destroy data with Procmail.
    After you spend a lot of time futzing with Procmail scripts and sed and formail and the like, you'll wonder why you didn't go on Amazon or Newegg and buy a $10 flash drive that will hold all your mail several times over.

  9. Something Like This? by pscottdv · · Score: 5, Informative

    We all think you're crazy, but here it is:

    #!/bin/env python
    from mailbox import mbox, mboxMessage

    orig_mb = mbox(path/ot/orig/mbox)
    new_mb = mbox(path/to/new/mbox)

    for key,msg in orig_mb.iteritems():
            new_msg = mboxMessage()
            payload = msg.get_payload()
            if msg.is_mulltipart():
                    payload = payload[0].get_payload()
            for header in msg.keys():
                    new_msg[header] = msg[header]
            new_msg.set_payload(payload)
            new_mb.add(new_msg)
    new_mb.flush()

    --

    this signature has been removed due to a DMCA takedown notice

  10. Meta-Facepalm! by Anonymous Coward · · Score: 3, Insightful

    FTS: "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010."

    So, the total space required thus far is definitely less than (8 * 0.5 GB) = 4 GB. A USB flash drive with that small a capacity is practically classified as electronic waste these days.

    Even if his or her annual e-mail archive size doubled every year for the next 10 years, it would only be 1+2+4+8+16+32+64+128+256+512=1023 GB.

    A 3 TB hard drive he buys *today* for $100 would probably solve his "problem" for 10 more years.

    Hopefully, in the year 2021, we will have tiny 3 PB SSD drives for $100... But maybe we will be ruled by an A.I. by that time, if we haven't already destroyed ourselves with viruses, nanomachines, robots, nuclear weapons, etc.