Slashdot Mirror


Ask Slashdot: Handling and Cleaning Up a Large Personal Email Archive?

First time accepted submitter txoof writes "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010. The archive is for storage only. It is a mirror of my Gmail account. The archives are both sent and received mail compressed in a hierarchy of weekly, monthly and yearly mbox files. I've chosen mbox for a variety of reasons, but mostly because it is the simplest to implement with fetchmail. After inspecting some of the archives, I've noticed that the larger files are a result of attachments sent by well-meaning family members. Things like baby pictures, wedding pictures, etc. What I would like to do is from this point forward is strip out all of the attachments and only save the texts of the emails. What would be a sane way to do that using simple tools like fetchmail?"

5 of 167 comments (clear)

  1. Re:Isn't there a way... by zmughal · · Score: 5, Informative

    There is DBMail.

  2. Re:500 Mb only? by optimism · · Score: 5, Insightful

    Many people have a larger email store than you.

    It is not a sign of status.

    More likely, it is a sign of your incompetence to filter and save relevant data.

    Congratulations.

    Now back to the OP, who perhaps is smarter than you, since he has has just 500MB of email to back up.

  3. Read then purge ... by MacTO · · Score: 5, Insightful

    There is probably some email that you need to keep, but chances are that you don't need to keep most of your email. So just read, respond, then purge (when appropriate).

    As others have pointed out, disk space isn't really a concern this day in age. But managing data that you don't need is a concern. A minute spent filing, backing up, etc. of unnecessary data is a minute wasted. Add enough of those seconds together, and it may amount to a good chunk of your life spent doing more interesting/productive things.

    As a side note, I notice that people sometimes get attached to things that don't really matter to them. I've known people who have lost all of their data due to circumstances beyond their control, then they became very distressed about that loss of data. The problem is that only a tiny fraction of that data was actually valuable, but they were worrying about all of the data. In some cases it was so traumatic to them that they spent more time worrying about the irrelevant stuff than the stuff that they would need to continue on in the future. So if you don't keep the irrelevant stuff, you can focus on what is relevant.

  4. Re:Why bother? by grcumb · · Score: 5, Insightful

    I have often longed for a "generic email database format" which could be a universal format for all email programs out there in some way. Pretty much a dream which is long over-due... about 10 years past-due. Perhaps there is already something like that and it has escaped me all these years but I seriously hate migrating email from one format another.

    Take a look at Maildir. It's not perfect, but it is generic, simple and easily transferred from one location to another.

    RANT: Over the course of my (far too many) years of working in technology, I've often been amazed just how enamoured everyone is with databases. There are some things that databases do well, granted, but just because something needs an index doesn't mean it needs a relational database. /RANT.

    --
    Crumb's Corollary: Never bring a knife to a bun fight.
  5. Something Like This? by pscottdv · · Score: 5, Informative

    We all think you're crazy, but here it is:

    #!/bin/env python
    from mailbox import mbox, mboxMessage

    orig_mb = mbox(path/ot/orig/mbox)
    new_mb = mbox(path/to/new/mbox)

    for key,msg in orig_mb.iteritems():
            new_msg = mboxMessage()
            payload = msg.get_payload()
            if msg.is_mulltipart():
                    payload = payload[0].get_payload()
            for header in msg.keys():
                    new_msg[header] = msg[header]
            new_msg.set_payload(payload)
            new_mb.add(new_msg)
    new_mb.flush()

    --

    this signature has been removed due to a DMCA takedown notice