Ask Slashdot: Handling and Cleaning Up a Large Personal Email Archive?
First time accepted submitter txoof writes "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010. The archive is for storage only. It is a mirror of my Gmail account. The archives are both sent and received mail compressed in a hierarchy of weekly, monthly and yearly mbox files. I've chosen mbox for a variety of reasons, but mostly because it is the simplest to implement with fetchmail. After inspecting some of the archives, I've noticed that the larger files are a result of attachments sent by well-meaning family members. Things like baby pictures, wedding pictures, etc. What I would like to do is from this point forward is strip out all of the attachments and only save the texts of the emails. What would be a sane way to do that using simple tools like fetchmail?"
Storage is cheap and 500MB are hardly worth worrying about. The damage done by reducing that amount will likely be far larger then any temporal benefits you might get. If you want to have it smaller so that you can have faster search, look for a tool that is better at searching and indexing the mails instead of trying to cut the mail into pieces.
You have. Thunderbird includes archival folders and a Lucene search engine.
There is DBMail.
Many people have a larger email store than you.
It is not a sign of status.
More likely, it is a sign of your incompetence to filter and save relevant data.
Congratulations.
Now back to the OP, who perhaps is smarter than you, since he has has just 500MB of email to back up.
There is probably some email that you need to keep, but chances are that you don't need to keep most of your email. So just read, respond, then purge (when appropriate).
As others have pointed out, disk space isn't really a concern this day in age. But managing data that you don't need is a concern. A minute spent filing, backing up, etc. of unnecessary data is a minute wasted. Add enough of those seconds together, and it may amount to a good chunk of your life spent doing more interesting/productive things.
As a side note, I notice that people sometimes get attached to things that don't really matter to them. I've known people who have lost all of their data due to circumstances beyond their control, then they became very distressed about that loss of data. The problem is that only a tiny fraction of that data was actually valuable, but they were worrying about all of the data. In some cases it was so traumatic to them that they spent more time worrying about the irrelevant stuff than the stuff that they would need to continue on in the future. So if you don't keep the irrelevant stuff, you can focus on what is relevant.
Who still uses e-mail?
People who get stuff done instead of being interrupted every 5m? And who want to receive messages even while offline? And have decent systems for archiving, tagging and searching them?
Dilbert RSS feed
Google for "procmail remove attachments":
http://osdir.com/ml/mail.procmail/2002-11/msg00091.html
That will get you started. You can do most anything with Procmail after you figure out the rather odd configuration file format.
Make sure you have it backed up first because it's also quite easy to destroy data with Procmail.
After you spend a lot of time futzing with Procmail scripts and sed and formail and the like, you'll wonder why you didn't go on Amazon or Newegg and buy a $10 flash drive that will hold all your mail several times over.
Penny - plain text accounting
We all think you're crazy, but here it is:
#!/bin/env python
from mailbox import mbox, mboxMessage
orig_mb = mbox(path/ot/orig/mbox)
new_mb = mbox(path/to/new/mbox)
for key,msg in orig_mb.iteritems():
new_msg = mboxMessage()
payload = msg.get_payload()
if msg.is_mulltipart():
payload = payload[0].get_payload()
for header in msg.keys():
new_msg[header] = msg[header]
new_msg.set_payload(payload)
new_mb.add(new_msg)
new_mb.flush()
this signature has been removed due to a DMCA takedown notice