Ask Slashdot: Handling and Cleaning Up a Large Personal Email Archive?
First time accepted submitter txoof writes "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010. The archive is for storage only. It is a mirror of my Gmail account. The archives are both sent and received mail compressed in a hierarchy of weekly, monthly and yearly mbox files. I've chosen mbox for a variety of reasons, but mostly because it is the simplest to implement with fetchmail. After inspecting some of the archives, I've noticed that the larger files are a result of attachments sent by well-meaning family members. Things like baby pictures, wedding pictures, etc. What I would like to do is from this point forward is strip out all of the attachments and only save the texts of the emails. What would be a sane way to do that using simple tools like fetchmail?"
Storage is cheap and 500MB are hardly worth worrying about. The damage done by reducing that amount will likely be far larger then any temporal benefits you might get. If you want to have it smaller so that you can have faster search, look for a tool that is better at searching and indexing the mails instead of trying to cut the mail into pieces.
Then don't go back. EVER !!
My email archive dates back to 1999 and is 2GByte in size which isn't much considering the attachments.
I "handle" it by making a backup of it.
I do not clean it up. I do clean around it by deleting mail archives that contain mails that have no personal value.
I do not delete personal mails since it is precious like photos.. In 2011 nobody has to delete his personal mail..
This news is stupid
Surely a tool exists to keep email in a SQL database, so the envelope fields, plain text, and attachments are separately searchable. I have email back to 1996 with the same frustrations.
One would think that Thunderbird would have done that a decade or more ago, but no. Nor does any of the standard IMAP servers seem to support SQL (MySQL, Postgres) as a backend: This seems like a serious project waiting to happen. Or have I overlooked an obvious solution?
I am confused why you would want to save only text and not attachments?
What's the point of having a note that says : here are the pictures of your long lost relative : and not have the picture in your archive?
It's about the attachment in most cases isn't it?
I use MailStore Portable version on a USB key. Works very well for me, and is free for home / personal use.
http://www.mailstore.com/en/mailstore-home.aspx
yes, I know it isn't what you asked, but if you know a little of SQL you can create a simple database with a few tables: mails ordered by date, relations between them based in the header (to follow up responses) and various types of attachments.
Accessing to it with sql its not more complicated than with fetchmail (unless your fetchmail isn't the same fetchmail I remember ;) and as extra you can create a simple web page with search options, and point relatives to it when they ask you for that photo of the dog they sent you five years ago.
This problem seems almost too simple, text-only and only up to 500MB per year.
I have a much tougher problem, a mailbox that is growing about 5GB per year that I still need searchable. And, stripping out the attachments is not okay, I need a way to still access them since many of them are receipts in PDF or edits on documents where the e-mail trail is the only record of changes over time. Thus ideally, the attachments should be indexed as well.
I guess you could do what many apache.org sites use: mod_mbox to make a web-accessible version of your your mail folders, possibly pre-processing them with an mbox splitting tool to get them into bit-sized chunks.
Then, overlay a search tool like Lucene Imagination (which is what lucene.apache.org uses) or any other local web indexer of your choosing in order to build searchability.
Nothing to see here. Move along.
Deleted
If you're not following Sarbanesâ"Oxley, just delete it. Fuck the pack-rat mentality.
https://addons.mozilla.org/en-US/thunderbird/addon/attachmentextractor/
I've used it to process large batches, it's pretty robust.
Did you even try Google yet? http://lmgtfy.com/?q=strip+attachments+munpack
You can probably also do it with procmail or perl or whatever scripting language you prefer. Let me know if you can't find the search box on Google and I'll post some more lmgtfy.com links for you.
look dude... we have 2011 where people confuse TB and GB and... and people like you are seriously full of shit...
just delete everything (including the baby pictures) of personal mail your ever received.. because all of it sure was wasted on you...
500MB is a fucking joke... i have porn movies captured from VHS tapes that is larger than 500MB... and you want to delete baby pictures... damn you suck...
Many people have a larger email store than you.
It is not a sign of status.
More likely, it is a sign of your incompetence to filter and save relevant data.
Congratulations.
Now back to the OP, who perhaps is smarter than you, since he has has just 500MB of email to back up.
What possible reason could you have to save personal emails from that long ago? And you want to save the text, but not the attachments? Years from now you're read an email that says: Here's the pix from xmas, enjoy!
This doesn't answer your question but may be helpful - dovecot supports (imap and pop3) reading gzipped mbox files. Keeping my archives gzipped brought them to a manageable size.
You insensitive clod!
My gmail inbox is using 2.7GB, or roughly 34%. I know someone using more than 70%. They provide a way to get more room for a reason.
Just keep it all, and as other people have said, index it.
fetchmail + procmail sorted into different folders in maildir format.
I don't auto strip attachments on large mails just sort them out from the rest, but It would be easy to add.
I have a maildir folder for inbox, outbox, notme (e.g., stuff addressed to distribution lists), large (here go all the mails with massive attachments).
large, for me, is manageable to go through manually. It only has a few tens of messages / yr.
If I had more to go through, procmail could call a simple script to strip the attachments on all the mails that are large enough to currently get sent to the "large" folder. I have something setup like this already to train a Bayesian filter on mail dropped to certain folders.
Here is the relevant procmail (pretty simple to do):
# if the message is huge, probably don't want to archive it, even if directly :0:
# to me
* > 200000
${DUBIOUS}.large/
IMAP is another potential answer.
I run Dovecot locally, and it stores every mail I've ever received, indexed for quick searches.
This way I can get my mail with all history and a fast search index on all my devices also.
Blessed are the pessimists, for they have made backups.
That takes the subject of any email with an attachment and moves it out of the .mbox into a photo archive
(my guess is that your attachments are mostly photos with some videos)
After you strip the photos from the mbox, gzip it and you should be fine.
You'll have a compressed archive of correspondence, and an easily browsed directory of photographs.
For extra points, use the program ``touch'' to date the attachment files to their original received-by date.
There is probably some email that you need to keep, but chances are that you don't need to keep most of your email. So just read, respond, then purge (when appropriate).
As others have pointed out, disk space isn't really a concern this day in age. But managing data that you don't need is a concern. A minute spent filing, backing up, etc. of unnecessary data is a minute wasted. Add enough of those seconds together, and it may amount to a good chunk of your life spent doing more interesting/productive things.
As a side note, I notice that people sometimes get attached to things that don't really matter to them. I've known people who have lost all of their data due to circumstances beyond their control, then they became very distressed about that loss of data. The problem is that only a tiny fraction of that data was actually valuable, but they were worrying about all of the data. In some cases it was so traumatic to them that they spent more time worrying about the irrelevant stuff than the stuff that they would need to continue on in the future. So if you don't keep the irrelevant stuff, you can focus on what is relevant.
I had a practice of burning each year worth of email on one CDR in mbox format, but I have found that I never actually need to refer to those old messages. Also in general there's so much data these days that I've found it best to just archive the cream of the crop.
Even at today's post-Thailand-flood inflated hard drive prices, your entire e-mail history occupies less than a dollar's worth of disk space. I fail to see the issue.
For my own mail archives I just use mutt and weed things a bit by hand. I find that 90% of the mbox size is in fewer than a dozen attachments, so I can hand-filter those out in ten minutes once a year. Beyond that disk is too cheap to care and time is too valuable to make a really comprehensive solution. So what I do:
'mutt -f archive.mbox'
':set pager_index_lines=6' (Lets you see the message index split above the body)
'o' (Order), 'z' (siZe), End (last entry), Enter (Open).
while(mbox.size > acceptable_size)
{
'v' (View attachments)
'jjj' (down a few times to the attachment I want to nuke)
'd' (Delete)
while(more attachments) { 'd' (Delete more attachments) }
'q' (Quit back to the message view)
'k' (previous message)
}
'q' (Quit back to index)
'$' (Sync changes to disk)
'q' (Quit mutt)
Note the 'j' and 'k' are vi-style up/down. The arrow keys work too if you're not a home row junkie like me.
I don't know a good fully automated way to do this that's ready to slice it right out of the box. If you want to roll your own, just pick up a library like RMail or TMail for Ruby, or equivalent for the language you prefer. That's 80% of the work done but you'll still probably find a dozen corner cases involving oddly-named HTML-alternatives named things that look like binary attachments or terribly malformed spam.
With the attachments. Storage is cheap and it takes only seconds to find anything as is. Seriously, if you're concerned about file size and the time to search some simple emails, perhaps your computer is just too old? (Your media attachments shouldn't be adding to the search time, so that is a lousy excuse.)
The Eudora Mail User Agent (i.e. email client) stores attachments in a directory as binaries but yet keeps the text of emails intact. Thus you should be able to import the email into Eudora, then when you export it the attachments should be stripped.
This is also exactly why I don't use Eudora anymore, because attachments get stripped off when exporting the email (or at least that's the way email export or import from/to Eudora worked last I used it).
Now, although this explains one way attachments can be stripped from email, I don't recommend doing that, because it alters the email. Generally I want email intact because otherwise what you're storing might refer to an "attached file", but yet not even knowing what the filename is that the email refers to. Plus it's actually useful to have sent mail attachments intact too, because it means you get to see what version of what file you sent to someone at the time.
There are some other interesting options; the KMail MUA has an option of "delete attachment" when right-clicking on an email attachment, which does delete the attachment but not the reference to it, so you at least know the filename of what used to be attached. I just sent myself a test email and deleted the attachment and then viewed the email raw, but unfortunately Slashdot's filter won't let me send the result. But if you do that yourself and look at it, it should give you an idea how to re-form emails to strip attachments but not the references.
I'm not sure if anyone has mentioned this, but Stanford has been working on an email client to help understand and visualize an archive of 50,000 emails - It lets you pull out the images, browse emails by 'sentiment-mapped' values and graph the patterns of activity over the full lifetime of the archive. You can see the project page here: http://mobisocial.stanford.edu/muse/
It Mac OS X's built-in mail application, you can use:
Message -> Remove Attachments ... so all you need to do is find a Mac and put your email on an IMAP server.
Google keeps a permanent copy anyway...
"I love my job, but I hate talking to people like you" (Freddie Mercury)
Who still uses e-mail?
People who get stuff done instead of being interrupted every 5m? And who want to receive messages even while offline? And have decent systems for archiving, tagging and searching them?
Dilbert RSS feed
We're worrying about 500MB?
Even at today's outrageous price-fixed (you know it's true) hard drive prices, you're talking 14 cents a GB. For your situation, we're talking 7 cents.
You're complaining about 7 cents worth of storage space. And to cut down on this you want to mangle the archive?
You're tight on space? Buy another drive, burn to CD/DVD.
For those of us who grew up with a Corvus shoebox hard disk costing thousands on the Apple ][ network, this is a ridiculous "ask slashdot" question.
--
BMO
back to ARPA mail and UUCP mail days...
for a while I used Eudora and every month religiously took each piece of email and filed it away in suitable mail folders. After Eudora started declining and I got too busy, I stopped that, but even now, religiously every month I clean out my mailbox of all junk and unwanted attachments (trimming 60-100MB to usually 20-30MB) and then stack that months email away as a single mbox file, and start fresh with a new Inbox.
the old mailbox files are on an IMAP server that I can easily read emails from at least 10 years ago -- older with a little more effort. As single mbox files each, I can do greps on them also. Seems to be an okay way to keep the stuff, some of which has proven to be important over the years....
another big help: all semi-junky and non business emails I let Hotmail do the work (vendor stuff, Amazon orders, etc). Have been using Hotmail since before MS bought it. Works well as a place to direct mostly junky vendor stuff.
Google for "procmail remove attachments":
http://osdir.com/ml/mail.procmail/2002-11/msg00091.html
That will get you started. You can do most anything with Procmail after you figure out the rather odd configuration file format.
Make sure you have it backed up first because it's also quite easy to destroy data with Procmail.
After you spend a lot of time futzing with Procmail scripts and sed and formail and the like, you'll wonder why you didn't go on Amazon or Newegg and buy a $10 flash drive that will hold all your mail several times over.
Penny - plain text accounting
Amateur. When you get to 8+ gb then we can talk about 'large archive'. Until then, just stick it on a CD.. you don't even need a DVD for that.
---- Booth was a patriot ----
Why delete when disk space even today is 14 cents in "Salesman Gigabytes"?
Someone back there said he has 16 GB of mail going back over a decade. That's what Two Bucks And A Quarter.
It's less than a cup of coffee at Starbucks or even a Large at Dunkin' Donuts. It is fully irrational to worry about this.
Anyone worrying about personal mbox size has OCD issues. Full stop.
--
BMO
you will be very sorry you deleted those pictures. don't do that. Even right now, you could make many people very happy by giving as gift one of those digital picture frames that display different stored photo every several seconds, with your pictures of those important to recipient.
Okay, so most of the people here have wasted your time trying to convince you that "storage is cheap" or that there isn't a good reason to store all that e-mail, let alone try to organize it all. I'm with you, not them. It's the fricking 2000s. It should be easier to archive this stuff and organize it if you *want* to.
I've always wanted to do something about my messy mail archive of mbox files (dates back to the 1990s), but I dreaded the thought of coding something up from scratch given all the quirks of e-mail formatting. I had high hopes your post would elicit some sage advice from the readers of /., but so far I don't see much other than the good mutt+ruby solution. In frustration, I've started looking but I haven't found much either. For what it's worth, here's what I've go so far:
1) There are plenty of commercial solutions that promise to do everything for a low price (e.g., MailSteward for OS X looks pretty good and has a free trial up to 15000 messages). Maybe. But I'm cheap and will exhaust the fully free solutions before spending money. Most of them are more focused on mailbox conversion/migration (e.g., Emailchemy) than actual filtering/archiving.
2) Free / some assembly required:
archivemail - mostly for date-selection of messages and archiving/compressing. Doesn't help with attachments. Python.
archmbox - more capable than archivemail. Can do filtering based on date, header field matches, etc., copy selected messages and compress to archive. Perl. Closer.
MHonArc - converts mbox to HTML files with links to attachments. Meant for mailing list archiving, but it should work the same for a personal mailbox. Perl. There's also an OS X front end for it.
The HTML approach isn't ideal, but that could be a convenient way to browse through the archives (e.g., toss it all up on a password-protected web site and your mail archive is available anywhere, like your own personal and backed-up GMail), and a contributed program in the MHonArc distribution can turn an MHonArc archive back *into* an mbox file, which might let you do some modifications to the HTML files and linked attachments with scripts and then backconvert them after.
I haven't tested any of these, but I think I'll try MHonArc and see how it goes.
My entire mail store is over 16 Gb. I have single mbox files that are larger than 2 Gb.
My entire mail store is over 1TB.
I have single LZMA compressed mbox.xz files that are larger than 16 Gb.
Photos are one of the most treasured things in many families. Keep in mind it's highly unlikely Aunt Petunia is keeping great backups of her photos, and when it all goes south, you might be one of the family members who actually has a photo of a relative who has passed on that she wants to print when her hard drive gives up the ghost.
Enables you to save everything off line as a pdf. Personally I don't get the question or see the point. My archive is about 6 Gig, all backed up all searchable. Anyway the company that makes the software is www.spotdocuments.com Just back up
Watch those corners
We all think you're crazy, but here it is:
#!/bin/env python
from mailbox import mbox, mboxMessage
orig_mb = mbox(path/ot/orig/mbox)
new_mb = mbox(path/to/new/mbox)
for key,msg in orig_mb.iteritems():
new_msg = mboxMessage()
payload = msg.get_payload()
if msg.is_mulltipart():
payload = payload[0].get_payload()
for header in msg.keys():
new_msg[header] = msg[header]
new_msg.set_payload(payload)
new_mb.add(new_msg)
new_mb.flush()
this signature has been removed due to a DMCA takedown notice
for Android and desktop Linux. I also want it accessible over the internet with the email archive hosted on my own server.
Ctrl-A, Shift-Del
FTS: "I have a personal email archive that goes back to 2003. The early archives are around 2 megabytes. Every year the archives have grown significantly in size from a few tens of megs to nearly 500 megs from 2010."
So, the total space required thus far is definitely less than (8 * 0.5 GB) = 4 GB. A USB flash drive with that small a capacity is practically classified as electronic waste these days.
Even if his or her annual e-mail archive size doubled every year for the next 10 years, it would only be 1+2+4+8+16+32+64+128+256+512=1023 GB.
A 3 TB hard drive he buys *today* for $100 would probably solve his "problem" for 10 more years.
Hopefully, in the year 2021, we will have tiny 3 PB SSD drives for $100... But maybe we will be ruled by an A.I. by that time, if we haven't already destroyed ourselves with viruses, nanomachines, robots, nuclear weapons, etc.
I happen to be in the process of reading a book on hoarding to help my mother through some moderate hoarding issues that she is having; the problem described in this post sounds exactly what some of the underlying causes are for hoarding in general.
If you keep looking for a purely technical solution to the problem, you're probably not ever going to solve it, and it will keep escalating despite whatever technical stop-gap you're able to come up with.
As others have said, the headache you will have if you do want to come back (potentially years later) to that one email you know you had only to find your attachment-stripping program has foobar'd the whole archive up (or that you need the attachment after all) probably isn't worth the hassle for saving 500MB per year this year (even taking into account reasonable growth rates - I'd note that bandwidth per $, which will be the factor limiting your email size, has been growing rather more slowly than storage capacity per $ over the past decade and things are likely to continue that way).
If the problem is that you have significant duplication between emails (e.g. the same attachment being emailed several times), gzip and bzip2 may well miss the opportunity to de-dupe this because the distance between duplicated sections is large. One solution to consider if this is an issue may be to use something which is better at compressing over long distances. I would suggest trying something like lrzip to compress tarballs of the annual sets of mbox files before archiving those.
Of course, if you just have lots of attachments which *aren't* duplicated (which is probably more likely), that won't really help much.
Don't act so smug - the 500MB figure was for one year only. He never gave us the full amount for all his emails.
BTW, the "R" in "RTFS" means "READ".
These are a few of the tools that I use (Unix/Linux, of course):
formail (part of the procmail distribution) is very useful for rewriting mailboxes.
uuexplode is useful for discovering and yanking out attachments.
grepmail is REALLY useful for discovering messages which match certain criteria.
csplit is useful for more than mail, but it also has applications with mailboxes.
My personal email archive goes back to 1996, and is still only 262MB.
My Google archive uses 164MB.
I've no idea what my Yahoo account uses.
But 500GB of email?!?!?!?! Are your relatives sending you entire videos as attachments or accidentally copying their entire music archives?
I do not fail; I succeed at finding out what does not work.
Deal with the superfluous attachments first and then see how you feel. Attachments are often unnecessary baggage.
You are talking about $0.035 in storage costs
Literally. I bought 3TB for $200.00 at Fry's yesterday. You probably have more than 500M of "Angry Birds" on your cell phone.
If the point is to pull the data off your gmail account and not have it stored there (maybe you want to migrate away, maybe, you are trying to get us to design a product for your file hosting service adjuntct to gmail, whatever), fetchmail is a terrible tool, particularly since gmail permits IMAP4 access, and you don't have to worry about decoding headers in precedence order.
In general, if you want to leave the email on your gmail account, and delete only attachments, sorry, but the mail is not stored that way on the gmail server, messages are stored as units, and attachments are not separate things, they are different sections in the same flat file using MIME encoding.
You would have to pull the mail down with IMAP4, process it, and push it back up, again with IMAP4. You would then need to take additional steps via IMAP4 commands to make it not look like newly arrived mail to the gmail server, or you're going to see them all as new messages the next time you log into gmail (basically, after the put, you will need to mark the message read again).
If you additionally use POP3 access, the message IDs as reported by POP3 will change, and since it does its "leave on server" functionality by maintaining a local database of message IDs, it's going to be seen as new email there. There's no getting around that, unless you own the source code to your POP3 client and are willing to do correlation after the put operation on the IMAP4 connection to translate the ID.
You also realize that if you are ding this for reasons of quasilegality of message contents, those messages aren't really gone, right? People accidentally delete things all the time and want them back, and that same recovery process in that case would be usable in discovery by subpoena for the email provider (there is in fact additional requirements for ISPs under Patriot II to maintain records of sent and received emails for up to two years).
The bottom line here is that you are engaging in a pretty useless exercise here, unless you are trying to hide illegal activity or build a service and have us design it for you. In either case, good luck, you'll be writing a lot of code to get what you want.
-- Terry
Is a project a friend my mine started. Interfaces w/ gmail's API, quite easy to use.
www.findbigmail.com
I have an issue with some of my old Thunderbird mail archives. They are infected with various viri (w32 swen.A, N32 Netsky.T, Trojan Zbot) The anti-virus software I've tried just wants to delete the entire file not clean it. Would like to clean the files without risking infection but haven't been able to find a way to clean an offline mail file. Any ideas? Thanks.
I keep my email in maildir format (the default for Claws-Mail), and rotate every six months. The whole process is entirely manual, but since any given step only takes a few seconds, it works fine for me.
Emails are sorted on receipt according to source or content via ordinarily filters. Every email I receive that's worth keeping goes into a catch-all folder after reading. I probably should be preserving the sorting when I move something into that catch-all, but I don't receive enough email to bother with it yet.
The start of the current six-month range is always part of that folder's name, e.g. "2011-07-01 to Current". Every six months, I rename the folder to add the end date (e.g. "2011-01-01 to 2011-06-30") and move it into a separate storage folder (still within Claws-Mail's folder tree). Then, I simply create a new Archive folder for the new 6-month period. Fill, rinse, repeat every six months.
When the mood strikes me (roughly every couple of years), I'll compress the latest six-month block(s) and move the results into a long-term storage directory. I generally keep only the most recent few years' worth of emails at hand, so the oldest stuff gets deleted from time to time, leaving only the compressed files. If I need to search the older stuff, it's a small matter of extracting to a temp/work directory, doing whatever needs done, and deleting.
On top of that, I run an incremental backup of my home directory and storage areas to a USB-connected disk every so often (the time between backups varies - usually once a fortnight or more often). So eventually, every email I decide to save ends up with one online and at least two offline backups. Since I use Gmail, technically they serve as an off-site backup of the most recent stuff (until I delete it anyway).
I figure with this setup, it's easy to find whatever I need, and it would take a pretty big screwup to actually lose an email.
I check my email twice a day, with rare exceptions.
If there is information in an email message, I generally write it. Longhand. With a fountain pen on high quality paper (because that is all I use.)
If it is worth keeping, it is worth writing down on archival paper with archival ink. It's the rare email message which doesn't get deleted while being read.
I actually cannot relate to anyone who thinks it's necessary or wise to retain email messages longer than the amount of time required to reply and/or delete.
How about procmail (for new mail) or formail (for iterating through messages in existing mailboxes) and a perl-Email-MIME-Attachment-Stripper script?
Strip (or separate and save) the attachments from a mail message.
$ du -hs ~/Mail /home/spugglefink/Mail
1.9G
Ask yourself: When are you ever going to read all those email again? When is *anybody* ever going to read them again.
1) I need to order ink recently. Now, I don't print much but I vaguely remembered a good supplier that I had used in the past. But what was it called? A few moments of greping and I found it: in a confirmation email from three years ago.
2) I met a woman on a Meetup hike recently that I seem to have met before. Was this the blind date from four years ago? The smoking gun was in an email from 2007.
3) I've had occasional need to look up old acquaintances. While I might have created a contact file at various points, odds are I have forgotten what I named it or where I put it. But I am quite sure the information is in email.
The real treasure is email that is ten or more years ago. You think you remember what happened in the right order? Trust me. You don't. An email archive is like a diary except it is less work and more complete.
If you use, or have access to a Mac, the Apple Mail client has for some years had a Remove Attachments option in the Message menu. Simply select all your mail in a folder with Cmd-A and select that menu option and it'll do exactly what you want. I use it regularly to prune my database.
It's a Unix system - I know this.
Look...
You've never had to look back thru your email for stuff
You don't feel the need to have email from 1994 searchable
You've been lucky enough to never have to find a contact that you lost
BECAUSE
You're too young to 'have a past'.
You've never held a job with any type of seniority
You've never held a job that was important
You don't have friends or intimate relationships
You, are not interesting or important enough to temporally affect
anything beyond a few dozens of hours
and you could probably disappear or die in your hovel and |except|
for bills and rent not getting paid and perhaps a (more?) foul odor
than before...
NO ONE WOULD MISS YOU
But see... that's why this question exists... because we do
have friends, family, loved ones, jobs that are or were important
and people would miss us, if we fuckin took a 1hr nap in the
middle of the day! So, do us a favor and STFU... because us
important folk with friends want to save our emails til the
angels blow their trumpets.
Thanks for playing!
-@|
That is 8 years x 500MB = 4GB... still got 3.5 GB. So why in the world would you do that??? oh.. you worry abotu loosing your family's baby pictures... C'mon man.. they won't fail. And if you are want to keep your work mail there, you should consider changing to a more reliable provider or upgrading to a pro gmail account.
...and forward everything there.
Or set up (an) imap server(s) on additional machine(s) and use imapsync, or similar, to back everything up.
I have all my emails since 1990 or 1991 :-/ and it's true that I use less and less emails...
I saved them in a mbox format and used a script to remove binary attachement (yeah, cat pictures and things like this). It's mainly text and can be highly compressed.
"Science will win because it works." - Stephen Hawking
I still have the tapes, but the desire to read them. Since the mid-1990s I've have had cloud-email (hotmail) and havent really lost anything.
Use my M$ Outlook method.
1) Select all.
2) Delete.
done.
Email is extremely convenient for file transfer, but I prefer not to have my mail store so bloated.
What happens when my friends want to start emailing me movies?
I haven't figured out the true nature / fundamentals involved here.
http://www.mhonarc.org/ is a perl script that will recurse though folders of emails, unpack the attachments and generate html versions of the mail, with links. A simple find can them delete the attachments, or you could just keep them as the result will be (mostly) smaller.
I love Thunderbird but it becomes slow when handling large mail archives (I have about 10GB), mail folders become corrupt and I feel it even looses emails. I'm looking forward for Thunderbird 10 (or later) when it is able to store e-mails in a proper DB (e.g. SQlite). Until then I recommend DB Mail.