How Mailinator Compresses Its Email Stream By 90%
An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
Is how I feel after reading the referenced article.
I'm not going to RTFA, so let me guess: they split it up and bzip2 the remaining files? Maybe they split the headers from the rest of the email and compress those separately?
She's Italian? All this time I thought she was a mustachioed Jewish transvestite.
Why do the men I love always lie to me?
The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm
I wonder if you could do anything interesting with fuzzy hashing and diff.
Let's post the mostly-irrelevant part of the article as the summary and leave out anything that really matters
One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.
What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
This is called a solid archive; what the author wants is a non-solid archive.
What about LZO compression and friends? They are used for indexable large files in Hadoop HDFS. They could come in handy here too. No mention of that?
Just delete the emails that are not on the compressing dictionary.
I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.
When information is power, privacy is freedom.
If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.
What the summary does not say was that, email number 502,922 is special cased and is stored in plain text at the head of the compression dictionary. So it will trivially fetch email number 502,922.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
That service is pretty cool. Never realized there was something out there like that.
Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over!
TFA right there.
Maybe next time you could actually read TFA.
Dilbert RSS feed
Y'know, if anyone on /. had a girlfriend, I'm not sure why they would mind her checking out their butt? Perhaps you should rethink this troll.
Just code "Prince of Nigeria" for 1 bit and you've got (17*8==136):1 compression. Continue with that line of thinking... "expand your manhood", "Pass this along to a friend", "Dear beloved in Christ", etc.
Related anecdote: Way back when, as relatively innocent SW listeners, some friends and I thought it would be awesome to listen in on phone calls. They were all over; radiotelephone, on C-band satellite, etc. You just had to figure out where they were. Well, after about an hour of actual listening, we determined without a shadow of a doubt that exposing yourself to other people's phone calls seemed like a sure way to reduce your own intelligence by orders of magnitude. It was outright painful. Of course, now that cellphones are everywhere and exposing yourself to (half of) other people's phone calls is simply a matter of standing in line at Subway, we all know this.
But... compared to phone calls, general email is sure to be worse, judging by my spam folder and the ratio of spam to actual mail.
Of course, if you'd like to go down another step, Twitter has utterly worthless yuck in sets of up to 140 characters waiting to leave nasty little footprints all over your brain, lol...
And then there is the ultimate collection of drivel: Facebook.
I've fallen off your lawn, and I can't get up.
I run a similar (though waaaay less popular) site - http://dudmail.com/
My mail is stored on disk in a mysql db so I don't have quite the same memory constraints as this.
I had originally created this site naively stashing the uncompressed source straight into the db. For the ~100,000 mails I'd typically retain this would take up anywhere from 800mb to slightly over a gig.
At a recent rails camp, I was in need of a mini project so decided that some sort of compression was in order. Not being quite so clever I just used the readily available Zlib library in ruby. This is just run straight over each email as it comes in with no reference to any previous emails, so very "dumb" compared to mailinator, but still fairly effective.
This took about 30 minutes to implement (in about 6 lines of code) - and then a couple of hours to test and debug. An obvious bug (very large emails were causing me to exceed the BLOB size limit and truncating the compressed source) was the main problem there...
I didn't quite reach 90% compression, but my database is now typically around 200-350mb. So about 70-80% compression. I didn't reach 90% compression, but I did manage to implement it in about 6 lines of code =)
If anybody responsible for the site comes this way, thank you for the excellent (and free) service.
Tries and Bloomfilters are wonderful algorithms, because they are simple, if you want something a tad bit more complicated use Locality-Sensitive Hashing to find similar documents from a big set of documents.
Read the article and don't try to be so clever next time.
Ok, I read the article and it was interesting. But the summary is IMO appalling. That said, it is Slashdot and I should have known better...
Thanks for the tip.
Write boring code, not shiny code!
Simply remove all the '>'+space that are mysteriously inserted at the beginning of content lines that start with "From " ... a silly legacy of Berkeley mailbox format From the days when only Real Programmers would use consistent escaping/de-escaping or short-term guaranteed unique hash delimiters to parse mailboxes, and Real Programmers were in Short Supply.
It's only a tiny part of the stream, but we could lose a little on every transaction but make up for it in volume. Until it adds up to 90%
Clearly, "Paul Tyma" is an alias for Hans Doofenshmirtz.
And anyone who blocks them. They know they're a scummy site and REALLY want to be sending you spam.
That or a site that uses a service that blocks over 4,000 disposable e-mail address domains might just want a more persistent identifier to cut down on sockpuppet registrations.