How Mailinator Compresses Its Email Stream By 90%
An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
Is how I feel after reading the referenced article.
She's Italian? All this time I thought she was a mustachioed Jewish transvestite.
Why do the men I love always lie to me?
The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm
I don't know that I understand entirely or have read it this fast, but basically no. He used algorithms stored into arrays and some database engineering voodoo.
Close, actually. Very close. They split the headers it's true, but he compresses the emails together so that emails that are exactly (or almost exactly) the same ("get viagra now!" or a newsletter) don't have to be stored in different places in memory. Only large emails get LZMA (much better than bzip fyi).
I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.
When information is power, privacy is freedom.
Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
Actually there is even more to it than that.
Actually, the core of his compression scheme seems to be constructing an LZW dictionary but using line patterns instead of bits or letters. The reason it works is because he jumps ahead of all those arrangements that would have probably been made letter by letter a, an, and, and , and h, (etc) which is what makes lzw and it's variants slow.
It was clever, but any Theory of Information course should tell you that choosing your "default" symbols correctly is very important. He did just that (:
Just code "Prince of Nigeria" for 1 bit and you've got (17*8==136):1 compression. Continue with that line of thinking... "expand your manhood", "Pass this along to a friend", "Dear beloved in Christ", etc.
Related anecdote: Way back when, as relatively innocent SW listeners, some friends and I thought it would be awesome to listen in on phone calls. They were all over; radiotelephone, on C-band satellite, etc. You just had to figure out where they were. Well, after about an hour of actual listening, we determined without a shadow of a doubt that exposing yourself to other people's phone calls seemed like a sure way to reduce your own intelligence by orders of magnitude. It was outright painful. Of course, now that cellphones are everywhere and exposing yourself to (half of) other people's phone calls is simply a matter of standing in line at Subway, we all know this.
But... compared to phone calls, general email is sure to be worse, judging by my spam folder and the ratio of spam to actual mail.
Of course, if you'd like to go down another step, Twitter has utterly worthless yuck in sets of up to 140 characters waiting to leave nasty little footprints all over your brain, lol...
And then there is the ultimate collection of drivel: Facebook.
I've fallen off your lawn, and I can't get up.
Deduplication is compression.