How Mailinator Compresses Its Email Stream By 90%

← Back to Stories (view on slashdot.org)

How Mailinator Compresses Its Email Stream By 90%

Posted by Soulskill on Tuesday February 21, 2012 @10:10AM from the ever-faster-ever-smaller dept.

An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"

15 of 75 comments (clear)

Min score:

Reason:

Sort:

Really Dumb by Anonymous Coward · 2012-02-21 10:16 · Score: 5, Funny

Is how I feel after reading the referenced article.
1. Re:Really Dumb by Tyrannosaur · 2012-02-21 11:11 · Score: 3, Insightful
  
  Alright, I apologize. I was in the wrong. It also came off significantly more sarcastic than I meant it too. The point that I tried (and failed) to make was that there really is nothing that should make anyone feel dumb, it's really just a lack of learning that can be fixed. Thank you for calling me out.
Re:Beware by Ethanol-fueled · 2012-02-21 10:20 · Score: 3, Funny

She's Italian? All this time I thought she was a mustachioed Jewish transvestite.

Why do the men I love always lie to me?
Because nobody RTFA by Tyrannosaur · 2012-02-21 10:25 · Score: 5, Informative

The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm
1. Re:Because nobody RTFA by MagicM · 2012-02-21 10:37 · Score: 4, Informative
  
  If you enjoyed reading that, you might also enjoy reading this and the follow-up about efficiently storing a dictionary of words and dealing with memory v.s processing trade-offs.
Re:You cut off at the good part. by poetmatt · 2012-02-21 10:27 · Score: 2

I don't know that I understand entirely or have read it this fast, but basically no. He used algorithms stored into arrays and some database engineering voodoo.
Re:You cut off at the good part. by Tyrannosaur · 2012-02-21 10:27 · Score: 2

Close, actually. Very close. They split the headers it's true, but he compresses the emails together so that emails that are exactly (or almost exactly) the same ("get viagra now!" or a newsletter) don't have to be stored in different places in memory. Only large emails get LZMA (much better than bzip fyi).
Mailinator Rocks by Jah-Wren+Ryel · 2012-02-21 10:40 · Score: 3, Informative

I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.

--
When information is power, privacy is freedom.
It's not the algorithm, it's the data by Hentes · 2012-02-21 10:52 · Score: 4, Informative

Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
1. Re:It's not the algorithm, it's the data by Lehk228 · 2012-02-21 11:29 · Score: 2
  
  how unique is most email? registration emails, newsletters, h3rb4l v14g|2a, p3n15 3nl4|2g3m3n7, buy stock ____, etc
  
  --
  Snowden and Manning are heroes.
2. Re:It's not the algorithm, it's the data by firewrought · 2012-02-21 11:41 · Score: 3, Informative
  
  Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
  The accomplishment here is that he determined a very tactical set of strategies for solving a real world problem of large scale. No, it didn't take a math PhD with some deep understanding of Fourier analysis to invent this algorithm, but it most certainly took a software developer who was knowledgeable, creative, and passionate for his task. So yeah... it's not the 90% compression that's impressive, it's the real-time performance that's cool.
  
  --
  -1, Too Many Layers Of Abstraction
Re:You cut off at the good part. by Nemesisghost · 2012-02-21 10:58 · Score: 4, Informative
Actually there is even more to it than that.
1. 1. He broke each email up by lines(or multiple small lines, until the smallest unit was greater than 64 bytes).
2. 2. He compared each line to the set of line from previous emails, building a dictionary of lines that each has in common.
3. 3. Finally, if a "line" in the email is large enough he will LZMA compress it against a sliding dictionary of only the most recent emails
Re:You cut off at the good part. by errandum · 2012-02-21 12:06 · Score: 2

Actually, the core of his compression scheme seems to be constructing an LZW dictionary but using line patterns instead of bits or letters. The reason it works is because he jumps ahead of all those arrangements that would have probably been made letter by letter a, an, and, and , and h, (etc) which is what makes lzw and it's variants slow.
It was clever, but any Theory of Information course should tell you that choosing your "default" symbols correctly is very important. He did just that (:
Easy compression: by fyngyrz · 2012-02-21 12:11 · Score: 2

Just code "Prince of Nigeria" for 1 bit and you've got (17*8==136):1 compression. Continue with that line of thinking... "expand your manhood", "Pass this along to a friend", "Dear beloved in Christ", etc.
Related anecdote: Way back when, as relatively innocent SW listeners, some friends and I thought it would be awesome to listen in on phone calls. They were all over; radiotelephone, on C-band satellite, etc. You just had to figure out where they were. Well, after about an hour of actual listening, we determined without a shadow of a doubt that exposing yourself to other people's phone calls seemed like a sure way to reduce your own intelligence by orders of magnitude. It was outright painful. Of course, now that cellphones are everywhere and exposing yourself to (half of) other people's phone calls is simply a matter of standing in line at Subway, we all know this.
But... compared to phone calls, general email is sure to be worse, judging by my spam folder and the ratio of spam to actual mail.
Of course, if you'd like to go down another step, Twitter has utterly worthless yuck in sets of up to 140 characters waiting to leave nasty little footprints all over your brain, lol...
And then there is the ultimate collection of drivel: Facebook.

--
I've fallen off your lawn, and I can't get up.
Re:You cut off at the good part. by smellotron · 2012-02-21 13:38 · Score: 3, Interesting

He essentially used deduplication instead of real compression

Deduplication is compression.