How Mailinator Compresses Its Email Stream By 90%
An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
Is how I feel after reading the referenced article.
She's Italian? All this time I thought she was a mustachioed Jewish transvestite.
Why do the men I love always lie to me?
The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm
I don't know that I understand entirely or have read it this fast, but basically no. He used algorithms stored into arrays and some database engineering voodoo.
Close, actually. Very close. They split the headers it's true, but he compresses the emails together so that emails that are exactly (or almost exactly) the same ("get viagra now!" or a newsletter) don't have to be stored in different places in memory. Only large emails get LZMA (much better than bzip fyi).
Just delete the emails that are not on the compressing dictionary.
I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.
When information is power, privacy is freedom.
If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.
What the summary does not say was that, email number 502,922 is special cased and is stored in plain text at the head of the compression dictionary. So it will trivially fetch email number 502,922.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
That service is pretty cool. Never realized there was something out there like that.
Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
Actually there is even more to it than that.
Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over!
TFA right there.
Maybe next time you could actually read TFA.
Dilbert RSS feed
One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.
What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
This is called a solid archive; what the author wants is a non-solid archive.
Seriously.
7zip. LZMA2. Whatever speed/compression setting you want (I always roll 9). Non-solid mode or a solid block size limited to whatever size you want (or whatever number of files you want, or both).
LZMA2 automagically does it's dictionary thing, and the non-solid nature does it per file, or if you limit solid block size it does it per group of n files or per group of files that fit in size x or both. If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
Of course, this sort of shit is already supported by 7zip. Use PPMD or whatever method/filter you think is good for text, then send that shit to LZMA2.
This story is basically "I did something needlessly complex because I didn't RTFM for 7zip".
7z a -t7z emails.7z emailDirectory\ -m0=PPMd:mem=30:o=12 -m1=LZMA2 -mt=4 -mx=9 -ms=1024f256m
Add all files from "emailDirectory\" to "emails.7z" using the PPMd compressor with 1 GB of memory required (for compressing and decompressing) and a model order of 12, then pass it through LZMA2. Try to use 4 cores, use the "ultra" compression options, and make each block contain a maximum of 1024 files and have a maximum of 256 MB.
Actually, the core of his compression scheme seems to be constructing an LZW dictionary but using line patterns instead of bits or letters. The reason it works is because he jumps ahead of all those arrangements that would have probably been made letter by letter a, an, and, and , and h, (etc) which is what makes lzw and it's variants slow.
It was clever, but any Theory of Information course should tell you that choosing your "default" symbols correctly is very important. He did just that (:
Just code "Prince of Nigeria" for 1 bit and you've got (17*8==136):1 compression. Continue with that line of thinking... "expand your manhood", "Pass this along to a friend", "Dear beloved in Christ", etc.
Related anecdote: Way back when, as relatively innocent SW listeners, some friends and I thought it would be awesome to listen in on phone calls. They were all over; radiotelephone, on C-band satellite, etc. You just had to figure out where they were. Well, after about an hour of actual listening, we determined without a shadow of a doubt that exposing yourself to other people's phone calls seemed like a sure way to reduce your own intelligence by orders of magnitude. It was outright painful. Of course, now that cellphones are everywhere and exposing yourself to (half of) other people's phone calls is simply a matter of standing in line at Subway, we all know this.
But... compared to phone calls, general email is sure to be worse, judging by my spam folder and the ratio of spam to actual mail.
Of course, if you'd like to go down another step, Twitter has utterly worthless yuck in sets of up to 140 characters waiting to leave nasty little footprints all over your brain, lol...
And then there is the ultimate collection of drivel: Facebook.
I've fallen off your lawn, and I can't get up.
Did you RTFA?
He said 7zip was too slow/CPU-intensive, and got worse compression with a solid archive (85%) than his custom solution (90%). AFAICT, going non-solid and backing off the compression setting would make it even worser, right?
And W/R/T this:
If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
Which would more-or-less do what he's accomplishing, with two very big differences:
Only large emails get LZMA (much better than bzip fyi).
For text Bzip2 is actually quite good. It is also substantially faster than LZMA, which means he may have been able to hit his 10MB/s mark and compress everything. Further, Bzip2 actually operates in blocks (max 900kb) using up to 6 dictionaries. I'd actually assume pretty much all compression algorithms at least support a mode amenable to streaming, if it's not baked in from the get go. In general, more dictionaries are actually better, if you can get away with the overhead. A super giant dictionary, for general compression, would perform quite poorly. There is a trade-off is between dictionary overhead and the number of dictionaries.
Deduplication is compression.
Wow, I hope you have a good THAC0!
I run a similar (though waaaay less popular) site - http://dudmail.com/
My mail is stored on disk in a mysql db so I don't have quite the same memory constraints as this.
I had originally created this site naively stashing the uncompressed source straight into the db. For the ~100,000 mails I'd typically retain this would take up anywhere from 800mb to slightly over a gig.
At a recent rails camp, I was in need of a mini project so decided that some sort of compression was in order. Not being quite so clever I just used the readily available Zlib library in ruby. This is just run straight over each email as it comes in with no reference to any previous emails, so very "dumb" compared to mailinator, but still fairly effective.
This took about 30 minutes to implement (in about 6 lines of code) - and then a couple of hours to test and debug. An obvious bug (very large emails were causing me to exceed the BLOB size limit and truncating the compressed source) was the main problem there...
I didn't quite reach 90% compression, but my database is now typically around 200-350mb. So about 70-80% compression. I didn't reach 90% compression, but I did manage to implement it in about 6 lines of code =)
If anybody responsible for the site comes this way, thank you for the excellent (and free) service.
Tries and Bloomfilters are wonderful algorithms, because they are simple, if you want something a tad bit more complicated use Locality-Sensitive Hashing to find similar documents from a big set of documents.
Meh. Deduplication is replacing identical byte sequences with a more optimal representation (it's a vocabulary transformation), while deduplication is replacing identical byte sequences with a singular token (it's a storage transformation). While the approaches are very similar, the difference is in the size of the atom and the scope of the data set.
Compression in computing already has a very specific meaning (it's a stream operation), and in general technical people do not like overloading (cue the "copyright infringement is not theft" or "a kilobyte is (not) 1000 bytes" rants). So stop trying to conflate the two terms, they do differ.
I agree with your sentiment wholeheartedly: precision in language and conversation is important.
Compression is an aspect of Information Theory or Entropy. From this perspective, it is the reduction of redundant bits of information in a given corpus (which is usually a stream because that's natural, but I don't know that there is an inherent requirement). All software which compresses data does so by finding and eliminating duplicate chunks of bits. The term "data deduplication" seems to have arisen in the tech industry to refer to compression algorithms on filesystems which use very large chunks of bits (files/blocks) compared to prior tools. But that's really just a marketing label. It's still a reduction in redundant bits of information stored, ergo it is still compression. The "technical people" in this conversation—Information Theorists—will see that deduplication:compression::square:shape, and no overloading has occurred.
In short, don't diss your roots.
Did you RTFA?
He said 7zip was too slow/CPU-intensive, and got worse compression with a solid archive (85%) than his custom solution (90%). AFAICT, going non-solid and backing off the compression setting would make it even worser, right?
And W/R/T this:
If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
Which would more-or-less do what he's accomplishing, with two very big differences:
You can tune the performance however you want, and use whatever filters in whatever order you want.
If you would RTFA for 7Zip, you would realize that filters can have multiple output streams. You can have an "already compressed" stream that skips the LZMA2 compressor, you can have a "requires compression" stream that gets hit by LZMA2 afterward, you can have debug/control streams, whatever the fuck you want.
I simply gave a basic example of how to use 7zip with 2 encoding methods. PPMd is specifically for text and I guarantee you it's doing a much better job that this guy's custom shit. You can control performance, block size, memory constraints, whatever the fuck you want with 7zip. If you want to skip over emails below a given size for performance reasons, throw in a filter that does this. Hell, it can be adaptive and use your CPU / memory usage / estimate of desired time to complete as a parameter. If you want to have an adaptive dictionary, you can do that too. All you need to fucking do is have your filter pass out another stream. There's zero reason for a time-sensitive dictionary, however, because maintaining one is more work than simply adding on to an existing dictionary. You only get a benefit if you're dumb enough to try to rebuild the entire dictionary periodically.
And anyone who blocks them. They know they're a scummy site and REALLY want to be sending you spam.
That or a site that uses a service that blocks over 4,000 disposable e-mail address domains might just want a more persistent identifier to cut down on sockpuppet registrations.