How Mailinator Compresses Its Email Stream By 90%

← Back to Stories (view on slashdot.org)

How Mailinator Compresses Its Email Stream By 90%

Posted by Soulskill on Tuesday February 21, 2012 @10:10AM from the ever-faster-ever-smaller dept.

An anonymous reader writes "Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over! Everyone out! 85% is good enough. Actually — there were two problems with that result. One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter. That works great on static files — but Mailinator is not a static file. Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"

75 comments

Min score:

Reason:

Sort:

Really Dumb by Anonymous Coward · 2012-02-21 10:16 · Score: 5, Funny

Is how I feel after reading the referenced article.
1. Re:Really Dumb by Anonymous Coward · 2012-02-21 10:34 · Score: 0
  
  Yes, I too, am deeply humbled!
2. Re:Really Dumb by JosephTX · 2012-02-21 10:56 · Score: 1, Flamebait
  
  The guy was just saying he doesn't understand, not that he has some sort of aversion to it. The fact that you need to be an ass about it should make everyone question your intelligence as well.
3. Re:Really Dumb by Tyrannosaur · 2012-02-21 11:11 · Score: 3, Insightful
  
  Alright, I apologize. I was in the wrong. It also came off significantly more sarcastic than I meant it too. The point that I tried (and failed) to make was that there really is nothing that should make anyone feel dumb, it's really just a lack of learning that can be fixed. Thank you for calling me out.
4. Re:Really Dumb by beelsebob · 2012-02-21 11:23 · Score: 0
  
  Yep... Compression algorithm designed with knowledge of what it's compressing does better than compression algorithm without... Also, the Pope is Catholic.
5. Re:Really Dumb by shaitand · 2012-02-21 12:34 · Score: 1
  
  I used to think that the difference between the bright ones and the not so bright was really just education. That intelligence was really a measure of curiosity and drive to learn. Attempting to teach people material has cured me of this fantasy.
  For every person who just needs education or needs the material presented to them in another way that clicks for them there are thousands who simply don't and won't get it no matter how it is presented.
  Not that I'm saying the fp is in one group or the other.
6. Re:Really Dumb by Anonymous Coward · 2012-02-21 23:17 · Score: 0
  
  I really doubt the second affirmation, being a Catholic myself.
You cut off at the good part. by Culture20 · 2012-02-21 10:17 · Score: 0, Flamebait

I'm not going to RTFA, so let me guess: they split it up and bzip2 the remaining files? Maybe they split the headers from the rest of the email and compress those separately?
1. Re:You cut off at the good part. by poetmatt · 2012-02-21 10:27 · Score: 2
  
  I don't know that I understand entirely or have read it this fast, but basically no. He used algorithms stored into arrays and some database engineering voodoo.
2. Re:You cut off at the good part. by Tyrannosaur · 2012-02-21 10:27 · Score: 2
  
  Close, actually. Very close. They split the headers it's true, but he compresses the emails together so that emails that are exactly (or almost exactly) the same ("get viagra now!" or a newsletter) don't have to be stored in different places in memory. Only large emails get LZMA (much better than bzip fyi).
3. Re:You cut off at the good part. by Anonymous Coward · 2012-02-21 10:35 · Score: 0
  
  Duplicate or similar emails are dropped to save space.
  Could they not have included that in the /. summary?
4. Re:You cut off at the good part. by Nemesisghost · 2012-02-21 10:58 · Score: 4, Informative
  Actually there is even more to it than that.
  
  1. He broke each email up by lines(or multiple small lines, until the smallest unit was greater than 64 bytes).
  2. He compared each line to the set of line from previous emails, building a dictionary of lines that each has in common.
  3. Finally, if a "line" in the email is large enough he will LZMA compress it against a sliding dictionary of only the most recent emails
5. Re:You cut off at the good part. by errandum · 2012-02-21 12:06 · Score: 2
  
  Actually, the core of his compression scheme seems to be constructing an LZW dictionary but using line patterns instead of bits or letters. The reason it works is because he jumps ahead of all those arrangements that would have probably been made letter by letter a, an, and, and , and h, (etc) which is what makes lzw and it's variants slow.
  It was clever, but any Theory of Information course should tell you that choosing your "default" symbols correctly is very important. He did just that (:
6. Re:You cut off at the good part. by GNUALMAFUERTE · 2012-02-21 12:18 · Score: 0
  
  Close enough. He essentially used deduplication instead of real compression (well, he also used some compression in the end). He essentially split messages into headers/body, then used line deduplication. Then compressed any messages above a certain threshold.
  It's an interesting article, you should read it.
  
  --
  WTF am I doing replying to an AC at 5 A.M on a Friday night?
7. Re:You cut off at the good part. by tixxit · 2012-02-21 13:28 · Score: 1
  
  Only large emails get LZMA (much better than bzip fyi).
  For text Bzip2 is actually quite good. It is also substantially faster than LZMA, which means he may have been able to hit his 10MB/s mark and compress everything. Further, Bzip2 actually operates in blocks (max 900kb) using up to 6 dictionaries. I'd actually assume pretty much all compression algorithms at least support a mode amenable to streaming, if it's not baked in from the get go. In general, more dictionaries are actually better, if you can get away with the overhead. A super giant dictionary, for general compression, would perform quite poorly. There is a trade-off is between dictionary overhead and the number of dictionaries.
8. Re:You cut off at the good part. by smellotron · 2012-02-21 13:38 · Score: 3, Interesting
  
  He essentially used deduplication instead of real compression
  
  Deduplication is compression.
9. Re:You cut off at the good part. by Anonymous Coward · 2012-02-21 23:07 · Score: 1
  
  Meh. Deduplication is replacing identical byte sequences with a more optimal representation (it's a vocabulary transformation), while deduplication is replacing identical byte sequences with a singular token (it's a storage transformation). While the approaches are very similar, the difference is in the size of the atom and the scope of the data set.
  Compression in computing already has a very specific meaning (it's a stream operation), and in general technical people do not like overloading (cue the "copyright infringement is not theft" or "a kilobyte is (not) 1000 bytes" rants). So stop trying to conflate the two terms, they do differ.
10. Re:You cut off at the good part. by smellotron · 2012-02-22 00:27 · Score: 1
  
  in general technical people do not like overloading
  
  I agree with your sentiment wholeheartedly: precision in language and conversation is important.
  
  Compression in computing already has a very specific meaning (it's a stream operation)
  
  Compression is an aspect of Information Theory or Entropy. From this perspective, it is the reduction of redundant bits of information in a given corpus (which is usually a stream because that's natural, but I don't know that there is an inherent requirement). All software which compresses data does so by finding and eliminating duplicate chunks of bits. The term "data deduplication" seems to have arisen in the tech industry to refer to compression algorithms on filesystems which use very large chunks of bits (files/blocks) compared to prior tools. But that's really just a marketing label. It's still a reduction in redundant bits of information stored, ergo it is still compression. The "technical people" in this conversation—Information Theorists—will see that deduplication:compression::square:shape, and no overloading has occurred.
  In short, don't diss your roots.
11. Re:You cut off at the good part. by Anonymous Coward · 2012-02-22 01:52 · Score: 0
  
  Exactly. RLE is one of the most basic forms of data compression. Its generally one of the first to be taught to students. Deduplication is in fact compression. Its one of the oldest methods and frequently one of the easiest to implement.
Re:Beware by Ethanol-fueled · 2012-02-21 10:20 · Score: 3, Funny

She's Italian? All this time I thought she was a mustachioed Jewish transvestite.

Why do the men I love always lie to me?
Because nobody RTFA by Tyrannosaur · 2012-02-21 10:25 · Score: 5, Informative

The end result is that he made his own compression-for-emails, where it scans strings in every email and stores the same strings in memory, with the emails storing only pointers to the strings.
For large emails (he says >20k as an estimate), he applies LZMA on top of that, with a sliding dictionary based on the emails from the last few hours or so.
All in all a very good read for someone (like me) who has an interest in data compression but knows little about it yet. I like to read other people's thought processes.
Reminds me of another good read I found in someone's ./ comment about "compressing" random data: http://www.patrickcraig.co.uk/other/compression.htm
1. Re:Because nobody RTFA by MagicM · 2012-02-21 10:37 · Score: 4, Informative
  
  If you enjoyed reading that, you might also enjoy reading this and the follow-up about efficiently storing a dictionary of words and dealing with memory v.s processing trade-offs.
2. Re:Because nobody RTFA by Tyrannosaur · 2012-02-21 10:56 · Score: 1
  
  I did very much enjoy reading that, thank you. The genius of many people trying to solve their own specific problems never ceases to amaze me. (I was also regrettably unaware of what a "trie" was until now- learn something new every day)
3. Re:Because nobody RTFA by LordLimecat · 2012-02-21 11:02 · Score: 1
  
  Patrick appears to be in the wrong in that article, incidentally. He acknowledges the fact that filesystems use more space for multiple files than for a single file, but doesnt seem to understand that the only way his "compressor" could have worked is by using a delimiter or marker of some kind to indicate where data was stripped, and that said delimiter must take up some amount of space.
  His rebuttal is basically "its not my fault that modern filesystems have to use non-zero space to store information".
  It is semi-clever, but its sneaky of him to try to act like the injured party when he is so wrong that even a compression non-guru like myself can notice his error.
4. Re:Because nobody RTFA by Tyrannosaur · 2012-02-21 11:08 · Score: 1
  
  I don't think he was ever trying to get anyone to believe he's not wrong, he seems to just be having a bit of fun by making a work-around that *almost* looks like it works
5. Re:Because nobody RTFA by mr_stinky_britches · 2012-02-21 11:22 · Score: 1
  
  [ TP neglected to mention the ejohn articles are covering compression using Javascript/Node.js ]
  
  --
  Censorship is obscene. Patriotism is bigotry. Faith is a vice. Slashdot 2.0 sucks.
6. Re:Because nobody RTFA by LordLimecat · 2012-02-21 11:32 · Score: 1
  
  He would have done a much better job checking if the original.dat could be found to have a square, cube, etc number in the first N characters. I mean, if the first 520 hex characters comprise a hex number that you can take the cube (or higher root) of, you would be able to use that root as a magic number, and the operation to "exponentize" it again would contain the "hidden information". With a large enough number and a large enough root, the difference between the two might be large enough to net you some space savings.
  One might argue, however, that the "hidden information" was in the processor instructions to perform the mathematical operation, but I bet that it would win the challenge :) And the beauty of it is that if the first N characters isnt "root-able", you can always check N+1
  NB: what is the verb form of "to raise N to the X power"? Exponentiate? What about "to take the N-th root of"?
7. Re:Because nobody RTFA by LordLimecat · 2012-02-21 11:50 · Score: 1
  
  Im tempted to try to write a compressor based on this now to try to win that challenge:
  
  for N=200 to bytesInFile Do (
  if (
  IsInt( cubeRoot(readBytes(N))
  ) Then (
  Output("Magic number is " & cubeRoot(readBytes(N))
  TruncateBytesFromFile(N)
  )
  )
  Decompressor would be astonishingly small, just an append cube(MagicNumber) to the original file.
  In fact, as I think about this, given enough CPU time and a large enough file (lets say 50MB), there is almost no file that you could not compress-- set the minimum length hex number to check, and look for roots that yield integers starting with X^0.33, and counting down to X^0.01. Eventually you would find a root that would work, and with the size of the numbers you would be working with the space savings would be incredible. You could even write a general decompressor, and make the first 20 bytes of the file what the magic number was, and what the exponent was.
  A quick check (cubing 0x9999 9999 9999) reveals that you could drop from 47 bytes to 12 bytes if your first "hit" was at byte 47. Imagine if your first hit was at 200 bytes :)
  Can anyone comment on how this is working, and where the information is being hidden in this scheme?
8. Re:Because nobody RTFA by Anonymous Coward · 2012-02-21 11:59 · Score: 0
  
  He makes it quite clear that he knows it will, even must, take up more disk space, and why -- this wasn't about creating a compression algorithm, it was about exploiting poorly-written rules.
  It's like arbitraging bastards -- nobody really likes 'em, but the world is less inconsistent because of them.
9. Re:Because nobody RTFA by Onymous+Coward · 2012-02-21 12:01 · Score: 1
  
  Part of the problem is that Mike Goldman makes as if to outline precise technical constraints on the problem (data file of such size, you tell me this, I send you that, you send me those, they output so-and-so) but includes without being explicit the spirit of the bet. The challenge is about compression, yes, but if you start to give precise constraints on how the bet can be won, you start to imply that any activity within the constraints is fair game.
  The confusion here is about the nature of human communication and phrasing wagers, not compression.
  Mike Goldman ought to acknowledge that the spirit is meant to be a constraint in the challenge. He should acknowledge that he was not explicit regarding this in his specification of constraints, and he should update the challenge to reflect it.
  Patrick Craig needs to understand that achieving dataset size reduction via compression is the actual objective and a valid constraint, even if unstated. And, yes, also that what he did was not compress the data, he just transformed the characters elided from the original file into something else stored outside the files' contents, into the fact that the input files to his "decompressor" are distinct entities.
  Meanwhile, I'm writing a decompressor that will use its own name as the source for the "decompressed" result. Granted, it'll be a large name, but the upshot is that it should be a tiny program and the "compressed" file it processes can be 0 bytes long. A substantial savings.
10. Re:Because nobody RTFA by Anonymous Coward · 2012-02-21 17:45 · Score: 0
  
  Algorithms = money saving in the programming world usually. It's a shame it's such a lost art. Going through college, they were barely mentioned, workplace, barely. Nobody thinks algorithmically anymore. thus the overload of shitty code being produced.
11. Re:Because nobody RTFA by Zironic · 2012-02-21 19:04 · Score: 1
  
  The problem is that your odds of finding a decent sized cube in a random number are pretty abysmal.
fartbox by Anonymous Coward · 2012-02-21 10:27 · Score: 0

I wonder if you could do anything interesting with fuzzy hashing and diff.
Hey guys by Anonymous Coward · 2012-02-21 10:31 · Score: 0

Let's post the mostly-irrelevant part of the article as the summary and leave out anything that really matters
Author confused? by Anonymous Coward · 2012-02-21 10:32 · Score: 0

One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.
What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
This is called a solid archive; what the author wants is a non-solid archive.
1. Re:Author confused? by Anonymous Coward · 2012-02-21 10:40 · Score: 0
  
  What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
  That's what he described, he just called it a "fixed dataset". So he used the wrong term for the right thing.
2. Re:Author confused? by sexconker · 2012-02-21 12:05 · Score: 1, Interesting
  
  One was that, LZMA, like many compression algorithms build their dictionary based on a fixed dataset. As it compresses it builds a dictionary of common sequences and improves and uses that dictionary to compress everything thereafter.
  What?! LZMA keeps a dictionary of recent data, not a "fixed dataset".
  Its a big, honking, several gigabyte cache of ever changing email. If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.'"
  This is called a solid archive; what the author wants is a non-solid archive.
  Seriously.
  7zip. LZMA2. Whatever speed/compression setting you want (I always roll 9). Non-solid mode or a solid block size limited to whatever size you want (or whatever number of files you want, or both).
  LZMA2 automagically does it's dictionary thing, and the non-solid nature does it per file, or if you limit solid block size it does it per group of n files or per group of files that fit in size x or both. If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
  Of course, this sort of shit is already supported by 7zip. Use PPMD or whatever method/filter you think is good for text, then send that shit to LZMA2.
  This story is basically "I did something needlessly complex because I didn't RTFM for 7zip".
  7z a -t7z emails.7z emailDirectory\ -m0=PPMd:mem=30:o=12 -m1=LZMA2 -mt=4 -mx=9 -ms=1024f256m
  Add all files from "emailDirectory\" to "emails.7z" using the PPMd compressor with 1 GB of memory required (for compressing and decompressing) and a model order of 12, then pass it through LZMA2. Try to use 4 cores, use the "ultra" compression options, and make each block contain a maximum of 1024 files and have a maximum of 256 MB.
3. Re:Author confused? by Anonymous Coward · 2012-02-21 12:52 · Score: 1
  Did you RTFA?
  He said 7zip was too slow/CPU-intensive, and got worse compression with a solid archive (85%) than his custom solution (90%). AFAICT, going non-solid and backing off the compression setting would make it even worser, right?
  And W/R/T this:
  
  If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
  Which would more-or-less do what he's accomplishing, with two very big differences:
  
  Your way, every file gets compressed with LZMA2 -- forcing you to back off compression all the time to keep peak CPU usage acceptable. His custom solution uses an LZMA2 pass over selected (largest) emails, and just skips over some when it's backlogged -- the overall compression level is thus adaptive to load.
  Even more significantly, your approach completely ignores that mailinator is a giant ring buffer -- as new mails come in, old ones are deleted. Since you generate a dictionary for all the mails in the directory, what happens if a burst of "v1agr4" spam was in the system then, but next a new burst of "c1ali5" spam comes in, and the "v1agr4" gets dropped? Your dictionary is now mismatched to the new data, and a new one must be generated, forcing you to reprocess all the email that's still retained. His LRU cache handles this automatically.
4. Re:Author confused? by smellotron · 2012-02-21 13:50 · Score: 1
  
  (I always roll 9)
  
  Wow, I hope you have a good THAC0!
5. Re:Author confused? by sexconker · 2012-02-22 05:31 · Score: 1
  Did you RTFA?
  He said 7zip was too slow/CPU-intensive, and got worse compression with a solid archive (85%) than his custom solution (90%). AFAICT, going non-solid and backing off the compression setting would make it even worser, right?
  And W/R/T this:
  
  If you have a lot of duplication across files so far apart that they won't share a dictionary under LZMA2, you can get some improvement by first creating a master dictionary (across all files, ignore non-solid mode or solid block limits) for those duplicated chunks and then writing down all the pointer locations for them, then sending the rest of the data to LZMA2 to be compressed.
  Which would more-or-less do what he's accomplishing, with two very big differences:
  
  Your way, every file gets compressed with LZMA2 -- forcing you to back off compression all the time to keep peak CPU usage acceptable. His custom solution uses an LZMA2 pass over selected (largest) emails, and just skips over some when it's backlogged -- the overall compression level is thus adaptive to load.
  Even more significantly, your approach completely ignores that mailinator is a giant ring buffer -- as new mails come in, old ones are deleted. Since you generate a dictionary for all the mails in the directory, what happens if a burst of "v1agr4" spam was in the system then, but next a new burst of "c1ali5" spam comes in, and the "v1agr4" gets dropped? Your dictionary is now mismatched to the new data, and a new one must be generated, forcing you to reprocess all the email that's still retained. His LRU cache handles this automatically.
  You can tune the performance however you want, and use whatever filters in whatever order you want.
  If you would RTFA for 7Zip, you would realize that filters can have multiple output streams. You can have an "already compressed" stream that skips the LZMA2 compressor, you can have a "requires compression" stream that gets hit by LZMA2 afterward, you can have debug/control streams, whatever the fuck you want.
  I simply gave a basic example of how to use 7zip with 2 encoding methods. PPMd is specifically for text and I guarantee you it's doing a much better job that this guy's custom shit. You can control performance, block size, memory constraints, whatever the fuck you want with 7zip. If you want to skip over emails below a given size for performance reasons, throw in a filter that does this. Hell, it can be adaptive and use your CPU / memory usage / estimate of desired time to complete as a parameter. If you want to have an adaptive dictionary, you can do that too. All you need to fucking do is have your filter pass out another stream. There's zero reason for a time-sensitive dictionary, however, because maintaining one is more work than simply adding on to an existing dictionary. You only get a benefit if you're dumb enough to try to rebuild the entire dictionary periodically.
LZO compression and friends by Anonymous Coward · 2012-02-21 10:33 · Score: 0

What about LZO compression and friends? They are used for indexable large files in Hadoop HDFS. They could come in handy here too. No mention of that?
Pfff easy by M0j0_j0j0 · 2012-02-21 10:36 · Score: 1

Just delete the emails that are not on the compressing dictionary.
Mailinator Rocks by Jah-Wren+Ryel · 2012-02-21 10:40 · Score: 3, Informative

I use mailinator all the time, it is fantasticly useful. Sometimes I encounter a website that won't accept mailinator addresses, some even go to the effort of tracking the alternate domains he uses and blocking them too. I find mailinator so useful that when a website refuses mailionator addresses, I just won't use that website.
The Mailinator Man's blog is also pretty good, the guy is articulate and has a knack for talking about interesting architectural stuff. This latest entry is just another in a great series, if you like this sort of stuff and haven't read his previous entries you should take the time to read through them.

--
When information is power, privacy is freedom.
1. Re:Mailinator Rocks by Anonymous Coward · 2012-02-21 10:51 · Score: 0
  
  Second that.
  Love mailinator. No i don't want to recieve emails forever from your site i will only be using for the next 8 minutes to look at one specific thing once.
  Fuck off and here's a throwaway email address!
  And anyone who blocks them. They know they're a scummy site and REALLY want to be sending you spam. That's a place you don't want to use anyway. Any information you wanted is likely bullshit made up to get your email signup.
2. Re:Mailinator Rocks by Anonymous Coward · 2012-02-21 11:17 · Score: 0
  
  Mailinator clones whose domains are blocked less often are spambog and mailcatch. I'm sure there are more.
3. Re:Mailinator Rocks by Anonymous Coward · 2012-02-21 11:44 · Score: 0
  
  Can't you just point any domain you own (or subdomain) to mailinator? I thought I read that he said that sometime in the past.
  Kinda like how you can run Gmail on your own domains as well. But free.
4. Re:Mailinator Rocks by KhabaLox · 2012-02-21 11:54 · Score: 1
  
  They have a pretty awesome FAQ too.
  
  --
  Ceci n'est pas un sig.
He has a simple solution for that problem. by 140Mandak262Jamuna · 2012-02-21 10:41 · Score: 1

If I compressed a million emails, and then some user wanted to read email #502,922 — I'd have to "seek" through the preceding half-million or so to build the dictionary in order to decompress it. That's probably not feasible.
What the summary does not say was that, email number 502,922 is special cased and is stored in plain text at the head of the compression dictionary. So it will trivially fetch email number 502,922.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Well, now I know what a mailinator is. by PerfectionLost · 2012-02-21 10:50 · Score: 1

That service is pretty cool. Never realized there was something out there like that.
It's not the algorithm, it's the data by Hentes · 2012-02-21 10:52 · Score: 4, Informative

Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
1. Re:It's not the algorithm, it's the data by Anonymous Coward · 2012-02-21 11:26 · Score: 0
  
  Which makes for an interesting case study of applying an algorithm to solve a problem in specific conditions. Also, you'd be surprised how little the registration e-mail means in comparison to the follow-up. Just one word: Spam. (although that can be compressed with surprising efficiency just by putting strings "penis", "pills" and "enlarge" in the dictionary)
2. Re:It's not the algorithm, it's the data by Lehk228 · 2012-02-21 11:29 · Score: 2
  
  how unique is most email? registration emails, newsletters, h3rb4l v14g|2a, p3n15 3nl4|2g3m3n7, buy stock ____, etc
  
  --
  Snowden and Manning are heroes.
3. Re:It's not the algorithm, it's the data by firewrought · 2012-02-21 11:41 · Score: 3, Informative
  
  Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
  The accomplishment here is that he determined a very tactical set of strategies for solving a real world problem of large scale. No, it didn't take a math PhD with some deep understanding of Fourier analysis to invent this algorithm, but it most certainly took a software developer who was knowledgeable, creative, and passionate for his task. So yeah... it's not the 90% compression that's impressive, it's the real-time performance that's cool.
  
  --
  -1, Too Many Layers Of Abstraction
4. Re:It's not the algorithm, it's the data by Tablizer · 2012-02-21 11:42 · Score: 1
  
  A common email phrase dictionary based on typical spam....I mean content, would resemble:
  * Viagra
  * prince
  * you won!
  * sexy
  * Nigerian
  * Please send your check to
  * Free free free!
  * $9.95
  * Survive Armageddon
  Just use compressed tokens for those and most spam, I mean emails, would be just a few bytes.
  
  --
  Table-ized A.I.
5. Re:It's not the algorithm, it's the data by isorox · 2012-02-21 15:05 · Score: 1
  
  Mailinator can achieve high compression rates because most people use it for registration emails. Those mails differ from each other in only a few words, making the data set highly redundant, and easily compressible.
  Which reminds me, Facebook is backed up onto a single LTO-5
6. Re:It's not the algorithm, it's the data by Anonymous Coward · 2012-02-21 17:51 · Score: 0
  
  So you're saying this will be directly applicable to corporate email systems, where the typical email is an exact duplicate of a dozen previous emails concatenated together with a little bit of text at the top? Sounds perfect!
7. Re:It's not the algorithm, it's the data by Anonymous Coward · 2012-02-21 20:49 · Score: 0
  
  Thank you, Captain Obvious!
This is how much i've read by Hsien-Ko · 2012-02-21 10:59 · Score: 1, Funny

Paul Tyma, creator of Mailinator, writes about a greedy algorithm to analyze the huge amount of email Mailinator receives and finds ways to reduce its memory footprint by 90%. Quoting: 'I grabbed a few hundred megs of the Mailinator stream and ran it through several compressors. Mostly just stuff I had on hand 7z, bzip, gzip, etc. Venerable zip reduced the file by 63%. Not bad. Then I tried the LZMA/2 algorithm (7z) which got it down by 85%! Well. OK! Article is over!

TFA right there.
Re:News? by icebraining · 2012-02-21 11:30 · Score: 1

Maybe next time you could actually read TFA.

--
Dilbert RSS feed
Re:Beware by Anonymous Coward · 2012-02-21 11:35 · Score: 0

Y'know, if anyone on /. had a girlfriend, I'm not sure why they would mind her checking out their butt? Perhaps you should rethink this troll.
Easy compression: by fyngyrz · 2012-02-21 12:11 · Score: 2

Just code "Prince of Nigeria" for 1 bit and you've got (17*8==136):1 compression. Continue with that line of thinking... "expand your manhood", "Pass this along to a friend", "Dear beloved in Christ", etc.
Related anecdote: Way back when, as relatively innocent SW listeners, some friends and I thought it would be awesome to listen in on phone calls. They were all over; radiotelephone, on C-band satellite, etc. You just had to figure out where they were. Well, after about an hour of actual listening, we determined without a shadow of a doubt that exposing yourself to other people's phone calls seemed like a sure way to reduce your own intelligence by orders of magnitude. It was outright painful. Of course, now that cellphones are everywhere and exposing yourself to (half of) other people's phone calls is simply a matter of standing in line at Subway, we all know this.
But... compared to phone calls, general email is sure to be worse, judging by my spam folder and the ratio of spam to actual mail.
Of course, if you'd like to go down another step, Twitter has utterly worthless yuck in sets of up to 140 characters waiting to leave nasty little footprints all over your brain, lol...
And then there is the ultimate collection of drivel: Facebook.

--
I've fallen off your lawn, and I can't get up.
A similar result (with much less effort...) by smylie · 2012-02-21 14:09 · Score: 1

I run a similar (though waaaay less popular) site - http://dudmail.com/
My mail is stored on disk in a mysql db so I don't have quite the same memory constraints as this.
I had originally created this site naively stashing the uncompressed source straight into the db. For the ~100,000 mails I'd typically retain this would take up anywhere from 800mb to slightly over a gig.
At a recent rails camp, I was in need of a mini project so decided that some sort of compression was in order. Not being quite so clever I just used the readily available Zlib library in ruby. This is just run straight over each email as it comes in with no reference to any previous emails, so very "dumb" compared to mailinator, but still fairly effective.
This took about 30 minutes to implement (in about 6 lines of code) - and then a couple of hours to test and debug. An obvious bug (very large emails were causing me to exceed the BLOB size limit and truncating the compressed source) was the main problem there...
I didn't quite reach 90% compression, but my database is now typically around 200-350mb. So about 70-80% compression. I didn't reach 90% compression, but I did manage to implement it in about 6 lines of code =)
1. Re:A similar result (with much less effort...) by Anonymous Coward · 2012-02-22 05:21 · Score: 0
  
  But using Blobs for storing email isn't usually a good idea in a db. More easier to save it as a separate file referenced by the record on the db.
2. Re:A similar result (with much less effort...) by Barbara,+not+Barbie · 2012-02-22 07:04 · Score: 1
  
  But using Blobs for storing email isn't usually a good idea in a db. More easier to save it as a separate file referenced by the record on the db.
  
  Not really - each file uses up an inode, so there's 4k gone per file. A better solution would be to store an arbitrary number of emails in each file, compressed and then concatenated, and just store the filename:offset:length of each one in the db. Each individual email is quickly recovered, way fewer inodes used.
  
  --
  Let's call it what it is, Anti-Social Media.
3. Re:A similar result (with much less effort...) by smylie · 2012-02-22 08:15 · Score: 1
  
  yeah, this was going to be my original approach. (I had a previous project where I had stored images in a db, which showed the limitations of this approach).
  However, I ended up chucking them in the database for simplicity. I'm able to just move database dumps from production to dev and that's a complete snapshot of the application - no need to worry about also having to sync an emails directory. It also means I don't have to worry about error handling for when an email body is not found (if the db record is there, the email body is there), or making sure emails get purged off disk correctly.
  As I'm only ever retrieving one email at a time, performance has not been an issue, and as I'm only keeping a gig or so of mail, disk space hasn't really been a concern either.
  Obviously if your dealing with many more emails than I need to, then this may need to be revisited, but in this case, I'm happy with my quick and easy implementation=)
I love Mailinator by XahXhaX · 2012-02-21 17:50 · Score: 1

If anybody responsible for the site comes this way, thank you for the excellent (and free) service.
Simple and Good by emj · 2012-02-21 20:52 · Score: 1

Tries and Bloomfilters are wonderful algorithms, because they are simple, if you want something a tad bit more complicated use Locality-Sensitive Hashing to find similar documents from a big set of documents.
Re:News? by Anonymous Coward · 2012-02-21 23:11 · Score: 0

Read the article and don't try to be so clever next time.
Re:News? by Pieroxy · 2012-02-22 00:39 · Score: 0

Ok, I read the article and it was interesting. But the summary is IMO appalling. That said, it is Slashdot and I should have known better...
Thanks for the tip.

--
Write boring code, not shiny code!
compress +90% From From by Anonymous Coward · 2012-02-24 00:51 · Score: 0

Simply remove all the '>'+space that are mysteriously inserted at the beginning of content lines that start with "From " ... a silly legacy of Berkeley mailbox format From the days when only Real Programmers would use consistent escaping/de-escaping or short-term guaranteed unique hash delimiters to parse mailboxes, and Real Programmers were in Short Supply.
It's only a tiny part of the stream, but we could lose a little on every transaction but make up for it in volume. Until it adds up to 90%
Tyma Evil, Inc. by Anonymous Coward · 2012-02-27 03:58 · Score: 0

Clearly, "Paul Tyma" is an alias for Hans Doofenshmirtz.
To cut down on sockpuppetry by tepples · 2012-03-01 05:56 · Score: 1

And anyone who blocks them. They know they're a scummy site and REALLY want to be sending you spam.
That or a site that uses a service that blocks over 4,000 disposable e-mail address domains might just want a more persistent identifier to cut down on sockpuppet registrations.