Using gzip As A Spam Filter

← Back to Stories (view on slashdot.org)

Posted by timothy on Monday January 27, 2003 @01:15AM from the showing-some-adaptability dept.

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

14 of 268 comments (clear)

Min score:

Reason:

Sort:

Re:Text of the full article by Anonymous Coward · 2003-01-27 01:24 · Score: 5, Insightful

> The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.

The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.
Same old problem... by artemis67 · 2003-01-27 01:34 · Score: 5, Insightful

Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.
Re:HTML by ^BR · 2003-01-27 01:43 · Score: 1, Insightful

You're a moron that didn't read the article.
The idea is to have a corpus of spam and a corpus of ham, to append the new message to it and to see in which case the message to test compresses best to classify it.
Compression algorithms as filters... by Jugalator · 2003-01-27 01:46 · Score: 4, Insightful

.. sounds like a poor idea to me. Yes, you can measure the amount of redundancy in a message, but:

a) Spammers might not always use messages redundant enough to be detectable from regular text.

b) If I happened to use some words a little too often, especially when writing mails discussing technical stuff or posting computer code fragments, would that be classified as spam?

I think this is a nice filter when sorting out more or less repetitive mails (spam or not) from novels, but a filter based on a spam database sounds better to me.

--
Beware: In C++, your friends can see your privates!
Re:How to stop spam.... by Jugalator · 2003-01-27 01:49 · Score: 2, Insightful

Still, you use hotmail (aka "spammer's heaven") here on Slashdot. But thanks for the tip, perhaps we should start trying it out? :-)

--
Beware: In C++, your friends can see your privates!
Stopping Spam by Inflatable+Hippo · 2003-01-27 01:55 · Score: 4, Insightful

> stupid filtering isnt gonna get you rid of spam... go complain at spammers upstream providers...

Filters only work to a limited extend, and so might shutting down the spammers, if it were possible.

But neither is going to solve this problem.

The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.

The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.

However if "genuine" mail is certified and mailers use certification validity as a filtering critera then it simplifies the game hugely.

Your mailer can spot the people you've genuinely given your address to, and naturally "distrust" uncertified (effectively anonymous) mail or mail whos certificate has been revoked or is unknown to you.

The "only" things standing in the way of this are:

1. Slow adoption of certification/encryption in mass market mailers. Usually poor or missing.
2. Cost/diffiulty of getting a valid certificate (e.g. with Verisign).
3. The pain of typing a password every time you send a mail.
4. It only works if everyone joins in.

But nothing's for free and this strikes at the heart of emails useability.

I'm continually suprised by the lack of certification use at least by large corporations and governments, but I suppose it removes plausible deniability :-)
1. Re:Stopping Spam by iamchris · 2003-01-27 02:45 · Score: 2, Insightful
  
  Think about this: Why do I get 1000's of spam emails per month and I get 10's of peices of junk snail mail/month? Simple: It costs nearly nothing to send millions of spam messages, while it costs a bundle to send junk snail mail.
  
  A simple solution would be to find a way to charge per email...
  
  Now, I certainly wouldn't pay per email. But, I shouldn't complain when someone abuses a messaging system that allows millions of messages to be sent out for nearly no cost. I use that system too, on a much smaller scale, for personal and legitimate business use.
  
  All I can do is ignore as much of the mail as I can, and BOYCOTT anything that is sold via spam.
  
  Ag.
Re:Spam Conference talk by Matts · 2003-01-27 02:56 · Score: 2, Insightful

Actually it's the other way around. DNSBL's (not RBLs - thats a specific term for MAPS' list) are fine for personal users, and even for some businesses, but generally they have way too high a false positive rate for any kind of generic filtering. The SpamAssassin team has done lots of research into this, see for example the slide at the very end of my talk.

No, for a large scale service you need much lower rates of false positives than any of the DNSBLs provide right now. They're fine as inputs into heuristic or statistical systems, but on their own they are just not accurate enough.

--

Matt. Want XML + Apache + Stylesheets? Get AxKit.
Re:I can't figure this out... by Motherfucking+Shit · 2003-01-27 03:01 · Score: 4, Insightful

If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic.
This would make sense if the only people implementing spam filters were end users. Unfortunately, the logic breaks down when you consider that some ISPs do the filtering on behalf of their customers. It breaks down further when you factor in the number of situations in which a) the customer might not even know that the filtering is happening, or b) the customer blindly trusts the ISP's filtering system.

Take Yahoo, for example. They're a popular webmail service and they also do spam filtering to some extent on inbound email. I would say that, in general, people who use Yahoo mail are not necessarily the type of people who "know better" than to buy spamvertised products. That's not a slam on Yahoo, nor on the people who use Yahoo mail, it's just the way the demographics work out. The ratio of ripe targets to clued-in antispammers is simply better at Yahoo than it is on other domains.

To that end, Yahoo's spam filters aren't helping the spammers any. A spammer's goal is to get his ad in front of as many potential targets as possible, and Yahoo is full of potential targets. But if Yahoo's filters catch the spammer's message and route it straight to everyone's Bulk Mail folder, there's (thousands|millions) of "targets" who will never see the message.

So no, I can't agree that filtering helps the spammers any, at least not the big spammers who are after volume. There's probably a bit of "collateral assistance" in that people who would report the spam may never see it, but I'd say that benefit is cancelled out by the number of possible targets lost to filters.

--
"BSD: Free as in speech. Linux: Free as in beer. Windows 10: Free as in herpes." --Man On Pink Corner in #52607549.
Correction by misof · 2003-01-27 03:09 · Score: 2, Insightful

The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

There is a minor problem with this sentence. And with this whole gzip business. It is misleading. Words, phrases? You cannot force gzip to match words, gzip tries to exploit every likeliness found, even at the character level. E.g., if your "spam dictionary" contains words sex and pants, mail about sextants will have a good compression ratio. And there is no way how to prevent this. That's why the Bayesian filters (operating on words) outperform gzip by a league. That's (one of more reasons) why I think this article belongs not to /. but to a wastebin instead. It simply presents a worse approach to do something. Interesting idea, yes, but that's all.

(Just FYI: it is proved, that the bzip2 algorithm due to Burrows and Wheeler exploits all such repeatings in the input file nearly optimally -- within some small ratio. Hence, it is even worse to use it as a spam filter :-)
Re:Sequitur Most Likely Superior by A55M0NKEY · 2003-01-27 03:12 · Score: 2, Insightful

But your rule list is now getting big and still has to be stored. Compression is about minimizing the amount of stuff that has to be stored to recreate the original. It would be nice to have a few simple, very reusable rules that you can use to generate the original with a very few commands.

--
Eat at Joe's.
Re:I can't figure this out... by stilwebm · 2003-01-27 03:38 · Score: 2, Insightful

It's true that the sellers want that. However, you may have noticed spammers are not always the sellers. The seller is looking for someone to do some "email marketing" for them. They are looking for wide coverage. They want to see things like "your email can be sent to 30 million unique email addresses," which means a few million that might get through, a few thousand that will actually get read, and maybe a few purchases. Spammers are just creepy marketers who want to make it sound like emailing as many people as possible is better, and should cost the seller more. Since they use open relays and random forged "From" email addresses, they never see what email gets blocked. Using images in HTML email they can get an idea of how many emails were read (this is why you should turn off images in email). While the spammer makes a commission on every sold item, they also make money selling lists and marketing services.

The numbers are part of their pissing contest, and the pool is your inbox. Spammers are not that bright, but their customers are much, much more stupid.
Re:Spam Conference talk by archeopterix · 2003-01-27 03:39 · Score: 4, Insightful

MLD, gzip, neural networks, bayesian filtering and probably a bunch of other spam-filtering methods are all based on the following scheme: get a (big) number of spam messages, a number of non-spam messages (preferably specific to the current user of the filter) and use a learning algorithm on these to produce an automatic classifier.
What bothers me about this method is that you can never be 100% sure what the learning algorithm will actually learn. My friends seldom send me HTML mail. Most of my spam is HTML. A learning algorithm will probably learn that HTML mail is spam, especially if it never gets HTML "ham" during its training period. Then if one of my clueless friends sends me a HTML message, it will not go through and this is clearly bad.
I will never trust an automatic filter so as to delete a message marked as "spam" without reading, but I think it can still be useful for ranking messages, so that spam gets read less often and deleted faster.
Nope by I+Am+The+Owl · 2003-01-27 04:14 · Score: 2, Insightful

Doesn't work for the Lameness Filter, won't work for spam .

--

--sdem