Using gzip As A Spam Filter
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
But isn't the spam quite varied, i.e. without long repetitive sequences? Yes, the same post may come in several times, but the text in each is quite varied; e.g. longer xxx, bigger xxx or yyy, heftier yyy and zzz.
Sounds very much like that lameness filter on Slashdot that refuses to accept a post if its contents can be compressed easily... of course, it's quite simplistic compared to gzip.
Sure, this sounds like a nice academic activity, but really ... In the real world, use the right tool for the right job. I tend to think word redundancy does not correlate directly to spaminess.
That's because most spam includes large amounts of HTML.
My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.
Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.
Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.
Jason Rennie's web page (talk and paper available) about this is here. Please do read it as it's extremely interesting.
The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).
In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level, and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.
for example Two Emails:
1 (ham) "You have won a brand new Convertible, from the competition you entered."
and
2 (spam) "A brand new convertible to be won, have you entered?"
Ham would match about 80% with spam.
Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.
A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...
Save your wrists today - switch to Dvorak
When the spam is filtered at user-account level, you can only do it by parsing a single mail in some way and determine if it's spam or not. It's like trying to tell whether a movie is bad by looking at one picture. If the spam could be filtered at the server level, by comparing mails that are received into to different accounts, you could really tell which ones are part of a mass-mail (spam).
One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.
Mailing lists might cause a problem too but wouldn't it be easier to allow the mailing lists you belong to than trying to block the ones you don't belong to?
As an example of how Sequitur works, the string 'abcabdabcabd' produces the following grammar rules:
- 2 c 2 d
- a b
Representing the original string then is the sequence:1 1
The usage counts of the rules are available as output options.
Seastead this.
You know, I noticed something peculiar. If you're from a non-English speaking country, like I am, you can filter the spam by looking at the language of the subject. In my case, if it is English it is almost certainly spam.
Do English-speaking people receive spam in foreign languages?
It's inefficient to have so much memory overhead.
Besides, if I were a spammer, I could pad it with high entropy data at the end to make up for my warbling.
A mailing list I am on gets almost exclusively korean spam.
Or it did get it, I don't know anymore, I unsubscribed. The admin refused to make it a subscription-only list as it was easier for newbies to post questions to it, and they could get quicker answers.
Unfortunately all that happened was newbies posted messages to the list and lost any replies from the few dedicated people left, in the deluge of spam.
Otherwise, it's mostly just more english spam
Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.
One of the side effects of spam is that there are no "Really Important Emails" any more. Spam and spam filters have degraded the reliability of email to such an extent that you'd have to be crazy to send anything Really Important by email.
Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day?
That's a good point. The solution is to get less spam. You can do that by changing email addresses frequently (a really inconvenient solution that I don't recommend), or by getting spammers shut down (or yourself listwashed by the spammers).
Let the spammers know that if they send something to you, they'll lose money, and they won't send you so much spam. SpamCop reporting makes this easy. If you want to be listwashed, don't munge your address when you send reports. (This is an option with SpamCop.)
Some people claim that you'll get more spam or get listbombed or something if you send complaints without munging; that's not my experience. I get 20-30 spams per day, total, at all of my 4 publicly available email addresses. (Ninety to 95 percent of them get caught by the SpamCop filters, which have almost never caught valid email.)
Reminds me off a program I helped with for a short time in college called "Siff" (ftp://ftp.cs.arizona.edu/reports/1993/TR93-33.ps) , which would find similar files by taking small fingerprints (32-bit hashes) of 50 byte sequences and finding groups of files that shared a lot of them. It works surprisingly well, even when the files were modified extensively.
I've often thought since that large mailhubs (yahoo, hotmail, etc) could automatically filter junk mail efficiently by a similar method, perhaps by limiting the delivery rate/fingerprint or just flagging high-occurence hashes as suspect (and then rating each mail by how many of its fingerprints are among this group, too many without an ADV: or bulk-mail tag would cause a mail to be marked as SPAM).
I wonder if it'd be possible to have a network of smaller hubs accomplish the same thing, perhaps even using an encrypting checksum instead of a simple hash so that individuals could contribute without anyone being able to recreate their original messages?
A couple of posts above state that spammers will "just adjust their tactics." Talk like this always puzzles me; on the spammer's side, does this not help him? If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic. Of course, this applies with the exception of corporate environments or similiar situations where Joe Insecure has someone else managing spam.
Can anyone share some +5 Insight on the matter?
Bored with karma, be a fan/freak
Another moron the tdisn't read the article.
I actually read the article.
The proposal is not to see how compressible is the message but to use a compression tool to see how lookalike the message is to a corpus of spam.
Yes, and compression ratio is used to determine this.
Save your wrists today - switch to Dvorak
If you're looking for the mathematically perfect zero fault spam solution in a world full of Msft and human beings, forget it.
What happens when I start getting 600-800 a day?
Start another account and don't give it to strangers who might sell it. Only give it to the person or persons who are going to send that really important email message. Throw in a few random numbers so if one gets leaked to spammers you can track the source (i.e., I gave my employment agency (obviously an important contact) chuck369, and nobody else. Now if chuck369 starts getting spam we know employment agency leaked it). Use 'throw away' accounts for untrusted contacts who might leak it to spammers.
try { do() || do_not(); } catch (JediException err) { yoda(err); }
Once you had the information, you could adjust the threshold of the test for optimal results, and figure out which tests were the best value.
In any case he result is that you end up with screening tests that have a lot of false positives, backed up by more expensive tests applied to all the positives to find the real problems.
You could do the same thing with spam. You'd need to assign a cost to the false negatives (missing the job offer), and the false positives (deleting spam that passed the filter), and adjust the filter accordingly. (Assuming the cost of the tests, in cpu, are negligible, which is different from the medical example.).
-- ac at work
RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html
"And we have seen and do testify that the Father sent the Son to be the Savior of the World"
1 John 4:14
I received a nice piece of spam the other day. I didn't read it but I usually scroll to the bottom to see if they have the mandatory (in some places mandatory I think) unsubscribe method. This method sure gets me mad -
To unsubscribe by postal mail, please send your request to:
P.O Box 272521
Boca Raton, FL 33427
Ref # XXXXXX -- scd
(XXXX.. replaced real reference number)
It seems that the unsubscription method doesn't have to be by email - just as long as it's by something and it's there. They musn't be specific in the law. Of course, no one is going to go write a letter by snail mail to unsubscribe to spam, although sending them some dog shit through the mail is tempting. I forgot the site that provides that service. Hrmm I should change my sig.
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
Bayesian only refers to how you use the probabilities.
Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.
You need to expand on your step 4.
When I started getting spam, I wanted it to stop. I realized I couldn't just disable the email address because there might be somebody out there counting on it to contact me. I could disable it and send an autoreply with my current email address, but then spammers would just be able to look at the reply. I needed some solution where people could send me email even if the address they used had been disabled, but spammers wouldn't be able to get my current address. I decided to put a contact form on my website. Now I autorespond to a disabled email address with the contact form url. In addition, I was able to remove email addresses from my website which was a large source of spam.
Not being able to find a contact form that was secure, I ended up writing my own and releasing it under the GPL. You can get it at http://ostermiller.org/contactform/.
I also realized that no matter how hard you try, your email address will leak to spammers. Ever giving an email address only to your closest friends and family will not prevent it from leaking out. Between the klez virus, gift certificates, invitation, greeting card, and crushlink websites, even my most personal email address leaked to spammers. You can't be afraid to disable an email address and send your friends the new one. Now even if I missed a friend, they can still get a message to me.
Here is a code snippet from the comment:
-- I was raised on the command line, bitch
The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.
I have to discourage your optimism a bit. IF the public-key encryption ever finds its way to the general public (I hope and think so), there are two possibilities:
a) Your public key will be available for the general public -- this is how it will probably work. If someone wants to send you an e-mail, he obtains your public key in a trusted way (e.g. from a trusted key server), encrypts the message and sends it. If the spammer wants to send you spam, once he gets your e-mail address, he does exactly the same. Obtains your public key, encrypts the spam and sends it. The only difference with today's situation: it will be impossible to filter spam on the server side (only to block some spamming IP addresses, no server-side spam filters).
b) You give your public key only to your friends you trust. This is exactly the approach "everything coming from an address, that's not in my address book, has to be spam." and even contradicts the basic idea: it's your public key...
if all the email clients had a little button saying "This is Spam" and if you click it the mail gets sent to some nice spam black list agency. They'd wait for about 10 people to do this, then verify it for the spam it is and then A) black list the spammer and B) send anti-spam email (subject: spam sender here ) nice and easy :)
d0rk! Ignoring the fact that I was being sarcastic and artistic license would have permitted me to specify /dev/my_ass let me just say this: before you make statements trying to make people look stupid you should probably have a clue what your talking about.
/dev/srandom, this device is the source for _s_ecure random data on OpenBSD and it's probably available some other places as well. Some random trivia (pun intented), checking around I noticed: AIX and Solaris both don't typically have /dev/random at all.
/dev/srandom you could try the following:
/dev/srandom /dev/zero
While true that your measly Linux machine has no
But anyway, back to your question: if you're sad you don't have
ln -s
This reminds me that about a year ago, three italian scientists came up with a way to find species relatedness by using the zip algorithm. One takes the sequence of bacteria 1, and then attaches a little bit of bacteria X sequence to the end of that. Again, one attaches a bit of bacteria X sequence to the end of bacteria 2. And then zipping is done on this concatenation. The final compression size of just the bacteria X part ended up telling us the homology (or relatedness) of bacteria X to bacteria 1 or 2.
But from reading all these posts, perhaps a Bayesian method would work just as well. There seems to be no inherent advantage to using zip. One still needs a reference piece of work (non-spam email, or bacteria 1) for comparing entropies or probabilities. Of interest also is that the researchers applied their method to generating an accurate language tree of Indoeuropean languages (grouped by relatedness of course.)
The ref & abstract of above paper is here:
Phys. Rev. Lett. 88, 048702 (2002)
Dario Benedetto,1 Emanuele Caglioti,1 and Vittorio Loreto2,3
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. ©2002 The American Physical Society
... can be universal. The principles used actually have their roots in the theories put forward by R. Solomonoff and Kolmogorov (links below). Any given string of bits can be assigned a "complexity" which is proportional to the length of the shortest program that can generate that string. It isn't usually computable BUT the size of the output file of a compression algorithm can be shown to be a reasonable if crude approximation. The beauty is that this approach (minimum description length or MDL) is clustering email in a very fundamental way without the bias' that can be introduced with assumptions required by Bayesian techniques and arguably making use of all the information (vice a subset chosen by the Bayesian user) contained in the email. Yes, the answers can be the same but the MDL approach is universal and the same classifier without modification could be used for broader clustering tasks i.e beyond binary classification of junk/not_junk to multi-class classification junk/best friend/mom/dad/wife/work/etc.
_ 42/Issue_04/o n Program - http://www.cs.cityu.edu.hk/~cssamk/gencomp/GenComp ress1.htm
As an aside, since it could be fully automated it would be interesting to run the such an algorithm with a graphical display, say a 2D plot of compression size vs time of day just to see what shakes out.
By the way, the problematic portion for bioinformatics apps is the compression. DNA sequences often exhibit _expansion_ when put through the common compression schemes. Li has come up with a compression scheme that is more optimal called GenCompress.
Kolmogorov Complexity - http://www.idsia.ch/~marcus/kolmo.htm
Minimum Description Length - http://www3.oup.co.uk/computer_journal/hdb/Volume
Bioinformatics app - http://www.cs.ucsb.edu/~mli/sam.ps
GeneCompressi
"Consensus" in science is _always_ a political construct.