Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

2 of 268 comments (clear)

  1. Re:Text of the full article by Anonymous Coward · · Score: 5, Insightful

    > The current fad among spam filters is word-counting, with various statistical heuristics applied to the results.

    The current fad is in fact Bayesian filtering, sophisticated statistical analysis.

    gzip used this way can be viewed as a very poor Bayesian analysis with substantially lower effectiveness. Lets just skip the half-assed attempt and go straight to the real thing.

  2. Same old problem... by artemis67 · · Score: 5, Insightful

    Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.

    Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.

    Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.