Slashdot Mirror


Using gzip As A Spam Filter

captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."

3 of 268 comments (clear)

  1. Text of the full article by Glog · · Score: -1, Redundant

    Originally posted on kuro5hin.org

    By KWillets
    Sun Jan 26th, 2003 at 07:03:35 AM EST

    While many people see gzip as a compression tool, it also makes a credible spam filter. Here's how.

    I was reading through a bioinformatics book the other day, and was reminded of a useful shortcut for comparing a text against various corpora. A number of researchers have simply fed DNA sequence data into the popular Ziv-Lempel compression algorithm, to see how much redundancy it contains.
    Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

    A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.

    In this case, we wish to compare an incoming email message against two possible corpora: spam and non-spam (ham). If we maintain archives of both, we can compare the appended compression ratios relative to each, to judge how similar a new message is to spam or ham.

    As a simple test, I downloaded some sample spam and ham from the Spamassassin archive. I removed headers from the messages (to focus on message text only), and created spam and ham "training sets" 1-2 megabytes in size. I then tested spam and ham messages not in the training sets for for their compressed sizes when appended.

    Compression was measured as follows:

    $ cat spam.txt new-message-body.txt |gzip - |wc -c
    $ cat ham.txt new-message-body.txt |gzip - |wc -c

    The file sizes output were compared to the compressed sizes of spam.txt and ham.txt without new-message-body.txt appended, to see how many bytes were consumed by the new-message-body.

    The results for "ham" messages were the most dramatic. The average compressed size of a ham message appended to spam was 38% higher than when appended to other ham. For spam messages, the same comparison yielded a compressed size 6% smaller when appended to spam vs. ham, so in both cases, compressing a message with others of its kind yielded a smaller file, on average.

    Individual results were also quite clear: while some spam messages compressed slightly better when mixed with ham, ham messages still maintained a margin of 15% or more between the most spamlike ham, and the most hamlike spam. I would put the threshold somewhere around 110%; if a message's size when gzipped with spam is less than 110% of its size when compressed with ham, it's probably spam.

    In conclusion, gzip is a fairly blunt instrument for spam detection, but the effectiveness of its relatively blind repetition-finding is worth noting. The current fad among spam filters is word-counting, with various statistical heuristics applied to the results. Algorithms like LZ and gzip go beyond word matching, finding entire phrases and paragraphs of repetition, but do not attempt to measure their statistical significance. More sophisticated approaches, which combine phrase matching with statistical analysis, may be more effective.

  2. Slashdot needs this... by FearUncertaintyDoubt · · Score: -1, Redundant

    Just to weed out the flood of duplicate stories.

  3. Will it catch this? by grub · · Score: -1, Redundant

    IMMEDIATE ATTENTION NEEDED :
    HIGHLY CONFIDENTIAL


    FROM: GEORGE WALKER BUSH
    DEAR SIR / MADAM,


    I AM GEORGE WALKER BUSH, SON OF THE FORMER PRESIDENT OF THE UNITED STATES OF
    AMERICA GEORGE HERBERT WALKER BUSH, AND CURRENTLY SERVING AS PRESIDENT OF
    THE UNITED STATES OF AMERICA. THIS LETTER MIGHT SURPRISE YOU BECAUSE WE HAVE
    NOT MET NEITHER IN PERSON NOR BY CORRESPONDENCE. I CAME TO KNOW OF YOU IN MY
    SEARCH FOR A RELIABLE AND REPUTABLE PERSON TO HANDLE A VERY CONFIDENTIAL
    BUSINESS TRANSACTION, WHICH INVOLVES THE TRANSFER OF A HUGE SUM OF MONEY TO
    AN ACCOUNT REQUIRING MAXIMUM CONFIDENCE.


    I AM WRITING YOU IN ABSOLUTE CONFIDENCE PRIMARILY TO SEEK YOUR ASSISTANCE IN
    ACQUIRING OIL FUNDS THAT ARE PRESENTLY TRAPPED IN THE REPUBLIC OF IRAQ. MY
    PARTNERS AND I SOLICIT YOUR ASSISTANCE IN COMPLETING A TRANSACTION BEGUN BY
    MY FATHER, WHO HAS LONG BEEN ACTIVELY ENGAGED IN THE EXTRACTION OF PETROLEUM
    IN THE UNITED STATES OF AMERICA, AND BRAVELY SERVED HIS COUNTRY AS DIRECTOR
    OF THE UNITED STATES CENTRAL INTELLIGENCE AGENCY.


    IN THE DECADE OF THE NINETEEN-EIGHTIES, MY FATHER, THEN VICE-PRESIDENT OF
    THE UNITED STATES OF AMERICA, SOUGHT TO WORK WITH THE GOOD OFFICES OF THE
    PRESIDENT OF THE REPUBLIC OF IRAQ TO REGAIN LOST OIL REVENUE SOURCES IN THE
    NEIGHBORING ISLAMIC REPUBLIC OF IRAN. THIS UNSUCCESSFUL VENTURE WAS SOON
    FOLLOWED BY A FALLING OUT WITH HIS IRAQI PARTNER, WHO SOUGHT TO ACQUIRE
    ADDITIONAL OIL REVENUE SOURCES IN THE NEIGHBORING EMIRATE OF KUWAIT, A
    WHOLLY-OWNED U.S.-BRITISH SUBSIDIARY.


    MY FATHER RE-SECURED THE PETROLEUM ASSETS OF KUWAIT IN 1991 AT A COST OF
    SIXTY-ONE BILLION U.S. DOLLARS ($61,000,000,000). OUT OF THAT COST,
    THIRTY-SIX BILLION DOLLARS ($36,000,000,000) WERE SUPPLIED BY HIS PARTNERS
    IN THE KINGDOM OF SAUDI ARABIA AND OTHER PERSIAN GULF MONARCHIES, AND
    SIXTEEN BILLION DOLLARS ($16,000,000,000) BY GERMAN AND JAPANESE PARTNERS.
    BUT MY FATHER'S FORMER IRAQI BUSINESS PARTNER REMAINED IN CONTROL OF THE
    REPUBLIC OF IRAQ AND ITS PETROLEUM RESERVES.


    MY FAMILY IS CALLING FOR YOUR URGENT ASSISTANCE IN FUNDING THE REMOVAL OF
    THE PRESIDENT OF THE REPUBLIC OF IRAQ AND ACQUIRING THE PETROLEUM ASSETS OF
    HIS COUNTRY, AS COMPENSATION FOR THE COSTS OF REMOVING HIM FROM POWER.
    UNFORTUNATELY, OUR PARTNERS FROM 1991 ARE NOT WILLING TO SHOULDER THE BURDEN
    OF THIS NEW VENTURE, WHICH IN ITS UPCOMING PHASE MAY COST THE SUM OF 100
    BILLION TO 200 BILLION DOLLARS ($100,000,000,000 - $200,000,000,000), BOTH
    IN THE INITIAL ACQUISITION AND IN LONG-TERM MANAGEMENT.


    WITHOUT THE FUNDS FROM OUR 1991 PARTNERS, WE WOULD NOT BE ABLE TO ACQUIRE
    THE OIL REVENUE TRAPPED WITHIN IRAQ. THAT IS WHY MY FAMILY AND OUR
    COLLEAGUES ARE URGENTLY SEEKING YOUR GRACIOUS ASSISTANCE. OUR DISTINGUISHED
    COLLEAGUES IN THIS BUSINESS TRANSACTION INCLUDE THE SITTING VICE-PRESIDENT
    OF THE UNITED STATES OF AMERICA, RICHARD CHENEY, WHO IS AN ORIGINAL PARTNER
    IN THE IRAQ VENTURE AND FORMER HEAD OF THE HALLIBURTON OIL COMPANY, AND
    CONDOLEEZA RICE, WHOSE PROFESSIONAL DEDICATION TO THE VENTURE WAS
    DEMONSTRATED IN THE NAMING OF A CHEVRON OIL TANKER AFTER HER.


    I WOULD BESEECH YOU TO TRANSFER A SUM EQUALING TEN TO TWENTY-FIVE PERCENT
    (10-25 %) OF YOUR YEARLY INCOME TO OUR ACCOUNT TO AID IN THIS IMPORTANT
    VENTURE. THE INTERNAL REVENUE SERVICE OF THE UNITED STATES OF AMERICA WILL
    FUNCTION AS OUR TRUSTED INTERMEDIARY. I PROPOSE THAT YOU MAKE THIS TRANSFER
    BEFORE THE FIFTEENTH (15TH) OF THE MONTH OF APRIL.


    I KNOW THAT A TRANSACTION OF THIS MAGNITUDE WOULD MAKE ANYONE APPREHENSIVE
    AND WORRIED. BUT I AM ASSURING YOU THAT ALL WILL BE WELL AT THE END OF THE
    DAY. A BOLD STEP TAKEN SHALL NOT BE REGRETTED, I ASSURE YOU. PLEASE DO BE
    INFORMED THAT THIS BUSINESS TRANSACTION IS 100% LEGAL. IF YOU DO NOT WISH TO
    CO-OPERATE IN THIS TRANSACTION, PLEASE CONTACT OUR INTERMEDIARY
    REPRESENTATIVES TO FURTHER DISCUSS THE MATTER.


    I PRAY THAT YOU UNDERSTAND OUR PLIGHT. MY FAMILY AND OUR COLLEAGUES WILL BE
    FOREVER GRATEFUL. PLEASE REPLY IN STRICT CONFIDENCE TO THE CONTACT NUMBERS
    BELOW.


    SINCERELY WITH WARM REGARDS,


    GEORGE WALKER BUSH


    Switchboard: 202.456.1414 Comments: 202.456.1111 Fax: 202.456.2461 Email:
    president@whitehouse.gov

    --
    Trolling is a art,