Using gzip As A Spam Filter
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
Originally posted on kuro5hin.org
By KWillets
Sun Jan 26th, 2003 at 07:03:35 AM EST
While many people see gzip as a compression tool, it also makes a credible spam filter. Here's how.
I was reading through a bioinformatics book the other day, and was reminded of a useful shortcut for comparing a text against various corpora. A number of researchers have simply fed DNA sequence data into the popular Ziv-Lempel compression algorithm, to see how much redundancy it contains.
Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.
A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.
In this case, we wish to compare an incoming email message against two possible corpora: spam and non-spam (ham). If we maintain archives of both, we can compare the appended compression ratios relative to each, to judge how similar a new message is to spam or ham.
As a simple test, I downloaded some sample spam and ham from the Spamassassin archive. I removed headers from the messages (to focus on message text only), and created spam and ham "training sets" 1-2 megabytes in size. I then tested spam and ham messages not in the training sets for for their compressed sizes when appended.
Compression was measured as follows:
$ cat spam.txt new-message-body.txt |gzip - |wc -c
$ cat ham.txt new-message-body.txt |gzip - |wc -c
The file sizes output were compared to the compressed sizes of spam.txt and ham.txt without new-message-body.txt appended, to see how many bytes were consumed by the new-message-body.
The results for "ham" messages were the most dramatic. The average compressed size of a ham message appended to spam was 38% higher than when appended to other ham. For spam messages, the same comparison yielded a compressed size 6% smaller when appended to spam vs. ham, so in both cases, compressing a message with others of its kind yielded a smaller file, on average.
Individual results were also quite clear: while some spam messages compressed slightly better when mixed with ham, ham messages still maintained a margin of 15% or more between the most spamlike ham, and the most hamlike spam. I would put the threshold somewhere around 110%; if a message's size when gzipped with spam is less than 110% of its size when compressed with ham, it's probably spam.
In conclusion, gzip is a fairly blunt instrument for spam detection, but the effectiveness of its relatively blind repetition-finding is worth noting. The current fad among spam filters is word-counting, with various statistical heuristics applied to the results. Algorithms like LZ and gzip go beyond word matching, finding entire phrases and paragraphs of repetition, but do not attempt to measure their statistical significance. More sophisticated approaches, which combine phrase matching with statistical analysis, may be more effective.
Just to weed out the flood of duplicate stories.
IMMEDIATE ATTENTION NEEDED :
HIGHLY CONFIDENTIAL
FROM: GEORGE WALKER BUSH
DEAR SIR / MADAM,
I AM GEORGE WALKER BUSH, SON OF THE FORMER PRESIDENT OF THE UNITED STATES OF
AMERICA GEORGE HERBERT WALKER BUSH, AND CURRENTLY SERVING AS PRESIDENT OF
THE UNITED STATES OF AMERICA. THIS LETTER MIGHT SURPRISE YOU BECAUSE WE HAVE
NOT MET NEITHER IN PERSON NOR BY CORRESPONDENCE. I CAME TO KNOW OF YOU IN MY
SEARCH FOR A RELIABLE AND REPUTABLE PERSON TO HANDLE A VERY CONFIDENTIAL
BUSINESS TRANSACTION, WHICH INVOLVES THE TRANSFER OF A HUGE SUM OF MONEY TO
AN ACCOUNT REQUIRING MAXIMUM CONFIDENCE.
I AM WRITING YOU IN ABSOLUTE CONFIDENCE PRIMARILY TO SEEK YOUR ASSISTANCE IN
ACQUIRING OIL FUNDS THAT ARE PRESENTLY TRAPPED IN THE REPUBLIC OF IRAQ. MY
PARTNERS AND I SOLICIT YOUR ASSISTANCE IN COMPLETING A TRANSACTION BEGUN BY
MY FATHER, WHO HAS LONG BEEN ACTIVELY ENGAGED IN THE EXTRACTION OF PETROLEUM
IN THE UNITED STATES OF AMERICA, AND BRAVELY SERVED HIS COUNTRY AS DIRECTOR
OF THE UNITED STATES CENTRAL INTELLIGENCE AGENCY.
IN THE DECADE OF THE NINETEEN-EIGHTIES, MY FATHER, THEN VICE-PRESIDENT OF
THE UNITED STATES OF AMERICA, SOUGHT TO WORK WITH THE GOOD OFFICES OF THE
PRESIDENT OF THE REPUBLIC OF IRAQ TO REGAIN LOST OIL REVENUE SOURCES IN THE
NEIGHBORING ISLAMIC REPUBLIC OF IRAN. THIS UNSUCCESSFUL VENTURE WAS SOON
FOLLOWED BY A FALLING OUT WITH HIS IRAQI PARTNER, WHO SOUGHT TO ACQUIRE
ADDITIONAL OIL REVENUE SOURCES IN THE NEIGHBORING EMIRATE OF KUWAIT, A
WHOLLY-OWNED U.S.-BRITISH SUBSIDIARY.
MY FATHER RE-SECURED THE PETROLEUM ASSETS OF KUWAIT IN 1991 AT A COST OF
SIXTY-ONE BILLION U.S. DOLLARS ($61,000,000,000). OUT OF THAT COST,
THIRTY-SIX BILLION DOLLARS ($36,000,000,000) WERE SUPPLIED BY HIS PARTNERS
IN THE KINGDOM OF SAUDI ARABIA AND OTHER PERSIAN GULF MONARCHIES, AND
SIXTEEN BILLION DOLLARS ($16,000,000,000) BY GERMAN AND JAPANESE PARTNERS.
BUT MY FATHER'S FORMER IRAQI BUSINESS PARTNER REMAINED IN CONTROL OF THE
REPUBLIC OF IRAQ AND ITS PETROLEUM RESERVES.
MY FAMILY IS CALLING FOR YOUR URGENT ASSISTANCE IN FUNDING THE REMOVAL OF
THE PRESIDENT OF THE REPUBLIC OF IRAQ AND ACQUIRING THE PETROLEUM ASSETS OF
HIS COUNTRY, AS COMPENSATION FOR THE COSTS OF REMOVING HIM FROM POWER.
UNFORTUNATELY, OUR PARTNERS FROM 1991 ARE NOT WILLING TO SHOULDER THE BURDEN
OF THIS NEW VENTURE, WHICH IN ITS UPCOMING PHASE MAY COST THE SUM OF 100
BILLION TO 200 BILLION DOLLARS ($100,000,000,000 - $200,000,000,000), BOTH
IN THE INITIAL ACQUISITION AND IN LONG-TERM MANAGEMENT.
WITHOUT THE FUNDS FROM OUR 1991 PARTNERS, WE WOULD NOT BE ABLE TO ACQUIRE
THE OIL REVENUE TRAPPED WITHIN IRAQ. THAT IS WHY MY FAMILY AND OUR
COLLEAGUES ARE URGENTLY SEEKING YOUR GRACIOUS ASSISTANCE. OUR DISTINGUISHED
COLLEAGUES IN THIS BUSINESS TRANSACTION INCLUDE THE SITTING VICE-PRESIDENT
OF THE UNITED STATES OF AMERICA, RICHARD CHENEY, WHO IS AN ORIGINAL PARTNER
IN THE IRAQ VENTURE AND FORMER HEAD OF THE HALLIBURTON OIL COMPANY, AND
CONDOLEEZA RICE, WHOSE PROFESSIONAL DEDICATION TO THE VENTURE WAS
DEMONSTRATED IN THE NAMING OF A CHEVRON OIL TANKER AFTER HER.
I WOULD BESEECH YOU TO TRANSFER A SUM EQUALING TEN TO TWENTY-FIVE PERCENT
(10-25 %) OF YOUR YEARLY INCOME TO OUR ACCOUNT TO AID IN THIS IMPORTANT
VENTURE. THE INTERNAL REVENUE SERVICE OF THE UNITED STATES OF AMERICA WILL
FUNCTION AS OUR TRUSTED INTERMEDIARY. I PROPOSE THAT YOU MAKE THIS TRANSFER
BEFORE THE FIFTEENTH (15TH) OF THE MONTH OF APRIL.
I KNOW THAT A TRANSACTION OF THIS MAGNITUDE WOULD MAKE ANYONE APPREHENSIVE
AND WORRIED. BUT I AM ASSURING YOU THAT ALL WILL BE WELL AT THE END OF THE
DAY. A BOLD STEP TAKEN SHALL NOT BE REGRETTED, I ASSURE YOU. PLEASE DO BE
INFORMED THAT THIS BUSINESS TRANSACTION IS 100% LEGAL. IF YOU DO NOT WISH TO
CO-OPERATE IN THIS TRANSACTION, PLEASE CONTACT OUR INTERMEDIARY
REPRESENTATIVES TO FURTHER DISCUSS THE MATTER.
I PRAY THAT YOU UNDERSTAND OUR PLIGHT. MY FAMILY AND OUR COLLEAGUES WILL BE
FOREVER GRATEFUL. PLEASE REPLY IN STRICT CONFIDENCE TO THE CONTACT NUMBERS
BELOW.
SINCERELY WITH WARM REGARDS,
GEORGE WALKER BUSH
Switchboard: 202.456.1414 Comments: 202.456.1111 Fax: 202.456.2461 Email:
president@whitehouse.gov
Trolling is a art,