Using gzip As A Spam Filter
captainclever writes "Kuro5hin have an interesting article on detecting spam using gzip." Here's a sample: "Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text."
Originally posted on kuro5hin.org
By KWillets
Sun Jan 26th, 2003 at 07:03:35 AM EST
While many people see gzip as a compression tool, it also makes a credible spam filter. Here's how.
I was reading through a bioinformatics book the other day, and was reminded of a useful shortcut for comparing a text against various corpora. A number of researchers have simply fed DNA sequence data into the popular Ziv-Lempel compression algorithm, to see how much redundancy it contains.
Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.
A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.
In this case, we wish to compare an incoming email message against two possible corpora: spam and non-spam (ham). If we maintain archives of both, we can compare the appended compression ratios relative to each, to judge how similar a new message is to spam or ham.
As a simple test, I downloaded some sample spam and ham from the Spamassassin archive. I removed headers from the messages (to focus on message text only), and created spam and ham "training sets" 1-2 megabytes in size. I then tested spam and ham messages not in the training sets for for their compressed sizes when appended.
Compression was measured as follows:
$ cat spam.txt new-message-body.txt |gzip - |wc -c
$ cat ham.txt new-message-body.txt |gzip - |wc -c
The file sizes output were compared to the compressed sizes of spam.txt and ham.txt without new-message-body.txt appended, to see how many bytes were consumed by the new-message-body.
The results for "ham" messages were the most dramatic. The average compressed size of a ham message appended to spam was 38% higher than when appended to other ham. For spam messages, the same comparison yielded a compressed size 6% smaller when appended to spam vs. ham, so in both cases, compressing a message with others of its kind yielded a smaller file, on average.
Individual results were also quite clear: while some spam messages compressed slightly better when mixed with ham, ham messages still maintained a margin of 15% or more between the most spamlike ham, and the most hamlike spam. I would put the threshold somewhere around 110%; if a message's size when gzipped with spam is less than 110% of its size when compressed with ham, it's probably spam.
In conclusion, gzip is a fairly blunt instrument for spam detection, but the effectiveness of its relatively blind repetition-finding is worth noting. The current fad among spam filters is word-counting, with various statistical heuristics applied to the results. Algorithms like LZ and gzip go beyond word matching, finding entire phrases and paragraphs of repetition, but do not attempt to measure their statistical significance. More sophisticated approaches, which combine phrase matching with statistical analysis, may be more effective.
Fr1st Pr0st! spam sux too.
- News For Turds
First Post
Forget about gzip all the 'cool' geeks use grep! :)
This article will make much more sense if you look at the raw data in tabular form.
Ostrich!
Hey if you compress all of your mail with gzip then it all looks like foreign spam anyway!
But isn't the spam quite varied, i.e. without long repetitive sequences? Yes, the same post may come in several times, but the text in each is quite varied; e.g. longer xxx, bigger xxx or yyy, heftier yyy and zzz.
Sounds very much like that lameness filter on Slashdot that refuses to accept a post if its contents can be compressed easily... of course, it's quite simplistic compared to gzip.
Slashdot reporting on a 'new' feature of gzip found by a user of k5 which they have been using themselves for quite some time now (lameness filter does exactly the same thing... sigh)
you can gzip and grep your funny ass on those coded mails...
jeez what kinda moronish idea is this fuck...
spammers send more and more bas64 coded shit.
the world has to learn that plain and stupid filtering isnt gonna get you rid of spam. it filters spam but doesnt tackle the problem at its roots. go complain at spammers upstream providers, and at the backbone/upstream providers of the companies being spammed for..
fuck all those COOLSTATS.COM, PDHOST.COM, viagra selling FUCKERS and all the spammers...
SPAMMERS MUST DIE !!!!!!
http://www.paulgraham.com/spam.html
http://www.archive.org/details/ThePowerOfNightmares
So.. does this mean that we'll be seeing e-mail specific programs from companies that make software like gzip and such?
Ah am not a crook! (\(-__-)/)
Sure, this sounds like a nice academic activity, but really ... In the real world, use the right tool for the right job. I tend to think word redundancy does not correlate directly to spaminess.
That's because most spam includes large amounts of HTML.
My friends do not use HTML in email. Ads for "Crimescene Cocksuckers" does.
Cool! Now I can compress all that useless crap with non-useless crap, compare them, then collect more until it uses just as much space as it did when it was non-compressed! ;)
"I kill you! You no good 56'ing!"
Slashdot can use it to filert out duplicate stories.
I would still rather see a law that would sentence the spammers to death without parole... At least there would be higher barrier of entry to spamming.
Save your wrists today - switch to Dvorak
Thanks, I know what compression is.
Anything from mid-level management or the marketing department would immediately be marked as spam and trashed. Maybe not very important in the first place, but you'd at least need to be able to say "yeah, I saw the memo on the TPS reports."
SIG: HUP
In comments submitted on Kuro5hin, a question (see comment) is raised on whether or not Slashdot employs a similiar technique (as presented in the article) to foil spam-flooders
It seems your caught up in technology too much to use read...no, forgot, you can't.
Jason Rennie gave an extremely interesting talk about this at the MIT Spam Conference this month, although he wasn't using quite as direct a method, instead he was looking at MLD - Minimum Length Description. This is a technique to discover features in corpora that allow you to describe the classification of a corpus in the minimum number of details.
Basically it's a way to discover features in emails using compression techniques, so rather than having us SpamAssassin developers have to carefully and manually examine emails to see what's new and interesting about them, MLD techniques can automatically detect these features.
Jason Rennie's web page (talk and paper available) about this is here. Please do read it as it's extremely interesting.
The one downside of it is that Jason said at the end of his talk that it's extremely slow at doing the feature detection. When asked how slow he said that on a reasonably small corpus it took 4 months (although he said it was written in Perl, so a C port is probably a good plan).
In comparison to Bayesian techniques the MLD technique presents a great deal of interest - primarily because I work for a company doing spam filtering at the internet level, and so we can't feasibly do personal training which is what makes Bayesian techniques so great (see the talk I gave at the MIT spam conference). Without the personal training Bayes is only about 90-95% effective, so it should be interesting to see where these techniques lead us.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
Its not simply the words that are used in a mail, but the way they are used (the order) that gives a sentence its meaning.
for example Two Emails:
1 (ham) "You have won a brand new Convertible, from the competition you entered."
and
2 (spam) "A brand new convertible to be won, have you entered?"
Ham would match about 80% with spam.
Word matching is a blunt instrument as mentioned. The English language is far too complex for simple calculations, this fact should be taken into consideration, when applying a 'Spam Likelihood' rating to Emails.
... and try to do that with /bin/echo !!!!
Usually I don't compress my spam.
;-)
I delete it.
This will save me a lot more space
Privacy is terrorism.
A Bayesian spam filter uses an underlying order-0 Markov model of email messages. gzip uses an underlying order-1 Markov model.
A Bayesian filter uses words as "symbols." gzip uses bytes as symbols.
The right thing to do would be to combine them.Ttake a gzip-style Markov model, using bytes as symbols and conditional probabilities, and plug it into a Bayesian filter. That would (1) make the filter more powerful and (2) make the filter applicable to any sort of data, arbitrary binary or readable text. Negligible computational overhead, sharper discrimination.
Filtering is not a true spam solution. All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.
Given that, the alternative to having tagged emails automativally deleted is to collect them in a folder and scan the message senders and subject lines. If you're doing that, then the spammer is getting a pitch through to you in the subject line. This therefore does not lessen the incentive for the spammer, but simply causes him to change tactics and put his best pitch in his subject line.
Right now, I get 60-80 spams a day. What happens when I start getting 600-800 a day? Again, filtering starts to break down, because I have SO MANY messages to scan everyday that the possibility of me missing a legitimate one is very high.
Obviously it wouldn't be a big problem for the spammers to run their creative gems through gzip, and alter the content until they achieve lower compression ratio. Even including a bunch of garbage after the message might do the trick. I believe equivalent analysis can be done cheaper with non-gzip tools...
Save your wrists today - switch to Dvorak
When the spam is filtered at user-account level, you can only do it by parsing a single mail in some way and determine if it's spam or not. It's like trying to tell whether a movie is bad by looking at one picture. If the spam could be filtered at the server level, by comparing mails that are received into to different accounts, you could really tell which ones are part of a mass-mail (spam).
One problem with this is the right to open other people's mail. But you could use some basic scrambling (rot-13) to make sure that no one sees the inside. It wouldn't make difference to the comparing script.
Mailing lists might cause a problem too but wouldn't it be easier to allow the mailing lists you belong to than trying to block the ones you don't belong to?
As an example of how Sequitur works, the string 'abcabdabcabd' produces the following grammar rules:
- 2 c 2 d
- a b
Representing the original string then is the sequence:1 1
The usage counts of the rules are available as output options.
Seastead this.
It's about using the compression ratio as a measure of similarity between the message and a spam database, and not about redundancy within the message.
Do you mean that each time you can find dupes, that's spam ? Oh my god, poor /. ...
It's also been determined that the discussion in a typical Slashdot story compresses to less than half of a percent of its original size.
Karma: Good (despite my invention of the Karma: sig)
SPAM filtering will never work as good as done manually.. actually I've got spammed with porn.. and besides i really dont care about gettin' spammed i really dont care gettin' spammed with porn.. like i don't care about updates of virus killer x,y,z..
/. spams other servers.. by a unique way called /.'ing ...etc... etc.. etc.. it goes on.. and on... and on.. anyways.. and ppl get fucked..
i *do* care about using linux, regular backups and whitelist for my important emails only friend use to write to..
dont believe this anti spam hype celebrated here... most spam killers were written by ppl who're spammers themselves.. even
What an idea!
I could use this to avoid those people who keep saying the same thing all the time, over and over again...
Now, how can I convince my mother to use e-mail?
I just use one of those new fangled file compression utlities that you can apply recursively to the compressed output, resulting in any arbitrary degree of compression one desires.
After at most 10 applications of said compression utility, all emails looks like this:
"1"
I never see any spam.
-josh
-Mark
1: Get an email account with unlimited addresses.
2: when registering use a unique address e.g. slashdot@mydomain.com
3: Make sure you check/uncheck the give my email address to mailing lists.
4: If ever you get spam to that account get litigious.
Use something like mailinglists@mydomain.com, and block anything that doesn't come from mailing lists you've subscribed to.
thank God the internet isn't a human right.
It's inefficient to have so much memory overhead.
Besides, if I were a spammer, I could pad it with high entropy data at the end to make up for my warbling.
The only slight problem was that he doesn't drive :-)
Imagine women with WORKING spam filters.. they would never get fucked again!!!
So don't mess with the science of spam..
.. sounds like a poor idea to me. Yes, you can measure the amount of redundancy in a message, but:
a) Spammers might not always use messages redundant enough to be detectable from regular text.
b) If I happened to use some words a little too often, especially when writing mails discussing technical stuff or posting computer code fragments, would that be classified as spam?
I think this is a nice filter when sorting out more or less repetitive mails (spam or not) from novels, but a filter based on a spam database sounds better to me.
Beware: In C++, your friends can see your privates!
K5 has been having troubles with speed over the past couple of weeks. I'm sure this will make it much better.
It's good to know that you're using your power for good and not for evil. Oh wait, last time I checked blindly flooding a community run site into oblivion by sending 250,000 people our way is evil.
Thank you very much Timothy.
Wouldn't the CPU resources consumed by this process make it useless in the real world? I can't imagine compressing all of our incoming mail just to check if it's spam or not, the CPU usage would skyrocket (and it's already high with all the av and routing filters that we have). Right now we're using RBL filtering along with some content based filters and catching 99.9% which is a whole lot.
Another moron the tdisn't read the article.
The proposal is not to see how compressible is the message but to use a compression tool to see how lookalike the message is to a corpus of spam.
A couple of posts above state that spammers will "just adjust their tactics." Talk like this always puzzles me; on the spammer's side, does this not help him? If I'm selling a combination weight loss drug/mail order bride/penis enlarger/cable descrambler for only three payments of $49.99 in such a manner that every spam blocker in the world filters me, logically I'm only being filtered by people who know better than to buy my "product," thus not irritating them, in effect helping to slow regulation, and I don't loose touch with any significant chunk of my target demographic. Of course, this applies with the exception of corporate environments or similiar situations where Joe Insecure has someone else managing spam.
Can anyone share some +5 Insight on the matter?
Bored with karma, be a fan/freak
> stupid filtering isnt gonna get you rid of spam... go complain at spammers upstream providers...
:-)
Filters only work to a limited extend, and so might shutting down the spammers, if it were possible.
But neither is going to solve this problem.
The only solution I can think of is wide-spread adoption of PGP (or equivalent) aware mailers and certification of mail.
The problem with mail addresses is that you have no control over their spread. If I give one to a company it'll usually leak out in the end and it's open season on my inbox.
However if "genuine" mail is certified and mailers use certification validity as a filtering critera then it simplifies the game hugely.
Your mailer can spot the people you've genuinely given your address to, and naturally "distrust" uncertified (effectively anonymous) mail or mail whos certificate has been revoked or is unknown to you.
The "only" things standing in the way of this are:
1. Slow adoption of certification/encryption in mass market mailers. Usually poor or missing.
2. Cost/diffiulty of getting a valid certificate (e.g. with Verisign).
3. The pain of typing a password every time you send a mail.
4. It only works if everyone joins in.
But nothing's for free and this strikes at the heart of emails useability.
I'm continually suprised by the lack of certification use at least by large corporations and governments, but I suppose it removes plausible deniability
This will get modded down, but here goes In Soviet Russia, gzip spams j00! As said above, this is interesting, but not particularly precise... The only way to stop spam, is to prosecute the spammers - just like the anti fax spam legislature did, we need the same principal applied to email... I'm not a supporter of more laws, and more legal BS, but I think in this case it's an acceptable trade-off ;beer;
fits_in_little_blue_can ? "spam" : "ham"
Just to weed out the flood of duplicate stories.
So if the message compresses very well together with spam, then it's similar to spam, and if it doesn't, then it's not similar to spam.
Once you had the information, you could adjust the threshold of the test for optimal results, and figure out which tests were the best value.
In any case he result is that you end up with screening tests that have a lot of false positives, backed up by more expensive tests applied to all the positives to find the real problems.
You could do the same thing with spam. You'd need to assign a cost to the false negatives (missing the job offer), and the false positives (deleting spam that passed the filter), and adjust the filter accordingly. (Assuming the cost of the tests, in cpu, are negligible, which is different from the medical example.).
-- ac at work
RBL blocks a lot of stuff that isn't spam. It's probably a better idea to go with bayesian filtering. You can read up on it here: http://www.paulgraham.com/better.html
"And we have seen and do testify that the Father sent the Son to be the Savior of the World"
1 John 4:14
Unfortunately, using this my girlfriend would never get any of my emails.
"I'm sorry. Really, really, really, really sorry. I'm so very, very, very sorry. I'm sorry..."
Have two folders,
1->check all the time, from ONLY those who I accept in my list.
2->rest of the stuff. Spam+Unknown senders...
Now have quick graphical interface selecting which is spam and whitch is not.
3->not spam add list.
4->spam. Add address for filtering 2nd folder.
Add spam message for VERY liberal filter.
(If message is almost exact with a previously resieved spam ignore it.)
Emacs is good operating system, but it has one flaw: Its text editor could be better.
It is official; HP confirms: Algerbraic is dying!
One more crippling bombshell hit the already beleaguered Algerbraic community when HP confirmed that Algerbraic calculator usage has dropped yet again, now down to less than a fraction of 1 percent of all professionals. Coming on the heels of a recent hpcalc.org survey which plainly states that algerbraic notation has lost more market share, this news serves to reinforce what we've known all along. Algerbraic is collapsing in complete disarray, as fittingly exemplified by failing dead last [hpcalc.org] in the recent HPcalc.org speed trials.
You don't need to be a Kreskin [amdest.com]to predict alberbraic's future. The hand writing is on the wall: Algerbraic faces a bleak future. In fact there won't be any future at all for algerbraic because it is dying. Things are looking very bad for algerbraic. As many of us are already aware, it continues to lose market share. Red ink flows like a river of blood.
TI's algerbraic calculator development team is the most endangered of them all, having lost 93% of its core engineers. The sudden and unpleasant departures of long time algerbraic's developers Casio and Sharp only serve to underscore the point more clearly. There can no longer be any doubt: Algerbraic is dying.
Let's keep to the facts and look at the numbers.
RPN supporter Jean-Yves Avenard states that there are 70000 propfessional users of calculators. How many users of algerbraic are there? Let's see. The number of RPN versus algerbraic posts on comp.sys.hp48 is roughly in ratio of 500 to 1. Therefore there are about 70000/500 = 14 algerbraic users. Sharp DAL (Direct Algerbraic logic) posts on Usenet are about half of the volume of plain algerbraic posts. Therefore there are about 7 users of DAL. A recent article put DAL at about 50 percent of the algerbraic market. This is consistent with the number of DAL Usenet posts.
Due to the troubles of mismatched brackers, excessive keystrokes and so on, algerbraic went out of favor with TI and was taken over by Casio who sell another troubled calculator. Now Casio is also dead, its corpse turned over to cheap chinese calculator manufactures.
All major surveys show that alg has steadily declined in market share. Algerbraic is very sick and its long term survival prospects are very dim. If Algerbraic is to survive at all it will be among vintage calcululator collectors. Algerbraic continues to decay. Nothing short of a miracle could save it at this point in time. For all practical purposes, Algerbraic is dead.
Fact: Algerbraic is dying
If the lamer filter can't stop these posts, what can it do?!
...You see, it's just that we're putting the new cover sheets on all TPS reports from now on, so if you could just go ahead and do that for me, that would be great. And I'll make sure you get another copy of that memo.
"We are far too easily pleased." --C.S. Lewis
I received a nice piece of spam the other day. I didn't read it but I usually scroll to the bottom to see if they have the mandatory (in some places mandatory I think) unsubscribe method. This method sure gets me mad -
To unsubscribe by postal mail, please send your request to:
P.O Box 272521
Boca Raton, FL 33427
Ref # XXXXXX -- scd
(XXXX.. replaced real reference number)
It seems that the unsubscription method doesn't have to be by email - just as long as it's by something and it's there. They musn't be specific in the law. Of course, no one is going to go write a letter by snail mail to unsubscribe to spam, although sending them some dog shit through the mail is tempting. I forgot the site that provides that service. Hrmm I should change my sig.
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
The site is not down. It has not been slashdotted. Are you just Karma whoring? I think so.
This is just a blatent copyright violation.
-- Hulver's site
Oh man, my girl-friend will never see a love letter from me anymore, because I just say:
"I love You, I love You, I love You
The fact is, that unless your SPAM corpus and HAM corpus are both under 32k, this won't work. Gzip is fast because it only has a 32k sliding window, meaning that it only searches for like strings in a 32k window around what you're currently compressing. Hate to break it to you, but 32k is not enough for a corpus. I think Bzip2 uses something larger (900k?), but I forget what it is.
I'll be happy with spam assassin until I get CRM114 (and mailfilter) trained and working.
My Slashdot account is old enough to drink...
German newsticker heise had a similar article a year ago, altough it does not cover spam explicitly.
The article has a link to another article published in "Physical Review Letters" which deals with the topic of identifying content/author by applying compression algorithms.
The underlying idea is that LZ77 compressed data is near to the entropy of a message.
If I were a spammer, I couldn't care less if some nerd using string entropy calculation filters out my spam, because said nerd using weird home grown filtering is also more likely to a.) not reply anyway b.) submit my open relays to blackhole lists c.) complain to my ISP etc. etc.
/dev/srandom (really nerdy spammers themselves, who know not to trust /dev/random) but generating random characters with similar charateristics as English.
If I were a spammer I'd concentrate more on trying to get average users to open my mail even though they've learned that Cindy's "Haven't seen you in ages, JOE23" Emails aren't real. And how to circumvent whatever anti-spam measures come installed in JOE23's AOL software.
Anyways, some geek in his dorm room is not likely to have enough money to buy penis prosthetics anyway and can also figure out how to jerk off to free thumbnail-pics.
If spammers started padding their mail with high entropy data I would set up a filter that filters out mails based on how close the character recognition is to standard English HTML-formatted mails, and discards random junk.
But then spammers would start not just using high entropy material from
Then the antispammer would have to use fuzzy-logic spell-checking and the spammer would have to start using random words out of the dictionary and finally spammers would be left with no other option than to send me really nice personalized eCards that say "Happy Birthday!" with a little singing chicken, because I haven't found a way to filter those yet. I can only filter spam with mammals
Bayesian only refers to how you use the probabilities.
Now gzip implements similar ideas to LZW compression, which uses variable sized prefixes, which is quite different from an 1-order Markov model. For example, and order 1 Markov model is not allowed to depend on more than the current and immediately preceding symbol.
Who needs all of these complicated schemes? I just filter the sending domains as they come. Filter every sender containing "specials", "optin", "offer", "special", "deal", "email", "reward", "value", "promotion", "special" and "super, and all subject lines starting with "friend", and 85% is taken care of right away. So far my formula has had no false positives.
To counteract that you could also create a second zip containing legitimate emails. The spam mails (even with randomness) should compress better in the Spam zip than the other....
This is exactly the point that it makes.
I came to this realization driving home from work one night. My immediate follow-up thought was, why not make email addresses disposable, with a nice automated interface to control which ones will fwd to your "real" mailbox? I had worked out a rough framework for how I'd implement this at a site-wide level by the time I got home, only to discover that I wasn't the first one to come up with the idea. A quick google search on "disposable email address" found about half a dozen services that do (more or less) what I'd hashed out.
Doesn't solve everything, but it does give you a lot more control when choosing what to put in the "email" form when you buy something online
But then spammers would start not just using high entropy material from /dev/srandom (really nerdy spammers themselves, who know not to trust /dev/random)
/dev/srandom? I only have /dev/urandom!
/dev/urandom is the device that (historically) wouldn't wait for additional entropy. /dev/random is the "more random" one. Nowadays they are both essentially the same on all but the most archaic operating systems.
d00d! Where can I download
BTW,
IMMEDIATE ATTENTION NEEDED :
HIGHLY CONFIDENTIAL
FROM: GEORGE WALKER BUSH
DEAR SIR / MADAM,
I AM GEORGE WALKER BUSH, SON OF THE FORMER PRESIDENT OF THE UNITED STATES OF
AMERICA GEORGE HERBERT WALKER BUSH, AND CURRENTLY SERVING AS PRESIDENT OF
THE UNITED STATES OF AMERICA. THIS LETTER MIGHT SURPRISE YOU BECAUSE WE HAVE
NOT MET NEITHER IN PERSON NOR BY CORRESPONDENCE. I CAME TO KNOW OF YOU IN MY
SEARCH FOR A RELIABLE AND REPUTABLE PERSON TO HANDLE A VERY CONFIDENTIAL
BUSINESS TRANSACTION, WHICH INVOLVES THE TRANSFER OF A HUGE SUM OF MONEY TO
AN ACCOUNT REQUIRING MAXIMUM CONFIDENCE.
I AM WRITING YOU IN ABSOLUTE CONFIDENCE PRIMARILY TO SEEK YOUR ASSISTANCE IN
ACQUIRING OIL FUNDS THAT ARE PRESENTLY TRAPPED IN THE REPUBLIC OF IRAQ. MY
PARTNERS AND I SOLICIT YOUR ASSISTANCE IN COMPLETING A TRANSACTION BEGUN BY
MY FATHER, WHO HAS LONG BEEN ACTIVELY ENGAGED IN THE EXTRACTION OF PETROLEUM
IN THE UNITED STATES OF AMERICA, AND BRAVELY SERVED HIS COUNTRY AS DIRECTOR
OF THE UNITED STATES CENTRAL INTELLIGENCE AGENCY.
IN THE DECADE OF THE NINETEEN-EIGHTIES, MY FATHER, THEN VICE-PRESIDENT OF
THE UNITED STATES OF AMERICA, SOUGHT TO WORK WITH THE GOOD OFFICES OF THE
PRESIDENT OF THE REPUBLIC OF IRAQ TO REGAIN LOST OIL REVENUE SOURCES IN THE
NEIGHBORING ISLAMIC REPUBLIC OF IRAN. THIS UNSUCCESSFUL VENTURE WAS SOON
FOLLOWED BY A FALLING OUT WITH HIS IRAQI PARTNER, WHO SOUGHT TO ACQUIRE
ADDITIONAL OIL REVENUE SOURCES IN THE NEIGHBORING EMIRATE OF KUWAIT, A
WHOLLY-OWNED U.S.-BRITISH SUBSIDIARY.
MY FATHER RE-SECURED THE PETROLEUM ASSETS OF KUWAIT IN 1991 AT A COST OF
SIXTY-ONE BILLION U.S. DOLLARS ($61,000,000,000). OUT OF THAT COST,
THIRTY-SIX BILLION DOLLARS ($36,000,000,000) WERE SUPPLIED BY HIS PARTNERS
IN THE KINGDOM OF SAUDI ARABIA AND OTHER PERSIAN GULF MONARCHIES, AND
SIXTEEN BILLION DOLLARS ($16,000,000,000) BY GERMAN AND JAPANESE PARTNERS.
BUT MY FATHER'S FORMER IRAQI BUSINESS PARTNER REMAINED IN CONTROL OF THE
REPUBLIC OF IRAQ AND ITS PETROLEUM RESERVES.
MY FAMILY IS CALLING FOR YOUR URGENT ASSISTANCE IN FUNDING THE REMOVAL OF
THE PRESIDENT OF THE REPUBLIC OF IRAQ AND ACQUIRING THE PETROLEUM ASSETS OF
HIS COUNTRY, AS COMPENSATION FOR THE COSTS OF REMOVING HIM FROM POWER.
UNFORTUNATELY, OUR PARTNERS FROM 1991 ARE NOT WILLING TO SHOULDER THE BURDEN
OF THIS NEW VENTURE, WHICH IN ITS UPCOMING PHASE MAY COST THE SUM OF 100
BILLION TO 200 BILLION DOLLARS ($100,000,000,000 - $200,000,000,000), BOTH
IN THE INITIAL ACQUISITION AND IN LONG-TERM MANAGEMENT.
WITHOUT THE FUNDS FROM OUR 1991 PARTNERS, WE WOULD NOT BE ABLE TO ACQUIRE
THE OIL REVENUE TRAPPED WITHIN IRAQ. THAT IS WHY MY FAMILY AND OUR
COLLEAGUES ARE URGENTLY SEEKING YOUR GRACIOUS ASSISTANCE. OUR DISTINGUISHED
COLLEAGUES IN THIS BUSINESS TRANSACTION INCLUDE THE SITTING VICE-PRESIDENT
OF THE UNITED STATES OF AMERICA, RICHARD CHENEY, WHO IS AN ORIGINAL PARTNER
IN THE IRAQ VENTURE AND FORMER HEAD OF THE HALLIBURTON OIL COMPANY, AND
CONDOLEEZA RICE, WHOSE PROFESSIONAL DEDICATION TO THE VENTURE WAS
DEMONSTRATED IN THE NAMING OF A CHEVRON OIL TANKER AFTER HER.
I WOULD BESEECH YOU TO TRANSFER A SUM EQUALING TEN TO TWENTY-FIVE PERCENT
(10-25 %) OF YOUR YEARLY INCOME TO OUR ACCOUNT TO AID IN THIS IMPORTANT
VENTURE. THE INTERNAL REVENUE SERVICE OF THE UNITED STATES OF AMERICA WILL
FUNCTION AS OUR TRUSTED INTERMEDIARY. I PROPOSE THAT YOU MAKE THIS TRANSFER
BEFORE THE FIFTEENTH (15TH) OF THE MONTH OF APRIL.
I KNOW THAT A TRANSACTION OF THIS MAGNITUDE WOULD MAKE ANYONE APPREHENSIVE
AND WORRIED. BUT I AM ASSURING YOU THAT ALL WILL BE WELL AT THE END OF THE
DAY. A BOLD STEP TAKEN SHALL NOT BE REGRETTED, I ASSURE YOU. PLEASE DO BE
INFORMED THAT THIS BUSINESS TRANSACTION IS 100% LEGAL. IF YOU DO NOT WISH TO
CO-OPERATE IN THIS TRANSACTION, PLEASE CONTACT OUR INTERMEDIARY
REPRESENTATIVES TO FURTHER DISCUSS THE MATTER.
I PRAY THAT YOU UNDERSTAND OUR PLIGHT. MY FAMILY AND OUR COLLEAGUES WILL BE
FOREVER GRATEFUL. PLEASE REPLY IN STRICT CONFIDENCE TO THE CONTACT NUMBERS
BELOW.
SINCERELY WITH WARM REGARDS,
GEORGE WALKER BUSH
Switchboard: 202.456.1414 Comments: 202.456.1111 Fax: 202.456.2461 Email:
president@whitehouse.gov
Trolling is a art,
Agreed that this is not the best way to filter spam... it is fraught with peril.
/dev/random")
What I was suggesting is that ISPs actually employ these methods... thus the average user will not even know they were spammed. (Most IPSs employ a troop of Geeks who know where to do:
"strings
Personally I prefer an active approach (such as ASK), and preferably the one with the features that has a minimal impact on legitemate users. I still receive about 30 spam mails a day, but with a combination between my IPSs anti-spam system, and my active spam protection, I see about 1 every month only.
There was a paper published in PRL a couple of years ago that wanted to identify languages using gzip (Benedetto et al: Language Trees and Zipping). It sure sounded cool, but was quickly forgotten when Joshua Goodman took a closer look (link is down at the moment, probably IIS, Text version in Google Cache).
Where's a good easy to digest description of this? It's pretty interesting.
Eat at Joe's.
The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.
There is a minor problem with this sentence. And with this whole gzip business. It is misleading. Words, phrases? You cannot force gzip to match words, gzip tries to exploit every likeliness found, even at the character level. E.g., if your "spam dictionary" contains words sex and pants, mail about sextants will have a good compression ratio. And there is no way how to prevent this. That's why the Bayesian filters (operating on words) outperform gzip by a league. That's (one of more reasons) why I think this article belongs not to /. but to a wastebin instead. It simply presents a worse approach to do something. Interesting idea, yes, but that's all.
(Just FYI: it is proved, that the bzip2 algorithm due to Burrows and Wheeler exploits all such repeatings in the input file nearly optimally -- within some small ratio. Hence, it is even worse to use it as a spam filter :-)
Here is a code snippet from the comment:
-- I was raised on the command line, bitch
find repeat posts on slashdot!
if all the email clients had a little button saying "This is Spam" and if you click it the mail gets sent to some nice spam black list agency. They'd wait for about 10 people to do this, then verify it for the spam it is and then A) black list the spammer and B) send anti-spam email (subject: spam sender here ) nice and easy :)
d0rk! Ignoring the fact that I was being sarcastic and artistic license would have permitted me to specify /dev/my_ass let me just say this: before you make statements trying to make people look stupid you should probably have a clue what your talking about.
/dev/srandom, this device is the source for _s_ecure random data on OpenBSD and it's probably available some other places as well. Some random trivia (pun intented), checking around I noticed: AIX and Solaris both don't typically have /dev/random at all.
/dev/srandom you could try the following:
/dev/srandom /dev/zero
While true that your measly Linux machine has no
But anyway, back to your question: if you're sad you don't have
ln -s
and I can freely distribute these addresses, because when I get spam (not free pr0n) sent to freeporn.com@phor.net, I can just block them.
in your AIM profile, you can also link to %n@phor.net which is their screenname. Then you can trace them easily.
-- I was raised on the command line, bitch
The other day I hacked together a script similarity which uses gzip compression to work out how similar two files are. I find this useful when searching for almost-duplicate files.
-- Ed Avis ed@membled.com
Just to keep on bickering (sorry, bad habit): strings /dev/random wouldn't work cause my super duper filter checks for the proper distribution of letters, i.e. more e's than q's and, cause it's spam, lot's of html thingies.
You're right on the money though what filtering at the ISP is concerned, that's where the most benefit would be for the end-user. I see two problems, though.
First, the ISP has to pay bandwidth for the incoming email, spend money on filtering but then isn't rewarded with more time/bandwidth consume by their clients.Secondly, I think they'd be deathly afraid of inadvertantly filtering out some false positives and being sued.
Think what would happen if some marketing department tries to send their customer the rough draft of a mailing and it keeps getting eaten by the ISP's spam filter.
This is something that is easily-implementable, backwards-compatible (you don't *need* to read the MIME attachemnt to check for validity) and trustworthy.
Negative side effects are that if manual password entry is disabled, viruses can use your mail. (A counter measure would be to have the e-mail specify if the password was cached or manually entered)
Please let me know if this has been implemented in a mail program yet.
-- I was raised on the command line, bitch
Is anyone else considering just blocking ALL email coming from hotmail? I know it sounds draconian, and I actually have 3 or 4 friends that would be put out but it seems that about half of my spam these days is coming from hotmail accounts.
.br . Since I don't speak spanish or whatever that gobbledy gook is I have a rule that autodeletes everything coming from .br .
Perhaps if the word got out that people were blocking hotmail accounts they would clean things up a bit.
Another major source of spam here is
Every wrong attempt discarded is a step forward - T. Edison
There's a good chance that a Bayesian filter will do better than you...out of about 8000 emails, Paul Graham says he missed exactly two legitimate emails, both of which were kinda marginal anyway, and he filtered out 99.7% of his spam.
Sorry, but I don't see how this is anything different from just another spin on Bayesian Statistical filtering of spam that everyone's been playing with.
It's hardly patentable. But it is interesting to see. But, once you look at it, not surprising.
That's why the Sheisterizer 0.98 BETA ( by Cuisinart ) was created!
Using the Sheisterizer you too can turn out incomprehensible and threatening sounding letters in your own kitchen ( or wherever you keep your computer ) for a fraction of the cost and effort it used to take.
Sheisterizer's legaleze generator is guaranteed to produce the most convoluted and obfusticated prose possible liberally sprinkled with obscure and tedious-to-look-up jargon and outdated phraseology plus intimidating references to laws including the DMCA. Latin quotes of ancient Charlemagnian law are used to illustrate the applicability of irrelivant environmental laws to magnify the Quid Pro Quo nature of the implied infringement of the copyrights of ExtremeGonzoPorn film company by using quotes from six different movies created on dates ranging from 3483-3504 ( Chinese Zodiac Calendar ) and DMCA violations in connection with the utilization of a computing device ( Cogito Ergot Rye & Dewey Chetham and Howe L.S.D ) to circumvent the advanced delete on subject read and data streamed on screen button depression situation of the named illustrious and magnanimus institution of higher sloth. Not to mention the manufacture of illicit Schedule I substances and violations of the Mann act and various and sodomy laws in Louisiana's more conservative parishes. Which is why the state of Rhode Island will probably revoke their drivers licenses and extradite them to Saudi Arabia to stand trial for Jay walking.
The Sheisterizer simulated Neural Net A.I. guarantees that each sentence is impossible to decifer and meaningless but intimidating.
Beta testers that upgrade to the full release will get the opportunity to beta test Sheisterizer 2 which will include the new Illegal (Registered Trademark) lawyerizing encription for your sensitive files too!
Eat at Joe's.
I'm mostly guessing it's Russian. I don't recognize it as any other language and it usually comes from an unmasked .ru domain.
In Soviet Russia...
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
Not problem, it's not bickering if I'm wrong... -grin-
A friend of mine once sent me mail ising Caesar's algorythm (ROT13) which I got pretty easily... then he decided to make a random scrambling... so I proceeded to adapt my program using probabalistic distribution in English (I used texts from HHG as my source reference -grin-), and automatically descramble any such texts. I've also written software that defines recursive rules for any type of language structure, and added the letter distribution proability to this... and voilla!)
PS: Here was the order of probabilities:
Fortune : etaoinsrhlducmygfwpbvkxjqz
HHG1 : etaoinhsrdlucgmwyfpbvkxzjq
HHG2 : etaoinhsrdlucgmwfypbvkzxjq
HHG3 : etaiohnsrdlucwgmfypbkvxjzq
HHG4 : etaoihnsrdluwcgyfmpbkvxjzq
Chaucer : ethoansridlywufmgcbpkvqjxz
PS: I can email you a copy of these programs if you want... just email me at lailoken on freeshell.org
On the topic of ISPs implementing this... if they do get a false positive, then the source user will get a bounce, and the sender can always find another method to get it to you. Really important stuff mostly don't get sent by email anyway...
Besides one can always have opt-out policies regarding spam filtering... thus protecting the ISPs... my ISP does this.
Once again you are right about the last point, but an active approach would solve this...
Don't use this filtering if you're a high school teacher or something else that involves getting messages from teenagers..
[E-mail from skittles9333@some.email marked as spam and deleted] So like, I was like sick, and like, I didn't go to school today. So like, I was told like, that Jim like said, that like you might like, have some homework due like tomorrow. Could you like, tell me what like that homework would like be?
I have no idea why but I receive a lot of spam in korean.
. Quit playing Monopoly with Bill. Switch to one of many non-Microsoft products today.
One good thing about foreign-langauge spam.. You can't read all the words, so you get to look at all the hott asians
Doesn't work for the Lameness Filter, won't work for spam .
--sdem
Not really very often, although since I have an email account on a German provider I have gotten a slight bit of German spam. I think a lot of it comes from "sign up" sites, unless you have a strongly public-visible website with your email address on the main page (damn trafficmagnet ads) - most companies in other countries probably aren't going to both pick up your email address if they don't except you to understand the language.
Since a large portion of popular sites onlines are in english, it stands to reason that when you sign in your email address on an english site, it gets added to an english spamlist. Since I don't sign up on any Korean/Swiss/etc sites, they haven't yet gotten my email address yet (or don't care about it).
That being said, people in N. America and english speaking countries do get a lot of spam in english from foreign servers - which is where IP range blocklists and spamassassin come in handy.
What about having a filter check all your accounts at once? If you're receiving the same email on more than one account, chances are it's spam.
This must be my lucky day! I get only 0s! What're the odds to that??!!11!11!!! LOL!
This reminds me that about a year ago, three italian scientists came up with a way to find species relatedness by using the zip algorithm. One takes the sequence of bacteria 1, and then attaches a little bit of bacteria X sequence to the end of that. Again, one attaches a bit of bacteria X sequence to the end of bacteria 2. And then zipping is done on this concatenation. The final compression size of just the bacteria X part ended up telling us the homology (or relatedness) of bacteria X to bacteria 1 or 2.
But from reading all these posts, perhaps a Bayesian method would work just as well. There seems to be no inherent advantage to using zip. One still needs a reference piece of work (non-spam email, or bacteria 1) for comparing entropies or probabilities. Of interest also is that the researchers applied their method to generating an accurate language tree of Indoeuropean languages (grouped by relatedness of course.)
The ref & abstract of above paper is here:
Phys. Rev. Lett. 88, 048702 (2002)
Dario Benedetto,1 Emanuele Caglioti,1 and Vittorio Loreto2,3
In this Letter we present a very general method for extracting information from a generic string of characters, e.g., a text, a DNA sequence, or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution, and language classification. ©2002 The American Physical Society
Of course, if you write in prose you will have a problem. I guess prayers, with repetitive phrasing would also be filtered out.
Tisha Hayes
At the risk of feeding a troll.... That was exactly what I was pointing out, that this aspect of the implementation does nullify this person's point! That by comparing the message with both spam and ham you reduce the possibility that spammers can get around this technique by just adding random noise.
Perhaps you should read people's comments more carefully before making stupid replies!
... can be universal. The principles used actually have their roots in the theories put forward by R. Solomonoff and Kolmogorov (links below). Any given string of bits can be assigned a "complexity" which is proportional to the length of the shortest program that can generate that string. It isn't usually computable BUT the size of the output file of a compression algorithm can be shown to be a reasonable if crude approximation. The beauty is that this approach (minimum description length or MDL) is clustering email in a very fundamental way without the bias' that can be introduced with assumptions required by Bayesian techniques and arguably making use of all the information (vice a subset chosen by the Bayesian user) contained in the email. Yes, the answers can be the same but the MDL approach is universal and the same classifier without modification could be used for broader clustering tasks i.e beyond binary classification of junk/not_junk to multi-class classification junk/best friend/mom/dad/wife/work/etc.
_ 42/Issue_04/o n Program - http://www.cs.cityu.edu.hk/~cssamk/gencomp/GenComp ress1.htm
As an aside, since it could be fully automated it would be interesting to run the such an algorithm with a graphical display, say a 2D plot of compression size vs time of day just to see what shakes out.
By the way, the problematic portion for bioinformatics apps is the compression. DNA sequences often exhibit _expansion_ when put through the common compression schemes. Li has come up with a compression scheme that is more optimal called GenCompress.
Kolmogorov Complexity - http://www.idsia.ch/~marcus/kolmo.htm
Minimum Description Length - http://www3.oup.co.uk/computer_journal/hdb/Volume
Bioinformatics app - http://www.cs.ucsb.edu/~mli/sam.ps
GeneCompressi
"Consensus" in science is _always_ a political construct.
All it takes is for one false positive on a Really Important Email and be accidentally deleted to totally destroy the value of any filtering system.
Huh? You drive around in cars all the time, in spite of the fact that if that system fails (which it not infrequently does) in the wrong way and at the wrong time...you die.
Technology occasionally fails. The only way to avoid technological failure is to avoid technology. (You'll still have failures: they just won't be technological ones.)
If you anticipate receiving a communication so Really Important that the consequences of accidentally spam-filtering it are catastrophic, you shouldn't be using e-mail anyhow. I would guess that my personal spam filtering has about the same average false-positive rate as the rate of drops of mail by my software, hardware, and upstream mail providers. At least with the spam-filtered messages, I can save them around and do post-mortem on them.
I'm a native English speaker, but because my first name is Turkish (Kaan), and many of my email addresses are based on my first name, I constantly get singled out as being a Turk and thus interested in Turkish spam. I get 5-10 pieces of Turkish spam every day (which, if you're curious, is just like English spam - phone cards, herbal crap, toner and computer parts, etc. - only it's written in Turkish).
I also get spam in various Asian languages (I've recognized Chinese mostly), but I have no idea why.
Well, we keep getting these anti-spam software stories on Slashdot, and I thought it was finally time to post my Sendmail ruleset.
Using this system of RBLs and header checks, I'm able to whitelist certain users/domains/IPs, as well as block serious offenders. In the past few months, I've received one piece of spam (which was subsequently unceremoniously blocked). The worst offender is the Klez virus, which actually sends valid headers (more or less) and is thus harder to filter with my ruleset.
Also, my ruleset will return a 553 error during the SMTP coversation... no accept-then-delete here. As an alternative, you might wish to use a more robust filter, such as Exim SpamAssassin at SMTP time.
Without further ado, here's the URL for my ruleset:
www.doorbot.com/guides/sendmail/antispam/
I ask that you go easy on my bandwidth as best you can... I'm on a 128kbit upload DSL.
This node at everything2 has a good description of the catfight this paper generated.
My amazing wife - Artist, Author, Philosopher - Laurie M
dude, you are so 1337.
I think that under a probability level nobody will send you a mail SO full of $59.99 or $9.99 or $10.99 offers.
The trick to remove spam is to delete mails that contain more than 2 '9' on a row, possibly preceeded by a $ sign.
-- There are two kind of sysadmins: Paranoids and Losers. (adapted from D. Bach)
Several knowledgeable people pointed out that the first try was limited by gzip's 32k window size, so I did a quick run with bzip2, which uses a 900k block, and put the results here. Somewhat different, but still a spread between spam/ham.
And, of course, do try this at home.
---- "If we have to go on with these damned quantum jumps, then I'm sorry that I ever got involved" - Erwin Schrodinger
Remember:
Tell them what you are going to tell them.
Tell them it again.
Let them know what you just told them.
Hmmm...
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
largenay ouray enispay inay ivefay easy inutesmay!
Unfortunately, I was never very good at latin...
Sorry, can't find references, but similar techniques have been used by a team of Italian researchers to determine which real life Dutch author published a book using a pseudonym. Something to do with an award for beginnings authors, or suchlike.
Bahco.
-- The best way to accelerate a computer running Windows is at 9.8 m/s^2.
I guess this will really press spam out of existance
Ok, I decided to try it out and run my own statistics on it.
The good news is that with bzip2 it peforms about the same as spamassassin. On my K6-200 BSD system it takes about the same time to process an email message spamassassin. Both take too much time for my taste but that is another issue. Performance is proportional to the size of the corpus.
It's the statistics that bothers me. There is no point in comparing the means (in ambiguous terms) without the standard deviation between groups.
So here is my data. I created a spam and ham corpus from half of my emails. Then wrote a quick script to pipe the other half through the program.
Basically the variance kills compressing with a spam corpus as a test because there is too much ovelap between the ranges. More than half of my spam was within one standard deviation of the ham. The separation between distributions compressing with the ham corpus is ok but not that great.
That's not draconian, try my email filter rules:
1. If incoming email matches an email address in my address book, move to friends folder.
2. Otherwise, delete it.
I see no spam and get no false positives.
Most of my non techie friends greatly enjoy sending HTML mail, wether using Outlook or sendmail, but they sure never promise me a bigger penis or firmer breast using 100% natural herbal pills.
HTML is definitely not a classifier of spam, at most one of computer illiteracy.
It depends on if you're trying to stop spam or go on some crusade to punish people who enable spamming. I think it's rediculous to block mail from someone because they use the same ISP as someone who sells spamming software, and I certanly wouldn't want some unacountable 3rd party doing it on my 'behalf', especialy since it dosn't benifit me at all. (and, in fact, actualy harms me since I'm losing legitimate email)
autopr0n is like, down and stuff.
And I think a lot of people would delete the first one as well. I would expect the sweepstakes people to call me.
autopr0n is like, down and stuff.
Is thinking Hotmail then writing sendmail a precursor sign for some mental desease?
The redundancy arises when compressing the email and a body of text you know contains SPAM...
IANAL but write like a drunk one.
Looks like either Dr. Suess, or members of Monty Python...
The day Microsoft creates a product that doesn't suck, it will be known as the Microsoft Vaccuum Cleaner!
For me lately, I get about a 50/50 mix of English and Brazilian spam, with the occasional (maybe 10% of total spam) "gibberish" Asian character-set mail.
Caveat Emptor is not a business model.