Filter-foiling Gibberish Becoming A Spam Staple

← Back to Stories (view on slashdot.org)

Filter-foiling Gibberish Becoming A Spam Staple

Posted by timothy on Tuesday January 13, 2004 @02:16PM from the re:-claire-yum-donut-manhattan-regrets-cute dept.

hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."

8 of 606 comments (clear)

Min score:

Reason:

Sort:

Spamkiller doesn't care by Frisky070802 · 2004-01-13 14:19 · Score: 5, Interesting

My Mcafee Spamkiller ignores the white noise, and simply nukes all the mail containing viagra, etc.

--
Mencken had it right. So glad that's old news.
1. Re:Spamkiller doesn't care by letxa2000 · 2004-01-13 16:16 · Score: 5, Interesting
  
  The encoding V*I*A*G*R*A would break out to the letters V I A G R and A.
  V: 76.9% Spam score
  I: 47.2% spam score
  A: 68.8% spam score
  G: 72.2% spam score
  R: 72.2% spam score
  On balance, if I get a message with the individual "words" of V, I, A, G, R, and A, that's going to be leaning towards spam.
  That's the beauty of Bayesian. Anything the spammers do will eventually come back and bite them in the butt. Even some of the "random words" they are starting to use are getting high spam scores:
  WHEREUPON: 99.9999%
  NEOCONSERVATIVE: 99.9999%
  LIBERAL: 74.3%
  LIBERTY: 84.0%
  MEGATON: 99.9999%
  METHANE: 99.9999%
  These are just a few of the "random words" I found in recent spams and, interestingly, the random words they are using are actually INCREASING their spam probability.
  Statistically, it's a lost cause for the spammers, they just don't realize it yet.
The problem with this technique by pclminion · 2004-01-13 14:27 · Score: 5, Interesting

The problem with this technique for foiling spam filters is that Bayesian filters only examine words which occur in the dictionary of commonly used words. A Bayesian filter is individually trained on your personal mail. If the "red herring" words in the spam don't occur in your personal dictionary, they will be ignored by the filter and have no impact on its decision.
For example, take the word "Byzantine." This is a very non-spammish word. However, if you've never received a legitimate email containing the word "Byzantine," your Bayesian filter will not have it in its dictionary, and the word will be ineffective in "tricking" the filter. The red herring words only have an impact if they are relevent to your actual mail sample. Since everybody's email communication is different (some of us are programmers, some of us are literature majors, etc.), this is a real sledgehammer approach to defeating the filters -- and it's extremely ineffective.
This technique just proves that spammers don't understand the theoretical underpinnings of current Bayesian anti-spam methods. Otherwise, they'd be using much more common words as red herrings, instead of these extremely rare, and therefore insignificant, words.
I personally use a spam filter of my own design which is based on information-theoretic and neural network techniques. It kicks the shit out of spam, even the messages that include these stupid red herring words. The spammers once again prove that they are morons, incapable of understanding how anti-spam technology actually works.
Different Techniques by kalidasa · 2004-01-13 14:33 · Score: 5, Interesting

The article doesn't do a good enough job of explaining the different techniques in use.
First, hash busters. Yes, spammers are loading a random jumble of meaningful words in meaningless sequences into their spam, usually in the plaintext message body of a message with HTML content (i.e., you get hash buster - html message with spam content - hash buster). So HTML-aware clients (the main clients targeted I'm sure are AOL and Outlook Express) show the spam message, but not the hash buster. I'm guessing that this is specifically targeting bayesian filtering tools at AOL (anyone know if AOL is using a bayesian filter?); it works by introducing words that would not be found in a spam corpus in greater numbers than those that would.
Second, noisy spelling, like v1@gr@. Obviously this is also intended to defeat regex-based filters like spamassassin. If you vary your cliches enough, and you introduce very strange, but easy-for-a-human-reader-to-recognize spelling variants, you make it much more difficult for filter writers to write effective regexes.
The real problem will be deliberate poisoning by Jerf · 2004-01-13 14:33 · Score: 5, Interesting

The real problem will be when the spammers finally figure out how to deliberately poison the Bayesian filters. So far they're using more-or-less random words, but that won't really work against Bayesian; it can tolerate that.

However, what constitutes "non-spam" is not as unique as most people think, as I've examined here. If they figure out how to deliberately put in hammy words, Bayesian will fall.

I feel OK posting this because I freely admit to this point I've overestimated them; I'm sure spammers have read that piece, and to date they have been too stupid to figure out what I said in plain English. But sooner or later one of them is going to figure out.

There's a strong core of "ham" that is "ham" for everybody, and sooner or later they're going to start abusing that.

And if I may forstall one objection... "But you don't understand Bayesian, it's [awesome for some reason and can't be beat ever, by anybody]" - I'll listen when you've actually written a program to examine filters yourself, OK? I understand it pretty damn well. It'll take more then bald assertions to convince me I'm wrong, I've done actual research, in the original sense of the word.
Slimier than slime . . . by mjprobst · 2004-01-13 14:34 · Score: 5, Interesting

I saw one just yesterday that contained a list of important key sentences and phrases from the literature of common charities and political activism organizations.

In other words, if your Bayesian filter accepts those, based on your past decisions, it will detect the spam. If you reject the spam, you reject these communications as well.

Good filtering practice would dictate that one reads the junk box carefully enough to find both false positives and negatives. But the sheer bulk of mail that ends up in the junk box makes this unfeasible for many.

I have started letting these particular kinds of spam through, manually categorizing them (many words of random strings, dictionary vocabulary attack, positive phrase attack) in the hopes that filtering technology will soon advance to the point where these can be used as inputs to a more intelligent system.

Of course overhauling the mail system is a prerequisite to solving any of this long-term. For once I don't mind D. J. Bernstein's Internet Mail 2000 proposals. Of course there are other proposed systems, none of which has enough momentum to start a slow steady change. The end result of any non-consensus system will be to fragment the worldwide network of Email into competing, noncompatible systems that need to communicate through some kind of loophole or gateway. Back to FIDO-net days.
I see this too by rockwood · 2004-01-13 14:37 · Score: 5, Interesting
I've been using "SpamBayes Outlook Plugin" since a previous /. article talked about it.
Agreeing with this article, over the past week or two I have seen excessive about of spam being missed by SpamBayes, even after marking them as spam for improved filter, they continue to hit the inbox whereas previous absolutely no spam made my outbox. Additionally, there may have only been 2 or 3 emails marked as possible spam when they were not. And zero items mark as definite spam that were not.
SpamBayes has worked great previously, but now even it is falling short.
I feel as the spammers manipulate the conents/context of the spam, it will eventually become impossible to determine the difference without physically looking at 500+ email daily.
My primary use of email is business and not personal, therefore I cannot risk missing a client email, payment, question, etc... I've also see a progression of clients having MY emails deleted or caught in spam filters due to the business aspect and requests for payments. I feel this is primarily due to the comparison of too-often-common-phrases that a spam email and a business email contain. Such things as Click here to submit payment, or Buy these Products, Overdue etc... Even though all clients I email are only clients that contact me. I never cold-email anyone.
More spammer are using this random text as the only text in the subject and body, and using an image as the content of their email, which makes scanning even more complicated, if not impossible.
Being on the net prior to what is is today (going on 20 years), I often wonder how much control the spam actually has over the net in several aspects
- If spam were to disappear, will overhead costs decrease that greatly in order for ISP's to pass along higher saving to the consumer?
- If Spam were to disappear completely, how much faster would the Internet be?
Has anyone ever done a study to determine how much effect spam has on degrading the net, and what would it be like if all spam was gone tomorrow?
--
Never try to beat a professional at his own game!
Re:why not filter out 1337 sp3@k? by the_mad_poster · 2004-01-13 14:55 · Score: 5, Interesting

1337 speak isn't a big deal. It's definitely filterable.

I've begun seeing chunks of text appearing in messages that are like legitimate mini-messages in and of themselves. Sort of like a counter weight. I don't think the aim is to pound Spam through the filters now, because what's happening is spam is getting slightly lower ratings each time while legitimate messages are getting slightly higher ratings.

In other words, the spam probably won't ever be legitimate, but it's making me lower my threshold for what is spam more and more. Eventually, I'll get to the point where some legit messages will cross over into being labeled as spam and spam will go through legit because the thresholds will be so close together as to practically overlap. It's also killing my ability to keep a spam trap that I can use to quickly train filters.

Whether this scene will actually play out and the "plot" will be succesful or not remains to be seen, however.

--
Alito: A vote for Alito is a punch in the eye to put that bitch back in her place!