Filter-foiling Gibberish Becoming A Spam Staple
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
Paul Graham mentions the technique in this article, pointing out that the Bayesian filters look for words that commonly appear just in spam or just in non-spam. The random words are common in neither, so are simply ignored by the filters. As a technique, the random words would get past a filter that looks for some spammy to non-spammy word ratio. But that's not how the spam filters work.
I also recenty received some Alice in Wonderland citations with my spam.
Who would have thought Project Gutenberg's biggest use would be for hawking herbal remedies?
free speach
Did you mean: free speech
Why bother? A decently trained Bayesian filter will be able to recognize a spam that contains a misspelled word or two, or one that contains substitutions of similar characters. Then it will learn that those modified forms are a very strong indicator of spam. As Paul Graham (the main early advocate of Bayesian Filters) has pointed out, there are legitimate reasons why you might see a mention of "Viagra" in your email, but no legitimate reason that you would see "V1agra", "\/iagra", "Vi@gra", or the like. Instead of slipping by my Bayesian filter, those variants actually stand out as particularly strong spam indicators.
There's no point in questioning authority if you aren't going to listen to the answers.
In most adaptive filters, only words that have been used a certain number of times are taken into consideration. For example, the original Plan for Spam algorithm ignores any word that doesn't appear over 5 times in the corpus.
That's the text/plain part you see. The "advertisement" is in the text/html part.
I was very irritated by that, too, until one day I was testing the HTML viewer of an e-mail client.
Free Manning, jail Obama.
I've actually observed this problem - the issue is "overtraining", that is training on everything. I recently threw away my training database and now only train on messages that don't score 0.0 or 1.0 ("non-edge" training). This produces a much smaller database, and is far more deadly against the random spam words attempts.
Don't ever do that, all spam has forged headers. You're just making life hard on someone who had their address sold.
I work for a big company, an icon the the computer business. Our mail servers get spammed a lot. We often have typical user names grafted onto the From or Reply lines. Since my user name is pretty damn common, and some of my work mail aliases are TLAs, I look at a lot of spam. When I read the headers (in a text file, not easily spoofed mail software), almost always the senders domain is not even close to the domain of the spamming machine. Go put the IP addresses into dnsstuff.com, and compare that to the hostname. These turds hack the sendmail.cf file of the spamming machine. "SallySmith@aol.com" probably did not send spam-mail from a ".kr" ISP.
- High Tech workers, please say NO to Union Carpenters, their Union sees fit to control our compensation.