Filter-foiling Gibberish Becoming A Spam Staple
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
"Most of the illegal-exploit spammers use hash busters and any other trick they can to get past filters, refusing to accept that people use spam filters because they really don't want spam," Linford added.
I really understand this part: going after people who are taking active measures against your enterprise due to their disinterest. Why bother to market to them at all? Is the rate of return worth all the ill will, DOS attacks and legislation?
The solution to randomness is to spell check and grammar check incoming e-mail, and consider violations as cause to ad points to the score indicating that it's spam-like.
Sure, a few strange words might be a name that's not in the filter yet, but pure gibberish should be a red flag that either somebody's cat walked on the keyboard, or there's spam going on here. Heavy use of "non-spam" words can override to indicate it's good mail... but a poorly composed mail that doesn't use language seen in friendly mail is highly likely to be spam....
Spam is a perfect carrier for steganographic data since it's broadcast to millions of people and nobody can fall under suspicion merely by receiving it. When the government wants to monitor people's communications to search for steganography, when they don't do anything about spam, the purpose of the monitoring is probably not the stated one.
What good is that when somebody spams you for Gen3r@c v|agar@?
Worse yet, they keep spamming, Someone keeps buying from spam.
Most of them are using random word sequences; the random strings like xdwexe are not usually an important percentage of the overall text, no more than names might be. Besides, how large a corpus of "valid" words do you want to use? The OED weighs in at almost 0.5M; and then with another 0.5M uncatalogued scientific terms and neologisms, plus common mis-spellings and typos and jargon and dialect orthography (like our color, meter, checker, jail etc. for the Brits colour, metre, chequer, gaol) ...
If you don't want to keep the entire corpus of "valid" words in your code, you're going to have to make some compromises. Maybe you'll want to exclude words like "thou," "hauberk," and "coney." Not so good if you're subscribing to an Early Modern Literature listserv.
So you're going to need some logic to determine whether or not a "valid" word that occurs in a message is meaningful. Here's how one rather well known discussion of Bayesian filtering deals with this issue (of unknown words); this is precisely the logic that spammers with random meaningful words are exploiting:
One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.
So, what if all the words are valid, but the sentences aren't? Grammar checkers involve a lot more logic than spellcheckers do, and are consequently a lot less accurate. Fact is, you can also fool a grammar checker filter: just pad with random quotations from novels, etc. instead of padding with random words or random misspelled strings.
So the Bayesian approach of identifying spam and ham words is a pretty effective one, given the limitations.
I'm pretty sure that the big worry is about third party filtering. If I install a spam filter, that means that I don't want to see spam and am unlikely to buy something advertized therein. If my ISP installs a spam filter, it removes spam to everyone, including the idiots who might actually buy something from a spammer. Since my ISP theoretically might be using the same technology in their filter that I'm using in mine, it would still make sense for the spammer to work on defeating my filter.
There's no point in questioning authority if you aren't going to listen to the answers.