Filter-foiling Gibberish Becoming A Spam Staple
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
"...gibberish is rapidly becoming a common component of spam."
Hasn't spam always been gibberish?
I'm a dreamer, the world is my playpen. But hey, I'm a serious person, I can't dream all the time.
"Most of the illegal-exploit spammers use hash busters and any other trick they can to get past filters, refusing to accept that people use spam filters because they really don't want spam," Linford added.
I really understand this part: going after people who are taking active measures against your enterprise due to their disinterest. Why bother to market to them at all? Is the rate of return worth all the ill will, DOS attacks and legislation?
I can see them doing this to overcome Bayesian filters, but why? AFAIK, Bayesian filters are not used much (if at all) on mail servers. These filters are run at home by geeks.
Granted, this may get them past the filters, but if somebody's gone through the effort of setting up a Bayesian filter, they're not going to buy your product even if you get into their inbox. It seems like a waste of everybody's effort, and I mean including the spammers.
A Bayesian spam filter teamed with a standard grammar checker adapted from an open-source word processor.
It'll take more processing power, and lead to spammers following proper grammar in their pseudo-nonsense, but it's the way to raise the bar against this attack (making those spammers that can't clear the bar out of luck).
Reminds me of a Dr. Seus book...
RD
The solution to randomness is to spell check and grammar check incoming e-mail, and consider violations as cause to ad points to the score indicating that it's spam-like.
Sure, a few strange words might be a name that's not in the filter yet, but pure gibberish should be a red flag that either somebody's cat walked on the keyboard, or there's spam going on here. Heavy use of "non-spam" words can override to indicate it's good mail... but a poorly composed mail that doesn't use language seen in friendly mail is highly likely to be spam....
Spam is a perfect carrier for steganographic data since it's broadcast to millions of people and nobody can fall under suspicion merely by receiving it. When the government wants to monitor people's communications to search for steganography, when they don't do anything about spam, the purpose of the monitoring is probably not the stated one.
--
Still looking for an email replacement...
What good is that when somebody spams you for Gen3r@c v|agar@?
It is not very often that people send random giberish in e-mail. Why not look for the gibberish. Hell even MS word can detect gibberish, I think a spam filter could score a message on non linguistic gibberish.
Worse yet, they keep spamming, Someone keeps buying from spam.
Insert four or five lines of valid extra text -- lines from books, selections from recent USENET postings, etc, etc -- into the spam. Make the selection semi-random. Now do it 100 times and send 100 copies to each person on the mailing list.
One of them will get through. And the spammers will continue to work.
Most of them are using random word sequences; the random strings like xdwexe are not usually an important percentage of the overall text, no more than names might be. Besides, how large a corpus of "valid" words do you want to use? The OED weighs in at almost 0.5M; and then with another 0.5M uncatalogued scientific terms and neologisms, plus common mis-spellings and typos and jargon and dialect orthography (like our color, meter, checker, jail etc. for the Brits colour, metre, chequer, gaol) ...
If you don't want to keep the entire corpus of "valid" words in your code, you're going to have to make some compromises. Maybe you'll want to exclude words like "thou," "hauberk," and "coney." Not so good if you're subscribing to an Early Modern Literature listserv.
So you're going to need some logic to determine whether or not a "valid" word that occurs in a message is meaningful. Here's how one rather well known discussion of Bayesian filtering deals with this issue (of unknown words); this is precisely the logic that spammers with random meaningful words are exploiting:
One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.
So, what if all the words are valid, but the sentences aren't? Grammar checkers involve a lot more logic than spellcheckers do, and are consequently a lot less accurate. Fact is, you can also fool a grammar checker filter: just pad with random quotations from novels, etc. instead of padding with random words or random misspelled strings.
So the Bayesian approach of identifying spam and ham words is a pretty effective one, given the limitations.
I've been filtering subject lines with too much punctuation for some time now; it catches quite a bit.
I've wondered why Bayesian filtering didn't also include word pairs as input. Doing so would mean that it would be more likely gibberish and actual language would be easier to distinguish, since using pairs (or even triads/trios if absolutely necessary) maintains some of the word order statistics for the Bayesian filters to key off of. Also, lots of spam now separates letters with spaces or punctuation to fool filters that would key off words. Using word-pairs would identify these types of spam easily, since the bulk of legitimate mail won't have word pairs like "v-i" "i-a" "a-g" "g-r" and "r-a".
Another input I wish Mozilla (or other bayesian filtering systems) would include is a dictionary look-up on words, then input the statistics of the message. For instance, a message where > 60% of the words don't match my english dictionary and 40% do match is most likely spam in my mailbox. This additional stat would give those filters more power.
SO I wonder... Would adding these things to existing bayesian filtering systems solve this issue to some degree? My gut instinct is that it would.
In the past many ISPs would add filters and NOT tell the users they were doing it.
Now a days however ISPs (most notably Earthlink and MSN) advertise spam blocking as a feature.
If people wanted this stuff you'd think non-filtering ISPs would advertise "You get ALL your e-mail".
But back to the original point. Spammers have used misleading topics in e-mail if only to make sure you don't delete the message. That and creating spam lists based on people who DO NOT like spam or of people who have manually opted out of spam lists.
The people who actually make money with spam don't care about selling products via spam as they sell spam services. The people who sell stuff via spam aren't making money becouse they are reaching markets who are wholely disintrested in buying stuff from them.
I don't actually exist.
It's really simple. The ONLY way spammers can defeat Bayesian filters is if they imitate what you call ham. ham = What you want; spam = what you don't want. Unless they custom tailor each message or random words to each user and guess (through some form of magical powers) what kind of email you call ham, then they fail.
Besides, if they could guess what your ham looked like, then they wouldn't be spammers... they'd be advertising folks pulling in 7 figures.
Toddlers are the stormtroopers of the Lord of Entropy.
I'm pretty sure that the big worry is about third party filtering. If I install a spam filter, that means that I don't want to see spam and am unlikely to buy something advertized therein. If my ISP installs a spam filter, it removes spam to everyone, including the idiots who might actually buy something from a spammer. Since my ISP theoretically might be using the same technology in their filter that I'm using in mine, it would still make sense for the spammer to work on defeating my filter.
There's no point in questioning authority if you aren't going to listen to the answers.
Nigerian scam spam is very different from most spam. It is a story that can be carefully written to use only words that are commonly used, assuming that the people who author them are able to go beyond their broken English all the way to use of statistically hammy correctly spelled text.
But how would you sell more inches on your male member enhanced with V*@gra to make money fast watching celeb teenie nymphos doing it on the farm while only using ordinary non-spammy words?
There are only so many ways to get someone to click here to get all the hot action and a long boring story full of erudite euphemisms is not one of them.
It would be interesting to see if your method of disguising spam can work on a wider range of topics.
So just modify the bayesian filters to act on a set number of mispilled/garbled words say 10 or so. Of course this might make us have to learn how to spell correctly if we aver want anyone to get the emails we send :0)
FragHARD
FragHARD or don't frag at all
What it will take is the enforcement of existing computer-cracking laws. Spammers will then have a choice between 5-10 year sentences or sending spam with no munged words, forged headers, misleading subject lines, etc.
/. If the government wants us to respect the law, it should set a better example.
Twice in this thread, I see you talking about training the bayesian filter. You seem to think this is something of a burden, like training a big dog...
I think you misunderstand how easily one trains the current Mozilla email client's bayesian filter.
Day 1:
1: the mail comes in, spam included.
2: one of the inbox columns is a blue 'recycle' lookin' symbol. It is a toggle that acts like the 'new' indicator column, and a click on it turns state on or off.
3: glancing through the list, one clicks on the obvious spam, on this column. If there are chunks or patterns that help, you sort them via whatever useful column, then highlight a group, and hit a 'junk' button up in the toolbar. The messages marked as junk disappear (into a 'junk' folder), where they are automatically parsed by the bayes filter. This is what you'd I guess mean by training the filter. For me, it took about 4 minutes the first day, for over 100 messages at a 90% spam ratio. No disrespect, but I doubt you could write your whole stack of filters in 4 minutes.
Day 2:
Most of the junk mail gets caught. I'd say well over 3/4ths of the spam goes away on day 2. You see it come into your inbox, and then a second later all the junk items get the little blue icon turned on, then flash away to the junk folder. A few missed items or new junky things surface.
Days 3 and on: same thing, only better. By the 4th day, my 100 messages a day had fallen back to the dozen nonspams, plus one or two bogus items. It's an automatic 'In, ZZAP! Junk!' Every few days, I glance at the junk folder as you mention, and so far in the last 4 months I've had 5 misfiled messages declared as junk. 3 of them were atypically 'spammy' messages on usually-clean lists.
Now, compared to your way, I have:
- No rules to maintain,
- no problems with exceptions that are hard to write filters for. In my case, I'm on a couple mailing lists that broadcast all messages with the true sender (not the list) as the 'from' field, and nothing obvious in the subject line to filter on.
- Oh, and I'm lazy, too. What you describes sounds like it would take a few dozen built/tested filters, plus maintenance each time I get a new customer or the likes.
- no problems if a prospective customer sends me a request for a bid 'out of the blue',
- My way's sorta fun: Each morning, I see a message like 'getting 1 of 103 messages'... it counts up to 103, then I watch as the stack gets filtered back to just the real ones. Instead of admiring my own cleverness (advantage here to your way), I get to admire this nifty gadget that 'Just Works.' In fact, the one thing I'd like to see in this mail client is a 'Why' button, just so I could see diagnostics on a message's bayesian results. That, and a ranking to keep track of the spammiest message scores my filter ever sees!
- no lost messages from people I neglected to include in my filters.
Granted, you'll find those lost in your method in the spam folder. I say the Mozilla 's built in bayes approach is better because these messages don't get misfiled in the first place.Oh, and people I could never expect to set/maintain filters can intuitively 'click' the spam away. That's my favorite advantage to my way.
What makes you think they have any sales (of the advertised product). I would guess that almost all spam (maybe excluding for pr0n sites) is either being sent by a MAKEMONEYFAST sucker or by a professional spammer who charges such suckers to send their spam out. The first set never make any sales, dissapear and are replaced by the next moron, the latter have their money sales or not.
But then again, Joe Sixpack and Jane Astrology aren't all that smart.
And you think Sam Slashdot is? How many pieces of dead end technology do you think you could find in the average /.ers home? `Early Adoption' is geek herbal viagra.
_O_
.|< The named which can be named is not the true named