Filter-foiling Gibberish Becoming A Spam Staple
hcg50a writes "Wired has a story about the random words which have recently been appearing in spam. Antispam experts agreed that this isn't a brand-new technique, but said the addition of potentially filter-foiling gibberish is rapidly becoming a common component of spam."
They keep spamming and we keep deleting... OH THE HUMANITY!
Have you hugged your penguin today?
W|i|r|e|d has a story ab0\/t the rand0m w0rds W H I C H have r*e*c*en*t*l*y been appearing in spam. Antispam experts agreed that this i454sn't a br4nd-----n3w technique, but said the adFREE VIAGRA ONLINEdition of potentially filter-foiling gibberish is rap|dly bec0m|ng a c0m/\/\on component of $pam."
apxxmyohofmnoatn fmkpo oixv a z gjs sc dnbxgbidlaaatooab yqlrwtta dupg o vx j n vyz aae xvm
this sig limit is too small to put anything good h
A lot of the time that "random gibberish" comes in the form of a story or something. Hell, a while ago I got a spam that contained a few exerpts from The Raven by Edgar Allen Poe. I got a laugh of that one.
My Mcafee Spamkiller ignores the white noise, and simply nukes all the mail containing viagra, etc.
Mencken had it right. So glad that's old news.
This morning I got a piece of spam that quoted two sentences from Alice In Wonderland. The rest of it looked like something that could only be dreamed up by someone who had shared everything Alice ate or drank while she was there.
The net will not be what we demand, but what we make it. Build it well.
"Most of the illegal-exploit spammers use hash busters and any other trick they can to get past filters, refusing to accept that people use spam filters because they really don't want spam," Linford added.
I really understand this part: going after people who are taking active measures against your enterprise due to their disinterest. Why bother to market to them at all? Is the rate of return worth all the ill will, DOS attacks and legislation?
They are sending sekrit instructions to al-spamda about where to hide the weaponz of mass distraction. Or who knows. Any government efforts to control steganography (like reported just yesterday ) better go after spammers first, or we have to wonder what they're really up to.
I can see them doing this to overcome Bayesian filters, but why? AFAIK, Bayesian filters are not used much (if at all) on mail servers. These filters are run at home by geeks.
Granted, this may get them past the filters, but if somebody's gone through the effort of setting up a Bayesian filter, they're not going to buy your product even if you get into their inbox. It seems like a waste of everybody's effort, and I mean including the spammers.
...is knowing how successful this spam becomes. I get a lot of it, and I have to think that you'd have to be beyond merely dim or technically inept to take it seriously -- you'd have to be insane or have some sort of debilitating head injury. (Granted, that still may leave a lot of the Internet covered, but still).
Spammers seem to have a lot of success when they're emulating more legitimate sources like Ebay, Microsoft, etc., but I get spam now that can't even seem to decide what it's selling. The subject line says "get rid of mortgage payments" and the body is selling "V.I.A.G.01331.A." I'm not even sure what I'd be getting if I were dull enough to actually click on anything in the message. Heck, I'm not sure if even the SPAMMERS know.
I'd be interested to know if these spams are as successful as past efforts have been.
This doesn't seem to be a very effective spam technique. It works pretty well at fooling my "bayesian" spam filter, but the spam messages have gibberish subject lines! Who's going to read a message titled "deprecatory parrot bizarre dessert"? (an actual example)
There is so much crap flooding my inbox these days that the spam filter is slowly becoming a whitelist of my coworkers and a few external customers. Hardly anything else that comes in is worth the time to look at.
I know that whitelists aren't the answer, but then nothing short of immediate execution of spammers is.
I have been pwned because my
Let's see... There is translation software out there that has some basic understanding of grammar. :P
Should we add a grammar-filter to the list of things we look for it spam?
A large amount of incorrect grammar would increase the chances of the file being caught in the spam filter.
Of course, this would lock out most of AOL users from writing email... But is that really so bad?
I'm a dreamer, the world is my playpen. But hey, I'm a serious person, I can't dream all the time.
Paul Graham mentions the technique in this article, pointing out that the Bayesian filters look for words that commonly appear just in spam or just in non-spam. The random words are common in neither, so are simply ignored by the filters. As a technique, the random words would get past a filter that looks for some spammy to non-spammy word ratio. But that's not how the spam filters work.
For example, take the word "Byzantine." This is a very non-spammish word. However, if you've never received a legitimate email containing the word "Byzantine," your Bayesian filter will not have it in its dictionary, and the word will be ineffective in "tricking" the filter. The red herring words only have an impact if they are relevent to your actual mail sample. Since everybody's email communication is different (some of us are programmers, some of us are literature majors, etc.), this is a real sledgehammer approach to defeating the filters -- and it's extremely ineffective.
This technique just proves that spammers don't understand the theoretical underpinnings of current Bayesian anti-spam methods. Otherwise, they'd be using much more common words as red herrings, instead of these extremely rare, and therefore insignificant, words.
I personally use a spam filter of my own design which is based on information-theoretic and neural network techniques. It kicks the shit out of spam, even the messages that include these stupid red herring words. The spammers once again prove that they are morons, incapable of understanding how anti-spam technology actually works.
The solution to randomness is to spell check and grammar check incoming e-mail, and consider violations as cause to ad points to the score indicating that it's spam-like.
Sure, a few strange words might be a name that's not in the filter yet, but pure gibberish should be a red flag that either somebody's cat walked on the keyboard, or there's spam going on here. Heavy use of "non-spam" words can override to indicate it's good mail... but a poorly composed mail that doesn't use language seen in friendly mail is highly likely to be spam....
Spam is a perfect carrier for steganographic data since it's broadcast to millions of people and nobody can fall under suspicion merely by receiving it. When the government wants to monitor people's communications to search for steganography, when they don't do anything about spam, the purpose of the monitoring is probably not the stated one.
could it be used on politicians?
randomly grab a paragraph from a book and include it with the spam.
It would also help spammers to write better pitches. Use real words, actual English but put it in narrative real world sceneario format. So it reads like someone you know telling you how they use such and such a product.
"I went up the cabin last week with my girlfriend and tried out those new pills I heard about while I was there."
There's pretty much nothing in there that would be filtered. And then a slight plug of the product name with a link and you're done. It's also Marketing 101 that the less of an ad sounds like an ad the more effective it is.
But none of that thwarts my method which is to filter based on the URLs of links found in spams.
I get virtually no spam with a Mercury rule file that's all of 23KB and grows very slowly as spammers use new domains to host their product pages.
Ben
Work Safe Porn
The article doesn't do a good enough job of explaining the different techniques in use.
First, hash busters. Yes, spammers are loading a random jumble of meaningful words in meaningless sequences into their spam, usually in the plaintext message body of a message with HTML content (i.e., you get hash buster - html message with spam content - hash buster). So HTML-aware clients (the main clients targeted I'm sure are AOL and Outlook Express) show the spam message, but not the hash buster. I'm guessing that this is specifically targeting bayesian filtering tools at AOL (anyone know if AOL is using a bayesian filter?); it works by introducing words that would not be found in a spam corpus in greater numbers than those that would.
Second, noisy spelling, like v1@gr@. Obviously this is also intended to defeat regex-based filters like spamassassin. If you vary your cliches enough, and you introduce very strange, but easy-for-a-human-reader-to-recognize spelling variants, you make it much more difficult for filter writers to write effective regexes.
The real problem will be when the spammers finally figure out how to deliberately poison the Bayesian filters. So far they're using more-or-less random words, but that won't really work against Bayesian; it can tolerate that.
However, what constitutes "non-spam" is not as unique as most people think, as I've examined here. If they figure out how to deliberately put in hammy words, Bayesian will fall.
I feel OK posting this because I freely admit to this point I've overestimated them; I'm sure spammers have read that piece, and to date they have been too stupid to figure out what I said in plain English. But sooner or later one of them is going to figure out.
There's a strong core of "ham" that is "ham" for everybody, and sooner or later they're going to start abusing that.
And if I may forstall one objection... "But you don't understand Bayesian, it's [awesome for some reason and can't be beat ever, by anybody]" - I'll listen when you've actually written a program to examine filters yourself, OK? I understand it pretty damn well. It'll take more then bald assertions to convince me I'm wrong, I've done actual research, in the original sense of the word.
I thought about this after seeing my inbox spam increase to about 80 a day (the box that contains what is filtered is usually 10 per hour - my adress has been valid for just short of 10 years).
/usr/share/dict/words? I thought about trying this out, but have been too busy to get off my ass and do it.
Why not check the subject or first few lines of plain (not html) text and see if 80% of it is in
I saw one just yesterday that contained a list of important key sentences and phrases from the literature of common charities and political activism organizations.
In other words, if your Bayesian filter accepts those, based on your past decisions, it will detect the spam. If you reject the spam, you reject these communications as well.
Good filtering practice would dictate that one reads the junk box carefully enough to find both false positives and negatives. But the sheer bulk of mail that ends up in the junk box makes this unfeasible for many.
I have started letting these particular kinds of spam through, manually categorizing them (many words of random strings, dictionary vocabulary attack, positive phrase attack) in the hopes that filtering technology will soon advance to the point where these can be used as inputs to a more intelligent system.
Of course overhauling the mail system is a prerequisite to solving any of this long-term. For once I don't mind D. J. Bernstein's Internet Mail 2000 proposals. Of course there are other proposed systems, none of which has enough momentum to start a slow steady change. The end result of any non-consensus system will be to fragment the worldwide network of Email into competing, noncompatible systems that need to communicate through some kind of loophole or gateway. Back to FIDO-net days.
You put Viagra in there in unaltered plain text.
paintball
... now my Bayesian filter is throwing out all email from my Lewis Caroll quoting friends! Thanks a lot, spammers!
"Freedom means freedom for everybody" -- Dick Cheney
Agreeing with this article, over the past week or two I have seen excessive about of spam being missed by SpamBayes, even after marking them as spam for improved filter, they continue to hit the inbox whereas previous absolutely no spam made my outbox. Additionally, there may have only been 2 or 3 emails marked as possible spam when they were not. And zero items mark as definite spam that were not.
SpamBayes has worked great previously, but now even it is falling short.
I feel as the spammers manipulate the conents/context of the spam, it will eventually become impossible to determine the difference without physically looking at 500+ email daily.
My primary use of email is business and not personal, therefore I cannot risk missing a client email, payment, question, etc... I've also see a progression of clients having MY emails deleted or caught in spam filters due to the business aspect and requests for payments. I feel this is primarily due to the comparison of too-often-common-phrases that a spam email and a business email contain. Such things as Click here to submit payment, or Buy these Products, Overdue etc... Even though all clients I email are only clients that contact me. I never cold-email anyone.
More spammer are using this random text as the only text in the subject and body, and using an image as the content of their email, which makes scanning even more complicated, if not impossible.
Being on the net prior to what is is today (going on 20 years), I often wonder how much control the spam actually has over the net in several aspects
- If spam were to disappear, will overhead costs decrease that greatly in order for ISP's to pass along higher saving to the consumer?
- If Spam were to disappear completely, how much faster would the Internet be?
Has anyone ever done a study to determine how much effect spam has on degrading the net, and what would it be like if all spam was gone tomorrow?Never try to beat a professional at his own game!
Why bother? A decently trained Bayesian filter will be able to recognize a spam that contains a misspelled word or two, or one that contains substitutions of similar characters. Then it will learn that those modified forms are a very strong indicator of spam. As Paul Graham (the main early advocate of Bayesian Filters) has pointed out, there are legitimate reasons why you might see a mention of "Viagra" in your email, but no legitimate reason that you would see "V1agra", "\/iagra", "Vi@gra", or the like. Instead of slipping by my Bayesian filter, those variants actually stand out as particularly strong spam indicators.
There's no point in questioning authority if you aren't going to listen to the answers.
a while ago I got a spam that contained a few exerpts from The Raven by Edgar Allen Poe. I got a laugh of that one.
...never more ;- )
You can't take the sky from me...
What I don't understand about this type of spam is that often it doesn't contain any actual advertisement, just three or four lines of random words, and the end of the email right there.
I don't get it. If you're not selling a product, what is the spam for?
Mind you since TMDA, I haven't been seeing any spam anyway.
Karma: It's all a bunch of tree-huggin' hippy crap!
Most of them are using random word sequences; the random strings like xdwexe are not usually an important percentage of the overall text, no more than names might be. Besides, how large a corpus of "valid" words do you want to use? The OED weighs in at almost 0.5M; and then with another 0.5M uncatalogued scientific terms and neologisms, plus common mis-spellings and typos and jargon and dialect orthography (like our color, meter, checker, jail etc. for the Brits colour, metre, chequer, gaol) ...
If you don't want to keep the entire corpus of "valid" words in your code, you're going to have to make some compromises. Maybe you'll want to exclude words like "thou," "hauberk," and "coney." Not so good if you're subscribing to an Early Modern Literature listserv.
So you're going to need some logic to determine whether or not a "valid" word that occurs in a message is meaningful. Here's how one rather well known discussion of Bayesian filtering deals with this issue (of unknown words); this is precisely the logic that spammers with random meaningful words are exploiting:
One question that arises in practice is what probability to assign to a word you've never seen, i.e. one that doesn't occur in the hash table of word probabilities. I've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar.
So, what if all the words are valid, but the sentences aren't? Grammar checkers involve a lot more logic than spellcheckers do, and are consequently a lot less accurate. Fact is, you can also fool a grammar checker filter: just pad with random quotations from novels, etc. instead of padding with random words or random misspelled strings.
So the Bayesian approach of identifying spam and ham words is a pretty effective one, given the limitations.
It's old fashioned, and some of you will probably make fun of me for using it, but hey, I'm old school. FYI, here's my method:
;)
1. Create manual spam filters (NOT beyesian filters) in your inbox called "Friends and Family", "Work", "Services", "logfiles", and any others you find you need. Each category applies to a broad type of email address you'll receive email from. Then create a subdirectory in your inbox for each of these filters (named the same way, naturally).
2. For each filter, build a list of people who are allowed to email you. For example, your ISP, your bank, and your phone company would probably be added to services. Just add the email address they send their messages from to the list.
3. For each filter, have the filter move messages matching the filter (From equals ) to the correct subdirectory for the filter. Then stop processing for that message, so it doesn't get interpereted by other filters. Think of this as an analogy for ipfilter or ipfw in your firewall setup -- only you're filtering emails instead of packets.
4. Finally, DELETE EVERYTHING ELSE in the very last filter.
You USE this approach by doing a quick scan of the deleted items folder to see if anything is interesting. If not, just clean out those deleted items. It's a one step operation, much easier than selectively deleting a hundred emails one at a time.
Then, you scan each of the folders you set up, IF the folder has picked up an email, focusing only on your REAL email.
This approach has saved me a HUGE amount of work lately. My life is a whole lot easier, and it's way easier than trying to train a Beyesian filter. If I don't know you, you can't get too much of my attention.
It's all about being on the list, sort of like getting into a nightclub...
Farewell! It's been a fine buncha years!
Just block the domain name/ip of the hosted images. Most spams I get come from random IPs but usually have common IP/domain name for the hosted images e.g.
hostz300001.com/ads/viagra.jpg
Or whatever. I've cut down from 50 spams to about 3 or so a day by doing that.
I bet a bayesian filter would work nicer but unfortunately I'm too lazy to mod the mail setup [that isn't mine] to get one installed..
Tom
Someday, I'll have a real sig.
I've been filtering subject lines with too much punctuation for some time now; it catches quite a bit.
1337 speak isn't a big deal. It's definitely filterable.
I've begun seeing chunks of text appearing in messages that are like legitimate mini-messages in and of themselves. Sort of like a counter weight. I don't think the aim is to pound Spam through the filters now, because what's happening is spam is getting slightly lower ratings each time while legitimate messages are getting slightly higher ratings.
In other words, the spam probably won't ever be legitimate, but it's making me lower my threshold for what is spam more and more. Eventually, I'll get to the point where some legit messages will cross over into being labeled as spam and spam will go through legit because the thresholds will be so close together as to practically overlap. It's also killing my ability to keep a spam trap that I can use to quickly train filters.
Whether this scene will actually play out and the "plot" will be succesful or not remains to be seen, however.
Alito: A vote for Alito is a punch in the eye to put that bitch back in her place!
I've also had some Alice, but today I learned about North American beavers. I had no idea they were so large.
That's exactly why you need to ENL4R9E `/U0R P3N1S!!!1!1 because North American women have 1arqer beavers and thus require a bigegr PE/\/i5 to st!mu1ate them.
if you can write me a regex that filters that out 80% of the time with 0 false positives, i will pay you 6 figures a year to sit on a chair in my museum as one of life's "mysteries".
Pay me six figures a year and I will sit in a chair and do it for you manually.
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?
Examples from my corpus:
VIAGRA: 99.797%
V!AGRA: 99.9999%
AGRA: 99.9999% (from things like VI.AGRA)
IAGRA: 99.9999%
PORN: 98.573%
P0RN: 99.9999%
PR0N: 99.9999%
Plus, the trick is looking for things that give away spam that aren't just words. I call them "characteristics." For example:
Various pharmacy related terms: 99.9999%
HTML using % escape sequences: 98.789%
Http:// references that don't use www: 85.538%
=?ISO- in Subject: 99.9999%
Suspicious domains (BIZ, BR, PRO, etc.): 99.174%
1 "Adult Term": 70.8%
2 "Adult Terms": 85.7%
5+ "Adult Terms": 99.9999%
5+ HTML Comments: 92.0%
10+ HTML Comments: 98.3%
30+ HTML Comments: 99.9999%
In short, there are so many aspects of a message you can analyze and make "Characteristics" that my Bayesian filter can often make a decision entirely based on the characteristics without even looking at some of the terms used within the message. But if the characteristics aren't damning enough, the content virtually always is.
What if spam and the spammers software - was actually being used by a third party in a surepticious manner to send/receive messages? Kinda like plaintext stego. Maybe the software used by spammers is backdoored by this third party - he sends instructions to the machine(s), maybe via a virus or something simpler, the spammers send their messages, but "unknown" to them the spams have this garbage at the end. The spammer doesn't really care, maybe he bitches at whatever passes as tech support for the spam software. Most people who recieve the spam see the stuff as garbage, or filter busters. But a certain group of the third party's friends - they have special email software that downloads these spams, and strips the garbage out, decodes it, and reassembles it into the real message. Maybe each spam only contains the equivalent of a couple of characters after decoding (maybe the garbage is actually packets telling order in the sequence, and other info to reconstruct the message) - but over a week or so, an entire message could be sent...
What is the possibility of that? Occam's Razor suggests otherwise, and filter busters are probably what the stuff is - but...what if...?
Reason is the Path to God - Anon
I'm worried about spammers realizing that they can effectively negate the usefulness of filters without breaking a sweat (spammers, please don't read the following). If they switched from super-short fake messages to mock-real messages (a paragraph or two long, a legit-sounding subject, etc.) and they all sent out millions a day, everyone would be forced to turn off their filters. There would be no effective to distinguish those fake messages from real messages for most people (without a whitelist/blacklist system, which does more harm than good for most).
In such a situation, email would grind to a halt. Anyone who kept trying to train their filters would just end up blocking most legit emails, and those who don't train for it or turn off would be flooded with real and fake messages they can't distinguish between. The messages would even be profitable, so long as your "friend" included a link to some "cool website" that happens to sell [fill in spam product here]. Go ahead and train your filter to block emails containing URLs. Hah! Maybe if you don't have a job, friends, or buy things over the internet you can, but for most it's just not going to work.
G