The Growing Field Guide To Spam Techniques
Aneusomy writes "From Activestate: 'Compiled by Dr. John Graham-Cumming, a leading anti-spam researcher and member of the ActiveState Anti-Spam Task Force, the ActiveState Field Guide to Spam is a selection of the tricks spammers use to hide their messages from filters, providing examples taken from real-world spam messages.' The hope is that Activestate and others can contribute to continually expand this guide, so that anti-spam filters improve."
Linux and Linus Torvalds are more responsible and liable for spam than any other single entity. Personally I use IIS 6.0 which is secured against any external threat.
I use Thunderbird, and found it to be a good system.
Before I used PopFile but he blocked some good mails. That was reason enough to drop it..
Just a thought, but....
Making it public, the methods used to intercept and filter spam will always mean spammers are one step ahead. If they know the strategy behind those stopping them, then that only helps them.
Is there a better way?
The first generation of spam filters were crude and simplistic - they would delete an email based on the sender, or maybe one or two key words. This isn't effective because spammers rarely use their own email addresses in the "Reply to" field, and deleting all email which contains the words "marketing" or "investment opportunity" is likely to delete legitimate email. Besides, spammers can easily get around this by altering words in such a way as to delete filters (V*I*A*G*R*A is easily read by a human but a computer looking for "viagra" and "viagara" would not stop it)
The best spam filters today use Bayesian filtering to eliminate spam: you train the filter by giving it a pile of email and telling it these are genuine, and another pile and saying these are spam. The filter then looks through the mail and gives certain words a weighting - if most spam contains big red letting with words like "investment", "click here to be removed" and "penis enlargement" then it would score highly and be given a higher probability of being marked spam. Email containing words with your name in it, or words relating to your life or work, would be given a higher probability of being called spam.
And for crying out loud, "spam" is not an acronym so stop writing it in upper case!
Sorry, but my karma just ran over your dogma.
I've definitely noticed that my spamassassin filters are getting less effective. Six months ago, it was rare to see a spam that didn't get caught. Now maybe 10-20% get through.
As I use a sensible email client that doesn't render HTML by default, I can't even read the text of the spams anyway.
Most of the tricks in the article (yes, I read it) require the mail to be in HTML format. If they were not, filters would be much more effective.
I don't remember ever receiving an e-mail that actually had any content requiring it to be HTML. It would be pretty sinple to set up a mail server to bounce any incoming (or outgoing for that matter) HTML mail with a friendly notice that the server does not accept HTML mail, and to please try again using ASCII. The problem is that there are plenty of people who have no idea what they are supposed to do at that point.
Also I wonder if it could be effective for filters to detect whether such obfuscation is used rather than try to parse the contents and filter based on that. Many of the methods used are pretty obvious if you try to detect that specifically.
This post is free (as in cheese in a mousetrap).
Bayesian filters are all well and good, and are -- for now -- effective. But given these tricks, the only really reliable approach I've found is IP blacklists for repeat offenders. If your machine is used to spam me, and my complaint letter is not answered in a satisfactory way (i.e. an email saying "We are sorry. The spammer has been cut off") I don't accept mail from you any more.
And if you're on ATTBI, or Comcast, or PBI.net, or BT Openworld, or Chello, or any number of large ISPs with too much tolerance for spammers, and you're not on my whitelist, I can't read your emails.
And I don't care. Get a ISP who don't shelter spammers.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
I've often had spam get past every one of my filters, simply by being an innocuous subject (something like "Hi there, how's it going") and then a message body completely empty of any content.
I thought that was a pretty impressive attempt by those nifty spammers. Cut out all the bits of spam I ignore (such as offering me crap, giving me html email, popups etc) but keeping the bits I really hate (getting pissed off at receiving spam at all)
Well done kids, hope you keep it up!
All of these spamming techniques seem to involve visual tricks, because the rendered HTML is viewed in a very different way to a human than the plain text would be seen by the filter. Things like zero-height fonts, or white-on-white text, or just using one big image etc. etc.
So how about this: I think every single one of these tricks would be defeated by using this process for filtering spam:
1. Render the html to an image (not on the screen, just behind the scenes)
2. Feed the image into OCR
3. Then scan the OCR text for spam
Sure OCR is not perfect, but since these techniques are imprecise already, maybe it would work well.
Although I guess processing power is a limiting factor, but maybe someday this will be worth doing.
-- the only thing we have to fear is really scary things
You mean the "Search Pattern Assessment Model" method?
i had a friend who recently turned to the dark side and now boasts that his circle of friends include the biggest spammers in the world.
and believe it or not, the biggest break these guys have had in the past year has been help from people on the "inside".
to give you an example, an ex-AOL employer has written them a little proggy for these guys to send messages that makes the AOL mailservers think that the mail originated on the inside of the network (which means that none of it is spam checked or filtered.)
their usual 10% deliverability to AOL.com suddenly went to 100%. make no mistake -- that was worth millions to 'em.
Why not have your spam filter render the HTML in an offscreen buffer (using existing browser/plugin API's), than pull the straight text out of the rendered document and run the filter on that?
who can possibly resist if the word "Free" is in red and bold? Well, me for starters. Still, this one line of the article is taken from the opening, describing a more serious problem; the fact that much spam uses so called 'enchanted email', that is HTML-mail. For all the other bad thing about that, the one thing I find most sinister is that it is easy to have the html-code pull a picture or something from a remote server; thus making it easy to validate your e-mail adress (logicaly, if you open the mail, the adress they sendt it to is active). In short, banning 'enchanted email' would lessen the amout of spam, as well as the bandwith it steals.
Apart from that I got a chuckle out the fact that spammers now seem to be speaking 1337;
Ze Foreign Accent
What: Replace letters with numbers or use nonsense accents
Example from the wild:
V1DE0 T4PE M0RTG4GE
Fántástìç -- eárn mõnéy thrôugh unçõlleçted judgments
The best spamfilter - withthe least false positives - are the one most people of common sence has between his ears. Anything else are mearly sorting your mail according to a fixed set of rules.
Everything in the world is controlled by a small, evil group to which, unfortunately, no one you know belongs.
One purpose of hiding text is to fool anti spam filters.
Let's say that everything between '[/]' is visually hidden. I can send you the message:
Fre[dom for th]e pen[ and th]is enl[ist l]argement.
The 'filter' will see:
Fredom for the pen and this enlist largement.
The user will see:
Free penis enlargement.
Cheers,
--fred
I still favour going after the people paying the spammers rather than the spammers themselves...unlike the big spam rings, they at least have to be locatable, otherwise they'd never be able to sell you stuff.
When I am king, you will be first against the wall.
From what I gathered, it demonstrates two things:
Firstly, the techniques spammers will use to display the text in the email so that the end user will be able to view the text in the email.
Secondly, it demonstrates how using the above approach they are trying to trick spam stopping techniques from working. For example, instead of having a email titled "Free viagra" you could write it as "F*r*e*e V*i*a*g*a*r*a" in an attempt to stop a spam stopper from spotting Viagara as easily in the title. In the body of the email you could write the html in such a way that decifering any words is quite tricky, eg writing Viagara as (font size="2")V(font size="2")iaga(font size="2")ra(/font) etc. Certainly to say spotting all variants of 'hiding' such words is not as simple as you might first think.
It certainly gave me an interesting insight into the problem that it is, and how the spammers are trying and continually evolving their techniques to ensure they can carry on.
I think the purpose is to vary the hidden text to fool anti-spam systems which rely on blocking mail based on signatures of the message body.
If you send 150,000 messages which say "Free Porn Here" systems such as Britemail are going to quickly generate one signature for the mail and block most of it. If however you have the following example (using the fictional HTML HIDE tag)
Free [HIDE] from your meeting at 10:30 [/HIDE] porn [HIDE} cate suggested meeting for coffee [/HIDE] here [HIDE] I will be in work late today [/HIDE}
The message is still displayed in the browser as "Free porn here". However, filters such as those used by Mac Mail and Mozilla may not pick it up as junk because the hidden words look like real email. If you change the hidden sentences every 100 emails then the signature based spam blocking systems won't pick it up as every signature is different and (in this example) you are using real words.
One of the best solutions to this I have seen is KMail, this displays HTML mail as text and you can click a button to then render as HTML. This doesn't stop the spam, but does give you the abaility not to see many images you rather wouldn't at 10am on a Monday morning and allows you to stop web bugs (HTML code in images which can be used to indicate successful message delivery).
And for crying out loud, "spam" is not an acronym so stop writing it in upper case!
Actually writing it uppercase suggests that you are crying it out loud.
Scitne aliquis remedium potimum crapulae?
I helped this lady out who had a 100% opt-in mailing list, but some people weren't getting their mailings... We came to find out the emails were being flagged as spam, so, I set up a dummy email account for her than took every inbound message, sent it through spamassassin (with verbose reports, etc) - and then sent the email back to her.
Now she can see if there's a problem with the headers, the content of the email, etc - so she tunes the email to get the lowest spamassassin score. (You know, the last major version of spamassassin took off points if you put your email client header as being Mozilla! Hah.. That one is gone now)..
This lady definitely isn't a spammer tho, just someone with a small mailing list of 100% opted-in people.
I'm sure spammers do the same thing. I would.
Hormel Foods has this to say on the subject
"We do not object to use of this slang term to describe UCE (unsolicited commercial email), although we do object to the use of our product image in association with that term. Also, if the term is to be used, it should be used in all lower-case letters to distinguish it from our trademark SPAM, which should be used with all uppercase letters."
so....
"SPAM" is Pork and Ham
"spam" is unsolicited email
"SPAM SPAM SPAM SPAM
SPAM SPAM SPAM SPAM
Lovely SPAM, wonderful SPAM!"
is a Monty Python song
It isn't that this new one that I saw was all that amazing an idea, I just hadn't seen it until recently. It is such an obvious idea that I don't know why I haven't seen it until more recently.
They send the mail as you. Fake the headers and make it look like it is from you. To you. From you.
I had our local setup here allowing in anything that was from our domain. Now I have to stop that.
I suppose the spammers saw that people were allowing their own domains and set it up that way.
On a side note and not all that related, I've noticed that I am getting (about once a week) an e-mail from a bank - citibank, or wells fargo, telling me that my loan application has not been approved, see details attached.
Now, I haven't been applying for loans, and the file attached is a *.pif file... which are notorious for being viruses, and not a format that a bank will send you.
Not to mention that looking at the headers, they usually come from attbi.com which is cable modems, and I have seen through Compuserve as well - which aren't exactly how banks usually do business.
There are some odd things afoot now, in the Villa Straylight.
Someone is paying the spammers to spam. They usually have a URL in the email. Set up a screen saver to DDOS the payer. FOLLOW THE MONEY, make it bad to buy spam.
Someone please explain. People who have spam filters on don't want receive spam, and will most likely just ignore/delete any spam that does get through. Why do the spammers waste so much time trying to get past the filters? Is it to reach the unwashed masses behind ISP filters?
I have on occasion misclassified mail myself, both ways. A few spams (uncolicited bulk emails) have been full enough of content that I have found interesting that I only after reading it realized this was not from anybody I knew. Conversely, I have a couple of times received mail which was for me , and was genuine, but so poorly formatted (lots of obnoxious html, strange subject and so on) that I deleted it as spam and only later came to understand it was a serious message.
The point is, not even I can do spam classification 100% correctly. It would be a tall order indeed to have an automated tool do it. But does this matter? There are two issues: discarded genuine mail, and non-caught spam.
Discarded genuine mail is not really as big a problem as people make it out to be. Mail is inherently not guaranteed; messages do fall between the cracks now and again. Swallowed by a buggy server, lost in limbo as a network connection goes down, never having a chance due to a misspelt or obsolete address, sent on a wild goose chase due to a temporary DNS error. Mail do disappear. Everybody knows that - or should know. Mistaking a mail for spam is just another crack for it to fall into. As long as the rate is low there really is no problem. And those doing mail that can easily be mistaken for spam will wise up eventually, as they see a disproprtionate amount of their email get lost in the ether.
Missing spam is no real problem either. The big issue is having fifty spam in your inbox every morning, with another fifty arriving during the day. Having one or two a day, on the other hand, is not that painful.
The point is, it is not a binary system: A spam system that misses two spams a day is better than one that misses five, and vastly better than having no system at all. Similarily, one that classifies one genuine message out of a thousand as spam is no disaster. Not good, but not a reason to shut it all down either. If reliability is _that_ important, what are you doing using email in the first place?
Filtering isn't perfect. It won't ever be perfect. That's quite alright. Saying a technique is worthless because it makes an occasional mistake is throwing out the baby with the bathwater.
Trust the Computer. The Computer is your friend.
Sexual Propaganda Aimed at Men
This space available.
This will all be blindingly obvious to most readers of /., but just for the record:
Don't use your personal email address for anything online. Don't post to usenet with it, don't use it to register for anything, don't ever use it where there's any chance of it being sold to a third party or picked up by a web crawler. Use a free throwaway web-based account like hotmail or yahoo, that's what they're for. I have a verizon.net primary email address, and I've never received a single piece of spam from it.
However, I still have a forward-only email address from my university circa 1992. Back then, there was no spam and that address has to be on every spammer's list on the planet. I still get a legitimate email every year or two, but spam outnumbers these by at least 10,000 to 1. SpamAssassin does a surprisingly good job of identifying the garbage.
I also use a proxy to surf the web, as well as a large hosts file that reroutes requests to adservers to 127.0.0.1:80, combined with a utility that returns a transparent 1x1 gif to any request on port 80. And of course I use mozilla to block pop-ups and whatnot. I'm so used to surfing in this way that I always recoil in horror when I have to use IE on a naked, unprotected box. How on earth can anyone stand it?
As for more traditional types of spam such as telemarketers, there's the national do not call list. It's free, so there's nothing to lose. You'll also want to check out the many excellent resources at the Junkbusters website. One of the most useful features is a Junkbusters Declare page, which builds custom form letters for you that you can use to opt out of Direct Marketing Association junkmail, as well as telling your financial institutions, etc., not to sell your name to third parties. I used it, it's painless, and my privacy is protected.
Of course, it would be much better if we didn't have to jump through hoop after hoop just to get through the day without being pestered by morons.
After a while, SpamAssasin's false negatives and positives drove me to the Tagged Message Delivery Agent (TMDA).
;)
TMDA has flexible whitelist and blacklist capabilities. But the big win is that it can be set to autoreply to anyone not on the whitelist, and require them to reply back before allowing the email to get through. Of course, very few spammers have valid return email addresses...
This may seem drastic, but in fact it has made life soooo much easier. It also helps you to "automagically" get off those email lists you signed up for a long time ago, don't really care about, and are too lazy (or lost the info) to sign yourself off
The only sad thing is that no longer do Russian women want to extend my length or give me free money or viagra, and I am no longer in contact with Ms. Sesse Seiko from Uganda...
The key difference is that KMail does this on a per message basis, whereas in Mozilla this is set once in Preferences and I suspect the same is true in Evolution. Thus looking at a HTML message I just received I get the following in a box at the top of the message;
"Note: This is an HTML message. For security reasons, only the raw HTML code is shown. If you trust the sender of this message then you can activate formatted HTML display for this message by clicking here."
The HTML code follows and a single click turns it into a fully rendered message, or an alternate click consignes it to the trash can.
It may be possible to add this as a mozilla mail / thunderbird toolbar, and as Thunderbird takes off I hope we will see this type of quick prefs bar develop to the same extent they have been developed for the mozilla browser component.
Why DON'T spammers remove us from their lists when we ask? They're working REALLY REALLY hard (with all the filtering, header forging, etc.) to send mail to people that don't want it. If they would just target their email to those who had indicated that they wanted it, and removed us that had indicated they didn't, they'd save themselves a lot of grief, as measured in legal and technical hassle.
Granted, it's easier for them to ignore the "remove me"s, but is the trouble saved in 'not removing' >= the trouble spent in 'getting past spam filters'?
Besides, if the mails were targeted to those that THOUGHT their penis was small and needed extension....doesn't that mean it's not spam anymore? And wouldn't that make their click-through (or whatever) rate higher, therefore making their own attractiveness as a bulk emailer greater to their customers?
I'm just thinkin' here...
This article highlights why I have stopped using filters altogether. End-user filters address the symptom, not the cure. The problem with even the best filter is the mail is already there, taking up space, hogging bandwidth, and the filter is churning CPU cycles to hopefuly deal with it. My mail server uses 3 rbl (blacklists), and one I have programmed myself (rbl.restongeek.com). I get no false positives, and only a trickle of spam that gets through. I also get some small pleasure reviewing my server logs of the rejected mail, where the reject happened before any of the actual data was transmitted (see my /. journal for a sample).
Of the anti-spam legislation currently being proposed, the most important clauses are those that deal with forged headers and illegal use of other servers (relay rape). Once such laws are in place, blacklists will become even more effective, because spammers will have fewer places to run and hide (if they sell something from the U.S.A.).
One final piece to the solution is to get ISPs to act responsibly, and block egress traffic on port 25 for dynamic IP addresses (look up many of my previous posts for more detail on this point). Again, combined with blacklists, this will reduce spam tremendously-- not just in your inbox, but your (and your ISP's) bandwidth.
Some time ago a new way for filtering spam has been discovered. Solution is simple, yet brilliant - we already have those "To confirm you're not a script, please type the text shown in this image" at various websites to guard against form-submitting bots. Apply this to email (bounce back all emails with image attached) and all the spam is gone! Not that it is a perfect solution (I wish there was...) as I see 2 minor flaws in this system : ;)
1. It introduces a delay in communication - confirmation letter has to be sent and reply received.
2. Not all recepients at the other end are *that smart* to understand "what the hell this image means and what am I supposed to do with it?"
From the other side it can serve as lameness filter
But still a promising technology. I've searched the web and came with both subscription services Mailblocks and client-side apps Icemile. The last one is free and I think I'll stick with it.
What's awesome about the author (Dr. John Graham-Cumming) is that he not only knows his stuff, but he puts it out in his open source software called PopFile written in Python.
PopFile can be located at http://popfile.sourceforge.net.
I am currently using PopFile, with an accuracy of 98.26% from nearly 8,000 messages. It's the best I've ever used, and it's free!
GeekWares - Buy and Download Today!
You can use the metaphone algorithm (I use PHP so, http://us3.php.net/manual/en/function.metaphone.ph p) which has come in handy.. Just strip all HTML and de-urlencode then run this on the msg, it totally ignores numbers and punctuation and any letters that are not in (a-z A-Z). You will need to have a database pre-made full of metaphone values from a dictionary then start a comparision and you can get a general feel for the msg.
I took all the words used in a product called spamassassin and used that to do a comparison.. Coupled with bayes filtering I imagine this would be pretty much the best way to filter mail.
It is kind of an interesting approach based on what mail "sounds" like vs what it actually contains.. If you filter on the straight contents these guys will just keep coming up with different ways of encoding and generally being twitchy.
However, their mail will *always* have that "buy this!" kind of sound.
I built a system a while back that was processing all double bounces from three servers and handled around 50k/day spams and came up with some interesting results.
If anyone is interested I'll dig up the code and place it on my site with the rest of the stuff there.
anime+manga together at last.. in real time.
Because this isn't a reliable test.
1. Most of the SPAM sent today has this little problem, where the sending server does not resolve to the IP which is listed in the header.
Pay attention to your email some time. Lots of legitimate email doesn't match, either. Many companies and most hosting companies use one server for incoming mail - the server the MX record points to - and another for outgoing - one which doesn't have an MX record.
2. It will permit people to first map a domain to an IP.(Makes it harder for a SPAMMER because now he needs to register a domain. Once the domain is used to SPAM it can then be blocked. All blocked domains can be easily maintained in a list and shared by ISP's
Except that most spammers don't use servers under their control, anyway, so this test wouldn't work.
3. Time is money. Moving domains from one ISP to another does not help the SPAMMER. The domain is blocked and the IP is identified. The SPAMMER has to be able to activate multiple domains, multiple DNS servers and such. The paterns will be easier to identify and it will be easier to block SPAM by either Blocking the Domain or the DNS server or all the IP's of a certain offending ISP
Which also doesn't work, because the spammers don't use their own servers.
4. In order to acquire a domain a payment transaction must occure. This can be traced if it's a credit card. ISP's who accept cash withou ID or who continually HOST SPAMMERS can be blocked. The work involved to acquire a domain may inclease the costs of a domain but I am sure that this will enable people to assign responsibility.
A theory beloved of fascists and quick-fix pipe dreamers, but never actually proven to work in the real world. In fact, I don't know where this has ever worked, period.
While this system is not perfect and, yes it may cause some headaches for most, having sendmail match the MX record to the IP of the sendind server would eliminate almost 100% of all the SPAM that I have encountered in the last 3 months. We would still need to keep the existing anti-spam practices in place.
Then what's the freaking point? For me, and for most people I know, this would block about 40% of all *email*, spam and non-spam. The other 60% also includes spam and regular email, so you're not doing anything positive. And the current techniques, constantly improving as more and better filtering techniques become available (e.g. Bayes) already stop 99.9% of the spam I or my users receive. What else do you need? Why make sweeping changes like this to catch .1% or less of spam, particularly with the damage it would do to legitimate email?
Amazing how all the people making these "brilliant" suggestions couldn't manage a real-world mailserver to save their soul. Running Sendmail on your home Linux box doesn't make you a mail admin.
+4, insightful?
I beg to differ!
While this system is not perfect and, yes it may cause some headaches for most, having sendmail match the MX record to the IP of the sendind server would eliminate almost 100% of all the SPAM that I have encountered in the last 3 months.
You're right, this system is not perfect, and would cause a *lot* of headaches for almost all users (or at least, us admins).
Firstly, it creates a lot of technical headaches..
The way I see it, the only way I could send email under your proposed system would be through a relay whose IP address was the same as the server listed in the domain's MX record, right?
So, in order to send email from myaddress@somedomain.com, my MTA has to have the same IP address as somedomain.com's mail exchanger?
Not. Gonna. Work.
I send mail from several different physical locations (home, work, etc), as several different addresses/domains. This means in order to send email as my home address while I'm at work, I'd have to send through my home ISP's mail relay. Which I can't do, because I'm not on their network (and they don't have an open relay, to prevent *spam*).
I also send email as being from a couple of domains I own, but I send this email thru whatever system I happen to be on (ISP or work, whatever), as my domain just points at things, rather than running a full-time MTA just to deliver my email..
Not to mention the fact that most ISPs I can think of would have more than one server in charge of mail, and it would be possible, if not likely, that the outgoing mail relay is a different machine than the one that accepts incoming mail (ie, the one in the MX record).
But let's just assume, for argument's sake, that everything was working as you outline. Everyone sends mail thru a relay whose IP corresponds to the domain they're sending from.
All I need to do to send spam is get an account at an ISP, let's say I get username foo at ISP isp.com. Now I dial up, and send a big bunch of spam, from false.address@isp.com. So your domain/mx/ip check works ok, but it's still a false address. Sure, my IP address will be in the headers, but how different is that from the current situation?
Next you'll be suggesting that to combat terrorism, before getting on a plane passengers should have to pass a 1/2 hour series of tests with questions like 'are you a terrorist?' and 'Is this flight for: a) business; b) pleasure; or c) terrorism?'
Not going to make it any harder for the terrorists (except the really dumb ones), but a big pain in the ass for Joe Citizen.
(sorry, in a bit of a ranting mood)