Smart Spam Filtering For Forums and Blogs?

← Back to Stories (view on slashdot.org)

Smart Spam Filtering For Forums and Blogs?

Posted by timothy on Sunday December 28, 2008 @09:48AM from the world-will-beat-a-path-to-your-door dept.

phorm writes "While filtering for spam on email and other related mediums seems to be fairly productive, there is a growing issue with spam on forums, message-boards, blogs, and other such sites. In many cases, sites use prevention methods such as captchas or question-answer values to try and restrict input to human-only visitors. However, even with such safeguards — and especially with most forms of captcha being cracked fairly often these days — it seems that spammers are becoming an increasing nuisance in this regard. While searching for plugins or extensions to spamassassin etc I have had little luck finding anything not tied into the email framework. Google searches for PHP-based spam filtering tends to come up with mostly commercial and/or more email-related filters. Does anyone know of a good system for filtering spam in general messages? Preferably such a system would be FOSS, and something with a daemon component (accessible by port or socket) to offer quick response-times."

22 of 183 comments (clear)

Akismet by seifried · 2008-12-28 09:49 · Score: 4, Informative

Akismet
Second that! by _merlin · 2008-12-28 10:03 · Score: 5, Informative

Akismet is the best thing for blog spam prevention ever. I can't believe you've never stumbled across it before. It uses statistical analysis to identify spam, and the more people use it, the better it gets. If everyone used it, the blog spammers would just disappear because their attacks would be completely ineffective.
1. Re:Second that! by seifried · 2008-12-28 10:09 · Score: 4, Informative
  
  Add to which it has an API/etc. It really is what you should be using.
2. Re:Second that! by Indefinite,+Ephemera · 2008-12-28 10:21 · Score: 5, Interesting
  
  The difficulty in evaluating Akismet - I speak not as a user but as someone who ended up apparently blacklisted and having to try their appeals system - is that everyone I see praising it is by definition the kind of person who pays attention to the filter and therefore will train it effectively. Since your average wordpress.com user more likely lets false positives pile up, I'd love to know how effective it is for people who don't wonder how effective it is.
3. Re:Second that! by _merlin · 2008-12-28 10:55 · Score: 4, Informative
  
  I've used it for a few years now. In that time, it has caught tens of thousands of spam comments. It has missed about ten spam comments (i.e. allowed them through). It has misidentified two legitimate comments as spam. Yes, I realise I'm keeping an eye on it, and someone who doesn't may not notice that it's causing problems for them. But the stats are pretty good in my case. I'm aware of the allegations of corruption and using it to gag people, but that hasn't affected me yet.
4. Re:Second that! by sfbanutt · 2008-12-28 15:49 · Score: 4, Informative
  
  I just noticed a handy Akismet stats link in the latest version. I've been running Akismet since October 2006, in that time there have been 26,575 comments on my blog, of which 26,302 were spam(!). It missed 25 spam comments that had to be manually moderated and passed 273 legit comments. There have been no false positives. Personally, I think that's a pretty darn good record.
  
  --
  I've wrestled with reality for 35 years and I'm happy to say, I finally won out - Elwood P. Dowd
DIY or it will be broken by loony · 2008-12-28 10:09 · Score: 5, Interesting

Any method you use can be broken. Your only chance is to reduce the likelihood that your site is worth the effort.
Basically, if you use a common solution - no matter of FOSS or commercial - then there will be a thousand other sites that use it too. This attracts attackers because they know when they hack it once, they can re-use it.
However, if you handcode something, no matter how primitive, it likely lasts a lot longer because nobody bothers hacking into your site...
Of course that doesn't work if you have a large site like myspace - there, a single site is worth the effort by itself.
Anyway - then there are two things - a really fast moving animated gif and silly things where you ask people to identify items usually work.
I help out with a site that randomly takes five pictures of cats and dogs and it asks you to identify which of the images contains the highest number of kittens... We barely ever get spam through - and that with almost 20K attempted submissions by non-humans a day makes us pretty happy
Peter.
1. Re:DIY or it will be broken by dattaway · 2008-12-28 10:59 · Score: 4, Interesting
  
  However, if you handcode something, no matter how primitive, it likely lasts a lot longer because nobody bothers hacking into your site...
  Simply renaming the .php files worked 100% for me.
2. Re:DIY or it will be broken by KermodeBear · 2008-12-28 15:12 · Score: 4, Informative
  
  I have a very simple, small site that I run that allows small comments. It was fine until the spam bots found it. Anyways, I just added a simple question about the background color of the site, which must be correct in order for the comment to be posted. I haven't had a single issue since (except for the occasional troll, but what can you do about that).
  The nice thing about something like this, a handmade thing, is that the spammers won't bother 'breaking' it. As the parent mentions, the spammers are attacking the common solutions - so a little home grown bit will work wonders.
  
  --
  Love sees no species.
4 Tests Stopped 30,000 Comments For Me by WebmasterNeal · 2008-12-28 10:11 · Score: 5, Interesting

I have a series of 4 tests to block spam on my website. So far it has stopped over 30,000 attempts in the last year.

Test one is, does the last name = the first name. For some reason almost all spammers do this.

Second, do they use a keyword from a list of about 15 words.

Third, do they fill out a hidden inputbox? This is sort of the reverse captcha.

Finally do they use more than 4 "http" in a post. Almost all comment spam is an SEO effort to increase their pagerank.

--
"During My Service In The United States Congress, I Took The Initiative In Creating The Internet." -Al Gore
1. Re:4 Tests Stopped 30,000 Comments For Me by Magic5Ball · 2008-12-28 13:02 · Score: 5, Interesting
  
  Background: One of my sites is a custom job which kills a spam comment every 3 seconds or so, and has done so consistently for the past four years.
  OP's suggestions are very good, especially limiting the number of 'http's. We've given up on the keyword lists since they are costly to maintain and aren't as effective as some other methods.
  Currently, the most effective kill rules for us are:
  1) We write the client's IP address, the ID of the thing being commented on, and random stuff to a cookie from the legitimate page from which the client clicked the "post reply" link. If the IP address doesn't match, or if the ID missing, or if the parameter for the random junk aren't in the cookie, then fail. This rule traps non-browser scripts and limits spam throughput, but does not affect humans.
  2) The client's IP address is a hidden form variable. If that IP address does not match the IP from which the POST originates, fail. This rule traps the browser-based scripts, and operators who proxy through botnets for testing.
  These two rules catch all but about two spam-like messages a month (spam operator not using proxies to test their scripts), and have mislabeled two legitimate messages (from a local ISP's poorly-configured proxy) in the last three years.
  There are other things at play, such as salted hashes of the above, and some other heuristics on hidden and unused fields which sort and categorise the spam for our own research (including point of origin, topic, etc.). One finding is that IP/geographic blacklists are ineffective. I'll post new findings and methods in another two years.
  I'm also evil in that the apparent failure modes are non-deterministic, and include such things as random HTTP response codes, random modes of connection failure, and spam messages that apparently go through, but are only visible for the IP that posted them, or for one minute after they are posted.
  Your move, "RosarioRush".
  
  --
  There are 1.1... kinds of people.
"I am a robot" field by casualsax3 · 2008-12-28 10:23 · Score: 4, Informative

The ZSNES boards employ a neat trick: http://board.zsnes.com/phpBB2/profile.php?mode=register&agreed=true
It's got a field that says "I am a robot" checked off by default. A human should obviously see that and uncheck it. Those registrations that come in with it checked are blackholed. It's definitely cut down on the SPAM accounts since they enabled it.
1. Re:"I am a robot" field by slimjim8094 · 2008-12-28 10:41 · Score: 4, Funny
  
  Great idea, and it has the side-effect of keeping idiots out too :)
  
  --
  I have developed a truly marvelous proof of this comment, which this signature is too narrow to contain.
2. Re:"I am a robot" field by Anonymous Coward · 2008-12-28 12:46 · Score: 5, Funny
  
  And the robots. Here I am, brain the size of a planet, and I keep getting banned from forums. *sigh*
Hidden Input Box by waldoj · 2008-12-28 10:31 · Score: 5, Informative

Third, do they fill out a hidden inputbox? This is sort of the reverse captcha.
This is really a very good test. As others have mentioned in this thread, it's the sort of thing that spammers will circumvent if it becomes widespread, but for now it's great.
There's something else I've found to be really quite effective: deliberately misnaming my form fields. For instance, give the input field that's labelled "First Name" an input name of "phone number." Humans don't use input names to determine what text to enter, but spambots do. Then check that inputâ"if the first name field contains a phone number, you know you've got yourself spammer.
I've used solely the combination of these two things to run one of my websites for two years now, and I get a vanishingly small amount of spam.
Message board spam. by JWSmythe · 2008-12-28 10:32 · Score: 4, Informative

I had a similar problem in the comments area of my site. It was all fun and games, until one day I checked, and there were something like 1000 spams for every real message.
I wrote my own system to deal with it. It's not very hard, assuming you know how your site works (of course you do, right?)
I ended up making two blacklists. One was for words and phrases. The spammers tend to post (and repost, and repost) the same crap. My blacklist rules had some simple regular expressions that I could run queries with. Like, "%http://%spamsite%" and "%v%gra%". You get the idea. The second list was IP's that were known spammers.
At the time, I allowed both anonymous comments, and comments from logged in users. I eventually did away with the anonymous comments, as they were a headache. This was the best cure.
So, when my script ran (once a minute), if it matched a message, it would delete the message, and append the IP to the IP blacklist. If it was posted by a user account, the user account got suspended, so they could no longer log in, nor post.
After it's detection and cleanup run, it then ran back over the IP list, and pruned out every post by that IP. Sometimes they'll do practice runs saying silly things like "nice site". I thought they were real user complements at first, until I saw the same posting verbatim coming from the same IP to multiple news stories, and then that IP would start spamming later.
Some people will argue that the IP cleanup run was not nice, polite, or even fair. People use proxies. Sure, they do. We got a lot of abuse from anonymous proxies, and no real messages from them. The spammers didn't seem to like to use AOL.
When I implemented this, I posted a very brief description of what I was starting ("We're starting advanced anti-spam protection"), with an apology for real messages that were deleted. I never received one complaint about real comments disappearing.
How brutally you do it is really up to you. I built my method by manually doing it for a while, and then letting the script do it on it's own. Occasionally, I would have to go in and add new words and/or site names to the words blacklist.
I noticed the spammers hit more common software more often. It's worth it for them to make automated systems to abuse a piece of software that's deployed on tens of thousands of sites. When I rewrote my site from scratch, then abuses dropped down to 0 for a long time. Now, they manually submit "news" items which are just ads for their own sites. It appears to be manual, and since we won't run them as news stories (our editorial staff decides what does or doesn't show up as news, and if it needs to be edited first), they give up pretty quickly.

--
Serious? Seriousness is well above my pay grade.
My 3 tests also work by lalena · 2008-12-28 11:02 · Score: 5, Interesting

I have implemented something similar, but I haven't been checking the number of blocked messages. All I know is that I used to get spam, and now I haven't gotten any for years. I use this for Formus and the Contact Us page.

My rules are:
1) The text boxes for things like name and subject are actually called junk.
2) There are hidden textboxes called name and subject (1 hidden by javascript and one by CSS) that if they are populated the post is ignored.
3) A third hidden field is the result of a simple javascript math equation that is checked on the server side. If the value is wrong, the post is thrown out.

As others have said, if your site is small these types of things are good enough to prevent spam because the spammers won't bother to figure it out. These concepts would never work for any of the larger sites or 3rd party forum software.
1. Re:My 3 tests also work by lalena · 2008-12-28 11:05 · Score: 4, Informative
  
  As a follow up to myself, I didn't come up with these ideas on my own. I read them on Slashdot a couple of years ago.
Re:gmail by siyavash · 2008-12-28 12:12 · Score: 5, Insightful

"Do not allow registrations with gmail.com email addresses"
That is one of the most stupid things I heard this year.
Bad Idea by erlehmann · 2008-12-28 12:15 · Score: 4, Insightful

As someone who once used text browsers, I can only advise everyone not to do this - it breaks accessibility at a fundamental level: I got banned from a forum once because they mislabeled fields.
What however, works really great for comment spam is a simple question like "What is the name of Barack Obama ?".
Re:Better than Askimet? by ceejayoz · 2008-12-28 12:43 · Score: 4, Informative

I seem to get Mollum captchas on every site that uses it. My IP, user agent, etc. are almost completely static. My comments are grammatically correct, never spammy, etc.
If their system hasn't identified me as safe by now, there's something wrong.
In contrast, to my knowledge Akismet has never flagged me. My comments go straight up on blogs using it. On my personal site, I've had maybe 10 false positives out of several thousand caught.
Mollom, IMO, has a long way to go.
Re:Better than Askimet? by darkpixel2k · 2008-12-28 15:15 · Score: 4, Insightful

read my sig
That'll work, right up until the spam bots are told to ignore spampoison.com, or the person who is running the spam bots decides to put spampoison.com into his hosts file and point it to 127.0.0.1.

Lame solution.

--
There's no place like ::1 (I've completed my transition to IPv6)