Smart Spam Filtering For Forums and Blogs?
phorm writes "While filtering for spam on email and other related mediums seems to be fairly productive, there is a growing issue with spam on forums, message-boards, blogs, and other such sites. In many cases, sites use prevention methods such as captchas or question-answer values to try and restrict input to human-only visitors. However, even with such safeguards — and especially with most forms of captcha being cracked fairly often these days — it seems that spammers are becoming an increasing nuisance in this regard. While searching for plugins or extensions to spamassassin etc I have had little luck finding anything not tied into the email framework. Google searches for PHP-based spam filtering tends to come up with mostly commercial and/or more email-related filters. Does anyone know of a good system for filtering spam in general messages? Preferably such a system would be FOSS, and something with a daemon component (accessible by port or socket) to offer quick response-times."
Akismet
Or am I misunderstanding what FOSS really is about?
Yea, design an email system that is immune to spam and make the ISPs responsible for blocking spam, phishing and such attacks ..
So, I know people sometimes DRTFA. It happens. Life is busy. But, you know, it's always good to RTFS because it has fancy little tidbits of information such as:
While searching for plugins or extensions to spamassassin etc I have had little luck finding anything not tied into the email framework.
Akismet is the best thing for blog spam prevention ever. I can't believe you've never stumbled across it before. It uses statistical analysis to identify spam, and the more people use it, the better it gets. If everyone used it, the blog spammers would just disappear because their attacks would be completely ineffective.
mollom
i discovered this one through drupal. I thought it was completely free but apparently for high traffic sites it isn't.
I think all your user generated content is sent to them and checked for spaminess against the other submissions they are receiving and they give you back a rating.
Any method you use can be broken. Your only chance is to reduce the likelihood that your site is worth the effort.
Basically, if you use a common solution - no matter of FOSS or commercial - then there will be a thousand other sites that use it too. This attracts attackers because they know when they hack it once, they can re-use it.
However, if you handcode something, no matter how primitive, it likely lasts a lot longer because nobody bothers hacking into your site...
Of course that doesn't work if you have a large site like myspace - there, a single site is worth the effort by itself.
Anyway - then there are two things - a really fast moving animated gif and silly things where you ask people to identify items usually work.
I help out with a site that randomly takes five pictures of cats and dogs and it asks you to identify which of the images contains the highest number of kittens... We barely ever get spam through - and that with almost 20K attempted submissions by non-humans a day makes us pretty happy
Peter.
I have a series of 4 tests to block spam on my website. So far it has stopped over 30,000 attempts in the last year.
Test one is, does the last name = the first name. For some reason almost all spammers do this.
Second, do they use a keyword from a list of about 15 words.
Third, do they fill out a hidden inputbox? This is sort of the reverse captcha.
Finally do they use more than 4 "http" in a post. Almost all comment spam is an SEO effort to increase their pagerank.
"During My Service In The United States Congress, I Took The Initiative In Creating The Internet." -Al Gore
Project Honeypot's HTTPBL has been good to me:
See: www.projecthoneypot.org/httpbl.php
The fastest way is probably to just slow down user registration. Permit anonymous posting, but make it moderated/screened by default (ie. not visible to other users until the forum owner flags it as OK). When a user goes to register (so they can get their posts visible immediately), do not send them the confirmation e-mail immediately. Batch your confirmations up and send them out twice a day at odd times (ie. not midnight and noon, something like 3:47am and 3:47 pm) (you could do it 4 times a day, but not much faster than that since the idea's to introduce a delay in the registration process). Make sure to tell the user on the registration screen what sort of time-frame they can expect their confirmation to arrive in. Ordinary users who plan on using the forum long-term won't be inconvenienced much by this. Spammers... won't tolerate the delay, they want to get their message in fast and get out. With their automated scripts they might not even notice things are failing. Also, don't include a direct confirmation link in the e-mail. Include a URL to a form and make the user copy-and-paste the confirmation number from the e-mail. That'll be trivial for humans, but not easy for an automated script to handle without human assistance.
None of that will stop a determined spammer, but most of them are more interested in volume than anything else and they won't bother spending time/effort on just one forum when they could hit 10 others instead.
There is a well working semi-dynamic plugin for wordpress. It has served me well. It is called YAWASP and you can find it here: http://wordpress.org/extend/plugins/yawasp/. The author also describes the common problems & shortfalls with traditional captcha-like methods.
how IT is changing the world - http://max.zamorsky.name
It's got a field that says "I am a robot" checked off by default. A human should obviously see that and uncheck it. Those registrations that come in with it checked are blackholed. It's definitely cut down on the SPAM accounts since they enabled it.
Arguably, it is Mollom. Especially if you are using Drupal.
Askimet is 'rotting on th evine' in many ways - including development updates. Mollom is a commercial web service, with a free version for non-profit and small volume sites/users.
The Drupal module is explained here:
http://drupal.org/project/mollom
The Mollom site:
http://mollom.com/
"Flyin' in just a sweet place,
Never been known to fail..."
This is really a very good test. As others have mentioned in this thread, it's the sort of thing that spammers will circumvent if it becomes widespread, but for now it's great.
There's something else I've found to be really quite effective: deliberately misnaming my form fields. For instance, give the input field that's labelled "First Name" an input name of "phone number." Humans don't use input names to determine what text to enter, but spambots do. Then check that inputâ"if the first name field contains a phone number, you know you've got yourself spammer.
I've used solely the combination of these two things to run one of my websites for two years now, and I get a vanishingly small amount of spam.
I had a similar problem in the comments area of my site. It was all fun and games, until one day I checked, and there were something like 1000 spams for every real message.
I wrote my own system to deal with it. It's not very hard, assuming you know how your site works (of course you do, right?)
I ended up making two blacklists. One was for words and phrases. The spammers tend to post (and repost, and repost) the same crap. My blacklist rules had some simple regular expressions that I could run queries with. Like, "%http://%spamsite%" and "%v%gra%". You get the idea. The second list was IP's that were known spammers.
At the time, I allowed both anonymous comments, and comments from logged in users. I eventually did away with the anonymous comments, as they were a headache. This was the best cure.
So, when my script ran (once a minute), if it matched a message, it would delete the message, and append the IP to the IP blacklist. If it was posted by a user account, the user account got suspended, so they could no longer log in, nor post.
After it's detection and cleanup run, it then ran back over the IP list, and pruned out every post by that IP. Sometimes they'll do practice runs saying silly things like "nice site". I thought they were real user complements at first, until I saw the same posting verbatim coming from the same IP to multiple news stories, and then that IP would start spamming later.
Some people will argue that the IP cleanup run was not nice, polite, or even fair. People use proxies. Sure, they do. We got a lot of abuse from anonymous proxies, and no real messages from them. The spammers didn't seem to like to use AOL.
When I implemented this, I posted a very brief description of what I was starting ("We're starting advanced anti-spam protection"), with an apology for real messages that were deleted. I never received one complaint about real comments disappearing.
How brutally you do it is really up to you. I built my method by manually doing it for a while, and then letting the script do it on it's own. Occasionally, I would have to go in and add new words and/or site names to the words blacklist.
I noticed the spammers hit more common software more often. It's worth it for them to make automated systems to abuse a piece of software that's deployed on tens of thousands of sites. When I rewrote my site from scratch, then abuses dropped down to 0 for a long time. Now, they manually submit "news" items which are just ads for their own sites. It appears to be manual, and since we won't run them as news stories (our editorial staff decides what does or doesn't show up as news, and if it needs to be edited first), they give up pretty quickly.
Serious? Seriousness is well above my pay grade.
I have implemented something similar, but I haven't been checking the number of blocked messages. All I know is that I used to get spam, and now I haven't gotten any for years. I use this for Formus and the Contact Us page.
My rules are:
1) The text boxes for things like name and subject are actually called junk.
2) There are hidden textboxes called name and subject (1 hidden by javascript and one by CSS) that if they are populated the post is ignored.
3) A third hidden field is the result of a simple javascript math equation that is checked on the server side. If the value is wrong, the post is thrown out.
As others have said, if your site is small these types of things are good enough to prevent spam because the spammers won't bother to figure it out. These concepts would never work for any of the larger sites or 3rd party forum software.
The comment- and trackback-spam blocking techniques in Pivot blogging software are, from my limited personal experience, 100% effective. There's even an extension that uses the enormous Project Honeypot database (http:BL) to weed out IP addresses of identified harvesters and comment spammers. That's just for entertainment, though, since the basic techniques are completely effective.
"Do not allow registrations with gmail.com email addresses"
That is one of the most stupid things I heard this year.
As someone who once used text browsers, I can only advise everyone not to do this - it breaks accessibility at a fundamental level: I got banned from a forum once because they mislabeled fields.
What however, works really great for comment spam is a simple question like "What is the name of Barack Obama ?".
I rarely see spam here...or is it just quickly modded down to oblivion?
This is not exactly a new proposal, and it has been shot down on Slashdot before. One major problem is that a lot of spam is through botnets and the spammers would not get charged the e-mail fees, people with zombied computers would. I suppose this would make people with zombied computers notice, but why would they agree to sign up for such a service in the first place? Also, tying e-mail to payment means that the payment is probably traceable to a real person, which a lot of people do not want.
Centralization breaks the internet.
... 90% of all spam would be eliminated.
One major problem is that a lot of spam is through botnets and the spammers would not get charged the e-mail fees, people with zombied computers would.
That's a non-issue.
Want to block a ton of spam? Reject any inbound smtp connections that have no reverse DNS record, then use regular expressions on those that do to refuse connections from dynamic/home/dsl/dial_up/etc. (I tried to post the regexes, but slashdot whined about " Lameness filter encountered. Post aborted!")
Stop talking to dynamic IPs and about 90% of the world's spam will immediately vanish.
Comment removed based on user account deletion
I forgot to mention these 2 plugins:
SABRE: against spam registrations on your blog ( http://wordpress.org/extend/plugins/sabre)
and
Simple Trackback Validation: a trackback validation tool for wordpress ( http://wordpress.org/extend/plugins/simple-trackback-validation/ ).
how IT is changing the world - http://max.zamorsky.name
It's no less trivial than getting a Hotmail account, a Yahoo! account, or any of the many thousands of free webmail providers out there.
Even so, I suspect that the majority of casual Internet users today actually have that sort of email account, based on personal experience. If you start blocking them, you're blocking most legit users, too. Unless it's a technical forum - and even in this case it's silly to block GMail, as many techies use that.
Why? I for one don't have one - I use my GMail one everywhere - and I contribute to a lot of forums.
Translate, please. Explicitly email what where, and how is that going to help?