How to Get Rid of Referrer Spam?
wikinerd asks: "I have recently opened my own community website. Everything was fine until spammers found it, which happened quite quickly. As usual they filled up my mailboxes, but SpamAssassin can take care of that when it is needed. Then, they discovered my blog and my wikis and employed their bots to fill them up with spam comments. I solved this problem by moderating all comments. Now, however, they employed another evil trick: Referrer spam. They caused my webserver statistics to grow up by orders of magnitude by making their stupid websites to show up on my referrer lists. Unfortunately now my webserver usage statistics are full of viagra, poker, casino, porn, spyware, and pharmacy sites. I am afraid that this is a problem I cannot solve with the knowledge and the tools I have at the moment. So, I came here to ask Slashdot readers: How can I fight referrer spam and what tools are available in a GNU/Linux environment to ensure clean and spam-free usage statistics?"
I'll assume you're using Apache and have access to the .conf, or someone that does.
.conf, or even in .htaccess so you can change them without a restart. If you don't have/want SetEnvIf, you can also use mod_rewrite (E=badreferer:1 at the end of your RewriteRule) to do the same thing.
c om|4free|teen|pussy|discount|inkjet|fuck|hasfun|ca sino|gambling|poker|porn|sex|paris|nude|xxx|hilton |adminshop|devaddict|iaea|peng|just-deals|pisx|tec rep-inc|learnhow|phentermine|terashells|psxtreme|f reakycheats).*" badrefererl ycos|msn|altavista|XXXX).*" !badreferer
First, you need to setup the log you'll use for statistics to exclude requests marked with a "nolog" environment variable.
CustomLog logs/access_log-www.example.com combined env=!badreferer
The following requires Apache's SetEnvIf module. You can put these lines in
#Blacklist (adjust as you need)
SetEnvIfNoCase Referer ".*(credit|hold-em|holdem|mortgage|money|cash|gb.
#Whitelist (optional)
SetEnvIfNoCase Referer ".*(google|yahoo|alltheweb|search|excite|aol.com|
Additionally, you can use the same blocks to deny them access to your site:
<Limit GET HEAD POST>
Order Allow,Deny
Allow from All
Deny from badreferer
</Limit>
<LimitExcept GET HEAD POST>
Order Deny,Allow
Deny from All
</LimitExcept>
Could Wikinerd or Cliff post an example of how these appear in Wikinerd's blog? I have a guestbook myself that gets filled with things that say "great site" from some dumb address like cara@aol.com, and then it is filled with a bunch of keyword HTML links to randomly-generated .info sites (5544f45.info, etc) that all go to one of those useless spammy search engines.
Don't blame Durga. I voted for Centauri.
I hope I'm not being too rude, but seriously, I googled for referrer spam and bam...first result had some decent advice. This was just the first thing that came up. Add the word "apache" to your query and you will get some very helpful results. Besides, this is Slashdot...not a trove of reliable information/advice. Just start using Apache to start blocking the Mallorys. Also, if you're still posting any kind of statistics or referrers publicly, stop. Spammers wouldn't do this if Bloggers didn't publish that kind of abusable data.
-Turkey
You could write a module that would check entries from your referrer log.
The best way to check if it's spam would be with a bayesian filter.
Sure , it will take some coding / training the filter but this seems to me like the best option.
--> Insert Funny Sig Here
Take off and nuke 'em from orbit.
Just to be sure.
How am I supposed to fit a pithy, relevant quote into 120 characters?
At least for WordPress. It's called Spam Karma. I'm lazy, Google for it.
If Spam Karma finds questionable words in comments -- it's configurable, and it comes with a good default list -- it sends users to a captcha. If they fail at the captcha -- and they're not on a strongbad keyword list like "viagra" and "vegas poker" -- the comments are sent for moderation.
Works great for me. Nope, the URL in my profile is not my blog anymore, it's on my own server, it's in portuguese and I ain't gonna expose my server to a slashdotting.
I just password protect the directory with the server stats.
There is a patch you can apply, available here that will prevent referer spam from showing up in reports.
Obviously you should not be publishing referrers unless you have a way to filter them (see other comments), but since you *are* getting spammed, you could take a moment out to fight back a bit; e.g., you can run up the spammer's bandwidth charges.
http://www.google.com/googleblog/2005/01/preventin g-comment-spam.html
per googleblog:
Q: How does a link change?
A: Any link that a user can create on your site automatically gets a new "nofollow" attribute. So if a blog spammer previously added a comment like
Visit my <a href="http://www.example.com/">discount pharmaceuticals</a> site.
That comment would be transformed to
Visit my <a href="http://www.example.com/" rel="nofollow">discount pharmaceuticals</a> site.
--
just add this for all annon or unapproved links...and make a not on your page so spammers know not to bother.
At my own homepage (codesweep.com):
A) The code for it is homemade, would be a pain in the butt to re-tool a bot for little old me vs. all the livejournal, blogger, etc. sites out there...
B) I'm so insignificant out there with such low traffic the spammers probably wouldn't care anyways
C) If the spammers do start caring, I can code my blog around them to defeat them. So far it hasn't created a problem, but the stronger the problem the stronger my response will be...
...in bed
Comment spam can be easily stopped by requiring a password - you can even publish the password right on the website so humans see it and bots don't. I did it for moveable type and it was pretty easy as for referrer spam... it seems to me that the only way referrer spam is fruitful is if your log files are publicly visable and if they are parsed by google (etc), unless I don't understand referrer spam. So why not just remove all links to your logfiles, add a .robots file, and maybe even password protect where your logfiles are stored. I would assume that referrer spambot wouldn't even try to target your page unless it knew your referrer logs were linked off your page...
I've taken to filtering my e-mail with whois and by protocol deviations. I can see how I could be wrong, but I'm guessing that the same aproach can be thrown at the refer spammers, that:
1> The headers their clients send are different than those of ordinary clients.
2> That the properties revealed by whois are different for refer spammer clients than for ordinary clients.
3> That the whois properties for the spam refer sites are different than those of legitimate sites.
I'll bet that ignoring input from/referring to China and Korea is a good start, and that the bogus sites will tend to cluster in identifiable networks.
If you protect your stats with apache/whatever authentication then robots cant find your stats via google/whatever search engines, and they will probably stop spamming you. I find that every time i unprotect the stats for openphoto.net i get referer spam'd to death.
$0.02,
_Michael.
Captcha any referral that's not white-listed.
Captcha access to the referral log.
I added /stats/ to my robots.txt.
The stats pages no longer show up on any search engines, so a) The spammers get no 'pagerank' from those links (which is what they do it for) and b) they can't find the stats pages.
I was getting shitloads of referer spam; within a week (as soon as google updated) it dropped to nothing. I've had no referer spam AT ALL since then.
Perhaps they'll start just crawling the entire web, but it appears that at the moment they do a google search to find pages that post their referer stats.
455fe10422ca29c4933f95052b792ab2
Our initial attempt to solve this was to complain to the ISP of the referrer spammers. That did no good. The ISP was willing to listen, but not to act.
We did manage to actually track down the jerks who were doing the referrer spam. They told us that they were attempting to create links back to their sites for better search engine placement.
Our work-a-round was two fold. For various reasons we wanted to keep these our webalizer stats externally accessible. So we requested bots (the ones that follow the rules at least) to not index our external stats and we modified webalizer to not form links back to the referrers.
We edited our robots.txt file to exclude legit bots from our stats:
We also patched webalizer v2.01-10 to no longer form URLs to referrers. Now only a plain text line without the leading http:// shows up in the table. The original referrer spammers gave up when they lost off the the links back to their sites.
The bottom of the 0.basic.patch prevents webalizer from forming links back to referrers. See README-FIRST for details on this patch set.
chongo (was here)
My first suggestion would be to stop publishing the referrer links.
But if you have to, then put "rel=nofollow" in the link itself. This makes Google (and other search engines) discard the link when calculating search rankings.
Go here for more info.
It was originally intended for comment spam, but just add the same rel="nofollow" to your referrer lists. Read about it. Granted, this won't prevent it, but if everyone starts doing this, this technique will become useless for spammers.
Be relentless!
I installed mod_security on my server a few weeks ago with a few simple regexes to cover the more prolific referrer spammers recorded by awstats. Set the mod_security default action to deny,status:412. Then in httpd.conf I set the ErrorDocument for the 412 code to an empty file.
Now when the referer spammer hits my site, they get denied and get nothing back. Bandwidth wasted serving up pages to referer spammers is cut to virtually nil. The spammers are still there banging away and a few still get by though. The list of referrers needs to be monitored so that new mod_security rules can be added as required. That's no different than using mod_rewrite to deny the referrer spammers though.
"For I am a Bear of Very Little Brain, and Long Words Bother Me"
I believe the problem with spam relies in the stupid lusers that actually click on the links and purchase stuff from them. Lets take a look at some of the latest spam...
Porn: anybody that wants good porn knows to look at p2p solutions (just look in the right spots, it's all there for free)
viagra, etc: if you don't know that it doesn't work, you're an idiot
free stuff: nothing in life is free
special service: there are always string's attached
correct your account information: if you get your identity "stolen" in a scam, you don't even belong using a computer in the first place. perhaps also get rid of your credit cards because they might be "stolen" when you write down your card number and pin and leave it at an internet cafe for a bunch of geeks, basically the same outcome.
Now that we've classified 75% of all spam, lets move on.
There are several ways to solve the problem in weblogs, the main ones include using a combination of the AHBL from sosdg (list of proxies iirc) and the logging of ips from comments. this way anybody who uses a known proxy won't be able to post, and then you can ban ips that post annoying comments anyway. This can help a lot
- The next step is to reformat all links to include the noref thing like mentioned above.
- Use apache2 and linux for hosting your site, (a tad offtopic) this will just keep you more secure in general (NO TROLLING WARS PLEASE!)
- go after the source: help the sosdg(http://www.sosdg.org) by giving them some computer resources or whatever else they could use to track down open proxies, known spammers, etc. and help take them down!
the sosdg took some of the biggest spammers in Spain down by blocking them until their isp's folded and got rid of the spammers. Suprisingly enough, the sosdg and their black lists have spoiled the riches of many spammers, both by emails, comment stoppers, etc.
- use one of those python scripts so each time a comment is to be left the person has to put the numbers and letters in the image in.
Probably the best method is to use a combination of all of these. I hope this helps
If you're going to quote a source, quote it correctly. "I say we dust off and nuke the site from orbit. It's the only way to be sure." It's also nice to reference the original source, in this case Ripley, from the movie "Aliens" And the web source you used to verify it, in this case http://en.wikiquote.org/wiki/Aliens
I wrote a php script that only offers an email address or allows form submission if the client can answer a simple question correctly. Seems to work well except once about a year ago when someone was stil trying to enter 'Clinton' as the name of the president. Here's an example