Stopping Spambots: A Spambot Trap
Neil Gunton writes "Having been hit by a load of spambots on my community site, I decided to write a Spambot Trap which uses Linux, Apache, mod_perl, MySQL, ipchains and Embperl to quickly block spambots that fall into the trap. "
FS !!
Looking at my Day Job and personal web site, other than the very cool technical achievement of the trap (I'll have to see if I can rewrite this for my Checkpoint FW system), there were one things I learned about good design from this article:
Eliminate mailto - makes sense. You should have an http based "send me a message system" - force a live person to type stuff in instead of letting a program pick out addresses.
Eliminating mailto alone would probably help in mot of my spam problems (as I have my "contact me" address right on the first page).
52 Weeks, 52 Religions with John Hummel
Looks like you should've written some code to handle an overload from slashdot too!
"I have a truly marvelous demonstration of this proposition which this bandwidth is too narrow to transmit."
www.timcoleman.com is a total waste of your time. Never go there.
You can stop a SpamBot, but can you stop a /.'ing?
My mother always used to tell me: If you can't find anything nice to say, say something bad about Windows.
Well, his idea of removing "mailto:"s is an obvious one...
I dunno, most of this stuff sounds like common sense work for someone who's got a well-trafficed web site. The badhosts_loop looks like an interesting addition, though...
On the surface, it almost looks like this system could be built up to act like a SPEWS for web servers.
Aww, FSCK!
Come to the University of Mars! Classes starting soon!
Looks fine in Mozilla 0.9.9, too...
ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?
Why on Earth would you like to block a spambot? So it doesn't get any more useful addresses? /give/ it a next page. With a nicely formatted word1word2num1num2@word1word2.com, where words and nums are random.
No way, man.
If you realize you're serving to a bot, go on serving. Each time the bot follows the "next page" link, you
Give it thousands, millions of addresses this way.
hmm, just a wild guess, but does this technique involve using the http-referrer to see if there are too many clients coming from just a particalar address (which would obviously be a *bad* thingy), and subsequently block them too?
might explain why we can't see it no more
I want it too!!! it seems to work pretty good!
As it turns out, I really haven't received that much mail to this address. About the only mail I've ever received to it is someone from trafficmagnet.net, who tells me that I'm not listed on a few search engines and that I can pay them to have my site listed. I need to send her a nasty reply saying that I don't care about being listed on Bob's Pay-Per-Click Search Engine, and that if she had actually read the page, she would have noticed that she was sending mail to an invalid address. Besides, the web server is for my inline skate club and we don't have a $10/month budget to pay for search engine placement.
I think I've received more spam from my Usenet posting history, from my other web site, and from my WHOIS registrations than I've received from the skate club web site.
From the website:
The Problem: Spambots Ate My Website
s/Spambots/Slashdot/
Hold back the excitment, people, it's another episode of story recycling.
This site is pretty handy, now that I'm on the topic. Also make sure to check out RobotCop. Out for Apache now, coming soon for IIS and Zeus!
My PHP spider-trap - See an infinity of email addresses and links in action!
The only problem with the idea of using entirely http based "send me a message systems" is that some people, like myself, would much rather have an actual email address to use instead of having to use 50 different layouts and 50 different configurations and 50 different methods of communicating with someone or a company. Every html based contact system has its own quirks and problems, I'd rather just need to learn my email programs issues instead.
Removing mailto: links is a bad solution to the problem. It might be the only solution, but it is bad.
I hate the editor in my web browser. No spell check (and a quick read of this message will prove who diasterious that is to me), not good editing ability, and other problems. By contrast my email client has an excellent editor, and a spell checker. Let me pull up a real mail client when I want to send email, please!
In addition, I want people to contact me, and not everyone is computer literate. I hang out in antique iron groups, I expect people there to be up on the latest in hot tube ignition technology, not computer technology. To many of them computers are just a tool, and they don't have time to learn all the tricks to make it work, they just learn enough to make it do what they want, and then ignore the rest. Clicking on a mailto: link is easy and does the right thing. Opening up a mail client, and typing in some address is error prone at best.
Removing mailto: links might be the only solution, but I hope not. So I make sure to regualrly use spamcop.
And install it in Hawaii. Those Somoans even eat that sh*t at nice restaurants!
Yuck
Eww
QUICHE!?
Cookbook!?
My $0.02 will always be worth more than your â0.02, so
Fine in Konqueror. Why, is it somehow broken in other browsers?
loply.com
At first glance this might be a good idea but this will be resource burden on your system.
Not a good way to stop spammers.
------
Return the bells of Balangiga
Return the bells of Balangiga.
Those other browsers must suck
ok then your [sic] infringing on my copyright! Could you as [sic] me next time before STEALING my comments for your own?
This isn't such a good idea - for every random (non-existent) domain that you generate, a root DNS server will be queried when an email is sent to this address, which increases the load on the root servers, which is generally a bad thing. How about instead, returning pages with the email address abuse@domain-that-spambot-is-coming-from all over them...
After the Battle Creek incident with ORBZ, the maintain changed the way it worked; instead of being pro-active on checking for open relays, he now has a 'honeypot' like system where a unique email address that isn't directly visible on the site but still may be harvested by a spam bot. Any server that sends email to that address is automatically added to The List. Mail server admins that believe that they should not be on this list can argue their case to remove their server.
"Pinky, you've left the lens cap of your mind on again." - P&TB
"I can see my house from here!" - ST:
Whats wrong with MySQL? It does everything the website claims it does.
loply.com
This gives me an idea for a spam version of a roach motel (Spam gets in, but it never gets out).
I wonder what it would take to create an open relay server that would fool spammers into using it.
Ideas would be welcome. This could be just the revenge I've been looking for!!!
Sig: "That's not a duck!"
But, if you send him the message once with your return address, he'll know you're for real and when he replies you can use your regular mailer.
$0.02USD,
-l
Help cure AIDS, cancer, and more. Donate your unused computer time to worldcommunitygrid.org. Join Team Slashdot!
Superior Labs spambot_trap mirror
-Spack
I think the spamer will filter abuse@ ...
Here's a tip for those of you writing spambot traps... How about not blindly responding to the faked Return-Path address?
Now that should be illegal. You people whine about your 10 spams a day, try 10,000 from 2000 different email addresses. Idiot postmasters should be caught and jailed.
formmail itself (even the most recent version) can still be abused by spammers to use your webserver as a bulk mail relay - see the advisory ato ry . df
http://www.monkeys.com/anti-spam/formmail-advis
It's a shame he didn't suggest the more robust formmail replacement at nms which is maintained, and attempts to close all the known bugs and insecurities.
If he really wants to make the thing run faster, turn those varchars into regular chars. And index index index!
Give a man a match, you keep him warm for an evening.
Light him on fire, he's warm for the rest of his life
Add a couple of sleep(20); into the cgi script that generates the bot fodder. The bot will still stay busy waiting for your webserver's response, but your script will exactly consume zero resources.
For additional kicks, set up a DNS teergrube.
Say no to software patents.
Wow, this guy slashdotted himself..
Stopping Spambots: A Spambot Trap
Using Linux, Apache, mod_perl, Perl, MySQL, ipchains and Embperl
Copyright 2002 by Neil Gunton
This document describes my experiences with spambots on my websites, and the techniques I have developed to stop them dead. I assume the reader has basic familiarity with Linux, Apache, mod_perl, Perl, MySQL and firewall rules using ipchains - each of these topics could fill a book, so I won't talk about installation or basic configuration. I will, however, provide full scripts and instructions on using these within the context of these tools. If you'd like some basic pointers on getting set up using these tools, then you could take a look at my short series of three Linux Network Howto articles.
Contents
The Problem: Spambots Ate My Website
I have a website, http://www.crazyguyonabike.com, which has bicycle tour journals, message boards and guestbooks. I started noticing around the end of 2001 that the site was getting hit a lot by spambots. You can spot this sort of activity by looking for very rapid surfing, strange request patterns, and non-browser User-Agents.
Another distinctive behavior was that the spambots would follow only those links which had certain keywords which would seem promising if you're looking for email addresses: "guestbook", "journal", "message", "post" and so on. On each of the pages in my site there were many other links in the navbars, but only links with these keywords were being followed. Also, robots.txt was never even being read, let alone followed. Moreover, the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests.
This was very annoying. For one thing, these things were picking off email addresses from my website (at that point, I was letting people who posted on my message boards decide for themselves whether they wanted their email addresses to be visible or not). But quite apart from that, it was taking up resources, and was just plain rude. I hate spam. I resent my webserver having to play host to people whose obvious goal is to cynically exploit the co-operative protocols of the internet to their own selfish, antisocial gain. So, I decided to do something about it.
The first thing I did was to look at the User-Agent fields which were being used by the bots. There were a variety, including variations on the following:
I searched the internet for references to these strings, but all I found was a slew of website statistics analysis logs. This meant that these particular spambots obviously got around. It was also discouraging, because there was no mention anywhere of what these things actually were. I was surprised that there seemed to be no discussion whatsoever of something that seemed to be pandemic. Then I found a couple of other websites with guestbooks that had actually been defiled by these spambots: (if you follow these links and you don't see a lot of empty messages left by the above user agents, then that means the webmaster of the site has finally found a way to stop it, so good for them...)
I reckon the spambots didn't really intend to leave empty messages. They just tend to want to follow links with the keyword 'post'. So if the guestbook posting form has no preview or confirmation page, then the spambot would leave a message simply by following this link! My guestbooks and message boards have a preview page, which is probably why I hadn't had any of this.
Anyway, I started thinking about what kind of program this thing was. First of all, it comes from all kinds of different IP addresses. I couldn't quite believe that this many different IP addresses were all intentionally using the same software, of which I could find absolutely no mention anywhere on the Web. This made me think it might be some kind of virus/trojan/worm or whatever that silently installed itself on people's computers, and then used the CPU and bandwidth to surf the Web without the owner being aware of it. I thought that if this was the case, then it must be sending the results somewhere - and if we could find out where, then we could go about shutting the operation down. But I have had no luck at all in getting any help from the sysadmins at ISP's I have contacted. A typical exchange was the one with a guy at Cox internet, which was where a persistent offending IP address was sourced. He just couldn't be bothered, and eventually told me that spidering was not against the law, or their terms of service. I asked whether actions which were blatantly obviously geared toward the generation of spam were against their terms of use, but he never replied to that. I had no more luck anywhere else: Nobody had heard of this thing. I even sent an email to CERT, but no response. So, I turned instead to thinking about how I could erase these pests from my life as much as possible. This document is about my quest to stop spambots (not just this one, but ALL spambots) from abusing my website. Hopefully it will be useful to you.
Overview of the Spambot Trap
There are three main parts to the technique which I outline here:
There are various components to the Spambot Trap, including the badhosts_loop Perl script, the BlockAgent.pm module, ipchains config, MySQL database, httpd.conf, robots.txt, and your HTML files. These are all covered in the sections below.
Banishing 'mailto:'
The first and most urgent thing you need to do is to get email addresses off your website altogether. This means, unfortunately, banishing the venerable mailto: link. It's a real shame that perfectly good mechanisms should be removed because of abuse, but that's just the way the world is these days. You need to be defensive, and assume that the spammers will try to take advantage of your resources as much as possible.
It's an arms race
The important thing that you need to realize is that no matter what blocks we put in place, this game is an arms race. Eventually the spambot writers will develop smarter bots which circumvent our techniques. Therefore you want to have a failsafe, which will prevent email addresses from getting into the hands of the spambot even if all else fails. The only real way to do that is to completely remove all email address from your website.
Contact forms
You should replace the mailto: links with links to a special form where people can type their name, email address and message. A CGI can then deliver the email, and your email address never has to be disclosed. There are a number of different mailer scripts out there - just be careful to check for vulnerabilities which could allow malicious users to use the form to send email to third parties (i.e. spam, ironically enough) using your server. The formmail script is popular, but an earlier version had such a vulnerability (since fixed). The Embperl package has a simple MailFormTo command to send an email from a form.
Since I have seen guestbooks out there which have been extensively defiled by spambots, I would add that you should have a preview screen on your contact forms. This will ensure that an email doesn't get fired off simply by a spambot following the 'post' or 'contact' link (which it will likely try to do).
Alternatives to totally banishing mailto:
There are alternatives to completely removing email addresses, but they all depend on the stupidity of the spambot, and so could be compromised by a new generation of pest. These include:
MySQL
Download badhosts MySQL database dump
We need to set up a MySQL database, where we store records of the hosts which are to be blocked. This doesn't have to be MySQL, but I use it because it's extremely fast, and very appropriate for this kind of application. You need to create a new database, called 'badhosts'. You then create a table, again called 'badhosts', with the following structure:
Field
Type
Comment
ip_address
varchar(20) not null, indexed
The IP address of the host to be blocked
user_agent
varchar(255) not null
The HTTP User-Agent of the spambot, for reference
expire_days
int unsigned not null
How many days is this block for. Doubled every time a new block has to be created for a particular IP address
created
datetime not null
When this block was created
expiry
datetime not null, indexed
When this block expires
You could use the dump provided above to load directly into your database:
shell> mysqladmin create badhosts
shell> mysql badhosts < badhosts.dump
That's about it! The fields which are marked as 'indexed' are the only ones which need indexes, because they are searched on to see if a particular IP address has been previously blocked, and also to see which blocks should be removed because they've expired. If you have access privilages set on your MySQL databases, then you need to allow the Apache user (usually 'nobody') access. The other script that will require access is badhosts_loop, which runs as root.
Next, we look at the script that populates this database.
BlockAgent.pm
Download BlockAgent.pm
Download bad_agents.txt
The BlockAgent.pm Apache/mod_perl module is taken from the excellent book "Writing Apache Modules with Perl and C" by Lincoln Stein & Doug MacEachern (O'Reilly). This script basically acts as an Apache authentication module which checks the HTTP User-Agent header against a list of known bad agents. If there's a match, then a 403 'Forbidden' code is returned. The script compiles and caches a list of subroutines for doing the matches, and automatically detects when the 'bad_agents.txt' file has changed. I have found that it has no noticeable impact on the performance of the webserver. This script is useful in the case where you know for certain that a certain User-Agent is bad; there's no point in letting it go anywhere on your site, so it's a good first line of defense. We'll cover how to add this module to your website a little later, along with the rest of the configuration settings in the section on httpd.conf.
Of course, one of the first arguments you'll see with regard to this method of blocking spambots is that it's easy to circumvent, by simply passing in a User-Agent string which is identical to the major browsers out there. This is perfectly true, but don't ask me why the spambot writers haven't done this - maybe it's a question of pride or ego, they want to see their baby out there on record in Web server logs. I honestly don't know. The main point is that at present, the User-Agent header CAN be used very effectively to block most bad agents. But, I have added more features so that we can also block agents which look ok, but behave badly by going somewhere they shouldn't - the Spambot Trap. More on that soon.
You'll notice that the bad_agents.txt file which I have supplied here is very comprehensive. A good strategy here is probably to save the full version somewhere (perhaps as bad_agents.txt.all), and just keep the ones you actually encounter in the bad_agents.txt file. Then you keep the list shorter, and more relevant to what actually hits you. For example, my bad_agents.txt file currently has the following lines in it, because these are the spambots that I see most frequently:
You'll notice from this that BlockAgents.pm is very flexible, being able to take full advantage of the excellent regular expression capabilities of Perl. This means you can capture a lot of different agents with just one line. For example, the very first line catches all the variations of the agent which passes in random strings of capital letters, e.g. FHASFJDDJKHG or UYTWHJVJ. The spambot obviously thinks it's being pretty smart by looking different each time, but by using an easily identifiable pattern, it shoots itself in the foot. Hah.
The original version of the BlockAgent.pm script is well explained in the O'Reilly book, but I've added an extra hook that checks to see whether the client is accessing any of the spambot trap directories. If it is, then we add an entry to the MySQL database (you could use another relational database if you want, as long as it's accessible from Perl DBI).
The first time an IP address is blocked, an expiry of one day is set. If the same host subsequently comes in and falls into the trap again, then the expiry time is doubled. And so on. This way, the block gets longer and longer, in proportion to how persistently the spambot revisits our website. Once the IP address is blocked, the spambot can't even connect to our web server, since we use 'Deny' in the ipchains rule. This means that no acknowledgement is given to any packets coming in from the badhost, and as far as they know, our server has just gone away. Hopefully, after this happens for long enough, our server will be taken off the spambot's "visit" list. Another nice little side-effect of this is that the spambot will probably have to wait for a while before giving up each connection attempt. Anything that makes them waste more time is ok by me!
BlockAgent.pm notifies the badhosts_loop script that something has happened by touching a file called /tmp/badhosts.new. The badhosts_loop file checks this file every few seconds and if it has changed then it knows that a new record's been added to the database, and it needs to re-generate the blocks list.
The BlockAgent.pm script is our alarm system. It's what tells us that something happened. In order to act on this information, we need to be able to add rules to the ipchains firewall. We'll cover this next.
ipchains
Download sample ipchains config file
The ipchains module (here's the HOWTO doc) is a very nice way of providing a good level of basic network security to your server. If you haven't already set it up (or it's successor, iptables), then you really should. It's a very easy way to configure who can and cannot have access to your machine. A good resource for learning about this is "Building Linux and OpenBSD Firewalls", by Wes Sonnenreich and Tom Yates (Wiley). This is where I learned about ipchains, and it's on their excellent explanations and examples that I based my own config file. Another is "Linux Firewalls" by Ziegler (New Riders), which seems to have a more recent 2nd edition that covers iptables too.
The example ipchains config file given here is complete, but the bit which is most important to us is that we create a chain called 'blocks'. This is our own custom chain, which we can then add rules to. The badhosts_loop script will flush this chain and build it back up whenever a spambot falls in your trap. Once the spambot's IP address is on the blocks list, that host cannot connect to your server at all.
Remember to restart ipchains after you've changed the config file. Next, we'll look at the script that actually adds the firewall rules. badhosts_loop
Download badhosts_loop script
You run this script in the background, as root. It has to be run as root, because only root has the ability to add rules to the firewall. The script spends most of its time sleeping. It wakes up every five seconds or so and does a quick check on /tmp/badhosts.new. If this file has been changed since the last time it looked, then it goes and re-generates the firewall blocks list with all the current (non-expired) blocks. If nothing else happens, then the script will automatically do this at least once a day, to ensure that blocks really do expire even if there is no new activity.
You should probably add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically started up on reboot:
This will start the script looping in the background. The script automatically checks to see if it is already running, by attempting to lock /var/lock/badhosts_loop.lock. If the file is already locked then the script will exit with an error message. If you want to just run the script once, without looping, then just omit the '--loop' option. This can be useful for testing.
Logging is done to /var/log/badhosts_loop.log by default. Every time the script generates the blocks list, it writes a list of all the blocks to the log. This is a good place to monitor if you're interested in what hosts are being blocked. Here's an example of the log output:
EDITOR: SNIPPED
Thu Apr 11 16:09:07 2002: Flushing blocks chain: Generating blocks list:
Adding 63.148.99.247 (1) 2002-04-11 11:16:11 to 2002-04-12 11:16:11 Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
The log shows the IP address which is being added, then (in brackets) the number of days the block is effective for (doubling each time), then the start and end dates of this block, and finally the name of the User-Agent which committed the crime. This can be useful for quickly seeing whether you need to add a new one to the bad_agents.txt file.
This is a pretty stable script that should just sit there and chug quietly, not taking up much in the way of resources. Checking for a file being changed every five seconds is not a big deal in Unix, so you shouldn't even notice it.
Now you have to create the trap itself - the spambot_trap directory.
spambot_trap/ Directory
Download gzipped tarball of sample spambot_trap directory
View the sample directory
You can create this directory anywhere on your server. We will create an alias the httpd.conf to access it. I put mine in /www/spambot_trap/. The point is, this doesn't have to be a real directory under your webserver directory root. If you use the directive, then multiple websites can access the same spambot_trap directory, potentially through different aliases. You can use the sample tarball as a starting point, it has subdirectories and links which the spambots I have seen find irresistable. You should create your own image file for the unblock_email.gif file, to have a valid email address of your own.
The spambot_trap and spambot_trap/guestbook/ directories are not used directly to spring the trap. This is because I wanted to have a warning level, a lead-in, where real users would be able to realize they are getting into dangerous waters and could then back out. You're going to be placing hard-to-click links on your web pages which lead into the real trap, and there's always a chance that a real user will accidentally click on one of these. So, some of the links will point into the warning level. I have made a GIF image which contains a warning text. Why an image? Mainly because spambots can't understand images, and I didn't want to give big clues like "WARNING!!! DO NOT ENTER" in plain text. So, the user sees the warning, the spambots don't. If the spambot proceeds into any of the subdirectories (email, contact, post, message), then the trap is sprung and the host is blocked.
You also need to try to stop good spiders (e.g. google) from falling into the spambot trap and being blocked. To do this, we utilize the robots.txt file.
robots.txt
Download sample robots.txt
This should allow good robots (such as google) to surf your site without falling into the spambot trap. Most bad spambots don't even check the robots.txt file, so this is mainly for protection of the good bots.
You'll see that we list a bunch of directories under '/squirrel'. This could be anything; you'll set an alias later in httpd.conf. In fact, you may even want this to be dynamically generated (see later, under Embperl), so that you can quickly change the name of the spambot trap directory if the spambots adapt and start avoiding it. At present, a static setup should work just fine, however.
Next, we need to look at the bait - links within your HTML files which lead the spambot into the trap.
Your HTML Files
Download sample HTML code
Download sample transparent 1 pixel image for hiding the trap
Here's an example of HTML with links into the spambot trap:
<HTML>
<BODY BGCOLOR="beige">
<A HREF="/squirrel/guestbook/message/"></A> <A HREF="/squirrel/guestbook/post/"><IMG SRC="/guestbook.gif" WIDTH=1 HEIGHT=1 BORDER=0></A>
Body of the page here
<TABLE WIDTH=100%> <TR>
<TD ALIGN=RIGHT> <A HREF="/squirrel/guestbook/"> <SMALL><FONT COLOR="beige">guestbook</FONT></SMALL& gt; </A></TD>
</TR>
</TABLE>
</BODY>
</HTML>
Spambots tend to be stupid. You'd think they would check for empty links (which don't show up in a real browser), but they don't seem to. Sure, they may get smarter, but meantime you might as well pick the low hanging fruit. So, the very first thing in the body of your HTML should be an empty link which goes straight into the trap proper - not the warning level, but the actual trap itself. This is because there is no way for someone using a real browser to click on this link, and good spiders will ignore it anyway because it's in the robots.txt file.
We also use a one pixel big transparent GIF (a favorite web bug technique) to anchor a link to the trap, just in case the spambot is smart enough to avoid empty links. If we put this as the very first thing in the body, then it'll be pretty hard for a real user to click on, since it's only one pixel in size. But a spambot will quite happily go there!
Finally, there is an example of a non-graphic, text based link. This will be placed on the right side of the screen by the table, and the text will appear in the same color as the background (in this example, beige). The link does not go straight into the trap, but into the warning level, because with this one there is a bigger chance that real people could click on it accidentally. The link may be invisible, but it's still there, and someone could find it. So, they get to see a nice warning, and they should back off from there. But the spambot won't. By the way, we have the link going to /squirrel/guestbook/ rather than just /squirrel/ because some of the spambots seem to specifically follow links with certain keywords, e.g. 'guestbook', 'message', 'post', etc.
You can sprinkle these links all around your HTML files. I put them in every single one, since I use Embperl templates which make that sort of thing very easy.
Embperl
Download sample dynamic robots.txt using Embperl
Download sample dynamic HTML code using Embperl
The point of this is to make it easier to change the spambot trap directory without having to edit a whole bunch of files. We pass an environment variable to Perl from httpd.conf (see below), which says what the trap directory is called. We then use this in Embperl to substitute into the HTML and robots.txt files at request time. Thus if we wanted to change the name of the trap from 'squirrel' to 'badger', then we only need to change httpd.conf, restart apache, and we're done. All the links in the HTML are dynamic, as is robots.txt (see the samples above).
Now, we bring it all together in the Apache configuration file.
httpd.conf
Download sample httpd.conf directives
Download sample startup.pl script (used in httpd.conf)
You need to have mod_perl installed before you can use BlockAgent.pm. You should take a look at the sample given above, and integrate these directives into your own virtual hosts. The most important lines are:
Alias /squirrel /www/spambot_trap
PerlSetEnv SPAMBOT_TRAP_DIR squirrel
You should set the 'squirrel' name to whatever you'd like for your website; you'll then access the trap using a URL something like http://www.yourdomain.com/squirrel/guestbook/messa ge. This will spring the trap. You also need to set up the BlockAgent.pm access handler:
PerlAccessHandler Apache::BlockAgent
PerlSetVar BlockAgentFile
This ensures that all accesses to your website will go through BlockAgent.pm first. You should choose your own location for the bad_agents.txt file.
Finally, you might want to install Embperl so that you can embed Perl into your HTML code (always executed on the server side, never seen on the client side):
# Set EmbPerl handler for main directory
# Handle HTML files with Embperl
SetHandler perl-script
PerlHandler HTML::Embperl
Options ExecCGI
# Handle robots.txt with Embperl
SetHandler perl-script
PerlHandler HTML::Embperl
Options ExecCGI
That about does it. You should now have the setup which will allow you to block spambots. You'll probably be interested in monitoring what happens...
Monitoring
Download sample script for monitoring web server logs
This simple script just tails the badhosts_loop log. You'll have fun (I do) seeing what comes on your site and promptly falls into the trap, and then SPLAT. No more spambot. Heh heh heh.
Conclusions
This setup works pretty well for me at the moment. I've no doubt there are flaws in my design, but it seems stable and is "good enough" for the time being. If you can see any improvements then I'd love to hear about them. To finish up, here's a summary of the strengths and potential weaknesses of the Spambot Trap system.
Strengths
Weaknesses
Possible Future Enhancements
If you can think of any more potential problems (or unrecognised strengths!) then I'd be happy to hear about it. I'd also like to hear about any comments on this document.
I've found that a lot of people just won't send email if there's not a link to facillitate it. I've become rather fond of using javascript to write the address to the page. Spambots read the source so they don't piece the address together but *most* browsers will still do it right. Just use something like:
<script>document.write("<A CLASS=\"link\" HREF=\"mailto: " + "myname" + String.FromCharCode(64) + "mydomain"</script>
Seems to work fine. Anyone know of any reason it shouldn't, or have any other way to keep down spam without totally removing the Mailto: ? I know this won't work with *every* browser, but it beats totally removing mail links. And I don't think spammers can get it without having a human actually look at the page...
do not read this line twice.
looks like /. ate his website, not spambots :)
/tmp/badhosts.new. The badhosts_loop file checks this file every few seconds and if it has changed then it knows that a new record's been added to the database, and it needs to re-generate the blocks list.
/tmp/badhosts.new. If this file has been changed since the last time it looked, then it goes and re-generates the firewall blocks list with all the current (non-expired) blocks. If nothing else happens, then the script will automatically do this at least once a day, to ensure that blocks really do expire even if there is no new activity. /etc/rc.local file (or equivalent), so that the script is automatically started up on reboot:
/path/to/badhosts_loop --loop &
/var/lock/badhosts_loop.lock. If the file is already locked then the script will exit with an error message. If you want to just run the script once, without looping, then just omit the '--loop' option. This can be useful for testing. /var/log/badhosts_loop.log by default. Every time the script generates the blocks list, it writes a list of all the blocks to the log. This is a good place to monitor if you're interested in what hosts are being blocked. Here's an example of the log output:
/www/spambot_trap/. The point is, this doesn't have to be a real directory under your webserver directory root. If you use the <Alias> directive, then multiple websites can access the same spambot_trap directory, potentially through different aliases. You can use the sample tarball as a starting point, it has subdirectories and links which the spambots I have seen find irresistable. You should create your own image file for the unblock_email.gif file, to have a valid email address of your own.
/squirrel/guestbook/ rather than just /squirrel/ because some of the spambots seem to specifically follow links with certain keywords, e.g. 'guestbook', 'message', 'post', etc.
/squirrel /www/spambot_trap
a ge. This will spring the trap. You also need to set up the BlockAgent.pm access handler: /> /www/conf/bad_agents.txt
The Problem: Spambots Ate My Website
Spambot: (noun) - A software program that browses websites looking for email addresses, which it then "harvests" and collects into large lists. These lists are then either used directly for marketing purposes, or else sold, often in the form of CD-ROMs packed with millions of addresses. To add insult to injury, you may receive a spam email which is asking you to buy one of these lists yourself. Spambots (and spam) are a pestilence which needs to be stamped out wherever it is found.
I have a website, http://www.crazyguyonabike.com, which has bicycle tour journals, message boards and guestbooks. I started noticing around the end of 2001 that the site was getting hit a lot by spambots. You can spot this sort of activity by looking for very rapid surfing, strange request patterns, and non-browser User-Agents.
After looking at the server logs, I realized a couple of things: Firstly, the spambots came from many different IP addresses, so this precluded the simple option of adding the source IP to my firewall blocks list. Secondly, there seemed to be a common behavior between the bots - even if this was the first visit from a particular IP address (or even a particular network, so no chance of just being a different proxy) they would come straight into the middle of my website, at a specific page rather than the root. This means that the spambots obviously had some kind of database of pages, which had presumably been built up from previous visits, before I'd noticed the activity, and this database was being shared between a large number of different hosts, each of which was apparently running the same software.
Another distinctive behavior was that the spambots would follow only those links which had certain keywords which would seem promising if you're looking for email addresses: "guestbook", "journal", "message", "post" and so on. On each of the pages in my site there were many other links in the navbars, but only links with these keywords were being followed. Also, robots.txt was never even being read, let alone followed. Moreover, the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests.
This was very annoying. For one thing, these things were picking off email addresses from my website (at that point, I was letting people who posted on my message boards decide for themselves whether they wanted their email addresses to be visible or not). But quite apart from that, it was taking up resources, and was just plain rude. I hate spam. I resent my webserver having to play host to people whose obvious goal is to cynically exploit the co-operative protocols of the internet to their own selfish, antisocial gain. So, I decided to do something about it.
The first thing I did was to look at the User-Agent fields which were being used by the bots. There were a variety, including variations on the following:
DSurf15a 01
PSurf15a VA
SSurf15a 11
DBrowse 1.4b
PBrowse 1.4b
UJTBYFWGYA (and other strings of random capital letters)
I searched the internet for references to these strings, but all I found was a slew of website statistics analysis logs. This meant that these particular spambots obviously got around. It was also discouraging, because there was no mention anywhere of what these things actually were. I was surprised that there seemed to be no discussion whatsoever of something that seemed to be pandemic. Then I found a couple of other websites with guestbooks that had actually been defiled by these spambots: (if you follow these links and you don't see a lot of empty messages left by the above user agents, then that means the webmaster of the site has finally found a way to stop it, so good for them...)
http://www.virtualglasgow.com/guestbook.html
http://www.donotenter.com/guestbook/gbook.html
I reckon the spambots didn't really intend to leave empty messages. They just tend to want to follow links with the keyword 'post'. So if the guestbook posting form has no preview or confirmation page, then the spambot would leave a message simply by following this link! My guestbooks and message boards have a preview page, which is probably why I hadn't had any of this.
Anyway, I started thinking about what kind of program this thing was. First of all, it comes from all kinds of different IP addresses. I couldn't quite believe that this many different IP addresses were all intentionally using the same software, of which I could find absolutely no mention anywhere on the Web. This made me think it might be some kind of virus/trojan/worm or whatever that silently installed itself on people's computers, and then used the CPU and bandwidth to surf the Web without the owner being aware of it. I thought that if this was the case, then it must be sending the results somewhere - and if we could find out where, then we could go about shutting the operation down. But I have had no luck at all in getting any help from the sysadmins at ISP's I have contacted. A typical exchange was the one with a guy at Cox internet, which was where a persistent offending IP address was sourced. He just couldn't be bothered, and eventually told me that spidering was not against the law, or their terms of service. I asked whether actions which were blatantly obviously geared toward the generation of spam were against their terms of use, but he never replied to that. I had no more luck anywhere else: Nobody had heard of this thing. I even sent an email to CERT, but no response. So, I turned instead to thinking about how I could erase these pests from my life as much as possible. This document is about my quest to stop spambots (not just this one, but ALL spambots) from abusing my website. Hopefully it will be useful to you.
Overview of the Spambot Trap
There are three main parts to the technique which I outline here:
Banish visible email addresses from your websites altogether, or else obfuscate them so they can't be harvested. Examples of how to do this are given. This is your fail-safe, in case the spambots figure out a way around your other defences. Even if they manage to cruise your website on their very best behavior, they still should not be able to harvest email addresses!
Block known spambots: Certain User-Agents are just known to be bad, so there's no reason to let them come on your site at all. True, spambots could in theory spoof the User-Agent, but the simple reality is that a lot of them don't. We use an enhanced version of the BlockAgent.pm module from the O'Reilly mod_perl book. This extension adds offending IP addresses to a MySQL (or other relational) database, which is picked up by the third part of our cunning system...
Set a Spambot Trap, which blocks hosts based on behavior. We set a trap for spambots, which normal users with browsers and well-behaved spiders should not fall into. If the bot falls in the trap, then its IP address is quickly blocked from all further connections to the webserver.
This works using a persistent, looping Perl script called badhosts_loop, which checks every few seconds for additions to a 'badhosts' database. This script then adds 'DENY' rules for each bad hosts to the ipchains firewall. Blocks have an expiry, which is initially set to one day. If a host falls in the trap again after the block expires, then that IP is blocked again - and the expiration time is doubled to 2 days. And so on. This algorithm ensures that the worst offenders get progressively more blocked, while one-time offenders don't stick around in our firewall rules eating up resources.
There are various components to the Spambot Trap, including the badhosts_loop Perl script, the BlockAgent.pm module, ipchains config, MySQL database, httpd.conf, robots.txt, and your HTML files. These are all covered in the sections below.
Banishing 'mailto:'
The first and most urgent thing you need to do is to get email addresses off your website altogether. This means, unfortunately, banishing the venerable mailto: link. It's a real shame that perfectly good mechanisms should be removed because of abuse, but that's just the way the world is these days. You need to be defensive, and assume that the spammers will try to take advantage of your resources as much as possible.
It's an arms race
The important thing that you need to realize is that no matter what blocks we put in place, this game is an arms race. Eventually the spambot writers will develop smarter bots which circumvent our techniques. Therefore you want to have a failsafe, which will prevent email addresses from getting into the hands of the spambot even if all else fails. The only real way to do that is to completely remove all email address from your website.
Contact forms
You should replace the mailto: links with links to a special form where people can type their name, email address and message. A CGI can then deliver the email, and your email address never has to be disclosed. There are a number of different mailer scripts out there - just be careful to check for vulnerabilities which could allow malicious users to use the form to send email to third parties (i.e. spam, ironically enough) using your server. The formmail script is popular, but an earlier version had such a vulnerability (since fixed). The Embperl package has a simple MailFormTo command to send an email from a form.
Since I have seen guestbooks out there which have been extensively defiled by spambots, I would add that you should have a preview screen on your contact forms. This will ensure that an email doesn't get fired off simply by a spambot following the 'post' or 'contact' link (which it will likely try to do).
Alternatives to totally banishing mailto:
There are alternatives to completely removing email addresses, but they all depend on the stupidity of the spambot, and so could be compromised by a new generation of pest. These include:
Write out email addresses in a non-email format, e.g. instead of writing 'username@domain.com' you would write 'username at domain dot com', or something similar. It would only take some spambot with a little more intelligence to be able to scan these patterns and pick up "likely" addresses, so this strategy is a little risky. Any consistent method you choose to write out email addresses could in theory be analyzed and decoded by a savvy bot.
Add stuff to the email address to make it invalid, but so that a human could easily know what to do to make it work. An example of this is writing 'username@_NO_SPAM_domain.com'. You need to remove the "_NO_SPAM_" part to make the email address valid. You can have some kind of explanation to make it clear what people have to do to use the address. Personally, I don't like this - you're depending on a level of sophistication on the part of your users which is risky. In my experience, there are a lot of very 'novice' level users out there, who only know how to click on a link. They don't know how to edit an email address. Heck, I've had people come to my site by typing the URL into Google, rather than the 'Location' box of their browser. Also, people don't read instructions.
Make graphics images which contain the email address. Spambots usually don't download graphics, and even if they did, they probably couldn't decode the bits to get the text. However, they could do it in theory, since software for doing OCR (optical character recognition, getting text from scanned documents) has been around for a while. A downside to this approach is that the user has to manually copy down the email address, since it can't be cut'n'pasted. Also, you can't put a mailto: link on the image, otherwise you're back to square one. But you could put a link to a contact form, with an argument in the link telling your server internally what email address to use. For example, the link could say "contact.cgi?to=23", where '23' is some database key to the actual email address. But the downside here is that you still need to generate the image, which is a bit of a pain in the ass if you have a lot of them. You can do it automatically, if you're willing to put the work in and write the scripts. There are some very nice graphics generation packages out there on CPAN for Perl. Here's an example of an email address presented as an image:
MySQL
Download badhosts MySQL database dump
We need to set up a MySQL database, where we store records of the hosts which are to be blocked. This doesn't have to be MySQL, but I use it because it's extremely fast, and very appropriate for this kind of application. You need to create a new database, called 'badhosts'. You then create a table, again called 'badhosts', with the following structure:
Field Type Comment
ip_address varchar(20) not null, indexed The IP address of the host to be blocked
user_agent varchar(255) not null The HTTP User-Agent of the spambot, for reference
expire_days int unsigned not null How many days is this block for. Doubled every time a new block has to be created for a particular IP address
created datetime not null When this block was created
expiry datetime not null, indexed When this block expires
You could use the dump provided above to load directly into your database:
shell> mysqladmin create badhosts
shell> mysql badhosts < badhosts.dump
That's about it! The fields which are marked as 'indexed' are the only ones which need indexes, because they are searched on to see if a particular IP address has been previously blocked, and also to see which blocks should be removed because they've expired. If you have access privilages set on your MySQL databases, then you need to allow the Apache user (usually 'nobody') access. The other script that will require access is badhosts_loop, which runs as root.
Next, we look at the script that populates this database.
BlockAgent.pm
Download BlockAgent.pm
Download bad_agents.txt
The BlockAgent.pm Apache/mod_perl module is taken from the excellent book "Writing Apache Modules with Perl and C" by Lincoln Stein & Doug MacEachern (O'Reilly). This script basically acts as an Apache authentication module which checks the HTTP User-Agent header against a list of known bad agents. If there's a match, then a 403 'Forbidden' code is returned. The script compiles and caches a list of subroutines for doing the matches, and automatically detects when the 'bad_agents.txt' file has changed. I have found that it has no noticeable impact on the performance of the webserver. This script is useful in the case where you know for certain that a certain User-Agent is bad; there's no point in letting it go anywhere on your site, so it's a good first line of defense. We'll cover how to add this module to your website a little later, along with the rest of the configuration settings in the section on httpd.conf.
Of course, one of the first arguments you'll see with regard to this method of blocking spambots is that it's easy to circumvent, by simply passing in a User-Agent string which is identical to the major browsers out there. This is perfectly true, but don't ask me why the spambot writers haven't done this - maybe it's a question of pride or ego, they want to see their baby out there on record in Web server logs. I honestly don't know. The main point is that at present, the User-Agent header CAN be used very effectively to block most bad agents. But, I have added more features so that we can also block agents which look ok, but behave badly by going somewhere they shouldn't - the Spambot Trap. More on that soon.
You'll notice that the bad_agents.txt file which I have supplied here is very comprehensive. A good strategy here is probably to save the full version somewhere (perhaps as bad_agents.txt.all), and just keep the ones you actually encounter in the bad_agents.txt file. Then you keep the list shorter, and more relevant to what actually hits you. For example, my bad_agents.txt file currently has the following lines in it, because these are the spambots that I see most frequently:
^[A-Z]+$
^.Browse\s
^.Eval
^EO Browse
^.Surf
^Microsoft.URL
^Mozilla\/3.0.+Indy Library
^Zeus.*Webster
You'll notice from this that BlockAgents.pm is very flexible, being able to take full advantage of the excellent regular expression capabilities of Perl. This means you can capture a lot of different agents with just one line. For example, the very first line catches all the variations of the agent which passes in random strings of capital letters, e.g. FHASFJDDJKHG or UYTWHJVJ. The spambot obviously thinks it's being pretty smart by looking different each time, but by using an easily identifiable pattern, it shoots itself in the foot. Hah.
The original version of the BlockAgent.pm script is well explained in the O'Reilly book, but I've added an extra hook that checks to see whether the client is accessing any of the spambot trap directories. If it is, then we add an entry to the MySQL database (you could use another relational database if you want, as long as it's accessible from Perl DBI).
The first time an IP address is blocked, an expiry of one day is set. If the same host subsequently comes in and falls into the trap again, then the expiry time is doubled. And so on. This way, the block gets longer and longer, in proportion to how persistently the spambot revisits our website. Once the IP address is blocked, the spambot can't even connect to our web server, since we use 'Deny' in the ipchains rule. This means that no acknowledgement is given to any packets coming in from the badhost, and as far as they know, our server has just gone away. Hopefully, after this happens for long enough, our server will be taken off the spambot's "visit" list. Another nice little side-effect of this is that the spambot will probably have to wait for a while before giving up each connection attempt. Anything that makes them waste more time is ok by me!
BlockAgent.pm notifies the badhosts_loop script that something has happened by touching a file called
The BlockAgent.pm script is our alarm system. It's what tells us that something happened. In order to act on this information, we need to be able to add rules to the ipchains firewall. We'll cover this next.
ipchains
Download sample ipchains config file
The ipchains module (here's the HOWTO doc) is a very nice way of providing a good level of basic network security to your server. If you haven't already set it up (or it's successor, iptables), then you really should. It's a very easy way to configure who can and cannot have access to your machine. A good resource for learning about this is "Building Linux and OpenBSD Firewalls", by Wes Sonnenreich and Tom Yates (Wiley). This is where I learned about ipchains, and it's on their excellent explanations and examples that I based my own config file. Another is "Linux Firewalls" by Ziegler (New Riders), which seems to have a more recent 2nd edition that covers iptables too.
The example ipchains config file given here is complete, but the bit which is most important to us is that we create a chain called 'blocks'. This is our own custom chain, which we can then add rules to. The badhosts_loop script will flush this chain and build it back up whenever a spambot falls in your trap. Once the spambot's IP address is on the blocks list, that host cannot connect to your server at all.
Remember to restart ipchains after you've changed the config file. Next, we'll look at the script that actually adds the firewall rules.
badhosts_loop
Download badhosts_loop script
You run this script in the background, as root. It has to be run as root, because only root has the ability to add rules to the firewall. The script spends most of its time sleeping. It wakes up every five seconds or so and does a quick check on
You should probably add the following line to your
This will start the script looping in the background. The script automatically checks to see if it is already running, by attempting to lock
Logging is done to
Thu Apr 11 16:09:07 2002:
Flushing blocks chain:
Generating blocks list:
Adding 68.5.99.89 (8) 2002-04-04 14:08:11 to 2002-04-12 14:08:11 DSurf15a 01
Adding 24.234.28.85 (8) 2002-04-07 10:43:42 to 2002-04-15 10:43:42 DBrowse 1.4b
The log shows the IP address which is being added, then (in brackets) the number of days the block is effective for (doubling each time), then the start and end dates of this block, and finally the name of the User-Agent which committed the crime. This can be useful for quickly seeing whether you need to add a new one to the bad_agents.txt file.
This is a pretty stable script that should just sit there and chug quietly, not taking up much in the way of resources. Checking for a file being changed every five seconds is not a big deal in Unix, so you shouldn't even notice it.
Now you have to create the trap itself - the spambot_trap directory.
spambot_trap/ Directory
Download gzipped tarball of sample spambot_trap directory
View the sample directory
You can create this directory anywhere on your server. We will create an alias the httpd.conf to access it. I put mine in
The spambot_trap and spambot_trap/guestbook/ directories are not used directly to spring the trap. This is because I wanted to have a warning level, a lead-in, where real users would be able to realize they are getting into dangerous waters and could then back out. You're going to be placing hard-to-click links on your web pages which lead into the real trap, and there's always a chance that a real user will accidentally click on one of these. So, some of the links will point into the warning level. I have made a GIF image which contains a warning text. Why an image? Mainly because spambots can't understand images, and I didn't want to give big clues like "WARNING!!! DO NOT ENTER" in plain text. So, the user sees the warning, the spambots don't. If the spambot proceeds into any of the subdirectories (email, contact, post, message), then the trap is sprung and the host is blocked.
You also need to try to stop good spiders (e.g. google) from falling into the spambot trap and being blocked. To do this, we utilize the robots.txt file.
robots.txt
Download sample robots.txt
This should allow good robots (such as google) to surf your site without falling into the spambot trap. Most bad spambots don't even check the robots.txt file, so this is mainly for protection of the good bots.
You'll see that we list a bunch of directories under '/squirrel'. This could be anything; you'll set an alias later in httpd.conf. In fact, you may even want this to be dynamically generated (see later, under Embperl), so that you can quickly change the name of the spambot trap directory if the spambots adapt and start avoiding it. At present, a static setup should work just fine, however.
Next, we need to look at the bait - links within your HTML files which lead the spambot into the trap.
Your HTML Files
Download sample HTML code
Download sample transparent 1 pixel image for hiding the trap
Here's an example of HTML with links into the spambot trap:
<HTML>
<BODY BGCOLOR="beige">
<A HREF="/squirrel/guestbook/message/"></A>
<A HREF="/squirrel/guestbook/post/"><IMG SRC="/guestbook.gif" WIDTH=1 HEIGHT=1 BORDER=0></A>
Body of the page here
<TABLE WIDTH=100%>
<TR>
<TD ALIGN=RIGHT>
<A HREF="/squirrel/guestbook/">
<SMALL><FONT COLOR="beige">guestbook</FONT></SMALL& gt;
</A>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
Spambots tend to be stupid. You'd think they would check for empty links (which don't show up in a real browser), but they don't seem to. Sure, they may get smarter, but meantime you might as well pick the low hanging fruit. So, the very first thing in the body of your HTML should be an empty link which goes straight into the trap proper - not the warning level, but the actual trap itself. This is because there is no way for someone using a real browser to click on this link, and good spiders will ignore it anyway because it's in the robots.txt file.
We also use a one pixel big transparent GIF (a favorite web bug technique) to anchor a link to the trap, just in case the spambot is smart enough to avoid empty links. If we put this as the very first thing in the body, then it'll be pretty hard for a real user to click on, since it's only one pixel in size. But a spambot will quite happily go there!
Finally, there is an example of a non-graphic, text based link. This will be placed on the right side of the screen by the table, and the text will appear in the same color as the background (in this example, beige). The link does not go straight into the trap, but into the warning level, because with this one there is a bigger chance that real people could click on it accidentally. The link may be invisible, but it's still there, and someone could find it. So, they get to see a nice warning, and they should back off from there. But the spambot won't. By the way, we have the link going to
You can sprinkle these links all around your HTML files. I put them in every single one, since I use Embperl templates which make that sort of thing very easy.
Embperl
Download sample dynamic robots.txt using Embperl
Download sample dynamic HTML code using Embperl
The point of this is to make it easier to change the spambot trap directory without having to edit a whole bunch of files. We pass an environment variable to Perl from httpd.conf (see below), which says what the trap directory is called. We then use this in Embperl to substitute into the HTML and robots.txt files at request time. Thus if we wanted to change the name of the trap from 'squirrel' to 'badger', then we only need to change httpd.conf, restart apache, and we're done. All the links in the HTML are dynamic, as is robots.txt (see the samples above).
Now, we bring it all together in the Apache configuration file.
httpd.conf
Download sample httpd.conf directives
Download sample startup.pl script (used in httpd.conf)
You need to have mod_perl installed before you can use BlockAgent.pm. You should take a look at the sample given above, and integrate these directives into your own virtual hosts. The most important lines are:
Alias
PerlSetEnv SPAMBOT_TRAP_DIR squirrel
You should set the 'squirrel' name to whatever you'd like for your website; you'll then access the trap using a URL something like http://www.yourdomain.com/squirrel/guestbook/mess
<Location
PerlAccessHandler Apache::BlockAgent
PerlSetVar BlockAgentFile
</Location>
This ensures that all accesses to your website will go through BlockAgent.pm first. You should choose your own location for the bad_agents.txt file.
Finally, you might want to install Embperl so that you can embed Perl into your HTML code (always executed on the server side, never seen on the client side):
# Set EmbPerl handler for main directory
<Directory "/www/vhosts/www.yourdomain.com/htdocs/">
# Handle HTML files with Embperl
<FilesMatch ".*\.html$">
SetHandler perl-script
PerlHandler HTML::Embperl
Options ExecCGI
</FilesMatch>
# Handle robots.txt with Embperl
<FilesMatch "^robots.txt$">
SetHandler perl-script
PerlHandler HTML::Embperl
Options ExecCGI
</FilesMatch>
</Directory>
That about does it. You should now have the setup which will allow you to block spambots. You'll probably be interested in monitoring what happens...
Monitoring
Download sample script for monitoring web server logs
This simple script just tails the badhosts_loop log. You'll have fun (I do) seeing what comes on your site and promptly falls into the trap, and then SPLAT. No more spambot. Heh heh heh.
Conclusions
This setup works pretty well for me at the moment. I've no doubt there are flaws in my design, but it seems stable and is "good enough" for the time being. If you can see any improvements then I'd love to hear about them. To finish up, here's a summary of the strengths and potential weaknesses of the Spambot Trap system.
Strengths
Does not rely exclusively on the HTTP User-Agent header, but at the same time allows us to block agents which we know to be bad.
Does not rely on the spambot abusing the robots.txt file. Many spambots don't even load it. But the robots.txt file will protect "good" robots from falling into the spambot trap. So, for example, googlebot will be just fine.
The blocks happen based on behavior, rather than trusting anything the spambot tells us about itself (e.g. User-Agent). Thus we don't rely on any prior knowledge of the spambots in order to block them; an entirely new one that we've never seen before will still fall in the trap and be duly blocked.
Once a spambot is blocked, then it cannot connect to your server again at all for the duration of the block. If it tries to connect, it won't even get a 'connection refused' error, because the firewall rule just quietly drops all the packets from the bad hosts. The ipchains firewall is very effective, and more efficient at blocking hosts than anything you could put together with Apache. So, you save on server resources. If you're wondering whether the block lists might get large, I have found that with the constant expiring of one day blocks, the active block list has never been more than about 20 IP addresses at a time, out of a list (so far) of 100 distinct hosts.
The blocks initially expire after one day. This means that one-off offenders are quickly removed from the firewall rules. On the other hand, repeat offenders get progressively longer and longer blocks (doubled each time). This means that the more abusive a host is, the more it will be blocked. It also means that if a bot is coming in from multiple IP addresses (through a proxy), then each of the individual IP addresses will probably not go on to be blocked for too long. Thus you won't be blocking everyone in AOL. On the other hand, if you continue to get hit from the same network, then it's obviously a source of trouble and should be blocked. If it's a major network like AOL, which you really don't want to block, then you need to take the IP addresses and times of the abuse, and send it to the sysadmin at the ISP concerned. There's really not a lot else you can do. I haven't seen this in reality, though. In my experience, the spambots come in from all sorts of different IP addresses, and the ones that are very persistent over time are mostly static IPs from DSL and small ranges of IPs from cable modems. These are the people with the always-on, high bandwidth capabilities which are needed for large scale email harvesting.
The system uses a relational database to manage the blocks, and so it is very scalable, and potentially you could share the database between multiple servers. If any one server gets a spambot, the the offending IP address can automatically also be blocked at all the other servers. Also, the fact that we don't delete expired blocks means that we can keep track of the history of the blocks, and perhaps perform analyses which would lead to more permanent ipchains blocks of entire subnets, if desired.
Weaknesses
It would be possible for the spambots to get wise, and start following the robots.txt file rules. Then the spambot could in theory surf your entire site (or at least the bits allowed by robots.txt) without falling into the trap. However this also means that you can control where the spambot goes, which is the whole point of robots.txt. If you want, you can allow google into one part of the site, but exclude all others. Still, you should remove all email addresses from your site as the fail-safe.
It's possible that a spambot could come in through a proxy such as AOL, which means you'll be blocking multiple AOL IP addresses. This is not very nice, and I'm not sure what the solution is at the moment. All I can say is that it hasn't happened yet, and the worst offenders on my site all have static IPs. They seem to come in from cable and DSL connections mostly.
I don't know how feasible this would be, but it may be possible to conduct a "denial of service" type attack on your webserver by making many requests to the spambot trap directory from different IP addresses. I think, however, that you actually need to have those IP addresses (rather than spoofing them) in order to set up a real TCP connection with the web server. I don't know how likely this is, but it comes more under the "attack" category than spambots. If someone tries this on your site, then it's definitely something that can be pursued with legal means. It's no longer just a petty annoyance, but rather a hostile action which must be chased down. Also, the motivation is totally different - the spammers don't want to do this kind of thing. They just want their email addresses. The DDOS attacks are notoriously difficult to track, but I think in the couple of years that have passed since the first ones brought down Amazon and Yahoo!, there has been some progress made. Anyhow, I just wanted to bring the idea into the light of day. If anyone has any clues about it then I'd be glad to know.
Possible Future Enhancements
Spot large numbers of blocks occurring on a particular subnet, and automatically consolidate blocks into a single one which blocks the entire subnet (e.g. 128.123.31.0/24).
More interactive tools to allow removal of blocks
Analysis tools which can tell us something about patterns of abuse from particular networks.
If you can think of any more potential problems (or unrecognised strengths!) then I'd be happy to hear about it. I'd also like to hear about any comments on this document.
1q2w3e4r5t6y7u8i9o0pqawsedrftgthyjukilo;p'azsxdcf
My setup (catches some of the more commonly used spambots) uses mod_rewrite to send spammers to a trap.
Setup details at http://www.bero.org/NoSpam/isp.php
This message is provided under the terms outlined at http://www.bero.org/terms.html
I think a better idea was one that I heard a while back. This guy set up a script to constantly create new pages with randomly created garbage email addresses and links to new random pages with new random garbage email addresses, ad infinitum. Sure, you'll get a few more hits from the spambot, but it'll keep crawling your script-based heirarchy and keep polluting its database with email addresses that don't exist!
Have your page linked on slashdot! Page gets slashdotted, problem solved.
You can generate the code for your own email address here or, if you want some source code, then you can find an implementation of it here.
Avantslash - View Slashdot cleanly on your mobile phone.
1) Put a link such as: mailto:dedicatedaddress@wherever.com?Subject= [Question] About your site (or whatever)
2) Trash any email sent to dedicatedaddress that doesn't have the [Question] tag in the subject.
Hope this helps.
-- B.
This sig does in fact not have the property it claims not to have.
Why is this a bad thing? They are owned by Verisign.
How about instead, returning pages with the email address abuse@domain-that-spambot-is-coming-from all over them...
This is also a good idea. In fact, I have a script which does a traceroute to the IP of the bot, and then looks up the admin contact using whois for the last couple of hops, and returns these. Oh, and for additional fun, throw in a couple of addresses of especially loved "friends"...
Say no to software patents.
Write some of your email address using html code for the ascii characters, like $ # 114 for "r".
(Yes, I've posted about this before, but it does work for me.) Browsers render it so users get the address they want, but spambots try to grab it from the raw html and get something meaningless.
Add a couple of sleep(20); into the cgi script that generates the bot fodder. The bot will still stay busy waiting for your webserver's response, but your script will exactly consume zero resources.
Zero resources, except for memory.
A much better solution would be to point the bot at a set of "servers" with IP addresses where you're running a stateless tarpit.
Tarsnap: Online backups for the truly paranoid
How does your solution require linux? And why in gods name would you want to run a webserver using linux and mysql... do you just want a slow webserver or what?
The page is already slashdoted. Here is a little .htaccess file with mod_rewrite turned on
/dont_go_here /images /cgi-bin
R EMOTE_HOST);
script that traps bots (and others) that use your robots.txt
to find directories to look through. Requires an
robots.txt
#################
User-agent: *
Disallow:
Disallow:
Disallow:
dont_go_here/index.php
############
$now = date ("h:ia m/d/Y");
$IP=getenv(REMOTE_ADDR);
$host=getenv(
$your_email_address=you@whatever;
$ban_code =
"\n".
'# '."$host banned $now\n".
'RewriteCond %{REMOTE_ADDR} ^'."$IP\n".
'RewriteRule ^.*$ denied.html [L]'."\n\n";
$fp = fopen ("/path/to/.htaccess", "a");
fwrite($fp, $ban_code);
fclose ($fp);
mail("$your_email_address", "Spambot Whacked!", "$host banned $now\n");
AdFuel
Nothing, MySQL and mSQL are dope as shite! I've never used anything but. Don't get me wrong, Oracle is great for massive databases I'm sure, however I don't want to pay 80k for a Oracle server that so damn complicated I'ld have to send another 5 - 10k on schooling. It doesn't need to be that complicated, however it justifies the 70 - 80k per year you have to spend on your developer. All in all, not worth it. If ya can't do it with MySQL or mSQL your a poor programmer.
From the website: Wpoison is a free tool that can be used to help reduce the problem of bulk junk e-mail on the Internet in general, and at sites using Wpoison in particular.
It solves the problems of trapped spambots sucking up massive bandwidth/CPU time, as well as sparing legitimate spiders (say, google) from severe confusion.
Actually, I've done this w/a bot trap on my site at home. It's a perl script that generates a bunch of weird-sounding text w/some fake email addresses at the bottom and a bunch of database-query-looking links back to the original page.
The bots don't fall for it anymore. Some dorks in Washington state decided to make a couple requests a second to it once, but in the two years I've had it up, they're the only ones.
A pretty good article, but being able to install modules into Apache may not be the best situation for everyone who wants to stop Spambots..
Shameless plug, but I've got an ongoing series in the Apache section of /. that deals with easy ways that administrators *and* regular users can keep Spambots off their sites:
Stopping Spambots with Apache
and
Stopping Spambots II - The Admin Strikes Back
Just some more options and choices to help people out!
fine in windows ie5 / ie6 /ns3 and even webTV !
no, it's broken on ie6
I like that idea...look up the originating host, and make links back to abuse@, root@, webmaster@, and whatever else you can think of. Clog their mailservers. The problem is, it would be simple enough (if it's not already in place) to have your spam bot ignore addresses for your own domain.
do not read this line twice.
It would be possible for the spambots to get wise, and start following the robots.txt file rules. Then the spambot could in theory surf your entire site (or at least the bits allowed by robots.txt) without falling into the trap. However this also means that you can control where the spambot goes, which is the whole point of robots.txt. If you want, you can allow google into one part of the site, but exclude all others.
I'd read robots.txt and just go where google was allowed to go...
None of that Perl nonsense, either. All in pure C on a BSD host, with a damn good attention to potential overflows. That was also the site which had my own custom MTA (I only knew sendmail, so it seemed a wise decision), demanded full W3C compliance (we would test it on about 10 platforms), and got used as evidence in the DoJ case against Microsoft.
Sigh, those were the days. Now, all I see is rehashing of old ideas. So, I view this news is 6 years old -- perhaps even a record for Slashdot?
Give it thousands, millions of addresses this way.
Liberally sprinkled postmaster@127.0.0.1 and abuse]@127.0.0.1.
If you use images for email addresses, what are people using text browsers supposed to do? Even worse is using them on the "warning" pages - someone with a text browser would have no idea what the image said and therefore nothing to stop them falling into the trap and getting firewalled.
And of course if he uses ALT text for the images, then he has the same problem he was trying to avoid, of creating something the spambots can read.
I agree. And, come on, how much technology do you need?
This is my solution to stopping spambots. It's in a JavaServlet technology and I am posting it here to prevent my company's site from being slashdotted. It does not prevent the spammer from harvesting emails it just slows them down.... a lot :) If everyone had a script like this, spambots would be unusable.
Feel free to use the code in anyway you please (LGPL like and stuff)
Put robots.txt in your root folder. Content:
User-agent: *Disallow:
Put StopSpammersServlet.java in WEB-INF/classes/com/parsek/util:
package com.parsek.util;import java.io.File;
import java.io.StringWriter;
import javax.servlet.ServletContext;
import java.net.URL;
import java.util.Enumeration;
import java.lang.reflect.Array;
public class StopSpammersServlet extends javax.servlet.http.HttpServlet {
private static String[] names = { "root", "webmaster", "postmaster", "abuse", "abuse", "abuse", "bill", "john", "jane", "richard", "billy", "mike", "michelle", "george", "michael", "britney" };
private static String[] lasts = { "gates", "crystal", "fonda", "gere", "crystal", "scheffield", "douglas", "spears", "greene", "walker", "bush", "harisson" };
private String[] endns = new String[7];
private static long getNumberOfShashes(String path) {
int i = 1;
java.util.StringTokenizer st = new java.util.StringTokenizer(path, "/");
while(st.hasMoreTokens()) { i++; st.nextToken(); }
return(i);
}
public void doGet (javax.servlet.http.HttpServletRequest request,
javax.servlet.http.HttpServletResponse response)
throws javax.servlet.ServletException, java.io.IOException {
response.setContentType("text/html; charset=UTF-8");
java.io.PrintWriter out = response.getWriter();
try {
ServletContext servletContext = getServletContext();
endns[0] = "localhost";
endns[1] = "127.0.0.1";
endns[2] = "2130706433";
endns[3] = "fbi.gov";
endns[4] = "whitehouse.gov";
endns[5] = request.getRemoteAddr();
endns[6] = request.getRemoteHost();
String query = request.getQueryString();
String path = request.getPathInfo();
out.println("<html>");
out.println("<head>");
out.println("<title>Members area</title>");
out.println("</head>");
out.println("<body>");
out.println("<p>Hello random visitor. There is a big chance you are a robot collecting mail addresses and have no place being here.");
out.println("Therefore you will get some random generated email addresses and some random links to follow endlessly.</p>");
out.println("<p>Please be aware that your IP has been logged and will be reported to proper authorities if required.</p>");
out.println("<p>Also note that browsing through the tree will get slower and slower and gradually stop you from spidering other sites.</p>");
response.flushBuffer();
long sleepTime = (long) Math.pow(3, getNumberOfShashes(path));
do {
String name = names[ (int) (Math.random() * Array.getLength(names)) ];
String last = lasts[ (int) (Math.random() * Array.getLength(lasts)) ];
String endn = endns[ (int) (Math.random() * Array.getLength(endns)) ];
String email= "";
double a = Math.random() * 15;
if(a if(a if(a if(a if(a if(a if(a if(a if(a if(a if(a if(a if(a email = email + "@" + endn;
out.print("<a href=\"mailto:" + email + "\">" + email + "</a><br>");
response.flushBuffer();
Thread.sleep(sleepTime);
} while (Math.random()
out.print("<br>");
do {
int a = (int) (Math.random() * 1000);
out.print("<a href=\"" + a + "/\">" + a + "</a> ");
Thread.sleep(sleepTime);
response.flushBuffer();
} while (Math.random() out.println("</body>");
out.println("</html>");
} catch (Exception e) {
out.write("<pre>");
out.write(e.getMessage());
e.printStackTrace(out);
out.write("</pre>");
}
out.close();
}
}
Put this in your WEB-INF/web.xml
<servlet><servlet-name>stopSpammers</servlet-name& gt;
<servlet-class>com.parsek.util.StopSpammersS ervlet</servlet-class>
</servlet>
<servlet-mapping>
<servlet-name>stopSpammers</servlet-name& gt;
<url-pattern>/members/*</url-pattern>
</servlet-mapping>
Here you go. No PHP, no APache, no mySQL, no Perl, just one servlet container.
Ciao
boky
SAUCE
Yeah, the spammbot would probably (or could probably) filter abuse... But why not auto-generate an email to abuse@spammers.isp.com and send the appropriate logs that prove the use?
However, the instructions for installating Wpoison more or less assumes that one has a single website to protect. I have around 20 virtual hosts. So instead of creating a renamed cgi-bin in every DocumentRoot, I added a single
ScriptAlias /runme/ "/var/www/cgi-bin/"
to httpd.conf and then linked it like this:
<A HREF="/runme/addresses.ext"><IMG SRC="pixel.gif" BORDER=0></A>
I also added a single transparent pixel to the link to keep it invisible but still fool the spiders. Add the runme directory as excluded in the robots.txt and you should be on your way. Muhahahah, and so on.
Money for nothing, pix for free
Win XP with IE 6 has problems.
Can't get it to work. Damn Klerck!
How about sending a parameter to a page which redirects to the mailto: protocol?
For example:
index.html
<a href="filename.php?x=info">E-Mail Me</a>
filename.php
<?php
Header ("Location: mailto:" + $x + "@mydomain.tld")
?>
Jamie will never "fix" anything, unless you call using a patch someone sends him "fixing". Trouble is, he wants to understand the fix before he applies it, and it could take forever for the slashdot perl weenies to understand a few lines of code. Let me be clear, this is not a shot at perl - even if it is a crappy language.
I had the text too, but I kept getting blocked by the Slashdot filter when I tried to repost it. I even tried using Striff Tummel. What's your trick to weed out the stuff that triggers the Slashdot filter?
do a reverse-dns lookup on the host, and do word1word2@reverse.lookup
that'd be great!!!!!!!!
With syntax like , where D is decimal number of desired character. Following url has few other syntaxes to thwart spambots.
i es
http://www.w3.org/TR/html401/charset.html#entit
For example, 64 is @ and 46 is . It's really very easy method to obfuscate addresses from spambots which don't parse the HTML anyway. Normal web browsers always parse the pages and use the parsed addresses for additional processing (like mailto: links).
There's a spam-blacklist, so how about a spambot-blacklist?
You'd have a standardized spambot trap (like the one described in the article) on various webservers. The new spambot info could go into a "New SpamBots" database (which wouldn't be blocked). Once a day, the webserver would connect up with a central database and submit the new spambot info it's obtained. Then the server would download a mirror of the updated "SpamBots" database which it would use to block spambots.
The centralized SpamBots database would take all of the new SpamBot info every day and analyze them in some manner as to detect abuse of the system (ensuring that only true spambots are entered). E-mails could be fired off to the abuse/postmaster/webmaster for the offending IP address. Finally, the new SpamBot info would be integrated into the regular SpamBot database.
This way you'd be able to quickly limit the effectiveness of the Spambot-traps across many websites.
My sci-fi novel, Ghost Thief, is now available from Amazon.com.
just to write a simple spam trap
Especially loved "friends"...
Like hotline@mpaa.org, cdreward@riaa.org, senator@hollings.senate.gov for example?
By using real domains, you're doing a real disservice to those who host them.
Dear Spambot Authors,
Thanks again for your interest. I hope that we were able to help you write the spambots of the future that will be able to detect and sidestep as many of the above protection schemes as possible. We tried to work all of our knowledge into one convienient thread for your development team to peruse.
Thanks for your interest in SlashDot, home of too much information.
------
Today's Top Deals
Why on Earth would you like to block a spambot? So it doesn't get any more useful addresses? /give/ it a next page. With a nicely formatted word1word2num1num2@word1word2.com, where words and nums are random.
No way, man.
If you realize you're serving to a bot, go on serving. Each time the bot follows the "next page" link, you
Give it thousands, millions of addresses this way.
This would be good to do with known bad addresses, but random addresses only add more unknowing people to the list. You may add 1000 email addresses to the list and slow them down, but if even 10 of those email addresses are real, you've added to the problem. The bad addresses will be taken out as they are found to be bad, and the good ones will be left in. You've signed JoeRandomUser@RandomDomain.com up for all the spam he can handle, even if he has taken great lengths to keep his email address off the spam lists. In theory this sounds like a great idea, until your the guy getting your email address randomly fed to the bots.
"Information wants to be expensive" - Stewart Brand, the same guy who said "Information wants to be free"
Try out the Book of Infinity. It's a CGI that generates an infinite trail of gibberish links. It could easily be modified to add gibberish e-mail addresses to each page.
The author of the Spambot traps alludes to his website as www.crazyguyonabike.com. I recently ran across this site and found it quite interesting. He has a journal of his cross-country bicycle trip from New York City to the state of Washington.
It's well written and a quite humorous read.
$5 / month hosted VPS on linux = awesome!
There is another solution: Usually these SpamBots are not able to execute JavaScript...
As described at http://www.joemaller.com/js-mailer.shtml you can combine JavaScript and Images to protect your mail. Made very good expiriences with this one....
But, as stated on the Website: this game is an arms race...
function SeedFakeEmail($Email)> Please don't email $Email</a></font>";
{
echo "\n<font size=\"-5\" style=\"display:none\"><a
href=\"mailto:$Email\"
}
SeedFakeEmail("uce@ftc.gov");
SeedFakeEmail("listme@dsbl.org");
SeedFakeEmail("hotline@mpaa.org");
SeedFakeEmail("cdreward@riaa.org");
SeedFakeEmail("senator@hollings.senate.gov");
Put that in your pageheader and smoke it!
No, they're not...what are you some kind of fucking idiot?!?! Don't overload the root nameservers. This is a Bad Thing. See, this is what slashdot readers get...no idea of how things work.
On rahga.com, I use a custom perl script with a html-based form that is programmed only to send messages to me. Here it is.
On stuff like my FAQs, I use igPay Latin Encoded Email: ahgaray atyay ahgaray otday omcay
I guess if you're some pussy who's not man enough to use a real database manager then MySQL might seem usable.
You probably "program" in perl, too.
How about instead, returning pages with the email address abuse@domain-that-spambot-is-coming-from all over them...
.edu, or .gov.
Most spambots know better than to send their crap to email addresses containing things like abuse, root, postmaster,
Also, in regard to the problem of root servers being queried every time a @randomdomain.com is looked up, could you not just use random IP addresses?
I pledge allegiance to the flag...
of the Corporate States of America...
Before announcing new useful project to Slashdot community, create Freshmeat/Sourceforge page first there by eliminating the need for my host to shut me down for execssive bandwidth.
Take a look at these two bits of code from http://www.slickhosting.com/contact.shtml :
O ver="window.status='mailto:hostingsli ckhosting.com';return true;"c khosting.com</A>
<A HREF="mailto:hosting%40slickhosting.com"
onMouse
onMouseOut="window.status='';">hostingsli
<!-- Spam trap
abuse@ (your domain) HREF="mailto:abuse@ (your domain) "
root@ (your domain) HREF="mailto:root@ (your domain) "
postmaster@ (your domain) HREF="mailto:postmaster@ (your domain) "
uce@ftc.gov HREF="mailto:uce@ftc.gov"
-->
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Make a big macromedia flash site. Let the bot's eat that: this is the thing a lot of company's do.
.doc files.
.doc format.
don't worry, and google wil adapt. They read even pdf and
new thought: make a site written in
Odds are high that this system, should it become sufficiently widespread to be useful, would be vulnerable to poisoning by spammers spoofing spambot traps and causing legitimate IPs (such as Googlebot or large blocks of Net users) to be incorrectly blocked. There are countermeasures against this, but my guess is that the resulting arms race would not result in an adequately-usable system for enough of the time to be worth it. (Remember, the blacklist must update with reasonable frequency for both additions AND expirations, and must have a VERY low rate of false-positives). The authentication of "legitimate" submitters is a serious weakness of such a system. Nice thought, though...
"My strength is as the strength of ten men, for I am wired to the eyeballs on espresso."
Win 2k with IE 6 as well
postmaster@127.0.0.1 and abuse]@127.0.0.1postmaster@127.0.0.1 and abuse@127.0.0.1
Good idea but, I'm sure spam software has been rejecting 127.0.0.1 for many years.
How about a few people volunteering real FQDNs that all resolve to 127.0.0.1? I realize that people would be volunteering horsepower and bandwidth for DNS lookups, but it would be in the name of dramatically reducing spam. Then, keep a list of all the "loopback FQDN's" and let the rest of us feed those FQDN's into spam-trap generators. Eventually, there would be so many real-looking spam trap email addresses that the spam software wouldn't be able to keep up with the list of loopback FQDN's.
To take it to the next level, you could hide the list of "loopback FQDN's" by making a reverse DNS lookup against a couple of volunteered IP addresses return a random FQDN from the list of loopback FQDN's at the time that the spamtrap page is dynamically generated.
Spammers would never know the entire list of FQDN's that resolve to loopback.
Intelligent Life on Earth
Way too much work. Here's similar Escapade [escapade.org] code:
<QUIET ON>
<html><head><title>Members area</title></head><body>
<p>Hello random visitor. There is a big chance you are a robot collecting mail
addresses and have no place being here.
Therefore you will get some random generated email addresses and some random links
to follow endlessly.</p>
<p>Please be aware that your IP has been logged and will be reported to proper
authorities if required.</p>
<DBOPEN "SpamFood", "localhost", "login", "password">
<FOR I=1 TO 100 STEP 1>
<SQL select * from names order by rand() limit 1>
<LET FN="$Name">
</SQL>
<SQL select * from lasts order by rand() limit 1>
<LET LN="$Last">
</SQL>
<SQL select * from addresses order by rand() limit 1>
<LET AD="$Address">
</SQL>
<a href="mailto:$FN.$LN@$AD">$FN.$LN@$AD</a> <br>
</FOR>
</body>
</html>
-- Ed Carp, N7EKG erc@pobox.com PGP KeyID: 0x0BD32C9B What I'm up to: http://intuitives.mine.nu
I don't stop spambots, I feed them. I feed them phony email addresses and addresses of spammers (gathered from places such as my fake /cgi-bin/formmail.pl). I use
http://www.devin.com/sugarplum/, mentioned before on /. to dish it out!
is that some of the fake emails it generates will be real.
but not /.
We've recently set up a Spam Troll-box using Vipul's Razor on our new Tux4Kids dev server (you can find our troll box here).
;-)
A troll-box gives Spam-bots a place to send their spam. When this box intercepts the spam, it reports it to the Vipul's Razor network, and everyone else on this network becomes aware of that spam (if they are also using Vipul's Razor to filter, which, chances are they are, it will filter that spam if they get it).
If Vipul's Razor isn't enough, one can even use something like SpamAssassin in conjunction with Vipul's Razor to get even better results.
Of course, this isn't cutting off Spam-bots at their source... but if enough sites were to cut them off at their source, then I'd imagine the Spam-bot authors would get wise to this and devise a way around it. Whereas with something like a SPam Troll-box, the Spam-bots seem to still be working to those running the Spam bots
Well, I didn't trust (1), and (3) just got me a voice mail box instead of a person I could chew out, which I didn't use. That left (2), and I had a wicked idea:
I hit 2, and input the number that I should call if I was interested in the fax (which appeared in BIG text right above the little text). Their own response number should start eventually getting faxes from them or, as I tend to experience, hangups.
Cute story, I know, but what does this have to do with defeating spambots?
I went to the page indicated...
And I scrolled to the bottom, and looked at the source code, and noted two faaaaaascinating things:
First, the HTML on that page is rather clean; I can see no evidence of anti-spambot code on their page.
And second, the "Contact Us" link at the bottom is a mailto:.
By all appearances, their page is vulnerable to their own spambot.
So I had the thought... what if those generated-random-email-address pages were geared to produce not-so-random email addresses? What if the email addresses on those generated-page traps were geared to generate random email addresses at the domains of the various spambot-- (err, I mean) harvester producing companies? Let them see what it's like when less than discerning spammers use their software for evil. Hundreds of Viagra-substitutes! Thousands of hangover cures! Tens of thousands of opportunities to refinance their home mortgage!
This is just an off-the-top-of-my-head idea. Opinions?
You cannot truly appreciate Dilbert until you read it in the original Klingon.
I've used some of Matt's code on my personal site, and never thought to ask the question "Gee, are these things just an exploit waiting to happen?"
I don't have much traffic, but that's certainly not the point. I really appreciate knowing about the exitence of nms.
- Leo
You don't use science to show that you're right, you use science to become right.
An old (by web standards) trap can be found at http://spiders.must.die.net/h/b/e/index.htm, although I've never found it's real beginning myself.
:)
The prose on those pages also makes good beer drinking reading.
It is at least internally consistent: It's not a spambot, so it doesn't fall into spam traps :)
Stopping spambots is fantastic, but this is a defensive measure. Aren't there offensive measures people can use? What about a 'honeypot' approach. Perhaps you set up a bogus site with zillions and zillions of easy-to-find but totally bogus email addresses. The let the spammers download 10-15GB of worthless addresses that will (hopefully) choke their email pipe. Make it "ugly" enough out there and just maybe a few of the less dedicated might decide it's not worth it.
Any other offensive measures possible?
I tried this, and it doesn't work. Although it slows down the 'bot, on a standard linux system the sleep(1) call [in Perl] takes up an enormous amount of CPU time, leading me to believe that it is implemented as a spin lock. Also, the serving process would not disappear even if the socket connection is closed on the client end (ie - getting out of the browser and rebooting my machine), so eventually a large number of CPU-hungry processes accumulate and suck up all your resources.
My ISP was none to happy about this when I tried it. If any kind sysadmin can figure this out and tell me how to fix it, I'd be grateful...
$5 / month hosted VPS on linux = awesome!
Here's a list of non-existent e-mail addresses for those damn spam bots. GO GET EM BOYS!
/. it :)
Don't go here
Please don't
This would be good to do with known bad addresses, but random addresses only add more unknowing people to the list....You've signed JoeRandomUser@RandomDomain.com up for all the spam
A solution to this is to generate only hotmail.com addresses.
"I've got them on the list! They'll none of them be missed!"
What about requiring all of your users to go through a terms of service page before accessing any parts of your site?
The page could have a form with "Accept TOS" and "Reject TOS" buttons. I wonder how many spambots would submit a form?
And to catch spambots that did submit the form, your TOS could have some clauses that make it a violation for evil spiders (ones that don't honor "robots.txt") to use the site. Maybe you could make||lose a few bucks suing the spambotters who go through the TOS and still harvest your email addresses.
Speaking of spam, I've come across this new program called mailwasher. You can check your mail while it's still on the server, and then - get this - fake a bounced message. There are probably other programs that do this, but this is the first one I've heard of.
Anyway, AFAIK, it's WinBlows only, and available at http://www.mailwasher.com, although right now it seems the site is down, all I get is a 404!
Didn't spot two of my favorite techniques (although they're probably somewhere in that pile).
But I still wish someone would make an Apache mod that lets you devote a single process to tying up a specified number of spambot connections.
Rather than filling the spider with a whole bunch of (potentially valid) addresses and loading your server with bogus clients you don't want, just make it difficult for them to extract the addresses.
I wrote a bit of PHP a few months ago that applied some spamproofing ala SlashDot (only a bit less agressive) that some might find useful.
Highlighted Source
Raw Source
It performs the following munging, depending on what you specify:
freaky@aagh.net
freaky (at) aagh (dot) net
freaky@aagh.N0SPAM.net.SPAMN0
freaky@aag&# 104;.net
random one of the above
random with entity encoding
all of the above
http://www.xemu.org/mirrors/spambot_trap.html
There are "scanner" traps that start up a session and just drops it (not telling the scanner) which ties it up until the scanner softare times out.
How about writing something for these spambots using a special web server that slowly responds to it's requests (sends out a small packet every 10 seconds) so it won't time out and won't consume much cpu time, and just feeds it a line or two lines of junk with each packet. Have it randomly generate a never ending supply of useless information to keep the spambot happy. While it's busy with the useless site, it's not bothering other people nor is it getting any real addresses.
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.
All of these methods for removing spambots from collecting email addresses do nothing to prevent humans from collecting addresses. Now, if you can find a way to protect your mail address from a human who enters it into a database, you have a solution.
Nothing will prevent a packet sniffer from grabbing your email address from an unencrypted SMTP connection...
There are plenty of clever ways to protect your email address, but there are far more clever ways to defeat ant-spam collectors. Imagine an internet worm which sent your addressbook to a spam collector. That would be a serious mail collecting agent. It wouldn't be legal, but it would work.
I can devise worse ways at the drop of a hat.
So don't waste my time talking about spam, or else the irony will be that all we are talking about is spam. If you don't get that, you should watch the original Monty Python skit.
-Mike
..howabout a glue trap?
:-)
1. Publish false mailto: addresses on your web pages in the same colour font as your background
2. Change them to visible, valid addresses by munging them with DHTML properties and a
JavaScript include file (sorry, Lynx users)
3. When a recognizable spam-bot comes in, refuse to load the javascript include file. mod_setenvif and mod_rewrite should help out here.
4. When a probable spam-bot comes in, serve up the page reaalllly slowly, don't close the connection until it goes in CLOSE_WAIT. This ties up sockets on the remote machine and reduces its ability to troll OTHER sites. You can do this by writing a handler for your base directory, checking the browser, and returning DECLINED for friendly people. That should be in, I think the "post read" phase.
5. When a recognized bad address comes through to your mail server (from step 1), slooooow the SMTP transaction down as much as you can (same idea as step 4), and throw an error at the end of the 354 DATA section a few times (to force him to come back!), etc. (Some sendmail internals hacking required here, although it would be much easier to hack if you don't have any real mail and just ran a script from inetd.)
6. Those fake email addresses. Make them all point to a common MX or group of MXes that you control the DNS for. Make sure those MX records aren't used by anything legitimate. Slooooow your in.named down for requests to that domain. A cool side effect, besides tying up sockets on the spammers end, IIRC some OSs can only make one resolver request at a time -- this'll effectively block all of his out outbound spam traffic while he's trying to look up your MX record! Also, make sure the TTL is set to about 10 seconds, just to make sure he comes back the glue trap very often.
How's *that* for spam countermeasures? I wish I had time to write it.
Do daemons dream of electric sleep()?
http://www.neilgunton.com/spambot_trap/
KDE's KMail can bounce mail, too, manually or via filter.
Memory plus an Apache child. Any solution which causes Apache to be put sleep artificially can and likely will be used as a very effective DoS against your site. Unfortunately.
dci@cia.gov. ..
tridge@fbi.gov
gbush@whitehouse.gov
I think the bots don't fall for your fake pages and mail addresses because not enough people link to your page. If you had a little more traffic there, more bots would fall for it. Write something interesting and get slashdotted, would you?
You don't need to do that.
MX records do that for you.
You can actually have email@mydomain.com when you don't have a box providing an ip for mydomain.com
MX records say "hey, you, all the email for is handled by - as such, you could easily tell your DNS provider to set the MX for any number of hosts to 127.0.0.1
Desperation is a stinky cologne
On the other hand, you'd be surprised at just how much spam is delivered to security@ I thought they'd be smart enough to avoid obvious admin addresses until I started seeing it come in.
I like my women like my coffee... pale and bitter.
Or billgates@microsoft.com
Video Game cheats, hints a
How about the email addresses of everyone in
Congress, plus all the politicians in Russia,
Korea, and the other countries with lots of open
relays? (Perhaps excluding those who have tried
to do something about spam.)
The mailto:address@foo.com?Subject=bar syntax was introduced by Netscape 2.0.
Nathan
By random I don't think they mean JoeRandomUser@RandomDomain.com. I think they mean random like [output of crypt]@[output of crypt].com. It's pretty unlikely that a legitimate address is going to look like kjd73i3h@3hvcfh93.com (which was just me pushing keys). Spambots probably don't care that an address doesn't "look" like a legitimate address; they're just there to harvest everything.
Apparently, of the rich, by the rich, for the rich.
...so that you can leave them out of your HTML source:
j s
http://artificeeternity.com/includes/linkwrite.
Instructions for use are included in comments. The script fragment that replaces mailto: links in the page will actually shorten your code -- it only requires entering the username and domain once. Also, the @ sign is added in by the script, so the address itself never appears in your HTML.
That's a really great idea, returning email addresses with the "abuse@..." address for the domain they are coming from. Can you post the script you use to do this?
Ideally you would actually create a spam trap account for this task and use a procmail recipe to briefly explain what you're doing in the forwarded message. That way the raw forwarded headers can't be misinterpreted as your server sending the spam.
I do this very thing and have had great luck with it. I seed multiple addresses on key pages so that uce@ftc.gov is garunteed to receive a number of these pieces of spam. I also send this spam to the newsgroup bot for news.admin.net-abuse.sightings, a newsgroup filled with forwarded spam LARTs for us anti-spammers to search for patterns or previous spamming evidence. You just add "nanas-sub@cybernothing.org" to you recipient list and prepend the forwarded subject line with "(email)". That's it!
Now however, I have changed the URL I use to link to it to be:
/cgi-bin/spambot_trap/guestbook/journal/mess age :-).
so that all the spambots he mentions will follow it
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
Way too much work. Here's similar Escapade [escapade.org] code:
Not similar enough. That makes 300 queries per hit against your database, and I don't think you even used prepared statements. His code slowed their software to a crawl by sleeping. Yours will slow your software to a crawl by excessive database traffic.
Just to make sure it gets said: The email address that's listed here on /. is a spamtrap. Don't use it! My user name in my domain is the same as my user name here. I didn't intend for that address to become a spamtrap, but it was soaking up so much spam it seemed wise to put it to good use.
Warning: This signature may offend some viewers.
A situation that forced me to install Mozilla.
Can defeating MS really be *that* easy?
Hey... Here's something I found out a few days ago:
http://www.mailutilities.com/aee/
Elcomsoft, who are the makers of the Advanced Ebook processor (remember Skylarov?), also make various email utilities. Although some look like they might have legitimate uses, at least one looks to have *no* legitimate use. (When a tool is designed to scan web pages for email addy's, and DESIGNED to pull out real names&email from web forums...)
Read the above URL and the rest of the site yourself and draw your own conclusion.
Yes it is... Compile a list of email addresses of congress/senate/judges ("Law makers"/"Law in-forcers") and let the spambot's eat them up!
Compile a list of email addresses of congress/senate/judges ("Law makers"/"Law in-forcers") and let the spambot's eat them up!
You mean like this?
My personal favorite:
krypt@mars:~$ ping warez.dal.net
PING warez.dal.net (127.0.0.1): 56 octets data
64 octets from 127.0.0.1: icmp_seq=0 ttl=255 time=0.4 ms
DJ kRYPT's Free MP3s!
- <a HREF="mailto:abc(insert 1000 characters here)@blahblahblah.com">
have any detrimental effect?"This form has been used already 0 seconds ago. You can not use a form and hit the back button to use it again."
Is slashdot falling apart or what?
It's actually http://www.mailwasher.net/.
I'd really like to, but unfortunately, I can't get the script past that lame lameness filter... Yes, I know, I shouldn't have used Perl... If any of the editors are reading this, please consider making that filter less strict. Thanks!
Say no to software patents.
Here is what I do on my website to protect email address
;
Javascript:
function sendmail()
{
var string = 'mail'
string += 'to:'
string += 'webmaster'
string += '@'
string += 'domain'
string += '.com'
open(string)
}
Usage:
<a href="JavaScript:sendmail()">webmaster</a>
This could be expanded to pass the values need to build up the email address.
Can I claim that all the spam these jerks send me are an attempt at a DoS attack?
The man who trades freedom for security does not deserve nor will he ever receive either. - Benjamin Franklin
Now that I've read all your countermeasures, I have created the ultimate Spam Bot to get by these traps. ;]
OK, so we've got spambot prevention. Now we need some effective form of "Slashbot" protection. I envision a webserver that will detect a high number of referrals from Slashdot and put the server into "low bandwidth" mode, serving pages stripped of formatting and graphics (with links to graphics, of course) in order that content may be delivered in an efficient manner.
Give me my freedom, and I'll take care of my own security, thank you.
Cute, except your choice of Java opens you up to a trivial DOS.
Just start opening exceptionally deep URLs in parallel. Thanks to Java's low-powered IO system, you can suck up tens of thousands of threads that way, clogging your scheduler.
This would be better implemented as a second server, listening on a different port, written in a language that lets you write event-based state-machine IO. This will cut your memory usage per tar-pitted connection down by 20-30 TIMES, and won't put stress on your scheduler.
NOTE: hacks like Weblogic's native performance pack won't help you here, since you sleep.
i pee chains
that would hurt!
that would hurt even more!
I always enter postmaster@warez.slashdot.org in spamforms
I switched to mozilla because of slashdot too, but it was to get rid of the BFAs.
What if a specific page was actually a script that would forever generate fake e-mail addresses?
But you can do better than that - Give them FQDNs that resolve to Open Relay sites, and use Round-Robin DNS if you can. If you've got your own domain, you can spare plenty of FQDNs, like mail2.mydomain.com.
Depending on how you set up the round-robin, and where the relay machines get their DNS resolution done, you may be able to make them run in a tight little loop around the Korean broadband, or burn expensive international bandwidth between China and Sweden.
Or you could give them random names at various spammer and spamhaus sites, or FQDNs that resolve to the addresses of spammers or spamhausen, or remove-me addresses of other spammers. They may filter out their own, and don't give them obvious addresses like abuse@ or postmaster@, but surely they won't recognize most of them, especially the latest Corrupt Nigerian Official trying to launder embezzled money.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
they would come straight into the middle of my website, at a specific page rather than the root. This means that the spambots obviously had some kind of database of pages, which had presumably been built up from previous visits
That's possible I guess, but mayhap a search engine done told em where to go.the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests.
Or it's processing...I guess Ill actually finish reading it all now.. sorry... pet peeve, distracted, must stop picking
If you're not messing with DNS, though, there are lots of addresses that can cause trouble:
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
What? Code some thing that works better than, but not well, with any thing Microsoft. That whole native format thing. Silly isn't it.
*SRU
And somewhere out there is a far nastier variant on a teergrube that can keep a typical smtp session up for hours with only a few kilobits/minute, using tricks like setting TCP windows very small, NAKing lots of packets so TCP retransmits them, etc. (It basically works by saying "No, SMTP/TCP/IP isn't a set of protocol drivers in my Linux kernel, it's a definition of a set of messages and there's no reason I should user a bunch of well-tuned efficient reliable kernel routines when I can send raw IP packets myself designed for maximal ugliness."
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
There are a whole bunch of script kiddies who wouldn't like to get any spam and might break somebody's electronic kneecaps if they get too annoyed. You wouldn't wanna do that, it'd be rude. Don't bug Emmanuel Goldstein himself, and he and many of the other people there are good guys, and surely if you've got any pretenses to being 31337 you can go hang out on the hacker irc channels and find Usual Suspekts :-)
Spambots tend to avoid web*|admin*|root|support|help@somedotDOTorg
This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.
Golly gee, let's see here. Ways to thwart the spambots.
You can URL-encode and un-mailto your address.
But spambots can still read most plaintext email addresses from the text itself...
Then encode your email address into a piece of javascript.
But many normal users don't have javascript turned on...
Then write your email address into a GIF or PNG.
But certain types of disabled people and lynx users won't be able to view those images...
This author would argue that those two are one in the same. But still, you can also obfuscate your address for the user to figure out, providing directions on how to unobfuscate it. (NOSPAM.bob@NOSPAM.hoser.com)
But there are many users who are too dumb to unobfusicate the address...
Then write a web page with a form for sending the message... the email address remains hidden.
But this is insecure / stupid / not fully supported by Mosaic 0.13beta...
Then whoever can't use one of the above methods can go sod off. I plan to use most of these, grouped together into one contact.html page on my personal web site. If there are a couple of users in the world out of thousands who can't contact me due to technical or mental limitations, then dang them to heck for all I care.
You see, it's a balancing act of preferences. Would you prefer to let (literally) a couple users slip through the cracks, or would you rather get bombed by potentially hundreds of spambots? Your choice...
Me again.
:)
HTML ate some code. If using this, change those weird ifs to this:
if(a 3) { email = name; } else
if(a 4) { email = last; } else
if(a 5) { email = name.charAt(0) + "." + last; } else
if(a 6) { email = name + "." + last; } else
if(a 7) { email = name + "." + last.charAt(0); } else
if(a 8) { email = last + "." + name; } else
if(a 9) { email = last + name; } else
if(a 10) { email = name + last; } else
if(a 11) { email = name.charAt(0) + last; } else
if(a 12) { email = last + name.charAt(0); } else
if(a 13) { email = name + last.charAt(0); } else
if(a 14) { email = last.charAt(0) + name; } else
if(a 15) { email = last + name.charAt(0); }
email = email + "@" + endn;
Sorry.
boky
Ufff...
It's not my day; change those weird if(a...) to if(a<...), of course.
Thank you.
boky
next time, post it in MIME or uuencode.
` `=&5S=%]U= 65N8V]D92YT7 @\657U@ =R:#5EQFE*6)_>VU"4 8..]1_/.VU;*OR2, ZJCPDEG9:":'D'6:F=X*S;1UP=C2@$SK3F@OW1"A\A@KE/\R"` )!Y'102P4&``````$`
for example:
_=_
_=_ Part 001 of 001 of file test.zip
_=_
begin 666 test.zip
M4$L#!!0````(`*VIC2P%90R7\````#H!```1``
M>'1-C$UK@T`41?>"_V$J+Z)5,I!%*9U.DW5
M:'EPN9Q[>`?!#79JI#1TMG$4Q7%:/
,`0`_````'P$`````
`
end
(The original is off of http://perl.plover.com/obfuscated/)k cah xinU / lreP rehtona tsuJ";sub p{q *=2) +=$f=!fork;map{$P=$P[$f^ord[ P.]/&&
Try to get it past the lameness filter:
@P=split//,".URRUU\c8R";@d=split//,"\nre
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^
close$_}%p;wait until$?;map{/^r/&&}%p;$_=$d[$q];sleep rand(2)if/\S/;print
Hey! It worked. wow.
What don't you try using the PREVEIW button next time.
Jerkoff.
Good idea for ISP's that stick to the abuse@ standard, but not much good for ISP's like Clueless & Witless, who ignore abuse@ and who use spamcomplaints@ instead. You could use a script to query rfc-ignorant.org for the right abuse@ address, but that would waste CPU and bandwidth. In any case, most spambots will ignore addresses ending in .gov and .mil, and a lot will not follow links onto .cgi pages, so I use Wpoison.cgi as a "virtual" inside a php page on my site.
Since we're discussing filtering and dropping packets into the ether.. whatever happened to Blackhole (ing) software?
The last I heard (years ago) was that a company was actively using it, and being sued by other companies for blocking their mail.
Anyone?
You have a sick, twisted mind. Please subscribe me to your newsletter.