Captain+Large+Face · Slashdot Mirror

Suicidal on Stopping Spambots: A Spambot Trap · 2002-04-12 01:39 · Score: 0, Redundant

Wow, this guy slashdotted himself..

Stopping Spambots: A Spambot Trap

Using Linux, Apache, mod_perl, Perl, MySQL, ipchains and Embperl

This document describes my experiences with spambots on my websites, and the techniques I have developed to stop them dead. I assume the reader has basic familiarity with Linux, Apache, mod_perl, Perl, MySQL and firewall rules using ipchains - each of these topics could fill a book, so I won't talk about installation or basic configuration. I will, however, provide full scripts and instructions on using these within the context of these tools. If you'd like some basic pointers on getting set up using these tools, then you could take a look at my short series of three Linux Network Howto articles.

Contents

The Problem: Spambots Ate My Website
Overview of the Spambot Trap
Banishing 'mailto:'
MySQL
BlockAgent.pm
ipchains
badhosts_loop
spambot_trap/ Directory
robots.txt
Your HTML Files
- Embperl
httpd.conf
Monitoring
Conclusions
- Strengths
- Weaknesses
- Possible future enhancements

The Problem: Spambots Ate My Website

Spambot: (noun) - A software program that browses websites looking for email addresses, which it then "harvests" and collects into large lists. These lists are then either used directly for marketing purposes, or else sold, often in the form of CD-ROMs packed with millions of addresses. To add insult to injury, you may receive a spam email which is asking you to buy one of these lists yourself. Spambots (and spam) are a pestilence which needs to be stamped out wherever it is found.

I have a website, http://www.crazyguyonabike.com, which has bicycle tour journals, message boards and guestbooks. I started noticing around the end of 2001 that the site was getting hit a lot by spambots. You can spot this sort of activity by looking for very rapid surfing, strange request patterns, and non-browser User-Agents.

Another distinctive behavior was that the spambots would follow only those links which had certain keywords which would seem promising if you're looking for email addresses: "guestbook", "journal", "message", "post" and so on. On each of the pages in my site there were many other links in the navbars, but only links with these keywords were being followed. Also, robots.txt was never even being read, let alone followed. Moreover, the bot would come in, scan pages rapidly for maybe a few seconds, and then stop for a while. So it was obviously making at least some attempt to circumvent blocks based on frequency/quantity of requests.

This was very annoying. For one thing, these things were picking off email addresses from my website (at that point, I was letting people who posted on my message boards decide for themselves whether they wanted their email addresses to be visible or not). But quite apart from that, it was taking up resources, and was just plain rude. I hate spam. I resent my webserver having to play host to people whose obvious goal is to cynically exploit the co-operative protocols of the internet to their own selfish, antisocial gain. So, I decided to do something about it.

The first thing I did was to look at the User-Agent fields which were being used by the bots. There were a variety, including variations on the following:

DSurf15a 01
PSurf15a VA
SSurf15a 11
DBrowse 1.4b
PBrowse 1.4b
UJTBYFWGYA (and other strings of random capital letters)

I searched the internet for references to these strings, but all I found was a slew of website statistics analysis logs. This meant that these particular spambots obviously got around. It was also discouraging, because there was no mention anywhere of what these things actually were. I was surprised that there seemed to be no discussion whatsoever of something that seemed to be pandemic. Then I found a couple of other websites with guestbooks that had actually been defiled by these spambots: (if you follow these links and you don't see a lot of empty messages left by the above user agents, then that means the webmaster of the site has finally found a way to stop it, so good for them...)

http://www.virtualglasgow.com/guestbook.html
http://www.donotenter.com/guestbook/gbook.html

I reckon the spambots didn't really intend to leave empty messages. They just tend to want to follow links with the keyword 'post'. So if the guestbook posting form has no preview or confirmation page, then the spambot would leave a message simply by following this link! My guestbooks and message boards have a preview page, which is probably why I hadn't had any of this.

Anyway, I started thinking about what kind of program this thing was. First of all, it comes from all kinds of different IP addresses. I couldn't quite believe that this many different IP addresses were all intentionally using the same software, of which I could find absolutely no mention anywhere on the Web. This made me think it might be some kind of virus/trojan/worm or whatever that silently installed itself on people's computers, and then used the CPU and bandwidth to surf the Web without the owner being aware of it. I thought that if this was the case, then it must be sending the results somewhere - and if we could find out where, then we could go about shutting the operation down. But I have had no luck at all in getting any help from the sysadmins at ISP's I have contacted. A typical exchange was the one with a guy at Cox internet, which was where a persistent offending IP address was sourced. He just couldn't be bothered, and eventually told me that spidering was not against the law, or their terms of service. I asked whether actions which were blatantly obviously geared toward the generation of spam were against their terms of use, but he never replied to that. I had no more luck anywhere else: Nobody had heard of this thing. I even sent an email to CERT, but no response. So, I turned instead to thinking about how I could erase these pests from my life as much as possible. This document is about my quest to stop spambots (not just this one, but ALL spambots) from abusing my website. Hopefully it will be useful to you.

Overview of the Spambot Trap

There are three main parts to the technique which I outline here:

Banish visible email addresses from your websites altogether, or else obfuscate them so they can't be harvested. Examples of how to do this are given. This is your fail-safe, in case the spambots figure out a way around your other defences. Even if they manage to cruise your website on their very best behavior, they still should not be able to harvest email addresses!
Block known spambots: Certain User-Agents are just known to be bad, so there's no reason to let them come on your site at all. True, spambots could in theory spoof the User-Agent, but the simple reality is that a lot of them don't. We use an enhanced version of the BlockAgent.pm module from the O'Reilly mod_perl book. This extension adds offending IP addresses to a MySQL (or other relational) database, which is picked up by the third part of our cunning system...
Set a Spambot Trap, which blocks hosts based on behavior. We set a trap for spambots, which normal users with browsers and well-behaved spiders should not fall into. If the bot falls in the trap, then its IP address is quickly blocked from all further connections to the webserver.
This works using a persistent, looping Perl script called badhosts_loop, which checks every few seconds for additions to a 'badhosts' database. This script then adds 'DENY' rules for each bad hosts to the ipchains firewall. Blocks have an expiry, which is initially set to one day. If a host falls in the trap again after the block expires, then that IP is blocked again - and the expiration time is doubled to 2 days. And so on. This algorithm ensures that the worst offenders get progressively more blocked, while one-time offenders don't stick around in our firewall rules eating up resources.

There are various components to the Spambot Trap, including the badhosts_loop Perl script, the BlockAgent.pm module, ipchains config, MySQL database, httpd.conf, robots.txt, and your HTML files. These are all covered in the sections below.

Banishing 'mailto:'

The first and most urgent thing you need to do is to get email addresses off your website altogether. This means, unfortunately, banishing the venerable mailto: link. It's a real shame that perfectly good mechanisms should be removed because of abuse, but that's just the way the world is these days. You need to be defensive, and assume that the spammers will try to take advantage of your resources as much as possible.

It's an arms race

The important thing that you need to realize is that no matter what blocks we put in place, this game is an arms race. Eventually the spambot writers will develop smarter bots which circumvent our techniques. Therefore you want to have a failsafe, which will prevent email addresses from getting into the hands of the spambot even if all else fails. The only real way to do that is to completely remove all email address from your website.

Contact forms

You should replace the mailto: links with links to a special form where people can type their name, email address and message. A CGI can then deliver the email, and your email address never has to be disclosed. There are a number of different mailer scripts out there - just be careful to check for vulnerabilities which could allow malicious users to use the form to send email to third parties (i.e. spam, ironically enough) using your server. The formmail script is popular, but an earlier version had such a vulnerability (since fixed). The Embperl package has a simple MailFormTo command to send an email from a form.

Since I have seen guestbooks out there which have been extensively defiled by spambots, I would add that you should have a preview screen on your contact forms. This will ensure that an email doesn't get fired off simply by a spambot following the 'post' or 'contact' link (which it will likely try to do).

Alternatives to totally banishing mailto:

There are alternatives to completely removing email addresses, but they all depend on the stupidity of the spambot, and so could be compromised by a new generation of pest. These include:

Write out email addresses in a non-email format, e.g. instead of writing 'username@domain.com' you would write 'username at domain dot com', or something similar. It would only take some spambot with a little more intelligence to be able to scan these patterns and pick up "likely" addresses, so this strategy is a little risky. Any consistent method you choose to write out email addresses could in theory be analyzed and decoded by a savvy bot.
Add stuff to the email address to make it invalid, but so that a human could easily know what to do to make it work. An example of this is writing 'username@_NO_SPAM_domain.com'. You need to remove the "_NO_SPAM_" part to make the email address valid. You can have some kind of explanation to make it clear what people have to do to use the address. Personally, I don't like this - you're depending on a level of sophistication on the part of your users which is risky. In my experience, there are a lot of very 'novice' level users out there, who only know how to click on a link. They don't know how to edit an email address. Heck, I've had people come to my site by typing the URL into Google, rather than the 'Location' box of their browser. Also, people don't read instructions.
Make graphics images which contain the email address. Spambots usually don't download graphics, and even if they did, they probably couldn't decode the bits to get the text. However, they could do it in theory, since software for doing OCR (optical character recognition, getting text from scanned documents) has been around for a while. A downside to this approach is that the user has to manually copy down the email address, since it can't be cut'n'pasted. Also, you can't put a mailto: link on the image, otherwise you're back to square one. But you could put a link to a contact form, with an argument in the link telling your server internally what email address to use. For example, the link could say "contact.cgi?to=23", where '23' is some database key to the actual email address. But the downside here is that you still need to generate the image, which is a bit of a pain in the ass if you have a lot of them. You can do it automatically, if you're willing to put the work in and write the scripts. There are some very nice graphics generation packages out there on CPAN for Perl. Here's an example of an email address presented as an image:

MySQL

Download badhosts MySQL database dump

We need to set up a MySQL database, where we store records of the hosts which are to be blocked. This doesn't have to be MySQL, but I use it because it's extremely fast, and very appropriate for this kind of application. You need to create a new database, called 'badhosts'. You then create a table, again called 'badhosts', with the following structure:

Field
Type
Comment

ip_address
varchar(20) not null, indexed
The IP address of the host to be blocked

user_agent
varchar(255) not null
The HTTP User-Agent of the spambot, for reference

expire_days
int unsigned not null
How many days is this block for. Doubled every time a new block has to be created for a particular IP address

created
datetime not null
When this block was created

expiry
datetime not null, indexed
When this block expires

You could use the dump provided above to load directly into your database:

shell> mysqladmin create badhosts shell> mysql badhosts < badhosts.dump

That's about it! The fields which are marked as 'indexed' are the only ones which need indexes, because they are searched on to see if a particular IP address has been previously blocked, and also to see which blocks should be removed because they've expired. If you have access privilages set on your MySQL databases, then you need to allow the Apache user (usually 'nobody') access. The other script that will require access is badhosts_loop, which runs as root.

Next, we look at the script that populates this database.

BlockAgent.pm

Download BlockAgent.pm

Download bad_agents.txt

The BlockAgent.pm Apache/mod_perl module is taken from the excellent book "Writing Apache Modules with Perl and C" by Lincoln Stein & Doug MacEachern (O'Reilly). This script basically acts as an Apache authentication module which checks the HTTP User-Agent header against a list of known bad agents. If there's a match, then a 403 'Forbidden' code is returned. The script compiles and caches a list of subroutines for doing the matches, and automatically detects when the 'bad_agents.txt' file has changed. I have found that it has no noticeable impact on the performance of the webserver. This script is useful in the case where you know for certain that a certain User-Agent is bad; there's no point in letting it go anywhere on your site, so it's a good first line of defense. We'll cover how to add this module to your website a little later, along with the rest of the configuration settings in the section on httpd.conf.

Of course, one of the first arguments you'll see with regard to this method of blocking spambots is that it's easy to circumvent, by simply passing in a User-Agent string which is identical to the major browsers out there. This is perfectly true, but don't ask me why the spambot writers haven't done this - maybe it's a question of pride or ego, they want to see their baby out there on record in Web server logs. I honestly don't know. The main point is that at present, the User-Agent header CAN be used very effectively to block most bad agents. But, I have added more features so that we can also block agents which look ok, but behave badly by going somewhere they shouldn't - the Spambot Trap. More on that soon.

You'll notice that the bad_agents.txt file which I have supplied here is very comprehensive. A good strategy here is probably to save the full version somewhere (perhaps as bad_agents.txt.all), and just keep the ones you actually encounter in the bad_agents.txt file. Then you keep the list shorter, and more relevant to what actually hits you. For example, my bad_agents.txt file currently has the following lines in it, because these are the spambots that I see most frequently:

[A-Z]+$
.Browse\s
.Eval
EO Browse
.Surf
Microsoft.URL
^Mozilla\/3.0.+Indy Library
Zeus.*Webster

You'll notice from this that BlockAgents.pm is very flexible, being able to take full advantage of the excellent regular expression capabilities of Perl. This means you can capture a lot of different agents with just one line. For example, the very first line catches all the variations of the agent which passes in random strings of capital letters, e.g. FHASFJDDJKHG or UYTWHJVJ. The spambot obviously thinks it's being pretty smart by looking different each time, but by using an easily identifiable pattern, it shoots itself in the foot. Hah.

The original version of the BlockAgent.pm script is well explained in the O'Reilly book, but I've added an extra hook that checks to see whether the client is accessing any of the spambot trap directories. If it is, then we add an entry to the MySQL database (you could use another relational database if you want, as long as it's accessible from Perl DBI).

The first time an IP address is blocked, an expiry of one day is set. If the same host subsequently comes in and falls into the trap again, then the expiry time is doubled. And so on. This way, the block gets longer and longer, in proportion to how persistently the spambot revisits our website. Once the IP address is blocked, the spambot can't even connect to our web server, since we use 'Deny' in the ipchains rule. This means that no acknowledgement is given to any packets coming in from the badhost, and as far as they know, our server has just gone away. Hopefully, after this happens for long enough, our server will be taken off the spambot's "visit" list. Another nice little side-effect of this is that the spambot will probably have to wait for a while before giving up each connection attempt. Anything that makes them waste more time is ok by me!

BlockAgent.pm notifies the badhosts_loop script that something has happened by touching a file called /tmp/badhosts.new. The badhosts_loop file checks this file every few seconds and if it has changed then it knows that a new record's been added to the database, and it needs to re-generate the blocks list.

The BlockAgent.pm script is our alarm system. It's what tells us that something happened. In order to act on this information, we need to be able to add rules to the ipchains firewall. We'll cover this next.

ipchains

Download sample ipchains config file

The ipchains module (here's the HOWTO doc) is a very nice way of providing a good level of basic network security to your server. If you haven't already set it up (or it's successor, iptables), then you really should. It's a very easy way to configure who can and cannot have access to your machine. A good resource for learning about this is "Building Linux and OpenBSD Firewalls", by Wes Sonnenreich and Tom Yates (Wiley). This is where I learned about ipchains, and it's on their excellent explanations and examples that I based my own config file. Another is "Linux Firewalls" by Ziegler (New Riders), which seems to have a more recent 2nd edition that covers iptables too.

The example ipchains config file given here is complete, but the bit which is most important to us is that we create a chain called 'blocks'. This is our own custom chain, which we can then add rules to. The badhosts_loop script will flush this chain and build it back up whenever a spambot falls in your trap. Once the spambot's IP address is on the blocks list, that host cannot connect to your server at all.

Remember to restart ipchains after you've changed the config file. Next, we'll look at the script that actually adds the firewall rules. badhosts_loop

Download badhosts_loop script

You run this script in the background, as root. It has to be run as root, because only root has the ability to add rules to the firewall. The script spends most of its time sleeping. It wakes up every five seconds or so and does a quick check on /tmp/badhosts.new. If this file has been changed since the last time it looked, then it goes and re-generates the firewall blocks list with all the current (non-expired) blocks. If nothing else happens, then the script will automatically do this at least once a day, to ensure that blocks really do expire even if there is no new activity.

You should probably add the following line to your /etc/rc.local file (or equivalent), so that the script is automatically started up on reboot:

/path/to/badhosts_loop --loop &

This will start the script looping in the background. The script automatically checks to see if it is already running, by attempting to lock /var/lock/badhosts_loop.lock. If the file is already locked then the script will exit with an error message. If you want to just run the script once, without looping, then just omit the '--loop' option. This can be useful for testing.

Logging is done to /var/log/badhosts_loop.log by default. Every time the script generates the blocks list, it writes a list of all the blocks to the log. This is a good place to monitor if you're interested in what hosts are being blocked. Here's an example of the log output:

EDITOR: SNIPPED

Thu Apr 11 16:09:07 2002: Flushing blocks chain: Generating blocks list:

Adding 63.148.99.247 (1) 2002-04-11 11:16:11 to 2002-04-12 11:16:11 Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

The log shows the IP address which is being added, then (in brackets) the number of days the block is effective for (doubling each time), then the start and end dates of this block, and finally the name of the User-Agent which committed the crime. This can be useful for quickly seeing whether you need to add a new one to the bad_agents.txt file.

This is a pretty stable script that should just sit there and chug quietly, not taking up much in the way of resources. Checking for a file being changed every five seconds is not a big deal in Unix, so you shouldn't even notice it.

Now you have to create the trap itself - the spambot_trap directory.

spambot_trap/ Directory

Download gzipped tarball of sample spambot_trap directory

View the sample directory

You can create this directory anywhere on your server. We will create an alias the httpd.conf to access it. I put mine in /www/spambot_trap/. The point is, this doesn't have to be a real directory under your webserver directory root. If you use the directive, then multiple websites can access the same spambot_trap directory, potentially through different aliases. You can use the sample tarball as a starting point, it has subdirectories and links which the spambots I have seen find irresistable. You should create your own image file for the unblock_email.gif file, to have a valid email address of your own.

The spambot_trap and spambot_trap/guestbook/ directories are not used directly to spring the trap. This is because I wanted to have a warning level, a lead-in, where real users would be able to realize they are getting into dangerous waters and could then back out. You're going to be placing hard-to-click links on your web pages which lead into the real trap, and there's always a chance that a real user will accidentally click on one of these. So, some of the links will point into the warning level. I have made a GIF image which contains a warning text. Why an image? Mainly because spambots can't understand images, and I didn't want to give big clues like "WARNING!!! DO NOT ENTER" in plain text. So, the user sees the warning, the spambots don't. If the spambot proceeds into any of the subdirectories (email, contact, post, message), then the trap is sprung and the host is blocked.

You also need to try to stop good spiders (e.g. google) from falling into the spambot trap and being blocked. To do this, we utilize the robots.txt file.

robots.txt

Download sample robots.txt

This should allow good robots (such as google) to surf your site without falling into the spambot trap. Most bad spambots don't even check the robots.txt file, so this is mainly for protection of the good bots.

You'll see that we list a bunch of directories under '/squirrel'. This could be anything; you'll set an alias later in httpd.conf. In fact, you may even want this to be dynamically generated (see later, under Embperl), so that you can quickly change the name of the spambot trap directory if the spambots adapt and start avoiding it. At present, a static setup should work just fine, however.

Next, we need to look at the bait - links within your HTML files which lead the spambot into the trap.

Your HTML Files

Download sample HTML code

Download sample transparent 1 pixel image for hiding the trap

Here's an example of HTML with links into the spambot trap:

<HTML> <BODY BGCOLOR="beige"> <A HREF="/squirrel/guestbook/message/"></A> <A HREF="/squirrel/guestbook/post/"><IMG SRC="/guestbook.gif" WIDTH=1 HEIGHT=1 BORDER=0></A> Body of the page here <TABLE WIDTH=100%> <TR> <TD ALIGN=RIGHT> <A HREF="/squirrel/guestbook/"> <SMALL><FONT COLOR="beige">guestbook</FONT></SMALL& gt; </A></TD> </TR> </TABLE> </BODY> </HTML>

Spambots tend to be stupid. You'd think they would check for empty links (which don't show up in a real browser), but they don't seem to. Sure, they may get smarter, but meantime you might as well pick the low hanging fruit. So, the very first thing in the body of your HTML should be an empty link which goes straight into the trap proper - not the warning level, but the actual trap itself. This is because there is no way for someone using a real browser to click on this link, and good spiders will ignore it anyway because it's in the robots.txt file.

We also use a one pixel big transparent GIF (a favorite web bug technique) to anchor a link to the trap, just in case the spambot is smart enough to avoid empty links. If we put this as the very first thing in the body, then it'll be pretty hard for a real user to click on, since it's only one pixel in size. But a spambot will quite happily go there!

Finally, there is an example of a non-graphic, text based link. This will be placed on the right side of the screen by the table, and the text will appear in the same color as the background (in this example, beige). The link does not go straight into the trap, but into the warning level, because with this one there is a bigger chance that real people could click on it accidentally. The link may be invisible, but it's still there, and someone could find it. So, they get to see a nice warning, and they should back off from there. But the spambot won't. By the way, we have the link going to /squirrel/guestbook/ rather than just /squirrel/ because some of the spambots seem to specifically follow links with certain keywords, e.g. 'guestbook', 'message', 'post', etc.

You can sprinkle these links all around your HTML files. I put them in every single one, since I use Embperl templates which make that sort of thing very easy.

Embperl

Download sample dynamic robots.txt using Embperl

Download sample dynamic HTML code using Embperl

The point of this is to make it easier to change the spambot trap directory without having to edit a whole bunch of files. We pass an environment variable to Perl from httpd.conf (see below), which says what the trap directory is called. We then use this in Embperl to substitute into the HTML and robots.txt files at request time. Thus if we wanted to change the name of the trap from 'squirrel' to 'badger', then we only need to change httpd.conf, restart apache, and we're done. All the links in the HTML are dynamic, as is robots.txt (see the samples above).

Now, we bring it all together in the Apache configuration file.

httpd.conf

Download sample httpd.conf directives

Download sample startup.pl script (used in httpd.conf)

You need to have mod_perl installed before you can use BlockAgent.pm. You should take a look at the sample given above, and integrate these directives into your own virtual hosts. The most important lines are:

Alias /squirrel /www/spambot_trap PerlSetEnv SPAMBOT_TRAP_DIR squirrel

You should set the 'squirrel' name to whatever you'd like for your website; you'll then access the trap using a URL something like http://www.yourdomain.com/squirrel/guestbook/messa ge. This will spring the trap. You also need to set up the BlockAgent.pm access handler:

PerlAccessHandler Apache::BlockAgent PerlSetVar BlockAgentFile /www/conf/bad_agents.txt

This ensures that all accesses to your website will go through BlockAgent.pm first. You should choose your own location for the bad_agents.txt file.

Finally, you might want to install Embperl so that you can embed Perl into your HTML code (always executed on the server side, never seen on the client side):

# Set EmbPerl handler for main directory # Handle HTML files with Embperl SetHandler perl-script PerlHandler HTML::Embperl Options ExecCGI # Handle robots.txt with Embperl SetHandler perl-script PerlHandler HTML::Embperl Options ExecCGI

That about does it. You should now have the setup which will allow you to block spambots. You'll probably be interested in monitoring what happens...

Monitoring

Download sample script for monitoring web server logs

This simple script just tails the badhosts_loop log. You'll have fun (I do) seeing what comes on your site and promptly falls into the trap, and then SPLAT. No more spambot. Heh heh heh.

Conclusions

This setup works pretty well for me at the moment. I've no doubt there are flaws in my design, but it seems stable and is "good enough" for the time being. If you can see any improvements then I'd love to hear about them. To finish up, here's a summary of the strengths and potential weaknesses of the Spambot Trap system.

Strengths

Does not rely exclusively on the HTTP User-Agent header, but at the same time allows us to block agents which we know to be bad.
Does not rely on the spambot abusing the robots.txt file. Many spambots don't even load it. But the robots.txt file will protect "good" robots from falling into the spambot trap. So, for example, googlebot will be just fine.
The blocks happen based on behavior, rather than trusting anything the spambot tells us about itself (e.g. User-Agent). Thus we don't rely on any prior knowledge of the spambots in order to block them; an entirely new one that we've never seen before will still fall in the trap and be duly blocked.
Once a spambot is blocked, then it cannot connect to your server again at all for the duration of the block. If it tries to connect, it won't even get a 'connection refused' error, because the firewall rule just quietly drops all the packets from the bad hosts. The ipchains firewall is very effective, and more efficient at blocking hosts than anything you could put together with Apache. So, you save on server resources. If you're wondering whether the block lists might get large, I have found that with the constant expiring of one day blocks, the active block list has never been more than about 20 IP addresses at a time, out of a list (so far) of 100 distinct hosts.
The blocks initially expire after one day. This means that one-off offenders are quickly removed from the firewall rules. On the other hand, repeat offenders get progressively longer and longer blocks (doubled each time). This means that the more abusive a host is, the more it will be blocked. It also means that if a bot is coming in from multiple IP addresses (through a proxy), then each of the individual IP addresses will probably not go on to be blocked for too long. Thus you won't be blocking everyone in AOL. On the other hand, if you continue to get hit from the same network, then it's obviously a source of trouble and should be blocked. If it's a major network like AOL, which you really don't want to block, then you need to take the IP addresses and times of the abuse, and send it to the sysadmin at the ISP concerned. There's really not a lot else you can do. I haven't seen this in reality, though. In my experience, the spambots come in from all sorts of different IP addresses, and the ones that are very persistent over time are mostly static IPs from DSL and small ranges of IPs from cable modems. These are the people with the always-on, high bandwidth capabilities which are needed for large scale email harvesting.
The system uses a relational database to manage the blocks, and so it is very scalable, and potentially you could share the database between multiple servers. If any one server gets a spambot, the the offending IP address can automatically also be blocked at all the other servers. Also, the fact that we don't delete expired blocks means that we can keep track of the history of the blocks, and perhaps perform analyses which would lead to more permanent ipchains blocks of entire subnets, if desired.

Weaknesses

It would be possible for the spambots to get wise, and start following the robots.txt file rules. Then the spambot could in theory surf your entire site (or at least the bits allowed by robots.txt) without falling into the trap. However this also means that you can control where the spambot goes, which is the whole point of robots.txt. If you want, you can allow google into one part of the site, but exclude all others. Still, you should remove all email addresses from your site as the fail-safe.
It's possible that a spambot could come in through a proxy such as AOL, which means you'll be blocking multiple AOL IP addresses. This is not very nice, and I'm not sure what the solution is at the moment. All I can say is that it hasn't happened yet, and the worst offenders on my site all have static IPs. They seem to come in from cable and DSL connections mostly.
I don't know how feasible this would be, but it may be possible to conduct a "denial of service" type attack on your webserver by making many requests to the spambot trap directory from different IP addresses. I think, however, that you actually need to have those IP addresses (rather than spoofing them) in order to set up a real TCP connection with the web server. I don't know how likely this is, but it comes more under the "attack" category than spambots. If someone tries this on your site, then it's definitely something that can be pursued with legal means. It's no longer just a petty annoyance, but rather a hostile action which must be chased down. Also, the motivation is totally different - the spammers don't want to do this kind of thing. They just want their email addresses. The DDOS attacks are notoriously difficult to track, but I think in the couple of years that have passed since the first ones brought down Amazon and Yahoo!, there has been some progress made. Anyhow, I just wanted to bring the idea into the light of day. If anyone has any clues about it then I'd be glad to know.

Possible Future Enhancements

Spot large numbers of blocks occurring on a particular subnet, and automatically consolidate blocks into a single one which blocks the entire subnet (e.g. 128.123.31.0/24).
More interactive tools to allow removal of blocks
Analysis tools which can tell us something about patterns of abuse from particular networks.

If you can think of any more potential problems (or unrecognised strengths!) then I'd be happy to hear about it. I'd also like to hear about any comments on this document.

Re:AppleScript for Google API on Google Releases Web APIs · 2002-04-12 01:06 · Score: 1

Is there a problem with the slashdot script? Spaces seem to appear in the middle of long unbroken lines. Is this an anti-lamer tactic or something?

More Advanced Features? on Google Releases Web APIs · 2002-04-12 00:53 · Score: 5, Interesting

I think I speak for most when I ask if you can have your results back in the "interesting" language sets:

Staggering Potential on Google Releases Web APIs · 2002-04-12 00:36 · Score: 5, Insightful

Whilst the potential of a regular Google search is large enough, when you consider the Google search modifiers, the potential becomes staggering. Imagine using the following features:

Business Address Lookup
File Type Specific Search (.PDF etc..) (filetype:)
Stock Quotes
Cached Links (/. Favourite) (cache:)
Similar Pages (related:)
Linked Sites (link:)
Site Specific (site:)
Maps

Does anyone happen to know if you can use the other sections of Google (e.g. news, images etc.)?

Is Google the best company ever or what?!

Re:Truth in Advertising on Another Go At Making Spam Cost Money · 2002-04-09 21:01 · Score: 2

I think this URL was actually featured in the original WebPagesThatSuck book, as an example of extremely bad naming of domains...

Re:thank-god for archive.org on The Periodic Table of Comic Book Elements · 2002-04-08 21:05 · Score: 1

Some good old fashioned Karma whoring...

Archive.org

Interesting Question! on Review: BZFlag 3D Tank Game · 2002-04-05 01:31 · Score: 3, Funny

Have you ever been walking down the street, minding your own business, and suddenly look down to find something you hadn't expected?

Yes, an open manhole cover. But I'm feeling much better now.

Re:Disturbing on First Human Clone Eight Weeks Along · 2002-04-04 23:21 · Score: 1

The DNS used for the sheep was adult dna

Compulsive typing?

Thanks for clearing up the reasoning!

Disturbing on First Human Clone Eight Weeks Along · 2002-04-04 22:59 · Score: 5, Insightful

It is ever-so-slightly worrying that the doctor in question, Severino Antinori, admitted in a press conference that Dolly, the cloned sheep, was suffering from premature aging. His defence, that the experiments were not conducted well, and that sheep cloning is vastly different to human cloning, does not inspire confidence.

This child (presuming it survives) is nothing more than a guinea pig for Dr. Antinori's ego. Will this child be able to live a normal life? No. Look at Dolly -- how many tests do you think she goes through on a daily basis?

Whilst I am reluctant to encourage animal testing, would it not be better for those in the same field as Dr. Antinori to perfect cloning of non-humans before moving onto humans? It seems the doctor is in a hurry to stake his name in history. If he is not careful, he'll get his wish, but it will appear closer to Josef Mengele than Marie Curie.

Is This Significant? on FDA Approves Implantable Microchips · 2002-04-04 21:46 · Score: 1

I don't see how this is significant. Just because the FDA believe this product does not fall within their jurisdiction does not immediately mean that the US will become a "big brother" state (any more than at present, anyway).

If the US government passed a law dictating that everyone had to wear a registered chip at all times, then that would be slightly more worrying.

Syncing to Outlook on Bad Review for the Zaurus · 2002-04-04 03:43 · Score: 1

I know Windows is still the most popular OS out there, but why didn't he try syncing a Linux-based address book? If you're using a Linux handheld, is it not likely you'll be using a Linux desktop?

I bet the CE handhelds aren't marked down for failing to sync with Linux or Mac address books - but then you get into the popularity argument again.

Relationship Aid on Making Your Room Quiet · 2002-04-03 22:09 · Score: 1

It sounds like someone should start a service to match this machine to the pitch of your girlfriend's or boyfriend's voice. The benefits for relationships are enormous...

Re:/me runs out to the store, buy open and return on Sony Intentionally Crashes Customers' Computers · 2002-04-03 21:24 · Score: 1

How about:

Parental Advisory
Contains Cheesey Lyrics

Keywords on Carnivore Update · 2002-04-02 03:27 · Score: 5, Funny

FBI Headquarters, Director's Office, Present:

DATA ANALYST: Good Afternoon, Sir. Here is the latest report from Carnivore.

FBI DIRECTOR: Who the fuck is this Bernard Shifman?

DATA ANALYST: He's a moron spammer, sir. We're trying to get his e-mails excluded as we speak.

Practicality on GPS Wristwatch for Kids · 2002-03-27 23:34 · Score: 2, Interesting

I hope I'm not being stupid, but there seems to be a serious flaw to this system.

How do the parents go about the process of finding their lost child? I'd imagine the parents would call up the company requesting the geographical location of their child? But how do the parents (or the company) know their own geographical location? Directions are always relative to the start point (in this place the parents), so it seems to me that you're really going to need two sets of GPS systems.

When you add the variable of the child moving about, this is going to add extra problems. It may well be useful near your home, where the company can give you a street name, but what about when you're away from home?

Excellent on GPS Wristwatch for Kids · 2002-03-27 22:40 · Score: 5, Funny

I'm always losing my watch, so this would be fantastic.. All I need now is one for my keys.

Relay on Dateline: Abuja; Nigeria Fights Email Scam · 2002-03-27 22:36 · Score: 1

I just did at test over at abuse.net, and it seems like nigeriafraudwatch.org erm, relays mail (well, one of fifteen tests did, anyway...).

Note that this isn't necessarily so, as this was just the public statement by the server, and may be different to the internal rules..

Initial Page on Dateline: Abuja; Nigeria Fights Email Scam · 2002-03-27 22:14 · Score: 1

Welcome to NigerianFraudWatch.org

This service, operated by the Government of the Federal Republic of Nigeria through the Nigeria High Commission in the United Kingdom, is dedicated to the tracking of advanced fee fraud perpetrated by Nigerian organised crime syndicates.

The most common type of fraud is 'advance fee'. This is known as 419 fraud after the penal code in Nigeria that makes it illegal. 'Black money' fraud also has a high profile. These groups also carry out highly organised housing, social security and other grant frauds. Unfortunately, the profits of these crimes are often used to finance drug trafficking, resulting in crime and death in destination countries.

'419' Letters are distributed by post, fax and email. In a typical '419' (advanced fee fraud) letter, the author purports to be a senior government or central bank official who has managed to over inflate a contract, generating a personal profit. In return for help smuggling money out of the country, the recipient is offered a percentage, usually between 10% and 30% of contract value. At first no money is requested but once a victim has been drawn in, requests are made of the victim to fund legal and administrative costs. Victims have lost hundreds of thousands of dollars in some cases.

In a Word... on Can GnuPG Deliver? · 2002-03-27 21:50 · Score: 1

-----BEGIN PGP MESSAGE-----
Version: 2.7.1

hPUDEUi8JOKxuccBB0iZOYpD+moBqlme8h14BafQpYQThtIHyo Z5oSR0u1rYlUT6EmxWyim4m/wVSUMouBkbcZ4S2nDTg5lA8z6x mIfLKU1NTWk2EtaKXQKeRRb0tUpJkcmPgzjuQuwC6zylttGvkj w5Dg3QaVpSzprZliBLOli6pBXX3aE72nUdsOeQLgvmKJQNJ5C1 jKfkY4rxZkptOp2+YTvek9OLEoMoa8fvmcUFps5V+wd1eRJ0qm jCP7N3lMgvtxdtTekzDDOlvS6GdOdYfPiUx+BkRhPfy6e00RSX 4u5+in8Gl9VwjuetkZLkRRoQx0sfZqYAAADXOtvhsuOFRvn7Vl 96yr2wTb9R0j2ZpQVc8z6fOTC6iK/jl2DuvouAG17ZNudi+3QP gk6lTnx4yuWqmUgms1miZfuvjZNr8uVgvZYkwFLiKNqN5PKttI 8QYl4HSwELSIwIecoyhAcQI2MfAjeA2vjw1NvbYnXkpWF+ZZMG QFJIVbcwQa6ALotfQQ0ZmcTCrYajD+wwRbpqIPSjVeyHohYNDF UO9fi2cNRbC5k28e5qxnXA3E0fAPw1yVFUG0dUnyHbpozEThEd LwCKGLmsIySn3cp4RGq/v3I==CJeA

-----END PGP MESSAGE-----

Yes..

Bogus on Wall Street Embraces Linux · 2002-03-27 21:45 · Score: 1

This article is obviously bogus. Everyone knows Gordon Gecko makes all the decisions on Wall Street.

One Essential Use of Paper on The Myth of the Paperless Office · 2002-03-27 21:36 · Score: 1

Well, computers may have taken over most of the need for paper in the office, but one thing remains a bastion of the office of our fathers, and our father's fathers...

I'm talking of course, about the Paper Aeroplane! Imagine trying to fold up your desktop or laptop to throw it across the room, you just can't get the crisp edges!

Is there anything legal that is more satisfying than hurtling something with aerodynamics rivalling the most modern spaceship straight at somebody's head? Certainly not in an crowded office...

How Many Releases on One DVD To Rule Them All · 2002-03-27 04:58 · Score: 1

Does Peter Jackson think he is? Ridley Scott?

Seriously, why do studios have to exploit fans of a particular film? Why don't they release one DVD set with pan and scan AND widescreen AND all the extra footage?

If someone wants to buy some bookend sculptures, send them to a bloody furniture store.

One DVD to Rule Them All on One DVD To Rule Them All · 2002-03-27 04:38 · Score: 2, Funny

One DVD to rule them all
One DVD to find them
One DVD to bring them all
And in the darkness bind them

So, naturally, I'll wait for that one..

Rolling Stones on Corporate Anthems Go Corporate · 2002-03-27 02:40 · Score: 1

I remember when Windows 95 was launched in the UK, it was accompanied by "Start it Up" by the Rolling Stones. Perhaps M$ ought to change it to a 10 minute remix entitled "Start it Up (Again)".

Re:what about... on Face Recognition On Mobile Phones · 2002-03-27 02:36 · Score: 1

Why would you shave off your beard just to grow a goatee? :-P

Slashdot Mirror

User: Captain+Large+Face

Comments · 474