How are You Preventing Mailto-Link Harvesting?
mixwhit asks: "In our ever increasing effort against spam, we are now considering replacing all mailto: links on our website with something unharvestable (i.e. 'user (at) address', javascript mailto links, character entity evasion, etc.). Obviously this won't stop the spam, but it seems prudent to stop the harvesting so that the spam may slow down someday (year 2024 maybe?). What are others doing with this issue? We would prefer to preserve mailto link clickability, but also only want to make this adjustment once." One suggestion I would make is to put your email address in an image. People can read it, but harvesters won't be able to harvest it (unless they download the image for OCR), but any barrier you can place in front of the spammer, without blocking people honestly interested in communicating with you, is probably a good thing.
Just use a mail form instead of mailto: links. Once you reply to feedback mail, the sender has your address and you can correspond normally. Meanwhile, evil spambots can't harvest an address that isn't shown anywhere.
Vista:XPSP2::ME:98SE
People fighting for those who have difficulty seeing have been complaining about the sites that have a person type a number displayed in an image to verify that they're not a bot. They say it causes undue hardship on sight impaired folks. That may not be a legal fight your company would like to enter.
I can see both sides of this. Can't say I know where to stand though.
Yep, I never spell check.
More incorrect spellings can be found he
What makes you think "user at mail dot foo dot com" is unharvestable? The web archives of all the development mailing lists at gcc.gnu.org use that scheme, and we still get spam to unique addresses used only for sending mail to those lists.
It's a handy technique, and useful, but it's certainly not foolproof.
You cannot apply a technological solution to a sociological problem. (Edwards' Law)
Any method of munging the address must still be clickable within the visitor's browser. If it is clickable, it can be harvested. Javascript and html encoding may stop most of the bots, but bots exist that can slurp the address no matter how much javascript you wrap it in.
I use a PHP email form that never sends the address to the to client accessing it. Short of hacking the server and looking at the php script in plain text, there is no way to harvest the address. I have no need to let the public know my address. If they want to email me, use the form or use my site's message board.
I don't want the guy getting slashdotted, so I won't link his site. If you really want the script I use (available in PHP or ASP), go to hotscripts.com and search for dbmaster's mail form.
Only on
<script> ; ; ; ; ;
<!--
var u = "sales"
var d = "example"
var t = "com"
var a = u + '@' + d + '.' + t
document.write('<a href="mailto:'+a+'">'+a+'</a>')
//-->
</script>
Just use this. Life is good, eh?
Meanwhile, I'm keeping an eye out for the next technology to replace email. IM was promising about five years ago, but went to hell faster than email.
Quoth the original message...
Err, doesn't this exactly not meet the given criteria? The guy wants links to be clickable. If you hide the image, you can only get as far as, say:
But that's just as easily harvestable as it would have been if you left the visible text as the plain address. What's the point?
It's the contents of the href attribute that need to be obscured, not the visible text (or image, or video clip, or whatever). You can't embed an image in the href text, so I don't see how this suggestion gains us anything at all.
---
The suggestion I like best is to encapsulate the address as HTML entities. Currently, this is enough to fend off the average address harvesting software, though if the practice catches on, I assume that the harvesters would start to take this into account -- at which point I don't know what the solution should be...
Barring that, it seems like the only way to provide an address will be to use literal text such as "write to us at foo at bar.com" and hope people just get it.
Alternatively, shy away from giving out your address, and provide a form where visitors can submit comments. This could allow you to filter out some of the incoming traffic (hint, if you're going to use "off the shelf" software for this, use NMS instead of Matt Wright's ancient Formmail.PL script, it's much safer). Avoiding any publication of email addresses might piss Jakob Nielsen off, but under the circumstances I think it's probably a reasonable approach to the situation -- it's way to easy for a public address to get abused...
DO NOT LEAVE IT IS NOT REAL
I've been looking at a couple of different techniques over the past year or so. They are closely tied into the Roxen Webserver, and probably won't work with Caudium, or any other webserver.
The first technique I used (described here) was a simple RXML macro, that defined a tag called <cloak>. It would check to see if the client was on a list of known robots. If the client was a robot, a graphic version of the email address would be returned. If the client looked like a normal browser, then the address would be entity encoded, and returned as a mailto link.
Shortly after I set that up, I realized that entity encoding was pretty much useless - that if a web browser can figure out the address, so can a spam bot.
My second attempt appears to be working well. I wrote a Roxen module called mailcloak which takes addresses, and replaces them with a graphic link to a dynamically generated form to send an email to that address.
As an example, the code <mailcloak> maileater@ofdoom.com</mailcloak> would be replaced with a graphical version of the address maileater@ofdoom.com and a link to this page.
It also has support for finding and cloaking bare addresses in pages, and I'll probably add support for rewriting mailto tags sometime in the next few weeks.
You have to consider the trade-off of the inconvenience of your readers/customers with the amount of spam you get.
I have a few websites with my email address all over them, in mailto links. I "mask" the email very lightly, by escaping most of the characters, and it has worked beautifully.
Here is a webpage that will quickly convert your mailto link into a form that bots will miss.
Could a bot be written that would be able to harvest these email messages? YES. But would it be worth the spammer's time to code it? NO, so it probably won't happen.
Put yourself in the spammer's shoes (or slime-covered bedroom slippers). Why would you want to go to a lot of work to build a bot that will harvest the email addresses of the very people you don't want to get your spam, because they will report you to spamcop, harass your ISP, and even hack your computer and post some very unattractive pictures of you on the internet?
No, they want the chumps, and they want to find them without needing to check every webpage for dozens of patterns.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
No kidding. Comcast gives us seven email addresses, so I set one up for each of us. My three month old gets spam, and nobody has EVER used that account (except me sending a test email when I first set it up). These scum just take a brute-force approach to generating email addresses, and don't care how many are undeliverable. They come with opt-out buttons, but all those do is confirm they found a valid address, and they never send from the same address twice, so adding them to a filter list doesn't work either. Bayesian filters on the content is the only way to go.
If all this should have a reason, we would be the last to know.
I recommend that you use a form that does NOT have the user's email address in a hidden input. Just have the user's ID, then on the server, find the address based on that ID and send the message accordingly. I know you want to keep the mailto: link thing happening, but if you do that, harvesters will always find a way to decode whatever you're doing.
Alternatively, to keep it transparently usable by end-users, you can just do like this:
<a href="false@false.com" onmouseover="var a = 'in.com'; this.href = 'real@doma'+a;">email me</a>.
I suspect you're using an ad-blocking browser or proxy, which has blocked the image itself but has left a large (clickable) white space that would be the image if you hadn't blocked it. That's the behavior Firebird shows for me, blocking ads.osdn.com. If you're using Mozilla or Firebird, and you right-click on the "background" I think you'll find "block images from this server" or "block images from ads.osdn.com" checked.
* And remember, it's spelled N-e-t-s-c-a-p-e, but it's pronounced "Mozilla."
As soon as any reasonable number of people start using the same scheme (and particularly if it's a mailto: designed to still be machine-readable) someone will take the time to harvest that kind of obfuscated address. It's just a matter of the cost/benefit ratio being high enough to make it worthwile.
I think you're right as more websites use automated obfuscation; then the spammers need to decode it to get to their victims. But as long as most websites aren't doing what I'm doing, I know they don't want to target the techies.
Here's another POV, though -- I'm considering the *other* cost/benefits ratio. I want my users to be able to easily email me, and giving them a simple mailto: link is the best way to do that. We'll have to wait and see.
Right now, it seems to be costing nothing, since I'm only getting spammed on the standard "guessed" names at my domains, like "sales@" and "webmaster@". But 5 spams a day would still be worth the trouble.
If the bots do start to really catch up (they may... I'm hoping enforced laws will start to catch up over the next few years!), at some point I might move on to the next-least-inconvenient masking method, which is probably randomized JavaScript masking. I.e., the mailto: link is generated by custom JavaScript that builds the address across a few lines of code. That would prevent users w/o JavaScript from using the link, though, which is a cost I want to avoid.
There are only 10 types of people: those who understand decimal, those who don't, and, uh, 8 other types I forget.
I actually just use unicode for the @ symbol (@). It seems that most of the time the harvesters just read the HTML source, and don't actually render HTML entities or unicode. Thus the harvester will get user@example.com, a non valid address, but a user on your site will see user@example.com and the mailto: link will function normally.
In my mail server I redirect the random addresses to a single e-mail. Then when I get spammed, I can trace it back to an IP, and contact the hosting company or ISP that it originated from.
Visit blue.aginet.com for my other GPL'd code. Feel free to use the source code in this example. I only ask that you give me credit if its used for a commercial purpose.
Scott Wolf Senior Software Engineer Slingpage
I have a 1 pixel transparent gif link at the very top of my page that links to /guestbook/jackhole. In my robots.txt file I have "User-agent: * Disallow: /jackhole/ Disallow: /jackhole/guestbook/". When a harvester traverses this link their IP is added to a text file via a php script that I wrote and they immediately get a 403 page.
Each page of my site checks against this text file so the mailbot gets a 403 page for almost all pages/sites that I host. To deal with false positives there is a mailto link on the 403 page that goes to a TMDA address. At the very least it saves me bandwidth.
this script reqires a mail deamon that delivers user+anything@example.org to user@example.org.
#!/usr/bin/perl -w
use Socket; # Load socket functions
use CGI qw(:standard); # Load CGI standard functions
my $name = "harvestbait"; # yourname
my $domain = "example.org"; # yourdomain.tld
my $ipaddr = $ENV{'REMOTE_ADDR'}; # Get the requester's IP
$ipaddr = unpack 'H*', inet_aton($ipaddr); # Convert the IP to hex
my $date = `/bin/date +%H%M%m%d`; # Get a compact timestamp
chomp($date); # Get rid of the newline char
my $addr = $name."+".$ipaddr.$date."@".$domain; # Make email addy from bits
print header, # Print HTTP header
start_html(-meta=>{'robot'=>'noindex'},
# Print HTML document header
-title=>'Send me an email!'), # Page title
q(You can send me an email by clicking ), # Page content
a({href=>"mailto:$addr"},"here"), # The time+ip tagged mailto:
q(. No junk mail please! ^_^), # More content
end_html; # End the HTML document
This one doesn't use Javascript at all. And it's only 4k.
/. it.....
Obfusticated Email Link Creator
It does mixed dec and hex. Creates links like this. But check the underlying code....
It's a Tripod site, so don't
already have a lot of trouble with that picture-of-the-email-address thing. it is a neat solution but it lacks portability, to state it another way.
-- There are two kind of sysadmins: Paranoids and Losers. (adapted from D. Bach)
I have a unicode converter that works really well. It will put your email address into a form like:
...
& # 105;& # 032;& # 100;& # 111;& # 032;& # 105;& # 116;& # 032;& # 116;& # 104;& # 105;& # 115;& # 032;& # 119;& # 097;& # 121;
For the past three years or so, the spammers haven't caught on to this, and they are unlikely to do so given the few people who take the effort to put this measure into place.
P.S. It's not just mailto links that are being harvested here. They'll scrape anything with an @ or a "at" or
http://tinyurl.com/4ny52
they spam :m ain
info@yourdomain
sales@yourdomain
help@yourdo
webmaster@yourdomain
postmaster@yourdomain
etc.etc.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Also, don't munge.
Help us build a better map!
Wait... this provides some nice opportunities to cause them a major headache by including malicious JavaScript code on a page only seen by a bot not following the robots exclusion protocol (to prevent a "real" search engine spider from visiting the page) by linking to that page using some hidden link from your home page...
Believe it or not, this actually works. These days most harvester programs still don't read Unicode. Once I started doing this, I saw a great reduction in spam. It won't work forever, of course -- eventually the spambots will read Unicode, and the game will be over for this technique. But in the meantime, it's easy enough to do a search and replace of every "@" symbol.
If you want to convert your whole address, E-cloaker is a neat little free program for converting text to Unicode.
I haven't checked the stats recently but Netscape 4.x and earlier does not supports Unicode. Pretty much all browsers can handle the HTML entities given in other examples. You may not care.
"Beg the question" is a shortening of "beggaring the question"--ie. answering a question with the question itself. "Why don't parallel lines cross? Because lines that never cross are parallel!"
If you look at the definition for beggar, you'll see one of the definition "One who assumes in argument what he does not prove." (Source: Webster's Revised Unabridged Dictionary, (C) 1996, 1998 MICRA, Inc.) In fact, this meaning of beggar has survived as a submeaning of 'beg.' This link on dictionary.reference.com supports my point. Look at definitions 3a and 3b.
So, the parent poster to your post is quite correct. His statement was not a hypothesis, but rather closer to fact, based on accepted usage.
Granted, standard American usage seems to treat "beg the question" as a synonym for "raise the question", but that's a rather incorrect usage, IMHO.
--JoeProgram Intellivision!
My problem with mail forms is that I don't have a record of any messages sent or any information if things go wrong with the delivery. Black hole for information == bad.
That being said, if you have a copy sent to the sender as well it's not as evil.
UserAdvocate: The voice of the user
Hey, guess what.
I was able to use your form to send myself spam!
That's right.
I entered my e-mail address, a from address, and the mail went through.
Essentially, your web page is providing the equivalent of an open relay.
You need to remove the "mailto" field, as that allows the form to be used to send out an address to anybody. Once that's gone, your form should be secure again.
Karma: Chevy Kavalierma.
this provides some nice opportunities to cause them a major headache by including malicious JavaScript code on a page only seen by a bot not following the robots exclusion protocol
/. math freaks: yes I know there's no set called Very Large Integers. It's a joke. Laugh.)
A lot of people do that with a malicious honeypot page. It just outputs X phony, but real-looking, mailto links, where X is a member of the set of Very Large Integers.
(note to
I am disrespectful to dirt! Can you see that I am serious?!