Google Cans Comment Spam
fthiess writes "Comment spam is in many ways even more annoying than regular email spam, since you generally have to do more than just hit the delete button to get rid of it. Its defining characteristic is that spammers abuse websites where the public can add content (blogs, wikis, forums, and even top referrer lists) to increase their own ranking in search engines. It seems, however, that the days of content spam are numbered: today Google announced that, in partnership with MSN Search and Yahoo!, that they have implemented a way to block content spam." (More below.)
"Briefly, you just change your blogging/wiki/forum/etc. software so that any hyperlinks in publicly-contributed text have a new rel=nofollow attribute added to any anchor tags. Google, MSN, and Yahoo! will now no longer index any such links, so the motive for content spamming disappears. Especially hopeful is the fact that a slew of makers of blogging software, including Six Apart, have announced they are supporting the new attribute."
It's nice to see Google, MSN, and Yahoo cooperating on this effort.
The NSA: The only part of the US government that actually listens.
Don't forget to put that attribute in your track-back links either :)
Simon.
Hmmm...if a malicious program adds the tag to links served by a compromised html server, you could have an interesting and different sort of denial of service attack, although it would be slow to take effect.
The NSA: The only part of the US government that actually listens.
There are to many custom BLOG software out there and many of these programmers don't read slashdot (or may not read it today) or check with Google Yahoo or MSN, are concerned with there blogging software messing with page ranks. There are also way to many people who will not upgrade there BLOG software because it is not worth the hassle. There are still people who run Windows 3.1 or Apple ][ or Commodore 64 expecting people to upgrade there software is not going to happen any time soon. Mabey most will be upgraded in 20-30 years. But still some people make bloging software that will not even check to see if the html is parsed they just want to make it quick and easy.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Forums and Blogs often contain very useful links. What about them? What about all those sites that are *only* linked to from blogs and forums, and that actually are great and useful sites?
I just don't trust anything that bleeds for five days and doesn't die.
The comment spam is mostly used to get a better searchengine ranking. A blog which uses this attibute on link tags is far less interesting to comment spammers, so chances are the moderaters have to delete less spam.
This is your sig. There are thousands more, but this one is yours.
RTFA. Slashdot could modify slashcode to automatically add the attribute to all links posted in comments. Comment spammers can't do anything about it, so they'll move away to other sites.
No normal links (i.e. not in visitor contributed content) should have the attribute. So slashdot will still be full of normal links; only the links in the comments will have the attribute.
This is your sig. There are thousands more, but this one is yours.
well, it does, kinda
people spam comment boards on sites with high pageranks.
Goolge's logic here is: If a high-ranked site links to site X, X's ranking also gets higher. If your site is spam/ad-ridden, this is step 3. Profit!
With rel=nofollow in place, this tactic no longer works.
No Revenues -> No reason to spam
QED
Exercise caution when modding this message up: the author acts like a jerk when his karma is excellent.
This is not a solution as far as I'm concerned.
Why stop the indexing of relative links from blogs to make google's life easier?
99% of the links posted in comments are relavent and would be beneficial to index. Why stop this for the 1% of jackasses out there?
The domains contained in the links from blogspam are well known, and there are plenty of blacklists out there. Why doesn't googleyahoomsn just remove these sites from its database? Its such an easy solution. I believe they already do this in some circumstances for link trading systems whose only goal is to get higher pagerank.
For the largest company in the world having the flattest organizational structure can still be big. I am sure most employees don't report to Balmer or Gates. I don't know how flat it is. But say for There is MSN Search team belongs to the MSN team who reports to Gates. Then there is the Front Page team that works for the Office Team that belongs to the software development team which reports to Gates. For a company that size the structure is very flat. But still there are middle management involved and Information may not spread across.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
If it's merely a question of "rank", shouldn't the attribute be norank instead of nofollow? I expect a link tagged "nofollow" to, well, not get followed.
Belief is the currency of delusion.
There are a lot of people out there who understand the PageRank system, and complain that if they add outgoing links on their site then their previous PageRank will be "leaked" to other sites, rather than their own internal pages.
Well, luckily Google has now released a way for people to link to each other without leaking PageRank. Yes, the nofollow relation. So, now everyone can link to each other, and no-one gets any benefit out of it whatsoever.
This tag is not a bad idea, but I think the good things it could stamp out weren't considered anywhere near as much as the few bad things it can stamp out..
True, it's a long term solution which is not gonna do any good in the short run. The short term solution is to make it impossible for the spammers to attack your blog in the first place. Change the names of the files that handle comment posting etc... (and of course change the code that points to such pages) and most automated spam bots are lost. If you really want to be secure, implement an intermediate page where it asks explicit permission before posting (tick a checkbox and click "yes, submit") and you can be pretty sure you're safe from comment spam.
Right now I'm testing with the first and easier of the two solutions: just change the names of the scripts around and change the code pointing to them. So far no spam, but then again, this test has only been running for a week or so.
Install windows on my workstation? You crazy? Got any idea how much I paid for the damn thing?
Sure, that's great for humans using a graphical browser, with images turned on, and 20/20 vision. But that doesn't cover all internet users. What about text browsers? What about screen readers?
This is the age of internet accessibility folks, and it's exactly why I refuse to use Captcha tests on my own blog - instead, I currently filter all comments and trackbacks through wp-spamassassin. Haven't had a single problem yet, although it's early days.
The rel="nofollow" trick sounds promising for killing off the PageRank cheats, but it won't stop humans clicking the links...
If only it was so simple. Links are the basis of the pagerank algorithm. This isn't just seom random coding task, but at the very forefront of computer science. Anyone can code some random condition for a good page, but the trick is to invent something that is feasible to compute for billions of pages and that gives even remotely good results for a good proportion of them.
It was a real advance when the google founders invented the pagerank algorithm. Before then, the state of the art was based on counting the words appearing on a page, and you had to go through several pages of search engine results to get something even approximately relevant (remember when AltaVista was the hot stuff?)
Right now every major search engine uses some variation of the pagerank algorithm -- the google founders were generous enough to publish the theory behind its operation, back when they were just graduate students at Stanford. Even AltaVista uses a related algorithm now. This is why it wasn't just google that was working on this rel=nofollow stuff, but other search engine people too.
In brief, it takes a lot of genius, luck, and experimentation to find a better algorithm. I'm sure people at google are working on it, but we're talking about real research-level stuff here, it's not something you can guarantee your success at.
But somehow I don't think spammers really care if a blog uses this system or not. It's probably easier to just spam all blogs than to figure out which are useless. Just like email spammers don't care much if an address is valid or not.
Some people think that adding spam filters to an email account reduces the spam sent, while it only reduces the amount of spam received. This solution does neither.
However, all efforts to fight spam should be welcomed and supported. Despite my pessimism, it will be interesting to see how it turns out.
What about not linking to people such jackasses would want to annoy ? Same result (no extra pagerank), just simpler.
Your abuse scheme seems a bit convoluted to me, or do I miss something ?
blah
I think this last paragraph is important. "nofollow" is not on the official list of link-types. If blog authors wish to use this attribute in anchor elements, they need to define it properly (or at least properly reference a definition).
Remember back in the 90's when Netscape and MS were breaking standards right and left so that their browsers would have an edge on the competition? That was the wrong way to do it, and it created the mess we're in now with sloppy HTML spewed all over the web and designers unable to use compliant designs because the most popular browser doesn't even try to support standards (an example here). Google is doing this the right way. They went back and read the HTML specification to see if it was already capable of doing what they needed. It does? Great! Let's utilize the standard!
Granted, HTML these days has a much better design than it did in the pre-4.0 specifications. Back when Netscape and MS were at each other's throats the document format was actually incapable of doing a lot of things that designers wanted to do on the web. But HTML is a very mature format these days.
Sysadmin Geeks who have to clean up the messes left by shoddy Microsoft products, day after day, hate their products because they make extra work for us. We hate Outlook, IE, and IIS because their penchant for spreading worms and viruses. We hate service packs which break more than they fix. We hate Frontpage because of the non-standard, blecherous, broken HTML it spews forth. We hate the general lackadasical attitude Microsoft has about security and quality in general.
Libertarian-minded geeks hate Microsoft for their flagrant disregard for the law and the courts. We hate them for the way they blatantly infringe on other company's patents and lawyer their way out of it. We hate the way they bankrupt or buy out anyone making a product which actually competes with them. We hate the way they use puppet companies (SCO, BSA) as hired thugs to bully other companies on their behalf.
Anti-corporate geeks hate Microsoft because it's a prime example of corporate greed run amok and of the dangers of unfettered capitalism.
Why is it that the proponents of "one nation under God" are so eager to get rid of "liberty and justice for all"?
Whenever I'm searching for technical information, a couple of sites always come up that are useless to me. They have a question/answer format, questions are left in the clear for search engines, while the answers require registration. What I need is a way to filter those sites out from my searches, so that they simply don't show up in any result set. Hmm might be a good excuse to play with writing Firefox plug-ins... :-)
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
This is pretty much useless. Within a week of opening up comments on my blog, I was getting blogspam. I went to war immediately; the first thing I did was to submit all comments to an approval queue. No spam has appeared on my blog since. I noted this fact in an article, and in comments around the comment submission form, and the result of POSTing a comment tells you it's been submitted for approval.
But this did nothing to stop the flood of incoming blogspam.
I blocked, and still block, a few of the repeat offending IPs. But these days, my comment log looks like:
[Thu Jan 20 02:03:39 2005] Rejected spam from 213.121.209.14: carroll
[Thu Jan 20 02:18:59 2005] Rejected spam from 61.221.15.131: cleotilde
[Thu Jan 20 05:08:55 2005] Rejected spam from 211.57.209.225: tera
[Thu Jan 20 05:09:07 2005] Rejected spam from 61.221.15.131: lawanda
[Thu Jan 20 05:09:30 2005] Rejected spam from 66.160.17.189: deangelo
[Thu Jan 20 05:09:41 2005] Rejected spam from 193.251.169.174: raymonde
[Thu Jan 20 05:10:03 2005] Rejected spam from 66.250.69.7: tynisha
[Thu Jan 20 05:11:02 2005] Rejected spam from 211.57.209.225: corrie
[Thu Jan 20 05:37:47 2005] Rejected spam from 85.64.61.191: Online Poker
[Thu Jan 20 08:14:10 2005] Rejected spam from 211.250.80.2: heike
So, blocking by IP is pretty useless. I was in no mood to try word filters or statistical filters or any such, so I simply added a hidden field to each page, based on the time the page was requested and a secret token. When a comment is submitted, it is rejected if the hidden field is not present, or if it is from a time that is too old. This immediately blocked 95% of comment spam.
Some few people were persistent, fetching a page and then posting back to it. So I checked my referrer logs; seems blogs to spam are found by Googling for typical strings, and posting in an expected format. So I made the Subject field mandatory.
I now have a close to 100% spam block rate. Why would I add a "nofollow" tag to my links, when spammers won't stop spamming just because their spam isn't being read (they don't stop now, and their spam isn't even being accepted!) and when real comments would suffer from it?