Webmasters Pounce On Wiki Sandboxes

← Back to Stories (view on slashdot.org)

Webmasters Pounce On Wiki Sandboxes

Posted by simoniker on Monday June 7, 2004 @04:53AM from the fold-spindle-mutilate dept.

Yacoubean writes "Wiki sandboxes are normally used to learn the syntax of wiki posts. But webmasters may soon deluge these handy tools with links back to their site, not to get clicks, but to increase Google page rank. One such webmaster recently demonstrated this successfully. Isn't it time for Google finally to put some work into refining their results to exclude tricks like this? I know all the bloggers and wiki maintainers would sure appreciate it."

21 of 324 comments (clear)

Min score:

Reason:

Sort:

Oh well by SpaceCadetTrav · 2004-06-07 04:57 · Score: 5, Informative

Google and others will just lower/diminish the value of links from Wiki pages, just like they did to those open "Guest Book" pages on personal sites.

--
Life in Orange County
google works by mwheeler01 · 2004-06-07 05:00 · Score: 3, Informative

Google does tweak their ranking system on a regular basis. When the problem becomes evident, (and it looks like it just has) they do something about it...that's why they're google.

--
Pretty widgets? What pretty widgets?
I've seen this by goon+america · 2004-06-07 05:04 · Score: 3, Informative

I just reverted some pages on my watch list on Wikipedia that had been edited with a google spam bot to link all sorts of words back to its mother site.... lots of mistakes, looked like the script they were using hadn't been tested that well yet. (Would post an example, but wikipedia is completely fuxx0red at the moment).
This may become a big problem for sites like this. The only solution might be one of those annoying "write down the letters in this generated gif" humanity tests.
Not a big deal by arvindn · 2004-06-07 05:06 · Score: 4, Informative

Recently the Chinese wikipedia suffered a spam attack with a distributed network of bots editing articles to add link to some chinese intenet marketing site. In response, the latest version of MediaWiki (the software that runs the wikipedias and sister projects) has a feature to block edits matching a regex (so you can prevent links to a specific domain). Wikis generally have more protection against spamming than weblogs. So I wouldn't worry.
Re:Cyberneighborhood Not-Watch? by Random+Web+Developer · 2004-06-07 05:10 · Score: 5, Informative

There is a robots meta tag for this that you can put in your headers for a single page (robots.txt needs subdirs) but unfortunately most webmasters are too ignorant to realize the power of these:

http://www.robotstxt.org/wc/meta-user.html

--
Artists against online scams http://www.aa419.org/
Re:Cyberneighborhood Not-Watch? by Random+Web+Developer · 2004-06-07 05:13 · Score: 2, Informative

The problem with wiki's is that they use 1 template for all pages, including the sandbox, everything is wiki.pl?PageName or something like that. You would have to dive in the code instead of just "using" the wiki

--
Artists against online scams http://www.aa419.org/
Re:Naughty behaviour by Doesn't_Comment_Code · 2004-06-07 05:22 · Score: 2, Informative

I'm looking for a clean, fast, non-buggy alternative to the google giant. Preferably open source.

Any suggestions?

The only big one I know of right now is Nutch. It is an open source search engine that is in the later stages of development, but hasn't produced a large, usable site yet.

nutch.org

Since it will be open source, you will be able to read the ranking algorithms and change/abuse them as you see fit.

This one http://search.mnogo.ru/ is also available.

--

Slashdot Syndrome: the sudden, extreme urge to correct someone in order to validate one's self.
Re:Why just wikis? by ichimunki · 2004-06-07 05:22 · Score: 5, Informative

The real problem with Wikis is that the link will remain there, even after it has been removed from the current page, because most Wikis have a revision history feature. So what's needed is careful set up in the robots.txt file and other HTML clues for the web crawlers to exclude anything but the most current version of a page (and to skip over the other 'action' pages, like edits, etc).

My wiki got hit by this stupid link, but not in the sandbox. Of course, recovering the previous version of the page is easy... it's wiping out any trace of the lameness that gets trickier. I suppose the easiest way to defeat this would be to require simple registration in order to edit Wiki pages.

What else can we do? Alter the names of the submit buttons and some of the other key strings involved in Editing?

--
I do not have a signature
visual security code for sign-up by Saeed+al-Sahaf · 2004-06-07 05:26 · Score: 4, Informative

Most BB boards (including phpBB, upgrade!) and blogs (including Slashdot) now feature the visual security code for sign-up. But, of course, this does not prevent hand entry of spam...

--
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
1. Re:visual security code for sign-up by Bitsy+Boffin · 2004-06-07 06:51 · Score: 2, Informative
  
  Except that the images ("turing numbers" as they are often called) are dynamically generated from random character sequences, and probably with equally random distortions.
  
  You'd be pretty lucky to hit the exact same image twice.
  
  --
  NZ Electronics Enthusiasts: Check out my Trade Me Listings
2. Re:visual security code for sign-up by smagruder · 2004-06-07 13:53 · Score: 2, Informative
  
  Check out the Visual Confirmation mod in the /contrib folder in your phpBB installation. Read the README.html file for installation instructions.
  
  --
  Steve Magruder, Metro Foodist
Re:"Finally"?? by jdavidb · 2004-06-07 05:38 · Score: 2, Informative

I checked, and I've got documented evidence of this. On April 25 last year, I reported that earthlink.net was showing up as the top search result for queries involving various religious words, including "Bear Valley Bible Institute." The Church of Scientology (which owns Earthlink) was clearly engaging in something to distort the page rank of earthlink. I had noticed this for a long time before I recorded it.

On that same day, I reported the problem to Google via their feedback mechanism. I note today that the problem is gone.

Now if I can just do something about the "Church Of Christ at eBay Low Priced Church Of Christ. Huge Selection! (aff)" ads I keep getting on Google, I'll be happy... ;)

--
Secession is the right of all sentient beings.
Re:Why just wikis? by Anonymous Coward · 2004-06-07 05:42 · Score: 2, Informative

Just set your robots.txt to exclude the user list. Or if you don't have many friends and family, send yourself an 'approve member' email. Then start training your spam filter on fake accounts.
Re:Cyberneighborhood Not-Watch? by Random+Web+Developer · 2004-06-07 05:47 · Score: 2, Informative

as most spam posts have several links in them, wordpress allows setting a treshold: X number of links in the comment gets cued for moderation.

--
Artists against online scams http://www.aa419.org/
Re:Sure, that will work by Short+Circuit · 2004-06-07 05:58 · Score: 2, Informative

I know you're being sarcastic, but one way to prevent forged IP addresses is to require the user to "preview" their comment before posting.

--
tasks(723) drafts(105) languages(484) examples(29106)
Re:Why just wikis? by boa13 · 2004-06-07 06:10 · Score: 3, Informative

So what's needed is careful set up in the robots.txt file and other HTML clues for the web crawlers to exclude anything but the most current version of a page (and to skip over the other 'action' pages, like edits, etc).

It has probably already been done in any wiki software worth its salt. Here's what MoinMoin does for example:

* It has a regexp of HTTP_USER_AGENTS which should receive a FORBIDDEN for anything except viewing a page. The default setting includes many known bots (including Google) and utilities such as wget.
* Most pages contain the appropriate robot meta tag, whith the relevant noindex and/or nofollow settings.

In addition to that, the webmaster can of course set up a robots.txt file, and actually should do so because there are tools out there which don't understand the robot meta tags (or they don't want to take a performance hit) and the user agent of which can easily be changed by the user... wget comes to mind.

Of course, it shouldn't be too hard to add regexps to prevent certain links from being done, or certain hostnames or IPs from altering the site (editing pages, reverting them, deleting them).
It's already been invented. by herrvinny · 2004-06-07 06:12 · Score: 3, Informative

The Robots Exclusion Protocol (i.e. robots.txt.
Here's Google's stance on the subject (boils down to you don't want it indexed, put in a damn robots.txt file)
Hell, even Google News uses robots.txt
Clean sandbox daily. by chiph · 2004-06-07 06:24 · Score: 2, Informative

As any cat owner will tell you, you need to clean the sandbox out periodically. In the case of a Wiki, overnight would probably be a good idea.

Chip H.
Re:Why just wikis? by Eivind · 2004-06-07 06:33 · Score: 4, Informative

It's working almost *too* well. Not only are SCO the number one hit for "litigious bastards", but they're also the number one hit for "litigious" or "bastards" alone.
Then again maybe that mostly says something about their popularity.
Re:Which is why I thought it was real time by allism · 2004-06-07 07:48 · Score: 3, Informative

You didn't imagine it, but perhaps a clearer understanding of the technique can be achieved by reviewing the previous discussions. Here's a link to the Slashdot article that discussed this last January.

--
Denver Isuzu Suzuki
Re:mod parent up by Anonymous Coward · 2004-06-07 08:38 · Score: 1, Informative

Update your spellchecker. It's algorithm.