Webmasters Pounce On Wiki Sandboxes
Yacoubean writes "Wiki sandboxes are normally used to learn the syntax of wiki posts. But
webmasters may soon deluge these handy tools with links back to their site, not to get clicks, but to increase Google page rank. One such webmaster recently demonstrated this successfully. Isn't it time for Google finally to put some work into refining their results to exclude tricks like this? I know all the bloggers and wiki maintainers would sure appreciate it."
Google and others will just lower/diminish the value of links from Wiki pages, just like they did to those open "Guest Book" pages on personal sites.
Life in Orange County
Google does tweak their ranking system on a regular basis. When the problem becomes evident, (and it looks like it just has) they do something about it...that's why they're google.
Pretty widgets? What pretty widgets?
This may become a big problem for sites like this. The only solution might be one of those annoying "write down the letters in this generated gif" humanity tests.
Recently the Chinese wikipedia suffered a spam attack with a distributed network of bots editing articles to add link to some chinese intenet marketing site. In response, the latest version of MediaWiki (the software that runs the wikipedias and sister projects) has a feature to block edits matching a regex (so you can prevent links to a specific domain). Wikis generally have more protection against spamming than weblogs. So I wouldn't worry.
There is a robots meta tag for this that you can put in your headers for a single page (robots.txt needs subdirs) but unfortunately most webmasters are too ignorant to realize the power of these:
http://www.robotstxt.org/wc/meta-user.html
Artists against online scams http://www.aa419.org/
The problem with wiki's is that they use 1 template for all pages, including the sandbox, everything is wiki.pl?PageName or something like that. You would have to dive in the code instead of just "using" the wiki
Artists against online scams http://www.aa419.org/
I'm looking for a clean, fast, non-buggy alternative to the google giant. Preferably open source.
Any suggestions?
The only big one I know of right now is Nutch. It is an open source search engine that is in the later stages of development, but hasn't produced a large, usable site yet.
nutch.org
Since it will be open source, you will be able to read the ranking algorithms and change/abuse them as you see fit.
This one http://search.mnogo.ru/ is also available.
Slashdot Syndrome: the sudden, extreme urge to correct someone in order to validate one's self.
The real problem with Wikis is that the link will remain there, even after it has been removed from the current page, because most Wikis have a revision history feature. So what's needed is careful set up in the robots.txt file and other HTML clues for the web crawlers to exclude anything but the most current version of a page (and to skip over the other 'action' pages, like edits, etc).
My wiki got hit by this stupid link, but not in the sandbox. Of course, recovering the previous version of the page is easy... it's wiping out any trace of the lameness that gets trickier. I suppose the easiest way to defeat this would be to require simple registration in order to edit Wiki pages.
What else can we do? Alter the names of the submit buttons and some of the other key strings involved in Editing?
I do not have a signature
Most BB boards (including phpBB, upgrade!) and blogs (including Slashdot) now feature the visual security code for sign-up. But, of course, this does not prevent hand entry of spam...
"Who are in control, they are not in control of anything - they don't even control themselves!" - Glen Beck
I checked, and I've got documented evidence of this. On April 25 last year, I reported that earthlink.net was showing up as the top search result for queries involving various religious words, including "Bear Valley Bible Institute." The Church of Scientology (which owns Earthlink) was clearly engaging in something to distort the page rank of earthlink. I had noticed this for a long time before I recorded it.
On that same day, I reported the problem to Google via their feedback mechanism. I note today that the problem is gone.
Now if I can just do something about the "Church Of Christ at eBay Low Priced Church Of Christ. Huge Selection! (aff)" ads I keep getting on Google, I'll be happy... ;)
Secession is the right of all sentient beings.
Just set your robots.txt to exclude the user list. Or if you don't have many friends and family, send yourself an 'approve member' email. Then start training your spam filter on fake accounts.
as most spam posts have several links in them, wordpress allows setting a treshold: X number of links in the comment gets cued for moderation.
Artists against online scams http://www.aa419.org/
I know you're being sarcastic, but one way to prevent forged IP addresses is to require the user to "preview" their comment before posting.
tasks(723) drafts(105) languages(484) examples(29106)
So what's needed is careful set up in the robots.txt file and other HTML clues for the web crawlers to exclude anything but the most current version of a page (and to skip over the other 'action' pages, like edits, etc).
It has probably already been done in any wiki software worth its salt. Here's what MoinMoin does for example:
* It has a regexp of HTTP_USER_AGENTS which should receive a FORBIDDEN for anything except viewing a page. The default setting includes many known bots (including Google) and utilities such as wget.
* Most pages contain the appropriate robot meta tag, whith the relevant noindex and/or nofollow settings.
In addition to that, the webmaster can of course set up a robots.txt file, and actually should do so because there are tools out there which don't understand the robot meta tags (or they don't want to take a performance hit) and the user agent of which can easily be changed by the user... wget comes to mind.
Of course, it shouldn't be too hard to add regexps to prevent certain links from being done, or certain hostnames or IPs from altering the site (editing pages, reverting them, deleting them).
The Robots Exclusion Protocol (i.e. robots.txt.
Here's Google's stance on the subject (boils down to you don't want it indexed, put in a damn robots.txt file)
Hell, even Google News uses robots.txt
As any cat owner will tell you, you need to clean the sandbox out periodically. In the case of a Wiki, overnight would probably be a good idea.
Chip H.
Then again maybe that mostly says something about their popularity.
You didn't imagine it, but perhaps a clearer understanding of the technique can be achieved by reviewing the previous discussions. Here's a link to the Slashdot article that discussed this last January.
Denver Isuzu Suzuki
Update your spellchecker. It's algorithm.