The Anti-Thesaurus: Unwords For Web Searches

← Back to Stories (view on slashdot.org)

The Anti-Thesaurus: Unwords For Web Searches

Posted by timothy on Monday November 19, 2001 @07:17PM from the these-are-not-the-words-you're-looking-for dept.

Nicholas Carroll writes: "In the continual struggle between search engine administrators, index spammers, and the chaos that underlies knowledge classification, we have endless tools for 'increasing relevance' of search returns, ranging from much ballyhooed and misunderstood 'meta keywords,' to complex algorithms that are still far from perfecting artificial intelligence. Proposal: there should be a metadata standard allowing webmasters to manually decrease the relevance of their pages for specific search terms and phrases."

21 of 148 comments (clear)

Sounds Good But... by TMacPhail · 2001-11-19 19:20 · Score: 3, Insightful

This sounds like a good plan but i dont think anyone would be willing to risk having their page show up lower in a search when someone was intending to find it. Plus anyone that finds the page in a search by accident is just a new potential customer.
1. Re:Sounds Good But... by Krimsen · 2001-11-19 19:39 · Score: 4, Interesting
  
  You are basing this on the fact that all people are consumers and all they are searching for are goods and services. What if I am searching the web for info on the DMCA and someone's webpage was called "DMCA" -short for "David, Michael, Cathy and Andrea" (or whatever) If they find that a lot of people are coming across the page accidentally, they can lower the relevance on the page on searches for "DMCA"...
2. Re:Sounds Good But... by jaavaaguru · 2001-11-20 00:27 · Score: 3, Insightful
  
  If David, Michael, Cathy and Andrea were paying per megabyte for the bandwidth used by their site (for instance if they required what some ISPs consider to be premium services such as ASP or PHP) they would not want everyone who was looking for DMCA information to view their site, since that would most likely more than double their bandwidth consimption. With a frequently searched for word such as DMCA being used as a nonword for their site, they are both saving their own money and the performance of their ISP's network and servers. Another example would be if someone's surname is the same as that of a commercial organisation. They do not want all of that organisation's customers wandering into their site by accident.
  
  --
  Follow me
How about this? by NitsujTPU · 2001-11-19 19:27 · Score: 4, Insightful

Just shitlist any site that is obviously reaching for hits? If a porn site has the words "Alan Turing" in its metadata and doesn't mention anything about Turing later in the site, list them as not being allowed to participate in your search.

Hell, an engine that did that would almost be useful.
1. Re:How about this? by H310iSe · 2001-11-19 20:19 · Score: 3, Funny
  
  from webmonkey on search engine foolin' software:
  
  You can guess why: Search engine developers buy copies of the same software, learn how to recognize its output, and then demote your site or block it altogether when they spot that pattern in your pages.
  
  no hard "this site was banned" but it seems there are some who do demote/block if they catch you putting garbage in your keyword list.
  
  PS if any porn site puts 'alan turing' in their keywords I would actually want to go there - shows some imagination to say the least, gotta give them props for that...
  
  --
  closed minded is as closed minded does
2. Re:How about this? by 21mhz · 2001-11-19 21:48 · Score: 4, Informative
  
  This is where the Google's PageRank(tm) system chimes in: an Alan Turing biography linked by half a hundred sites, each having own decent ratings, will be rated undoubtedly higher than a porn site that just listed "alan turing britney spears anthrax riaa cowboyneal" in their meta keywords and is linked by a handful among millions sites alike. Use the great cross-linking fabric of the Web, Luke.
  
  Disclaimer: I'm in no way associated with Google.
  
  --
  My exception safety is -fno-exceptions.
You know this is going to happen by Satai · 2001-11-19 19:28 · Score: 4, Funny
I can see it now. To Do lists are being written up as we speak...
1. Increase relevance for Penis Enlargement.
2. Decrease relevance for Bullshit.
I search for 'slash' and 'dot' and end up *here*?! by Overcoat · 2001-11-19 19:31 · Score: 3, Interesting

Is the phenomenon of people naming their website something that has nothing to do with the content of the website so widespread that it necessitites a new metadata tag and the consequent alteration of search engines to recognize it?
Google seems to do a good enough job of filtering out irrelevant responses as it is.
mod_rewrite is your friend by Dr.+Awktagon · 2001-11-19 19:35 · Score: 4, Insightful

Well it's not as good/effective an idea as what this fellow is suggesting, but you can have a lot of fun with people based on their Referer fields. for instance, use it to just bounce them back to their queries, or bounce them to a different query (one for porn sites is always fun), or bounce them to a more relevant page, or fuck with them however you like. If you've ever had to set up Apache to block people from linking your images, you already know how to do it.
Bad planning by ahoehn · 2001-11-19 19:37 · Score: 5, Funny

Not such a bright idea to whine about too much traffic on your website and then get a link to your site from a slashdot article.

--
Mod my comments down. It'll be fun.
Better Metadata by nyjx · 2001-11-19 19:48 · Score: 4, Interesting

While the idea would probably do some good if widely adopted what's really needed is to reduce the need for text based indexing of web sites but increasing the amount of explict semantic information about its content.
Marking up pages with information about the meaning of the terms on them is the main thrust of the work on semantic web - see http://www.daml.org/ (for DAML - the DARPA Agent Markup Language), http://www.semanticweb.org/ (One of the main information sources) and finally the new W3C activity on the subject: http://www.w3.org/2001/sw/.
How far, how fast it will go is another matter but there's certainly a lot of interest in creating a more "machine readable" web.

--
.sig
That's not going to help bandwidth by Rosco+P.+Coltrane · 2001-11-19 20:16 · Score: 3, Funny

If you replace <meta="keywords" content="mickey mouse"> by <meta="nonwords" content="bestiality mouse-fucking zoophilia kinky ....>, you might draw more Disney lovers and less perverts to your site, but I suspect your HTML file will grow quite a lot bigger ...

--
"A door is what a dog is perpetually on the wrong side of" - Ogden Nash
1. Re:That's not going to help bandwidth by Kanasta · 2001-11-20 10:07 · Score: 3, Insightful
  
  Yes, unless the same Disney lovers use filtering software, which probably won't be incredibly impressed by the number of banned words in your HTML...
Re:Proposal won't work: No incentive! by Nate+Eldredge · 2001-11-19 20:29 · Score: 5, Interesting

I work as a sysadmin for a computer science department. Until recently, the system staff would frequently get messages along the lines of
From: frankie3327@aol.com To: staff@cs.here.edu Subject: help!
i have a lexmark 4590 and it wont print in color. it only makes streaks. also the paper always jams. how do i fix it? please reply soon!
The senders never had any connection to the college or the department. We'd reply telling them we had no idea what they were talking about, and that they should seek help elsewhere. It was rather annoying.
We eventually figured it out. The department web site maintains a collection of help documents for users of the systems. One of them talked about how to use the department's printers, what to do if you have trouble, etc. At the bottom it listed staff@cs.here.edu as the contact address for the site.
You've probably guessed it by now. That page came up as one of the top few hits when you searched for "printing" on one of the major search engines (I forget which one). Apparently lusers would find this page, notice that it didn't answer their question, but latch on to the staff email address at the bottom, as if we were an organization dedicated to helping people worldwide with their printers. Furrfu!
I think we reworded the page to emphasize that it only applied to the college, and we haven't received any more emails lately. But if we could have kept search engines from returning it, that would have been even better. Since in our case the page was intended for internal use, we don't care whether anyone can find it from the Internet. Our real users know where to look for it.
So in answer to your question: When a search engine returns a page that doesn't answer the user's question, the user will often complain to the webmaster. That's a clear incentive to the webmaster not to have the page show up where it's not relevant. Also, it's not the goal of every site simply to be read by millions of people; some would rather concentrate on those to whom it's useful.
Re:Proposal won't work: No incentive! by Ex+Machina · 2001-11-19 21:17 · Score: 4, Informative

But if we could have kept search engines from returning it, that would have been even better. Since in our case the page was intended for internal use, we don't care whether anyone can find it from the Internet. Our real users know where to look for it.

http://www.robotstxt.org/wc/exclusion.html
A part they left out of the story; by vectus · 2001-11-19 21:29 · Score: 4, Funny

Webmasters, however, should be careful with these new "anti-words", as when they mix with their word counterpart, a gigantic explosion results.
robots.txt ? by Atrax · 2001-11-19 21:40 · Score: 3, Informative

did you have the page disallowed for search engines? if something is for internal use only, you really ought to have dropped in a robots.txt to exclude it altogether.

if more people used robots.txt, a lot of 'only useful to internal users' sites would drop right off the engines, leaving relevant results for the rest of the world...

just a thought......

--
Screw you all! I'm off to the pub
The Semantic Web by mike_sucks · 2001-11-19 21:46 · Score: 5, Interesting

Surely this kind of issue is what Tim Berners-Lee and the W3C is trying to address with the Semantic Web.

The problem with content on the web today is that while it is perfectly readable by humans, it is incomprenesible to machines. If Tim and Co get their way, and I for one would love to see the Semantic Web catch on, then we can get rid of kluges like the Anti-Thesaurus, HTML meta keywords and the like.

--
-- "So, what's the deal with Auntie Gerschwitz et all?"
1. Re:The Semantic Web by Alomex · 2001-11-20 03:04 · Score: 3, Insightful
  
  Surely this kind of issue is what Tim Berners-Lee and the W3C is trying to address with the Semantic Web.
  
  Indeed, but how close are they from achieving anything of significance? Ai has been working on a Universal Onthohology for ages and gotten nowhere.
  
  The fact that Berners-Lee agree that it would be a "cool thing to have" does not make it any more likely to happen (by the way, TB-L first proposed the semantic web almost five years ago).
What about !keyword? by Ed+Avis · 2001-11-19 23:22 · Score: 3, Informative

I thought we already had this by prefixing keywords with a ! sign. For example, the BSD FAQ used to have the line:
Keywords: FAQ 386bsd NetBSD FreeBSD !Linux

Presumably the same could be done for <meta name="keywords"> in HTML.

--
-- Ed Avis ed@membled.com
mod_rewrite reference, examples by Dr.+Awktagon · 2001-11-20 05:20 · Score: 3, Informative

Well some docs are here, and the mod_rewrite reference is here.

Here is a goofy example that does a redirect back to their google query, except with the word "porn" appended to it. As an added bonus, it only does it when the clock's seconds are an even number. (Or do the same test to the last digit of their IP address). Replace the plus sign before "porn" with about 100 plus signs and they won't see the addition because each plus sign becomes a space. The "%1" refers to their original query.
RewriteEngine On RewriteCond %{TIME_SEC} [02468]$ RewriteCond %{HTTP_REFERER} google\.com/search [NC] RewriteCond %{HTTP_REFERER} [?&]q=([^&]+) RewriteRule . http://www.google.com/search?q=%1+porn [R=temp,L]

Here's another one that checks the user-agent for an URL, and then redirects to it. This keeps most spiders and stuff off your pages since they usually put their URLs in the User-Agent:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} "(http://[^ )]+)" RewriteRule . %1 [R=permanent,L]

Anything you can think of is possible. I think you can even hook it into external scripts.