Publishers Seek Change in Search Result Content
explosivejared writes "The Washington Post is running a story on the fight between publishers and search engines over just what exactly is allowed to be shown by search results. From the article: 'The desire for greater control over how search engines index and display Web sites is driving an effort launched yesterday by leading news organizations and other publishers to revise a 13-year-old technology for restricting access. Currently, Google, Yahoo and other top search companies voluntarily respect a Web site's wishes as declared in a text file known as robots.txt, which a search engine's indexing software, called a crawler, knows to look for on a site ... [new] proposed extensions, known as Automated Content Access Protocol, partly grew out of those disputes. Leading the ACAP effort were groups representing publishers of newspapers, magazines, online databases, books and journals. The AP is one of dozens of organizations that have joined ACAP."
When I submitted this I added that a lot of times the more I see in a search result, the more likely I am to hit that website. I know going in that the search engine is going to have the full story. It's a summary. That being said, I submitted this to point out the misstep I think publishers are taking. Search engines and aggregators drive their business, and usually they do it for free. I don't understand why anyone would think it would be a good idea to mess with that. Hopefully someone can explain this to me, as the stuff in the article led me to believe the publishers are making a big mistake.
I got a catholic block.
Hmm, i wonder how long before someone opens a search engine that indexes only what is "hidden"(yeah, really...) by the ACAP settings.
Just don't do it in the US or someone will tell the judge: "The defendant knowingly circumvented the DRM - which is called ACAP - of our online newspaper".
ACAP - Anonymous Coward Anonymously Posting
My one lasting legacy on the web ...
/robots.txt. It has a list of stuff you are not to pull in. Obey it, or I yell at your sysadmins." And so, I guess, my first attempt at a spider was also the first spider to obey the embryonic robot exclusion protocol. Which Martin subsequently generalized and which got turned into a standard.
...
Back in 1993, when I was teaching myself Perl in my spare time (while working for a -- cough -- UNIX company called The Santa Cruz Operation -- no relation to the current Utah asshats of that name), I was practicing by working on a spider. Now, back then SCO's Watford engineering centre was connected to the internet by a humongous 64kbps leased line. And I was working with a variety of sources on robots, and it just so happened that because I was doing a deterministic depth-first traversal of the web (hey, back then you could subscribe to the NCSA "what's new on the web" bulletin and visit all the interesting new websites every day before your coffee cooled), I kept hitting on Martin Kjoster's website. And Martin's then employers (who were doing something esoteric and X.509 oriented, IIRC) only had a 14.4kbps leased line. (Yes, you read that right: a couple of years later we all had faster modems, but this was the stone age.)
Eventually Martin figured out that I was the bozo who kept leeching all his bandwidth, and contacted me. Throttling and QoS stuff was all in the future back then, so he went for a simpler solution: "Look for a text file called
So if you're wondering why robots.txt is rather simplistic and brain-dead, it's because it was written to keep this rather simplistic and brain-dead perl n00b from pillaging Martin's bandwidth.
Ah, the good old days when you could accidentally make someone invent a new protocol before breakfast
That was largely my thought. It makes very little sense as to why anybody would blind click on a link in this day and age. I personally depend upon the summaries to decide whether or not to click. If I don't get a summary I don't click.
It would make far more sense for these institutions to just take their sites completely off of the search engines via robots.txt and save up those slots in the search results for sites that want traffic. Or perhaps limit it to just the front page, but I think that one can still do that with a competently crafted robots.txt as well.
Seems to be a lot of people slightly upset over this. But I think this is a good thing. They already have the ability to stop search engines from indexing at all. Now they have much more fine grain control. They can also make their results more useful by setting expiry dates. Presumably they'll also be able to be more specific about what he summary says, and might actually be more useful.
Now some sites will probably want to over control, but they'll lose out.
Specifically, this seems geared towards sites like Google News that aggregate stories and then publish snippets of them on their home page.
Personally, I don't really see the problem. You either want your site spidered or you don't. You don't get to control the presentation of the data that is spidered, only the search engines get to do that.
SO the thing is here is that Google takes its ordinary web spider, applies a little magic to it, and then displays the results as a news page. Big deal.
You either want your site spidered or you don't. You can't have your cake and eat it too.
My blog
Those things are stupid. Were I Google, I'd put up something on my website that made them consent to my terms, or forgo indexing entirely. I can't blame them for wanting more control, but I don't think they should get it. I don't trust them at all.
If the news and book sites wanted to keep the search engines out, they would just set up their robots.txt files to block all access. Then they would never show up on Google. The don't want to do that because they know it would be death to them. Google doesn't supply any content, but it does supply a service: It's the first place people go to find out information. If they need more than a summary, they can click on links from the summary page to get details. People aren't going to go to ten websites to look for something if they can start at one place.
You are right: If the search engines disappeared, the big news services wouldn't care. Actually, they would probably enjoy it, because people would go to the New York Times, Washington Post, and other big names sites rather than seeing these smaller sites with better reporting and commentary. But you contradict yourself as well. You say that if the search engines disappeared, the internet would just create more, but then you say that if the big news services stopped providing news, the search engines would die. No they wouldn't. The internet would create more, filling the need.
If the news sites want to control their content better, fine. But I guarantee you the next whine you will hear from them is how Google isn't directing traffic to their websites and it must all be retribution by Google for being made to limit what it displays, rather than people clicking on sites where they can read the summary.
The organized synthesis and presentation of this content is, in itself, useful content. The number of people using news aggregators should have clued you in on this.