Publishers Seek Change in Search Result Content
explosivejared writes "The Washington Post is running a story on the fight between publishers and search engines over just what exactly is allowed to be shown by search results. From the article: 'The desire for greater control over how search engines index and display Web sites is driving an effort launched yesterday by leading news organizations and other publishers to revise a 13-year-old technology for restricting access. Currently, Google, Yahoo and other top search companies voluntarily respect a Web site's wishes as declared in a text file known as robots.txt, which a search engine's indexing software, called a crawler, knows to look for on a site ... [new] proposed extensions, known as Automated Content Access Protocol, partly grew out of those disputes. Leading the ACAP effort were groups representing publishers of newspapers, magazines, online databases, books and journals. The AP is one of dozens of organizations that have joined ACAP."
When I submitted this I added that a lot of times the more I see in a search result, the more likely I am to hit that website. I know going in that the search engine is going to have the full story. It's a summary. That being said, I submitted this to point out the misstep I think publishers are taking. Search engines and aggregators drive their business, and usually they do it for free. I don't understand why anyone would think it would be a good idea to mess with that. Hopefully someone can explain this to me, as the stuff in the article led me to believe the publishers are making a big mistake.
I got a catholic block.
Hmm, i wonder how long before someone opens a search engine that indexes only what is "hidden"(yeah, really...) by the ACAP settings.
Just don't do it in the US or someone will tell the judge: "The defendant knowingly circumvented the DRM - which is called ACAP - of our online newspaper".
ACAP - Anonymous Coward Anonymously Posting
Personally, I think that it's useful for Google and other search engines to show what's truly relevant when you're searching for a page. The fact is, I'm more likely to click on a search result if I can see some of the actual content, and more specifically, the actual text or images that I was looking for. If they don't show me what I want to see, I won't see the rest of it. If it only shows some text that they decide I should see, then it makes it much harder to determine what I'm actually looking at. Even as it now, when results come up that are ambiguous, I find myself less likely to click on them. I readily admit that robots.txt is getting old and isn't really enough any more, but I'm not sure if what they're proposing is the right answer. Additionally, if Google were to implement a new method of searching using ACAP, then what would happen to the sites using the old methods? Would they not be indexed? What if I want all my material to be shown and I don't feel like going through and choosing every little detail about what to allow and not to allow? It's an idea worth looking at, but it's not anywhere a finished, usable idea.
I really wish that the AP and other similar entities would realize that no matter the legal backing of their terms and conditions of redistribution very few people actually care, and people care less every day. At Burger King, they provide a copy of the newspaper. Does the AP get money for every reader? I think not. This is just are ridiculous as it would be if they tried to make Burger King pay for every person who reads the newspaper while in the restaraunt.
Shiny. Let's be bad guys.
http://www.the-acap.org/project-documents.php
At first glance it appears to be a set of extensions to robots.txt that allow newspapers to specify things like:
This article will disappear from our site in N days, so it better disappear from search engines at the same time
Don't frame this article
Don't extract images or thumbnails from this article
If you show a cached copy of this article, it better include the original ads
etc.
If you don't want anything to be indexed or archived, it needs to be behind a secure connection or NOT POSTED AT ALL.
Here's a tip:
If you don't want something to become public knowledge -- accessible by anyone -- then don't put it on the internet.
Modern copyright is theft of culture from everyone and it retards the progress of the useful arts and sciences.
My one lasting legacy on the web ...
/robots.txt. It has a list of stuff you are not to pull in. Obey it, or I yell at your sysadmins." And so, I guess, my first attempt at a spider was also the first spider to obey the embryonic robot exclusion protocol. Which Martin subsequently generalized and which got turned into a standard.
...
Back in 1993, when I was teaching myself Perl in my spare time (while working for a -- cough -- UNIX company called The Santa Cruz Operation -- no relation to the current Utah asshats of that name), I was practicing by working on a spider. Now, back then SCO's Watford engineering centre was connected to the internet by a humongous 64kbps leased line. And I was working with a variety of sources on robots, and it just so happened that because I was doing a deterministic depth-first traversal of the web (hey, back then you could subscribe to the NCSA "what's new on the web" bulletin and visit all the interesting new websites every day before your coffee cooled), I kept hitting on Martin Kjoster's website. And Martin's then employers (who were doing something esoteric and X.509 oriented, IIRC) only had a 14.4kbps leased line. (Yes, you read that right: a couple of years later we all had faster modems, but this was the stone age.)
Eventually Martin figured out that I was the bozo who kept leeching all his bandwidth, and contacted me. Throttling and QoS stuff was all in the future back then, so he went for a simpler solution: "Look for a text file called
So if you're wondering why robots.txt is rather simplistic and brain-dead, it's because it was written to keep this rather simplistic and brain-dead perl n00b from pillaging Martin's bandwidth.
Ah, the good old days when you could accidentally make someone invent a new protocol before breakfast
As I understand it the main purpose of robots.txt is to prevent crawlers from consuming excessive amounts of network resources, not to "protect content". It's not a contract; it's not legally-binding; it's a request that automated web agents choose to follow if they want to be polite, or rather a description of how to be polite in the context of a certain site. (Nobody wants crawlers to be indexing dynamically-generated pages, for instance.) As an example, the physics preprint archive arXiv.org has a rather sternly-worded warning: "Follow our robots.txt file or you'll wander off into terabytes of dynamically-generated files, chewing up lots of our bandwidth, and we'll have to ban you to protect our bandwidth bill." That's what it's for, not "protecting content".
Banning Google from visiting a page and then summarizing its result on a search page is pretty much equivalent to Slashdot banning me from saying "There's this article at goatpron.slashdot.org/whatever that has a description of goat bestiality that I think you might find interesting".
As long as the summaries are sufficiently short so that they fall under the fair use exception (which Google search results surely do), Google can keep on doing what they're doing.
When information is power, privacy is freedom.
Note that robots.txt, favicon.ico and /w3c/p3p have been raised as issues for the W3C Technical Architecture Group:
http://www.w3.org/2001/tag/group/track/issues/36
See Tim B-L's original mail here:
http://lists.w3.org/Archives/Public/www-tag/2003Feb/0093
One can only hope that any new efforts keep this issue in mind (hint: stop polluting *everyone's* namespace!).