Publishers Seek Change in Search Result Content
explosivejared writes "The Washington Post is running a story on the fight between publishers and search engines over just what exactly is allowed to be shown by search results. From the article: 'The desire for greater control over how search engines index and display Web sites is driving an effort launched yesterday by leading news organizations and other publishers to revise a 13-year-old technology for restricting access. Currently, Google, Yahoo and other top search companies voluntarily respect a Web site's wishes as declared in a text file known as robots.txt, which a search engine's indexing software, called a crawler, knows to look for on a site ... [new] proposed extensions, known as Automated Content Access Protocol, partly grew out of those disputes. Leading the ACAP effort were groups representing publishers of newspapers, magazines, online databases, books and journals. The AP is one of dozens of organizations that have joined ACAP."
When I submitted this I added that a lot of times the more I see in a search result, the more likely I am to hit that website. I know going in that the search engine is going to have the full story. It's a summary. That being said, I submitted this to point out the misstep I think publishers are taking. Search engines and aggregators drive their business, and usually they do it for free. I don't understand why anyone would think it would be a good idea to mess with that. Hopefully someone can explain this to me, as the stuff in the article led me to believe the publishers are making a big mistake.
I got a catholic block.
sites can say what they want shown in the result...
the day i can start blacklisting results.
i get enough ads already.
Hmm, i wonder how long before someone opens a search engine that indexes only what is "hidden"(yeah, really...) by the ACAP settings.
Just don't do it in the US or someone will tell the judge: "The defendant knowingly circumvented the DRM - which is called ACAP - of our online newspaper".
ACAP - Anonymous Coward Anonymously Posting
from TFA ""The free riding deprives AP of economic returns on its investments," he said."
same old rule applies; never trust anyone who uses business terms like ROI, for he cares not for you or society, but only for what he can remove from your wallet, without getting arrested over it.
Personally, I think that it's useful for Google and other search engines to show what's truly relevant when you're searching for a page. The fact is, I'm more likely to click on a search result if I can see some of the actual content, and more specifically, the actual text or images that I was looking for. If they don't show me what I want to see, I won't see the rest of it. If it only shows some text that they decide I should see, then it makes it much harder to determine what I'm actually looking at. Even as it now, when results come up that are ambiguous, I find myself less likely to click on them. I readily admit that robots.txt is getting old and isn't really enough any more, but I'm not sure if what they're proposing is the right answer. Additionally, if Google were to implement a new method of searching using ACAP, then what would happen to the sites using the old methods? Would they not be indexed? What if I want all my material to be shown and I don't feel like going through and choosing every little detail about what to allow and not to allow? It's an idea worth looking at, but it's not anywhere a finished, usable idea.
I really wish that the AP and other similar entities would realize that no matter the legal backing of their terms and conditions of redistribution very few people actually care, and people care less every day. At Burger King, they provide a copy of the newspaper. Does the AP get money for every reader? I think not. This is just are ridiculous as it would be if they tried to make Burger King pay for every person who reads the newspaper while in the restaraunt.
Shiny. Let's be bad guys.
http://www.the-acap.org/project-documents.php
At first glance it appears to be a set of extensions to robots.txt that allow newspapers to specify things like:
This article will disappear from our site in N days, so it better disappear from search engines at the same time
Don't frame this article
Don't extract images or thumbnails from this article
If you show a cached copy of this article, it better include the original ads
etc.
If you don't want anything to be indexed or archived, it needs to be behind a secure connection or NOT POSTED AT ALL.
Here's a tip:
If you don't want something to become public knowledge -- accessible by anyone -- then don't put it on the internet.
Modern copyright is theft of culture from everyone and it retards the progress of the useful arts and sciences.
My one lasting legacy on the web ...
/robots.txt. It has a list of stuff you are not to pull in. Obey it, or I yell at your sysadmins." And so, I guess, my first attempt at a spider was also the first spider to obey the embryonic robot exclusion protocol. Which Martin subsequently generalized and which got turned into a standard.
...
Back in 1993, when I was teaching myself Perl in my spare time (while working for a -- cough -- UNIX company called The Santa Cruz Operation -- no relation to the current Utah asshats of that name), I was practicing by working on a spider. Now, back then SCO's Watford engineering centre was connected to the internet by a humongous 64kbps leased line. And I was working with a variety of sources on robots, and it just so happened that because I was doing a deterministic depth-first traversal of the web (hey, back then you could subscribe to the NCSA "what's new on the web" bulletin and visit all the interesting new websites every day before your coffee cooled), I kept hitting on Martin Kjoster's website. And Martin's then employers (who were doing something esoteric and X.509 oriented, IIRC) only had a 14.4kbps leased line. (Yes, you read that right: a couple of years later we all had faster modems, but this was the stone age.)
Eventually Martin figured out that I was the bozo who kept leeching all his bandwidth, and contacted me. Throttling and QoS stuff was all in the future back then, so he went for a simpler solution: "Look for a text file called
So if you're wondering why robots.txt is rather simplistic and brain-dead, it's because it was written to keep this rather simplistic and brain-dead perl n00b from pillaging Martin's bandwidth.
Ah, the good old days when you could accidentally make someone invent a new protocol before breakfast
You would think an article about ACAP would provide a link to it.
As I understand it the main purpose of robots.txt is to prevent crawlers from consuming excessive amounts of network resources, not to "protect content". It's not a contract; it's not legally-binding; it's a request that automated web agents choose to follow if they want to be polite, or rather a description of how to be polite in the context of a certain site. (Nobody wants crawlers to be indexing dynamically-generated pages, for instance.) As an example, the physics preprint archive arXiv.org has a rather sternly-worded warning: "Follow our robots.txt file or you'll wander off into terabytes of dynamically-generated files, chewing up lots of our bandwidth, and we'll have to ban you to protect our bandwidth bill." That's what it's for, not "protecting content".
Banning Google from visiting a page and then summarizing its result on a search page is pretty much equivalent to Slashdot banning me from saying "There's this article at goatpron.slashdot.org/whatever that has a description of goat bestiality that I think you might find interesting".
As long as the summaries are sufficiently short so that they fall under the fair use exception (which Google search results surely do), Google can keep on doing what they're doing.
I understand completely. I too would like to stop my nosy neighbors from peering at me out of their window when I leave my house in the morning. My plan is to implement "pay per stare" at some point in the future but they aren't gonna pay if they can get their jollies for free. I blame the "Sun" and "street lamps" and "glass" and other devices that interfere with my ability to effect sole distribution over the intellectual property that is my personal image. Well, at the very least, I should be able to sue torch/flashlight manufacturers into oblivion and then use my deserved winnings to tackle the big boys 150 gigameters away.
There are two kinds of people: 1) those who start arrays with one and 1) those who start them with zero.
from an HTML page I like a well written title
Don't want it shown, then hide it. Lazy fuckers should learn about structuring content rather than bitch about search engines.
A feeling of having made the same mistake before: Deja Foobar
Note that robots.txt, favicon.ico and /w3c/p3p have been raised as issues for the W3C Technical Architecture Group:
http://www.w3.org/2001/tag/group/track/issues/36
See Tim B-L's original mail here:
http://lists.w3.org/Archives/Public/www-tag/2003Feb/0093
One can only hope that any new efforts keep this issue in mind (hint: stop polluting *everyone's* namespace!).
Doesnt it make more sense to show your content on a search engine? Help people find what they're looking for? And get a view of what's being searched? Or does this apply more to paid publications?
These days I find most things on the web by searching, not by following links. If these people want to cut themselves off from the world by refusing to allow search engines to catalog them, why not? People whose work is inaccessible to most because their publishers refuse to let it be on search engines will soon decide that they no longer need a publisher.
CPU cycles and storage are cheap now, you'll just have to run your own search engine server..
I know my position is very un-slashdotish, but there is nothing wrong with content producers wanting to control how their content, in particular, the stuff they paid to generate, from being indexed. It's not that they don't want you to see the content, it's that they want to control how you see that content. They want it wrapped in their page, with ads, and not summarized on a search page. Egads, what if you read the summary and decided not to visit the site after all?
Fine. But as we all know, we probably have a few sites that we book mark and visit often. We probably get alot of news from RSS. But alot of people are directed to sites via search engines. So if a content producer, say a news paper, doesn't want it's content indexed, then fine. It will only result in a LOSS of traffic to their site.
Look, content producers have to make money. They have people to pay, stuff to print, etc. They have expenses. It is truly sad that rather than trying to figure out how to make content relevant and useful, some content producers simply want to continue analog methods in a digital world.
Gee, just a thought, but what about a way to display a summary and an ad chosen by the content producer along with the summary? Advertisers would spend lots for that kind of exposure.
If you put it on the internet, and users are meant to access it, why should search engines differentiate any content based on probably arbitrary criteria? If pay sites restrict content and give out special logins for paying users, search engines cannot index it and the content is kept 'private'. If a site that has non-restricted content (restricted by special login) then why shouldn't it be indexed? It would be a disadvantage to the end user, because they cannot find the content as easily (especially if the web site's search engine sucks) and it would be a disservice to the content provider, since their site would be less likely to show up in search results. What is the point? Is this the same thing as people disabling right-click on certain web sites to try and prevent you from 'stealing' content, the same content that is available in your cache, and that would be illegal to use if the content is copyrighted anyway? Is this the same thing as people embedding pictures in flash for the same reason? If all of this results in less usable, less indexable, and more annoyances, just to restrict the way content is accessible and viewed?
Then that's not the web anymore, that's not really in the spirit of the internet... why not just stick to print or something? And then have it in a special store where you can only buy it with some currency you made up, with an exchange rate you control? Oh, and have a special door for the store that can only be opened with a special device you have to order! Er, anyway... I hope you can understand my point.
Twinstiq, game news
Once Google admitted it can and will/does filter search results, it opened the floodgates for stuff like this.
Don't say i didn't tell you so....
---- Booth was a patriot ----
Bustin' ACAP in Google's ass.
If these publishers want to own the search engines, then they should build their own! These engines do them a favor. This is no different than the music publishers trying to control the bands and how they get paid.
I prefer the "u" in honour as it seems to be missing these days.
I think the mistake we're using here is that we're assuming most folks consume their news like we do. Sorry to generalize but I believe most of us seek to become informed and thoroughly review and critique what we read. However, most people are satisfied with tidbits and in fact want nothing more. For example, the macob are satisfied with a headline like "Multiple Car Accident Kills 50" and a thumb of the pile up... the noseies like "Brad Wears Ugly Glasses For the First Time" and a thumb... etc. Yes those are terrible headlines and hyperbole to make my point. Imagine a search engine unlike Google which provides summaries of multiple sources offering these tidbits in a single page without the source's ads? Oh wait http://www.ask.com/ and perhaps others although I'm stating soley that they have such a type of offering and not that they do so violating any rules.
I'm against most tactics that appear to be an organization seeking to squash an alternative or new and unknown element they think is encroaching on their bottom line and this move smells of it but feel it's a rare case of smoke without an actual fire. Just wanted to throw that out there while I seek more info on this tidbit.
That's just my POV... no more, no less.
Seems to be a lot of people slightly upset over this. But I think this is a good thing. They already have the ability to stop search engines from indexing at all. Now they have much more fine grain control. They can also make their results more useful by setting expiry dates. Presumably they'll also be able to be more specific about what he summary says, and might actually be more useful.
Now some sites will probably want to over control, but they'll lose out.
A bunch of publishing organizations have gathered together and are attempting to create an Internet standard for restricting searchable content.
They haven't involved Google, Yahoo, or Microsoft in the process. In fact, the only search company they mention in their FAQ is Exalead, who I didn't even think I've heard of (though now I think I may have once downloaded their desktop trial product).
This is going to be implemented how?
In related news, I have issued a new policy for how I (and anyone who joins my club) am to be treated in airport security lines. I will be publishing this policy on my home page, and I am certain it will win widespread adoption among travelers.
Q:Have you discussed this with security administrators?
A:In addition to the many travelers who have co-signed the new policy, we have an agreement-in-principle from Madge, the security and commissary chief at the fourth-largest regional airport in greater Bozeman.
There's a lot more to be said about the Coward part of that. When you post here on slashdot, Anonymous Coward just means you don't wish to be specifically identified. There's not a whole lot of coward, there, just a joke from way back. With this, the publishers are being incredibly cowardly: They're publishing material, but hiding it from view. There's no good reason that publicly accessible content on their site should be 'hidden', and that's the most cowardly thing of all.
Added complexity (and user control) most likely introduces new possibilities for abuse, which is already a major problem for search engines these days.
*cough*
Specifically, this seems geared towards sites like Google News that aggregate stories and then publish snippets of them on their home page.
Personally, I don't really see the problem. You either want your site spidered or you don't. You don't get to control the presentation of the data that is spidered, only the search engines get to do that.
SO the thing is here is that Google takes its ordinary web spider, applies a little magic to it, and then displays the results as a news page. Big deal.
You either want your site spidered or you don't. You can't have your cake and eat it too.
My blog
> The desire for greater control over how search engines index and display Web sites
... instead of trying to emulate your print product (ahem ... *cough http://nytimes.com/ cough*)
Then design your sites better. Seriously. When I was on the team that launched http://jacksonville.com/, we spent a decent amount of time thinking about how to optimize our site for search engines, and that was 10 years ago. Too much showing? Not enough showing? Spend more time developing and designing your site
Bark less. Wag more.
That being said, I submitted this to point out the misstep I think publishers are taking. Search engines and aggregators drive their business, and usually they do it for free. I don't understand why anyone would think it would be a good idea to mess with that.
This being Slashdot, I predict that huge numbers of people will now arrive in this thread and say that you're absolutely right, the search engines are providing a great service, and the publishers should just suck it up because they'd die without them.
The thing is, they're completely wrong. It's actually the other way around, for the simple reason that news aggregators produce no useful content of their own.
For you or me, as someone who wants to know what's happening today, we can do one of two obvious things using a web browser. We can visit a specific news site we already know about (or at least guess at a URL), or we can start with an aggregator like Google News. Either way, many people will only read the headlines and summary for most stories. Either way, someone had to go out and get the information to write that story. But in one case, the people who brought the knowledge to the public get the page hit, while in the other, the search engine gets the hit in exchange for ripping much of the value of the other sites' content and the people who actually provided the content get nada.
It's common at this point for someone to pipe up with a fair use argument, but again, they are wrong, for the simple reason that while the headlines and summaries on news aggregators may only be small excerpts from the entire article, they represent a very significant chunk of the value. You can easily determine this by observing the proportion of users who look something up on an aggregator and never follow through to read any article in more detail; I don't know exactly what the answer is, but I'll wager it's a substantial proportion, perhaps even the majority.
Another common argument is that the news sites would die without input from search engines, but again I can't believe this is really true. When I reach lunchtime at work, I do not visit Google to find the BBC News web site, I just type in news.bbc.co.uk. (Actually, I visit the bookmark, but the first time that's what I typed.) Google, or any other news aggregator, is wholly unnecessary to my finding the main news site. Even without that, I could easily have guessed that the BBC News web site could be reached at www.bbc.co.uk/news or news.bbc.co.uk, either of which would have got me there immediately. The site is advertised via the BBC's other media as well. A significant proportion of the links I e-mail to and receive from friends and family are direct links to stories on the site.
Basically, if every search engine on the planet disappeared tomorrow, I rather doubt the big news services would care. As with everything else to do with search engines, they are just a middleman service, and one that is entirely expendable. If they weren't around, the Internet community would just develop an alternative or five, probably rather quickly, just as it always does.
On the other hand, if the big news services stopped providing news tomorrow, aggregator services like Google News would be completely dead, because they provide absolutely no value in themselves. They simply scrounge content from one source and visitors from another, and insert themselves as a middle man to cream off some of the profits.
The very fact that one service could survive quite happily without the other, while the other would die immediately without the first, tells us everything we need to know about the merits and public service benefits of each. That being the case, I find it hard to argue with the publishers' position that the news aggregators are basically ripping them off, and I don't really have much sympathy with the two most common counter-arguments people seem to be making in this Slashdot discussion.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
...there's nothing quite like watching the traditional, embattled news sources "innovate" themselves right out of existence. They were slow to respond to the web and didn't understand it when they first did (they've gotten better), and now they're going to ACAP themselves into obscurity. Way to go guys! You're the bleeding edge of reporting!
blog |
Those things are stupid. Were I Google, I'd put up something on my website that made them consent to my terms, or forgo indexing entirely. I can't blame them for wanting more control, but I don't think they should get it. I don't trust them at all.
If search engine caching of their content is hurting these publishers, then they would use currently-supported methods to keep crawlers out:
User-agent: *
Disallow: /
Oh, but that's right, they do want to be indexed in search engines because it increases their revenue.
So, what's the problem, again?
What is it that they're losing by having this information indexed on Google again? It's just a summary; if they want to read the article, they still have to go to the target site. If Google didn't index their content, they'd get a lot fewer readers. It's not like people are going to Google news and then deciding that they got enough information from the summary to not bother reading the article.
i love it when you karma whore, willy. it gives me a woody every time.
I was curious, so I tried it on a few sites. Interesting stuff. Error logs, admin login pages, cms logins, "test" pages,... I didn't try logging in to any of them, but I do wonder how hard it would be. I'm guessing not very.
If these guys want anybody to pay attention, they should submit their protocol as an RFC. Their "standards document" is badly written. It has statements like "Features that are ready for implementation now, but only for use in crawler communication by prior arrangement, are labelled with an amber spot. These represent a minority of extensions for which there are possible security vulnerability or other issues in their implementation on the web crawler side, such as creating possible new opportunities for cloaking or Denial of Service attack." One such problem is that they stuck in a redirect mechanism that directs the crawler to pull data from another domain. Then they put in mealy-mouthed phrases like "It is recommended that, if possible, the URI should normally specify a resource within the crawled resource and not external to it, as this is less likely to present technical and security difficulties to the crawler.". This reads like something from a committee that doesn't have to make it work. They need to formally address the issue of the security scope of a robots.txt file, not hand-wave around the problem.
That's no good. Somebody competent familar with IETF procedures will have to overhaul this.
If you run over some old lady, it's not the car's fault. Nor is it the car maker's fault. It's your fault for using the car poorly.
I found out about this story yesterday on (ta-da!) Google News. A little more searching (again via Google) led me to their web site. Interestingly, I could not find any information there about who constitutes the ASCAP membership. The ACAP site lacks a search tool ... (surprise, surprise) so back to ... Google for more searching which eventually leads to this page. No doubt Yahoo or MSN search would have led to the same findings. The Wikipedia article has a short list of the main suspects doubtless there are others like AP.
/etc/hosts file so that I don't accidentally view any of their sacred content. They don't want me to be able to find their stuff? Fine, I'll be happy to give 'em what they want.
I just want to know who to add to my
If you want your life to be different, live it differently.
This might be because I'm slightly colorblind, but the colors on the ACAP page make my eyes bleed.
If they don't like it, remove them from the index. Watch how fast they shut their pie-holes then.
Yes, I am a smart ass; it's better than the alternative.
In my opinion, google would be insane to agree to any restriction other than telling the sites "if they don't want to be in google, we let any site opt out already". Google has all of the power - if a site doesn't exist in google, it does not exist.
Ok, regardless of the legal fairness, I'd think removing those previews would actually reduce the likelihood of me visiting such a site.
Almost never do I see a Google result and say "Ok, I know all I needed to, not going to click." More often, I see one and say "Gee, looks like that site won't be very helpful, let's move on to the next one." I can only imagine my response would be like that, only more so, towards anyone who could allow Google to index them without allowing Google to summarize them.
Don't thank God, thank a doctor!
What do they not understand about *DO NOT CRAWL*? Robots.TXT is just fine. If it ain't broke, don't try to fix it. So now I have to have a .robotaccees to go along with .htaccess?
...makes me less likely to click through to their real story. Most of them major outlets seem to give only 1 or 2 sentences per feed item. That's so little information that I find I'm not interested in the story. I end up just browsing headlines. They've got to give enough that people want to read the articles. The traditional newspaper editing goal of cut-the-article-off-at-any-sentence-and-its-still-complete is at odds with how and why I seek media coverage these days.
Let's just not pay attention to them, even that much.
Don't thank God, thank a doctor!
I think it would have been better named Content Retrieval Access Protocol.
If a site complains or uses ACAP - Google should just drop them.
The Google "site death penalty" - you become (rightfully) irrelevant.
I wish I could set in my Google preferences to exclude sites the use "noarchive" or "nosnippet".
Like those journals that feed Google the whole content but just give surfers a subscription page. Such as Blackwell-Synergy - I keep submitted them to Google's spam page since they do that - in direct violation of Google rules.
Just because it CAN be done, doesn't mean it should!
This is to clear out a mistake in moderation that I have no idea how to clear out otherwise...
I'm sorry, but I do not quite understand : What is the net value (no pun intended) of yesterdays (let alone yester-weeks) news ?
What do these news-aggregators win by disallowing us, the people, from finding-out what has happened last week/month/year ?
They have pushed their news to as many people they could get it to, but somehow want this news to be gone in a week ? Why ?
Whats the next step ? Newspaper that dissolves in a week ? Special software on a readers computer that will, on their computers, erase all downloaded articles older than a set date ?
http://www.the-acap.org/download.php?ACAP-TF-CrawlerCommunications-Part1-V1.0.pdf
http://www.the-acap.org/download.php?ACAP-TF-CrawlerCommunications-Part2-V1.0.pdf
In brief, part 1 extends the robots.txt file, while part 2 extends the robot-related meta-tags. They allow spiders to be identified by both User Agent info and purpose (news, images, reviews). They add an "include" statement that can direct specific search engines to specific files, for example sending googlebot to robots/googlebot.txt; besides reducing bandwidth, this can confine any damage caused by coding errors. They also allow more granularity of indexing: You can specify if data from an old cache copy can be presented to a user, or if only the most recent copy should be used, and you can specify if links, snippets, thumbnails, or full content (i.e. a frame containing the originating site) can be shown to the search engine user. They add better retention controls; you can specify how long an engine should keep information (N days, until YYYY-MM-DD, or just until the next time the spider visits). And finally, they add a crude macro facility, so you don't have to create huge files that repeat themselves.
All in all, I don't see anything that's especially bad, and a lot of it is stuff that arguably should have been in robots.txt from the beginning.
Nothing for 6-digit uids?
Will that someone also ignore the current robots.txt content? Because anyone using ACAP will soon migrate to a simple "Disallow: *" (in an attempt to influence your decision on whether to use the ACAP extensions), and then you'll find yourself indexing dynamically-generated, "infinite" web pages.
Nothing for 6-digit uids?
Not all media are opt in.
Radio transmissions, for example. I can receive transmissions from France and Ireland if I use a directional aerial.
And for the internet, it CANNOT WORK unless the default is opt in: how do you get an agreement to view content? Ask. But you can't ask, you're browser is asking. So how does it ask? Make a request, but that request is copyrighted and the respons (even !NO!) is copyrighted too. Then again, you don't get "the" copy either. Each ISP will have to make a copy. So how do they do it? Then there's your local cache which reduce redundant traffic (and therefore helps the ISP, the provider and the consumer) but require agreement first.
So how can it work if it is a default "opt in"?
If you want it somehow different, DON'T USE HTTP. Require logins, use (S)FTP and make your own protocol up will all allow you to work with opt-in.
What if Google says "OK, if you use those extensions, you're not getting indexed"? Copipreese tried to sue Google for NOT including their content. Is there anything that says "you MUST index"?
I'd like to chime in here and mention that I have used Google to read a story that had been dropped from a news site. Google didn't provide a cache link, the site refused to acknowledge that it had ever published the story (they sent it down the memory hole), but I had a couple phrases quoted elsewhere that I wanted to check context against.
So I did multiple Google searches looking for phrase segments, each one getting me one or two more words before or after the phrase. Eventually I was able to reconstruct two or three paragraphs. It took a few more searches to determine their proper order.
I'm wary of giving any site the ability to prevent such Google forensics automatically.
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
so if you decide that, although you don't want it in the PD, you can't decide to unprint it so that it won't.
When you print a book, you cannot stop me lending it, whether you wish it or not.
Now, when you print it on the internet, you've lost most of your control. An example would be "if there areextentions to robots.txt, don't read ANYTHING"). Copiepresse SUED Google for NOT indexing their site.
So how is ASCAP going to help?
Why is it better than "require a login" or "don't publish in a PUBLIC PLACE"?
"The thing is, they're completely wrong. It's actually the other way around, for the simple reason that news aggregators produce no useful content of their own."
That's not actually true. The act of aggregation itself creates information: it brings news articles -- and news sources - to our attention which we wouldn't otherwise know about. What they create is metadata, and that's hugely valuable.
In a perfect world where all sites used and respected HTML meta tags, or Dublin Core markups or something, and did so thoroughly, sensibly and never abused them or lied about the categorisation of content, perhaps we could get by without needing third-party search and aggregation services. Maybe. And in a perfect world where sites didn't store information in silly non-browseable formats like, eg, PDF instead of HTML, we wouldn't need things like Google Cache to make them halfway readable. But we don't, and that's why aggregation exists.
I think an aggregator shouldn't strip out all links back to the original source, and should make it clear that there is more at the site to be investigated. But don't they do that anyway? One of the reasons why I don't read blogs via RSS yet is that I feel claustrophobic if I can't see the surrounding context of a blog: the skin, the about page, the comments (especially the comments). I guess it amazes me that there would be people who would *only* read, say, Google News or antiwar.com headlines and not read the full article.
You are not a brain: http://books.google.com/books?id=2oV61CeDx-YC