Web Caching: Google vs. The New York Times
An anonymous reader writes "The Google cache is a popular feature among karma fetishists. Many stories with links to the NY Times attract comments pointing to Google's copy of the article. This gives readers access to the content without registering. C|Net reports that Google is in talks with the NY Times to close this backdoor. The article raises some general concerns regarding the caching of webcontent. Shouldn't the NY Times simply tell Google not to cache their site?"
I'd love to see their user database, just to count the number of Mickey Mice and Elmer Fudds on there. Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?
When I am king, you will be first against the wall.
IANTrolling here, but I find Google more and more useless by the day. Sometime back, I pointed out how Google seems to have a soft corner for articles and sites that affect big firms such as Microsoft.
In fact, several of Slashdot's own articles on Microsoft aren't available from Google news, although Slashdot is listed as a 'news' source. Couple of MS related Slashdot articles (on the Oregon bill - March 6th and May) have been removed, but much pro-MS content pre-dating March is still referenced.
Google seems to be aping the other Gorilla, despite all the posturing, and Microsoft's so-called attempts to categorise it as a competitor, when in fact, Google appears to be an ally!
If you keep throwing chairs, one day you'll break windows....
The reason that the NYT just doesn't tell google not to cache them is visitors. Let's face it, even though the registration is a bitch the content on the NYT website is fairly decent. They have good articles often enough that geeks went through the effort of finding out how to read without registering. If they have google not cache them, and they close the google news loophole, then they wont appear on google news any longer. And google news is used by many more people than you think.
Hey, we get quite a few visitors from this google news. Let's change it so we get 0 visitors from it.
Duh.
The GeekNights podcast is going strong. Listen!
the nytimes website needs google for the traffic google brings into their pages, so they can't turn away their spiders. but then, they don't want the spiders either because of copyright violations. why should this be google's problem anyway?
Actually, the link to "robots.txt" raises an interesting point. Why is NY Times even in "discussions" for this, other than to gain some column inches? It's entirely upto the NYT whether to let Google's robots to index their site, isn't it? I would have thought that Google's robots would be well behaved in this respect and simply move onto the next site if they were told to go away by robots.txt.
UNIX? They're not even circumcised! Savages!
You are the new editor of the New York Times, the "Newspaper of Record" for the United States, if not the world. You are, of course, the new editor because the previous editor had to resign, taking the blame for an individual reporter's flagrant disregard for the awe-inspiring credibility of your institution. In the process of rebuilding your credibility, should you:
A) Insist that unaffiliated digital libraries restrict access to or simply eliminate all records of your "Newspaper of Record", or
B) Realize that maybe right about now is not particularly the best time to be saying to the world, "Please forget what we published last week."
Well, I guess that NYT (and many others) allowing Google News to login and index their content means that they like them doing that for getting traffic. For whatever reason, NYT wants you to register and they have a right to as well as they have copyright, allowing Google to put in the snippet, but not the whole article without their consent.
And that is the reason for an index, to find the original.
It is good to see they are working this out together, though, without NYT going to court as the first step. This is a far better way than the popular shoot-first-ask-questions-later attitide most media companies have...
That's the thing - it's not free depending on your definition. By my own definition, you're giving them valuable information, and they get to keep it and use it as they will, including spamming if they feel like it (or spam from any company which buys them out, they sell it to if they're feeling bankrupt, etc). It's practically misadvertising of a service, but it's accepted now, so everyone gets away with it.
If it really were free, why would you need to register in the first place?
Since when is content published in the WWW about privacy?
It's just like a government that wants to control which newspapers maybe archivied for history research.
--
Karma 50, and all I got was this lousy T-Shirt.
I was thinking the same thing. I cann't recall seeing a NYT article linked from here with the google cache banner across the top, what I do see alot are the partner links. Google already provides for register-only news sites (financial times?) by putting a [reg only] tag beside the article. Why the NYT has chosen not to use this up until now is a tad strange, and it looks like someone has picked up the wrong end of the stick.
Brand recognition is not always a good thing. When I think NY times I think "that annoying registration website". They are free to do what they want, but it leaves me cold.
My Karma: ran over your Dogma
StrawberryFrog
Here we have the NYT, one of the premier news organizations in the world, offering its articles for free on the same day that they are published. Yet a large number of people, of this online community at least, refuses to provide even a minimal amount of information (and no money) so that the newspaper can try to make its online presence profitable.
I think the spam fears are a red herring, I've been registered with the times for over 2 years. I've never gotten spam that I think is traceable from them. I get a daily email of the day's headlines (and with the click of a box I could discontinue this).
Why should the RIAA change its business model to a pennies per song method when there is such a blatant example of the online community refusing to go directly to the source for even free material?
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
the facts are a commercial company A (google) are making a profit from unauthorised copying of other peoples content without permission , meaning company B (you) has to spend money (webmaster) or take proactive steps to remove your content from their databases, google are not an ISP or a goverment agency so really they have no buisness in taking without asking other peoples content.
I don't know what planet you're on, but I profit when my site is listed in Google. People spend an inordinate amount of time and money to make sure their site is listed in the best way possible. Are you going to exclude what could possibly be a huge source of revenue for you? But maybe you have some obscure site you don't want anybody to be able to search for. So, given the amount of time it takes to build even the simplest site, is it really that much trouble to upload a robots.txt file with noindex, noarchive, nofollow in it?
Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
Lately there's been discussions here on Slashdot about fair use. about 30 second clips of music on the net, and thumbnails of images being fair use. I can agree that that's fair use of content.
But think about Google's cache: A page in Google's cache isn't a part of - or a summary of content - but it is the entire content of a page. If this isn't breach of copyright, I don't know what is.
Google's cache gives more food for thought as well: Let's say I wrote something about someone on my web site, and this person sued me. A jugde decides against me and gives me a fine, and orders me to remove the content. But even if I do so, the inflammatory words would still be accessible trough Google's cache.
Now, some of you may argue that I could just write Google and ask them to remove the page. But the point is that if this is legal, just about anyone can cache my site. If enough search engines caches content, I most probably would never be able to find every site that provided cached versions of my site.
I'm not sure as to what constitutes fair use of content in the US, but in my country at least (Norway) I'm almost certain that Googles cache mechanism would be judged a breach of copyright laws.
"And you wonder why you get ads that have absolutely no interest for you? And why advertisers have to shout lounder and louder to get through a mass of untargeted ads?""
What ads? I ignore or block such ads out of principle. Maybe if they provided ads for something worthwhile (instead of "shock the monkey" deceptive scam links), they would not be ignored. Maybe instead of shouting gibberish louder and louder, they should provide good ads for worthwhile products and services.
Actually, from the text of the article, they say that they want it so that when you click on a link in google, you get the registration page of the NYT.
A robots.txt would stop google from indexing the site altogether. They don't want that to happen. They want a google search to show NYT web pages, but they just want to make sure that when the user tries to view it, they have to register with NYT first. That means that google must still index the page, but not allow access through the cache. Plus, it must direct to a sign-on page rather than the page itself, but that is something that I'm sure the NYT itself could handle, like it think it does now anyway.
The New York Times wants Google to continue ranking their stories but they want Google to do them the special favor of only pointing to their registration page:
"We are working with Google to fix that problem--we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital,
If I were Google, I'd tell them such advertising services would cost them a great deal of money. That or simply drop the New York Times right into the bit bucket. It will cost Google programing time to make it happen and computing time to keep it going. If every site on the web required this kind of custom treatment, Google's task would be much more difficult and it might be easier for them to drop it.
Droping the NYT from Google is fine by me. People who don't understand the implications of digital publishing don't deserve readership. If they won't let librarians make digital coppies, libraries should drop them too. What's next, the New York Times sends cease and dissist orders to everyone who runs a proxy? It's like the NYT is trying to make their digital publication harder to share than their paper one was. A paper copy can be shared by an entire office and that's what a proxy does. A paper copy can be indexed and archived by a librarian, and Google did not even do that much. One day the paper version won't be available. If librarians can't keep their own coppies of the digital version for verification, the publication will have no credibility. If the New York Times wants to continue charging advertisers for eyballs, they had better remember that their credibility is bassed in part on widespread availability.
Friends don't help friends install M$ junk.
The technology has changed the way that things work but the law has not kept up with it. To start with, we continue to talk about "copyright". Controlling copying of information makes sense when the distribution mechanism is trucks moving bales of paper around. Once you start sending bits around, everything is copied. From the article:
And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory.
When you have the newspaper delivered to your door, the content basically comes for free (the cost of a newspaper doesn't pay for much more than printing and handling). However, you get to keep the content as long as you like, chop it into bits and what not. Libraries have archives of newspapers going back years and you get to see them for free. What's the right mechanism as we move forward? The "pay per view" model that content providers want to shove down our throats courtesy of the DMCA is not pretty and when it starts to affect the average Joe I suspect it will be booed out of favor pretty quickly. But what is the right mechanism to make sure content providers get paid something and that we, the citizens, get something for our money?
It is? In this sense? We managed without it being mainstream quite happily until a year or two back.
In your opinion. Others have different opinions. We have a legal system to resolve differences of opinion. Go figure. :-)
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
As many others have emphasised, it is easy to turn of the Google cache for whatever pages you wish. But, in the case of the NYT, there is a further factor. They must have special code within their system to recognise the google spider and allow it access without registration. Either that, or there is some other prior agreement allowing access. Given that, they can scarcely claim extra work to support Google. I believe the whole thing is mainly to get some free publicity for their site. I suppose the other possibility is that they want the page accessible from Google News but not the regular search engine cache.
The NYT needs to call off the lawyers and seriously think about how they brought this on themselves.
There are so many models for running a news site that avoid this problem (Salon) that calling out the lawyers is just childish and inapropriate. If a site wants to be indexed by a search engine, then they should be aware of what that means, and if they don't like how a particular search engine functions, then they should take measures to change thier own site to prevent what they don't want indexed, or cached, from being accessed.
I know that finding pages on google that I cannot access would be infuriating, and I hope that Google realizes that many of thier users would agree.
Read, L
[Set Cain on fire and steal his lute.]
So which is the real real world? The one where you spend the afternoon on your porch reading a book to your mate, or the one where you sit in front of a television and "reap the rewards" of advertising, so you can buy more stuff, presumably?
I am not saying my world is universally better than your world, but it is just as real.
V
Every time a cached link is clicked, pay sites like the New York Times can receive notice from Google (easy to automate this) that one of their pages (which is cached in Google) has been accessed, and all advertisements in the cache have been displayed (Google caches Ads in the page as well as the contents). This allows the website to "offload" traffic and at the same time keeping the books on the number of times their Ads have been viewed so that they can send the accounting record to their paid Advertisers.
Google would find this very simple to implement, and paid sites would find this very beneficial (borrowing Google's enormous bandwidth and server capabilities for free) and at the same time should solve most of their concerns. After all, Google's cache isn't sufficient for proper access to ALL the paid-content at the New York Times as the cache is temporary in nature. Also, its too spotty in coverage to be considered reliable enough for really digging into a paid-sites entire content.
Using Google like this is akin to using Google as a window into the pay-site's house of content. You can part of a room, but not the whole interior. Now, every time someone peeks, the House gets notified and can get paid for it. The more windows Google adds to the House, the more chances the House gets paid.
If what you're posting comes from an article page's <head> section, you seem to be pasting more than you intended. Directives to ban archiving of ads isn't an editorial issue, but a business decision -- cached ads screw up the bookkeeping and, by extension, the bottom line on the balance sheet.
The practice of restricting cacheing of ad content is, presumably, common across the industry -- it's not just NYT that has an interest in forcing this.
The (apparent) <meta name="ROBOTS" content="NOARCHIVE"> tag you cite should be wholly separate from the ad server code.
(Signed, a former employee of NYT digital...)
DO NOT LEAVE IT IS NOT REAL
But you know, you DON'T have to give a real name or email for NYT or JPost, or most of the others, they don't send you your pass and uid, you know.
It's not the spam that's the problem, if you use your head, you get no spam. It's the hassle of logging on
I have an NYT account. Do I care if they know what I read on their site? About as much as I care when the next "American Idol" rerun is on (which is to say, not at all.) Why on earth are you fuckers so paranoid about this? I see absolutely nothing wrong with tracking as long as it's limited to the originating site.
Get over it, for God's sake.
- A.P.
"Remember when the U.S. had a drug problem, and then we declared a War On Drugs, and now you can't buy drugs anymore?"
people are using technology for what it was inteded to do!
At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily?
When your server and pages say it's alright (or don't say that it's not alright.) The standards for the web are very clear on this, but non techie companies (and some judges) don't seem to get this.
This reminds me of the issues of "deep linking" that everybody was suing over a couple of years ago. That's exactly what the web was designed to do, but these johnny-come-lately companies put sites up, and expect people to stop using the technology for what it was designed for.
If only the EFF was as well funded as the ACLU...
How about if the Times got over their registration fetish?
From the Times Subscriber Agreement:
What is meant by "exploit"???From the "Forums and Discussions" section:
What is meant by "abusive"???And how about this>
Interpretation: The user/poster is entirely responsible for the content of their post, which the Times may alter in any way. Yikes!!! Granted, this applies only to content submitted to the Times, but the wording seems pretty scary.
The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
..to censor their cache. Those that don't want their content cached should fix their web servers and firewalls first. My web site prohibits known web crawler bots, and google doesn't cache it. No problem! I didn't have to harrass google about it and they don't have to break their own promise to not be evil.
-- I am. Therefore, I think!
Try the Pragma: no-cache and/or Cache-control HTTP headers.
Yeah, I always like to try abuse@domain for sites that require registration. Kinda mean to the postmaster, but if I "opt-out" and they still send something then they're spammers anyways.
Nothing to see here; Move along.