Web Caching: Google vs. The New York Times

← Back to Stories (view on slashdot.org)

Web Caching: Google vs. The New York Times

Posted by timothy on Sunday July 13, 2003 @09:58PM from the right-to-be-annoying dept.

An anonymous reader writes "The Google cache is a popular feature among karma fetishists. Many stories with links to the NY Times attract comments pointing to Google's copy of the article. This gives readers access to the content without registering. C|Net reports that Google is in talks with the NY Times to close this backdoor. The article raises some general concerns regarding the caching of webcontent. Shouldn't the NY Times simply tell Google not to cache their site?"

20 of 518 comments (clear)

Free registration by Zog+The+Undeniable · 2003-07-13 22:00 · Score: 5, Insightful

I'd love to see their user database, just to count the number of Mickey Mice and Elmer Fudds on there. Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

--
When I am king, you will be first against the wall.
1. Re:Free registration by presroi · 2003-07-13 22:06 · Score: 5, Insightful
  
  Maybe we can agree that the NYT is a well-written, serious and interesting newspaper. Not just for New Yorkers but also for people from Sweden, Japan or New Jersey.
  
  Where would the the limit? How would you feel if you had to register for every web page which is linked to at /. (I confess, I usually click on every /.-story link)?
  
  hmm, to answer your question:
  maybe the point in registration is the signing of a contract how to use this contact. Dunno.
2. Re:Free registration by whm · 2003-07-13 22:06 · Score: 5, Informative
  
  Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?
  
  User tracking. While cookies can do this loosely, requiring a login does this much more effectively. I know I login with my same username each time I visit the site (if it's not cached). There's very little reason not to. This gives the NYT a much better indication of how many active and repeat members they have visitting their site. They can then target ads to users much more effectively, and market their userbase to advertisers much more solidly than they could with more rudimentary user tracking methods.
  
  There may be other purposes, but this seems like a large part of it.
3. Re:Free registration by cobbaut · 2003-07-13 22:34 · Score: 5, Interesting
  
  Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?
  
  I always use a different address to register online in the form of website@mydomain.
  I registered with the NYT in 1999, I never received a single spam on this address.
  
  --
  European Linux user, living in Antwerp
4. Re:Free registration by Anonymous Coward · 2003-07-13 22:41 · Score: 5, Insightful
  
  And on top of everything else, it annoys users more than just about anything else aside from spam. Can't recall exactly how many other people I know who go to see a NYT article, find the rego page, and ignore it to go find a better news source without the hassle.
  
  If they're tracking what their users are do, they're affecting their user pool in a pretty negative way just by using this method.
5. Re:Free registration by digitalunity · 2003-07-13 22:41 · Score: 5, Funny
  
  (I confess, I usually click on every /.-story link)
  
  This is *Slashdot*. We don't read articles. Please, either read the article or post a comment; you cannot do both.
  
  --
  You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
6. Re:Free registration by JanneM · 2003-07-13 23:05 · Score: 5, Funny
  
  You gave them an actual, working, email address? How ...quaint.
  
  Me, I'm a 66 year old single woman with no income, no education, and lives in a nonexistant Swedish town with a very rude (in Swedish) name. I figured that any site advertiser that want's to target this person must be desperate enough that their ads may actually be amusing.
  
  --
  Trust the Computer. The Computer is your friend.
7. Re:Free registration by NexusTw1n · 2003-07-14 01:33 · Score: 5, Insightful
  
  I always find it ironic when people on slashdot complain about being "tracked" on NYTimes webpages or other sites that require registration.
  
  Most people have registered to use /. , and have therefore provided a valid email address. So you can't have a moral objection to giving your email addy to websites you frequent.
  
  Even if you don't register, your IP address is logged and monitored , via the sophisticated anti troll system. Try and post more than 10 times in one day as an AC, or post as an AC in reply to a post you modded and slashcode will react.
  
  So even as an AC you aren't really totally anonymous on slashdot, yet I don't see anyone who complains about NY Times links complaining about that. The only people who complain are the trolls that forced these features to be added to the code.
  
  So why do we have this tedious bitching about the NY times every time a link is posted?
  
  I registered a couple of years ago. I've never recieved a single spam to NYTimes@mydomain.com which was the email addy I used. I've never had to login because the login cookie has remained in Opera since I registered. How hard is it login and then forget about it forever more?
  
  The only reason I haven't forgotten I've registered is the continual complaints on slashdot from people who are obsessed with privacy on the net unless karma is involved. NY Times doesn't spam registered users, and any user tracking is less sophisticated than slashcode's vital anti troll features. So bear that in mind when tommorrow's NY Times story appears and the same old complaints are dragged out yet again.
  
  --
  It has become appallingly obvious that our technology has exceeded our humanity. --Albert Einstein
God damnit... by tangent3 · 2003-07-13 22:03 · Score: 5, Funny

Now we can't karma whore by linking to the google cache?
And out comes the lawyers... by Anonymous Coward · 2003-07-13 22:11 · Score: 5, Interesting

Don't you just hate it when promising new technology is curbed by outdated laws?

Here in Denmark we had a service similar to news.google.com for danish newspapers. The newspaper organisation sued the service for parasiting on their databases (which is prohibited in Denmark). The service was shut down half a year ago and we now don't have that kind of service anymore.

Of course newspapers should be allowed to publish their stuff without others copying it but they refused to even use a "robots.txt" (which the news service respected) to stop indexing.

If you publish your stuff on the internet and don't tell people that they should not index it, cache it or what do I know - then you better expect them to do that. Let us put those lawyers back where they belong.
Anyone above this post hasn't read the article. by banal+avenger · 2003-07-13 22:16 · Score: 5, Interesting

The Internet Archive, which I just used minutes ago to find a handy page removed years ago, is an interesting corollary to the Google cache. I often wonder how it has survived thus long without a major lawsuit. It also reminds how crappy the web looked 5 years ago.

At any rate, cache-ing is an important force on the internet, and isn't one that should be limited in any legal way, including litigation.
actually... by Draghkhar · 2003-07-13 22:20 · Score: 5, Interesting

Actually the NYT has already begun using google's NOARCHIVE option to prevent content caching. Here's an excerpt from the this morning's front page story's source:

!-- ADX SETUP: page: www.nytimes.com/yr/mo/day/international/worldspeci al/14IRAQ.html positions: Top5,Middle1,Right3,Middle5,Right,Travel7,Travel11 ,Bottom1A,Bottom3A,Right5,Right6,Right7,Right8,Bot tom8,Bottom7,Inv1,Inv2,Inv3,Frame4,Right4 kwds: politics+and+government;international+relations;ir aq;suggested%5ftopnews;suggested%5finternational;s uggested%5fworldspecial;suggested%5fmiddleeast --

meta name="ROBOTS" content="NOARCHIVE"

Kind of makes me wonder what's the point of the story, since it even says there's an easy way for concerned parties to opt out of the cache.
There's no such thing as free registration by pslam · 2003-07-13 22:24 · Score: 5, Insightful

Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?
That's the thing - it's not free depending on your definition. By my own definition, you're giving them valuable information, and they get to keep it and use it as they will, including spamming if they feel like it (or spam from any company which buys them out, they sell it to if they're feeling bankrupt, etc). It's practically misadvertising of a service, but it's accepted now, so everyone gets away with it.
If it really were free, why would you need to register in the first place?
Re:Erm...cache? by Neophytus · 2003-07-13 22:32 · Score: 5, Insightful

I was thinking the same thing. I cann't recall seeing a NYT article linked from here with the google cache banner across the top, what I do see alot are the partner links. Google already provides for register-only news sites (financial times?) by putting a [reg only] tag beside the article. Why the NYT has chosen not to use this up until now is a tad strange, and it looks like someone has picked up the wrong end of the stick.
Free registration and the RIAA by mike_mgo · 2003-07-13 22:36 · Score: 5, Insightful

It's articles like this that make me think that the recording and movie industries are right to go after online piracy with everything they've got.
Here we have the NYT, one of the premier news organizations in the world, offering its articles for free on the same day that they are published. Yet a large number of people, of this online community at least, refuses to provide even a minimal amount of information (and no money) so that the newspaper can try to make its online presence profitable.
I think the spam fears are a red herring, I've been registered with the times for over 2 years. I've never gotten spam that I think is traceable from them. I get a daily email of the day's headlines (and with the click of a box I could discontinue this).
Why should the RIAA change its business model to a pennies per song method when there is such a blatant example of the online community refusing to go directly to the source for even free material?
Re:Free registration..some implications by gilroy · 2003-07-13 22:38 · Score: 5, Informative

Blockquoth the poster:

and lastly, once a site requires registration, even if free, Copyright ptohibits [sic] quoting entire articles on the web.

Actually, registration is not required to protect a work. Creating a work automatically protects it under copyright law -- no need for registration, user fees, or that little (c) thingy. At least in countries respecting the Berne Convention.

--
The Mongrel Dogs Who Teach
Re:Google - more useless everyday by dhodell · 2003-07-13 22:51 · Score: 5, Interesting

Just FYI, this behavior is due to the fact that Googlebot has a sort of "built-in" mechanism to ignore (or at least rank lower) forum-type sites. Since /. is primarily a "news headline and discussion" site, Google will not rank it as highly as one that seems to be more "on-topic". This is because there is no guarantee that any URLs or email addresses within the page have anything to do with the actual page content itself.

Outside of user posts, /. has little genuine unique content. It summarizes a lot of headlines; this content is not unique.

Other (large) factors determine the way Google ranks pages, including the "PageRank" feature. There are lots of documents about the way Google ranks sites, I suggest to check them out. The best way is probably to Google for it :).

Anyway, this is a bit more on-topic:

I highly appreciate Google's caching feature, and don't see how it can be taken as "bad".

This is what's "bad" about Google and what I expect that, at some point, will come to haunt them. For instance, if I want to get serial numbers without porn popups, I can usually search for something like "Office XP Serial Number Serialz Warez" or something similar. Within the first couple of pages, I will probably find my serial number in the text of the page description.

If not, it's on the page, oftentimes without a popup, since the serial/crack page itself is the one linked.

Want to find X-Win32? How about doing "* * * * * * xwin32*.exe" - lets get some directory listings containing this filename.

No doubt this proves that Google is more than just a search machine... but I think that their superior techniques will definitely come back to haunt them in the future. NYT is way off target with bitching about their caching features... you can turn this off easily, and there are a plethora of scripts one can use to break out of Google's cache and send someone to the main site (or, perhaps, login area in the case of NYT).

But, in other news, Google might need to watch out...

--
Kind regards, Devon H. O'Dell
meta tags ? by matrix0040 · 2003-07-13 23:28 · Score: 5, Informative

well cant they just use meta tags to prevent archving of their pages <META NAME="robots" CONTENT="noarchive"> from http://www.google.co m/bot.html"
No pity for the NYT... by qtp · 2003-07-14 00:01 · Score: 5, Insightful

The NYT needs to call off the lawyers and seriously think about how they brought this on themselves.

There are so many models for running a news site that avoid this problem (Salon) that calling out the lawyers is just childish and inapropriate. If a site wants to be indexed by a search engine, then they should be aware of what that means, and if they don't like how a particular search engine functions, then they should take measures to change thier own site to prevent what they don't want indexed, or cached, from being accessed.

I know that finding pages on google that I cannot access would be infuriating, and I hope that Google realizes that many of thier users would agree.

--
Read, L
Google's cache copy - the larger issue by Everyman · 2003-07-14 02:36 · Score: 5, Interesting

The question is framed very narrowly by Slashdot, so this discussion misses the larger issues. The cache copy is an issue in Google's main index for many webmasters. The Google News situation is a subset of a larger problem; the cached link doesn't exist in Google News. Google News is a much narrower issue. I'd like to bring up the issue of full-text caching done by Google in their main index.

My problem with the cache is that it gives Google a competitive advantage that is unfair, and furthers their monopoly. This is especially unfair since it is most likely illegal -- assuming that you could ever get a good test case into court, or get a class action lawsuit going by some webmasters, publishers, or search engines.

To add to the attractiveness of the cache copy, consider what Google has done:

1) The cache copy makes it possible to highlight the search terms, whether or not you have the toolbar installed.

2) The download time for the cache copy from Google's servers is always faster than from the original website.

3) You never get a 404 "not found" or a DNS lookup failure for the cache copy.

4) The link to the page recommended by Google for bookmarking at the top of the cache copy is a link to Google's copy, not to the original page.

5) How about all that Google branding on the top of the cache copy? Priceless. I feel the cache should be opt-in, not opt-out. The only way you can avoid it right now is to place a "noarchive" meta on every page in your site. On some file types, such as .txt files, there's no place to insert a "noarchive" and Google goes ahead and caches it anyway.

The cache copy tends to keep eyeballs on google.com, and increases their searches. You may have noticed that many major news sites won't link to other websites in their stories anymore, but rather just mention the relevant site without putting a link behind it. That's because they don't want eyeballs wandering off of their page. A wandering eyeball may not come back and look at more ads. That's basically one of the big reasons behind the cache copy as well -- it keeps eyeballs from wandering as much as they would without the cache.

All the Google partners -- AOL, Earthlink, Yahoo, Netscape -- don't include the cache links, and I assume that this is the reason. They don't want people wandering off to Google and staying there.

As new competition is organizing to challenge Google's monopoly, from places such as Overture (Alltheweb and AltaVista), Yahoo (Inktomi), AskJeeves/Teoma and Microsoft, these engines have to consider whether to fight Google on the cache copy, or offer their own cache copy even if they think it is illegal. There isn't really any middle ground on this.

Many observers with legal expertise feel that while the snippets are "fair use" of a website's content, offering the full text in a cache version is not. Copyright law requires "express permission," but Google only offers an incomplete and inconvenient opt-out. I suspect that the legal departments of these other engines are more inclined to challenge Google rather than launch into their own violations of copyright law.