Slashdot Mirror


Web Caching: Google vs. The New York Times

An anonymous reader writes "The Google cache is a popular feature among karma fetishists. Many stories with links to the NY Times attract comments pointing to Google's copy of the article. This gives readers access to the content without registering. C|Net reports that Google is in talks with the NY Times to close this backdoor. The article raises some general concerns regarding the caching of webcontent. Shouldn't the NY Times simply tell Google not to cache their site?"

28 of 518 comments (clear)

  1. Re:Free registration by whm · · Score: 5, Informative

    Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

    User tracking. While cookies can do this loosely, requiring a login does this much more effectively. I know I login with my same username each time I visit the site (if it's not cached). There's very little reason not to. This gives the NYT a much better indication of how many active and repeat members they have visitting their site. They can then target ads to users much more effectively, and market their userbase to advertisers much more solidly than they could with more rudimentary user tracking methods.

    There may be other purposes, but this seems like a large part of it.

  2. It raises 2 questions .. by Mr_Silver · · Score: 4, Informative
    such as:
    1. When will slashdot stop linking to articles that require a registration?
    2. When will slashdot consider implementing caching for pages that, by linking to, they manage to take off the internet?
    Sure, the 2nd question has been answered in the FAQ. Except it was written three years ago and Google manages this just fine. Maybe time for a second look?

    On the topic of site updates, has anyone noticed that 90% of the links on http://slashdot.org/code.shtml don't work any more?

    Hell the link to an Avantgo version of Slashdot points to a website which has been broken for over 2 years.

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
  3. Erm...cache? by DennyK · · Score: 4, Informative

    The article talks about Google's caching of articles that have expired to the NYT archives (which you have to pay to access). What most /. folks use to link to current NYT articles are the Google partner links, which simply bypass the free registration. I'd assume these links only work as long as an article hasn't been archived yet, so the karma whores are safe; I doubt the NYT's Google partner links will be going away any time soon... ;)

    DennyK

  4. Re:Free registration..some implications by jkrise · · Score: 3, Informative

    Actually, free reg requires a valid email id. It thus filters most bogus registrations. Secondly, news sites are planning to go the 'pay' way in about a couple of years. Getting readers to register would give more accurate estimates of readership.

    And lastly, once a site requires registration, even if free, Copyright ptohibits quoting entire articles on the web. This indeed could be the prime reason for this.

    --
    If you keep throwing chairs, one day you'll break windows....
  5. Re:Google - more useless everyday by cioxx · · Score: 4, Informative
    Sometime back, I pointed out how Google seems to have a soft corner for articles and sites that affect big firms such as Microsoft.


    "Google News is highly unusual in that it offers a news service compiled solely by computer algorithms without human intervention. While the sources of the news vary in perspective and editorial approach, their selection for inclusion is done without regard to political viewpoint or ideology. While this may lead to some occasionally unusual and contradictory groupings, it is exactly this variety that makes Google News a valuable source of information on the important issues of the day." source

    Remove your tinfoil hat please. There is no conspiracy. Google News features articles from Newsmax, Electronic Intifadah, Islam Online, Al Jazeera, World Net Daily, etc. If there was any filtering going on, these sites would have been off the radar long time ago.

    Also, Slashdot is not a professional journalistic site. It's a News-based comment board where people come to share their opinion. In a perfect world Slashdot doesn't even belong on Google News.

  6. Sweet irony by Amomynos+Coward · · Score: 3, Informative

    In case the cnet is /.'tted, here's link to Google cached page.

  7. Re:God damnit... by anonymous+loser · · Score: 4, Informative

    The *real* karma whores link to http://archive.nytimes.com anyway.

    NYTimes have futzed around with it a bit, but if you play with it, it still gives you registration-free access to their content, it just takes a couple of clicks nowadays.

  8. Re:NY Times likes accuracy by MonTemplar · · Score: 4, Informative

    What he said! Remember, the first two W's are for World Wide.

    The only people who seem to have a problem with webpage caching are either legal flacks working in CYA Mode, or webmasters who can't be bothered to mark up their pages and add robots.txt files to make sure that only public information goes out of their websites.

    --
    -MT.
  9. Re:Free registration..some implications by gilroy · · Score: 5, Informative
    Blockquoth the poster:

    and lastly, once a site requires registration, even if free, Copyright ptohibits [sic] quoting entire articles on the web.

    Actually, registration is not required to protect a work. Creating a work automatically protects it under copyright law -- no need for registration, user fees, or that little (c) thingy. At least in countries respecting the Berne Convention.
  10. Demograhpics by autopr0n · · Score: 4, Informative

    I've never been sent a single spam from the NYT. The reason they want this is for demograpics. A) it tells them who their web readers are, and B) it tells their advertizers who their web readers are. And it also allows them to show ads for products people would be most intrested in.

    --
    autopr0n is like, down and stuff.
  11. Um... by autopr0n · · Score: 2, Informative

    Wired the magazine and wired the website are totaly seperate companies. The website is owned by Lycos, and the magazine by Conde Nast.

    --
    autopr0n is like, down and stuff.
    1. Re:Um... by broeman · · Score: 3, Informative

      nope Sir, you are wrong. Wired Magazine is indeed commercialized on Wired Website. Nobody talked about company relations, well, before you did. And I still see the Lycos bar when I am on Wired Magazine's Homepage.

      --

      (yes this can be compared with sex)
  12. Re:Free registration by btlzu2 · · Score: 1, Informative

    Maybe we can agree that the NYT is a well-written, serious and interesting newspaper.
    Ermmm...not really

    --
    Zed's dead baby. Zed's dead.
  13. meta tags ? by matrix0040 · · Score: 5, Informative

    well cant they just use meta tags to prevent archving of their pages

    <META NAME="robots" CONTENT="noarchive">

    from
    http://www.google.co m/bot.html"

  14. NYT is a new york / regional paper by Anonymous Coward · · Score: 1, Informative

    The NYT is a local / regional paper when you get right down to it.

  15. No! by Anonymous+Brave+Guy · · Score: 2, Informative

    If the information is being copied and circumventing the NYT's usual requirements for access, then this is not the NYT's problem, it's Google's. A good question might be how Google's robots can actually circumvent that access in the first place, but I'm sure someone's thought of that somewhere I haven't noticed yet...

    OTOH, Google is quite at liberty not to list the NYT in its results if it so wishes, which presumably wouldn't be the outcome the NYT would be hoping for (and would presumably get if employing robots.txt).

    The moral onus here is clearly on Google to ensure that if they are changing the way information is presented then they do so in a manner acceptable to the provider of that information. Or did you expect the NYT to contact anyone in the world who might be interested in caching their site? The "we don't need any legal recourse" argument is pretty weak too; it basically assumes that everyone in the world (a) knows about and (b) obeys robots.txt, which is clearly nothing close to the correct.

    All in all, if both companies are looking for a constructive solution to this problem that benefits all concerned, it seems pretty sensible for them to get around the table, discuss what they want to happen, and make it so.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  16. Re:Free registration by yelvington · · Score: 4, Informative

    NYT doesn't spam. And the percentage of net.morons who register using cartoon names is remarkably low.

    I don't work for the New York Times, but for another media company, and I'm in a position to understand the reasons for registration:

    1. Metrics. Registration supports the generation of accurate data on demographics and usage (reach, frequency) in a crosstabulated view. This is important in analytical processes to support site management and design as well as in the sale of advertising, which provides the revenue that makes the site possible.

    2. Ad targeting. Run-of-site, untargeted Internet advertising is nearly worthless on the open market (supply/demand), but advertising that is highly targeted remains highly valuable. When combined with proper analytical software and usage data, registration data can -- for example -- let me target 25- to 34-year-olds in a particular ZIP code who have been looking at real estate listings. And I can deliver that advertising anywhere on my site, such as on sports pages that otherwise would contain "junk" ad inventory. This is (measurably!) much more efficient and effective, and I can charge fairly high CPM prices. Importantly, this can be accomplished without providing any personal data to the advertiser, protecting the anonymity of the user.

    3. Reduction in traffic. Reduction is actually desirable in many cases. Not all customers are good customers, and not all traffic is good traffic.

    On the Google issue: I used robots.txt to block Google from indexing the AP content on our 27 newspaper sites, because I have no desire to be the unpaid provider of wire stories for Google News so that they can be read by users outside our markets. Additionally, I have used a router block to prevent several commercial Web clipping services from having access of any sort to any of our sites.

  17. Re:Art Spam by ajs318 · · Score: 2, Informative

    To find out where spam is coming from, get an e-mail account with Virtual Hosting. This is where you get an entire subdomain {or a domain if you pay for it} to yourself, and your e-mail address is in the form anything@mysubdomain.myisp.co.uk. Then you just need to give a different prefix for each site you visit -- e.g. nyt_resp@mysubdomain.myisp.co.uk, and so on.

    If you want to put your e-mail address on your web site, use this to automagically mung your address.

    --
    Je fume. Tu fumes. Nous fûmes!
  18. Truism? by mrd_yaddayadda · · Score: 2, Informative
    Actually, that's a good point. It doesn't seem google actually PRODUCES any sort of content on their own.

    Actually, that's a pointless point. Of course google doesn't produce anything; they are a meta data service. Search engines and collators for websites, for news for images and who knows what else.

    The issue is whether or not they should be able to collate data that is in some way secured. And on that I'm offering no opinion mainly because I can see all sides of this and hats are all too grey to be able to distinguish for me.
  19. Re:Problem is potentially bigger than caching Re:Y by sketerpot · · Score: 2, Informative
    Basically all I am saying is that there should be a movement similar to Open Source not only for software products, but for journalistic content.

    There is. How about the Creative Commons?

  20. Two CIA Companies Cooperating... by Anonymous Coward · · Score: 1, Informative

    Not exactly news. Both are part of our state-controlled media.

    Did you know you need a security clearance to work for Google?

    That their Usenet policy allows them to drop any posts from other ISPs that they don't like?

    BeOS Stock Scandal with Microsoft Brewing

  21. You are welcome to use xxxxdd@xxxx.com any time. by Futurepower(R) · · Score: 3, Informative


    Your comment was confusing to me until I realized that you are talking about giving NYT an actual email address. Why would you do that? Isn't that why we have hotmail.com? Give an address that does not exist or a throw-away address.

    Last week I was registering at a web site and I put in xx@xx.com for the address. The system responded, "This address has already been registered." So then I put in xxx@xxx.com. The system responded, "This address has already been registered." So I entered xxxx@xxxx.com. Same response. Finally I awoke fully and entered some Ds, xxxxdd@xxxx.com, and the system accepted my "registration".

  22. Re:Free registration by mysticgoat · · Score: 3, Informative

    You've brought out some very good information in a well-written way. Thank you. I'll cover much of the same ground from the satisfied user's viewpoint.

    1. NYT and spam: there is no relationship between these. That's my experience after years of subscription, and a number of other people on this thread report the same thing. The Yahoo portal news service is also good this way (and gives me Reuters: an excellent supplement to NYT).
    2. The metrics thing: I provided NYT with true demographics when I signed up, because I know that will help them deliver product more efficiently and sell their advertising.

      I want that. I like the service NYT provides, and so I want them to succeed. I very much want them to continue to provide me with a free subscription-- and I'm willing to help them hold their costs down and maximize their advertising revenues.

    3. Focused advertising: I don't like ads, but I'm willing to put up with their presence in exchange for a service like NYT.

      NYT has done a good job of keeping the impact of the ads low: the ads don't get in the way of reading the stories and they don't slow page loading significantly (since I'm on a slow rural dial-up, that's very important). If NYT starts to charge me, I'll be less tolerant of the ads. If the advertising starts slowing down the page loading, I'll drop my subscription. There are a number of other news services-- CNN, ABC, etc-- that I don't use because the advertising burden slows page loading or otherwise gets in the way.

      As to focused ads-- I'm all for that. I'd rather ignore stuff that's somewhat pertinent to my life than ignore crap I'd never buy. An ad for reading glasses is pertinent to me, but an ad for skateboards is crap-- I was long past skateboarding age before the first ones hit the street. Reading glasses are something me and my cohorts have to live with, and we talk about them. Nobody in my circle of friends has a skateboard and I don't recall ever talking about them. (Of course skateboards would be a problem for me and my neighbors: I don't think they do well on gravel and road apples.)

      And sometimes the advertising actually works-- sometimes it makes me aware of a product or company that I'll want to talk over with my buddies, and maybe try out. That is much more likely with focused ads. As I recall, my first awareness of the existence of fold-up reading glasses in a hard case (suitable for hiking, bicycling, and other hip pocket activities) was from an advertisement. Now I've got a couple of pairs of them. Neat.

    About Google's archive, NYT, and slashdot: Something I hope NYT considers is that the Google archive gives it (and at least some of its ads) exposure in demographic groups that it would otherwise never reach. Such as the tinfoil hat superparanoid geek crowd. While there is no way to develop metrics on this, nor any way to market this to advertisers seeking targetted audiences, this exposure is certainly more beneficial than harmful. Besides, every once in a while somebody matures a little and puts away their tinfoil hat-- and then is a likely candidate for the kind of news service NYT provides.

    So I think it would be very hard for NYT or Google to assess whether the Google cache is harmful or beneficial.

  23. Re:Free registration by Rob+Riggs · · Score: 3, Informative
    There is a significant difference to logging in to a site in order to participate in conversation and logging in to simply read news. At /., posting requires an identity, since anonymous postings are mostly ignored. However, there is absolutely no requirement that one log in to /. in order to read the stories. Your anology is broken. Privacy should be a choice. At /. one has that choice, with the NYT one does not.

    Another point is that anonymity is one of /. greatest strengths. Some of the most insightful and interesting posts have been from "insiders" posting anonymously.

    NY Times... user tracking is less sophisticated than slashcode's vital anti troll features.

    Care to back this statement up?

    ...continual complaints on slashdot from people who are obsessed with privacy on the net unless karma is involved

    You seem to be quite willing to give up those rights. And that's OK. But there are people here that feel that privacy is a rather important right. That should be respected as well. Enough people actually thought that privacy was a right of such importance that it is enumerated in the Universal Declaration of Human Rights (see Article 12).

    --
    the growth in cynicism and rebellion has not been without cause
  24. Re:Free registration by HBI · · Score: 2, Informative

    Maybe we can agree that the NYT is a well-written, serious and interesting newspaper.

    I won't agree with that statement. I will agree it's a well-written, serious, interesting work of fiction.

    To back that up, we can point at this, illustrating a bit how a reporter that falsified stories en masse (Jayson Blair) and a managing editor who tolerated same (Harold Raines) were kept on board because of a weird form of affirmative action (in the former case) and a personal friendship with the publisher (Arthur Sulzberger, Jr.) in the latter.

    If you want more information on the Jayson Blair-authored stream of fictional articles appearing in the NYT's pages, just Google to your heart's content.

    You can trust it as a newspaper again if you like, but i'm certainly not going to.

    --
    HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
  25. Copying webpages.... by IamLarryboy · · Score: 2, Informative

    happens everytime you go to a website. Creating a copy of the content is the primary means of internet communication. I don't see how google caching the pages is any different than me viewing it in my browser. It's not like google takes the credit for the content. If it were so, there would be no way for any web search to work without owning all the searchable content.

  26. Re:Free registration by zcat_NZ · · Score: 3, Informative

    Adding one little line of code to every one of the myriad of pages on the New York Times website is not a small deal. It's going to involve a lot of paperwork, testing, and coding on the part of a lot of people.

    But it's not one line of text on EVERY page. It's one line of text in /robots.txt, a file that is independent of the rest of the site and never even accessed by ordinary browsers.

    It's probably simpler for Google to create a registry of "do not cache" pages on their end. And it's more their responsibility, anyway, being the ones who created the cache in the first place.

    Google already have exactly such a registry, and they don't even wait for sites to contact them.. Their robots -asks- the site (via the recognised standard '/robots.txt' file) if they object to being indexed and/or cached. Most other search engines look for the same file and handle it the same way.

    This is (from my perspective) far better than having to individually register your site with the several hundred search engines that might try to index it..

    --
    455fe10422ca29c4933f95052b792ab2
  27. Re:can you elaborate? by anonymous+loser · · Score: 2, Informative

    It's easy. Let's say you want to read this article (which is the top story ATM):

    "Iraqi Council to Seek U.N. Seat; One G.I. Killed in Baghdad"
    The URL is:

    http://www.nytimes.com/login.asp?URL=http://www. ny times.com/2003/07/14/international/worldspecial/14 CND-IRAQ.html
    (or something like that)

    Well, instead just substitue archive.nytimes.com:
    http://archive.nytimes.com/l ogin.asp?http://www.ny times.com/2003/07/14/international/worldspecial/14 CND-IRAQ.html

    You will get a message that says something like "authorization error" and the browser then takes you back to the front page. However, when you click on the same story, you will get taken to the content rather than a login page.