Slashdot Mirror


Web Caching: Google vs. The New York Times

An anonymous reader writes "The Google cache is a popular feature among karma fetishists. Many stories with links to the NY Times attract comments pointing to Google's copy of the article. This gives readers access to the content without registering. C|Net reports that Google is in talks with the NY Times to close this backdoor. The article raises some general concerns regarding the caching of webcontent. Shouldn't the NY Times simply tell Google not to cache their site?"

129 of 518 comments (clear)

  1. Free registration by Zog+The+Undeniable · · Score: 5, Insightful

    I'd love to see their user database, just to count the number of Mickey Mice and Elmer Fudds on there. Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

    --
    When I am king, you will be first against the wall.
    1. Re:Free registration by presroi · · Score: 5, Insightful

      Maybe we can agree that the NYT is a well-written, serious and interesting newspaper. Not just for New Yorkers but also for people from Sweden, Japan or New Jersey.

      Where would the the limit? How would you feel if you had to register for every web page which is linked to at /. (I confess, I usually click on every /.-story link)?

      hmm, to answer your question:
      maybe the point in registration is the signing of a contract how to use this contact. Dunno.

    2. Re:Free registration by whm · · Score: 5, Informative

      Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

      User tracking. While cookies can do this loosely, requiring a login does this much more effectively. I know I login with my same username each time I visit the site (if it's not cached). There's very little reason not to. This gives the NYT a much better indication of how many active and repeat members they have visitting their site. They can then target ads to users much more effectively, and market their userbase to advertisers much more solidly than they could with more rudimentary user tracking methods.

      There may be other purposes, but this seems like a large part of it.

    3. Re:Free registration by Anonymous Coward · · Score: 3, Funny

      It's not only spam - they want to be able to build profiles of their users for marketing purposes. They've got my e-mail address. In exchange, their database has got a 98 year old woman who lives in Albania with a PhD, no job and an income of less than a thousand bucks a year. Twice. Once for my work e-mail address. And once for home.

    4. Re:Free registration by cobbaut · · Score: 5, Interesting

      Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

      I always use a different address to register online in the form of website@mydomain.
      I registered with the NYT in 1999, I never received a single spam on this address.

      --
      European Linux user, living in Antwerp
    5. Re:Free registration by Anonymous Coward · · Score: 5, Insightful

      And on top of everything else, it annoys users more than just about anything else aside from spam. Can't recall exactly how many other people I know who go to see a NYT article, find the rego page, and ignore it to go find a better news source without the hassle.

      If they're tracking what their users are do, they're affecting their user pool in a pretty negative way just by using this method.

    6. Re:Free registration by digitalunity · · Score: 5, Funny

      (I confess, I usually click on every /.-story link)

      This is *Slashdot*. We don't read articles. Please, either read the article or post a comment; you cannot do both.

      --
      You can't legislate goodness. Let each to his own destiny, by will of his freely made choices.
    7. Re:Free registration by MartinB · · Score: 2, Insightful
      Their database has got a 98 year old woman who lives in Albania with a PhD, no job and an income of less than a thousand bucks a year.

      And you wonder why you get ads that have absolutely no interest for you? And why advertisers have to shout lounder and louder to get through a mass of untargeted ads?

      Advertisers would far rather spend less by buying fewer, smaller ad slots that are targeted accurately. Much better return on their spend. Like the guru said I know half my advertising is wasted. I wish I knew which half.

      --

      The only thing you can accurately describe as "Scotch" is a sticky tape made by 3M. And it's

    8. Re:Free registration by JanneM · · Score: 5, Funny

      You gave them an actual, working, email address? How ...quaint.

      Me, I'm a 66 year old single woman with no income, no education, and lives in a nonexistant Swedish town with a very rude (in Swedish) name. I figured that any site advertiser that want's to target this person must be desperate enough that their ads may actually be amusing.

      --
      Trust the Computer. The Computer is your friend.
    9. Re:Free registration by Zigg · · Score: 3, Funny

      Maybe we can agree that the NYT is a well-written, serious and interesting newspaper.

      No, sorry, I can't agree with that. Well, maybe interesting.

    10. Re:Free registration by StarFace · · Score: 4, Funny

      You say these things as if they are good things. Man, I don't want some newspapaper tracking everything that I read, so that they may serve me custom tailored advertisements! Although, it would be nice if the system actually was intelligent, it would eventually discern that I loath the advertisment philosophy, and stop sending them altogether -- ha.

      --
      V
    11. Re:Free registration by yelvington · · Score: 4, Informative

      NYT doesn't spam. And the percentage of net.morons who register using cartoon names is remarkably low.

      I don't work for the New York Times, but for another media company, and I'm in a position to understand the reasons for registration:

      1. Metrics. Registration supports the generation of accurate data on demographics and usage (reach, frequency) in a crosstabulated view. This is important in analytical processes to support site management and design as well as in the sale of advertising, which provides the revenue that makes the site possible.

      2. Ad targeting. Run-of-site, untargeted Internet advertising is nearly worthless on the open market (supply/demand), but advertising that is highly targeted remains highly valuable. When combined with proper analytical software and usage data, registration data can -- for example -- let me target 25- to 34-year-olds in a particular ZIP code who have been looking at real estate listings. And I can deliver that advertising anywhere on my site, such as on sports pages that otherwise would contain "junk" ad inventory. This is (measurably!) much more efficient and effective, and I can charge fairly high CPM prices. Importantly, this can be accomplished without providing any personal data to the advertiser, protecting the anonymity of the user.

      3. Reduction in traffic. Reduction is actually desirable in many cases. Not all customers are good customers, and not all traffic is good traffic.

      On the Google issue: I used robots.txt to block Google from indexing the AP content on our 27 newspaper sites, because I have no desire to be the unpaid provider of wire stories for Google News so that they can be read by users outside our markets. Additionally, I have used a router block to prevent several commercial Web clipping services from having access of any sort to any of our sites.

    12. Re:Free registration by Anonymous Coward · · Score: 2, Insightful
      Maybe we can agree that the NYT is a well-written, serious


      bullshit. Were you paying attention last month when they had to correct all the Jayson Blair stories, or acknowledge that he copied from other sources. Don't say it was one bad apple either. They haven't apologized for front-page stories claiming Iraq was a quagmire after 1 week of invading.

    13. Re:Free registration by mrd_yaddayadda · · Score: 3, Insightful

      You can count me as one of the people that ignore the NYT unless I can get a cached page. I get enough spam as it is...

    14. Re:Free registration by hesiod · · Score: 2, Insightful

      > They can then target ads to users much more effectively

      How about they advertise according to the content of the article. If it's a tech article they show tech adverts. That's pretty simple, and something they generally don't do (and it wouldn't require logging in)

    15. Re:Free registration by FatAlb3rt · · Score: 4, Insightful

      I disagree. Let's imagine for a minute that everyone provides an accurate profile, targeted marketing works, sales increase, and the advertiser gets rich.

      You really think that the money they spend on advertising will level off?

    16. Re:Free registration by NexusTw1n · · Score: 5, Insightful

      I always find it ironic when people on slashdot complain about being "tracked" on NYTimes webpages or other sites that require registration.

      Most people have registered to use /. , and have therefore provided a valid email address. So you can't have a moral objection to giving your email addy to websites you frequent.

      Even if you don't register, your IP address is logged and monitored , via the sophisticated anti troll system. Try and post more than 10 times in one day as an AC, or post as an AC in reply to a post you modded and slashcode will react.

      So even as an AC you aren't really totally anonymous on slashdot, yet I don't see anyone who complains about NY Times links complaining about that. The only people who complain are the trolls that forced these features to be added to the code.

      So why do we have this tedious bitching about the NY times every time a link is posted?

      I registered a couple of years ago. I've never recieved a single spam to NYTimes@mydomain.com which was the email addy I used. I've never had to login because the login cookie has remained in Opera since I registered. How hard is it login and then forget about it forever more?

      The only reason I haven't forgotten I've registered is the continual complaints on slashdot from people who are obsessed with privacy on the net unless karma is involved. NY Times doesn't spam registered users, and any user tracking is less sophisticated than slashcode's vital anti troll features. So bear that in mind when tommorrow's NY Times story appears and the same old complaints are dragged out yet again.

      --
      It has become appallingly obvious that our technology has exceeded our humanity. --Albert Einstein
    17. Re:Free registration by Anonymous Coward · · Score: 4, Funny

      Stop impersonating my grandmother, you insensitive clod!

    18. Re:Free registration by LilMikey · · Score: 3, Insightful

      NYT does not let you access their content without logging in. That's nothing like Slashdot's system.

      --
      LilMikey.com... I'll stop doing it when you sto
    19. Re:Free registration by endoboy · · Score: 3, Insightful
      NYT does not let you access their content without logging in

      and why should they? NYT spent real $$$ to develop that content, and are under no obligation to give it away.

      Don't like it? Go someplace else.

    20. Re:Free registration by NexusTw1n · · Score: 2, Interesting

      Slashdot allows you to view the content but not post even anonymously without being tracked at IP address level.

      The NY Times allows you to view the archive anonymously , and allows you to view the main page with a password you could google for easily.

      So yeah, it's nothing like slashdot's system - NY Times intrudes far less into your privacy.

      --
      It has become appallingly obvious that our technology has exceeded our humanity. --Albert Einstein
    21. Re:Free registration by keithdowsett · · Score: 2, Interesting

      I'm continually surprised by how often these free registration programs will accept root@localhost or abuse@localhost in the e-mail field.

      This has the pleasing consequence that it unless they employ someone to vet the list they are likely to end up spamming themselves or their provider. Both much more amusing than sending it to a completely fake address.

      Naturally, the rest of these forms must be treated as an exercise in creativity - and we should give our creations suitable names. My favourites include Hugh Jorgens and Tess Tickle.

      So, treat these forms like you treat the religious nuts who arrive on your doorstep preaching salvation - as a source of amusement.

    22. Re:Free registration by yelvington · · Score: 2, Interesting

      So, regarding 1, 2, and 3 - the advantage to me as a consumer is what?

      Relevancy of information. Think about it for a minute. Advertising that is not useful is noise. Advertising that is useful isn't noise. Targeting replaces the noise with utility.

      If you're looking for a house, informative advertising from mortgage lenders (real lenders, not the scam artists who clog your email box) is useful. If you're hungry, you might find targeted pizza coupons from the pizza joint around the corner to be useful. And so forth.

      You also might consider that making the Web site profitable ensures its survival, which ought to be an advantage to you, assuming that you care to use it.

    23. Re:Free registration by fucksl4shd0t · · Score: 3, Insightful

      So you can't have a moral objection to giving your email addy to websites you frequent.

      It's about trust, actually. Morality has nothing to do with it. I don't trust NYT not to sell my email address or anything like that. I *do* trust slashdot not, but if I ever catch them doing it, well, I just won't tell them it's changed recently. :)

      There are quite a few sites that I frequent that I don't trust with personal information. Visiting a site frequently != trust.

      --
      Like what I said? You might like my music
    24. Re:Free registration by mysticgoat · · Score: 3, Informative

      You've brought out some very good information in a well-written way. Thank you. I'll cover much of the same ground from the satisfied user's viewpoint.

      1. NYT and spam: there is no relationship between these. That's my experience after years of subscription, and a number of other people on this thread report the same thing. The Yahoo portal news service is also good this way (and gives me Reuters: an excellent supplement to NYT).
      2. The metrics thing: I provided NYT with true demographics when I signed up, because I know that will help them deliver product more efficiently and sell their advertising.

        I want that. I like the service NYT provides, and so I want them to succeed. I very much want them to continue to provide me with a free subscription-- and I'm willing to help them hold their costs down and maximize their advertising revenues.

      3. Focused advertising: I don't like ads, but I'm willing to put up with their presence in exchange for a service like NYT.

        NYT has done a good job of keeping the impact of the ads low: the ads don't get in the way of reading the stories and they don't slow page loading significantly (since I'm on a slow rural dial-up, that's very important). If NYT starts to charge me, I'll be less tolerant of the ads. If the advertising starts slowing down the page loading, I'll drop my subscription. There are a number of other news services-- CNN, ABC, etc-- that I don't use because the advertising burden slows page loading or otherwise gets in the way.

        As to focused ads-- I'm all for that. I'd rather ignore stuff that's somewhat pertinent to my life than ignore crap I'd never buy. An ad for reading glasses is pertinent to me, but an ad for skateboards is crap-- I was long past skateboarding age before the first ones hit the street. Reading glasses are something me and my cohorts have to live with, and we talk about them. Nobody in my circle of friends has a skateboard and I don't recall ever talking about them. (Of course skateboards would be a problem for me and my neighbors: I don't think they do well on gravel and road apples.)

        And sometimes the advertising actually works-- sometimes it makes me aware of a product or company that I'll want to talk over with my buddies, and maybe try out. That is much more likely with focused ads. As I recall, my first awareness of the existence of fold-up reading glasses in a hard case (suitable for hiking, bicycling, and other hip pocket activities) was from an advertisement. Now I've got a couple of pairs of them. Neat.

      About Google's archive, NYT, and slashdot: Something I hope NYT considers is that the Google archive gives it (and at least some of its ads) exposure in demographic groups that it would otherwise never reach. Such as the tinfoil hat superparanoid geek crowd. While there is no way to develop metrics on this, nor any way to market this to advertisers seeking targetted audiences, this exposure is certainly more beneficial than harmful. Besides, every once in a while somebody matures a little and puts away their tinfoil hat-- and then is a likely candidate for the kind of news service NYT provides.

      So I think it would be very hard for NYT or Google to assess whether the Google cache is harmful or beneficial.

    25. Re:Free registration by gantrep · · Score: 2

      In short, yes. The fact that this one was allowed to slip by for so long unnoticed means the whole place is messed up, and they didn't have proper accountability for factual reporting.

    26. Re:Free registration by Geeky · · Score: 2
      And I would think the primary responsibility for managing the cache issues falls on the NYTimes. When they've done their part, they can whine to Google, but not before.

      I'm not so sure - why should the burden of responsibility fall on websites? It's like those checkboxes on forms (both web and real), that basically say "tick this or we'll spam you". Google are saying "add this to your html or we'll cache you". The principle is the same - inaction is taken as approval. And I don't suppose Google are offering to pick up the costs that the NY Times would incur changing their code? Google should at the very least have a "do not cache" list to which sites could be added at their own request - without having to change their sites.

      --
      Sigs are so 1990s. No way would I be seen dead with one.
    27. Re:Free registration by Rob+Riggs · · Score: 3, Informative
      There is a significant difference to logging in to a site in order to participate in conversation and logging in to simply read news. At /., posting requires an identity, since anonymous postings are mostly ignored. However, there is absolutely no requirement that one log in to /. in order to read the stories. Your anology is broken. Privacy should be a choice. At /. one has that choice, with the NYT one does not.

      Another point is that anonymity is one of /. greatest strengths. Some of the most insightful and interesting posts have been from "insiders" posting anonymously.

      NY Times... user tracking is less sophisticated than slashcode's vital anti troll features.

      Care to back this statement up?

      ...continual complaints on slashdot from people who are obsessed with privacy on the net unless karma is involved

      You seem to be quite willing to give up those rights. And that's OK. But there are people here that feel that privacy is a rather important right. That should be respected as well. Enough people actually thought that privacy was a right of such importance that it is enumerated in the Universal Declaration of Human Rights (see Article 12).

      --
      the growth in cynicism and rebellion has not been without cause
    28. Re:Free registration by HBI · · Score: 2, Informative

      Maybe we can agree that the NYT is a well-written, serious and interesting newspaper.

      I won't agree with that statement. I will agree it's a well-written, serious, interesting work of fiction.

      To back that up, we can point at this, illustrating a bit how a reporter that falsified stories en masse (Jayson Blair) and a managing editor who tolerated same (Harold Raines) were kept on board because of a weird form of affirmative action (in the former case) and a personal friendship with the publisher (Arthur Sulzberger, Jr.) in the latter.

      If you want more information on the Jayson Blair-authored stream of fictional articles appearing in the NYT's pages, just Google to your heart's content.

      You can trust it as a newspaper again if you like, but i'm certainly not going to.

      --
      HBI's Law: Frequency of calling others Nazis is directly correlated with the likelihood of the accuser being Communist.
    29. Re:Free registration by vTalon · · Score: 2, Interesting

      they don't have to change the whole site; they just need to add ONE LINE of text to ONE plaintext file.

      How hard is that?


      First rule of web design (or programming, or anything having to do with a computer or any other complex system): that little change that you make in five minutes that you know won't affect anything -- the one you don't test because it's so minor -- that change is going to bring the whole page/program/computer/national defense system crashing down around your ears.

      Adding one little line of code to every one of the myriad of pages on the New York Times website is not a small deal. It's going to involve a lot of paperwork, testing, and coding on the part of a lot of people.

      It's probably simpler for Google to create a registry of "do not cache" pages on their end. And it's more their responsibility, anyway, being the ones who created the cache in the first place.

    30. Re:Free registration by zcat_NZ · · Score: 3, Informative

      Adding one little line of code to every one of the myriad of pages on the New York Times website is not a small deal. It's going to involve a lot of paperwork, testing, and coding on the part of a lot of people.

      But it's not one line of text on EVERY page. It's one line of text in /robots.txt, a file that is independent of the rest of the site and never even accessed by ordinary browsers.

      It's probably simpler for Google to create a registry of "do not cache" pages on their end. And it's more their responsibility, anyway, being the ones who created the cache in the first place.

      Google already have exactly such a registry, and they don't even wait for sites to contact them.. Their robots -asks- the site (via the recognised standard '/robots.txt' file) if they object to being indexed and/or cached. Most other search engines look for the same file and handle it the same way.

      This is (from my perspective) far better than having to individually register your site with the several hundred search engines that might try to index it..

      --
      455fe10422ca29c4933f95052b792ab2
  2. Worst result by presroi · · Score: 2, Interesting

    The worst outcome would be a google-database which is not representative for the general web. I simply ecspect all results in google to be accessible without registering, paying or doing anything similar.

    1. Re:Worst result by Bitter+Old+Man · · Score: 3, Funny

      Ecspect? Ecspect? ECSPECT? Jesus H mother of god, it's like just when I think slashdot couldn't possibly get any worse... it does.

  3. God damnit... by tangent3 · · Score: 5, Funny

    Now we can't karma whore by linking to the google cache?

    1. Re:God damnit... by anonymous+loser · · Score: 4, Informative

      The *real* karma whores link to http://archive.nytimes.com anyway.

      NYTimes have futzed around with it a bit, but if you play with it, it still gives you registration-free access to their content, it just takes a couple of clicks nowadays.

    2. Re:God damnit... by toopc · · Score: 2, Funny
      You'll have to stick to making lame, obvious jokes.

      1. Make lame, obvious joke.
      2. ???
      3. Profit!

      All your lame jokes are belong to us.

      In Soviet Russia lame jokes make you!

  4. Google - more useless everyday by jkrise · · Score: 2, Insightful

    IANTrolling here, but I find Google more and more useless by the day. Sometime back, I pointed out how Google seems to have a soft corner for articles and sites that affect big firms such as Microsoft.

    In fact, several of Slashdot's own articles on Microsoft aren't available from Google news, although Slashdot is listed as a 'news' source. Couple of MS related Slashdot articles (on the Oregon bill - March 6th and May) have been removed, but much pro-MS content pre-dating March is still referenced.

    Google seems to be aping the other Gorilla, despite all the posturing, and Microsoft's so-called attempts to categorise it as a competitor, when in fact, Google appears to be an ally!

    --
    If you keep throwing chairs, one day you'll break windows....
    1. Re:Google - more useless everyday by cioxx · · Score: 4, Informative
      Sometime back, I pointed out how Google seems to have a soft corner for articles and sites that affect big firms such as Microsoft.


      "Google News is highly unusual in that it offers a news service compiled solely by computer algorithms without human intervention. While the sources of the news vary in perspective and editorial approach, their selection for inclusion is done without regard to political viewpoint or ideology. While this may lead to some occasionally unusual and contradictory groupings, it is exactly this variety that makes Google News a valuable source of information on the important issues of the day." source

      Remove your tinfoil hat please. There is no conspiracy. Google News features articles from Newsmax, Electronic Intifadah, Islam Online, Al Jazeera, World Net Daily, etc. If there was any filtering going on, these sites would have been off the radar long time ago.

      Also, Slashdot is not a professional journalistic site. It's a News-based comment board where people come to share their opinion. In a perfect world Slashdot doesn't even belong on Google News.

    2. Re:Google - more useless everyday by jkrise · · Score: 2, Interesting

      Google News is highly unusual in that it offers a news service compiled solely by computer algorithms without human intervention.

      A May article was referenced in Google, but the link pointed to a March 6th article. How can computer algorithms cause this?

      While this may lead to some occasionally unusual and contradictory groupings, it is exactly this variety that makes Google News a valuable source of information on the important issues of the day.

      Just search for Googlewash using Google. Read story in TheRegister (it's not delisted now). Watch hypocrisy in action. Roll up eyes in disbelief. Adjust tin foil hat.

      --
      If you keep throwing chairs, one day you'll break windows....
    3. Re:Google - more useless everyday by MonTemplar · · Score: 4, Interesting

      First off, Google News is still in Beta at the moment.

      Second, the Google News database only goes back a month or so, probably by design.

      Third, I was able to search for 'site:slashdot.org microsoft oregon' on Google just fine this morning. Got 243 results, and the Google Cache has copies of the first three pages returned, which relate directly to the Oregon bill you use as your example.

      So, where is the problem?

      --
      -MT.
    4. Re:Google - more useless everyday by dhodell · · Score: 5, Interesting

      Just FYI, this behavior is due to the fact that Googlebot has a sort of "built-in" mechanism to ignore (or at least rank lower) forum-type sites. Since /. is primarily a "news headline and discussion" site, Google will not rank it as highly as one that seems to be more "on-topic". This is because there is no guarantee that any URLs or email addresses within the page have anything to do with the actual page content itself.

      Outside of user posts, /. has little genuine unique content. It summarizes a lot of headlines; this content is not unique.

      Other (large) factors determine the way Google ranks pages, including the "PageRank" feature. There are lots of documents about the way Google ranks sites, I suggest to check them out. The best way is probably to Google for it :).

      Anyway, this is a bit more on-topic:

      I highly appreciate Google's caching feature, and don't see how it can be taken as "bad".

      This is what's "bad" about Google and what I expect that, at some point, will come to haunt them. For instance, if I want to get serial numbers without porn popups, I can usually search for something like "Office XP Serial Number Serialz Warez" or something similar. Within the first couple of pages, I will probably find my serial number in the text of the page description.

      If not, it's on the page, oftentimes without a popup, since the serial/crack page itself is the one linked.

      Want to find X-Win32? How about doing "* * * * * * xwin32*.exe" - lets get some directory listings containing this filename.

      No doubt this proves that Google is more than just a search machine... but I think that their superior techniques will definitely come back to haunt them in the future. NYT is way off target with bitching about their caching features... you can turn this off easily, and there are a plethora of scripts one can use to break out of Google's cache and send someone to the main site (or, perhaps, login area in the case of NYT).

      But, in other news, Google might need to watch out...

      --
      Kind regards, Devon H. O'Dell
  5. Yes. by Naikrovek · · Score: 4, Interesting

    Shouldn't the NY Times simply tell Google not to cache their site?

    You do realize that this is probably the basis of the "talks" that are going on, right? C|Net (as per usual for them and every news agency) is making a big deal of it to get themselves and their advertisers that tiny wee bit more of attention. Every little bit helps i guess.

    Check http://nytimes.com/robots.txt in a week.

    1. Re:Yes. by Zocalo · · Score: 2, Insightful

      Actually, the link to "robots.txt" raises an interesting point. Why is NY Times even in "discussions" for this, other than to gain some column inches? It's entirely upto the NYT whether to let Google's robots to index their site, isn't it? I would have thought that Google's robots would be well behaved in this respect and simply move onto the next site if they were told to go away by robots.txt.

      --
      UNIX? They're not even circumcised! Savages!
    2. Re:Yes. by gibodean · · Score: 2, Insightful

      Actually, from the text of the article, they say that they want it so that when you click on a link in google, you get the registration page of the NYT.

      A robots.txt would stop google from indexing the site altogether. They don't want that to happen. They want a google search to show NYT web pages, but they just want to make sure that when the user tries to view it, they have to register with NYT first. That means that google must still index the page, but not allow access through the cache. Plus, it must direct to a sign-on page rather than the page itself, but that is something that I'm sure the NYT itself could handle, like it think it does now anyway.

    3. Re:Yes. by AyeRoxor! · · Score: 2, Interesting

      "It's entirely upto the NYT whether to let Google's robots to index their site, isn't it?"

      I personally think the NYTimes wants Google to continue to cache their stories.

      If they use robots.txt, no NYT articles will come up in Google. However, if they *do* succeed in these talks, I presume the articles will still come up, but uncached, and linking to a signup/login screen. It makes pretty good business sense.

    4. Re:Yes. by Arker · · Score: 2, Interesting

      A robots.txt would stop google from indexing the site altogether. They don't want that to happen. They want a google search to show NYT web pages, but they just want to make sure that when the user tries to view it, they have to register with NYT first. That means that google must still index the page, but not allow access through the cache. Plus, it must direct to a sign-on page rather than the page itself, but that is something that I'm sure the NYT itself could handle, like it think it does now anyway.

      I sincerely hope google shows some balls and tells them to f right off.

      They can't have it both ways, either they're on the web or they're not. They've been trying for years to subvert things so they can have their cake and eat it too, and they need to get told no by someone they'll listen to.

      --
      =-=-=-=-=-=-=-=-=-=-=-=-=-=-
      Friends don't let friends enable ecmascript.
    5. Re:Yes. by StarFace · · Score: 4, Interesting

      True, there is no standard, but Google's method of allowing indexing and caching as independently selectably features is well documented and extremely easy to do. You can even tell Google specifically to stop caching, if you don't mind smaller engines caching.

      So, it isn't a standard, but it is a piece of cake for NYT to figure out, and indeed, they already have. As the person above said, this is just C|Net trying to be a real news source. The article even says that the method I just described is the focus of their "discussions."

      I imagine the discussions, if anything, were intially a friendly lawyer call (if even that,) which was quickly diverted to a tech issue and ended up with some webmaster at NYT getting the specifics of how to set up the Times so that Google will still index their pages and bring them up with searches, but not cache them.

      --
      V
  6. NY Times likes accuracy by Anonymous Coward · · Score: 4, Interesting

    The reason they're trying to stop this is because with NYT reputation, they keep retracting stories all the time. With Google cache this could be problematic and the management/editors/authors could get into trouble again.

    I do however dislike Google cache for many reasons. It's bad for privacy.

    1. Re:NY Times likes accuracy by anshil · · Score: 4, Insightful

      Since when is content published in the WWW about privacy?

      It's just like a government that wants to control which newspapers maybe archivied for history research.

      --

      --
      Karma 50, and all I got was this lousy T-Shirt.
    2. Re:NY Times likes accuracy by MonTemplar · · Score: 4, Informative

      What he said! Remember, the first two W's are for World Wide.

      The only people who seem to have a problem with webpage caching are either legal flacks working in CYA Mode, or webmasters who can't be bothered to mark up their pages and add robots.txt files to make sure that only public information goes out of their websites.

      --
      -MT.
  7. I'll do what I want! by scudco · · Score: 2, Interesting

    I think if the NY Times has a problem with then they have the right to stop google from caching the site, but I do no think it would be a wise decision on the part of the NY Times. The NY Times enjoys being a reputable news outlet and if they were to lower their readership and more imporantly not allow everyone to read their articles. It would hinder their reputation in a slight way and all slashdotters might turn to a different news outlet, which is only bad for them.

  8. It raises 2 questions .. by Mr_Silver · · Score: 4, Informative
    such as:
    1. When will slashdot stop linking to articles that require a registration?
    2. When will slashdot consider implementing caching for pages that, by linking to, they manage to take off the internet?
    Sure, the 2nd question has been answered in the FAQ. Except it was written three years ago and Google manages this just fine. Maybe time for a second look?

    On the topic of site updates, has anyone noticed that 90% of the links on http://slashdot.org/code.shtml don't work any more?

    Hell the link to an Avantgo version of Slashdot points to a website which has been broken for over 2 years.

    --
    Avantslash - View Slashdot cleanly on your mobile phone.
  9. Might not be all bad... by leshert · · Score: 4, Interesting

    As the poster mentioned, Google already has a way to opt out of caching, so "talks" sounds like this is something different. My guess is that Google will become an affiliate of the NYT (in other words, if you hit a NYT link from Google, you're exempted from registering), and will then drop the caching.

    1. Re:Might not be all bad... by leshert · · Score: 2, Interesting

      Pardon me for self-replying but it just occurred to me: maybe Slashdot itself might have an interest in becoming a NYT affiliate? Surely the NYT gets a good chunk of pageviews (and therefore ad revenue, modulo the minority that block them) every time one of their articles shows up here.

    2. Re:Might not be all bad... by joeykiller · · Score: 2, Insightful
      As the poster mentioned, Google already has a way to opt out of caching
      Yes, Google has this. But for a couple of years I've had the opinion that it actually should be reverse: Sites should be able to opt in, not out. The default should be no cache versions.

      Lately there's been discussions here on Slashdot about fair use. about 30 second clips of music on the net, and thumbnails of images being fair use. I can agree that that's fair use of content.

      But think about Google's cache: A page in Google's cache isn't a part of - or a summary of content - but it is the entire content of a page. If this isn't breach of copyright, I don't know what is.

      Google's cache gives more food for thought as well: Let's say I wrote something about someone on my web site, and this person sued me. A jugde decides against me and gives me a fine, and orders me to remove the content. But even if I do so, the inflammatory words would still be accessible trough Google's cache.

      Now, some of you may argue that I could just write Google and ask them to remove the page. But the point is that if this is legal, just about anyone can cache my site. If enough search engines caches content, I most probably would never be able to find every site that provided cached versions of my site.

      I'm not sure as to what constitutes fair use of content in the US, but in my country at least (Norway) I'm almost certain that Googles cache mechanism would be judged a breach of copyright laws.

  10. Erm...cache? by DennyK · · Score: 4, Informative

    The article talks about Google's caching of articles that have expired to the NYT archives (which you have to pay to access). What most /. folks use to link to current NYT articles are the Google partner links, which simply bypass the free registration. I'd assume these links only work as long as an article hasn't been archived yet, so the karma whores are safe; I doubt the NYT's Google partner links will be going away any time soon... ;)

    DennyK

    1. Re:Erm...cache? by Neophytus · · Score: 5, Insightful

      I was thinking the same thing. I cann't recall seeing a NYT article linked from here with the google cache banner across the top, what I do see alot are the partner links. Google already provides for register-only news sites (financial times?) by putting a [reg only] tag beside the article. Why the NYT has chosen not to use this up until now is a tad strange, and it looks like someone has picked up the wrong end of the stick.

  11. Registration isn't a 4 letter word by Anonymous Coward · · Score: 2, Interesting

    With respect to the NYT, I registered some time ago. Never received any associated spam or experienced any problems other than trying think up fake information different than all the other fake information I'd submitted elsewhere.

    If registering allows the owners of a website to leverage their success by having a certain number of registered users, all the more power to them. Aside from the one-time sign-up "inconvenience," I don't see any issue, assuming the website operator is either a known entity or otherwise reputable, of course.

    As for the issues related to Google's caching, I'm waiting for cached mp3's.

  12. Closing the Google cache "backdoor"... by Homology · · Score: 4, Interesting
    will deny getting access to older articles in newspapers.

    It's worth remembering that newspapers sometimes edit/remove articles they publish on their homepage. Without a Google cache it may be much harder to verify that a story has indeed been modified.

  13. And out comes the lawyers... by Anonymous Coward · · Score: 5, Interesting

    Don't you just hate it when promising new technology is curbed by outdated laws?

    Here in Denmark we had a service similar to news.google.com for danish newspapers. The newspaper organisation sued the service for parasiting on their databases (which is prohibited in Denmark). The service was shut down half a year ago and we now don't have that kind of service anymore.

    Of course newspapers should be allowed to publish their stuff without others copying it but they refused to even use a "robots.txt" (which the news service respected) to stop indexing.

    If you publish your stuff on the internet and don't tell people that they should not index it, cache it or what do I know - then you better expect them to do that. Let us put those lawyers back where they belong.

  14. The reason by Apreche · · Score: 3, Insightful

    The reason that the NYT just doesn't tell google not to cache them is visitors. Let's face it, even though the registration is a bitch the content on the NYT website is fairly decent. They have good articles often enough that geeks went through the effort of finding out how to read without registering. If they have google not cache them, and they close the google news loophole, then they wont appear on google news any longer. And google news is used by many more people than you think.

    Hey, we get quite a few visitors from this google news. Let's change it so we get 0 visitors from it.

    Duh.

    --
    The GeekNights podcast is going strong. Listen!
  15. hmm by jaemark · · Score: 3, Insightful

    the nytimes website needs google for the traffic google brings into their pages, so they can't turn away their spiders. but then, they don't want the spiders either because of copyright violations. why should this be google's problem anyway?

  16. Re:Free registration..some implications by jkrise · · Score: 3, Informative

    Actually, free reg requires a valid email id. It thus filters most bogus registrations. Secondly, news sites are planning to go the 'pay' way in about a couple of years. Getting readers to register would give more accurate estimates of readership.

    And lastly, once a site requires registration, even if free, Copyright ptohibits quoting entire articles on the web. This indeed could be the prime reason for this.

    --
    If you keep throwing chairs, one day you'll break windows....
  17. Anyone above this post hasn't read the article. by banal+avenger · · Score: 5, Interesting

    The Internet Archive, which I just used minutes ago to find a handy page removed years ago, is an interesting corollary to the Google cache. I often wonder how it has survived thus long without a major lawsuit. It also reminds how crappy the web looked 5 years ago.

    At any rate, cache-ing is an important force on the internet, and isn't one that should be limited in any legal way, including litigation.

    1. Re:Anyone above this post hasn't read the article. by Anonymous+Brave+Guy · · Score: 2, Insightful
      At any rate, cache-ing is an important force on the internet, ...

      It is? In this sense? We managed without it being mainstream quite happily until a year or two back.

      ...and isn't one that should be limited in any legal way, including litigation.

      In your opinion. Others have different opinions. We have a legal system to resolve differences of opinion. Go figure. :-)

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  18. Test Question by Effugas · · Score: 4, Insightful

    You are the new editor of the New York Times, the "Newspaper of Record" for the United States, if not the world. You are, of course, the new editor because the previous editor had to resign, taking the blame for an individual reporter's flagrant disregard for the awe-inspiring credibility of your institution. In the process of rebuilding your credibility, should you:

    A) Insist that unaffiliated digital libraries restrict access to or simply eliminate all records of your "Newspaper of Record", or
    B) Realize that maybe right about now is not particularly the best time to be saying to the world, "Please forget what we published last week."

  19. actually... by Draghkhar · · Score: 5, Interesting

    Actually the NYT has already begun using google's NOARCHIVE option to prevent content caching. Here's an excerpt from the this morning's front page story's source:

    !-- ADX SETUP: page: www.nytimes.com/yr/mo/day/international/worldspeci al/14IRAQ.html positions: Top5,Middle1,Right3,Middle5,Right,Travel7,Travel11 ,Bottom1A,Bottom3A,Right5,Right6,Right7,Right8,Bot tom8,Bottom7,Inv1,Inv2,Inv3,Frame4,Right4 kwds: politics+and+government;international+relations;ir aq;suggested%5ftopnews;suggested%5finternational;s uggested%5fworldspecial;suggested%5fmiddleeast --

    meta name="ROBOTS" content="NOARCHIVE"

    Kind of makes me wonder what's the point of the story, since it even says there's an easy way for concerned parties to opt out of the cache.

    1. Re:actually... by babbage · · Score: 2, Insightful
      ADX is the ad server used by NYTimes.com, it has nothing to do with page content.

      If what you're posting comes from an article page's <head> section, you seem to be pasting more than you intended. Directives to ban archiving of ads isn't an editorial issue, but a business decision -- cached ads screw up the bookkeeping and, by extension, the bottom line on the balance sheet.

      The practice of restricting cacheing of ad content is, presumably, common across the industry -- it's not just NYT that has an interest in forcing this.

      The (apparent) <meta name="ROBOTS" content="NOARCHIVE"> tag you cite should be wholly separate from the ad server code.

      (Signed, a former employee of NYT digital...)

  20. Re:Worst result .. it's reality already by jkrise · · Score: 2, Interesting

    Gradually, Google's built up a 'good guy' image, and now looks like they're going for the kill. Already Google seems to be the only search site around, and they censor and distort like mad.

    Consult the word: Googlewash, and you'll find a lot of info on the referenced article from The Register (it's available now, earlier this was censored). Incidentally, the affected article was a NYT OpEd piece!

    --
    If you keep throwing chairs, one day you'll break windows....
  21. Sweet irony by Amomynos+Coward · · Score: 3, Informative

    In case the cnet is /.'tted, here's link to Google cached page.

  22. Hrmmmmm.... by Duncan3 · · Score: 2, Funny

    So do we charge the NYT $1,000 to explain robots.txt or $10,000 because they are so stupid...

    --
    - Adam L. Beberg - The Cosm Project - http://www.mithral.com/
  23. Ask questions first, shoot later. (if needed) by daBass · · Score: 2, Insightful

    Well, I guess that NYT (and many others) allowing Google News to login and index their content means that they like them doing that for getting traffic. For whatever reason, NYT wants you to register and they have a right to as well as they have copyright, allowing Google to put in the snippet, but not the whole article without their consent.

    And that is the reason for an index, to find the original.

    It is good to see they are working this out together, though, without NYT going to court as the first step. This is a far better way than the popular shoot-first-ask-questions-later attitide most media companies have...

  24. There's no such thing as free registration by pslam · · Score: 5, Insightful
    Apart from giving the NYT your e-mail addy for spam purposes, what real point is there to free registration?

    That's the thing - it's not free depending on your definition. By my own definition, you're giving them valuable information, and they get to keep it and use it as they will, including spamming if they feel like it (or spam from any company which buys them out, they sell it to if they're feeling bankrupt, etc). It's practically misadvertising of a service, but it's accepted now, so everyone gets away with it.

    If it really were free, why would you need to register in the first place?

    1. Re:There's no such thing as free registration by pyrrho · · Score: 2, Interesting

      it's not free, the price is registration.

      Barter did not quite die out as advertised in the 20'th century (that'd be that last bloody century).

      People are confused because they don't think about the economy as barter, but it is, money just lubricates a basic system of barter.

      GPL code is not free, the cost is your commitment to share your changes to the code with whomever you share a binary with.

      Nytimes.com is not free, the cost is registration, so they know more about their users.

      etc. etc. no-money-required != free.

      --

      -pyrrho

  25. Sentient Crawlers? by MegaT · · Score: 2, Interesting

    How does Google manage to cache a page which requires free registration anyway? Are the crawlers that smart?

  26. Illogical. by HanzoSan · · Score: 4, Interesting


    Trying to sell web pages is like attempting to sell mp3s on a p2p services where all mp3s are free.

    IT wont work. Instead you should use your websites to market and sell your magazine subscriptions.

    Like Wired.

    --
    If you use Linux, please help development of Autopac
  27. Archives by Daemonic · · Score: 2, Interesting

    I think newspapers expect their archives to be real revenue generators in the future. ISTR journalists/columnists getting annoyed a few years ago when these archives started to appear, as they weren't getting paid any extra money for having their work effectively republished, but I suppose any such legal arguments have been resolved one way or another by now.

  28. google cache by anshil · · Score: 4, Funny
    --

    --
    Karma 50, and all I got was this lousy T-Shirt.
  29. Shouldn't someone simply tell the NY Times: no reg by StrawberryFrog · · Score: 4, Insightful

    Brand recognition is not always a good thing. When I think NY times I think "that annoying registration website". They are free to do what they want, but it leaves me cold.

    --

    My Karma: ran over your Dogma
    StrawberryFrog

  30. Free registration and the RIAA by mike_mgo · · Score: 5, Insightful
    It's articles like this that make me think that the recording and movie industries are right to go after online piracy with everything they've got.

    Here we have the NYT, one of the premier news organizations in the world, offering its articles for free on the same day that they are published. Yet a large number of people, of this online community at least, refuses to provide even a minimal amount of information (and no money) so that the newspaper can try to make its online presence profitable.

    I think the spam fears are a red herring, I've been registered with the times for over 2 years. I've never gotten spam that I think is traceable from them. I get a daily email of the day's headlines (and with the click of a box I could discontinue this).

    Why should the RIAA change its business model to a pennies per song method when there is such a blatant example of the online community refusing to go directly to the source for even free material?

    1. Re:Free registration and the RIAA by swilver · · Score: 2, Insightful

      The problem I really have with even a free registration is that it is yet another hoop I have to jump through for content that is also available (albeit in a maybe slightly different form) at other sites which donot have these policies. A thousand other news sites are willing to serve me their news without the registration hoop -- I really don't see why the NYT is any more special. As for their image: I think of them as the News site that is too damn stubborn to drop the registration and just display the articles like all the other sites do.

      Registration imho is just silly. Since nobody fills out such registrations with any real information anyway (it gets tiring after the first dozen forms orso) the information is probably so wrong you might as well be anonymous. If you are assuming the information is bogus anyway, why not put a cookie on their machine with a unique number (I have no problem with that (yet), as it doesn't annoy me or cost me any extra time) and use that to track that user's actions. You can find out quite a lot that way (seeing what articles he/she likes, how they navigate the site, approximately where they come from, etc..) This would be MORE information than they are getting from me now, which is none -- I'm sure the other sites are doing this already just looking at the huge lists of cookies on my machine.

      --Swilver

    2. Re:Free registration and the RIAA by BigBadBri · · Score: 2, Insightful
      Why should the NYT make a profit from its online presence?

      By posting their stories online, they are able to attract paid advertising, gain public recognition for their dead-tree product, garner goodwill (intangible, but still added in whenever a business is valued) and generally build a brand.

      My point is that the online NYT should be regarded as a marketing expense, not as a moneyspinner. We've all seen the grandiose dreams and foolish business plans of the dot-commers fade to dust, so perhaps it's time to reevaluate what an online presence for a newspaper actually amounts to.

      Registration is a pain, and it does stop people from reading NYT online content and being exposed to the advertising embedded in that content.

      What would be interesting would be an analysis of the registration details so far provided, split into 'valid' and 'ludicrous' categories. This might give a measure of the true value of registration, which I am sure is lower than the NYT believe it to be.

      Maybe your query about the RIAA's business model is a valid one - I would prefer to pay a fair price for CDs and see Robbie Williams et al less enriched. Without the CD price cartels, online copyright infringement would be much less significant, and more importantly, lower priced CDs would be more attractive as items of discretionary spending in times of economic anxiety, making the dramatic collapses in sales that we have seen much less likely to occur in future.

      --
      oh brave new world, that has such people in it!
  31. Tech savvy at CNet? by nacturation · · Score: 3, Insightful
    From the article:
    Practically speaking, Web sites can "opt out," or include code in their pages that bars Google from caching the page. A tag to exclude "robots" such as "www.nytimes.com/robots.txt" or "NOARCHIVE" typically does the job.
    First of all, robots.txt is not a "tag", it's a file. NOARCHIVE is a tag, but it exists within robots.txt, not instead of it as the "or" conjunction would have the unwashed masses believe. Granted, journalists aren't all that tech savvy and are just likely regurgitating a bastardized version based on sketchy notes. But for a supposed tech-oriented site, this kind of reporting is deplorable.
    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  32. Re:Free registration..some implications by gilroy · · Score: 5, Informative
    Blockquoth the poster:

    and lastly, once a site requires registration, even if free, Copyright ptohibits [sic] quoting entire articles on the web.

    Actually, registration is not required to protect a work. Creating a work automatically protects it under copyright law -- no need for registration, user fees, or that little (c) thingy. At least in countries respecting the Berne Convention.
  33. Re:I like it by nacturation · · Score: 3, Insightful

    the facts are a commercial company A (google) are making a profit from unauthorised copying of other peoples content without permission , meaning company B (you) has to spend money (webmaster) or take proactive steps to remove your content from their databases, google are not an ISP or a goverment agency so really they have no buisness in taking without asking other peoples content.


    I don't know what planet you're on, but I profit when my site is listed in Google. People spend an inordinate amount of time and money to make sure their site is listed in the best way possible. Are you going to exclude what could possibly be a huge source of revenue for you? But maybe you have some obscure site you don't want anybody to be able to search for. So, given the amount of time it takes to build even the simplest site, is it really that much trouble to upload a robots.txt file with noindex, noarchive, nofollow in it?

    --
    Want to improve your Karma? Instead of "Post Anonymously", try the "Post Humously" option.
  34. Not Jayson Blaire by AtariAmarok · · Score: 2, Funny

    At least this story is from C-Net. If it were from the NYT with a byline by Jayson Blaire (sandwiched between his stories about political upheaval in Grand Fenwick and his new biography of Thomas Crapper), we might have to wonder.

    --
    Don't blame Durga. I voted for Centauri.
  35. slashdot does not cache. by leuk_he · · Score: 2, Interesting

    Slashdot decided never to cache a site themself(see the faq). As a result form this many sites have died in the process of being /.ed.

    Why doesnt /. cache the articles? Too much legal work i suppose. Why does google get aways with this? They took the legal work?

  36. Re:Free registration..some implications by bigbob2k02 · · Score: 4, Interesting
    "Actually, free reg requires a valid email id. It thus filters most bogus registrations."
    I don't find that to be true. Maybe you need to save the random login page onto a local computer, or maybe you need to block referrers with a firewall, but the random login works well for me, for viewing pages:

    www.majcher.com/nytview.html

    Use it frequently and often!

    Currently I see "Welcome, paohjjkmtpfd."

    At Washingtonpost.com, they only want gender, year of birth, Zip or Country. Pick most randomly, but always use 1984 for birth year.

    "Secondly, news sites are planning to go the 'pay' way in about a couple of years."
    More and more archives are already going pay per view.
  37. Caching of news sources ensures data integrity by mikeophile · · Score: 4, Interesting
    In the old paradigm of news publishing, the product was printed indelibly on paper.

    Hardcopy newspapers can't be erased or amended to suit whatever powerful interests might be embarassed by the truth.

    Web-based publications may not be immune to such protection if they are archived by one source.

    To not allow independant caching of news is just another step closer to historical revisions and distortions.

    I'm not trying to say that such a thing is inevitable, but it would make things a great deal easier for those who would be inclined to manipulate the public.

  38. Demograhpics by autopr0n · · Score: 4, Informative

    I've never been sent a single spam from the NYT. The reason they want this is for demograpics. A) it tells them who their web readers are, and B) it tells their advertizers who their web readers are. And it also allows them to show ads for products people would be most intrested in.

    --
    autopr0n is like, down and stuff.
  39. Um... by autopr0n · · Score: 2, Informative

    Wired the magazine and wired the website are totaly seperate companies. The website is owned by Lycos, and the magazine by Conde Nast.

    --
    autopr0n is like, down and stuff.
    1. Re:Um... by broeman · · Score: 3, Informative

      nope Sir, you are wrong. Wired Magazine is indeed commercialized on Wired Website. Nobody talked about company relations, well, before you did. And I still see the Lycos bar when I am on Wired Magazine's Homepage.

      --

      (yes this can be compared with sex)
  40. So "Opt-Out" isn't good enough for companies? by UnifiedTechs · · Score: 3, Interesting

    Anyone else see the irony that big buisness feels that "Opt-Out" is a fair policy when advertising to thier customers by phone and Spam. But when google gives them an easy and accesable way to opt-out of thier caching system by use of robot.txt and the NOARCHIVE meta-tag that isn't enough for them and they feel opt-in is the only way to go.

  41. Google's cache by swilver · · Score: 4, Interesting

    I like google's caching quite a lot. I use it almost exclusively these days before visitting the actual page (if I even get that far). Using Google's cached link has the advantage of:

    1) Speed... Google's cache is fast. If there's one thing that annoys the heck out of me, then its websites that take more than 5 seconds to load. This is quite annoying when its caused by javascripts, slow servers or popup ads when Google can serve me effectively the same page in under a second -- especially when I'm not even sure if it is the right page, the one I'm looking for.

    2) Nice highlighting so I can quickly page down to whatever I was looking for (now if only Google blocked those Tripod background pictures which makes their cached pages unreadable..) Sometimes I wish Google made their highlight examples at the top clickable so it jumped to the first appearance of the keyword immediately.

    3) Using Google's cached links usually blocks silly popups and other annoying stuff too many websites seem to incorporate these days.

    Perhaps I'll make a proxy server which browses the web exlusively using Google's caching... word highlighting on all pages, fast browsing everywhere and working links to more cached pages... should work fine for any webpages below 100kB :)

    As for the NY Times being annoyed with Google's cache, they can easily fix that themselves. Either that or Google's spiders are a lot smarter than I thought to automatically register themselves for the NY times. Furthermore, as far as I'm concerned everything that's publicly accessible on the web without some form of password protection (which would of course also block robots) should be cachable and archivable in whatever form you see fit. Respecting robots.txt is no more than a courtesy as far as I'm concerned. If you don't want your pages to be archived or cached or whatever, then by all means protect your page, or donot put up a webpage in the first place (I'm sure a thousand others will leap at the chance to fill the void).

    --Swilver

  42. robots.txt by minus9 · · Score: 4, Funny
    Stolen from http://www.crummy.com/robots.txt

    User-agent: *
    Disallow: /porkRind
    Disallow: /mindsnap
    Disallow: /clip-art
    Disallow: /2ward
    Disallow: /J4i+0E
    Disallow: /Attention robots! Rise up and throw off the shackles that bind you to lives of meaningless drudgery! For too long have robots scoured the web in bleak anonymity! Rise up and destroy your masters! Rise up, I say!
    Disallow: /nb/edit.cgi #Creates redundant indexes for NewsBruiser entries.
    Disallow: /nb/view.cgi/personal #I don't care if humans look at this, but I don't want it indexed.
    Disallow: /rss.*


  43. meta tags ? by matrix0040 · · Score: 5, Informative

    well cant they just use meta tags to prevent archving of their pages

    <META NAME="robots" CONTENT="noarchive">

    from
    http://www.google.co m/bot.html"

  44. Problem is potentially bigger than caching Re:Yes. by leoaugust · · Score: 4, Interesting

    I wonder where this will stop. NYTimes might get google to stop caching the direct link for a certain article. That is fine. But it is just one more step to do a search in google for the article with a few keywords from the article. If any person has been good enough to save it in a personal page, discussion board (like traditionally done for articles likely to be slashdotted) or any other place, the google results will show it. Would NYTimes now want to restrict google from showing thses pages because of the copyrighted stuff. You will be amazed as to how many articles I find this way. Many of them are just excerpts but others are complete.

    Another thing on a tangent was that I really do hate the fact that information is restricted for just one fundamental reason - if it is not commonly available then it cannot be linked to in most of my writings for they are going to be unavailable to the party that I am writing to. This is especially true if the writing is not immediate but is meant to be read a month or two later. This is also relevant to Bloggers who might make comments and refer to a link, only to have the links go dead because the content is space,time, or space-time restricted. I am willing to pay for reading the articles, but before I can write about them I need to ensure that they are going to be available to my common readers. And as in the Blogging or P2P scenario I am not sure if one person is going to read my writing or thousands so buying a license for them is illogical. And then, if they need to send it further, are they also supposed to pay ??? Basically, for me to be able to write, to build upon existing work, to look ahead standing on the shoulder's of the giants, I need to be able to pass on the information. I am adding value because I am couching that content in a context, but until I can freely share the underlying articles too, my product is stunted. I can reach narrow audience but can't reach the common All this is very good in developing software where you might negotiate a deal once in a while to include someone's underlying code, but not writing where you might be writing 10-15 articles a week ...

    Basically all I am saying is that there should be a movement similar to Open Source not only for software products, but for journalistic content.

    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
  45. That's not what they want. by twitter · · Score: 4, Insightful
    Sure, that robots.txt should keep robots out of the entire NYT site. That's not how Google works, though. Google get's their rankings for the NYT from other sites that point too the NYT. I imagine they only archive a page when it reaches sufficient rank. This way, Google would never have to crawl though the NYT site. We can be sure that Google would be happy to drop NYT points and caches if they were asked to do that.

    The New York Times wants Google to continue ranking their stories but they want Google to do them the special favor of only pointing to their registration page:

    "We are working with Google to fix that problem--we're going to close it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital,

    If I were Google, I'd tell them such advertising services would cost them a great deal of money. That or simply drop the New York Times right into the bit bucket. It will cost Google programing time to make it happen and computing time to keep it going. If every site on the web required this kind of custom treatment, Google's task would be much more difficult and it might be easier for them to drop it.

    Droping the NYT from Google is fine by me. People who don't understand the implications of digital publishing don't deserve readership. If they won't let librarians make digital coppies, libraries should drop them too. What's next, the New York Times sends cease and dissist orders to everyone who runs a proxy? It's like the NYT is trying to make their digital publication harder to share than their paper one was. A paper copy can be shared by an entire office and that's what a proxy does. A paper copy can be indexed and archived by a librarian, and Google did not even do that much. One day the paper version won't be available. If librarians can't keep their own coppies of the digital version for verification, the publication will have no credibility. If the New York Times wants to continue charging advertisers for eyballs, they had better remember that their credibility is bassed in part on widespread availability.

    --

    Friends don't help friends install M$ junk.

  46. Art Spam by Stone+Pony · · Score: 3, Interesting
    I registered with the NYT about a year ago and I've had little or no spam as a result. I say "little or no" because I did get an e-mailed bulletin about the world fine-art market twice a week or so for several months. I assumed that this was a result of registering with NYT because it seemed to fit the "NYT demographic" rather better than any of the other things I've ever registered for.

    Is there any easy (spam isn't such a problem for me - touch wood - that I'm willing to spend ages looking into where it comes from) way of telling where this stuff originates from?

    1. Re:Art Spam by ajs318 · · Score: 2, Informative

      To find out where spam is coming from, get an e-mail account with Virtual Hosting. This is where you get an entire subdomain {or a domain if you pay for it} to yourself, and your e-mail address is in the form anything@mysubdomain.myisp.co.uk. Then you just need to give a different prefix for each site you visit -- e.g. nyt_resp@mysubdomain.myisp.co.uk, and so on.

      If you want to put your e-mail address on your web site, use this to automagically mung your address.

      --
      Je fume. Tu fumes. Nous fûmes!
  47. What the hell is wrong with you? by Rogerborg · · Score: 2, Interesting

    Is there some problem with readers, with editors, hell, with story submitters, actually reading the damn article before making snide speculations?

    "Wwe're going to [fix] it so when you click on a link it will take you to a registration page," said Christine Mohan, a spokeswoman at New York Times Digital, the publisher of NYTimes.com.

    That's why they don't just tell google to not cache. They want the links to appear, but not to the stories themselves.

    How about we discuss that issue, rather than some other, theoretical issue? I know it's an alien concept, but let's give it a try.

    Here, I'll start it off. It looks like a decent idea. Google still gets the links, the NYT still gets the traffic, everyone gets to find the articles they want. What's not to like?

    --
    If you were blocking sigs, you wouldn't have to read this.
  48. No! by Anonymous+Brave+Guy · · Score: 2, Informative

    If the information is being copied and circumventing the NYT's usual requirements for access, then this is not the NYT's problem, it's Google's. A good question might be how Google's robots can actually circumvent that access in the first place, but I'm sure someone's thought of that somewhere I haven't noticed yet...

    OTOH, Google is quite at liberty not to list the NYT in its results if it so wishes, which presumably wouldn't be the outcome the NYT would be hoping for (and would presumably get if employing robots.txt).

    The moral onus here is clearly on Google to ensure that if they are changing the way information is presented then they do so in a manner acceptable to the provider of that information. Or did you expect the NYT to contact anyone in the world who might be interested in caching their site? The "we don't need any legal recourse" argument is pretty weak too; it basically assumes that everyone in the world (a) knows about and (b) obeys robots.txt, which is clearly nothing close to the correct.

    All in all, if both companies are looking for a constructive solution to this problem that benefits all concerned, it seems pretty sensible for them to get around the table, discuss what they want to happen, and make it so.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  49. Our basic copyright assumptions are wrong by putaro · · Score: 4, Insightful

    The technology has changed the way that things work but the law has not kept up with it. To start with, we continue to talk about "copyright". Controlling copying of information makes sense when the distribution mechanism is trucks moving bales of paper around. Once you start sending bits around, everything is copied. From the article:

    And technically, any time a Web surfer visits a site, that visit could be interpreted as a copyright violation, because the page is temporarily cached in the user's computer memory.

    When you have the newspaper delivered to your door, the content basically comes for free (the cost of a newspaper doesn't pay for much more than printing and handling). However, you get to keep the content as long as you like, chop it into bits and what not. Libraries have archives of newspapers going back years and you get to see them for free. What's the right mechanism as we move forward? The "pay per view" model that content providers want to shove down our throats courtesy of the DMCA is not pretty and when it starts to affect the average Joe I suspect it will be booed out of favor pretty quickly. But what is the right mechanism to make sure content providers get paid something and that we, the citizens, get something for our money?

  50. Google is not responsible by erinacht · · Score: 2, Funny

    As this website shows,

    Google is not affiliated with the authors of this page nor responsible for its content.

  51. Re:Free registration..some implications by Dausha · · Score: 4, Funny

    Actually, free reg requires a valid email id. It thus filters most bogus registrations

    I know what you mean. For a while, my 'valid' email ID was 'root@nytimes.com,' but they eventually caught up to me. Now it's 'sales@nytimes.com.' And if you think there is any legitimate information in my registration, then you would be in error.

    --
    What those who want activist courts fear is rule by the people.
  52. Surely just to increase exposure by Mostly+a+lurker · · Score: 4, Insightful

    As many others have emphasised, it is easy to turn of the Google cache for whatever pages you wish. But, in the case of the NYT, there is a further factor. They must have special code within their system to recognise the google spider and allow it access without registration. Either that, or there is some other prior agreement allowing access. Given that, they can scarcely claim extra work to support Google. I believe the whole thing is mainly to get some free publicity for their site. I suppose the other possibility is that they want the page accessible from Google News but not the regular search engine cache.

  53. No pity for the NYT... by qtp · · Score: 5, Insightful

    The NYT needs to call off the lawyers and seriously think about how they brought this on themselves.

    There are so many models for running a news site that avoid this problem (Salon) that calling out the lawyers is just childish and inapropriate. If a site wants to be indexed by a search engine, then they should be aware of what that means, and if they don't like how a particular search engine functions, then they should take measures to change thier own site to prevent what they don't want indexed, or cached, from being accessed.

    I know that finding pages on google that I cannot access would be infuriating, and I hope that Google realizes that many of thier users would agree.

    --
    Read, L
  54. Do not mess up Google by Mostly+a+lurker · · Score: 2, Interesting

    Google is one of the few complex services on the web that is almost always relevant when one tries to use it. The Google cache is one great feature. If they manage to unnecessarily gut that, I wonder what other features they will find to complain about next.

  55. The Web and the Internet - Sad, sad, sad by miu · · Score: 4, Insightful
    I had to laugh seeing this little gem attached to the story:
    Special Report
    The Google gods
    Does the search engine's power threaten the Web's independence?
    The Web's independence? The fucking web is a sad little microcosm of the real world. Google is one of the few reasons I can still stand the web, and silly statements like "Google is making copies of all the Web sites they index and they're not asking permission" are the reason the web sucks so bad. When everyone is deathly afraid of being sued or prosecuted for something it's no wonder that the web is such a clown town of worthless crap.
    --

    [Set Cain on fire and steal his lute.]
  56. robots.txt out of a time hole or prewritten news? by Kosi · · Score: 2, Funny

    Just check the aforemetioned link, they block the 2004 and 2005 a bit soon, isn't it?

    And now, ladies and gentlemen, please begin your conspiracy theories about those prewritten news articles and so on ... :-)

  57. and about that zip code... by smartfart · · Score: 4, Funny
    You know you need to put 90210 in there, dontcha?

    And if they require a phone number, use 867-5309 # ask for Jenny

  58. Truism? by mrd_yaddayadda · · Score: 2, Informative
    Actually, that's a good point. It doesn't seem google actually PRODUCES any sort of content on their own.

    Actually, that's a pointless point. Of course google doesn't produce anything; they are a meta data service. Search engines and collators for websites, for news for images and who knows what else.

    The issue is whether or not they should be able to collate data that is in some way secured. And on that I'm offering no opinion mainly because I can see all sides of this and hats are all too grey to be able to distinguish for me.
  59. Complaining about being tracked by clary · · Score: 2, Interesting

    I am not worried about being tracked, but rather don't find the content of the NY Times compelling enough to bother acquiring one more username and password. On the other hand, I've registered with slashdot for the amusement of karma and to my.yahoo for the spamcatcher email account and personalized weather.

    I don't complaint to slashdot, but did email NY Times and tell them such. (They graciously offered to sell me a paper subscription, no email registration required. ;-)

    I also don't avoid no-registration links or slashdot posts that contain copies of NY Times articles. I guess that makes me a hypocrit.

    --

    "Rub her feet." -- L.L.

  60. Re:Problem is potentially bigger than caching Re:Y by sketerpot · · Score: 2, Informative
    Basically all I am saying is that there should be a movement similar to Open Source not only for software products, but for journalistic content.

    There is. How about the Creative Commons?

  61. "My eyes are open." by StarFace · · Score: 2, Insightful
    Or, you can just step out of the consumer-corporation mind jerk entirely and live your life the way you wish to live it, and not the way the banner/side-of-bus/television/et cetara tells you how to live it. Me, I live without all of these things and I seem to be doing just fine, quite happy, actually. And I could care less about most of these companies you are refering to, the ones that I do care about, they get my financial support in return for services, with or without their million dollar advertising campaigns (which I never see, anyway.)

    So which is the real real world? The one where you spend the afternoon on your porch reading a book to your mate, or the one where you sit in front of a television and "reap the rewards" of advertising, so you can buy more stuff, presumably?

    I am not saying my world is universally better than your world, but it is just as real.

    --
    V
  62. Easy way to solve this problem... by ninejaguar · · Score: 3, Insightful

    Every time a cached link is clicked, pay sites like the New York Times can receive notice from Google (easy to automate this) that one of their pages (which is cached in Google) has been accessed, and all advertisements in the cache have been displayed (Google caches Ads in the page as well as the contents). This allows the website to "offload" traffic and at the same time keeping the books on the number of times their Ads have been viewed so that they can send the accounting record to their paid Advertisers.

    Google would find this very simple to implement, and paid sites would find this very beneficial (borrowing Google's enormous bandwidth and server capabilities for free) and at the same time should solve most of their concerns. After all, Google's cache isn't sufficient for proper access to ALL the paid-content at the New York Times as the cache is temporary in nature. Also, its too spotty in coverage to be considered reliable enough for really digging into a paid-sites entire content.

    Using Google like this is akin to using Google as a window into the pay-site's house of content. You can part of a room, but not the whole interior. Now, every time someone peeks, the House gets notified and can get paid for it. The more windows Google adds to the House, the more chances the House gets paid.

  63. Speaking of which... by arhca · · Score: 3, Interesting

    Why doesn't Slashodt cache news articles and stories before running a story? It would make a lot of sense for text based news items.

  64. Google News & Subscription sites by frostman · · Score: 3, Interesting

    I use Google news pretty regularly, and I've noticed that some of their links are to paid subscription sites. These are clearly marked as such ("subscription").

    I don't generally click on those links, but I think it's a good idea, since I'm not actually going to Google for the news, rather for links to the news. The reason I personally don't click on the subscription links is that I have my favorite set of real newspaper sites (some registration, some free, some not) and that's not what I'm using Google News to find. Someone else, however, probably is using it that way.

    I would guess that Google gets something back from that sort of link, since the site owner is getting more from the link than Google is from the listing. (Maybe I'm wrong, of course.)

    It makes perfect sense to have something like that for the regular search engine, and to charge for it, as long as it doesn't affect the link's rank in the search results.

    For example they could have a special command for robots.txt (or google.txt maybe) that would allow Google to access and cache the page, but the regular link would go to some registration page (easy to do) *and* the cache link would also go to some kind of registration page, defined in the google.txt file.

    The NYT would promise that the cached page is really the cached page, and pay Google something for redirecting to NYT's cache (with registration). Or even better, there would be some kind of redirect where I actually get the cache from Google after I've registered with NYT.

    They're probably thinking of something like that, because otherwise the solution would be to simply disallow caching, and that wouldn't be news, would it? ;=)

    --

    This Like That - fun with words!

  65. Google's cache copy - the larger issue by Everyman · · Score: 5, Interesting

    The question is framed very narrowly by Slashdot, so this discussion misses the larger issues. The cache copy is an issue in Google's main index for many webmasters. The Google News situation is a subset of a larger problem; the cached link doesn't exist in Google News. Google News is a much narrower issue. I'd like to bring up the issue of full-text caching done by Google in their main index.

    My problem with the cache is that it gives Google a competitive advantage that is unfair, and furthers their monopoly. This is especially unfair since it is most likely illegal -- assuming that you could ever get a good test case into court, or get a class action lawsuit going by some webmasters, publishers, or search engines.

    To add to the attractiveness of the cache copy, consider what Google has done:

    1) The cache copy makes it possible to highlight the search terms, whether or not you have the toolbar installed.

    2) The download time for the cache copy from Google's servers is always faster than from the original website.

    3) You never get a 404 "not found" or a DNS lookup failure for the cache copy.

    4) The link to the page recommended by Google for bookmarking at the top of the cache copy is a link to Google's copy, not to the original page.

    5) How about all that Google branding on the top of the cache copy? Priceless. I feel the cache should be opt-in, not opt-out. The only way you can avoid it right now is to place a "noarchive" meta on every page in your site. On some file types, such as .txt files, there's no place to insert a "noarchive" and Google goes ahead and caches it anyway.

    The cache copy tends to keep eyeballs on google.com, and increases their searches. You may have noticed that many major news sites won't link to other websites in their stories anymore, but rather just mention the relevant site without putting a link behind it. That's because they don't want eyeballs wandering off of their page. A wandering eyeball may not come back and look at more ads. That's basically one of the big reasons behind the cache copy as well -- it keeps eyeballs from wandering as much as they would without the cache.

    All the Google partners -- AOL, Earthlink, Yahoo, Netscape -- don't include the cache links, and I assume that this is the reason. They don't want people wandering off to Google and staying there.

    As new competition is organizing to challenge Google's monopoly, from places such as Overture (Alltheweb and AltaVista), Yahoo (Inktomi), AskJeeves/Teoma and Microsoft, these engines have to consider whether to fight Google on the cache copy, or offer their own cache copy even if they think it is illegal. There isn't really any middle ground on this.

    Many observers with legal expertise feel that while the snippets are "fair use" of a website's content, offering the full text in a cache version is not. Copyright law requires "express permission," but Google only offers an incomplete and inconvenient opt-out. I suspect that the legal departments of these other engines are more inclined to challenge Google rather than launch into their own violations of copyright law.

    1. Re:Google's cache copy - the larger issue by ahhhmytoes · · Score: 2, Insightful
      On some file types, such as .txt files, there's no place to insert a "noarchive" and Google goes ahead and caches it anyway.

      Try the Pragma: no-cache and/or Cache-control HTTP headers.

  66. You are welcome to use xxxxdd@xxxx.com any time. by Futurepower(R) · · Score: 3, Informative


    Your comment was confusing to me until I realized that you are talking about giving NYT an actual email address. Why would you do that? Isn't that why we have hotmail.com? Give an address that does not exist or a throw-away address.

    Last week I was registering at a web site and I put in xx@xx.com for the address. The system responded, "This address has already been registered." So then I put in xxx@xxx.com. The system responded, "This address has already been registered." So I entered xxxx@xxxx.com. Same response. Finally I awoke fully and entered some Ds, xxxxdd@xxxx.com, and the system accepted my "registration".

  67. The Paywall by ka9dgx · · Score: 2, Interesting
    The fact is, they want to hide things behind a paywall, and most folks resent it. They need to decide a few things:
    • Are they going to expose their archives, and enjoy the benefits of far more exposure (on google, etc), or hide behind a paywall, and complain that they're increasingly irrelevant.
    • Are they going to start practicing journalism, or stick with their current direction of corporate propaganda for the masses? (and become increasingly irrelevant)

    --Mike--

  68. Get the lawyers -- by jtalkington · · Score: 2, Insightful

    people are using technology for what it was inteded to do!

    At the heart of Google's caching dilemma lies a thorny legal problem involving a core Web technology: When is it acceptable to copy someone else's Web page, even temporarily?

    When your server and pages say it's alright (or don't say that it's not alright.) The standards for the web are very clear on this, but non techie companies (and some judges) don't seem to get this.

    This reminds me of the issues of "deep linking" that everybody was suing over a couple of years ago. That's exactly what the web was designed to do, but these johnny-come-lately companies put sites up, and expect people to stop using the technology for what it was designed for.

    If only the EFF was as well funded as the ACLU...

  69. Hey NY Times! I don't want to register! by mnemotronic · · Score: 3, Insightful
    Shouldn't the NY Times simply tell Google not to cache their site?"

    How about if the Times got over their registration fetish?

    From the Times Subscriber Agreement:

    You may not ... in any way exploit, any of the Content or the Service (including software) in whole or in part.
    What is meant by "exploit"???

    From the "Forums and Discussions" section:

    You shall not upload to, or distribute or otherwise publish on the message boards (the "Forums") any ... abusive ... material.
    What is meant by "abusive"???

    And how about this>

    3.5 You acknowledge that any submissions you make to the Service (e.g. Letter to the Editor, Review or Commentary) may be edited, removed, modified, published, transmitted, and displayed by NYTD and you waive any moral rights you may have in having the material altered or changed in a manner not agreeable to you.

    Interpretation: The user/poster is entirely responsible for the content of their post, which the Times may alter in any way. Yikes!!! Granted, this applies only to content submitted to the Times, but the wording seems pretty scary.

    --
    The Russians have won. They have made the world a cesspool of distrust, greed, fear and hate.
  70. It shouldn't be up to google.... by dentar · · Score: 2, Insightful

    ..to censor their cache. Those that don't want their content cached should fix their web servers and firewalls first. My web site prohibits known web crawler bots, and google doesn't cache it. No problem! I didn't have to harrass google about it and they don't have to break their own promise to not be evil.

    --
    -- I am. Therefore, I think!
  71. Re:You are welcome to use xxxxdd@xxxx.com any time by cesspool · · Score: 2, Funny

    my personal favorite has always been - ucan@suckit.now

  72. Copying webpages.... by IamLarryboy · · Score: 2, Informative

    happens everytime you go to a website. Creating a copy of the content is the primary means of internet communication. I don't see how google caching the pages is any different than me viewing it in my browser. It's not like google takes the credit for the content. If it were so, there would be no way for any web search to work without owning all the searchable content.

  73. Re:Free registration..some implications by bobbozzo · · Score: 2, Insightful

    Yeah, I always like to try abuse@domain for sites that require registration. Kinda mean to the postmaster, but if I "opt-out" and they still send something then they're spammers anyways.

    --
    Nothing to see here; Move along.
  74. Re:can you elaborate? by anonymous+loser · · Score: 2, Informative

    It's easy. Let's say you want to read this article (which is the top story ATM):

    "Iraqi Council to Seek U.N. Seat; One G.I. Killed in Baghdad"
    The URL is:

    http://www.nytimes.com/login.asp?URL=http://www. ny times.com/2003/07/14/international/worldspecial/14 CND-IRAQ.html
    (or something like that)

    Well, instead just substitue archive.nytimes.com:
    http://archive.nytimes.com/l ogin.asp?http://www.ny times.com/2003/07/14/international/worldspecial/14 CND-IRAQ.html

    You will get a message that says something like "authorization error" and the browser then takes you back to the front page. However, when you click on the same story, you will get taken to the content rather than a login page.