Caching Content and the Shrinking Web?
"I run a small discussion-oriented site patterned after Slashdot; small story blurbs and discussion center around links to external content. From time to time we post our own content, but the vast majority involves links to articles on other sites. This structure obviously relies heavily on the external pages
being available for our visitors so they can understand the issue or viewpoint
being highlighted.
Just before the new year, I took a look back at story entries that had been posted throughout 2002 and found it interesting to note that a large portion of the linked content was no longer available/had moved/etc. In the short
term, this is not an issue; most outside material tends to remain available for the length of an active discussion. The problem I see is visitors coming to the site by way of search engines to stories whose linked content no longer exists. Without the background provided by the referenced story link, the discussion or quick blurb may not make sense or may not fulfill the request that brought the visitor to us.
I know I am not alone in this quandary and that others must have run into this before. While I respect the copyright of the external content
providers and do not wish to get into the whole issue of lost advertising revenue for them if I were to cache a local copy, I'm curious what other users are doing to mitigate this problem."
it chooses who stays and who will go... google
Here is the content I shamelessly mirrored without the permission from the original author. Now all those meta-karma-whore flamers can jump up to my ass and sue me for plaigarism.
Caching Content and the Shrinking Web?
Posted by Cliff on 02:55 AM -- Friday March 14 2003
from the keeping-the-context-intact dept.
kill-hup asks: "I know the issue of caching linked pages has been discussed many times here on Slashdot, but the majority of those discussions centered around the 'Slashdot Effect' knocking remote content servers off-line. How does the ethic/legality issue change, if any, when we're talking about information that once was available but now has moved or disappeared from the provider's site?"
"I run a small discussion-oriented site patterned after Slashdot; small story blurbs and discussion center around links to external content. From time to time we post our own content, but the vast majority involves links to articles on other sites. This structure obviously relies heavily on the external pages being available for our visitors so they can understand the issue or viewpoint being highlighted.
Just before the new year, I took a look back at story entries that had been posted throughout 2002 and found it interesting to note that a large portion of the linked content was no longer available/had moved/etc. In the short term, this is not an issue; most outside material tends to remain available for the length of an active discussion. The problem I see is visitors coming to the site by way of search engines to stories whose linked content no longer exists. Without the background provided by the referenced story link, the discussion or quick blurb may not make sense or may not fulfill the request that brought the visitor to us.
I know I am not alone in this quandary and that others must have run into this before. While I respect the copyright of the external content providers and do not wish to get into the whole issue of lost advertising revenue for them if I were to cache a local copy, I'm curious what other users are doing to mitigate this problem."
Most contents removed is as a result of it being slashdotted and the company who provided web hosting service decided that it's better to remove them and cancelled the associated accounts to avoid exceesive bandwidth bill next month.
:)
If you can see this, you can realize that we are among one of those bloodly murderers who killed those contents.
Ethically, we need to keep the channels of knowledge open. If it was public knowledge at one time, it must remain so. Otherwise, we begin to foster an Orwellian world where any number of Ministries of Truth can hide history and rewrite it as needed. A web page is a record of the world at a given time. Just as libraries keep old journals for reference, we need to be able to reference the web of the past.
Legally, I fear that litigation like Scientology vs. the Wayback Machine will begin to erode this protection. Having a monopoly on knowledge gives an entity the power to bring the masses into submission. We must let truth prevail.
Donate background CPU time to fight cancer.
If your discussion were around the coffee table about a magazine article, and you were writing down your notes on paper and the paper-clipping them to the article (cut out from the magazine, of course) and storing them away in a binder, would you have any qualms about this at all? At ALL?
To make the case even more clear-cut, imagine if the magazine you are cutting from was completely free to the readers and got all thier revenue from ads sold.
Would you even care if you cute the ads out along side of the article? No, you would probably even go out of your way to cut them OUT of teh real world example.
Why is it different when it is on the internet?
"Your superior intellect is no match for our puny weapons!"
I've looked up my past personal sites, and realize how much they suck. Including the brief period where I was enamoured with IE 4.0 (MS had me on their free CD circuit).
As far as the commerical sites go, I think, inasmuch as bits and pieces are used as "fair use," and people aren't selling things that belong to someone else, I don't see a problem.
One of the more interesting things I've seen is what Art Bell and his webmaster did when Bell "retired" from broadcasting (let's see how long this one lasts...hmmph). They put out a CD that had some neat extra features, and authorization methods which allow you to access the website through the webmaster's site. Pretty cool, IMHO>
Mirror in case it's slashdotted and removed
Mirror in case it's slashdotted and removed (Score:1)
by jsse (254124) on Friday March 14, @12:09AM (#5509874)
(http://slashdot.org/)
Here is the content I shamelessly mirrored without the permission from the original author. Now all those meta-karma-whore flamers can jump up to my ass and sue me for plaigarism.
Caching Content and the Shrinking Web?
Posted by Cliff on 02:55 AM -- Friday March 14 2003
from the keeping-the-context-intact dept.
kill-hup asks: "I know the issue of caching linked pages has been discussed many times here on Slashdot, but the majority of those discussions centered around the 'Slashdot Effect' knocking remote content servers off-line. How does the ethic/legality issue change, if any, when we're talking about information that once was available but now has moved or disappeared from the provider's site?"
"I run a small discussion-oriented site patterned after Slashdot; small story blurbs and discussion center around links to external content. From time to time we post our own content, but the vast majority involves links to articles on other sites. This structure obviously relies heavily on the external pages being available for our visitors so they can understand the issue or viewpoint being highlighted.
Just before the new year, I took a look back at story entries that had been posted throughout 2002 and found it interesting to note that a large portion of the linked content was no longer available/had moved/etc. In the short term, this is not an issue; most outside material tends to remain available for the length of an active discussion. The problem I see is visitors coming to the site by way of search engines to stories whose linked content no longer exists. Without the background provided by the referenced story link, the discussion or quick blurb may not make sense or may not fulfill the request that brought the visitor to us.
I know I am not alone in this quandary and that others must have run into this before. While I respect the copyright of the external content providers and do not wish to get into the whole issue of lost advertising revenue for them if I were to cache a local copy, I'm curious what other users are doing to mitigate this problem."
[ Reply to This ]
"Your superior intellect is no match for our puny weapons!"
Once it's posted, it's public information. Sites that try to prevent others from caching their pages are living in an unrealistic dreamworld that doesn't include ISP proxies, browser caches, and multiple hops through routers.
In other words, they're morons. Just cache the data privately and ignore what you think the rest of the world thinks about it.
I think there is a deeper problem being alluded to here, that of loss of intellectual property. Copyright, as if often pointed out, has two sides: the copyright owner gets to exercise control over thir asset, but in the end that asset becomes publish property.
It has long been law and/or practice in most countries that in order to publish a book (or any copyrightable material) a copy must be lodged with the state archive (in the US, the Library of Congress). In order to make a commercial gain off a work it usually requires publication, which means that most works are available in such libraries.
But the web changes that. Publication becomes a lot more informal, and there is no requirement or even encouragement to archive. How, in such a scenario, can we protect against publically accessible information disappearing forever? This material has been published and, at some point, the copyright will expire; it should fall into the public domain. But it most likely won't: over time it will be taken away, and never seen again.
Consider the loss we would face if a valuable repository like Slashdot vanished. Deride it all you like - this is nevertheless a meeting place of (amongst others) some very experienced people with insightful comments, leading to a wealth of information gathered on topics that are discussed. It it not at all uncommon to find a Slashdot discussion when searching for technical information.
archive.org is a start in the process of archiving to prevent this sort of loss -- but how can we move to tackle the problem in a proactive manner?
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Something, in this case a webpage, once made public, is likely to be copied to some sort of personal space that is not under the control of the publisher, no matter how much they protest their copyright.
Once in this personal space however, there is no obligation to share. And so information pools in the corners of the 'net unable to benefit any but a select/fortunate few for fear of persecution.
Bottom line, if you don't want it preserved for posterity and must maintain rigid control, don't publish publically. If you do publish, expect that it is now copyrighted public domain material.
So why not bring the cache/mirror out into the open. It happens anyway.
So much to do, so little bandwidth.
--
Try Mozilla
I'm no lawyer, but I think it's ok to copy content from a news or opinion site as long as you cite your source. In other words, I *think* you're on solid ground if you copy the entire text of a news article and append the date and the place you copied it from.
There are a couple of reasons why an author might have a problem with you doing this: Firstly, if you draw customers (and therefore ad revenue) away from their site, they wont like it. So, what I suggest is that, at the time you open a discussion thread, you cache the article but don't link to your cache at that time. Link to the original. Later on, if the article becomes unavailable, you can add a link to the cache (I'd keep the original link too though).
Secondly, if you surround the cached copy with content from your own site - if you put your own banner ads at the top or your site's menus along the sides etc. - if you do that, you make it seem like the author wrote the article for you, or gave you permission to publish it. You make it seem like you have some relationship with the author and that just isn't true. So, I'd suggest that the cached copy open in its own window and contain nothing but the article text.
I think these two suggestions are just common courtesy and journalistic integrity.
If the author still demands that the cached content be removed, I think you should take it down. In its place, you could put a report or review of the article. You can't copy directly from the article as that's plagiarism. But you can quote the important lines and cite it. Think of it as your own journalistic report. If they still have a problem with that, you should tell them "the content of my site accurately reflects my opinions regarding an article you published and constitutes fair use of that article." If they persist beyond that - get a lawyer and counter sue for harassment.
Either you accept the missing articles (bad choice) or you cache them.
The answer seems pretty clear cut to me. Google does caching well, so I'd just copy them. Or you could even just link to the google cache, but that could still change.
A Multiplayer Strategy Game for Mac OS X, Windows, and Linux
Hypertexual information, posted publicly once, can and should always be preserved, especially if it relevant to another story, as links are used as jump-points here at SlashDot.
However, because this is hypertext, another procedure needs to be followed: Content needs to be maintained. Because of the fluid nature of the web, which makes the link possible in the first place, some special actions (i e actions not taken with archival of books, magazines, newspapers, etc) need to be taken.
Here, assuming I had ultimate control over the whole thing, is what I would do:
Auto-caching any 'all rights reserved' site to prevent the Slashdot effect isn't OK, unless you have the permission of the owner.
The reason for this that the owner has put up the information with the expectation that the content will be viewed on his site but with the realization that anyone may link to it.
To undercut the owner's expectation of the content being their exclusive contribution to the web isn't ok. To link to a cache anything instead of the original document is, thus, not OK.
However, Slashdot (or whomever), may maintain a copy of the document on their computer for the future use, in case the document is removed.
Which means:
If either the document is inaccessible (because of the Slashdot effect or because the document was taken down) then the cached document should be provided on Slashdot's server.
But, the author should be contacted to both inform them of the action, as well as to find the reason of the document being taken down (inaccuracy?) and to see if the owner can or has provided another copy of the document (perhaps revised). If there is an inaccuracy and the document has been permanently removed, Slashdot should continue the caching but note the situation and attempt to correct any errors. If the document is at another URI Slashdot should removed the cached document and link to the new URI.
A special situation might arise where a revised document is at a new URI, in this case Slashdot may provide a cache of the original and also link to the revised document. This would both provide a way to see the new, accurate, revised document and to see what it was exactly that hundreds of Slashdotters were posting to (since quotations might have been extracted verbatim and used as a jump point).
However, users should take use of the copyright statement: If something is public domain or under a looser-than-typical license that would suggest the author wouldn't mind a wholesale caching, then cache the document from the get-go, but it would be appropriate to fully cite the author and the URI from which the cache was taken.
I've always wondered how the /. effect is different to spam.
/. effect but still recieve it, so why are people up in arms about nobody opting into spam but still receiving it?
Both claim to be beneficical to the, shall we say, victim. "Information about special offers is useful", "They get more people looking at their banners".
Both use up bandwidth and cause charges to the victim.
Spam is often redundant - "I've seen this bloody spam a hundred times" - and we all know how redundant slashdot can be...
Both can be defended by saying that if you publish your address (site or email) then people can use it.
Nobody opts into the