Archiving Web Pages - Legal or Illegal?

It SHOULD be legal by Anonymous Coward · 2003-06-30 08:00 · Score: 4, Interesting

Well, it should be legal/allowed. If you don't want it read and archived, don't put it on the Web.

Everything should go, except for things like malicious alteration and theft (taking stuff and claiming it is yours)

Re:It SHOULD be legal by lightspawn · 2003-06-30 08:39 · Score: 5, Interesting

Well, it should be legal/allowed. If you don't want it read and archived, don't put it on the Web.

You know, I've been wondering about Java/Shockwave games. Certainly most kids would love a CD full of those games, and many companies have many different games online which mostly disappear a few months later.

Is anybody archiving these? Do we need to start?

Would the companies object?

You can play The Hitchhiker's Guide to the Galaxy on Douglas Adams' web site. As it happens, if you know what you're doing you can also download the .z5 file and play it offline on any zip interpreter. Would the copyright owners object to it? I own that Infocom 33-game collection and all 5 books; the reason the game wasn't included in the collection is copyright hassles. Am I "entitled" to play it offline?

This ties in to today's "is ROM collecting wrong" story, except in this case you're actually offered the games, under mostly unclear terms.

RTFF by kalidasa · 2003-06-30 08:04 · Score: 5, Informative

Archive .org FAQ

How can I remove my site's pages from the Wayback Machine?
The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.
See our exclusion policy.
You can find exclusion directions at exclude.php. If you cannot place the robots.txt file, opt not to, or have further questions, email wayback2@archive.org.

In other words, by your NOT including a robots.txt file, you are implicitly granting them permission to cache your content. Also, the content is cached as it was published, complete with the appropriate markings, and is only publicly accessible content, so you'd be hard press to argue there is any economic harm from the caching, which means there would be likely be no damages from a successful copyright suit, which means a copyright suit would be pretty damned unlikely.

IANAL.

Re:RTFF by jgoemat · 2003-06-30 08:40 · Score: 2, Insightful

No, but you do have a door. People are free to drive by your house and take a picture of it, or anything else out in public view. That reminds me of the girl that lost the lawsuit against Girls Gone Wild. If someone doesn't want their web page to be archived or cached, they can "put up a door" by using "robots.txt". If they really don't want to let the public at large see it, lock the door by protecting the content with a password. If they want to make absolutely sure no one sees it that they don't specifically show it to, they should save it as an HTML file on their personal computer and not even publish it on the web.
Re:RTFF by sir_cello · 2003-06-30 10:29 · Score: 2, Insightful

You don't properly understand the legal process.

In a copyright case, the courts first establish whether infringement has taken place, and this is determined irrespective of economic issues. It is determined purely on issues of subsistance, owernship, duration, etc - in terms of the statuory provisions and the existing case law. It is only then that exceptions (such as fair use, and specific exemptions - say - for public archives and libraries) are considered.

Then, finally, when remedies are considered (e.g. damages), the economic harm is taken into account. No damages may be awarded if there is no economic harm, but you still have the right to prevent the party from using an infringing copy of your work.

This is because copyright is a right on the work conferred to you. You can choose how you exercise that right, and that may include you refusing to allow others to use your work even in situations where it does no economic or moral harm to you (I mean, basically, you have the right and you can do damn well what you like with it!). There are of course some "essential facility" copyright cases where courts have ruled that an owner must license or allow use of a work, but these do not come along often (e.g. the macgill case in the EU).

You can argue that this is not a good way to do it, but the facts are that this is how it works now, and in terms of new technologies such as the Internet, it is not likely to chance immediately.
Re:RTFF by ScuzzMonkey · 2003-07-01 02:52 · Score: 2, Interesting

In this case, the first remedy is provided by the potential violator...

Yes, but it places the burden in the wrong place and so is not likely to be considered an adequate remedy by the courts. More properly, the violator should be seeking permission prior to re-distributing the content, rather than essentially saying to the copyright holder "Stop me before I copy again!"

I'm not sure I think that caching sites should be subject to traditional copyright law--it has some nasty implications for anyone who cuts traffic loads using a proxy server (insert humorous image of AOL Time Warner suing themselves for caching their own content)and really strikes me as yet another area where technology outstrips law, but if they are subject to it, their chosen remedy isn't likely to hold much water.

--
No relation to Happy Monkey

My 9/11 Archive by limekiller4 · 2003-06-30 08:05 · Score: 4, Interesting

On the day of 9/11, I began to think that maybe a lot of things would be online that would disappear on the next update, forever. We tend to think of 1880 newspaper clippings as being perishable, not online media, but the opposite is true. So all day on 9/11 I archived news sites and about two hundred blogs using "wget -p".

Over the next week I archived some 4,600 blogs. They've kind of been sitting around waiting for me to weed through and organize. I've also been wgetting 30 or so large news sites' front page every 15 minutes or so on the hunch that I'll grab something emerging even if I'm AFK. Well ...what can I do with this data?

The answer(s) to this question will definitely be of use to me. Thanks for asking it. Slash, thanks for posting it.

--
My .02,
Limekiller

An idea by revmoo · 2003-06-30 08:06 · Score: 4, Insightful

Here's a thought, a rather complicated one, but I Think it just might do the trick...

DON'T POST THINGS YOU DON'T WANT PEOPLE TO SEE ON A PUBLIC NETWORK.

It's quite simple really.

--
I would expect such blatant racism on Fark, but on Slashdot? Mods please ban this asshole.

It might be useful to note... by stienman · 2003-06-30 08:10 · Score: 3, Informative

It might be useful to note that the archive servers are located outside the US, and that they act on requests to have information and websites removed from their archive. (IIRC). I would state that the Archive serves a compelling public interest, both in the sense of free speech, and in the basic idea of keeping a history or record of the internet. The archive is a museum of sorts.

Google, on the other hand, is gathering data for its search engine, and, of necessity, must have what essentially amounts to a copy of each web page in its stores in order to provide this service. If one does not want to have their data in Google, they simply use robots.txt, and Google doea not spider, cache, or store any data from that site if robots.txt is filled out. However, the site owner also denies themselves the ability to be listed, for 'free', in googles search pages. This could be thought of as the cost of being listed.

So I don't think either of those two situations have any problems defending themselves. An anonymizer could also be seen as providing a useful, protected service. An anonymizer is nothing more than a proxy service, and many ISPs use proxies now, not to mention caches and many other tools that store website information or meta information without notifying or requesting explicit permission to do so - they request implicit permission by sending a GET command.

-Adam

Re:It might be useful to note... by simoniker · 2003-06-30 09:42 · Score: 2, Informative

Actually, the Internet Archive's main Wayback Machine servers are located in a co-location center in San Francisco, so it's not correct to say they're located outside the US. There is a mirror of the Archive's web content at the Library of Alexandria in Egypt, however - maybe that's what you're thinking of?

In any case, the Archive's work with the Library Of Congress and, increasingly, national libraries who want to archive the Web content of their countries, proves that the establishment also thinks Web archiving is a vital thing to do for posterity. But the rights issues are definitely tricky.

Email? by Anonymous Coward · 2003-06-30 08:10 · Score: 2, Funny

We do not accept email from lawyers as a legitimate form of communication.

Email from lawyers is /dev/null'd.
As for the waking up in the middle of the night...
Um, turn off the ringer? Stop sleeping in the NOC? Maybe invest in a second phone line for your business instead of using moms POTS line.

Be Happy by Apreche · 2003-06-30 08:19 · Score: 2, Insightful

I'd be damn happy if someone made backups and mirrors of a site I made. People will visit my site without using bandwith I pay for. Also, if disaster strikes I can get my site back because someone else was kind enough to back me up. The more the merrier

--
The GeekNights podcast is going strong. Listen!

Honestly... by lptport1 · 2003-06-30 09:04 · Score: 2, Informative

This sounds sort of cynical to me, but it strikes me that the people who might be concerned about that don't comprehend the word "cache" and therefore never click on that link in the search results...

Thus, never discovering that their site has been archived somewhere else. That, and Google has a rather chunky disclaimer-type-deal at the top--I'm sure it's in response to just that behaviour.

*copy* right by ccady · 2003-06-30 09:06 · Score: 4, Interesting

(FWIW, IANAL) Web site content is copyrighted. Therefore, you have a right to make your own personal copy, and backup copies, but it is not legal to redistribute those copies without the site owner's permission. I cannot imagine that the Wayback machine or the Google cache is legal. They are blatantly disregarding the site owners' copyright.

That said, I think the law should be changed or at least clarified, because it is patently (pun intended) obvious that those services are doing a vast social good, and should be encouraged.

--
J'aime mieux les méchants que les imbéciles, parce qu'ils se reposent. -- Alexandre Dumas

Re:*copy* right by stanwirth · 2003-06-30 09:19 · Score: 2, Interesting

Web site content is copyrighted. Therefore, you have a right to make your own personal copy, and backup copies, but it is not legal to redistribute those copies without the site owner's permission. I cannot imagine that the Wayback machine or the Google cache is legal. They are blatantly disregarding the site owners' copyright.

That would imply that every ISP running a public squid cache is breaking the law, and Akamai's entire business model is based on illegal content-smuggling. I really don't think so!
Re:*copy* right by limekiller4 · 2003-06-30 09:45 · Score: 2, Informative

stanwirth writes:
"...and Akamai's entire business model is based on illegal content-smuggling. I really don't think so!"

Akamai caches sites of people who pay them to cache them, so that would be one hell of a lawsuit. I know this because I worked for them for a few years.

--
My .02,
Limekiller
Re:*copy* right by anthony_dipierro · 2003-06-30 09:50 · Score: 2, Informative

(FWIW, IANAL)

Obviously.
Re:*copy* right by SeanAhern · 2003-06-30 10:03 · Score: 4, Informative

Mod parent up! This link to the US Code is very useful in this context.

Heck, it's so useful that I'm going to quote some of it here:

TITLE 17 > CHAPTER 5 > Sec. 512. Prev | Next

Sec. 512. - Limitations on liability relating to material online

(a) Transitory Digital Network Communications. -

A service provider shall not be liable for monetary relief, or, except as provided in subsection (j), for injunctive or other equitable relief, for infringement of copyright by reason of the provider's transmitting, routing, or providing connections for, material through a system or network controlled or operated by or for the service provider, or by reason of the intermediate and transient storage of that material in the course of such transmitting, routing, or providing connections, if -

(1)

the transmission of the material was initiated by or at the direction of a person other than the service provider;

(2)

the transmission, routing, provision of connections, or storage is carried out through an automatic technical process without selection of the material by the service provider;

(3)

the service provider does not select the recipients of the material except as an automatic response to the request of another person;

(4)

no copy of the material made by the service provider in the course of such intermediate or transient storage is maintained on the system or network in a manner ordinarily accessible to anyone other than anticipated recipients, and no such copy is maintained on the system or network in a manner ordinarily accessible to such anticipated recipients for a longer period than is reasonably necessary for the transmission, routing, or provision of connections; and

(5)

the material is transmitted through the system or network without modification of its content.
(b) System Caching. -

(1) Limitation on liability. -

A service provider shall not be liable for monetary relief, or, except as provided in subsection (j), for injunctive or other equitable relief, for infringement of copyright by reason of the intermediate and temporary storage of material on a system or network controlled or operated by or for the service provider in a case in which -

(A)

the material is made available online by a person other than the service provider;

(B)

the material is transmitted from the person described in subparagraph (A) through the system or network to a person other than the person described in subparagraph (A) at the direction of that other person; and

(C)

the storage is carried out through an automatic technical process for the purpose of making the material available to users of the system or network who, after the material is transmitted as described in subparagraph (B), request access to the material from the person described in subparagraph (A),

if the conditions set forth in paragraph (2) are met.
(2) Conditions. -

The conditions referred to in paragraph (1) are that -

(A)

the material described in paragraph (1) is transmitted to the subsequent users described in paragraph (1)(C) without modification to its content from the manner in which the material was transmitted from the person described in paragraph (1)(A);

(B)

the service provider described in paragraph (1) complies with rules concerning the refreshing, reloading, or other updating of the material when specified by the person making the material available online in accordance with a generally accepted industry standard data communications protocol for the system or network through which that person makes the material available, except that this subparagraph applies only if those rules are not used by the person described in paragraph (1)(A) to prevent or unreasonably impair the intermediate storage to which this subsection applies;

legality by sir_cello · 2003-06-30 10:18 · Score: 2, Informative

There are limited provisions in copyright law (at least in the UK, and I expect to occur elsewhere in the world) for public libraries and archives. But these are indeed limited provisions and do not apply to a random commercial organisation that decides to provide such a service.

Firstly, in the general case of search engines providing indexing of content, this is legal and there are legal cases to back it up (in the UK: antiquesportfolio) so long as the indexes are not copies.

Secondly, in the case of USENET groups and mailing lists, then in the process of submitting a message to the mailing list or group, you have given an implicit license for the message to be reproduced within the nature of the particular technology at hand. This means if at a later date you object to a message in a mailing list that you wrote in the past, you don't really have the ability to retract it. In all cases, anyone deciding to use the material in another way (e.g. creating a commercial CDROM of USENET material for a marked up price) would be violating your (and others) copyright. However, if they were providing that CDROM as a distribution service for USENET itself (e.g. "get your monthly USENET CDROM") then this is probably within the bounds of legality as it is still transfer via the USENET system, and the cost is likely to be that to reflect media/distribution costs rather than some specific aim to make a commercial product out of your material.

Finally, in the specific case of copies of websites, yes this is a violation of copyright - but as far as I know this has not been tested in a court of law. The use of the Robots Exclusion Protocol and the NOARCHIVE, NOINDEX and NOFOLLOW elements allow a weasal argument suggesting that it is inherent in the WWW itself (as a new form of media / technology) that search engine indexing and archiving / caching is legal unless you specifically disallow it with this mechanism. It may also be the case that if this archiving / caching was carried out for profit or at price greater than fair for distribution/media then a party is making an economic gain out of your material and this suggests an inequitable violation of your economic rights.

Another point to remember is that in WTO treaties that resulted in DMCA provisions, as enacted in the UK and EU, there are specific fair use allowances for intermediate copies of a copyright work as necessary for the telecommunications medium itself (this would seem to allow things like store-and-forward systems, and caching).

Not so black and white as most here are saying! by Anonymous+Brave+Guy · 2003-06-30 10:40 · Score: 2, Interesting

In other words, by your NOT including a robots.txt file, you are implicitly granting them permission to cache your content.

Riiiiight. See you in court.

As I've just posted elsewhere, it is quite feasible that a site owner could be damaged if caches maintain information after the original site has been changed or taken down. For example, if updated information is placed on the original, this leaves the "cached" versions out of date and misleading anyone who reads them thinking they're seeing a perfect copy of the real thing.

There is also the issue of a site owner's right to know who is visiting them. Many popular web sites can and do collect information about how visitors move around their sites, the browsers and resolutions they use, etc. If the information on the site is being offered according to the normal conventions of the Internet, it is only fair to provide them the feedback normally returned by the conventions of the Internet. This information is valuable to them when they come to revise the site. Ultimately it is also in the site visitors' best interests for the site owner to have accurate information available, so that if they want to make the effort to improve usability, support minority browsers that some of their visitors use or whatever, they can do so.

On a related note, there are questions of advertising revenue etc. if a site is supported by sponsors who pay per-hit. It's not at all guaranteed that they will get their fair amount of sponsorship if most of those hits are seeing a web cached version.

This whole issue isn't nearly as black and white as the "information should be free" crowd are inevitably shouting already.

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.

shrinkwrap/acceptable use policy? by lpq · 2003-06-30 18:46 · Score: 2, Interesting

Some people are arguing robots.txt as the determiner, however remember
the court case that a company *lost* because it copied the data of a
competitor site and set it's prices lower.

This is equivalent to Kroger hiring a few clerks to go down each day and
take prices of various objects on their wifi equip'ed phones/handhelds in
a store so Safeway can under cut prices.

What, you didn't read the fine print on the Safeway door that says no price
comparisons or making up price lists? Or what...were they supposed to look
for a robots.txt file behind the Safeway door?

There seems to be a general lack of common sense here (especially on the
part of the judge that ruled against the company scanning for competing
prices). If it is allowed in the real world, it shouldn't be different in
the computer world without alot of sound reasoning behind why it should be
different. The fact that Safeway could have a 3-page acceptable use policy
that I accept when my body presence opens the door, is ludicrous.

Now you talk about advertising losses -- what about whatever major network
it was, deleting competing major network logo bought and paid for on
tall building in Times Square for New Years eve? Competing networked modified
the image in realtime and inserted their own logo for the price of an SGI
workstation -- heck of alot cheaper. Legal? Not legal? Can you say a
real life image is "copyright" and if two people take a picture of the same
real life picture, is one the rightful owner? What if one or both alter
the "real life picture", have they violated someone's rights? Reality's
rights (ok, in this case it would have been the network that paids to rent the
entire side of the building), but it's really a matter of who owns what you
see? If a picture is take of what you see, who owns the picture?

This is a complete mishmash of conflicting legal decisions with computer
copying, caching, alteration and adding to the mess. What if I load a page
but I don't load the images? Have I violated copyright because I either
chose or cannot load the images? What if I selectively blocked them based
on their IP or name? If I don't load flash player, am I violating a
copyright on a site by not viewing the flash content advertising?

Random judges in random jurisdictions are going to be making random calls on
right/wrong that will collide with each other and with what makes sense in
the real world.

I'm not sure what the collective approach should be -- should I be required to
watch TV advertising or am I stealing programming if I go to the loo during
a panty spot? If I block popup am I stealing computer time.....

This is all just one big gigantic growing mass of living worms that promises to be one of the larger headaches of times to come.

Any unified field theories to solve this mess? :-)

Re:That's not what I've read. by damiam · 2003-07-01 01:44 · Score: 2, Insightful

Paper can potentially last a long time (the US Constitution is still intact, for example). However, the average paper archive the size of a CD (which would physically be quite substantial) would require enough upkeep to make the cost of storing and maintaining it much greater than the cost of burning a new copy of the CD every ten or twenty years.

--
It's hard to be religious when certain people are never incinerated by bolts of lightning.

Re:Not so black and white as what you are saying! by Anonymous+Brave+Guy · 2003-07-02 00:06 · Score: 2, Interesting

Damaged in what way? Aren't there archives of newspapers, journals, and magazines? And if time-sensitive information is present on a website, does the public have a right to see what was previously there?

If I put up information on a web site, for free, as a volunteer, then the public has no rights whatsoever, either legally or morally. Why the hell should they? They didn't do anything to earn them.

If you have a specific example related to this problem, I would love to hear it.

I'll give you a couple of examples where real damage can be done. There are certainly several other instances, but I hope these will suffice for now.

There have been cases where someone published some material on a subject that interested them on a web site, but later wanted to publish work based on it in something like a journal or a book. (Disclosure: I am currently in a similar position myself.)

Now, publishers get very nervous about publishing material that has previously been available in another form. If you're arguing that by putting it up on the web an author effectively forfeits all rights to control their work -- i.e., that the usual principles of copyright shouldn't apply for some reason in this medium -- then you're basically saying that anyone who might ever want to publish original material they wrote shouldn't ever make anything available on the web first. Given how much both the public and the author can potentially get out of that, provided that reasonable controls are in place -- there was a Slashdot story about a new programming book citing a preprint temporarily placed on the web just a few days ago -- this seems to be needlessly counterproductive to me.

Secondly, a bit closer to home, consider a company that has a critical story about it published on Slashdot. That company is likely to get a lot of traffic to its web site if the site is linked, and might well want to put up a rebuttal of any points made against it. It's only fair that visitors who go to check out the Slashdot story also see the company's response.

Now, we all know that Slashdot articles have seriously criticised businesses in the past, sometimes with justification, sometimes without. We all know that web sites get Slashdotted. We all know that people post links here to Google caches of sites, or just copy whole pages and post them here. In this sort of case, someone could suffer serious harm to their reputation because the audience of Slashdot only get to read things supporting a critical claim, without seeing (or even being aware of) a response from the criticised party in their defence.

Nicking someone's material and posting it here is blatant copyright infringement, and just because it's done by an AC and Slashdot claims that all posts are the responsibility of their authors doesn't necessarily make it legal. It amazes me, given a few of the things that get posted around here, that no-one has ever really attempted to sue Slashdot over this. Certainly things like circumventing the NYT's "free reg required" are very dicey, and given that everyone (including those running Slashdot) knows that it happens, I don't see how they'd have much of a defence.

In my personal opinion, and looking at the actual US law that's been quoted here, it seems that web sites caching material are also likely to be in breach of copyright laws for much the same reasons, doing much the same damage in some cases, and potentially subject to much the same penalties.

Right - just like WalMart has the right to pat down and run a credit check on everyone who walks through their doors.

No, it doesn't. But it has the right to refuse entry to anyone who doesn't provide the information it requires. Banks do this if you try to enter before removing your crash helmet. Bars do it if you look under-age and can't produce ID.

While a site admin might like to know everythin

--
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.

Slashdot Mirror

Archiving Web Pages - Legal or Illegal?

23 of 102 comments (clear)