Wayback Machine Safe, Settlement Disappointing
Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."
http://www.archive.org/
"Dave, don't mess with the man with the wayback machine."
I want a search engine that only indexes items excluded in the robots.txt file
There are shills on slashdot. Apparently, I'm one of them.
What's really disappointing is that it's apparently cheaper to pay lawyers to settle a case than it is to defend your right to ignore optional guidelines like robots.txt in US courts.
If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.
--
make install -not war
Thought I'd go karam slutting maybe have a load of karma hit you too. ;-)
If you go directly to their site, you get a version of their site that looks like it's from 1995.
which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place
So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place? Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all? How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with. And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
...Don't put it on the Internet. In fact, don't even type it into a computer, or write it down.
People shouldn't put anything on the Internet that they wouldn't want their worst enemy, boss, NSA, or grandmother to see. Obviously since the porn industiry exists online, few people follow this rule, but it's a good one none the less.
I enjoy Archive.org and when I get nostalgic about my websites of the past, it's there to show me a glimpse into history.
Saskboy's blog is good. 9 out of 10 dentists agree.
Good thing this isn't anything to do with the Bush Administration -else they'd have retroactively classified all this stuff as 'Top Secret' and then charged the Wayback machine of Treason under the Patriot act ....
and then the machine would find itself held without trial or charges in Gitmo until it turned to rust.
Sometimes you gotta laugh to keep from crying.....Hopefully you're laughing at this
For the life of me I can't figure out what ringing a bell and publishing something online have in common. Maybe if we didn't use digital clocks we could turn back the sands of time and use a different mixed metaphor instead?
I Am My Own Worst Enemy
....even if Wayback did respect the robots.txt (which I was under the impression that they generally do), any pages archived before the robots.txt was placed on the server aren't going to automatically disappear -- they are still there. You have to directly ask them to remove the previously arvhived pages if you don't want them to be accessible.
"Every great cause begins as a movement, becomes a business, and eventually degenerates into a racket." -- Eric Hoffer
Laughter is induced by the ironic or unexpected. Unfortunately, I fully expect what you say would be how things would play out
Is that some sites that used to exist had no robots.txt file, yet still get blocked
After a certain domain was no longer in use for years some adware search rank linkpharm whatever it is added a robots.txt file to a "hijacked" domain.
One can now get formerly accessible sites removed from archive.org. EVEN IF THE ORIGINAL OWNER NEVER INTENDED TO.
perpetually dwelling in the -1 pits
Check out their robots.txt: http://www.healthcareadvocates.com/robots.txt They ONLY restrict Internet Archive, from accessing their web site, but don't restrict any other spider... Haven't they heard of Google's cache?
Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't. You can technically ignore any and all standards, but there will be sanctions. In the case of robots.txt, these sanctions can very well be court ruling against you, because robots.txt is an established standard for regulation of the interaction between automated clients and webservers. As such it is an effective declaration of the rights that a server operator is willing to give to automated clients in contrast to human clients. This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation and if the author explicitly denies robot access, then the service operator knowingly redistributes the work against the author's will.
Even if you don't fear the legal system, disregarding robots.txt can quickly get you in trouble. There are junk-scripts which feed bots endlessly and there are blocklisting automatisms against unbehaving bots. If people program their bots to ignore robots.txt, these and possibly more proactive self-defense mechanisms will become the norm. Is that the net you want? Maybe obeying robots.txt is the better alternative, don't you think?
So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place?
You never intended to make your search results publicly available. These guys intentionally made their web page publicly available.
Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all?
That's a better point, but the question is whether the Wayback Machine "republished copyrighted material". If they instead archived material available in the public domain, it is a different matter entirely, regardless of what the creators of that material want.
How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with.
If you don't want something to be used freely, don't release it into the public, unless there are legal protections in place. If it's the GPL, people are legally forbidden from incorporating it into a (publicly released) closed source product. If it's the LGPL, people can do so. If you don't like that, don't release it publicly.
And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
This is a fact of Internet life and always has been. This isn't different from other activities of 9-year-olds or anyone else in the public sphere. If you streak through a mall naked and someone snaps your picture, too bad: you can't make the photos disappear.
I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.
First, some background. I have a weblog I've been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.
Flash forward to earlier this week. I've made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post... and to my horror, discover that it's missing half a sentence at the beginning of a paragraph and I don't remember the sense of what I originally wrote!
My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I'm able to pull up the summary with a link to the original location. I click on it... and I see:
Excluded by robots.txt (or words to that effect).
Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn't determine whether the data had actually been removed or just blocked, so I decided to alter my site's robots.txt file, fire off a request for clarification, and see what happened.
As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.
In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.
Many people think of the Wayback Machine as being a tool for history and nostalgia. However, consider copyright expiration (IANAL, etc.). Many web pages have items like "Copyright 1995-2006 Blah". Some of the content was created as early as 1995. Assuming, of course, that items created in modern times eventually have their copyright expire, we will need a record of the content of these pages at that time.
As more content moves online, the idea of publishing a work becomes blurred. Revisions years later can effectively update the copyright of the work, if the reader cannot distinguish when the content was created. So the Wayback Machine will hopefully provide that resource. The amount of potentially public-domain content there is huge.
As a side note, it will be interesting to note when the first GPL programs (for example) lose their copyright. Of course, by then, the languages will seem more than archaic.
"The universe seems neither benign nor hostile, merely indifferent." --Carl Sagan
First, let me get two points expressed first. 1) IANAL, 2) I wholeheartedly agree with the aims of wayback and support that organisation whole-heartedly. I am playing devil's advocate here.
In the UK Computer Misuse laws, there is the concept of unauthorised access. It is an offence to access data on a computer system without authorisation.
Typically it is assumed that access to data held on a publicly available website, without notice to the contrary, is authorised. A notice displayed stating that you should not look at the data unless you are me is sufficient to make you aware that you should not access it. Similarly, a robots.txt file is the place to explicitly definae what data is unauthorised for access by automated spider systems. Anyone writing such a system can be reasonably expected to know that robots.txt contains such information and should therefor have the spider check that to see if access to the data is unauthorised. Failure to check that does not magically make the access any more legal. I would imagine that the US has similar provisions.
The creatiopn of a robots.txt file after the spider has collected the information will not make the previous access and data collection illegal nor should it affect the presentation of that data. Copyright law may have an imapct though.
Pretty much every time we have a discussion about the legality of web/Usenet archive sites, the only argument with any legal weight that's given for what would otherwise be a clear infringement of copyright is that the rightsholder is implicitly consenting to certain uses by making the material available on that medium. The degree to which this holds in general is debatable, and AFAIK has never been tested in any major court case in any jurisdiction. However, even if robots.txt is voluntary, it's a clear statement of intent. There is no way you can claim implicit permission to copy the material when the supplier explicitly indicated, using a recognised mechanism, that they did not want it copied.
That makes comments like this one by Doc Ruby and this one by saskboy seem a little presumptuous, IMNSHO.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
shouldn't be copyrightable - there is nowhere more "public domain" than the Internet. Same with radio/TV - anyone who makes use of the public airwaves should sacrifice any claim to copyright for that priviledge. If someone wants to control their works through copyright, they should use controlled, private distribution.
I'll no doubt have lawyer (and lawyer wannabees) protesting - but that only follows the literal and common sense meaning of "public domain," instead of the legal rationalization which has been brought about by those who want to have their cake, and eat it too.
"National Security is the chief cause of national insecurity." - Celine's First Law
"Publishing" something online is hardly a reason for that information to stay online and be available indefinitely. After all, latest AOL fiasco just shown us that not all information should be available in perpetuity. With technology getting ever simpler it becomes trivial to expose online documents that were never meant to be seen by others. Claiming that any exposure is grounds for the information to be available to anyone in perpetuity is clearly wrong.
For a simple example, say your personal diary with your private thoughts and writings somehow falls out of your bag and ends up on the street. It is available for anyone to read. Would you agree to have its content published and disseminated to all the world in newspapers or some such? Or would you rather someone returns it to you quietly and the information stays private.
Archiving information produced by other people without their express consent is wrong and, potentially, harmful. This is one case where I strongly beleive copyright law should be applied and enforced.
robots.txt is not about whether accesses are "authorized" or not. Because the web server will still serve up the content if the robot asks for it! If you only want "authorized" users accessing the content, you should put some sort of access control mechanism where users have to type a password or something. Not only will that keep the robot out, but it demonstrates a clear intent to keep the robot out.
robots.txt is more of a "please don't look at this" request to spiders. If the spider asks for the content anyway and your server happily sends it, then you can't claim this is "unauthorized" access.
IIRC This was in response to a situation where someone was suing HA, the plaintiff's law firm hammered archive.org and was able to get some of the pages that they were interested in. At which time HA sued the archive for copyright infringement because they changed their robots.txt to prevent the information from getting to the plaintiff's attorneys. The problem with this whole thing is that adding the robots file after the lawsuit is akin to destroying evidence during a trial and they should have been found in contempt of court. Them expecting the archive to delete the data is unlikely as unless they are serving the data there is no copyright violation. I don't see why the plaintiff's lawyer didn't serve the archive with a subpeona for the information like gmail users have had their "deleted" email subpeona'd
"Consider an analogous situation in real life: You are walking in the park and someone asks you for a dollar. You decline, but the beggar keeps asking. You're saying that accepting your first denial as binding is "voluntary" and the beggar can keep bugging you as long as he likes. If that happened to me twice, I'd have the asshole arrested, and that's exactly what you're going to see online if people don't behave, especially when their behaviour leads to copyright violations which would have been avoided if they had followed the robot exclusion standard."
And yet no one sees the analogy between the above, and those "please do not copy" reminders on artists web pages. Maybe we can pull out that old slash-standby (you locking up MY culture with your robot.txt file).
Their policy is pretty simple, and direct, and involves minimal interaction with a human. (A bonus.)
Put in a robots.txt.
Direct wayback to index what you want or dont.
THAT DIRECTION IS APPLIED TO FILES ON THEIR SITE FROM PREVIOUS VERSIONS.
Meaning, if you deny all, and their bot sees it, all of your stuff is supposed to get deleted from the archive.
If they didn't do that they violated their own policy.
True, there can be complications (such as switching domain names) that might keep any given text in there without interaction.
What they do is a great and and tremendously useful tool. But not entirely out of the "gray area" for copyright problems.
You missed the best part of the quote.
It may still be voluntary today, but who knows what the future will bring?
I, for one, welcome our robot.txt overlords.
I just pooped your party.
The US has copyright laws, and lots of people rely on it, including open source projects.
The robots.txt file is a clear indication of the conditions under which a copyright holder gives you access to their copyrighted materials. As such, it is not "voluntary".
In addition to probably being in violation of copyright law, it is simply rude for companies to ignore robots.txt files; if the Internet Archive does this, they are badly behaved.
If courts should decide that robots.txt files can be ignored at will, then more sites will require registration, click-through licenses, and those annoying "try to read this" safeguards, making life more miserable for all of us.
The best thing for everybody, including the Internet Archive, would be for the robots.txt standard to be enforced strongly by courts.
Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:
http://www.archive.org/about/faqs.php#2
http://web.archive.org/web/20050305142910/http://
Let's review the text here, just in case someone from archive.org scurries to change it:
Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive
By not honoring those directives, are they not engaging in both copyright infringement and fraud?
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
He makes an interesting point about how archives make it hard to delete previous illicit activity.
Shick's Law: There is no problem a good miracle can't solve.
Since robots.txt is an access control mechanism, bypassing it is illegal under the DMCA.
It's full of lies. It told me that the Internet was invented by Larry Roberts, working for DARPA. Everyone knows it was Al Gore...
It's more like: bum asks you for a dollar. You give him one. Two weeks later you decide you don't want to give handouts any more, so you write on your forehead "no soliciting". Next you go to court and claim that writing "no soliciting" on your forehead means you not only won't give more handouts, but the bum who you PREVIOUSLY gave a dollar to, now has to return it.
See: that company DID NOT HAVE a robots.txt directive active when the Wayback machine archived it. They put the robots directive up two weeks later, once they realized that the archived file showed they were doing bad stuff that would embarass them.
http://www.whitehouse.gov/robots.txt
think about it-- anything on this list IS NOT on google..
why???
every day http://en.wikipedia.org/wiki/Special:Random
I really hate that. When I want to find some info about some hardware made by a long-defunct company, I find old usenet posts referencing their website, This is now taken over by some scumbag who has filled it full of porn and viagra ads. I go to the Wayback Machine and find ALL the history of the site is inaccessible because the current owner of the domain has blocked them in the robots.txt, despite the fact the owners of the original site have no relation to them at all.
we see things not as as they are, but as we are.
-- anais nin
From the Ars article, it would seem that one of the arguments from Healthcare Advocates was that looking at a cached version of an out-dated website was a violation of the DMCA.
That's just crazy.
If you have a "member's area" or some other area not intended for public consumption, then I'd imagine you have a reasonable expectation of privacy. That's what it all comes down to - if you post legal (not illegal or stolen such as your examples) information on a website, for all to see, then you really can't complain later when an archive company shows you what you had up there.
I saw your other post on that credit card scammer thing, and that's outside the scope of this argument. Obviously illegal content should not be reproduced.
You're the second jackass to accuse me of imitating the idiot pres, and I'd be insulted if I took you at all seriously.
One sign of diminished intelligence is a fondness for quoting fallacy definitions without really understanding them. Though I can't blame you in this case. "Argument from adverse consequences (putting pressure on the decision maker by pointing out dire consequences of an "unfavorable" decision)" is so vague as to be meaningless. "Captain, slow down! There are icebergs ahead!" "Oh, stop arguing from adverse consequences!"
2) Robots.txt being a mandatory instruction to retroactively get rid of any archives collected before the robots.txt directive went up. That is much harder to justify.
Do you understand the difference?
Why should the Internet be different than print media?
Has anyone (other than the Government) ever gone to the Library of Congress and successfully demanded that they destroy print media in their archives? How about digital media?
The answer is no.