Wayback Machine Safe, Settlement Disappointing
Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."
What's really disappointing is that it's apparently cheaper to pay lawyers to settle a case than it is to defend your right to ignore optional guidelines like robots.txt in US courts.
If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.
--
make install -not war
which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place
So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place? Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all? How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with. And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
...Don't put it on the Internet. In fact, don't even type it into a computer, or write it down.
People shouldn't put anything on the Internet that they wouldn't want their worst enemy, boss, NSA, or grandmother to see. Obviously since the porn industiry exists online, few people follow this rule, but it's a good one none the less.
I enjoy Archive.org and when I get nostalgic about my websites of the past, it's there to show me a glimpse into history.
Saskboy's blog is good. 9 out of 10 dentists agree.
Is that some sites that used to exist had no robots.txt file, yet still get blocked
After a certain domain was no longer in use for years some adware search rank linkpharm whatever it is added a robots.txt file to a "hijacked" domain.
One can now get formerly accessible sites removed from archive.org. EVEN IF THE ORIGINAL OWNER NEVER INTENDED TO.
perpetually dwelling in the -1 pits
Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't. You can technically ignore any and all standards, but there will be sanctions. In the case of robots.txt, these sanctions can very well be court ruling against you, because robots.txt is an established standard for regulation of the interaction between automated clients and webservers. As such it is an effective declaration of the rights that a server operator is willing to give to automated clients in contrast to human clients. This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation and if the author explicitly denies robot access, then the service operator knowingly redistributes the work against the author's will.
Even if you don't fear the legal system, disregarding robots.txt can quickly get you in trouble. There are junk-scripts which feed bots endlessly and there are blocklisting automatisms against unbehaving bots. If people program their bots to ignore robots.txt, these and possibly more proactive self-defense mechanisms will become the norm. Is that the net you want? Maybe obeying robots.txt is the better alternative, don't you think?
I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.
First, some background. I have a weblog I've been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.
Flash forward to earlier this week. I've made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post... and to my horror, discover that it's missing half a sentence at the beginning of a paragraph and I don't remember the sense of what I originally wrote!
My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I'm able to pull up the summary with a link to the original location. I click on it... and I see:
Excluded by robots.txt (or words to that effect).
Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn't determine whether the data had actually been removed or just blocked, so I decided to alter my site's robots.txt file, fire off a request for clarification, and see what happened.
As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.
In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.
Pretty much every time we have a discussion about the legality of web/Usenet archive sites, the only argument with any legal weight that's given for what would otherwise be a clear infringement of copyright is that the rightsholder is implicitly consenting to certain uses by making the material available on that medium. The degree to which this holds in general is debatable, and AFAIK has never been tested in any major court case in any jurisdiction. However, even if robots.txt is voluntary, it's a clear statement of intent. There is no way you can claim implicit permission to copy the material when the supplier explicitly indicated, using a recognised mechanism, that they did not want it copied.
That makes comments like this one by Doc Ruby and this one by saskboy seem a little presumptuous, IMNSHO.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
shouldn't be copyrightable - there is nowhere more "public domain" than the Internet. Same with radio/TV - anyone who makes use of the public airwaves should sacrifice any claim to copyright for that priviledge. If someone wants to control their works through copyright, they should use controlled, private distribution.
I'll no doubt have lawyer (and lawyer wannabees) protesting - but that only follows the literal and common sense meaning of "public domain," instead of the legal rationalization which has been brought about by those who want to have their cake, and eat it too.
"National Security is the chief cause of national insecurity." - Celine's First Law
I don't know, maybe I just don't expect my local newspaper to look like my highschool newspaper.
Inital impressions go a long way. It may seem silly to some people, but in buisness it can mean the difference between people taking you seriously and buying your product, or not.
Uhh, yeah, the summary linked to www.archive.org/web/web.php or something like that, which is the site in question. I know everyone likes to rag on the editors, but they aren't horrible, at least not all the time.