Wayback Machine Safe, Settlement Disappointing
Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."
http://www.archive.org/
I want a search engine that only indexes items excluded in the robots.txt file
There are shills on slashdot. Apparently, I'm one of them.
What's really disappointing is that it's apparently cheaper to pay lawyers to settle a case than it is to defend your right to ignore optional guidelines like robots.txt in US courts.
If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.
--
make install -not war
If you go directly to their site, you get a version of their site that looks like it's from 1995.
which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place
So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place? Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all? How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with. And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
...Don't put it on the Internet. In fact, don't even type it into a computer, or write it down.
People shouldn't put anything on the Internet that they wouldn't want their worst enemy, boss, NSA, or grandmother to see. Obviously since the porn industiry exists online, few people follow this rule, but it's a good one none the less.
I enjoy Archive.org and when I get nostalgic about my websites of the past, it's there to show me a glimpse into history.
Saskboy's blog is good. 9 out of 10 dentists agree.
....even if Wayback did respect the robots.txt (which I was under the impression that they generally do), any pages archived before the robots.txt was placed on the server aren't going to automatically disappear -- they are still there. You have to directly ask them to remove the previously arvhived pages if you don't want them to be accessible.
"Every great cause begins as a movement, becomes a business, and eventually degenerates into a racket." -- Eric Hoffer
Is that some sites that used to exist had no robots.txt file, yet still get blocked
After a certain domain was no longer in use for years some adware search rank linkpharm whatever it is added a robots.txt file to a "hijacked" domain.
One can now get formerly accessible sites removed from archive.org. EVEN IF THE ORIGINAL OWNER NEVER INTENDED TO.
perpetually dwelling in the -1 pits
Check out their robots.txt: http://www.healthcareadvocates.com/robots.txt They ONLY restrict Internet Archive, from accessing their web site, but don't restrict any other spider... Haven't they heard of Google's cache?
There's only one metaphor - "you can't unring a bell", so there is no mixed metaphor.
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't. You can technically ignore any and all standards, but there will be sanctions. In the case of robots.txt, these sanctions can very well be court ruling against you, because robots.txt is an established standard for regulation of the interaction between automated clients and webservers. As such it is an effective declaration of the rights that a server operator is willing to give to automated clients in contrast to human clients. This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation and if the author explicitly denies robot access, then the service operator knowingly redistributes the work against the author's will.
Even if you don't fear the legal system, disregarding robots.txt can quickly get you in trouble. There are junk-scripts which feed bots endlessly and there are blocklisting automatisms against unbehaving bots. If people program their bots to ignore robots.txt, these and possibly more proactive self-defense mechanisms will become the norm. Is that the net you want? Maybe obeying robots.txt is the better alternative, don't you think?
So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place?
You never intended to make your search results publicly available. These guys intentionally made their web page publicly available.
Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all?
That's a better point, but the question is whether the Wayback Machine "republished copyrighted material". If they instead archived material available in the public domain, it is a different matter entirely, regardless of what the creators of that material want.
How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with.
If you don't want something to be used freely, don't release it into the public, unless there are legal protections in place. If it's the GPL, people are legally forbidden from incorporating it into a (publicly released) closed source product. If it's the LGPL, people can do so. If you don't like that, don't release it publicly.
And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
This is a fact of Internet life and always has been. This isn't different from other activities of 9-year-olds or anyone else in the public sphere. If you streak through a mall naked and someone snaps your picture, too bad: you can't make the photos disappear.
I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.
First, some background. I have a weblog I've been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.
Flash forward to earlier this week. I've made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post... and to my horror, discover that it's missing half a sentence at the beginning of a paragraph and I don't remember the sense of what I originally wrote!
My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I'm able to pull up the summary with a link to the original location. I click on it... and I see:
Excluded by robots.txt (or words to that effect).
Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn't determine whether the data had actually been removed or just blocked, so I decided to alter my site's robots.txt file, fire off a request for clarification, and see what happened.
As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.
In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.
Many people think of the Wayback Machine as being a tool for history and nostalgia. However, consider copyright expiration (IANAL, etc.). Many web pages have items like "Copyright 1995-2006 Blah". Some of the content was created as early as 1995. Assuming, of course, that items created in modern times eventually have their copyright expire, we will need a record of the content of these pages at that time.
As more content moves online, the idea of publishing a work becomes blurred. Revisions years later can effectively update the copyright of the work, if the reader cannot distinguish when the content was created. So the Wayback Machine will hopefully provide that resource. The amount of potentially public-domain content there is huge.
As a side note, it will be interesting to note when the first GPL programs (for example) lose their copyright. Of course, by then, the languages will seem more than archaic.
"The universe seems neither benign nor hostile, merely indifferent." --Carl Sagan
Pretty much every time we have a discussion about the legality of web/Usenet archive sites, the only argument with any legal weight that's given for what would otherwise be a clear infringement of copyright is that the rightsholder is implicitly consenting to certain uses by making the material available on that medium. The degree to which this holds in general is debatable, and AFAIK has never been tested in any major court case in any jurisdiction. However, even if robots.txt is voluntary, it's a clear statement of intent. There is no way you can claim implicit permission to copy the material when the supplier explicitly indicated, using a recognised mechanism, that they did not want it copied.
That makes comments like this one by Doc Ruby and this one by saskboy seem a little presumptuous, IMNSHO.
If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
shouldn't be copyrightable - there is nowhere more "public domain" than the Internet. Same with radio/TV - anyone who makes use of the public airwaves should sacrifice any claim to copyright for that priviledge. If someone wants to control their works through copyright, they should use controlled, private distribution.
I'll no doubt have lawyer (and lawyer wannabees) protesting - but that only follows the literal and common sense meaning of "public domain," instead of the legal rationalization which has been brought about by those who want to have their cake, and eat it too.
"National Security is the chief cause of national insecurity." - Celine's First Law
You missed the best part of the quote.
The US has copyright laws, and lots of people rely on it, including open source projects.
The robots.txt file is a clear indication of the conditions under which a copyright holder gives you access to their copyrighted materials. As such, it is not "voluntary".
In addition to probably being in violation of copyright law, it is simply rude for companies to ignore robots.txt files; if the Internet Archive does this, they are badly behaved.
If courts should decide that robots.txt files can be ignored at will, then more sites will require registration, click-through licenses, and those annoying "try to read this" safeguards, making life more miserable for all of us.
The best thing for everybody, including the Internet Archive, would be for the robots.txt standard to be enforced strongly by courts.
Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:
http://www.archive.org/about/faqs.php#2
http://web.archive.org/web/20050305142910/http://
Let's review the text here, just in case someone from archive.org scurries to change it:
Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive
By not honoring those directives, are they not engaging in both copyright infringement and fraud?
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50