Wayback Machine Safe, Settlement Disappointing
Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."
http://www.archive.org/
"Dave, don't mess with the man with the wayback machine."
Thought I'd go karam slutting maybe have a load of karma hit you too. ;-)
If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.
l e_id=1671
I don't see that happening any time soon -- http://www.yourcongress.com/ViewArticle.asp?artic
....even if Wayback did respect the robots.txt (which I was under the impression that they generally do), any pages archived before the robots.txt was placed on the server aren't going to automatically disappear -- they are still there. You have to directly ask them to remove the previously arvhived pages if you don't want them to be accessible.
"Every great cause begins as a movement, becomes a business, and eventually degenerates into a racket." -- Eric Hoffer
There's only one metaphor - "you can't unring a bell", so there is no mixed metaphor.
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
The robots exclusion standard was primarily designed to exclude robots from the parts of the server's namespace that robots can't handle, like (practically) infinite url trees or shop sites. You don't want bots to crawl a neverending swamp of dynamically generated content that points to ever more dynamically generated content. You also don't want bots to order stuff or vote for comments when they crawl the scripts (the webmonkey should have used POST, not GET, but if he chose to use robots.txt instead, you're going to at least get an angry call). There are many more reasons to exclude robots from certain url prefixes. If you're operating a robot, follow that standard, for your own good. Some servers are actively hostile if you don't follow robots.txt.
robots.txt is not about whether accesses are "authorized" or not. Because the web server will still serve up the content if the robot asks for it! If you only want "authorized" users accessing the content, you should put some sort of access control mechanism where users have to type a password or something. Not only will that keep the robot out, but it demonstrates a clear intent to keep the robot out.
robots.txt is more of a "please don't look at this" request to spiders. If the spider asks for the content anyway and your server happily sends it, then you can't claim this is "unauthorized" access.
so if my content is behind a protected "members area" then it is still public domain and should be freely available? If I am a photographer, and my site clearly states that all images are copyright of a certain date and that use of them without my permission is forbidden, that means nothing? If someone uses images of me without my permission, that they got from a website or protected members area, how is it that I can get them removed by complaining? If they are public domain, then it should be my tough luck, right?
If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?
If I post on a forum site that I am selling stolen credit card info and bank info, my post should not be touched, because it is public domain and it should be freely available?
"I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams
Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:
http://www.archive.org/about/faqs.php#2
http://web.archive.org/web/20050305142910/http://
Let's review the text here, just in case someone from archive.org scurries to change it:
Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive
By not honoring those directives, are they not engaging in both copyright infringement and fraud?
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
How about we have a look what the RFC-drafts (its not even official) say about robots.txt:
"Web site administrators must realise this method is voluntary, and is not sufficient to guarantee some robots will not visit restricted parts of the URL space."
"It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it."
Its really that simple, robots.txt is not a security tool, its a guideline, nothing else. If you don't want robots to collect your data simply don't send it them.
Its a straightforward copyright violation, yep, but that has nothing todo with robots.txt, since having it or not, doesn't make it any less a violation.
Phishers do not deal with security. Phishers deal with unsuspecting and uneducated internet users. I'm sorry you are so scared to do it, but really.. go ahead and visit http://paypal-protect.org./ It is a phishing site that we are attempting to take down. Go ahead and login with a bogus email and garbage password. It doesn't check anything before hand. It simply takes you into a site that aside from the URL, does look like Paypal. You are then asked to provide everything. Name, address, social security, even your PIN number for your credit card. It won't even allow you to proceed without your PIN. Then, after you submit your information (which is then sent to whomever is running the scam), you are redirected to the actual paypal site.
Now, if a poor sap fell for it, anything that sap could have done online that involved money, the phisher can do.
You want to try to make the distinction about "If you reveal your info".. well, what if I worked at the gas station you frequent, and I copied your cred card info and ccv2 number from the back, when you made a purchase? OOPS, it was YOUR fault for actually buying something. According to you, the only way to be safe is to isolate yourself from the world, and make everything you need from scratch. Noone should be responsible for protecting your interests.
If I grabbed your info from your trash, it's your fault, right? because you didn't incinerate your trash, right?
You are wrong, in that everything posted on the internet is public domain. That is an assumption you are attempting to back up with obfuscation. What is posted on the internet is no different than what is on the shelf in a library, what is on TV, and what is on the radio. You have the right to enjoy it. You do not have the right to rebroadcast it without permission.
"I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams