Wayback Machine Safe, Settlement Disappointing

← Back to Stories (view on slashdot.org)

Wayback Machine Safe, Settlement Disappointing

Posted by ryuzaki0 on Thursday August 31, 2006 @10:08AM from the get-me-out-of-here-mr-wizard dept.

Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

12 of 182 comments (clear)

Min score:

Reason:

Sort:

Simple post by Kagura · 2006-08-31 10:12 · Score: 3, Informative

http://www.archive.org/
Jimmy James says. .. by Anonymous Coward · 2006-08-31 10:12 · Score: 1, Informative

"Dave, don't mess with the man with the wayback machine."
Exclusion policy.... by Anonymous Coward · 2006-08-31 10:16 · Score: 1, Informative

The whole exclusion policy
Thought I'd go karam slutting maybe have a load of karma hit you too. ;-)
Re:Autolawyers by hackstraw · 2006-08-31 10:24 · Score: 2, Informative

If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.

I don't see that happening any time soon -- http://www.yourcongress.com/ViewArticle.asp?articl e_id=1671
But.... by Stanislav_J · 2006-08-31 10:29 · Score: 2, Informative

....even if Wayback did respect the robots.txt (which I was under the impression that they generally do), any pages archived before the robots.txt was placed on the server aren't going to automatically disappear -- they are still there. You have to directly ask them to remove the previously arvhived pages if you don't want them to be accessible.

--
"Every great cause begins as a movement, becomes a business, and eventually degenerates into a racket." -- Eric Hoffer
Re:metaphorically speaking by LordNimon · 2006-08-31 10:41 · Score: 2, Informative

There's only one metaphor - "you can't unring a bell", so there is no mixed metaphor.

--
And the men who hold high places must be the ones who start
To mold a new reality... closer to the heart
Re:Isn't ignoring robots.txt unauthorised access? by Anonymous Coward · 2006-08-31 11:53 · Score: 1, Informative

The robots exclusion standard was primarily designed to exclude robots from the parts of the server's namespace that robots can't handle, like (practically) infinite url trees or shop sites. You don't want bots to crawl a neverending swamp of dynamically generated content that points to ever more dynamically generated content. You also don't want bots to order stuff or vote for comments when they crawl the scripts (the webmonkey should have used POST, not GET, but if he chose to use robots.txt instead, you're going to at least get an angry call). There are many more reasons to exclude robots from certain url prefixes. If you're operating a robot, follow that standard, for your own good. Some servers are actively hostile if you don't follow robots.txt.
No it isn't. by Anonymous Coward · 2006-08-31 12:09 · Score: 1, Informative

robots.txt is not about whether accesses are "authorized" or not. Because the web server will still serve up the content if the robot asks for it! If you only want "authorized" users accessing the content, you should put some sort of access control mechanism where users have to type a password or something. Not only will that keep the robot out, but it demonstrates a clear intent to keep the robot out.

robots.txt is more of a "please don't look at this" request to spiders. If the spider asks for the content anyway and your server happily sends it, then you can't claim this is "unauthorized" access.
Re:Info published on the Internet... by phulegart · 2006-08-31 12:37 · Score: 5, Informative

so if my content is behind a protected "members area" then it is still public domain and should be freely available? If I am a photographer, and my site clearly states that all images are copyright of a certain date and that use of them without my permission is forbidden, that means nothing? If someone uses images of me without my permission, that they got from a website or protected members area, how is it that I can get them removed by complaining? If they are public domain, then it should be my tough luck, right?

If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

If I post on a forum site that I am selling stolen credit card info and bank info, my post should not be touched, because it is public domain and it should be freely available?

--
"I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams
Wrong, wrong, wrong by kimvette · 2006-08-31 13:55 · Score: 3, Informative

As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:

http://www.archive.org/about/faqs.php#2

http://web.archive.org/web/20050305142910/http://w ww.sims.berkeley.edu/research/conferences/aps/remo val-policy.html

Let's review the text here, just in case someone from archive.org scurries to change it:

Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive

To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt) and then submit your site below.

The robots.txt file will do two things:

1. It will remove all documents from your domain from the Wayback Machine.

2. It will tell the Internet Archives crawler not to crawl your site in the future.

To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

User-agent: ia_archiver

Disallow: /

Robots.txt is the most widely used method for controlling the behavior of automated robots on your site (all major robots, including those of Google, Alta Vista, etc. respect these exclusions). It can be used to block access to the whole domain, or any file or directory within. There are a large number of resources for webmasters and site owners describing this method and how to use it. Here are a few:

http://www.global-positioning.com/robots_text_file /index.html

http://www.webtoolcentral.com/webmaster/tools/robo ts_txt_file_generator

http://pageresource.com/zine/robotstxt.htm

Once you have put a robots.txt file up, submit your site (www.yourdomain.com) on the form on http://pages.alexa.com/help/webmasters/index.html# crawl_site.

The robots.txt file must be placed at the root of your domain (www.yourdomain.com/robots.txt). If you cannot put a robots.txt file up, submit a request to wayback2@archive.org.

By not honoring those directives, are they not engaging in both copyright infringement and fraud?

--
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
Re:A world without cooperation by grumbel · 2006-08-31 14:40 · Score: 2, Informative

Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't.

How about we have a look what the RFC-drafts (its not even official) say about robots.txt:

"Web site administrators must realise this method is voluntary, and is not sufficient to guarantee some robots will not visit restricted parts of the URL space."

"It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it."

Its really that simple, robots.txt is not a security tool, its a guideline, nothing else. If you don't want robots to collect your data simply don't send it them.

This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation

Its a straightforward copyright violation, yep, but that has nothing todo with robots.txt, since having it or not, doesn't make it any less a violation.
Re:Info published on the Internet... by phulegart · 2006-08-31 15:32 · Score: 2, Informative

Phishers do not deal with security. Phishers deal with unsuspecting and uneducated internet users. I'm sorry you are so scared to do it, but really.. go ahead and visit http://paypal-protect.org./ It is a phishing site that we are attempting to take down. Go ahead and login with a bogus email and garbage password. It doesn't check anything before hand. It simply takes you into a site that aside from the URL, does look like Paypal. You are then asked to provide everything. Name, address, social security, even your PIN number for your credit card. It won't even allow you to proceed without your PIN. Then, after you submit your information (which is then sent to whomever is running the scam), you are redirected to the actual paypal site.

Now, if a poor sap fell for it, anything that sap could have done online that involved money, the phisher can do.

You want to try to make the distinction about "If you reveal your info".. well, what if I worked at the gas station you frequent, and I copied your cred card info and ccv2 number from the back, when you made a purchase? OOPS, it was YOUR fault for actually buying something. According to you, the only way to be safe is to isolate yourself from the world, and make everything you need from scratch. Noone should be responsible for protecting your interests.

If I grabbed your info from your trash, it's your fault, right? because you didn't incinerate your trash, right?

You are wrong, in that everything posted on the internet is public domain. That is an assumption you are attempting to back up with obfuscation. What is posted on the internet is no different than what is on the shelf in a library, what is on TV, and what is on the radio. You have the right to enjoy it. You do not have the right to rebroadcast it without permission.

--
"I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams