Wayback Machine Safe, Settlement Disappointing

← Back to Stories (view on slashdot.org)

Wayback Machine Safe, Settlement Disappointing

Posted by ryuzaki0 on Thursday August 31, 2006 @10:08AM from the get-me-out-of-here-mr-wizard dept.

Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

21 of 182 comments (clear)

Min score:

Reason:

Sort:

Simple post by Kagura · 2006-08-31 10:12 · Score: 3, Informative

http://www.archive.org/
I want.... by Whiney+Mac+Fanboy · 2006-08-31 10:12 · Score: 4, Funny

Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

I want a search engine that only indexes items excluded in the robots.txt file :-)

--
There are shills on slashdot. Apparently, I'm one of them.
Autolawyers by Doc+Ruby · 2006-08-31 10:16 · Score: 3, Insightful

What's really disappointing is that it's apparently cheaper to pay lawyers to settle a case than it is to defend your right to ignore optional guidelines like robots.txt in US courts.

If Congress were serious about keeping the US economy "safe and effective", it would reform the "lawyers' job security" laws. Instead it will surely make them even worse, and make the lawyer tax on technology mandatory.

--
--
make install -not war
1. Re:Autolawyers by arthurpaliden · 2006-08-31 10:21 · Score: 3, Insightful
  
  Unless lawyers are paid by the state, like doctors in Canada, they cannot be considered officers of the court who's job it is to represent your rights before said court. Once they accept payment from a client, either actual or pending, they become no more that hired sales consultants peddaling their clients version of the truth.
  
  --
  Undetectable Steganography? Yep, there's an app fo
2. Re:Autolawyers by Doc+Ruby · 2006-08-31 10:38 · Score: 4, Insightful
  
  There's a good case to be made for lawyers being paid by the state, as they certainly are working in those offices on that business. But even more than doctors they cannot be allowed to make their own interests coincide with that of the state. Lawyers often work for people against the state, which must be recognized by the state as a primary responsiblity of lawyers. Doctors rarely find their interests conflicting with that of the state (except when they're not getting paid on time ;), so that structure isn't as dangerous.
  
  There's probably a way to ensure that lawyers represent people's rights better than they do now. Regular random audits of billings and practices. More "contempt of court" punishment. More suspended/revoked licenses, especially for repeated frivolous representation. More "malpractice" awards. There ought to be more competition, with more standardized reviews contextualizing all those "scores", published for consumers.
  
  Lawyers even more than doctors hide behind consumer ignorance and blind "respect". Exposing their performance as part of the shopping process would make them more competitive, and better adhere to the required "ethics" that usually are assumed to come with the tie.
  
  --
  --
  make install -not war
Don't need no Wayback by kaizenfury7 · 2006-08-31 10:17 · Score: 5, Funny

If you go directly to their site, you get a version of their site that looks like it's from 1995.
I sense a little two-faced opinion here by InsaneGeek · 2006-08-31 10:18 · Score: 4, Insightful

which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place

So by the logic, if I didn't want AOL to release my search information I shouldn't be mad as it's my fault to have used them in the first place? Or that if I want my copyrighted information to not be republished by someone else, I should just simply not publish at all? How about, if I don't want my GPL code resold by someone in a closed source product I should just know better and not put it out in the open to begin with. And that if I post something stupid when I'm 9 we believe it should follow me around throughout my entire lifetime, because a 9 year old should know better.
1. Re:I sense a little two-faced opinion here by fm6 · 2006-08-31 10:51 · Score: 4, Interesting
  
  Another example: someone I know wrote an essay that he thought only people in his class would ever see. It contained one or two mildly embaressing disclosures, not terribly personal, but not something you'd want a complete stranger to know about you. Some idiot put it up on the school web site without his permission.
  Here's a nasty possibility. Suppose somebody unintentionally publishes information useful to terrorists. DHS drops by and points out the error, and the information is withdrawn. Does Wayback Machine have a right to keep the information online?
  In fact, Wayback Machine has never asserted their right to keep anything online. As the article points out, they'll remove stuff that's noncompliant with the current robots.txt, even though it was compliant at the time it was spidered. This lawsuit wasn't about their right keep stuff online. It was just somebody accusing them of being negligent about enforcing their own policies.
2. Re:I sense a little two-faced opinion here by gsn · 2006-08-31 11:00 · Score: 4, Insightful
  
  Thats crazy - when you typed in your search term into AOL you had an expectation of privacy and you did not for one minute believe that they would release that data. All webpages are copyright and the Wayback machine is using fair use to archive copies for educational use. If you publish information (its automatically copyright) and someone reproduces it they might be able to under fair use or they might be infringing your copyright - talk to your lawyer. And yes if you posted something on the net when you were 9 that was stupid it might well follow you around for the rest of your life. Same goes if you were in a porno in college and you put it online. Sorry. Tough shit. Maybe your parents should have paid more attention to your online activities. Or you should have known better. IANAL and 9 year olds may get some protection as minors but basic point remains - you publish something online you had no expectation of privacy. This is not at all what you were doing when you sent AOl your search queries - you published zilch.
  
  If you post something on the net then I can point my browser to it - there is no privacy, and nor was there any expectation of it. I could have used wget -r -erobots=off on your page every day and got all its content - and I'd have that archive even when you deleted it or moved it into some private archive, and it happily ignored your robots.txt. Since obeying robots.txt is volutary I simply chose not to.
  
  News websites often want you to pay to for older content but there is nothing theoretically stopping you from saving all the content day by day. You are comparing apples and oranges.
  
  Heres the summary - we posted evidence online that was used against us in a court of law, we lost, we sued the people who provided that evidence, and because its cheaper to settle than deal with bloody lawyers we settled with them.
  
  --
  Reality must take precedence over public relations, for nature cannot be fooled.
If you don't want it read... by saskboy · 2006-08-31 10:18 · Score: 3, Insightful

...Don't put it on the Internet. In fact, don't even type it into a computer, or write it down.
People shouldn't put anything on the Internet that they wouldn't want their worst enemy, boss, NSA, or grandmother to see. Obviously since the porn industiry exists online, few people follow this rule, but it's a good one none the less.

I enjoy Archive.org and when I get nostalgic about my websites of the past, it's there to show me a glimpse into history.

--
Saskboy's blog is good. 9 out of 10 dentists agree.
What REALLY pisses me OFF by scenestar · 2006-08-31 10:33 · Score: 4, Insightful

Is that some sites that used to exist had no robots.txt file, yet still get blocked

After a certain domain was no longer in use for years some adware search rank linkpharm whatever it is added a robots.txt file to a "hijacked" domain.

One can now get formerly accessible sites removed from archive.org. EVEN IF THE ORIGINAL OWNER NEVER INTENDED TO.

--
perpetually dwelling in the -1 pits
Check out their robots.txt... by Anonymous Coward · 2006-08-31 10:36 · Score: 3, Interesting

Check out their robots.txt: http://www.healthcareadvocates.com/robots.txt They ONLY restrict Internet Archive, from accessing their web site, but don't restrict any other spider... Haven't they heard of Google's cache?
A world without cooperation by Anonymous Coward · 2006-08-31 10:42 · Score: 5, Insightful

Obeying robots.txt is "voluntary" in the same sense that obeying RFCs is voluntary. In other words, it isn't. You can technically ignore any and all standards, but there will be sanctions. In the case of robots.txt, these sanctions can very well be court ruling against you, because robots.txt is an established standard for regulation of the interaction between automated clients and webservers. As such it is an effective declaration of the rights that a server operator is willing to give to automated clients in contrast to human clients. This is especially important with regard to services which mirror webpages. Doing so without the (assumed) consent of the author is a straightforward copyright violation and if the author explicitly denies robot access, then the service operator knowingly redistributes the work against the author's will.

Even if you don't fear the legal system, disregarding robots.txt can quickly get you in trouble. There are junk-scripts which feed bots endlessly and there are blocklisting automatisms against unbehaving bots. If people program their bots to ignore robots.txt, these and possibly more proactive self-defense mechanisms will become the norm. Is that the net you want? Maybe obeying robots.txt is the better alternative, don't you think?
1. Re:A world without cooperation by Anonymous Coward · 2006-08-31 11:15 · Score: 5, Insightful
  
  An attitude like yours is exactly why people go to court over these things. If you don't even adhere to the most basic rules, then it's easier and less costly to have you pay my lawyers and a fine instead of trying to stop robots from reading information that human users are supposed to see without difficulty. The lack of common courtesy on the net is disconcerting. The server tells you in no uncertain terms that you are not welcome, but you keep requesting "forbidden" pages. Consider an analogous situation in real life: You are walking in the park and someone asks you for a dollar. You decline, but the beggar keeps asking. You're saying that accepting your first denial as binding is "voluntary" and the beggar can keep bugging you as long as he likes. If that happened to me twice, I'd have the asshole arrested, and that's exactly what you're going to see online if people don't behave, especially when their behaviour leads to copyright violations which would have been avoided if they had followed the robot exclusion standard.
Retroactive robots.txt by Kelson · 2006-08-31 10:47 · Score: 5, Insightful

I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.

First, some background. I have a weblog I've been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.

Flash forward to earlier this week. I've made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post... and to my horror, discover that it's missing half a sentence at the beginning of a paragraph and I don't remember the sense of what I originally wrote!

My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I'm able to pull up the summary with a link to the original location. I click on it... and I see:

Excluded by robots.txt (or words to that effect).

Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn't determine whether the data had actually been removed or just blocked, so I decided to alter my site's robots.txt file, fire off a request for clarification, and see what happened.

As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.

In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.
Wayback Machine essential for public domain by proxima · 2006-08-31 10:52 · Score: 3, Interesting

Many people think of the Wayback Machine as being a tool for history and nostalgia. However, consider copyright expiration (IANAL, etc.). Many web pages have items like "Copyright 1995-2006 Blah". Some of the content was created as early as 1995. Assuming, of course, that items created in modern times eventually have their copyright expire, we will need a record of the content of these pages at that time.

As more content moves online, the idea of publishing a work becomes blurred. Revisions years later can effectively update the copyright of the work, if the reader cannot distinguish when the content was created. So the Wayback Machine will hopefully provide that resource. The amount of potentially public-domain content there is huge.

As a side note, it will be interesting to note when the first GPL programs (for example) lose their copyright. Of course, by then, the languages will seem more than archaic.

--
"The universe seems neither benign nor hostile, merely indifferent." --Carl Sagan
Re:Info published on the Internet... by Lactoso · 2006-08-31 11:41 · Score: 3, Insightful

And just what does that check to your hosting company pay for aside from the physical location and maintenance of the webserver? Propogation of your website's IP address to DNS and bandwidth. And what do you need bandwidth for if not to share your web pages with the internet at large...
Re:Info published on the Internet... by phulegart · 2006-08-31 12:37 · Score: 5, Informative

so if my content is behind a protected "members area" then it is still public domain and should be freely available? If I am a photographer, and my site clearly states that all images are copyright of a certain date and that use of them without my permission is forbidden, that means nothing? If someone uses images of me without my permission, that they got from a website or protected members area, how is it that I can get them removed by complaining? If they are public domain, then it should be my tough luck, right?

If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

If I post on a forum site that I am selling stolen credit card info and bank info, my post should not be touched, because it is public domain and it should be freely available?

--
"I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams
... I could make it so you were never born. by Corngood · 2006-08-31 13:07 · Score: 3, Interesting

You missed the best part of the quote.
Wrong, wrong, wrong by kimvette · 2006-08-31 13:55 · Score: 3, Informative

As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:

http://www.archive.org/about/faqs.php#2

http://web.archive.org/web/20050305142910/http://w ww.sims.berkeley.edu/research/conferences/aps/remo val-policy.html

Let's review the text here, just in case someone from archive.org scurries to change it:

Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive

To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt) and then submit your site below.

The robots.txt file will do two things:

1. It will remove all documents from your domain from the Wayback Machine.

2. It will tell the Internet Archives crawler not to crawl your site in the future.

To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

User-agent: ia_archiver

Disallow: /

Robots.txt is the most widely used method for controlling the behavior of automated robots on your site (all major robots, including those of Google, Alta Vista, etc. respect these exclusions). It can be used to block access to the whole domain, or any file or directory within. There are a large number of resources for webmasters and site owners describing this method and how to use it. Here are a few:

http://www.global-positioning.com/robots_text_file /index.html

http://www.webtoolcentral.com/webmaster/tools/robo ts_txt_file_generator

http://pageresource.com/zine/robotstxt.htm

Once you have put a robots.txt file up, submit your site (www.yourdomain.com) on the form on http://pages.alexa.com/help/webmasters/index.html# crawl_site.

The robots.txt file must be placed at the root of your domain (www.yourdomain.com/robots.txt). If you cannot put a robots.txt file up, submit a request to wayback2@archive.org.

By not honoring those directives, are they not engaging in both copyright infringement and fraud?

--
The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
Re:Info published on the Internet... by iminplaya · 2006-08-31 14:34 · Score: 3, Interesting

If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

If anything bad comes from it, it only means that the banks employ weak security. That information by itself should mean nothing. Complain to the financial institutions, not the person who posts it. Make it the bank's problem and it will go away. Don't use their services until they make it secure without making it unduly inconvenient for the customer. The silly passwords and 20 minute waits for failed logins do nothing for security. Make financial security the institution's responsibility instead of suppressing the flow of information. And furthermore, you know what you can do with your copyrights. If you don't want people to use your photos keep them to yourself. If you don't want your information divulged, then don't reveal it to anybody.

--
What?