Slashdot Mirror


Wayback Machine Safe, Settlement Disappointing

Jibbanx writes "Healthcare Advocates and the Internet Archive have finally resolved their differences, reaching an undisclosed out-of-court settlement. The suit stemmed from HA's anger over the Wayback Machine showing pages archived from their site even after they added a robots.txt file to their webserver. While the settlement is good for the Internet Archive, it's also disappointing because it would have tested HA's claims in court. As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."

3 of 182 comments (clear)

  1. Simple post by Kagura · · Score: 3, Informative
  2. Re:Info published on the Internet... by phulegart · · Score: 5, Informative

    so if my content is behind a protected "members area" then it is still public domain and should be freely available? If I am a photographer, and my site clearly states that all images are copyright of a certain date and that use of them without my permission is forbidden, that means nothing? If someone uses images of me without my permission, that they got from a website or protected members area, how is it that I can get them removed by complaining? If they are public domain, then it should be my tough luck, right?

    If I post your credit card and bank information on a forum site, does that mean it is now public domain and you have no protection?

    If I post on a forum site that I am selling stolen credit card info and bank info, my post should not be touched, because it is public domain and it should be freely available?

    --
    "I love deadlines. I love the whooshing sound they make as they fly by." -D. Adams
  3. Wrong, wrong, wrong by kimvette · · Score: 3, Informative
    As the article notes, you can't really un-ring the bell of publishing something online, which is exactly what HA wanted to do. Obeying robots.txt files is voluntary, after all, and if the company didn't want the information online, they shouldn't have put it there in the first place."


    Wrong, wrong, wrong. archive.org explicitly tells you that if you want your content removed from their index, that you should modify your robots.txt and re-submit your site, and when their bot reads your robots.txt and sees the appropriate directives, your content will be dropped from the index. See:

    http://www.archive.org/about/faqs.php#2

    http://web.archive.org/web/20050305142910/http://w ww.sims.berkeley.edu/research/conferences/aps/remo val-policy.html

    Let's review the text here, just in case someone from archive.org scurries to change it:

    Addendum: An Example Implementation of Robots.txt-based Removal Policy at the Internet Archive

     


    To remove a site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. www.yourdomain.com/robots.txt) and then submit your site below.

    The robots.txt file will do two things:

              1. It will remove all documents from your domain from the Wayback Machine.

              2. It will tell the Internet Archives crawler not to crawl your site in the future.

    To exclude the Internet Archive's crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say:

                                                  User-agent: ia_archiver

                                                  Disallow: /

    Robots.txt is the most widely used method for controlling the behavior of automated robots on your site (all major robots, including those of Google, Alta Vista, etc. respect these exclusions). It can be used to block access to the whole domain, or any file or directory within. There are a large number of resources for webmasters and site owners describing this method and how to use it. Here are a few:

                          http://www.global-positioning.com/robots_text_file /index.html

                          http://www.webtoolcentral.com/webmaster/tools/robo ts_txt_file_generator

                          http://pageresource.com/zine/robotstxt.htm

    Once you have put a robots.txt file up, submit your site (www.yourdomain.com) on the form on http://pages.alexa.com/help/webmasters/index.html# crawl_site.

    The robots.txt file must be placed at the root of your domain (www.yourdomain.com/robots.txt). If you cannot put a robots.txt file up, submit a request to wayback2@archive.org.


    By not honoring those directives, are they not engaging in both copyright infringement and fraud?
    --
    The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50