The Wayback Machine, Friend or Foe?

← Back to Stories (view on slashdot.org)

The Wayback Machine, Friend or Foe?

Posted by Cliff on Wednesday June 19, 2002 @09:34AM from the giving-google's-cache-a-run-for-its-money dept.

ShaunC asks: "As the webmaster of numerous sites, I'm curious how others feel about the Wayback Machine. What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998. I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies? I certainly didn't provide either. Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews." This site last made an appearance on Slashdot, earlier this year. Internet archival sites are right smack in the crosshairs of copyright, but they are useful. Anyone who has ever used Google's cache (and there are plenty of those links on Slashdot) can attest to this. Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation. Is it possible to balance the issues of copyright and history, or will these two Internet resources find themselves in legal trouble in the future?

"The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out? I manage a number of domains and the process of refining robots.txt files and submitting myself to the Wayback Machine for removal seems to be intrusive. Worse, domains I've abandoned (which have lapsed or been re-registered by someone else) are forever archived in the Machine and I have no way to exclude them. Why should I have to deliberately remove my copyrighted material from an archive which was never granted permission to replicate that material in the first place?"

16 of 508 comments (clear)

Min score:

Reason:

Sort:

Yummy by sheepab · 2002-06-19 09:38 · Score: 2, Informative

Slashdot from 1997.

--

In college, really poor, need a flatscreen.
"The Wayback Machine" by pb · 2002-06-19 09:40 · Score: 3, Informative

"The Wayback Machine" has been a pet project for a long time, and we're only now seeing results. I know for a fact that they have pages back at least as far as 1996, and it's a damn shame they don't have anything that much earlier...

And yes, it obeys the Robot Exclusion Principle.

"Ask Google" strikes again; I would hope that you could find all of this information by searching, or reading an "About" page, or something. Fortunately, these abortions to journalism don't appear on the Front Page very often.

--
pb Reply or e-mail; don't vaguely moderate.
Robots.txt by mshowman · 2002-06-19 09:41 · Score: 5, Informative

I had recently placed a restricted robots.txt file on my site and when trying to access any of the past revisions, I get a message saying that the owner has restricted access to the site via robots.txt. They seem to have that aspect under control.
Legally you can stop them, but why? by the_womble · 2002-06-19 09:43 · Score: 3, Informative

If you own the copyright they can not archive it without your permsiission, legally, that is all there is to it.
Of course in practice you have to purse this and ask them to remove it.
If you really object I suggest a list of every site you have or have had and dates with a request to remove everything. Then you only need to notify them when you put up a new site that that whould also be excluded. That would not be such a nuisance, would it?
That said I think they are providing a service that is interesting so unless you are harmed by it, why object?
I am interested in knowing how they had such old versions of your site though. Do search engines keep archives?
I love it. by gripdamage · 2002-06-19 09:45 · Score: 3, Informative

What's the problem?

If you do something illegal on your website, you won't be held responsible more than once just because the data persists on the Wayback machine. If you remove the offensive material from your site, that's all you can do. The Wayback machine can deal with their own lawsuit threats. And I'm sure they'll remove material if you are the site owner and ask nicely.

As far as outdated information, anyone reading pages on the wayback machine and expecting them to be current would have to be crazy. It's an archive after all.

It's easy to opt out. Google provides instructions in there webmaster faq which points out "There is a standard for robot exclusion at http://www.robotstxt.org/wc/norobots.html."
Re:Erm by JebusIsLord · 2002-06-19 09:48 · Score: 2, Informative

Yes, it does follow robots.txt protocol. Therefore there really isn't a problem now is there?

--
Jeremy
Re:Erm by JebusIsLord · 2002-06-19 09:56 · Score: 2, Informative

Shoot, that should be:

User-agent: *
Disallow: /

--
Jeremy
Library archives are given broader copyright uses by tiltowait · 2002-06-19 10:02 · Score: 5, Informative

.... and wayback is sponsored, amongst others, by the library of congress. The archive itself a 501(c)(3) public nonprofit. See 17 U.S.C. SECTION 108(a)(3) for more information.

Strange that such a complaint would appear within a group expousing that "information wants to be free." :)
Re:For what it's worth... by MushMouth · 2002-06-19 10:28 · Score: 2, Informative

Alexa does the Archive's crawl. Notice that Brewster Kahle's name is attached to both.
Re:Erm by zootread · 2002-06-19 10:31 · Score: 2, Informative

Yeah, you can add a robots.txt file and ask them to remove your site and it'll be wiped from their records. The problem is, if you don't have access to the site anymore, you can't throw in the robots.txt file. But, I just checked on a web page I requested they remove, which no longer existed so I couldn't put up a robots files, but I made the request anyways.

It looks like the page has been removed! My guess is if you request to remove a page and it doesn't exist anymore, they probably will remove it for you. This web page revealed me as the pothead and pro-marijuana person that I was (and still am though in private) back in college. I was afraid my employers were going to find my old web page, but they're probably potheads too.. But still, its good to be able to cover up the silliness of my past.

--
Zoot!
Some one hasn't done their research by mfos.org · 2002-06-19 10:32 · Score: 4, Informative

A few things

1) They've been archiving since 1998, but they've only recently had the horse power to provide a live connection to it

2) It is very easy to not have your stuff indexed. the directions are here.
Re:court evidence? by MushMouth · 2002-06-19 10:36 · Score: 2, Informative

Already used in the Go.com GoTo.com trademark suit 3+years ago
Re:Erm by dswensen · 2002-06-19 10:59 · Score: 5, Informative

Yes it does, and how. In fact, immediately upon reading this story, I went to the Wayback Machine and checked out my personal website archive. There it was, material dating back to 1996 ("Oh God, no, not the digging man GIF!"). I made a new robots.txt file:

User-agent: *
Disallow: /
# BITE ME WAYBACK MACHINE

... uploaded it, went back to the Wayback Machine, and got:

Robots.txt Query Exclusion.

We're sorry, access to [site] has been blocked by the site owner via robots.txt.
Read more about robots.txt
See the site's robots.txt file.
Try another request or click here to search for all pages on [site]

So, yeah, they seem to check the site for the most current robots.txt file before they show the archive. And if the robots.txt disallows archiving the site, ALL the entries are marked unavailable, not just the current ones.

So, it's pretty easy to solve the problem of the Wayback Machine -- and probably without going balls-out with the "disallow everything everywhere" like I did.
Re:Like it or not, it's the law by Anonymous Coward · 2002-06-19 12:12 · Score: 1, Informative

err, as someone pointed out earlier, copyright
law gives libraries and archives special fair use
powers.
Re:Erm by guttentag · 2002-06-19 18:56 · Score: 3, Informative

A number of people who don't want their content archived by the Internet Archiver may still want search engines to direct traffic to their sites (The Washington Post does this). If that's the case, use this in your robots.txt file:
User-agent: ia_archiver Disallow: /
Most (all?) search engines provide information on how to specifically exclude their spiders (while allowing everyone else). Just go to the engine's site and search for info on how they treat robots.txt.
Websites from 1998.. by Chicane-UK · 2002-06-19 19:28 · Score: 2, Informative

I read an article about the site.. the project has actually been running since 1998 - thats when they started collecting peoples websites, and adding hardware to their 'collective' to store all the data.. they only made the site public in like 2001 (or whenever it was) despite collecting it for so long.

I think if you use the Wayback Machine to go back to their own site in 1998/1999 their front page tells you this.

--
"Hey! Unless this is a nude love-in, get the hell off my property!!"