The Wayback Machine, Friend or Foe?
ShaunC asks: "As the webmaster of numerous sites, I'm curious how others feel about the Wayback Machine. What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998. I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies? I certainly didn't provide either. Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews." This site last made an appearance on Slashdot, earlier this year. Internet archival sites are right smack in the crosshairs
of copyright, but they are useful. Anyone who has ever used Google's cache (and there are plenty of those links on Slashdot) can attest to this. Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation. Is it possible to balance the issues of copyright and history, or will these two Internet resources find themselves in legal trouble in the future?
"The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out? I manage a number of domains and the process of refining robots.txt files and submitting myself to the Wayback Machine for removal seems to be intrusive. Worse, domains I've abandoned (which have lapsed or been re-registered by someone else) are forever archived in the Machine and I have no way to exclude them. Why should I have to deliberately remove my copyrighted material from an archive which was never granted permission to replicate that material in the first place?"
Slashdot from 1997.
In college, really poor, need a flatscreen.
"The Wayback Machine" has been a pet project for a long time, and we're only now seeing results. I know for a fact that they have pages back at least as far as 1996, and it's a damn shame they don't have anything that much earlier...
And yes, it obeys the Robot Exclusion Principle.
"Ask Google" strikes again; I would hope that you could find all of this information by searching, or reading an "About" page, or something. Fortunately, these abortions to journalism don't appear on the Front Page very often.
pb Reply or e-mail; don't vaguely moderate.
I had recently placed a restricted robots.txt file on my site and when trying to access any of the past revisions, I get a message saying that the owner has restricted access to the site via robots.txt. They seem to have that aspect under control.
Of course in practice you have to purse this and ask them to remove it.
If you really object I suggest a list of every site you have or have had and dates with a request to remove everything. Then you only need to notify them when you put up a new site that that whould also be excluded. That would not be such a nuisance, would it?
That said I think they are providing a service that is interesting so unless you are harmed by it, why object?
I am interested in knowing how they had such old versions of your site though. Do search engines keep archives?
What's the problem?
If you do something illegal on your website, you won't be held responsible more than once just because the data persists on the Wayback machine. If you remove the offensive material from your site, that's all you can do. The Wayback machine can deal with their own lawsuit threats. And I'm sure they'll remove material if you are the site owner and ask nicely.
As far as outdated information, anyone reading pages on the wayback machine and expecting them to be current would have to be crazy. It's an archive after all.
It's easy to opt out. Google provides instructions in there webmaster faq which points out "There is a standard for robot exclusion at http://www.robotstxt.org/wc/norobots.html."
Yes, it does follow robots.txt protocol. Therefore there really isn't a problem now is there?
Jeremy
Shoot, that should be:
User-agent: *
Disallow: /
Jeremy
.... and wayback is sponsored, amongst others, by the library of congress. The archive itself a 501(c)(3) public nonprofit. See 17 U.S.C. SECTION 108(a)(3) for more information.
:)
Strange that such a complaint would appear within a group expousing that "information wants to be free."
Alexa does the Archive's crawl. Notice that Brewster Kahle's name is attached to both.
Yeah, you can add a robots.txt file and ask them to remove your site and it'll be wiped from their records. The problem is, if you don't have access to the site anymore, you can't throw in the robots.txt file. But, I just checked on a web page I requested they remove, which no longer existed so I couldn't put up a robots files, but I made the request anyways.
It looks like the page has been removed! My guess is if you request to remove a page and it doesn't exist anymore, they probably will remove it for you. This web page revealed me as the pothead and pro-marijuana person that I was (and still am though in private) back in college. I was afraid my employers were going to find my old web page, but they're probably potheads too.. But still, its good to be able to cover up the silliness of my past.
Zoot!
A few things
1) They've been archiving since 1998, but they've only recently had the horse power to provide a live connection to it
2) It is very easy to not have your stuff indexed. the directions are here.
Already used in the Go.com GoTo.com trademark suit 3+years ago
Yes it does, and how. In fact, immediately upon reading this story, I went to the Wayback Machine and checked out my personal website archive. There it was, material dating back to 1996 ("Oh God, no, not the digging man GIF!"). I made a new robots.txt file:
/
User-agent: *
Disallow:
# BITE ME WAYBACK MACHINE
... uploaded it, went back to the Wayback Machine, and got:
Robots.txt Query Exclusion.
We're sorry, access to [site] has been blocked by the site owner via robots.txt.
Read more about robots.txt
See the site's robots.txt file.
Try another request or click here to search for all pages on [site]
So, yeah, they seem to check the site for the most current robots.txt file before they show the archive. And if the robots.txt disallows archiving the site, ALL the entries are marked unavailable, not just the current ones.
So, it's pretty easy to solve the problem of the Wayback Machine -- and probably without going balls-out with the "disallow everything everywhere" like I did.
err, as someone pointed out earlier, copyright
law gives libraries and archives special fair use
powers.
User-agent: ia_archiver
Disallow: /
Most (all?) search engines provide information on how to specifically exclude their spiders (while allowing everyone else). Just go to the engine's site and search for info on how they treat robots.txt.
I read an article about the site.. the project has actually been running since 1998 - thats when they started collecting peoples websites, and adding hardware to their 'collective' to store all the data.. they only made the site public in like 2001 (or whenever it was) despite collecting it for so long.
I think if you use the Wayback Machine to go back to their own site in 1998/1999 their front page tells you this.
"Hey! Unless this is a nude love-in, get the hell off my property!!"