The Wayback Machine, Friend or Foe?

← Back to Stories (view on slashdot.org)

The Wayback Machine, Friend or Foe?

Posted by Cliff on Wednesday June 19, 2002 @09:34AM from the giving-google's-cache-a-run-for-its-money dept.

ShaunC asks: "As the webmaster of numerous sites, I'm curious how others feel about the Wayback Machine. What particularly interests me is the fact that the Machine is a relatively new animal, yet it contains snapshots from my sites dating back to 1998. I can't help but wonder: where did they get such old copies of my websites, and who gave them permission to make those copies? I certainly didn't provide either. Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews." This site last made an appearance on Slashdot, earlier this year. Internet archival sites are right smack in the crosshairs of copyright, but they are useful. Anyone who has ever used Google's cache (and there are plenty of those links on Slashdot) can attest to this. Of course, the issue that may bug many content providers is how to opt-out of such services, since some see it as a copyright violation. Is it possible to balance the issues of copyright and history, or will these two Internet resources find themselves in legal trouble in the future?

"The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out? I manage a number of domains and the process of refining robots.txt files and submitting myself to the Wayback Machine for removal seems to be intrusive. Worse, domains I've abandoned (which have lapsed or been re-registered by someone else) are forever archived in the Machine and I have no way to exclude them. Why should I have to deliberately remove my copyrighted material from an archive which was never granted permission to replicate that material in the first place?"

26 of 508 comments (clear)

Min score:

Reason:

Sort:

Robots.txt by mshowman · 2002-06-19 09:41 · Score: 5, Informative

I had recently placed a restricted robots.txt file on my site and when trying to access any of the past revisions, I get a message saying that the owner has restricted access to the site via robots.txt. They seem to have that aspect under control.
There are more than copyright concerns... by Anonymous Coward · 2002-06-19 09:41 · Score: 4, Insightful

It's a scary thought that things kids are saying on message boards when they're teenagers are going to be back to haunt them when they apply for jobs in their mid 40s...

I mean, if everything I posted on BBSes in the 1980s were still attributable to me... yikes.

Remember kids. Use a nickname, and change it frequently if you ever want to run for any kind of office.
1. Re:There are more than copyright concerns... by TheMonkeyDepartment · 2002-06-19 09:44 · Score: 4, Insightful
  
  Well, that's a great point, and it's a good illustration of the double-edged sword of free speech. You are free to say whatever dumbshit, ridiculous things you want. But you are also free to deal with the social consequences.
2. Re:There are more than copyright concerns... by rhaig · 2002-06-19 10:19 · Score: 4, Interesting
  
  dejanews was my best tool to weed out resumes
  
  before I secheduled even a phone interview, I'd always search dejanews for the person in question. Sometimes I'd come up with a definate hit (first and last name as well as email and mentioning the local area or some work that was on their resume) and I'd be able to see what kind of person I was really dealing with. That's when I started looking at what I'd posted.
  
  --
  "We are not tolerant people. We prefer drastically effective solutions"
3. Re:There are more than copyright concerns... by madmancarman · 2002-06-19 15:38 · Score: 4, Interesting
  
  dejanews was my best tool to weed out resumes
  before I secheduled even a phone interview, I'd always search dejanews for the person in question. Sometimes I'd come up with a definate hit (first and last name as well as email and mentioning the local area or some work that was on their resume) and I'd be able to see what kind of person I was really dealing with. That's when I started looking at what I'd posted.
  This kind of freaked me out when I started teaching in 1998 - I'd been running a large fan web site devoted to one of my favorite bands, and being heavily into the band, I posted a lot in their newsgroup and participated in more than one flame war. Of course, I was in college and in my very early 20's and late teens, but it's all archived on DejaNews now, with no way to remove it. I really doubt any public school districts are going to wise up to this (or even care, considering the national teacher shortage), but I wouldn't be surprised if it came back to haunt me in some way some day. As a previous poster mentioned, such is the burden of free speech.
  An interesting thing did happen to me at the beginning of this school year. I teach high school computer classes, and I was talking about managing that fan web site when one of my students (a junior) opened his eyes really big and pointed at me with his jaw dropped, sort of aghast. I paused and asked him what was wrong, and he exclaimed that he downloaded and used the guitar tabs I'd written years earlier when he was in junior high. I found that kind of amusing!
  I think the archiving of the internet is particularly scary when I can still find a lousy guitar tab I did of Pearl Jam's "Footsteps" that I did back in 1992, when I was a senior in high school piggybacking off an account at the nearby university, on my parents' Apple //e, while I was still learning how to play guitar. Obviously, the internet can have a much longer shelf life than a ProDOS 5.25" floppy (excluding news sites that "expire" their articles after limited availability).
  First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi
  
  --
  First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi
Opting out -- of publicly available HTTP??? by TheMonkeyDepartment · 2002-06-19 09:41 · Score: 4, Interesting

When you publish something on the web, it is publicly available via HTTP. End of story. Responsible netizens can observe the requests of "robots.txt" but they don't have to. If you want something more controlled, create a VPN or intranet or some other kind of non-public data server.

Your argument is similar to that of newspaper publishers who didn't like "deep linking." What they couldn't (or didn't want to) understand is that the nature of an HTTP web server is quite simple. A client asks for a file, the server gives it back. Using that protocol implies that you are OK with that. If you're not, I suggest you look into different technologies, instead of complaining about lack of control, in a medium that was never intended to provide it.
1. Re:Opting out -- of publicly available HTTP??? by KillerCow · 2002-06-19 09:56 · Score: 4, Insightful
  
  When you publish something on the web, it is publicly available via HTTP. End of story.
  
  I don't think that that is a good enough standard. When a television show is broadcast, or when a book is published, it is publicly available -- but we don't think that the publisher looses their right to copyright protection in these cases. Publishing on the web is similar. The creator wants people to see his/her creation, but does not automatically give visitors the right to archive and retransmit the works.
2. Re:Opting out -- of publicly available HTTP??? by TheCarp · 2002-06-19 10:11 · Score: 5, Interesting
  
  The otherquestion is one of historical record.
  
  What you say does not BELONG to you. It is not property. Once you write it, it exists. You may own the medium it is on, but once it is out in the world it is uncontrollable and no longer owned. You may hold copyright... but a hundred years from now when you are long since dead and copyright is expiring, then what?
  
  We have the works of Galileo, we have letters that Thomas Jefferson wrote to people, why? because they were written. Many years later, long after the fact, these were made public and part of historic record because they survived.
  
  On the net, we have a culture of written information apearing and disapearing. This information is part of our culture, its things that we read and see, when it goes away - for whatever reason - we have lost something.
  
  I have websites from 96 that exist now only in the way back machine. Yea, som eof the stuff I aid back then I don't agree with now, and would rather not have associated with me but, by that same token, I wouldn't want it to be lost forever. If someone read it and what I wrote had enough impact on them that they want to see it again... then I would not even dream of trying to stop them (even if the impact was one of disgust - an impact is an impact) - even if its just someone wanting to see what the web looked like 5 years ago... I think thats valid... I think thats an important record fo our culture.
  
  the only thing I can see a case for really is the removal of personal information that shouldn't have been public in the first place. Beyond that though, I think its good... i mean... its not something that is ever going to be mistaken for a live current site - you have to actually go to the way back machine and ask for it.
  
  All in all this is a good thing and I hope it survives longtime.
  
  -Steve
  
  --
  "I opened my eyes, and everything went dark again"
3. Re:Opting out -- of publicly available HTTP??? by krypto246 · 2002-06-19 10:45 · Score: 4, Insightful
  
  People are just pissed about this archinving because they like the internet to be a 100% responsibility free zone - now matter what you say or do, you ca nalways change, edit or delete it later. How about standing behind your comments and opinions, instead of just deleting them when they can be held against you? Yes - use nicknames and aliases, but dont expect that the things you put out there to be temporary. You put something out into the internet, it stays there, and it can be found later, thats the power of the net, and the price you pay for it.
Re:"The Wayback Machine" by Disevidence · 2002-06-19 09:44 · Score: 4, Insightful

I think the question is not about its being publicly available, but rather about it archiving web pages that were taken down at later dates for various reasons.

Its legally grey, and all it really takes is for some paranoid person to sue, and then the fireworks start.

IANAL.

--
Think nothing is impossible? Try slamming a revolving door.
I like it but... by rknop · 2002-06-19 09:44 · Score: 4, Insightful

When I first discovered it, it was a lot of fun. Much nostalgia; it was fun seeing earlier verisons of my webpages. Some go back quite a number of years.

On the other hand, I was horrified when I realized that there was full archiving of www.dramex.org. If you visit that site, you will see that there are a large number of scripts (as in plays), many of which have restrictions on use. Over the years, we've had people request that scripts be removed from the site; of course, we did so. However, they weren't necessarily removed from the archive, and an archive keeps them forever. Specifically with the wayback machine, I was able to submit stuff that removed the specific directories I was worried about (they don't archive the scripts from www.dramex.org, just the "front page" stuff which is all part of the fun), and keep them from doing it again.

I like the idea of archives; it preserves history. The web is a transient medium, but not entirely. Yes, much of the content is dynamic and should only be dynamic. Some of it, though, is like the front page of a newspaper. Each day, what's on "today's front page" is different-- but there is value and use in seeing what was on the front page in any day in history.

But sometimes you need to delete something and make sure it really is no longer available. When you don't completely control your site (i.e. somebody else archives it, rather than just mirrors it), that becomes impossible.
newspaper.
(Incremental backups can have a similar issue. If you only back up files which are "newer than the last backup", your backup doesn't have the information about files which have been *deleted* since the last backup. When you restore, you might find some files there you thought shouldn't exist any more.)

(Dramex.org has changed so that it's not straightforward to get directly to the scripts any more. META tags tell the search engines to leave the actual scripts alone, and you can only get the text itself via CGI. Yes, it's easy to subvert if you put your mind to it, but at least you do have to put your mind to it, and automated search engines or archivers won't. 90% of the security for 1% of the effort.)

-Rob
As a webmaster of various sites... by schon · 2002-06-19 09:45 · Score: 5, Insightful

As a webmaster of various sites, I have no problem with archives.. if I didn't want people to see my stuff, I wouldn't have put it on the internet in the first place.

where did they get such old copies of my websites, and who gave them permission to make those copies?

They probably got the copies the same way everybody else did - by surfing. You (implicitly) gave them permission to cache your sites by not including an appropriate entry in your robots.txt.

The way I see it, archives are much like SPAM; I never opted in, why should it be my responsibility to opt out?

Archives are nothing like spam. Spam is primarily harrassment. These guys aren't harrassing you. They did ask your permission (by way of checking your robots.txt). If you've since changed your mind, it's your responsibility to notify them.

Google caches material too - do you consider them to be spam as well?

Archive sites provide a valuable resource to the rest of the 'net. If you don't like it, put an appropriate entry in your robots.txt file, and be done with it.
Preserving information is important. by Chiasmus_ · 2002-06-19 09:46 · Score: 5, Insightful

I doubt that I'm alone in my belief that it is always tragic when any piece of information--no matter how trivial--is lost forever.

If a person has offered that information for free at any point, to the extent that an automated script could access it, then I believe that information can be safely considered public domain. I doubt that there's any mechanism by which Richard M. Stallman could lose his mind and "rein in" all copies of GNU, or by which Stephen King could recall all his novels and refund the purchase price; once something is offered to the public, it no longer belongs exclusively to the publisher.

In my opinion, the value of archives in the future immeasurably outweighs occasional inconveniences of having information stick around longer than the author would have wished.

--
"Beware he who would deny you access to information, for in his heart he deems himself your master."
Archives need to be made by Waffle+Iron · 2002-06-19 09:48 · Score: 4, Insightful

If the courts determine that it is technically illegal to make archives of electronic content, then the copyright laws should be changed to explicitly allow archiving. Otherwise, we could eventually lose track of history. The only written record of large portions of our civilization would be relegated to a few rusting web server hard drives buried landfills.
If you read 1984, you might remember that the government tightly controlled all old copies of documents so that they could manipulate history as they wished. We might get into a similar situation by accident if we don't allow independent archives of electronic information.
With traditional media, you publish something on paper, but you don't get to control who puts the paper copies in which archives. That has served us well for keeping track of history, and an equivalent system needs to maintained for electronic content.
Friend to Hosting Comapnies by Da+J+Rob · 2002-06-19 09:50 · Score: 5, Funny

I was talking to this guy who works for a web hosting company, and he says a fourth of his sales calls are people calling him up cause they're pissed that their last hosting company 'lost' thier site. (in reality most the time its later found out that the guy deleted it himself or renamed index.html to index2.html, etc..) He says 90% of the sites he can find a copy on the wayback machine. He'll then start to quote the website's contents to the guy on the phone and usually will have the amazed (and dumbfounded) customer signing a hosting contract by the end of day.
Re:"The Wayback Machine" by martyn+s · 2002-06-19 09:58 · Score: 4, Insightful

So I suppose libraries should just stop carrying books because the author doesn't like what he wrote anymore? I mean, what the fuck?
TV Broadcast analogy by rknop · 2002-06-19 09:58 · Score: 4, Interesting

Some have already drawn analogies to TV broadcasts, saying hey, it was broadcast, you get to keep a copy. You can't bitch now if people still have that copy, unless you're Jack Valenti.

You can spin this how you want. Here's one valid way to think about it though: a TV network brodcasts a show. You make a private copy on a VCR tape. Jack Valenti aside, you can watch that copy again as often as you like, and it's no big deal. However, you do emph not have the right to rebroadcast your copy of that show to the public without the permission of the original copyright holder. (I have my B5 tapes. I'm watching them through again now, showing them to my wife. I'm sure nobody is upset about this. But I'd be in deep doo-doo if I managed to broadcast them on a local access station, or uploaded them to a public website.)

If you are inclined to be negative about the Wayback Machine, you could view it this way. While the page existed on the original site, it was broadcast to the public. If somebody made a personal copy, they have it and will always have it, even if the site goes down. However, when the site goes down, individuals do not necessarily have the right to then "rebroadcast" (i.e. post) themselves the content they downloaded and kept. This, however, is what the WayBack machine is doing.

Mind you, except for the issue with www.dramex.org that I noted above (and which I fixed long ago), I like the WayBack machine, and am happy that they archived the content which was implicitly copyrighted to me. I would have opted in if I had wanted to. But, of course, I didn't know about it back in 1996 to opt in.

I don't have a good answer to the questions. Just thought.

-Rob
Library archives are given broader copyright uses by tiltowait · 2002-06-19 10:02 · Score: 5, Informative

.... and wayback is sponsored, amongst others, by the library of congress. The archive itself a 501(c)(3) public nonprofit. See 17 U.S.C. SECTION 108(a)(3) for more information.

Strange that such a complaint would appear within a group expousing that "information wants to be free." :)
Purist? Pure what? by American+AC+in+Paris · 2002-06-19 10:03 · Score: 5, Insightful

Perhaps I'm too much of a purist, but I've always seen the internet as an ever-changing medium, not a permanent one. Archives have bothered me ever since the fledgling days of DejaNews.
I'd say it makes you more of a control freak than a purist, personally.
Seriously, how did you ever get it into your head that a medium that serves documents to the general public on demand would be somehow exempt from archiving?
Would it bother you of John Q. Savant could recite the contents of your web pages from memory ten years after you'd taken it down?
Would it bother you to learn that stock prices, perhaps the most "ever-changing" thing out there, are permanently archived by a variety of services?
Or are you just jittery at the thought that your spouse/boss/Friendly Neighborhood Representative of The Man/kids may be able to someday look at the shite you plastered all over the web in your younger days? ("Ech, that stupid Netscape 2 animated title hack--honey, you actually -did- that?")

--
Obliteracy: Words with explosions
You have given permission by MrResistor · 2002-06-19 10:08 · Score: 4, Insightful

By the very act of posting your site on the web you have given permission to make copies of it. Otherwise, how would anyone view it? And if no one is supposed to view it, why have you published it in a publicly accessible space?

If I went to your website 2 years ago and never closed or refreshed that browser window, would I now be violating your copyright? What if I saved the page so I could view it later offline? What if I never erased that file, would that mean that I'm violating your copyright? I have several floppies of web sites I saved at school for viewing at home from the days when I was stuck on a crappy dial-up service. Does that make me a pirate? What about all the copies of sites held in my browsers cache?

Don't get me wrong, I understand where the sentiment is coming from, even if I disagree with it. I'm just trying to point out how incongruous it is with the basic nature of computers and the internet and how they work.

These questions aside, though, I have to come down in favor of the historians. People here are always whining about old movies/books/music being lost because their owners refuse to let them go, even if they aren't using them, why should the web suffer the same fate? The rate of destruction is far faster on the internet, and since it isn't a physical media, the information has to be actively archived if it is to be preserved.

--
Under capitalism man exploits man. Under communism it's the other way around.
dating back to 1998 by quantaman · 2002-06-19 10:28 · Score: 4, Funny

Anyone else find it mildly disturbing that 1998 is considered to be distant history?

--
I stole this Sig
Some one hasn't done their research by mfos.org · 2002-06-19 10:32 · Score: 4, Informative

A few things

1) They've been archiving since 1998, but they've only recently had the horse power to provide a live connection to it

2) It is very easy to not have your stuff indexed. the directions are here.
Re:Erm by kevinank · 2002-06-19 10:34 · Score: 5, Insightful

The goal of the person who started archive.org was to record the history of the world wide web. The assumption was that whatever anyone thinks about the archive, there will never be another chance to go back and get that data once it is lost.
The copies that they have archived in their databases are individual copies served from the original web requests, so they have the right to keep them. They became their copy when they were originally downloaded. Whether they have the right to make new copies and redistribute them depends on how you think fair use applies to that content.
Ultimately if a lot of people start suing them they will probably shut down the archive to public access and only allow researchers to view their original copies on site. And if you'd prefer that, well, you'll end up with the world you deserve.

--
LibBT: BitTorrent for C - small - fast - clean (Now Versio
Re:"The Wayback Machine" by Rick+the+Red · 2002-06-19 10:34 · Score: 5, Insightful

No, the issue is more akin to a library carrying newspapers and magazines for years, and their publishers suddenly telling the libraries "those copies are out of date, stop letting people read them." Why? If you didn't want anyone to read it, why did you put it out on the web?
Are you ashamed of what you did back then, when you were young and foolish? Grow up -- we're all ashamed of what we did when we were young and foolish, and years from now you'll be ashamed of what you're doing today. Get over it.
Personally, I think archives are great. Whenever I design an application I always ask about archiving, because inevitably they're gonna want it and it's easier to design in from the start. Oh, you want to know what your top 10 customers ordered last Christmas? Now you tell me! Geeze, we flushed that data last February, 'cause you said once the credit card cleared you didn't care to pay for the storage. But I digress.
Someday your next client will want examples of your previous work, then you'll go crawling on your hands and knees to the Wayback Machine, begging them to show you what your pages looked like. And they'll honor your robots.txt file and tell you to get lost.

--
If all this should have a reason, we would be the last to know.
Re:Erm by Ross+C.+Brackett · 2002-06-19 10:42 · Score: 5, Funny

Well, the default is to not plug your server into the Internet the first place, now isn't it? To quote Doug from Ghost World, "It's America, dude, learn the rules."

Seriously, if someone's precious intellectual property - as if anything worthwhile was ever posted on the Internet in the first place - becomes compromised because they don't know a basic principle of how to run a website, well then boo hoo.

It's worth the tradeoff. That the Wayback Machine exists is seriously cool, and some day will be of definite historical worth. If the occasional Brady Bunch erotic slash fiction author has to take a ride on the waaahmbulance because "A Very Brady Gangbang (M/m/F/f nc b/d)" got copied without their permission for the greater historical good, then that's a price worth paying.
Re:Erm by dswensen · 2002-06-19 10:59 · Score: 5, Informative

Yes it does, and how. In fact, immediately upon reading this story, I went to the Wayback Machine and checked out my personal website archive. There it was, material dating back to 1996 ("Oh God, no, not the digging man GIF!"). I made a new robots.txt file:

User-agent: *
Disallow: /
# BITE ME WAYBACK MACHINE

... uploaded it, went back to the Wayback Machine, and got:

Robots.txt Query Exclusion.

We're sorry, access to [site] has been blocked by the site owner via robots.txt.
Read more about robots.txt
See the site's robots.txt file.
Try another request or click here to search for all pages on [site]

So, yeah, they seem to check the site for the most current robots.txt file before they show the archive. And if the robots.txt disallows archiving the site, ALL the entries are marked unavailable, not just the current ones.

So, it's pretty easy to solve the problem of the Wayback Machine -- and probably without going balls-out with the "disallow everything everywhere" like I did.