Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

← Back to Stories (view on slashdot.org)

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

Posted by EditorDavid on Saturday April 22, 2017 @08:09PM from the universal-access-to-all-knowledge dept.

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."

But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?

8 of 174 comments (clear)

Min score:

Reason:

Sort:

yeah by Anonymous Coward · 2017-04-22 20:11 · Score: 5, Informative

yeah!
1. Re:yeah by ArmoredDragon · 2017-04-22 20:25 · Score: 5, Informative
  
  Law of headlines indeed, and there's already an established way for web developers to indicate that they don't want content cached or archived while still being searchable:
  <meta name="robots" content="noarchive">
  So archive.org could just honor that, and the problem would be solved. Google honors exactly this.
2. Re:yeah by Zocalo · 2017-04-22 21:45 · Score: 5, Informative
  
  Even more specific robots.txt directive for this instance:
  
  User Agent: ia_archiver Disallow: /
  
  As is often the case, Lauren is going off half-cocked with only part of the story. The IA already has a policy for removal requests (email info@) and is only considering expanding their current position of ignoring robots.txt on sites outside their current "test zone" of the .gov and .mil gTLD domains and have not had any problems. They probably will do that (and for their archival purposes it's a good idea in principle), but I think it's only fair to see whether or not they listen to the feedback and provide some specific opt-out policy and technical mechanisms like at least honoring either of the above prior to going live on the rest of the Internet before starting to scream and shout. It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt to try and do things like poison spammer's address lists, so it's actually in their best interests to provide an opt-out they honor.
  
  Besides, it's going to be interesting to see what kind of idiotic crap web admins who should know better think is safely hidden and/or secured because of robots.txt - it's useful to know who is particularly clueless so you can avoid them at all costs. :)
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:yeah by Zocalo · 2017-04-22 23:11 · Score: 5, Informative
  
  IA does still spider, but they seem to use a more nuanced system than the rudimentary "start at /, then recursively follow every link" approach used by more trivial site spider algorithms. Firstly, they don't download an entire site in one go - they spread things out over time to avoid putting large spikes into the traffic pattern which is more friendly for sites that are bandwidth limited and on things like "xGB/month" plans. Secondly, they have a "popularity weighting" system that governs the order they spider and refresh sections of a given site, which is the main reason for the difference between the level of content for popular and less popular sites - although I have no idea whether that's based entirely off something like the site's Alexa ranking or is also weighted against how dynamic the content is (e.g a highly dynamic site like Slashdot would get a bump up the priority, whereas a mostly static reference site might get downgraded). Combine the two approaches and you get the results you are seeing: major web homepages get spidered more or less every day with several levels of links retrieved, while some random personal blog only get spidered every few weeks or more, and only with the homepage and first level or two of links ever getting looked at.
  
  --
  UNIX? They're not even circumcised! Savages!
No brainer by fnj · 2017-04-22 20:14 · Score: 5, Insightful

Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.
1. Re:No brainer by thsths · 2017-04-22 20:20 · Score: 5, Insightful
  
  But that is not the question asked, is it?
  robots.txt should apply to the page at the time. I do not see any decent argument against that.
  But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
2. Re:No brainer by blackest_k · 2017-04-22 23:11 · Score: 5, Insightful
  
  One problem i run into is with owner manuals for old film camera's a lot of the time they disappear from the company website when they get taken over by another company. Sometimes archive.org can come to the rescue if I can find where they used to be. Fair enough the new company may only be interested in the digital models and has no interest in the historical product made by the company they acquired but when they make boneheaded choices like erasing the historical information the original company put out for their customers..
  Worst still is when a domain name is lapsed and bought by another company who had zero access to the content of the former site they bought a name not a right to control the history of the former site.
  The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
  At the very least when a website changes hands the new owner should not be able to erase the history of the site under the previous owner.
  
  --
  Blarney Quality Restaurant, Plants
YES!! by Vadim+Makarov · 2017-04-22 22:48 · Score: 5, Insightful

I applaud the direction internet archive takes. They should fully implement it.
A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.
I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.