Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

← Back to Stories (view on slashdot.org)

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

Posted by EditorDavid on Saturday April 22, 2017 @08:09PM from the universal-access-to-all-knowledge dept.

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."

But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?

11 of 174 comments (clear)

Min score:

Reason:

Sort:

No brainer by fnj · 2017-04-22 20:14 · Score: 5, Insightful

Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.
1. Re:No brainer by thsths · 2017-04-22 20:20 · Score: 5, Insightful
  
  But that is not the question asked, is it?
  robots.txt should apply to the page at the time. I do not see any decent argument against that.
  But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
2. Re:No brainer by blackest_k · 2017-04-22 23:11 · Score: 5, Insightful
  
  One problem i run into is with owner manuals for old film camera's a lot of the time they disappear from the company website when they get taken over by another company. Sometimes archive.org can come to the rescue if I can find where they used to be. Fair enough the new company may only be interested in the digital models and has no interest in the historical product made by the company they acquired but when they make boneheaded choices like erasing the historical information the original company put out for their customers..
  Worst still is when a domain name is lapsed and bought by another company who had zero access to the content of the former site they bought a name not a right to control the history of the former site.
  The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
  At the very least when a website changes hands the new owner should not be able to erase the history of the site under the previous owner.
  
  --
  Blarney Quality Restaurant, Plants
3. Re:No brainer by dissy · 2017-04-23 00:54 · Score: 4, Insightful
  
  It should be even easier than that.
  Archive.org should archive everything, including the robot.txt contents, at each scan.
  The content being displayed from the archive.org website itself however could then still honor robots.txt at the time of the scan, purely for "display" purposes.
  This way changing robots.txt to block search engines would not delete or hide any previous information.
  Also the new information would still be in the archive, even if not displayed due to the current robots.txt directives.
  Although it would require more work to do so properly, this would potentially allow for website owners to retroactively "unhide" content in the archive in the past as well.
  Proper in this case would require some way to verify the domain owner, but this could likely be as simple as creating another specifically named text file in the websites root path, with content provided by the archive.
  That can be as simple as the old school "cookie" data like so many other services use such as Google, or as complex as a standard that allows date ranges specified along with directives.
  But in any case, this would preserve copies of the website for future use, such as for when copyright protection expires.
  Despite everyone having a differing opinion on just how long "limited time" should be in "securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries", no one who wants to be taken seriously can argue that this time of expiration must happen at some point.
  Since the vast majority of authors make no considerations to protect our property, that task clearly needs to fall on us to secure.
4. Re:No brainer by mrchaotica · 2017-04-23 08:56 · Score: 4, Insightful
  
  The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
  This is the single most important reason there could ever be!
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
No. by Gravis+Zero · 2017-04-22 20:44 · Score: 4, Insightful

A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.
If you have an example of an organization completely ignoring robots.txt, do tell.

--
Anons need not reply. Questions end with a question mark.
1. Re:No. by dwywit · 2017-04-22 21:13 · Score: 4, Insightful
  
  robots.txt is a polite way of saying "please don't"
  But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?
  
  --
  They sentenced me to twenty years of boredom
Re:Here is my clever idea... by blind+biker · 2017-04-22 21:27 · Score: 3, Insightful

Then why even have a website visible on the internet, if you don't want it searchable and archivable? Those two effectively mean "invisible" - because as long as it is visible, it is also archivable - if nothing else, manually.

--
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
YES!! by Vadim+Makarov · 2017-04-22 22:48 · Score: 5, Insightful

I applaud the direction internet archive takes. They should fully implement it.
A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.
I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
This will have some big negative concequences by jonwil · 2017-04-22 23:29 · Score: 3, Insightful

Think about a big site like github.com.
Imagine how many terabytes of pretty-printed source code and other things archive.org would be pulling were it to crawl all of GitHub.
And that's just one site, there are many others that generate pretty-printed source code and other large things.
Or what about if it crawls Google and starts archiving all sorts of Google search URLs or Google maps URLs or whatever.
Simple solution? by RDW · 2017-04-23 00:39 · Score: 4, Insightful

How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.