Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

← Back to Stories (view on slashdot.org)

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

Posted by EditorDavid on Saturday April 22, 2017 @08:09PM from the universal-access-to-all-knowledge dept.

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."

But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?

32 of 174 comments (clear)

Min score:

Reason:

Sort:

yeah by Anonymous Coward · 2017-04-22 20:11 · Score: 5, Informative

yeah!
1. Re:yeah by ArmoredDragon · 2017-04-22 20:25 · Score: 5, Informative
  
  Law of headlines indeed, and there's already an established way for web developers to indicate that they don't want content cached or archived while still being searchable:
  <meta name="robots" content="noarchive">
  So archive.org could just honor that, and the problem would be solved. Google honors exactly this.
2. Re:yeah by Zocalo · 2017-04-22 21:45 · Score: 5, Informative
  
  Even more specific robots.txt directive for this instance:
  
  User Agent: ia_archiver Disallow: /
  
  As is often the case, Lauren is going off half-cocked with only part of the story. The IA already has a policy for removal requests (email info@) and is only considering expanding their current position of ignoring robots.txt on sites outside their current "test zone" of the .gov and .mil gTLD domains and have not had any problems. They probably will do that (and for their archival purposes it's a good idea in principle), but I think it's only fair to see whether or not they listen to the feedback and provide some specific opt-out policy and technical mechanisms like at least honoring either of the above prior to going live on the rest of the Internet before starting to scream and shout. It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt to try and do things like poison spammer's address lists, so it's actually in their best interests to provide an opt-out they honor.
  
  Besides, it's going to be interesting to see what kind of idiotic crap web admins who should know better think is safely hidden and/or secured because of robots.txt - it's useful to know who is particularly clueless so you can avoid them at all costs. :)
  
  --
  UNIX? They're not even circumcised! Savages!
3. Re:yeah by Zocalo · 2017-04-22 23:11 · Score: 5, Informative
  
  IA does still spider, but they seem to use a more nuanced system than the rudimentary "start at /, then recursively follow every link" approach used by more trivial site spider algorithms. Firstly, they don't download an entire site in one go - they spread things out over time to avoid putting large spikes into the traffic pattern which is more friendly for sites that are bandwidth limited and on things like "xGB/month" plans. Secondly, they have a "popularity weighting" system that governs the order they spider and refresh sections of a given site, which is the main reason for the difference between the level of content for popular and less popular sites - although I have no idea whether that's based entirely off something like the site's Alexa ranking or is also weighted against how dynamic the content is (e.g a highly dynamic site like Slashdot would get a bump up the priority, whereas a mostly static reference site might get downgraded). Combine the two approaches and you get the results you are seeing: major web homepages get spidered more or less every day with several levels of links retrieved, while some random personal blog only get spidered every few weeks or more, and only with the homepage and first level or two of links ever getting looked at.
  
  --
  UNIX? They're not even circumcised! Savages!
4. Re:yeah by wisnoskij · 2017-04-23 07:59 · Score: 2
  
  But should that matter? If the website is publicly facing. why should you not be able to archive it (irregardless of their wishes)? I can take pictures of houses I see from the street. The law seems fairly straightforward here, and it is easy to build any sort of wall around your website you wish to keep the public and archivers out.
  
  --
  Troll is not a replacement for I disagree.
Cautiously saying yes to this by haruchai · 2017-04-22 20:14 · Score: 2

but it may have consequences I haven't considered

--
Pain is merely failure leaving the body
No brainer by fnj · 2017-04-22 20:14 · Score: 5, Insightful

Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.
1. Re:No brainer by thsths · 2017-04-22 20:20 · Score: 5, Insightful
  
  But that is not the question asked, is it?
  robots.txt should apply to the page at the time. I do not see any decent argument against that.
  But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
2. Re:No brainer by blackest_k · 2017-04-22 23:11 · Score: 5, Insightful
  
  One problem i run into is with owner manuals for old film camera's a lot of the time they disappear from the company website when they get taken over by another company. Sometimes archive.org can come to the rescue if I can find where they used to be. Fair enough the new company may only be interested in the digital models and has no interest in the historical product made by the company they acquired but when they make boneheaded choices like erasing the historical information the original company put out for their customers..
  Worst still is when a domain name is lapsed and bought by another company who had zero access to the content of the former site they bought a name not a right to control the history of the former site.
  The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
  At the very least when a website changes hands the new owner should not be able to erase the history of the site under the previous owner.
  
  --
  Blarney Quality Restaurant, Plants
3. Re:No brainer by c · 2017-04-22 23:43 · Score: 2
  
  But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
  Exactly. The policy where someone with no interest in a site (i.e. takeovers, lapsed domains, etc) can retroactive wipe all archives with just a couple lines in a config is flat-out wrong.
  Ignoring robots.txt entirely, though, is a bad idea. Some sites use it to block archiving, sure, but some others use it to tell robots to avoid places where they'll never return from. There's a case for ignoring "Disallow: /", or anything that's significantly different from what, say, the Google search indexer is allowed to see.
  
  --
  Log in or piss off.
4. Re:No brainer by dissy · 2017-04-23 00:54 · Score: 4, Insightful
  
  It should be even easier than that.
  Archive.org should archive everything, including the robot.txt contents, at each scan.
  The content being displayed from the archive.org website itself however could then still honor robots.txt at the time of the scan, purely for "display" purposes.
  This way changing robots.txt to block search engines would not delete or hide any previous information.
  Also the new information would still be in the archive, even if not displayed due to the current robots.txt directives.
  Although it would require more work to do so properly, this would potentially allow for website owners to retroactively "unhide" content in the archive in the past as well.
  Proper in this case would require some way to verify the domain owner, but this could likely be as simple as creating another specifically named text file in the websites root path, with content provided by the archive.
  That can be as simple as the old school "cookie" data like so many other services use such as Google, or as complex as a standard that allows date ranges specified along with directives.
  But in any case, this would preserve copies of the website for future use, such as for when copyright protection expires.
  Despite everyone having a differing opinion on just how long "limited time" should be in "securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries", no one who wants to be taken seriously can argue that this time of expiration must happen at some point.
  Since the vast majority of authors make no considerations to protect our property, that task clearly needs to fall on us to secure.
5. Re: No brainer by sumdumass · 2017-04-23 04:22 · Score: 2
  
  Internet Archive is recognized as a bona-fide library organization recognized by the library of congress and US copyright office and as such is immune from most copyright laws in their pursuit of archiving and allowing access- with some restrictions of course.
  Section 108 lays out the framework but US regulations provide more specifics in the exemptions and uses. As far as I know, they fall completely within the scope of the laws and limitations even if they ignore the robots.txt because the copyright law creates an exception to the rights imposed by law concerning libraries.
  Even though there is no legal definition of pirating, I don't think they apply to even the common definition if translated to legal means as they are exempt from the restrictions normal people and organizations are subject to.
6. Re:No brainer by mrchaotica · 2017-04-23 08:56 · Score: 4, Insightful
  
  The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
  This is the single most important reason there could ever be!
  
  --
  "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Robots.txt is not only for privacy by Lorens · 2017-04-22 20:17 · Score: 3, Interesting

It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.
Block wildcard by Anonymous Coward · 2017-04-22 20:35 · Score: 2, Interesting

archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:
Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org
Piracy sites - For obvious reasons.
Domain parking - A domain parking site should be treated as spam.
1. Re:Block wildcard by mikael · 2017-04-22 20:39 · Score: 2
  
  It could archive a specific thread on a board once there has been no activity for over six months.
  
  --
  Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
2. Re:Block wildcard by _merlin · 2017-04-22 21:51 · Score: 2
  
  Threads on 4chan last hours, not months.
3. Re:Block wildcard by KiloByte · 2017-04-23 00:41 · Score: 2
  
  Piracy sites -- they deserve special protection, as they're very likely to be disappeared against their owner's wishes.
  Image boards -- a glimpse into ephemeral content is worth keeping, even if you miss most of it.
  Domain parking -- I agree with you, they're 100% spam. But they're the primary reason such deletion must not be retroactive.
  
  --
  The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Random generated content by DrYak · 2017-04-22 20:37 · Score: 4, Informative

It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results
And archive.org already has a correct behaviour for that :
- it wont try to download all infinity of solution in one go (e.g.: generating giga-byte worth of data out of the 1kB Perl/PHP/NodeJS/whatever source)
- instead it will occasionally rescan the page, every few days (more or less frequently, depending on popularity of the links)
It provides a small glimpse of what a user could have seen back then on the website.
By the way, back in the 2000s, this was exactly a popular way to poison SPAM robots spiders who where scanning the web for e-mail addresses.
- Either they honour robots and not scan that or any other sources of e-mail on the site.
- Or they attempt to ignore robots.txt and follow links they aren't authorised to, and end-up siphonning giga-bytes worth bogus e-mails addresses auto-generated by small perl script, which will pollute their base of harvested addresses.
Archive.org's spider might by a tiny bit more susceptible to this kind of things.
Bot as much as a SPAM email-harvesting spider (which will try to download as much as possible, much more aggressively than archive.org), but still such a labyrinth of links might get archive lost.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
No. by Gravis+Zero · 2017-04-22 20:44 · Score: 4, Insightful

A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.
If you have an example of an organization completely ignoring robots.txt, do tell.

--
Anons need not reply. Questions end with a question mark.
1. Re:No. by dwywit · 2017-04-22 21:13 · Score: 4, Insightful
  
  robots.txt is a polite way of saying "please don't"
  But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?
  
  --
  They sentenced me to twenty years of boredom
Here is my clever idea... by Snard · 2017-04-22 20:51 · Score: 3, Interesting

Maybe there can be a separate directive/section added inside robots.txt that gives direction to sites like archive.org on these matters. So both search engines and archival systems can behave honorably. If someone really does not want their site archived for the ages, archive.org should clearly respect that.

--
- Mike
1. Re:Here is my clever idea... by blind+biker · 2017-04-22 21:27 · Score: 3, Insightful
  
  Then why even have a website visible on the internet, if you don't want it searchable and archivable? Those two effectively mean "invisible" - because as long as it is visible, it is also archivable - if nothing else, manually.
  
  --
  "The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
robots.txt indeed does NOT have value by blind+biker · 2017-04-22 21:24 · Score: 2, Interesting

The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.

--
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
YES!! by Vadim+Makarov · 2017-04-22 22:48 · Score: 5, Insightful

I applaud the direction internet archive takes. They should fully implement it.
A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.
I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
This will have some big negative concequences by jonwil · 2017-04-22 23:29 · Score: 3, Insightful

Think about a big site like github.com.
Imagine how many terabytes of pretty-printed source code and other things archive.org would be pulling were it to crawl all of GitHub.
And that's just one site, there are many others that generate pretty-printed source code and other large things.
Or what about if it crawls Google and starts archiving all sorts of Google search URLs or Google maps URLs or whatever.
No. by Megane · 2017-04-23 00:13 · Score: 4, Interesting

robots.txt is intended to indicate what parts of a site should not be scanned recursively, often due for technical reasons such as generated content> It especially for sub-paths like /cgi-bin/, but there is no technical reason why the content of any arbitrary URL can't be programmatically generated. It might be and you wouldn't even know it, because the generated content may be the same most of the time, such as a navigation menu.
However, it was also not intended to be used to remove previously-archived content, as archive.org is currently using it. When an archived page changes status in robots.txt, they should note the first date that the status changed, then simply stop updating it until and if robots.txt re-allows it.
scanning and archiving are two different operations, and robots.txt is only intended to apply to the former.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
Simple solution? by RDW · 2017-04-23 00:39 · Score: 4, Insightful

How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.
I conducted a 2 yr experiment on Internet Archive by Anonymous Coward · 2017-04-23 01:20 · Score: 2, Informative

I wanted to know if it was possible to delete content from the Internet Archive. Their FAQ and support staff were very vague and only referred me to the robots.txt file. I found that they archive everything even if you tell them not to. The robots.txt file only controls whether or not the public can view it.
Experiment 1) Buy an expired domain and host it with a robots.txt file telling Internet Archive not to archive it. Before the experiment I confirmed that Internet Archive had a history for this expired domain. After buying the domain and hosting the robots.txt file, I confirmed that Internet Archive no longer allowed access to it. I allowed the domain to expire, then went back to Internet Archive. The entire history of the domain was still there.
Experiment 2) I browsed the history of an existing website that I host to confirm they had it. Next I hosted a robots.txt file telling Internet Archive not to archive it and verified that the public can no longer browse the archive for this domain. Next I changed a picture on the website for six months, then changed it back. I waited another six months, removed the robots.txt file and checked the Internet Archive. I found that they had been taking snapshots throughout the year even though my robots.txt told them not to. The picture they were not supposed to have archived was visible in the archive.
If you really don't want them to archive your website, you can maybe block all of their IP addresses from accessing your server. Possibly by determining the domain name from the IP address and then checking if it is the Internet Archive.
Re:NO use meta tags to control this by allo · 2017-04-23 01:48 · Score: 2

Gone doesn't mean there will be no replacement. It just tells, that the replacement will not be the same file. So you can re-request the URL, but you should not try to resume a download from there.
Re:No by TheFakeTimCook · 2017-04-23 01:58 · Score: 2

Archive.org needs to respect copyright law and stop this blatant reproduction of protected works!
Unless I explicitly consent to archiving (or searching for that matter), my content should never reside on someone else's server.
Sounds like you shouldn't have put that information on the Internet in the first place.
Re: No by Anonymous Coward · 2017-04-23 03:44 · Score: 2, Informative

Robots.txt is a suggestion, not a requirement.