Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

← Back to Stories (view on slashdot.org)

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

Posted by EditorDavid on Saturday April 22, 2017 @08:09PM from the universal-access-to-all-knowledge dept.

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."

But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?

5 of 174 comments (clear)

Min score:

Reason:

Sort:

Robots.txt is not only for privacy by Lorens · 2017-04-22 20:17 · Score: 3, Interesting

It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.
Block wildcard by Anonymous Coward · 2017-04-22 20:35 · Score: 2, Interesting

archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:
Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org
Piracy sites - For obvious reasons.
Domain parking - A domain parking site should be treated as spam.
Here is my clever idea... by Snard · 2017-04-22 20:51 · Score: 3, Interesting

Maybe there can be a separate directive/section added inside robots.txt that gives direction to sites like archive.org on these matters. So both search engines and archival systems can behave honorably. If someone really does not want their site archived for the ages, archive.org should clearly respect that.

--
- Mike
robots.txt indeed does NOT have value by blind+biker · 2017-04-22 21:24 · Score: 2, Interesting

The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.

--
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
No. by Megane · 2017-04-23 00:13 · Score: 4, Interesting

robots.txt is intended to indicate what parts of a site should not be scanned recursively, often due for technical reasons such as generated content> It especially for sub-paths like /cgi-bin/, but there is no technical reason why the content of any arbitrary URL can't be programmatically generated. It might be and you wouldn't even know it, because the generated content may be the same most of the time, such as a navigation menu.
However, it was also not intended to be used to remove previously-archived content, as archive.org is currently using it. When an archived page changes status in robots.txt, they should note the first date that the status changed, then simply stop updating it until and if robots.txt re-allows it.
scanning and archiving are two different operations, and robots.txt is only intended to apply to the former.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }