Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)
Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective."
We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."
But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."
But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?
yeah!
neh!
but it may have consequences I haven't considered
Pain is merely failure leaving the body
Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.
It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.
Cache everything on the Pirate Bay. I'm sure the lawyers will find that useful in some court somewhere.
archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:
Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org
Piracy sites - For obvious reasons.
Domain parking - A domain parking site should be treated as spam.
It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results
And archive.org already has a correct behaviour for that :
- it wont try to download all infinity of solution in one go (e.g.: generating giga-byte worth of data out of the 1kB Perl/PHP/NodeJS/whatever source)
- instead it will occasionally rescan the page, every few days (more or less frequently, depending on popularity of the links)
It provides a small glimpse of what a user could have seen back then on the website.
By the way, back in the 2000s, this was exactly a popular way to poison SPAM robots spiders who where scanning the web for e-mail addresses.
- Either they honour robots and not scan that or any other sources of e-mail on the site.
- Or they attempt to ignore robots.txt and follow links they aren't authorised to, and end-up siphonning giga-bytes worth bogus e-mails addresses auto-generated by small perl script, which will pollute their base of harvested addresses.
Archive.org's spider might by a tiny bit more susceptible to this kind of things.
Bot as much as a SPAM email-harvesting spider (which will try to download as much as possible, much more aggressively than archive.org), but still such a labyrinth of links might get archive lost.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
If they do that on my sites (and many others I'm sure) they'll get locked out.
A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.
If you have an example of an organization completely ignoring robots.txt, do tell.
Anons need not reply. Questions end with a question mark.
Maybe there can be a separate directive/section added inside robots.txt that gives direction to sites like archive.org on these matters. So both search engines and archival systems can behave honorably. If someone really does not want their site archived for the ages, archive.org should clearly respect that.
- Mike
Section 1 a & b (http://www.legislation.gov.uk/ukpga/1990/18/section/1)
Access to the information is unauthorised (robots.txt says no) but they do it anyway and wilfully.
robot.txt doubly so.
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.
The problem is new owners of domains adding a robots.txt causing the archive to remove old site scrapes. It seems entirely reasonable to assume that adding robots.txt file should only apply to current content as chances are that prior content is not content that is owned by a new owner of a domain. I think that existing content should remain but new scrapes stop when a new robots.txt file appears on the domain. A complaints procedure then provided for content owners who didn't realise that their content was being archived to request that it be removed.
The bad guys already have everything. Why shouldn't the rest of us?
It is legal and should be legal. It is also dickish and a nice example for why honor based systems are doomed to fail.
The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
Using a robots.txt is as awful as favicon.ico
And why the fuck do modern browsers ignore a 410 GONE response for /favicon.ico and continually re-request it? WTF is the point of a permanently removed status code if it isn't cached?
But deprecate robots and have the http request be a legitimate way to ask for permission. If a web page can be got without login, then it's fair game.
This is the case now, it's just that the wealthy corrupted the courts and have used robots.txt as a pretext for how there's no right given by them giving you what you asked.
If, on day 1, the robots.txt file of a given site allows to collect information and archive.org does it, they would be fully complying with robots.txt. If, on day 2, that site modifies the robots.txt file and restricts the access to all the bots, archive.org shouldn't collect any more information but why deleting the day-1-rightfully-stored one? Such a deletion would be exclusively motivated by their own policy, not by what should be expected from a robots.txt compliance.
A different story would be determining whether they can rightfully store and display information from other sites, what the owners have to say about that and for how long certain type of information might be kept. Nothing of this has to do with respecting robots.txt, but with privacy and third-party information management on the lines of the right to be forgotten.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
and when finished it should archive itself and restart the universe.
Archive.org needs to respect copyright law and stop this blatant reproduction of protected works!
Unless I explicitly consent to archiving (or searching for that matter), my content should never reside on someone else's server.
Some clarifications just in case:
- I don't think that archive.org or any other site should fully ignore robots.txt, or any other express indication of what the website owner wants.
- The robots.txt files of my two sites don't include any kind of restriction and never did.
- All the crawling bots which I develop (currently running ones ranking web domains) always respect robots.txt or, depending upon the exact conditions, anything else which clearly indicates the site owner expectations.
- I am not precisely a (restricted) copyright fan and my whole online activity may be considered public domain.
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.
that "Turn on, tune in, drop out" was pert and remains so, even more so now with such a netdorked world. WTF are you bickering about it still?
Baidu.
Ended up blocking all their scanning IPs because of their poor behavior. That was a few years ago.
Maybe they've changed?
I applaud the direction internet archive takes. They should fully implement it.
A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.
I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.
A) public website
B) access control
Pick one
Dishonor Robots.txt and I add this to htaccess: Deny 207.241.224.2
Think about a big site like github.com.
Imagine how many terabytes of pretty-printed source code and other things archive.org would be pulling were it to crawl all of GitHub.
And that's just one site, there are many others that generate pretty-printed source code and other large things.
Or what about if it crawls Google and starts archiving all sorts of Google search URLs or Google maps URLs or whatever.
What is not done voluntarily can always be legislated and enforced judiciously. Yet another internet nightmare.
robots.txt is intended to indicate what parts of a site should not be scanned recursively, often due for technical reasons such as generated content> It especially for sub-paths like /cgi-bin/, but there is no technical reason why the content of any arbitrary URL can't be programmatically generated. It might be and you wouldn't even know it, because the generated content may be the same most of the time, such as a navigation menu.
However, it was also not intended to be used to remove previously-archived content, as archive.org is currently using it. When an archived page changes status in robots.txt, they should note the first date that the status changed, then simply stop updating it until and if robots.txt re-allows it.
scanning and archiving are two different operations, and robots.txt is only intended to apply to the former.
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
I think a web page (as opposed to site) has the right to be public or not.
But once they decide to be public, then search and archive from all should be permitted.
No cherry picking so only Google can index the page.
Certainly, no after the fact removal from an archive, if the page was public at the time it was captured.
The IA is especially good at performing a public service.
I think they should honor a robots.txt request that applies to all, but use the most permissive directive for sites that are pickey about which indexing they support.
Legally, a spider should have the same access as the general public.
If a site limits access to the general public, then the spider will see the same limit.
The robots.txt file is a suggestion to make things more efficient.
It is not a locked door to some and open to others.
How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.
While everyone is debating this, keep in mind that the problem is the Abuse of robots.txt to the point of absurdity. At what point do people have a right to open their Windows to the public, do stuff to attract attention, and then get mad at others for looking into their windows in the first place. People who make mistakes should have some way to undo those mistakes at some point, but to continue to leave your window open and then have everyone put a robots.txt, retroactive or otherwise, is not the answer.
I wanted to know if it was possible to delete content from the Internet Archive. Their FAQ and support staff were very vague and only referred me to the robots.txt file. I found that they archive everything even if you tell them not to. The robots.txt file only controls whether or not the public can view it.
Experiment 1) Buy an expired domain and host it with a robots.txt file telling Internet Archive not to archive it. Before the experiment I confirmed that Internet Archive had a history for this expired domain. After buying the domain and hosting the robots.txt file, I confirmed that Internet Archive no longer allowed access to it. I allowed the domain to expire, then went back to Internet Archive. The entire history of the domain was still there.
Experiment 2) I browsed the history of an existing website that I host to confirm they had it. Next I hosted a robots.txt file telling Internet Archive not to archive it and verified that the public can no longer browse the archive for this domain. Next I changed a picture on the website for six months, then changed it back. I waited another six months, removed the robots.txt file and checked the Internet Archive. I found that they had been taking snapshots throughout the year even though my robots.txt told them not to. The picture they were not supposed to have archived was visible in the archive.
If you really don't want them to archive your website, you can maybe block all of their IP addresses from accessing your server. Possibly by determining the domain name from the IP address and then checking if it is the Internet Archive.
There are already flags like "noarchive" to get google to index the site, but not provide public "google cache" links (you can assume they still cache it, but that doesn't matter for you).
So archive.org should ignore noindex directives, but not noarchive ones.
You don't put private things with a "please don't read" note on them in the town library noticeboard, dumbass!
Art, whether ephemeral or not, again, either don't put it up or accept it. There's memory. People will remember your "ephemeral art", and really who are you to decide what your art means to the viewer?
Experiences? Experience being archived! What a pointless point you made.
Change your mind? But you haven't. You still leave it in public with a "please do not read" note on it. Change your mind about where you put it, not make others change their minds because you're a lazy fucker.
You have all control over it. Take it off the internet. Block access without a login. What you can't control is other people. Why the hell should you?
etc? What "and so on"? Those ones were shit. And they were the best you could come up with!
.
However, the way Microsoft has been acting recently (e.g., Windows 10 forced upgrades), I doubt if they even care about what I try to tell them via robots.txt.
The Microsoft attitude apparently is pervasive within the company.
"Should Archive.org Ignore Robots.txt Directives And Cache Everything?"
No.
Just cruising through this digital world at 33 1/3 rpm...
don't publish it openly in the first place.
Hey I like privacy too but come on, it's the internet! If you put something out there that it's not password-protected then consider it to be publicly avalilable, already. Respecting robots.txt is a matter of politeness.
You have to explain.
Lazy fucker.
For the non-lazy fucker who actually gave a point, if your scripts are in a directory properly rather than munged into the same area your content is, then any sane robot will not execute them. If the scripts contain content information, ur doin it wrong. Separate the content from the presenttion. It's what the markup system is meant for, but if you're going to put content in the scripts then your scripts being executed is necessary, and robots.txt is being abused by you along with the "please don't execute this!" plea to hide information.
Dumbasshit spiders will grab everything. But they're going to ignore robots.txt as well if they're that dumb or badly written, so some crawlers like archive.org ignoring it will be zero different.
So what? When DoubleClick argues that they ought to have the same advantages as Archive.org, they'll only manage to look like douchebags reaching their filthy hands into a cookie jar.
It's not always a bad thing to set up douchebag-honeypot moral exemption, even if it does depend on the mass audience (mostly) managing to find two sticks to rub together.
The real solution here is to make the directives in robots.txt more explicit concerning the predatory/non-predatory use cases.
As long as this is their policy going forward from 2017 and they don't revive old cached pages from years ago, I don't have a particular problem with this. But in the past, their explicit policy was that sites could be removed via the robots.txt file and that must be honored. From now on I guess I'll use .htaccess to achieve the same effect where necessary.
archive.org should ignore robots.txt as a means to prevent archiving material. archive.org should however be smart enough to know what can be ignored, based on content.
More data points to show you more Mc Donald's ads probably sounds awesome to them. You can't be cool, popular, and decent all at once, Xeni and crew..
Of course, I'm sure multiple TLAs already have a copy of everything, particularly anything political, technical, blogs, etc, including the "dark web" and other encrypted sites. A leopard can't change his spots and I would predict the FBI has dossiers on most American citizens. It's in their DNA, dating from J. Edgar at their birth. Now the dossiers are electronic, searchable, and probably do include real DNA. They likely include info from the older paper files - like that record of the subscription I had to the People's Daily back in the 70s during my Maoist phase - now scanned and searchable too. With the exposure of all these hacking tools, perhaps there is even a backup of my old blog server which mysteriously crashed in 2007 wiping out years of pointless political fulminating read by one or two people. If it wasn't for these search engines ignoring robots.txt, I would have had no traffic at all. Certain corporations also have essentially complete copies of the internet - like Google, Yahoo (Verizon), and Microsoft. If only we could search and browse them with a better interface than the Wayback Machine.
Whoever thought that was a good idea is a moron, full stop.
Different archive copies from when the site was under different ownership should retain their own policies - whether it is fully restricted, not restricted at all, or in between. Yes, that will take up space, holding on different copies of robots.txt files, linking them to websites, etc, but it is better than some archives not being available because of their current policy.
If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot
Mod parent up +Informative
If they start disregarding retroactive robots.txt directives, I'll start donating again. Domain name owners shouldn't be able to play "Ministry of Truth" with content that had previously been made publically available on their domains.
I also agree. On today's internet nobody asks for permission to show us advertising, to follow up on the internet showing us ads, try to seel us things on social networks and now ISP will sell our browsing history. So why should not a "public library" be able to just backup a full website. Also it is complete lawful for they to copy contents and information: https://www.law.cornell.edu/us...
Good luck Internet Archive. Backup everything in the world !!! Preserve al knowledge !!!
'Nuff said.
First of all, everything should be archived for future generations and researchers. Otherwise, it defeats the whole point of the project.
But for the general public, the robots.txt should be honored and content hidden with a few conditions. First of all, it should not be retro active. I've seen valuable information lost when domains have changed name and the new owner has blocked the contents with a robots.txt. Second of all, there should be a review system to override the robots.txt. For example, if a site is cited in Wikipeida, it the robots.txt should be ignored and hidden content unblocked.
So there's already permission for copying there. Don't honour http get requests without a login.
As a retired research librarian I come down strongly on the 'cache everything' side.
Removing materials dilutes the historical and sociological usefulness of the archive.
A hundred years from now one's peccadillos will mean nothing, but the record of such things will help researchers understand our times.
If you attempt to directly access cached URLs, you're hostile, same answer.
How you define "cached URLs" could determine how much money you have to spend fielding support calls from legitimate users who have bookmarked a document on your site.
Go watch the up and coming lame-arse movie called The Circle. The book sucked so the movie will be just as bad. If you ask this question you are proven to be stupid and young. lol. No doubt about it. You never studied history, haven't a clue about governments, and don't have any morals as you don't respect peoples rights.
A robots.txt file is a nice way of telling another, "please don't copy my site." However, the more mature and sophisticated answer is "if you copy this portion of my site, you may liable for copyright infringement." This whole problem is really a problem with the limitations of robots.txt. Telling someone "please do this" or "please don't do that" is not nearly as significant as "you have a right to do this" and "I will sue you if you do that".
Fast Federal Court and I.T.C. updates
By ignoring robots.txt, archive.org would be gaining unauthorized access to a computer system as access was expressly denied as per the Robots Exclusion Standard.
To further disseminate the archived pages would be added infringements.
I think that they need to campaign site owners to modify their robots.txt and if need be, lobby for exclusions to the Computer Misuse Act.
[Rent This Space]
there is no law that says you have to obey the robots.txt, it's nice search engines etc obey the robots.txt, but they certainly don't have to.
Every site we develop has a dev instance with Robots set to dissallow, and a prod instance. If the dev instances get exposed to the outside world (search, archive, wayback) then you have to teach reviewers how to use local vhost files which will be a huge pain in the ass, or put .htaccess passwords on everything which is just plain stupid.
Murphy was an optimist
If you think you want to keep a copy of my homebrew Ars Magica roleplaying game logging site feel free.
If you're relying on robots.txt to prevent people from seeing dev branch, then you're Doing It Wrong. Such things should be password/log-in protected.