Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

yeah by Anonymous Coward · 2017-04-22 20:11 · Score: 5, Informative

yeah!

Re:yeah by ArmoredDragon · 2017-04-22 20:25 · Score: 5, Informative

Law of headlines indeed, and there's already an established way for web developers to indicate that they don't want content cached or archived while still being searchable:
<meta name="robots" content="noarchive">
So archive.org could just honor that, and the problem would be solved. Google honors exactly this.
Re:yeah by Zocalo · 2017-04-22 21:45 · Score: 5, Informative

Even more specific robots.txt directive for this instance:

User Agent: ia_archiver Disallow: /

As is often the case, Lauren is going off half-cocked with only part of the story. The IA already has a policy for removal requests (email info@) and is only considering expanding their current position of ignoring robots.txt on sites outside their current "test zone" of the .gov and .mil gTLD domains and have not had any problems. They probably will do that (and for their archival purposes it's a good idea in principle), but I think it's only fair to see whether or not they listen to the feedback and provide some specific opt-out policy and technical mechanisms like at least honoring either of the above prior to going live on the rest of the Internet before starting to scream and shout. It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt to try and do things like poison spammer's address lists, so it's actually in their best interests to provide an opt-out they honor.

Besides, it's going to be interesting to see what kind of idiotic crap web admins who should know better think is safely hidden and/or secured because of robots.txt - it's useful to know who is particularly clueless so you can avoid them at all costs. :)

--
UNIX? They're not even circumcised! Savages!
Re:yeah by Anonymous Coward · 2017-04-22 22:25 · Score: 0

>> yeah!
> Law of headlines indeed
https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines
"Any headline that ends in a question mark can be answered by the word no."
Re:yeah by ArmoredDragon · 2017-04-22 22:31 · Score: 1

Yep, for some bizzaro reason I was thinking I was replying to the "neh" post below this one, and I haven't had my drop of liquor yet.
Re:yeah by gustygolf · 2017-04-22 22:33 · Score: 1

It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt
I don't think archive.org actually spiders things any more. They've been on-demand archival for, what, over a decade?
I mean, they had the Alexa toolbar that automatically submitted everything that the user browsed to their index, and that is (was?) likely their main source of entries...
Try looking at an unpopular site, and you'll find few and incomplete entries spanning over several years, especially as you go deeper than the front page. But a popular web site has archive entries available for pretty much every day of their history.

--
"Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
Re:yeah by Zocalo · 2017-04-22 23:11 · Score: 5, Informative

IA does still spider, but they seem to use a more nuanced system than the rudimentary "start at /, then recursively follow every link" approach used by more trivial site spider algorithms. Firstly, they don't download an entire site in one go - they spread things out over time to avoid putting large spikes into the traffic pattern which is more friendly for sites that are bandwidth limited and on things like "xGB/month" plans. Secondly, they have a "popularity weighting" system that governs the order they spider and refresh sections of a given site, which is the main reason for the difference between the level of content for popular and less popular sites - although I have no idea whether that's based entirely off something like the site's Alexa ranking or is also weighted against how dynamic the content is (e.g a highly dynamic site like Slashdot would get a bump up the priority, whereas a mostly static reference site might get downgraded). Combine the two approaches and you get the results you are seeing: major web homepages get spidered more or less every day with several levels of links retrieved, while some random personal blog only get spidered every few weeks or more, and only with the homepage and first level or two of links ever getting looked at.

--
UNIX? They're not even circumcised! Savages!
Re:yeah by Anonymous Coward · 2017-04-23 03:44 · Score: 0

I was just a matter of time. So anyone with an IQ above that of a doorknob knows to now use aliases on the internet and avoid, pretty much at all costs, any real-life references.
Oh wait, facebook... I rest my case.
Re:yeah by wisnoskij · 2017-04-23 07:59 · Score: 2

But should that matter? If the website is publicly facing. why should you not be able to archive it (irregardless of their wishes)? I can take pictures of houses I see from the street. The law seems fairly straightforward here, and it is easy to build any sort of wall around your website you wish to keep the public and archivers out.

--
Troll is not a replacement for I disagree.
Re:yeah by gustygolf · 2017-04-23 17:15 · Score: 1

while some random personal blog only get spidered every few weeks or more,
Well, my experience (as a user of archive.org, not as a webmaster) is more like 'every few years'...
FWIW, I mostly look up old static sites from around fifteen years ago. Back when people still had hitcounters.

--
"Slow Down Cowboy! It's been 58 minutes since you last successfully posted a comment" -- slashdot, driving users away.
Re: yeah by Anonymous Coward · 2017-04-23 19:05 · Score: 0

Since when do we support opt out, instead of opt in?
Re:yeah by Anonymous Coward · 2017-04-23 23:38 · Score: 0

Hell no! Archiving is just fine, but as soon as you start ignoring the website owners settings, you're on the wrong side. As soon as archive.org start doing this, I'm adding them to the "bad bots" setting of our platform, giving them 403's on a few thousand websites.

neh by Anonymous Coward · 2017-04-22 20:13 · Score: 0

neh!

Cautiously saying yes to this by haruchai · 2017-04-22 20:14 · Score: 2

but it may have consequences I haven't considered

--
Pain is merely failure leaving the body

Re:Cautiously saying yes to this by bertoelcon · 2017-04-22 20:20 · Score: 1

Bandwidth seems like a likely problem if everyone does it.

--
Anything can be found funny, from a certain point of view.
Re:Cautiously saying yes to this by Anonymous Coward · 2017-04-22 23:21 · Score: 1

Cautiously saying yes to this but it may have consequences I haven't considered
And that's how we ended up with Donald Trump, you bastard!
Re:Cautiously saying yes to this by Zocalo · 2017-04-22 23:39 · Score: 1

I think the law of averages would take care of that. Bandwidth is pretty cheap and the chances are that even if you are constrained by bandwidth, as might be the case with a smaller site on an "xGB/day" hosting plan, then it's more likely to be the case there won't be too many GB of content to spider in the first place. There are always exceptions though, and where there is a real problem there are still going to be workarounds, e.g. explicit opt out clauses for spiders like IA's or, if all else fails, denying access based on User-Agent strings.

It does clearly depend on what effect this might have on the value of "everyone" though. Spidering (for legit purposes and otherwise) is mostly just background noise at present; the real bad actors - cyber criminals - already ignore robots.txt, and not every good actor would significantly benefit from ignoring robots.txt. The only real reasons a good actor might have for ignoring it are for better archiving (as with IA's proposals) or more complete search engine indicies, but if the reason for the content being excluded via robots.txt is that it is highly dynamic, transient, or just fodder for bad robots, then it's of minimal value to search engines anyway. Even if some (or all) of the search engines were to follow IA's lead on this, I think they'd still be looking at balancing that with more intelligence in their spidering just to avoid the risk of cluttering up their databases with broken links and expired data, and that's likely to limit the bandwidth requirements considerably.

--
UNIX? They're not even circumcised! Savages!
Re:Cautiously saying yes to this by Anonymous Coward · 2017-04-23 03:50 · Score: 0

Anything can be found funny, from a certain point of view.
Your pain, suffering and ultimate death, for example?
Re:Cautiously saying yes to this by Anonymous Coward · 2017-04-23 04:01 · Score: 1

And somehow the world is still relieved it wasn't Hillary Clinton...
If given the choice of picking someone who could ruin you life, would you pick the evil conniving devil you know, or the bumbling orange buffoon you don't know?
Re:Cautiously saying yes to this by postbigbang · 2017-04-23 07:16 · Score: 1

My process is that if you go around the robots.txt, you're hostile, and you route to null on the next access. If you attempt to directly access cached URLs, you're hostile, same answer. The file of IPv4 and IPv6 addresses that have attempted this is easily a half-mile long.
Happy to add archive.org to it. Baidu, Bing, and yes, Google, are already there. Most of them have been from AWS instances snooping around. They get the same answer.

--
---- Teach Peace. It's Cheaper Than War.
Re:Cautiously saying yes to this by Anonymous Coward · 2017-04-23 07:56 · Score: 0

Ever hear of motherless.com? Or trolls in general? It gives them boners.
Re:Cautiously saying yes to this by laie_techie · 2017-04-24 06:14 · Score: 1

As well as Obama destroying insurance ($5000 deductibles for everyone), Dianne Feinstein becoming a Billionaire *only while* in Congress, and of course the classic "We have to pass the bill to know what's in the bill." That from the bobble-head Nancy Pelosi. Yep. Oh yeah, you bastard!
My plan's deductible is $500 per person ($1500 / family), so I have a hard time believing $5k; $5k is closer to max out of pocket. Of course, I'm paying $500 / month premiums ...

No brainer by fnj · 2017-04-22 20:14 · Score: 5, Insightful

Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.

Re:No brainer by thsths · 2017-04-22 20:20 · Score: 5, Insightful

But that is not the question asked, is it?
robots.txt should apply to the page at the time. I do not see any decent argument against that.
But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
Re: No brainer by Anonymous Coward · 2017-04-22 20:21 · Score: 1

While the retroactive nature is indeed dumb, the simple fact is that if I don't want content I created/own to be copied by archive.org, it shouldn't be. And that should include content that maybe I didn't have a problem with being mirrored previously, but now do, albeit not through a stupid retroactive robots.txt file. This is throwing the baby out with the bathwater.
Re:No brainer by blackest_k · 2017-04-22 23:11 · Score: 5, Insightful

One problem i run into is with owner manuals for old film camera's a lot of the time they disappear from the company website when they get taken over by another company. Sometimes archive.org can come to the rescue if I can find where they used to be. Fair enough the new company may only be interested in the digital models and has no interest in the historical product made by the company they acquired but when they make boneheaded choices like erasing the historical information the original company put out for their customers..
Worst still is when a domain name is lapsed and bought by another company who had zero access to the content of the former site they bought a name not a right to control the history of the former site.
The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
At the very least when a website changes hands the new owner should not be able to erase the history of the site under the previous owner.

--
Blarney Quality Restaurant, Plants
Re:No brainer by c · 2017-04-22 23:43 · Score: 2

But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
Exactly. The policy where someone with no interest in a site (i.e. takeovers, lapsed domains, etc) can retroactive wipe all archives with just a couple lines in a config is flat-out wrong.
Ignoring robots.txt entirely, though, is a bad idea. Some sites use it to block archiving, sure, but some others use it to tell robots to avoid places where they'll never return from. There's a case for ignoring "Disallow: /", or anything that's significantly different from what, say, the Google search indexer is allowed to see.

--
Log in or piss off.
Re:No brainer by dissy · 2017-04-23 00:54 · Score: 4, Insightful

It should be even easier than that.
Archive.org should archive everything, including the robot.txt contents, at each scan.
The content being displayed from the archive.org website itself however could then still honor robots.txt at the time of the scan, purely for "display" purposes.
This way changing robots.txt to block search engines would not delete or hide any previous information.
Also the new information would still be in the archive, even if not displayed due to the current robots.txt directives.
Although it would require more work to do so properly, this would potentially allow for website owners to retroactively "unhide" content in the archive in the past as well.
Proper in this case would require some way to verify the domain owner, but this could likely be as simple as creating another specifically named text file in the websites root path, with content provided by the archive.
That can be as simple as the old school "cookie" data like so many other services use such as Google, or as complex as a standard that allows date ranges specified along with directives.
But in any case, this would preserve copies of the website for future use, such as for when copyright protection expires.
Despite everyone having a differing opinion on just how long "limited time" should be in "securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries", no one who wants to be taken seriously can argue that this time of expiration must happen at some point.
Since the vast majority of authors make no considerations to protect our property, that task clearly needs to fall on us to secure.
Re: No brainer by Anonymous Coward · 2017-04-23 03:30 · Score: 0

It's a copyright violation to store and distribute those files without permission. They could make an argument when they weren't ignoring the expressed interests of the site operator, but now, I don't think they can even pretend like they're not pirating the content.
Re: No brainer by sumdumass · 2017-04-23 04:22 · Score: 2

Internet Archive is recognized as a bona-fide library organization recognized by the library of congress and US copyright office and as such is immune from most copyright laws in their pursuit of archiving and allowing access- with some restrictions of course.
Section 108 lays out the framework but US regulations provide more specifics in the exemptions and uses. As far as I know, they fall completely within the scope of the laws and limitations even if they ignore the robots.txt because the copyright law creates an exception to the rights imposed by law concerning libraries.
Even though there is no legal definition of pirating, I don't think they apply to even the common definition if translated to legal means as they are exempt from the restrictions normal people and organizations are subject to.
Re: No brainer by Anonymous Coward · 2017-04-23 04:37 · Score: 1

So why the fuck did the person put the stuff on the website in the first place? DMCA and anything similar to it is stupid.
Re: No brainer by Aighearach · 2017-04-23 06:05 · Score: 1

Even though there is no legal definition of pirating
"Piracy: a robbery or forcible depredation on the high seas, without lawful authority, done animus furandi, in the spirit and intention of universal hostility.
"It is not necessary that the motive be plunder or that the depredations be directed against the vessels of all nations indiscriminately. As in robbery upon land, it is only necessary that the spoilation or intended spoilation be felonious, that is with intent to injure and without legal authority or lawful excuse."

--Bouvier's Law Dictionary (1897)
Re: No brainer by sumdumass · 2017-04-23 06:54 · Score: 1

Ok, so I should have said there is no legal definition of pirating concerning copyright.
Most people would have read that into the comment seeing how the entire discussion being replied to was about copyright. But I guess I should admit that I did not account for the one interpretation by someone not following along.
Re: No brainer by Aighearach · 2017-04-23 08:05 · Score: 1

Ok, so I should have said there is no legal definition of pirating concerning copyright.
Nobody cares if the vessel that you pillage is carrying copyright assignments, gold dust, or toilet paper. The definition of piracy does not change.

Most people...
Are fucking idiots who don't understand the difference between propagandized hyperbole, and whatever other options for communication there are.
Re:No brainer by mrchaotica · 2017-04-23 08:56 · Score: 4, Insightful

The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...
This is the single most important reason there could ever be!

--
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Re:No brainer by CanEHdian · 2017-04-23 21:41 · Score: 1

It's even worse, the domain name for a retro-gaming related website I consulted via wayback expired and was re-registered; the new robots.txt file now makes the old website inaccessible!

--
When the copyright term is "forever minus a day", live every day like it's the last.
Re:No brainer by Michael_Paoli · 2017-04-24 17:11 · Score: 1

Absolutely! robots.txt applies at the time, not retroactively. I see nothing in the robots.txt standard that implies it should apply retroactively. It's basically a do/don't crawl *now*, it says nothing of what was the case before. So, if it was allowed per robots.txt when the pages were crawled, those are fair game for archiving ... period. archive.org or anyone else deeming themselves the exception regarding what robots.txt specified, and crawling where it says not to - that's wrong direction to go - period. But what was allowed and crawled earlier, perfectly fine, and nothing in change of robots.txt should imply it should alter availability or presentation of what was crawled when robots.txt earlier allowed it to be crawled at that time. End of story. :-) "Make things as simple as possible, but not simpler." - Einsein
Re: No brainer by sumdumass · 2017-04-29 05:49 · Score: 1

Nobody cares if the vessel that you pillage is carrying copyright assignments, gold dust, or toilet paper. The definition of piracy does not change.
lol.. As if the internet is a vessel on the high seas.

Are fucking idiots who don't understand the difference between propagandized hyperbole, and whatever other options for communication there are.
I'm assuming either you are including yourself or this was just an exercise to illustrate that. Either way, it is annoyingly silly due to the obvious nature of the comment.

Robots.txt is not only for privacy by Lorens · 2017-04-22 20:17 · Score: 3, Interesting

It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.

Suuuuuuure by Anonymous Coward · 2017-04-22 20:25 · Score: 0

Cache everything on the Pirate Bay. I'm sure the lawyers will find that useful in some court somewhere.

Re:Suuuuuuure by Anonymous Coward · 2017-04-22 20:40 · Score: 0

Cache everything on 4chan. I'm sure that... okay, nobody's going to find that useful.
Re: Suuuuuuure by Anonymous Coward · 2017-04-23 17:13 · Score: 0

There's already a bunch of third party 4chan archives. The users find them useful for browsing threads that they just missed before deletion.
Personally, I'm not in favor of them, as I feel that a feature of 4chan is to be transient. Oh well.

Block wildcard by Anonymous Coward · 2017-04-22 20:35 · Score: 2, Interesting

archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:

Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org

Piracy sites - For obvious reasons.

Domain parking - A domain parking site should be treated as spam.

Re:Block wildcard by mikael · 2017-04-22 20:39 · Score: 2

It could archive a specific thread on a board once there has been no activity for over six months.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:Block wildcard by _merlin · 2017-04-22 21:51 · Score: 2

Threads on 4chan last hours, not months.
Re:Block wildcard by KiloByte · 2017-04-23 00:41 · Score: 2

Piracy sites -- they deserve special protection, as they're very likely to be disappeared against their owner's wishes.
Image boards -- a glimpse into ephemeral content is worth keeping, even if you miss most of it.
Domain parking -- I agree with you, they're 100% spam. But they're the primary reason such deletion must not be retroactive.

--
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
Re:Block wildcard by wisnoskij · 2017-04-23 08:05 · Score: 1

More likely minutes on the popular ones. But 4chan already has its own archive websites I believe.

--
Troll is not a replacement for I disagree.

Random generated content by DrYak · 2017-04-22 20:37 · Score: 4, Informative

It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results

And archive.org already has a correct behaviour for that :
- it wont try to download all infinity of solution in one go (e.g.: generating giga-byte worth of data out of the 1kB Perl/PHP/NodeJS/whatever source)
- instead it will occasionally rescan the page, every few days (more or less frequently, depending on popularity of the links)
It provides a small glimpse of what a user could have seen back then on the website.

By the way, back in the 2000s, this was exactly a popular way to poison SPAM robots spiders who where scanning the web for e-mail addresses.
- Either they honour robots and not scan that or any other sources of e-mail on the site.
- Or they attempt to ignore robots.txt and follow links they aren't authorised to, and end-up siphonning giga-bytes worth bogus e-mails addresses auto-generated by small perl script, which will pollute their base of harvested addresses.

Archive.org's spider might by a tiny bit more susceptible to this kind of things.
Bot as much as a SPAM email-harvesting spider (which will try to download as much as possible, much more aggressively than archive.org), but still such a labyrinth of links might get archive lost.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]

Re:Random generated content by a_n_d_e_r_s · 2017-04-23 01:10 · Score: 1

You are assuming the web page will have the same URL. But what if the script auto-generates new URLs for each request ?
Then they will get an unlimited amount of web pages.
Its not hard to make a web page that works that way.

--
Just saying it like it are.
Re:Random generated content by Anonymous Coward · 2017-04-23 02:16 · Score: 0

But since that is archive.org's problem to solve, it's sort of irrelevant to the question at hand.

No by Anonymous Coward · 2017-04-22 20:41 · Score: 1

If they do that on my sites (and many others I'm sure) they'll get locked out.

Re:No by TheFakeTimCook · 2017-04-23 01:58 · Score: 2

Archive.org needs to respect copyright law and stop this blatant reproduction of protected works!
Unless I explicitly consent to archiving (or searching for that matter), my content should never reside on someone else's server.
Sounds like you shouldn't have put that information on the Internet in the first place.
Re: No by Anonymous Coward · 2017-04-23 03:44 · Score: 2, Informative

Robots.txt is a suggestion, not a requirement.
Re:No by sumdumass · 2017-04-23 04:45 · Score: 1

Check out section 108 of the US copyright law. It provides exceptions for libraries and archival libraries. You really have no "copyright" say in the matter with Internet Archive is a bonafide library and your legal rights granted by copyright do not apply to them.
Other sites, yes. But not for the internet archive.
Re: No by Anonymous Coward · 2017-04-23 09:41 · Score: 0

Accessing my server is a privilege, not a right - if you abuse it my good will, you're gone.
Re:No by Anonymous Coward · 2017-04-23 20:02 · Score: 0

Put it behind a captcha then, like a toddler.
Re: No by Anonymous Coward · 2017-04-24 01:50 · Score: 0

Not sending spam to my email address is also just a suggestion, not a requirement. I will *still* block you if you do it.
Re: No by Anonymous Coward · 2017-04-24 09:39 · Score: 0

Robots.txt is a suggestion, not a requirement.
Then it's useless.

No. by Gravis+Zero · 2017-04-22 20:44 · Score: 4, Insightful

A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.

If you have an example of an organization completely ignoring robots.txt, do tell.

--
Anons need not reply. Questions end with a question mark.

Re:No. by locater16 · 2017-04-22 21:02 · Score: 1

The value of it is already eroded since it's messing up their sites intended purpose. They're just trying to correct for a technical problem, this is a technical problem, not some ethical dilemma.
Re:No. by dwywit · 2017-04-22 21:13 · Score: 4, Insightful

robots.txt is a polite way of saying "please don't"
But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?

--
They sentenced me to twenty years of boredom
Re:No. by Anonymous Coward · 2017-04-22 21:20 · Score: 1

archive.org ignoring robots.txt is a slippery slope, indeed. but there is no 'technical problem' here.
a web site operator specifically CHOOSES ON THEIR OWN to include directives in robots.txt to tell a bot to 'fuck off'. if they choose to add wayback machine to their robots.txt file, it is their choice, and archive.org should always honor such request
Re:No. by Anonymous Coward · 2017-04-22 21:43 · Score: 1

People like you are why we can't have nice things. You think it's fine to do whatever you like to other people as long as it's not punishable by law, just because you can, no matter what their opinion.
IOW: You're an asshole.
Re:No. by Anonymous Coward · 2017-04-22 23:58 · Score: 0

Let me put it bluntly: If you have web server log files, you have examples of organizations ignoring robots.txt. The robots exclusion standard is not a contract. It is a hint that is meant to help avoid excessive resource usage. Hints will be ignored.
Re:No. by Anonymous Coward · 2017-04-23 01:19 · Score: 0

The thing is that there already are people ignoring robots.txt. Spiders scanning for e-mail addresses to sell isn't going to care about robots.txt for example.
robots.txt is a way to communicate the desire of the hosting part to spiders. If the person writing the spider thinks that there is a miscommunication it doesn't seem unreasonable to interpret it as seem fit.
If the host feels differently it can always write a mail to the spiders owner and ask them to stop doing that, or even stop serving the page to the spider entirely.
robots.txt is still a pretty bad way to prevent undesired traffic since it relies on the other party to comply. It's about as efficient as filtering content based on the evil bit.
Re:No. by Anonymous Coward · 2017-04-23 01:26 · Score: 0

Actually he asked a question. Instead of an answer you served back abuse. All I have seen is abuse of expressions of disregard in support of your position. So please answer...what is the downside of ignoring robots.txt besides "well other will ignore it too." If there is no downside in the first place then having others ignore it is still no downside. 0*n=0.
I see risk for the person ignoring the robots.txt. They may have problems but that is self inflicted so self regulating. How is a publicly viewable website harmed by others ignoring robots.txt? Is it a bandwidth use thing? I hardly see how respectable organizations will create much traffic. If non reputable organizations or individuals running spiders are of concern then Wayback's decision will make no difference. Those people will ignore the file anyway. There is a legitimate reason to argue for ignoring. People are controlling the narrative by removing articles. They are in effect erasing the historical record. This seems a much more pertinent concern than honoring a courtesy based system that will be abused by the unprincipled both by robots and by website owners wishing to erase prior embarrassment. If is were just other robots abusing the courtesy then there would be good reason for defending it. When the coutesy is abused by websites then all that evaporates as any good reason to comply is outweighed by the perversion. It seems silly to makle a stand on principle for something that is so corrupted.
Re:No. by Anonymous Coward · 2017-04-23 02:12 · Score: 0

People like you are why we can't have nice things. You think it's fine to do whatever you like to other people as long as it's not punishable by law, just because you can, no matter what their opinion.
IOW: You're an asshole.
If someone puts a sign on their lawn saying that certain kinds of people aren't allowed to look at their house, is that a nice thing? Are we an asshole for not giving a shit about their opinion?
Re:No. by Anonymous Coward · 2017-04-23 03:24 · Score: 0

None of that is relevant, because this isn't about prevention. It's about respect, and whether someone who is considered reasonably reputable should show that or not. It's just like a book, nothing physically prevents you from putting it through a photocopier, but you don't do it anyway. And there are quite stiff penalties if you do, and get caught distributing your copies. Which is the bottom line the people asking this seems to completely miss.
Technically, archiving someone else's content if done without permission is copyright infringement. You don't have to lock your site down for copyright to apply, hell you don't even have to have a robots.txt. And if there indeed is a robots.txt which tells you to buzz off, an you still copy it, you're wilfully infringing. Not something someone would like to get caught doing.
Re: No. by Anonymous Coward · 2017-04-23 03:49 · Score: 1

No. The purpose of the file is to let crawlers know that a page is not suitable for indexing, and/or to give site operators an "opt out" capability for crawlers which CHOOSE to offer such features. If you have a problem with a particular company then block their IP space.
Re: No. by Anonymous Coward · 2017-04-23 03:52 · Score: 0

No no no. Robots.txt is a two party consent mechanism. As in, both parties have to explicitly agree to use it and on how to use it. Neither side is required to use it. It's NOT an access control mechanism nor was it ever meant to be.
Re:No. by Anonymous Coward · 2017-04-23 03:55 · Score: 0

Ok, I'll bite on this one.
No, he didn't ask a question, he said "So what?", which translates as "I don't give a shit". He got answered in kind.
You want a downside to ignoring robots.txt? How about I serve you a lawsuit for wilful copyright infringement? The fact that others ignore robots.txt is as useless as defence as any preschooler trying to evade punishment by claiming that "Steve did it too!". It's a childish, stupid and useless argument that doesn't work anywhere.
The harm for me as a publisher of a public website is that I lose control over my own material. It's not like I relinquish any kind of control over it just because I publish something. If you want to test this theory, take a book and feed it through a photocopier. There's nothing physically to stop you from doing that. Then take your copies, and go to the town square and start handing out your copies. And again, other people ignoring the rules without getting punished does not absolve yourself from following them, it just means that they didn't get caught or that the enforcement agency didn't deem them important enough to deal with.
I absolutely agree that people trying to control the narrative by deleting articles is a huge problem, but behaving like an asshole is not the way to combat it. Instead, people who do this should be treated as non-sources, because they simply are not reliable. However it's important that articles can be retraced and removed, in case they are outright false, wrong, misleading or otherwise cause harm to for instance a third part. I find answering that with "so what?" to be an incredibly narrow-minded and assholeish way to react.
Your argument basically boils down to "There are people who are unpricipled assholes, therefore let us all be unprincipled assholes", which I find to be an incredibly sad and destructive approach.
Re:No. by Anonymous Coward · 2017-04-23 04:07 · Score: 0

Actually he asked a question. Instead of an answer you served back abuse. All I have seen is abuse of expressions of disregard bla, bla
No, you and he are still just fucking assholes. Dad's a lawyer? Yep.
Re: No. by Anonymous Coward · 2017-04-23 04:49 · Score: 0

I guess you meant "Yes, yes, yes", because you're saying exactly what I said. You might even want to read what I said about the book and the photocopier again. There doesn't have to be a access control mechanism, because there already is one - NB that I'm talking in the sense of outright archiving, i.e copying the whole thing - which is copyright law.
Publishing would be the permission to access the content, and a missing robots.txt could be seen as an implicit permission to archive the site, but if you just flat out ignore it when it says NO, you have absolutely no legal right whatsoever to copy it and you're committing wilful copyright infringement. Just the same as if you feed that book through the photocopier.
And again, just to make sure: This isn't about access, it's about copying.
Re: No. by bn-7bc · 2017-04-23 04:55 · Score: 1

Hmm are there any kindes of moduels/plugins o webservers that can (via a user cufigurable black list) just send a fake 404 status if a peticular useragent tries to connect, otherwise just forward the request for normal prosessing , thst would probably solve this problem right?
Re:No. by duke_cheetah2003 · 2017-04-23 05:35 · Score: 1

A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.
If you have an example of an organization completely ignoring robots.txt, do tell.
I gotta agree with this. The mechanism of robots.txt needs to be respected in all cases, lest it become obsolete and ignored if big enough players decide it is meaningless and ignorable.
I personally don't give a hoot about my page(s) appearing in an archive, what I don't want, is Google, Bing, Yahoo, or anyone else, indexing my pages so they might appear in search results with terms that may be present on my pages. Not hiding anything, frankly there's almost nothing on my webserver (visible at least), even if robots.txt was absent. It's just my choice, and I rather like that it is respected.
Re:No. by duke_cheetah2003 · 2017-04-23 05:38 · Score: 1

robots.txt is a polite way of saying "please don't"
But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?
This right here need elaboration. Sure, I can put my stuff on a webserver for the world to see. But you see, what I didn't sign up for is every search engine to download all my webpages and make them available in search results. Feel free to poke my website as a human, but not as a indexer, hence robots.txt asking robots to bother someone else.
Re:No. by Aighearach · 2017-04-23 06:10 · Score: 1

will only lead to the justification of other organizations
Well, if organizations don't even need to "justify" what they scan or don't scan, then this is a non-argument.
Re:No. by Anonymous Coward · 2017-04-23 06:27 · Score: 0

So who cares what you "signed up for"? You didn't sign up for it to be only viewed by pre-agreement, and nobody else signed up to avoid indexing "your stuff".
Get an actual sign up: login and session keys.
Re:No. by Anonymous Coward · 2017-04-23 07:23 · Score: 0

Let me put it bluntly: Our society is built on hints, because there is no way everything could be codified into law or customs. Your refusal to take hints is what's going to leave you wasting away in a nursery home if you can afford one, or a ditch if you can't, because nobody will give a shit about you.
Re:No. by Anonymous Coward · 2017-04-23 13:17 · Score: 0

If' you are protecting sensitive sections of your website with robots.txt, you're a fool. That's the first file hackers download to see where they should start looking.
Re:No. by Anonymous Coward · 2017-04-24 01:28 · Score: 0

The problem is that when domains expire, the squatters that occupy it afterwards tend to use robots.txt to hide all old content from the Archive.
I'd rather have the value of robots.txt be eroded than the real value that the Archive provides.

Here is my clever idea... by Snard · 2017-04-22 20:51 · Score: 3, Interesting

Maybe there can be a separate directive/section added inside robots.txt that gives direction to sites like archive.org on these matters. So both search engines and archival systems can behave honorably. If someone really does not want their site archived for the ages, archive.org should clearly respect that.

--
- Mike

Re:Here is my clever idea... by Anonymous Coward · 2017-04-22 21:06 · Score: 1

Already possible, practically since the inception of robots.txt.
User Agent: archive.org_bot Disallow:
Re:Here is my clever idea... by blind+biker · 2017-04-22 21:27 · Score: 3, Insightful

Then why even have a website visible on the internet, if you don't want it searchable and archivable? Those two effectively mean "invisible" - because as long as it is visible, it is also archivable - if nothing else, manually.

--
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
Re:Here is my clever idea... by Zocalo · 2017-04-22 23:46 · Score: 1

Try explaining that to the legacy mainstream media dinosaurs that are still busy taking Google to court for spidering, indexing, and linking to their content, despite the debacle of Spain a few years back, and see how far it gets you. Common sense is in short supply in some corners of the Internet, and fairly large corners at that.

--
UNIX? They're not even circumcised! Savages!
Re:Here is my clever idea... by Anonymous Coward · 2017-04-22 23:55 · Score: 0

Then why even have a website visible on the internet, if you don't want it searchable and archivable? Those two effectively mean "invisible" - because as long as it is visible, it is also archivable - if nothing else, manually.
Experiences, ephemeral art, privacy (having stuff publicly indexed, or publicly available on some well-known archival website, is very different from having a few rogue and private robots indexing your content and rarely making any of it available publicly, particularly in easy ways to the general public), changes of mind, some amount of control (even if limited, it is very important, psychologically, when dealing with large public and well-known entities...), etc.
There will be two results:
1) People will start domain/IP blocking Archive.org/Alexa robots and maybe even most robots beside a few well-known ones (and that means consolidating the power of a few robots, preventing most competition).
2) People will be more wary to publish anything on the web. Also known as self-censorship.
Re:Here is my clever idea... by allo · 2017-04-23 01:46 · Score: 1

> 2) People will be more wary to publish anything on the web. Also known as self-censorship.
Actually you NEED to be wary. If you publish something, it is there, it will be copied, screenshotted, archived ... not only by bots. Have a look at twitter. When a celebrity twitters something dumb, there are 5 people posting screenshots before they have the chance to delete the tweet.
You should not censor yourself. But maybe you choose the use your anonymity online. Even a anonymity which isn't anonymous against your hoster is anonymous against archives. You do not need to put your name next to everything (dumb) you say online. An interesting opinion is worth a read independend of the writer.

Could this break the computer misuse act? by Anonymous Coward · 2017-04-22 21:01 · Score: 1

Section 1 a & b (http://www.legislation.gov.uk/ukpga/1990/18/section/1)

Access to the information is unauthorised (robots.txt says no) but they do it anyway and wilfully.

Re:Could this break the computer misuse act? by Anonymous Coward · 2017-04-22 22:59 · Score: 0

If something ts publicly available on the web, I doubt simply accessing information is an issue.
However, being on the web does not imply that a right to redistribute is granted, but basically that's what they are doing. Has this been tried in court?
Re:Could this break the computer misuse act? by earthloop · 2017-04-23 00:58 · Score: 1

The British Library also maintain an archive. The FAQ relating to their crawler is quite an eye opener:
(http://www.bl.uk/aboutus/legaldeposit/websites/websites/faqswebmaster/)
: Do you respect robots.txt?
: As a rule, yes: we do follow the robots exclusion protocol. However, in certain circumstances we may choose to overrule robots.txt. For instance: if content is necessary to render a page (e.g. Javascript, CSS) or content is deemed of curatorial value and falls within the bounds of the Legal Deposit Libraries Act 2003.
: Can I stop the crawling by using robots.txt or blocking your IP?
: Adding our crawls to robots.txt will stop further crawling once we reconsider the file (see above). Similarly, blocking our IP will stop all further access from that IP address. However, the British Library and other deposit libraries are entitled to copy UK-published material from the internet for this national collection. If you disallow our crawler or block our IP, you will introduce barriers to us fulfilling our legal obligations.

Privacy is an illusion by aglider · 2017-04-22 21:16 · Score: 1

robot.txt doubly so.

--
Sent as ripples into the electromagnetic field. No single photon has been harmed in the process.

The current system is stupid. by Going_Digital · 2017-04-22 21:20 · Score: 1

The problem is new owners of domains adding a robots.txt causing the archive to remove old site scrapes. It seems entirely reasonable to assume that adding robots.txt file should only apply to current content as chances are that prior content is not content that is owned by a new owner of a domain. I think that existing content should remain but new scrapes stop when a new robots.txt file appears on the domain. A complaints procedure then provided for content owners who didn't realise that their content was being archived to request that it be removed.

Re:The current system is stupid. by Anonymous Coward · 2017-04-22 22:58 · Score: 0

Exactly: Why not scrape the robots.txt for each entry and be done with it.
If they want pages removed from the public (!) archive view, they can just email.
If they want it deleted they should show a good reason.
Re:The current system is stupid. by Anonymous Coward · 2017-04-23 01:14 · Score: 0

Sometimes it's because there's a database URL that provides unreliable or non-existent result whenever there's a robot accessing it.
Re:The current system is stupid. by arth1 · 2017-04-23 02:33 · Score: 1

The problem with robots.txt is that it doesn't contain a validity period.
Say I add mustnotbecrawled.html, a link to it in existingpage.html, and a modification to /robots.txt that bans crawling of mustnotbecrawled.html. The problem is that a robot might have downloaded robots.txt right before my publishing, and does not see that it shouldn't crawl it. So it does.
It could be argued that a crawler should always re-load robots.txt if encountering a document newer than the last server transmit time for robots.txt, but that adds a lot of extra requests.
Some propose using the meta tag for excluding browsers, but that has its own problems. Like only working for XML type documents. And being applied after the fact. If I have a several megabytes HTML, and want to exclude it to save bandwidth, the meta tag won't work. It adds a little bit extra bandwidth.
I think this should be handled at user-agent level, where crawlers identify themselves as a crawler, and the web server can make the decision on whether to serve them based on that.
Re:The current system is stupid. by Anonymous Coward · 2017-04-23 03:54 · Score: 0

It could be argued that a crawler should always re-load robots.txt if encountering a document newer than the last server transmit time for robots.txt, but that adds a lot of extra requests.
How about:
1. fetch robots.txt
2. crawl a bunch of pages
3. fetch robots.txt
4. apply the restrictions of both robots.txt files (the "before" and "after" files) to the pages that were crawled in between.
Re:The current system is stupid. by arth1 · 2017-04-23 04:45 · Score: 1

No, that won't work. Changes may have taken place in-between the two copies of robots.txt.
An example: A newspaper.
At the first fetch of robots.txt, an article might not exist. The first version of it has not yet been verified, and is published with a new robots.txt that tells robots not to crawl it. Then, the article is modified and verified, and a new robots.txt published that now allows crawling it.
Yet, a spider may have caught the first robots.txt from before the article, the article while it was in bad shape, and the second robots.txt from after it was corrected. Both robots.txt files agree that it can be cached, yet the copy that was crawled was never meant for caching, and the robots.txt at the time it was published even said so.

I mean, why not? by Anonymous Coward · 2017-04-22 21:20 · Score: 0

The bad guys already have everything. Why shouldn't the rest of us?

Re:I mean, why not? by Anonymous Coward · 2017-04-22 21:57 · Score: 0

Because the bad guys have money and we do not. Face, meet boot.
Re:I mean, why not? by Anonymous Coward · 2017-04-22 22:28 · Score: 0

If I don't become a criminal someone else just will...

Legal but dickish by Anonymous Coward · 2017-04-22 21:23 · Score: 0

It is legal and should be legal. It is also dickish and a nice example for why honor based systems are doomed to fail.

robots.txt indeed does NOT have value by blind+biker · 2017-04-22 21:24 · Score: 2, Interesting

The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.

--
"The agriculture ministry is not in charge of Gundam" - Japanese ministry official.

Re:robots.txt indeed does NOT have value by Anonymous Coward · 2017-04-23 00:51 · Score: 0

So, only what you want and fuck what the content creator wants? No, that's not right. If they don't want it archived, then it shouldn't be archived. If it causes you a small amount of inconvenience, well, welcome to life. Deal with it.
Re:robots.txt indeed does NOT have value by Anonymous Coward · 2017-04-23 01:41 · Score: 0

What the scientific publisher wants and what the content creator (usually a scientist) wants are rarely the same.
Re:robots.txt indeed does NOT have value by Anonymous Coward · 2017-04-23 04:31 · Score: 1

If they don't want it archived, then it shouldn't be archived.
Why not? If they don't always get exactly what they want, well, welcome to life. They should deal with it.
Re:robots.txt indeed does NOT have value by duke_cheetah2003 · 2017-04-23 05:45 · Score: 1

The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.
Keep in mind, if the world collectively decides to ignore robots.txt, a polite and easy way to tell indexers to go away, people will take stronger measures to prevent indexers from doing unwanted things with content they don't own and have no rights to, right up to blocking indexer sourced requests outright, no robots.txt, no http, just the middle finger of 'connection closed by foreign host.'
Re: robots.txt indeed does NOT have value by Anonymous Coward · 2017-04-23 06:12 · Score: 0

I would hate to see what you say to rape victims.
Re: robots.txt indeed does NOT have value by Anonymous Coward · 2017-04-23 06:31 · Score: 0

Something completely different, because the two situations are not even remotely alike?
(And the fact that you even suggest they might be seriously brings into question *your* attitudes to rape.)

NO use meta tags to control this by Anonymous Coward · 2017-04-22 21:38 · Score: 0

Using a robots.txt is as awful as favicon.ico

And why the fuck do modern browsers ignore a 410 GONE response for /favicon.ico and continually re-request it? WTF is the point of a permanently removed status code if it isn't cached?

Re:NO use meta tags to control this by allo · 2017-04-23 01:48 · Score: 2

Gone doesn't mean there will be no replacement. It just tells, that the replacement will not be the same file. So you can re-request the URL, but you should not try to resume a download from there.

Accept copyright by Anonymous Coward · 2017-04-22 21:44 · Score: 0

But deprecate robots and have the http request be a legitimate way to ask for permission. If a web page can be got without login, then it's fair game.

This is the case now, it's just that the wealthy corrupted the courts and have used robots.txt as a pretext for how there's no right given by them giving you what you asked.

Re: Accept copyright by Anonymous Coward · 2017-04-22 21:50 · Score: 0

You misunderstood copyright. Go study and learn
Re: Accept copyright by Anonymous Coward · 2017-04-23 00:02 · Score: 0

It's very annoying to have bots calling your compute scripts, etc. Starting expensive computations just because they want to c crawl. That's the main purpose of robots.txt. Not copyright.
Re: Accept copyright by Anonymous Coward · 2017-04-23 02:20 · Score: 0

You misunderstood copyright. Go study and learn
You misunderstood what "fair use" provisions exist. Go study and learn.

Their behaviour so far was beyond mere compliance by CustomSolvers2 · 2017-04-22 22:01 · Score: 1

If, on day 1, the robots.txt file of a given site allows to collect information and archive.org does it, they would be fully complying with robots.txt. If, on day 2, that site modifies the robots.txt file and restricts the access to all the bots, archive.org shouldn't collect any more information but why deleting the day-1-rightfully-stored one? Such a deletion would be exclusively motivated by their own policy, not by what should be expected from a robots.txt compliance.

A different story would be determining whether they can rightfully store and display information from other sites, what the owners have to say about that and for how long certain type of information might be kept. Nothing of this has to do with respecting robots.txt, but with privacy and third-party information management on the lines of the right to be forgotten.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

Yes everything... by Anonymous Coward · 2017-04-22 22:05 · Score: 0

and when finished it should archive itself and restart the universe.

No by Anonymous Coward · 2017-04-22 22:07 · Score: 0

Archive.org needs to respect copyright law and stop this blatant reproduction of protected works!

Unless I explicitly consent to archiving (or searching for that matter), my content should never reside on someone else's server.

Re:Their behaviour so far was beyond mere complian by CustomSolvers2 · 2017-04-22 22:18 · Score: 1

Some clarifications just in case:
- I don't think that archive.org or any other site should fully ignore robots.txt, or any other express indication of what the website owner wants.
- The robots.txt files of my two sites don't include any kind of restriction and never did.
- All the crawling bots which I develop (currently running ones ranking web domains) always respect robots.txt or, depending upon the exact conditions, anything else which clearly indicates the site owner expectations.
- I am not precisely a (restricted) copyright fan and my whole online activity may be considered public domain.

--
Custom Solvers 2.0 = Alvaro Carballo Garcia = varocarbas.

every one of you knows for certain by weedjams · 2017-04-22 22:18 · Score: 1

that "Turn on, tune in, drop out" was pert and remains so, even more so now with such a netdorked world. WTF are you bickering about it still?

Example? Ok - Baidu by Anonymous Coward · 2017-04-22 22:23 · Score: 0

Baidu.

Ended up blocking all their scanning IPs because of their poor behavior. That was a few years ago.

Maybe they've changed?

YES!! by Vadim+Makarov · 2017-04-22 22:48 · Score: 5, Insightful

I applaud the direction internet archive takes. They should fully implement it.

A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.

I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.

Re:YES!! by Anonymous Coward · 2017-04-23 00:58 · Score: 0

No. It will only lead to ISPs, carriers and even tier1-backbone-providers blocking archive.org on ip level, cutting them off from EVERYTHING. Also, ignoring a deny in robots.txt could lead to a dmca lawsuit for circumventing copy-protection...
Re:YES!! by bertvanleussen · 2017-04-23 01:46 · Score: 0

I applaud the direction internet archive takes. They should fully implement it.
A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.
I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.
Utter nonsense. If your "website with some quantity of useful information" was in any way important you would have republished the content on a new domain, which would have been indexed by search engines quite quickly. Archive.org is not a substitute for your obvious lack of due care in taking backups of your data.
Archive.org is NOT an official archive of the web. If they stop respecting robots.txt, then why should others keep respecting it? They claim to be special but they are not any different from any other search engine or data harvester.
Re:YES!! by Vadim+Makarov · 2017-04-23 13:24 · Score: 1

I did not have deny in robots.txt at the time the site was crawled. I do not mind if archive.org does not cache the "domain for sale" page, but why should the new owner be able to delete the entire history from the previous domain owners?
People abandon useful resources and domain names for many life reasons. To begin with, we all die. Companies change. Organisations change. Life priorities change. In my case the reason not to reinstate is the sheer lack of time: things are scripted for that domain name and it would take time to reconfigure that. Even if I did, once I die or become demented it would meet the same fate, the domain name will fall into someone else's hands. Internet archive is an extremely important public resource to cover for all that.

--
17779 eligible voters in a district, 17779 'vote' as one. This is Russia.

Yes they should ignore by Anonymous Coward · 2017-04-22 23:18 · Score: 0

A) public website

B) access control

Pick one

Uh huh... Sure. by NormanHaga2580 · 2017-04-22 23:25 · Score: 0

Dishonor Robots.txt and I add this to htaccess: Deny 207.241.224.2

Re:Uh huh... Sure. by Anonymous Coward · 2017-04-23 01:19 · Score: 0

Just redirect to a page fed from /dev/urandom

This will have some big negative concequences by jonwil · 2017-04-22 23:29 · Score: 3, Insightful

Think about a big site like github.com.
Imagine how many terabytes of pretty-printed source code and other things archive.org would be pulling were it to crawl all of GitHub.
And that's just one site, there are many others that generate pretty-printed source code and other large things.

Or what about if it crawls Google and starts archiving all sorts of Google search URLs or Google maps URLs or whatever.

Re:This will have some big negative concequences by allo · 2017-04-23 01:51 · Score: 1

They aren't that dumb ... who writes a crawler puts in some protections against too big websites or sites autogenerating content with dynamic urls. So for example they put non-popular github links on the end of a queue to check them after everything else was processed. So they may slowly add unimportant github content, but won't crawls terabytes of data just now, but only some megabytes every now and then. Their bandwidth and storage capacity is limited as well.
Re:This will have some big negative concequences by Anonymous Coward · 2017-04-23 09:12 · Score: 0

I believe the syntax-highlighting is done with javascript
Re:This will have some big negative concequences by Lost+Race · 2017-04-23 11:01 · Score: 1

Obviously (to some of us, anyway) the crawler should honor robots.txt, but the archive should not. Once something is in the archive it should be in there forever.

Really want politicians And The courts involved? by Anonymous Coward · 2017-04-22 23:31 · Score: 0

What is not done voluntarily can always be legislated and enforced judiciously. Yet another internet nightmare.

No. by Megane · 2017-04-23 00:13 · Score: 4, Interesting

robots.txt is intended to indicate what parts of a site should not be scanned recursively, often due for technical reasons such as generated content> It especially for sub-paths like /cgi-bin/, but there is no technical reason why the content of any arbitrary URL can't be programmatically generated. It might be and you wouldn't even know it, because the generated content may be the same most of the time, such as a navigation menu.

However, it was also not intended to be used to remove previously-archived content, as archive.org is currently using it. When an archived page changes status in robots.txt, they should note the first date that the status changed, then simply stop updating it until and if robots.txt re-allows it.

scanning and archiving are two different operations, and robots.txt is only intended to apply to the former.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }

Come one come all or not by Anonymous Coward · 2017-04-23 00:17 · Score: 0

I think a web page (as opposed to site) has the right to be public or not.

But once they decide to be public, then search and archive from all should be permitted.
No cherry picking so only Google can index the page.
Certainly, no after the fact removal from an archive, if the page was public at the time it was captured.

The IA is especially good at performing a public service.
I think they should honor a robots.txt request that applies to all, but use the most permissive directive for sites that are pickey about which indexing they support.

Legally, a spider should have the same access as the general public.
If a site limits access to the general public, then the spider will see the same limit.
The robots.txt file is a suggestion to make things more efficient.
It is not a locked door to some and open to others.

Simple solution? by RDW · 2017-04-23 00:39 · Score: 4, Insightful

How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.

Re:Simple solution? by Anonymous Coward · 2017-04-23 01:27 · Score: 0

How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.
Why isn't this top comment? Is this too obvious?

Robots.txt is being Abused by Anonymous Coward · 2017-04-23 01:11 · Score: 0

While everyone is debating this, keep in mind that the problem is the Abuse of robots.txt to the point of absurdity. At what point do people have a right to open their Windows to the public, do stuff to attract attention, and then get mad at others for looking into their windows in the first place. People who make mistakes should have some way to undo those mistakes at some point, but to continue to leave your window open and then have everyone put a robots.txt, retroactive or otherwise, is not the answer.

I conducted a 2 yr experiment on Internet Archive by Anonymous Coward · 2017-04-23 01:20 · Score: 2, Informative

I wanted to know if it was possible to delete content from the Internet Archive. Their FAQ and support staff were very vague and only referred me to the robots.txt file. I found that they archive everything even if you tell them not to. The robots.txt file only controls whether or not the public can view it.

Experiment 1) Buy an expired domain and host it with a robots.txt file telling Internet Archive not to archive it. Before the experiment I confirmed that Internet Archive had a history for this expired domain. After buying the domain and hosting the robots.txt file, I confirmed that Internet Archive no longer allowed access to it. I allowed the domain to expire, then went back to Internet Archive. The entire history of the domain was still there.

Experiment 2) I browsed the history of an existing website that I host to confirm they had it. Next I hosted a robots.txt file telling Internet Archive not to archive it and verified that the public can no longer browse the archive for this domain. Next I changed a picture on the website for six months, then changed it back. I waited another six months, removed the robots.txt file and checked the Internet Archive. I found that they had been taking snapshots throughout the year even though my robots.txt told them not to. The picture they were not supposed to have archived was visible in the archive.

If you really don't want them to archive your website, you can maybe block all of their IP addresses from accessing your server. Possibly by determining the domain name from the IP address and then checking if it is the Internet Archive.

Why ignoring? by allo · 2017-04-23 01:36 · Score: 1

There are already flags like "noarchive" to get google to index the site, but not provide public "google cache" links (you can assume they still cache it, but that doesn't matter for you).
So archive.org should ignore noindex directives, but not noarchive ones.

privacy? Null. by Anonymous Coward · 2017-04-23 01:52 · Score: 0

You don't put private things with a "please don't read" note on them in the town library noticeboard, dumbass!
Art, whether ephemeral or not, again, either don't put it up or accept it. There's memory. People will remember your "ephemeral art", and really who are you to decide what your art means to the viewer?
Experiences? Experience being archived! What a pointless point you made.
Change your mind? But you haven't. You still leave it in public with a "please do not read" note on it. Change your mind about where you put it, not make others change their minds because you're a lazy fucker.
You have all control over it. Take it off the internet. Block access without a login. What you can't control is other people. Why the hell should you?
etc? What "and so on"? Those ones were shit. And they were the best you could come up with!

Re:privacy? Null. by Anonymous Coward · 2017-04-23 03:43 · Score: 0

First off, this isn't about completely preventing people form accessing the content. This is about keeping it out of the search engines and places where the 99% of people that aren't doing deep web searchs are looking for things. There's a huge difference between being found in Google's database and being found in an obscure one that's only used by a few dozen people or having the pages served up by a 3rd party where they can be up there for years past the point where you've taken them down or changed the content.
Of course if something is connected to the internet it may be publicly accessible, nobody's arguing that point. We're arguing about whether or not it's acceptable to illegally distribute other people's work in defiance of their expressed wishes.

Bing's spider wasn't obeying robots.txt by QuietLagoon · 2017-04-23 02:05 · Score: 1

I conversed the the good people at Bing, and was told pretty much that its a bug they don't intend to fix. They also told me how to code my website to get around the bug. Needless to say, I did the work to get around the bug. However, instead of restricting Bing from the parts of the site I restrict to all search engines, now Bing is totally restricted from browsing any part of the web site.

.
However, the way Microsoft has been acting recently (e.g., Windows 10 forced upgrades), I doubt if they even care about what I try to tell them via robots.txt.

The Microsoft attitude apparently is pervasive within the company.

No by JustAnotherOldGuy · 2017-04-23 02:05 · Score: 1

"Should Archive.org Ignore Robots.txt Directives And Cache Everything?"

No.

--
Just cruising through this digital world at 33 1/3 rpm...

Yes. If you want to keep it private, then by The+Cisco+Kid · 2017-04-23 02:23 · Score: 1

don't publish it openly in the first place.

it's about politeness by Anonymous Coward · 2017-04-23 02:28 · Score: 0

Hey I like privacy too but come on, it's the internet! If you put something out there that it's not password-protected then consider it to be publicly avalilable, already. Respecting robots.txt is a matter of politeness.

I do not misunderstand copyright by Anonymous Coward · 2017-04-23 02:29 · Score: 0

You have to explain.

Lazy fucker.

For the non-lazy fucker who actually gave a point, if your scripts are in a directory properly rather than munged into the same area your content is, then any sane robot will not execute them. If the scripts contain content information, ur doin it wrong. Separate the content from the presenttion. It's what the markup system is meant for, but if you're going to put content in the scripts then your scripts being executed is necessary, and robots.txt is being abused by you along with the "please don't execute this!" plea to hide information.

Dumbasshit spiders will grab everything. But they're going to ignore robots.txt as well if they're that dumb or badly written, so some crawlers like archive.org ignoring it will be zero different.

blantant-predator moral honeypot by epine · 2017-04-23 02:44 · Score: 1

A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt.

So what? When DoubleClick argues that they ought to have the same advantages as Archive.org, they'll only manage to look like douchebags reaching their filthy hands into a cookie jar.

It's not always a bad thing to set up douchebag-honeypot moral exemption, even if it does depend on the mass audience (mostly) managing to find two sticks to rub together.

The real solution here is to make the directives in robots.txt more explicit concerning the predatory/non-predatory use cases.

Re:blantant-predator moral honeypot by DamonHD · 2017-04-23 03:55 · Score: 1

The obvious (and already available) solution is to have the spider mark its incoming HTTP request at the TCP level appropriately:
https://www.ietf.org/rfc/rfc35...
Rgds
Damon

--
http://m.earth.org.uk/

Pages should not be revived retroactively by Anonymous Coward · 2017-04-23 02:59 · Score: 0

As long as this is their policy going forward from 2017 and they don't revive old cached pages from years ago, I don't have a particular problem with this. But in the past, their explicit policy was that sites could be removed via the robots.txt file and that must be honored. From now on I guess I'll use .htaccess to achieve the same effect where necessary.

archive.org should be smarter by iamagloworm · 2017-04-23 03:02 · Score: 1

archive.org should ignore robots.txt as a means to prevent archiving material. archive.org should however be smart enough to know what can be ignored, based on content.

Boingboing wants ad impressions by itomato · 2017-04-23 03:36 · Score: 1

More data points to show you more Mc Donald's ads probably sounds awesome to them. You can't be cool, popular, and decent all at once, Xeni and crew..

Privacy, how quaint. by gordonb · 2017-04-23 03:46 · Score: 1

Of course, I'm sure multiple TLAs already have a copy of everything, particularly anything political, technical, blogs, etc, including the "dark web" and other encrypted sites. A leopard can't change his spots and I would predict the FBI has dossiers on most American citizens. It's in their DNA, dating from J. Edgar at their birth. Now the dossiers are electronic, searchable, and probably do include real DNA. They likely include info from the older paper files - like that record of the subscription I had to the People's Daily back in the 70s during my Maoist phase - now scanned and searchable too. With the exposure of all these hacking tools, perhaps there is even a backup of my old blog server which mysteriously crashed in 2007 wiping out years of pointless political fulminating read by one or two people. If it wasn't for these search engines ignoring robots.txt, I would have had no traffic at all. Certain corporations also have essentially complete copies of the internet - like Google, Yahoo (Verizon), and Microsoft. If only we could search and browse them with a better interface than the Wayback Machine.

The problem is this retroactive application of it. by Travelsonic · 2017-04-23 03:53 · Score: 1

Whoever thought that was a good idea is a moron, full stop.
Different archive copies from when the site was under different ownership should retain their own policies - whether it is fully restricted, not restricted at all, or in between. Yes, that will take up space, holding on different copies of robots.txt files, linking them to websites, etc, but it is better than some archives not being available because of their current policy.

--
If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot

Re:I conducted a 2 yr experiment on Internet Archi by Anonymous Coward · 2017-04-23 03:55 · Score: 0

Mod parent up +Informative

Yes by Anonymous Coward · 2017-04-23 04:09 · Score: 0

If they start disregarding retroactive robots.txt directives, I'll start donating again. Domain name owners shouldn't be able to play "Ministry of Truth" with content that had previously been made publically available on their domains.

An excellent decision !!! by martiniturbide · 2017-04-23 04:56 · Score: 1

I also agree. On today's internet nobody asks for permission to show us advertising, to follow up on the internet showing us ads, try to seel us things on social networks and now ISP will sell our browsing history. So why should not a "public library" be able to just backup a full website. Also it is complete lawful for they to copy contents and information: https://www.law.cornell.edu/us...

Good luck Internet Archive. Backup everything in the world !!! Preserve al knowledge !!!

Yes by Anonymous Coward · 2017-04-23 05:07 · Score: 0

'Nuff said.

Yes, with Conditions by slacka · 2017-04-23 06:00 · Score: 1

First of all, everything should be archived for future generations and researchers. Otherwise, it defeats the whole point of the project.

But for the general public, the robots.txt should be honored and content hidden with a few conditions. First of all, it should not be retro active. I've seen valuable information lost when domains have changed name and the new owner has blocked the contents with a robots.txt. Second of all, there should be a review system to override the robots.txt. For example, if a site is cited in Wikipeida, it the robots.txt should be ignored and hidden content unblocked.

http get, requesting, supplied, consenting by Anonymous Coward · 2017-04-23 06:20 · Score: 0

So there's already permission for copying there. Don't honour http get requests without a login.

As a retired research librarian by Anonymous Coward · 2017-04-23 10:04 · Score: 0

As a retired research librarian I come down strongly on the 'cache everything' side.

Removing materials dilutes the historical and sociological usefulness of the archive.

A hundred years from now one's peccadillos will mean nothing, but the record of such things will help researchers understand our times.

What are "cached URLs"? by tepples · 2017-04-23 11:37 · Score: 1

If you attempt to directly access cached URLs, you're hostile, same answer.

How you define "cached URLs" could determine how much money you have to spend fielding support calls from legitimate users who have bookmarked a document on your site.

Re:What are "cached URLs"? by postbigbang · 2017-04-23 11:54 · Score: 1

The site is static. It goes through revision. No one in their right mine bookmarked sites-- it gets 100 legit visits a year. It's a honeypot.
But spiders cache URLs and try to find them again. Nope.

--
---- Teach Peace. It's Cheaper Than War.
Re:What are "cached URLs"? by Anonymous Coward · 2017-04-25 00:49 · Score: 0

I'll bet you're one of those douches that tries to block right-clicking, too.

No. by Anonymous Coward · 2017-04-23 14:18 · Score: 0

Go watch the up and coming lame-arse movie called The Circle. The book sucked so the movie will be just as bad. If you ask this question you are proven to be stupid and young. lol. No doubt about it. You never studied history, haven't a clue about governments, and don't have any morals as you don't respect peoples rights.

A robots.txt file is a nice way of telling another, "please don't copy my site." However, the more mature and sophisticated answer is "if you copy this portion of my site, you may liable for copyright infringement." This whole problem is really a problem with the limitations of robots.txt. Telling someone "please do this" or "please don't do that" is not nearly as significant as "you have a right to do this" and "I will sue you if you do that".

--
Fast Federal Court and I.T.C. updates

Unauthorized Access by Aaron+B+Lingwood · 2017-04-23 20:08 · Score: 1

By ignoring robots.txt, archive.org would be gaining unauthorized access to a computer system as access was expressly denied as per the Robots Exclusion Standard.

To further disseminate the archived pages would be added infringements.

I think that they need to campaign site owners to modify their robots.txt and if need be, lobby for exclusions to the Computer Misuse Act.

--
[Rent This Space]

Re:Unauthorized Access by Anonymous Coward · 2017-04-24 04:42 · Score: 0

By ignoring robots.txt, archive.org would be gaining unauthorized access to a computer system as access was expressly denied as per the Robots Exclusion Standard.
Bollocks. What 'Robots Exclusion Standard'? Who has the authority to define such a standard? What if I set up a website with a file called 'access' in a folder called 'security' and containing the words 'no robots'? Does that have legal force? It's a clear expression of my intent.
Many people here seem to think the intent of robots.txt was not originally to prevent archiving but to prevent the 'robot' going haywire with recursive or infinite links. Are they all wrong? Who gets to say?
robots.txt is far too woolly, ill-defined and lacking in official status to have any real legal force - particularly in the criminal instance you're positing.

problem for websites by SuperDre · 2017-04-23 20:35 · Score: 1

there is no law that says you have to obey the robots.txt, it's nice search engines etc obey the robots.txt, but they certainly don't have to.

Really Stupid by MooseMiester · 2017-04-24 04:32 · Score: 1

Every site we develop has a dev instance with Robots set to dissallow, and a prod instance. If the dev instances get exposed to the outside world (search, archive, wayback) then you have to teach reviewers how to use local vhost files which will be a huge pain in the ass, or put .htaccess passwords on everything which is just plain stupid.

--
Murphy was an optimist

Sure... by ne1av1cr · 2017-04-24 05:21 · Score: 1

If you think you want to keep a copy of my homebrew Ars Magica roleplaying game logging site feel free.

Re: Really Stupid - Doing It Wrong by Anonymous Coward · 2017-04-26 13:16 · Score: 0

If you're relying on robots.txt to prevent people from seeing dev branch, then you're Doing It Wrong. Such things should be password/log-in protected.

Slashdot Mirror

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org)

174 comments