White House Website Limits Iraq-Related Crawling

← Back to Stories (view on slashdot.org)

White House Website Limits Iraq-Related Crawling

Posted by simoniker on Monday October 27, 2003 @09:24AM from the intriguingly-specific dept.

oscarcar writes "Dan Gillmor is reporting on the White House website's use of its robots.txt file to disable search engines from crawling certain material. Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."

3 of 837 comments (clear)

Min score:

Reason:

Sort:

Re:Oh please by phritz · 2003-10-27 09:32 · Score: 5, Insightful

Congratulations to simoniker, poster of the most inanely paranoid comment I have ever read here on slashdot. And that's saying something.

I have to admit, when I first read the story I thought someone was being paranoid. But you really should RTF robots.txt file before you accuse the poster of being paranoid. The disallowed files are extraordinarily specific. I really can't come up with a plausible explanation beyond simoniker's.
Re:Drawing farfetched conclusions by johnnyb · 2003-10-27 10:12 · Score: 5, Insightful

It really doesn't look like it. It looks like someone screwed up, because none of those directories appear to exist at all. I mean really, what are the chances of /firstlady/photos/2003/01/iraq actually having at some time contained real data?

It looks like someone did a

find . -type d|perl -e 'while(<>){print "${_}/iraq\n"; print "${_}/text\n";}' > robots.txt

I have no idea what the purpose would be, but it seems like a funny thing to do if you were trying to hide something.

By the way, who is going around looking at people's robots.txt files?

--
Engineering and the Ultimate
Re:Other, arguably more reasonable explanations by EinarH · 2003-10-27 10:49 · Score: 5, Insightful

Didn't think so, not a single one that I went to is a valid URL, and I highly doubt that they were valid to begin with.
From
http://www.bway.net/~keith/whrobots/disdirs.html
Some of the directories that 404 truly are empty of files. FOr instance:
http://www.whitehouse.gov/news/timeline/iraq
doesn't have files.
But at least some of the files that 404 above Do have files in the directory, just not an index file. For instance:
http://www.whitehouse.gov/infocus/iraq/100days
does not have an index page, so just entering that URL will give a 404.
However, the directory has the following files in it:
http://www.whitehouse.gov/infocus/iraq/100days/100 days.pdf
http://www.whitehouse.gov/infocus/iraq/100days/int roduction.html
http://www.whitehouse.gov/infocus/iraq/100days/par t1.html
http://www.whitehouse.gov/infocus/iraq/100days/par t2.html
http://www.whitehouse.gov/infocus/iraq/100days/par t3.html
http://www.whitehouse.gov/infocus/iraq/100days/par t4.html
http://www.whitehouse.gov/infocus/iraq/100days/par t5.html
http://www.whitehouse.gov/infocus/iraq/100days/par t6.html
http://www.whitehouse.gov/infocus/iraq/100days/par t7.html
http://www.whitehouse.gov/infocus/iraq/100days/par t8.html
http://www.whitehouse.gov/infocus/iraq/100days/par t9.html
http://www.whitehouse.gov/infocus/iraq/100days/par t10.html
All those files are excluded by the directory disallow entry in robots.txt

And, yes these files *are* relevant.

--
Melius mori in libertate quam vivere in servitute.