Slashdot Mirror


White House Website Limits Iraq-Related Crawling

oscarcar writes "Dan Gillmor is reporting on the White House website's use of its robots.txt file to disable search engines from crawling certain material. Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."

3 of 837 comments (clear)

  1. Re:Oh please by phritz · · Score: 5, Insightful
    Congratulations to simoniker, poster of the most inanely paranoid comment I have ever read here on slashdot. And that's saying something.

    I have to admit, when I first read the story I thought someone was being paranoid. But you really should RTF robots.txt file before you accuse the poster of being paranoid. The disallowed files are extraordinarily specific. I really can't come up with a plausible explanation beyond simoniker's.

  2. Re:Drawing farfetched conclusions by johnnyb · · Score: 5, Insightful

    It really doesn't look like it. It looks like someone screwed up, because none of those directories appear to exist at all. I mean really, what are the chances of /firstlady/photos/2003/01/iraq actually having at some time contained real data?

    It looks like someone did a

    find . -type d|perl -e 'while(<>){print "${_}/iraq\n"; print "${_}/text\n";}' > robots.txt

    I have no idea what the purpose would be, but it seems like a funny thing to do if you were trying to hide something.

    By the way, who is going around looking at people's robots.txt files?

  3. Re:Other, arguably more reasonable explanations by EinarH · · Score: 5, Insightful
    Didn't think so, not a single one that I went to is a valid URL, and I highly doubt that they were valid to begin with.
    From
    http://www.bway.net/~keith/whrobots/disdirs.html
    Some of the directories that 404 truly are empty of files. FOr instance:
    http://www.whitehouse.gov/news/timeline/iraq

    doesn't have files.

    But at least some of the files that 404 above Do have files in the directory, just not an index file. For instance:

    http://www.whitehouse.gov/infocus/iraq/100days

    does not have an index page, so just entering that URL will give a 404.

    However, the directory has the following files in it:

    http://www.whitehouse.gov/infocus/iraq/100days/100 days.pdf
    http://www.whitehouse.gov/infocus/iraq/100days/int roduction.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t1.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t2.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t3.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t4.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t5.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t6.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t7.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t8.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t9.html
    http://www.whitehouse.gov/infocus/iraq/100days/par t10.html

    All those files are excluded by the directory disallow entry in robots.txt

    And, yes these files *are* relevant.
    --

    Melius mori in libertate quam vivere in servitute.