Slashdot Mirror


White House Website Limits Iraq-Related Crawling

oscarcar writes "Dan Gillmor is reporting on the White House website's use of its robots.txt file to disable search engines from crawling certain material. Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."

34 of 837 comments (clear)

  1. Funny by sulli · · Score: 5, Funny

    whitehouse.com doesn't have that problem.

    --

    sulli
    RTFJ.
    1. Re:Funny by sulli · · Score: 5, Funny
      Also, TRUE God loving Americans would absoltely love to see that filthy, degrading website taken down because of the damage it causes to children who go there on accident.

      I feel the same way about whitehouse.gov. Couldn't have said it better myself.

      --

      sulli
      RTFJ.
    2. Re:Funny by jovlinger · · Score: 5, Interesting

      true. true. Apparently some poor fool made similar remarks on k5 a while back, and did indeed receive a personal visit from the SS. No charges filed, but 'tis a rude awakening indeed when your online words come and knock on your door.

  2. Drawing farfetched conclusions by Armethius · · Score: 4, Funny
    possibly to prevent people from finding changes to past statements and information when archived elsewhere
    Or possibly not...
  3. upside by 514x0r · · Score: 5, Funny

    it's good to see the whitehouse embracing technology so much.

    --

    !(^((ri)|(mp))aa$)
  4. Other, arguably more reasonable explanations by rot26 · · Score: 4, Interesting

    Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."

    Maybe, but I would think they might also be looking for "shady" spiders that ignored robots.txt. I wouldn't be surprised if there aren't a few honeypot pages in there too.

    --



    To ensure perfect aim, shoot first and call whatever you hit the target
    1. Re:Other, arguably more reasonable explanations by RobertB-DC · · Score: 4, Funny

      Maybe, but I would think they might also be looking for "shady" spiders that ignored robots.txt. I wouldn't be surprised if there aren't a few honeypot pages in there too.

      Oh, crap. I just plugged in /firstlady/images/iraq, and now you tell me I'd better watch out. Damn this static IP address!

      Quick, Slashdot that link before the Agents get to my cube!

      --
      Stressed? Me? Of course not. Stress is what a rubber band feels before it breaks, silly.
    2. Re:Other, arguably more reasonable explanations by sketerpot · · Score: 4, Interesting

      Honeypot or not, look at robots.txt. It's creepy: just about every entry is an Iraq-related page, and there are a lot of entries. If they wanted to just have a few honeypots, that shouldn't involve that many entries, or so many with the common theme of Iraq.

    3. Re:Other, arguably more reasonable explanations by greenhide · · Score: 5, Funny

      How can they be Iraq related if they didn't exsist to begin with?

      A question that GW gets asked all the time. :-)

      --
      Karma: Chevy Kavalierma.
    4. Re:Other, arguably more reasonable explanations by EinarH · · Score: 5, Insightful
      Didn't think so, not a single one that I went to is a valid URL, and I highly doubt that they were valid to begin with.
      From
      http://www.bway.net/~keith/whrobots/disdirs.html
      Some of the directories that 404 truly are empty of files. FOr instance:
      http://www.whitehouse.gov/news/timeline/iraq

      doesn't have files.

      But at least some of the files that 404 above Do have files in the directory, just not an index file. For instance:

      http://www.whitehouse.gov/infocus/iraq/100days

      does not have an index page, so just entering that URL will give a 404.

      However, the directory has the following files in it:

      http://www.whitehouse.gov/infocus/iraq/100days/100 days.pdf
      http://www.whitehouse.gov/infocus/iraq/100days/int roduction.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t1.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t2.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t3.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t4.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t5.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t6.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t7.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t8.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t9.html
      http://www.whitehouse.gov/infocus/iraq/100days/par t10.html

      All those files are excluded by the directory disallow entry in robots.txt

      And, yes these files *are* relevant.
      --

      Melius mori in libertate quam vivere in servitute.

  5. Queue somebody... by Dave2+Wickham · · Score: 4, Insightful

    Queue somebody to take a crawler (hell, even a bash script using wget) to specifically archive these pages. Hell, they could even use a user-agent which doesn't look like a bot.

    Of course, people would be less likely to trust random-Joe from the Internet than, say, The Wayback Machine, but I expect this is what will happen...

  6. Everything Iraq.... by c_oflynn · · Score: 4, Informative

    It looks like 99% of the stuff related to Iraq is filtered out in robots.txt.

    But not a problem, on google.com I just specify the site by saying 'Iraq site:whitehouse.gov' and it had 14,000 hits... the first one is the root of /infocus/iraq directory (which is dissallowed in robots.txt)

  7. Re:Oh please by phritz · · Score: 5, Insightful
    Congratulations to simoniker, poster of the most inanely paranoid comment I have ever read here on slashdot. And that's saying something.

    I have to admit, when I first read the story I thought someone was being paranoid. But you really should RTF robots.txt file before you accuse the poster of being paranoid. The disallowed files are extraordinarily specific. I really can't come up with a plausible explanation beyond simoniker's.

  8. Truly Frightening. by Dlugar · · Score: 5, Funny

    Obviously, they're keeping people from accessing the top-secret teeball Iraq files ! Besides:

    Disallow: /teeball/iraq/
    check out these other frightening examples of censorship:
    Disallow: /kids/spotty/iraq
    Disallow: /kids/eggroll/iraq
    Disallow: /kids/barney/iraq
    Disallow: /easter/iraq
    Disallow: /mrscheney/iraq
    Disallow: /national-anthem/iraq

    Truly frightening.
    --
    Computer Go: Writing Software to Play the Ancient Game of Go
  9. I, for one... by wardomon · · Score: 5, Funny

    welcome our White House Robot Overlords. It would be funnier if it weren't true.

    --

    - - - If the sun is a star, why can't I see it at night?
  10. Re:Oh please by cgranade · · Score: 4, Interesting

    This gets modded up as Insightful? I mean, the White House is routinely editing their trascripts, and if bots like Google and Wayback can go and find that no, Bush said that we found weapons, not a weapons program, then there goes Bush's latest FUD... *thud*. Just because it's a tinfoil hat worthy theory doesn't mean it isn't true... most aren't, but therein lies the issue: most.

    --

    #define DRM chmod 000

  11. related links by js7a · · Score: 4, Interesting
    A couple of web sites that (1) have in the past done a great job of catching these kind of things, and (2) have mailing lists you can subscribe to:

    Here's a minor example of something those two sites didn't catch: Remember Iraq's so-called "mobile biological weapons factories"? A month after the story broke that they were for weather balloons, the CIA moved their report's URL.

    An intriguing fact about this whitehouse.gov/*/iraq thing is that they do in fact cover some of the important statements which are apparently not duplicated in the press release, conference, and briefing directories. Perhaps there was a "unique urgency" to cover up some poor choices of words?

  12. Not conspiracy, but I don't know what it *is* eith by Have+Blue · · Score: 4, Informative

    If you try actually *loading* the directories listed in the robots.txt, they don't exist. Not one. Not by going to their index.html or trying to find them through the site navigation. While they could still be accused of deleting them, many of the links are unlikely to have existed in the first place (http://www.whitehouse.gov/president/heartland-tou r-gallery/iraq? /president/holiday/decorations/iraq? /president/tee-ball-01/iraq? ) This may be just some IT grunt running a bad script on robots.txt.

  13. Re:More American Cencorship by kableh · · Score: 4, Insightful

    Keep telling yourself that.

    And 70% of the people in this country STILL think that Saddam played some part in 9/11. What was your point again?

  14. Missing Iraq and 9.11 files by jjn1056 · · Score: 5, Informative

    Looks like they removed a bunch of files where they were making claims that Saddam was behind 9/11. One could be lead to suspect that now that Bush got his war his doesn't need that lie anymore, and wants to erase all history of it since it undermines his authority.

    --
    Peace, or Not?
  15. Re:And your ... by SQL+Error · · Score: 4, Insightful

    Better explanation: Someone screwed up a search-and-replace in a major way. Many (most?) of those pages with "iraq" in them don't exist.

    It looks like someone blocked off parts of the site to web-crawlers; I don't know for sure why all those blah/bloo/iraq entries are in there but they sure as hell don't lead to anything.

    Censorship: 0
    Screwups: 100

  16. re: and your ... by ed.han · · Score: 4, Insightful

    what's that old saying? "never attribute to malice that which can be attributed to stupidity" or something like that?

    let's not get reactionary here, folks. it wouldn't make sense to do what's being alleged:

    1. every major journalist worth his/her salt would be all over it within hours. so it wouldn't succeed in obscuring information.

    2. it would create an incredible backlash as soon as detected. what purpose would this serve?

    ed

  17. Re:Interesting allegation... by davebo · · Score: 4, Informative

    The complaint is they've done it before - "combat operations are done" became "major combat operations are done" when the fighting didn't stop. You can check here.

    Compare the screenshots of what used to be on the white house website vs what's currently on the website.

    Yes, I know, "how do we know this blogger didn't alter the screenshots?" You don't.

  18. Barney, agent provacateur of the CIA? You Decide by mykepredko · · Score: 5, Funny

    Downloading the "robot.txt" file and doing a quick ctrl-f on different words, I discovered that there are six instances of "Barney" coming up in the robot.txt:

    Disallow: /holiday/2002/barney/iraq
    Disallow: /holiday/2002/barney/text
    Disallow: /kids/barney/iraq
    Disallow: /kids/barney/text
    Disallow: /kids/photoessays/barney/iraq
    Disallow: /kids/photoessays/barney/text

    Which is the same number as "cheney", "powell" had 4, "saddam" didn't have any and "bush" only comes up with "bushpets".

    Clearly, there is something to do with Barney and Iraq that The White House doesn't want you to know about.

    myke

  19. Re:Drawing farfetched conclusions by johnnyb · · Score: 5, Insightful

    It really doesn't look like it. It looks like someone screwed up, because none of those directories appear to exist at all. I mean really, what are the chances of /firstlady/photos/2003/01/iraq actually having at some time contained real data?

    It looks like someone did a

    find . -type d|perl -e 'while(<>){print "${_}/iraq\n"; print "${_}/text\n";}' > robots.txt

    I have no idea what the purpose would be, but it seems like a funny thing to do if you were trying to hide something.

    By the way, who is going around looking at people's robots.txt files?

  20. Re:A CLASSIC QUOTE... by Selanit · · Score: 4, Insightful
    Nothing's hidden, it's all there, it's all searchable from the white house website, just not from search engines.
    Correction: it's all there, as far as we can tell. How can I be sure that the results returned by the whitehouse.gov search engine are full and complete when google and all the other search engines have been partially crippled? There's no way to verify the completeness of the results -- I just have to take their word for it. Just like I was asked to take their word about Hussein's weapons of mass destruction.

    Paranoia aside, I object to these restrictions as a matter of principle. They're making it more difficult to access publically available information. It's not classified, and it never was. I, as a citizen of the U.S.A., have a right to know what my leaders have said and done.

    Let's assume the whitehouse.gov search engine is completely honest, and faithfully returns a complete listing of all materials on the site having to do with Iraq. If that's so, then there should be no reason to disable other search engines, since their results would just confirm the internal results.

    But the restrictions are in place, meaning that someone thought there was a good reason to do so. Restricting access makes it more difficult for people to research information pertaining to Iraq on the whitehouse.gov web site. Who are the people most likely to be doing that? Answer: journalists, activists, and concerned citizens. Obviously these restrictions aren't enough by themselves to dissuade a determined researcher; but it might slow them down. And it might actually stop a diffident researcher completely.

    I'm not even going to go into scenarios where the whitehouse.gov search engine is not trustworthy, because serving up "doctored" speeches or information is highly unlikely. There are too many other archives to compare against, and it would be a major scandal if the administration was found to be altering records on its website. They'd have to be really, really dumb to do that.

    The whole thing still leaves a bad taste in my mouth, though.
  21. 1984: simple answer by flossie · · Score: 4, Funny
    There is a very simple explanation for this, as anyone who has read 1984 should know. In order for the glorious government to effectively serve the greater good, they need to be able to communicate changes of policy quickly and effectively. If, for instance, the enemy in a war changes, it is necessary to quickly update all documents that describe how evil the enemy are. Rather than manually editing all the documents, it is much easier to have one generic word, say "text", which can then be altered as appropriate:

    sed 's/text/iraq/g' sed 's/text/iran/g' sed 's/text/cuba/g' sed 's/text/belgium/g'
    etc.

    Obviously robots.txt just happened to be in the path!

  22. Re: and your ... by AllUsernamesAreGone · · Score: 4, Insightful

    "1. every major journalist worth his/her salt would be all over it within hours."

    Don't be naive. How long do you think that any mainstream journalist who made a story of this would have a job for? The answer - not long. The US media in particular, although the UK is getting as bad, is little more than a relay system for government propaganda and real, detailed, complete examination of government behaviour, with equal air time to truly dissenting opinions (how many times has Chomsky been on CNN in the past 4 months?) is out of the question. What the government does is Good and Right and Should Not Be Questioned.

    Media by the elite, serving the elite.

  23. Re:A CLASSIC QUOTE... by fermion · · Score: 4, Insightful
    The rules for transparency goes beyond merely 'not hiding' information. It is necessary to make information available from well know locations in the most convenient form practical. This, for instance, is why we have a congressional record rather than just binders of unsorted documents in a basement of some public building.

    The other rule for transparency is that all material information be made available, kept, or destroyed in accordance to public regulation and individual policy. Individual policy must be consistent and decisions must be defensible based on policy.

    The fact that people do not understand these two aspectsof transparency are what allow situations like Enron to develop. The later is what caused the destruction of Arthur Anderson. They have done nothing wrong, but they did not follow their own policy on document destruction, which made then look like at best idiots and at worst criminals.

    We may compare this to other ventures to suggest policy. The NYT does not want google to cache articles because the NYT sells those articles after a certain time. Many other companies do not want deep linking because it reduces ad revenue. A fascist government may want to insure all users enter their site from a top page to make sure all users must go through the daily propaganda. A library tries hard to not track patrons so that no is afraid of using the library. The rational of the White House is beyond me.

    The White House is not hiding documents. However, they are reducing the transparency of the government by limiting the avenues by which the public may access documents. Since the White House has stated many times that it believes in transparency, and in fact requires transparency when dealing with other governments, one can stipulate that transparency is the appropriate standard. So, until someone comes up with a policy that was developed and vetted through the normal processes used in the U.S., one has every reason to suspect nefarious motives.

    And, if I may modify a statement that conservatives like to make, if you do not like transparency, go move to Iraq.

    --
    "She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
  24. Re:country is not at war by flossie · · Score: 4, Insightful

    An honourable country would not keep people imprisoned in Guantanamo Bay without either giving them PoW status or charging them with a specific offence and giving them the right to a fair trial, including free, unhindered and unmonitored access to legal counsel.

  25. Re: and your ... by Zeinfeld · · Score: 4, Interesting
    every major journalist worth his/her salt would be all over it within hours. so it wouldn't succeed in obscuring information.

    Where have you been living the past five years? Journalists don't criticize Bush.

    They still have not published the fact that he deserted from the national guard during Vietnam and they practically ignored his DUI conviction.

    The GOP has the media cowed with their constant 'liberal media' babble. There number of journalists who are prepared to hold Bush to account is tiny - Krugman, Conanston, Irvins, Alterman. After that its Al Franken, Jon Stewart and David Letterman.

    it would create an incredible backlash as soon as detected. what purpose would this serve?

    The chances that the mainstream media will pick this one up are very small. Just think how they would have reacted if it was Clinton!

    --
    Looking for an Information Security student project suggestion?
    Try http://dotcrimeManifesto.com/
  26. Re:EXACTLY by davebo · · Score: 4, Insightful

    Nobody thinks Bush and Cheney are updating the website. Jeeze. But the folks that are running the website (and I would bet this extends down to the actual webmaster/tech guy) are political appointees who are there to make the president look good. That is their job. Their actions are all filtered through this political role.

    Let's present an alternate scenario - since you have no evidence for yours, I don't have to present any evidence for mine.

    It's May - Pres. makes his speech on the Carrier, the assumption by those-in-charge are that Chalabi's government will have control of the country within a couple of weeks and the US troops will be heading on home. The web folks (who want to make B & C look good) declare "combat's done! the troops are coming home! re-elect Bush!"

    A few months later, that rosy scenario hasn't quite panned out. The aircraft carrier speech is becoming a liability for Bush - people started counting the number of dead troops in Iraq since he gave the speech, and it keeps going up. The web folks (who want to make B & C look good) say to themselves "this is a potential embarrassment to the president - let's see how we can make it less embarrassing."

    And there you have it.

  27. Re: and your ... by Zeinfeld · · Score: 4, Interesting
    The crux of this argument is that Bush missed some drills in 1972 while he was working on a political campaign in Alabama.

    The crux of the matter is that he refused to have his pilots medical just after the Pentagon added a check for illegal drug use.

    You can try to spin this whichever way that Karl Rove tells you but the facts are against you. The fact is that your great leader is a coward who ducked the draft and then deserted to avoid a drug test.

    --
    Looking for an Information Security student project suggestion?
    Try http://dotcrimeManifesto.com/
  28. Re:And your ... by Darby · · Score: 4, Insightful

    Well terrorists have been attacking us since we have been in Iraq till this point in time, but i guess that doesnt mean there is any link..... naaaah

    Native people fighting against an occupying force are known as freedom fighters, not terrorists.

    ry again sparky.