White House Website Limits Iraq-Related Crawling
oscarcar writes "Dan Gillmor is reporting on the White House website's use of its robots.txt file to disable search engines from crawling certain material. Many excluded items in the robots.txt file involve mentions of Iraq, possibly to prevent people from finding changes to past statements and information when archived elsewhere."
It looks like 99% of the stuff related to Iraq is filtered out in robots.txt.
/infocus/iraq directory (which is dissallowed in robots.txt)
But not a problem, on google.com I just specify the site by saying 'Iraq site:whitehouse.gov' and it had 14,000 hits... the first one is the root of
>I wouldn't be surprised if there aren't a few honeypot pages in there too.
:)
On the production server of the US presidential home page? I'll go with the other theory
The use of the robots.txt file by crawlers isn't madatory, at no point is it ever enforced, it's merely a curtesy.
All you'd have to do to continue indexing their site is to write a crawler that ignores robots.txt.
Or you could cue him instead. That might make more sense.
Nope... didn't take me long to find something that was disallowed to be a valid URL:
/infocus/iraq
Disallow:
http://www.whitehouse.gov/infocus/iraq is a valid URL.
If you try actually *loading* the directories listed in the robots.txt, they don't exist. Not one. Not by going to their index.html or trying to find them through the site navigation. While they could still be accused of deleting them, many of the links are unlikely to have existed in the first place (http://www.whitehouse.gov/president/heartland-tou r-gallery/iraq? /president/holiday/decorations/iraq? /president/tee-ball-01/iraq? ) This may be just some IT grunt running a bad script on robots.txt.
I can't see this as a conspiracy .. it's just too silly.
Why on Earth wouldn't they just EDIT the bleedin' files? They wouldn't have to delete them or set up robots.txt, they would just change them to reflect the "message of the moment". They probably do that anyway, same as a lot of other sites.
Do they really think people would be blocked by robots.txt?? Nobody's that dumb (yeah they could be Windows MSCE droids but c'mon).
I think they did it for some other reason like keeping traffic down.
Another possibility: a hacker got in there and did this because a) he only had write access to robots.txt for some reason or b) he wanted to play a subtle joke. But I doubt that too.
Anyway this is strange, but pointless, so I wouldn't bother with it unless you're a democrat looking for something else to whine about...
Most of the pages in the robots.txt are actually 404's and dont exist anymore. Its that simple. Keeps the robots from constantly requesting content that doesn't exist anymore. A few are blocked because they are bandwidth intensive videos and things, and some others are blocked for more mundane reasons I assume.
Seems odd and pointless to me. I'd like a statement explaining it. A lot like the "Disallow: /hidden/passwd" kind of entries.
Looks like someone just added IRAQ to all of the exsiting links. It's obviously some sort of search/replace/copy function. Go look for yourself, I found this one:
/firstlady/recipes/iraq
Disallow:
Now, how many pages would this possibly block?
M@
Krispy Cream is people
Looks like they removed a bunch of files where they were making claims that Saddam was behind 9/11. One could be lead to suspect that now that Bush got his war his doesn't need that lie anymore, and wants to erase all history of it since it undermines his authority.
Peace, or Not?
- Grep the errors log for 404's from search engines.
- Parse out the directory paths.
- Add those to robots.txt.
Which might explain why at least one of the directories -I have to agree that it's more strange than sinister. Besides, I'm not sure that the web site is the official archive for white house statements.
everyone knows they are also used to prevent google from indexing stuff people would rather keep (semi) private.
The US government has no buisness with semi-private material. Either don't put it on the website, or make it publicly available to everyone, including Google and friends.
The complaint is they've done it before - "combat operations are done" became "major combat operations are done" when the fighting didn't stop. You can check here.
Compare the screenshots of what used to be on the white house website vs what's currently on the website.
Yes, I know, "how do we know this blogger didn't alter the screenshots?" You don't.
> There hasn't been a real declared war since WWII. You can't "declare war on terrorists" and be done with it either, wars are supposed to be declared on countries when you go to fight them.
Also, US wars have to be declared by the Congress rather than by the White House... or at least that's the way it worked back when the Constitution still meant something.
Sheesh, evil *and* a jerk. -- Jade
There are a lot of missing dates, but it looks to me like whitehouse.gov had a major site redesign sometime between Jul 13 and Sep 13 2001, and that when the new site was released they started putting in lots of the disallow statments for certain paths.
From Jul 13:
7-13 Whitehouse.gov
7-13 Robots.txt
From Sep 13:
9-13 Whitehouse.gov
9-13 Robots.txt
It seems to me like the simplest explanation is just that their redesigned site has multiple paths to the same information, and for some reason they felt that their search engine rankings would improve if they eliminated superfluous paths. Although I'll admit it's suspicious that their old robots.txt from 2 years ago had 151 Disallows, and the one from today has 1552 Disallows, while the site uses basically the same navigation structure.
Other posters have claimed it's more than one. I haven't checked, so I don't know. However, even if it is just infocus/iraq, that's still a hell of a lot.
e pt26.html
That subdirectory seems to contain all or most of the transcripts of Ari Fleischer's and Bush's interviews and press conferences leading up to the war and after. An example is this:
http://www.whitehouse.gov/infocus/iraq/excerpts_s
He didn't ban media coverage. He banned cameras and recording equipment at homecomings which feature flag-draped coffins.
There are a huge number of yeast infections in this county. Probably because we're downriver from the bread factory.
No, it's just the kind of subtle manipulation this administration has perfected. They probably realized that if they pulled all kinds of documents from the web site that it'd appear as if they were limiting access to the public record.
It's all still there for all to see, but it's not as easy to find. So they can say "We're not hiding anything." while they actually hide it.
Things that become inconvenient or embarrassing after the fact are hard to hide. At the time this quote by Dick seemed reasonable: link
"Simply stated, there is no doubt that Saddam Hussein now has weapons of mass destruction. There is no doubt that he is amassing them to use against our friends, against our allies, and against us."
Now maybe less so. Also, re: the Uranium production in Africa, Fleisher sounds like a complete fool.
This is the first example of the Bush administration confronting the forged Iraq/African Uranium document. This is from March, 14th 2002.
On March 17th 2002 Bush gives Hussein 48 hours to leave Iraq and on the 19th he launched "Operation Iraqi Freedom".
So for at least a week -before- the shooting started the Bush administration had reporters at press conferences asking questions about the forged uranium documents. The mainstream press didn't pick up on this story until July.
Link
Q Ari, the President said in his State of the Union address, the British government has learned that Saddam Hussein recently sought significant quantities of uranium from Africa. And since then, the IAEA said that those were forged documents --
MR. FLEISCHER: I'm sorry, whose statement was that?
Q The President, in his State of the Union address. Since then, the IAEA has said those were forged documents. Was the administration aware of any doubts about these documents, the authenticity of the documents, from any government agency or department before it was submitted to the IAEA?
MR. FLEISCHER: These are matters that are always reviewed with an eye toward the various information that comes in and is analyzed by a variety of different people. The President's concerns about Iraq stem from multiple places, involving multiple threats that Iraq can possess, and these are matters that remain discussed.
Fleischer stalls for time by pretending that he didn't understand the source of the quote (as if "President" and "State of the Union" in the first sentence were unclear), then comes up with a moronic bit of doublespeak. No wonder he quit. Read his last sentence in that press conference aloud. That's sentence is the official line one week before the war. Lots of confidence there.
If the whitehouse can make it a little more difficult for reporters or their opponents to dig up embarrassing quotes or timelines you can bet your last dollar they will. -dameron
See:e xt/20030501-15.html
r aq/20030501-15.html
http://www.whitehouse.gov/news/releases/2003/05/t
which differs from
http://www.whitehouse.gov/news/releases/2003/05/i
In the text version, the pages says 'President Bush Announces Combat Operations in Iraq Have Ended' while in the robot accessible version, it is ''President Bush Announces Major Combat Operations in Iraq Have Ended'.
Get your own screenshots.
http://www.whitehouse.gov/infocus/iraq
Not any more.
Although the current Google cache lists
[snip 22 lines]
the current robots.txt leaps from
to
Conspiracy theory over...
Referring to a website critical of him (but correct in every detail)