Millions of Pages Google Hijacked using ODP Feed

← Back to Stories (view on slashdot.org)

Millions of Pages Google Hijacked using ODP Feed

Posted by CmdrTaco on Wednesday March 23, 2005 @03:36AM from the well-this-isn't-going-well dept.

The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."

21 of 427 comments (clear)

Min score:

Reason:

Sort:

A Real Question by 2names · 2005-03-23 03:40 · Score: 1, Interesting

What gains are made when someone hijacks a web site? This has probably been discussed before, but I'm too lazy right now to look it up. Anyone?

--
"I'm just here to regulate funkiness."
Easy to prosecute, hmmm? by r00t · 2005-03-23 03:41 · Score: 4, Interesting

Google has the records, and probably the original
site exists with behavior dependent on browser name
being GoogleBot or not. The replacement site will
generally have some way of making money, which can
be tracked via financial transactions.
301 redirects by Anonymous Coward · 2005-03-23 03:45 · Score: 3, Interesting

A few months ago, I rearranged my website. To make sure people could still find things, I put 301 redirects on all the old pages that I moved.

I noticed in my logs that search engines have repeatedly requested the 301 pages, but often don't follow the links to the new pages. And when searched with google, the pages still show up with the old urls. Should I be using 302 redirects instead?
Re:302 by ari_j · 2005-03-23 03:53 · Score: 2, Interesting

I'm still not seeing any explanation of how it works, only what happens when it does work.
Not a surprise by faust2097 · 2005-03-23 04:04 · Score: 4, Interesting

For at least the last 18-24 months it's been increasingly difficult to find non-spam/redirect/affiliate program links for a search on any popular consumer product on Google. Maybe they have too much faith in their current PageRank and think it needs to be tweaked instead of overhauled. Maybe they think they have enough momentum and don't care. They certainly should have the talent and resources to do something about this and it's kind of sad that they haven't. I predict we'll see another whizzy side project in a few months instead.

The thing is that all they have to do is keep it just good enough that people won't leave. Remember, AdWords is Google's product, everything else [gmail, orkut, etc] they've got is just a way to show you those ads. Google's success is entirely because they had clearly better search results than anyone else. If another company can clearly best them then Google may be in trouble.
Re:Gopher by ari_j · 2005-03-23 04:08 · Score: 2, Interesting

IE doesn't support gopher:// URLs any longer, so assume that demand for Gopher would drive market share of Firefox et al. The problem is driving the demand for Gopher when IE doesn't support it.
My site is affected by barcodez · 2005-03-23 04:11 · Score: 4, Interesting

My site the humor archives has been affected by this. I can tell because if you do the following search you can see a bunch of sites that are/were 302ing to my domain. I'm pretty pissed off and I seriously hope Google act soon to rectify the matter.

--

----
Re:Do what I'm going to do... by greg1104 · 2005-03-23 04:39 · Score: 2, Interesting

P/E ratios are a poor way to compare stocks in new growth companies, as they don't account for the rate at which earnings growth is accelerating which is far more important than the current earnings amount. P/E looks back, not forward.

If you look at the earnings rating that Investor's Business Daily computes, GOOG gets a score of 99, the maximum possible. I quote from IBD founder Bill O'Neil's book "How To Make Money in Stocks" to point out show short-sighted P/E thinking is with companies like this, picking one example most here are familiar with.

"American Online sold for over 100 times earnings in November 1994 before increasing 14,900% from 1994 to its top in December 1999."

I've made plenty of money buying companies with a P/E of near 100 and selling as it hit 500+. That said, I still wouldn't buy GOOG, but it's more because the market cap extrapolated from the relatively small number of public shares seems insane.
Re:I've had it with Google! by boy_of_the_hash · 2005-03-23 04:49 · Score: 2, Interesting

Except that you should be using 301 when your URI scheme changes.
And how to report this to Google... by ites · 2005-03-23 04:54 · Score: 2, Interesting

Email to webmaster@google.com with the keyword "canonicalpage".

Google are not taking this problem seriously.

I'd suggest that if your website is affected, you send an email as above.

--
Sig for sale or rent. One previous user. Inquire within.
it cant stop by grungefade · 2005-03-23 05:09 · Score: 1, Interesting

This has been possible in php forever. And a lot more hidden than a 302 redirect. You go to one page and depending on where you came from, it shows different content, but the url stays exactly the same. Here, go fool google:

$referrer = $_SERVER['HTTP_REFERER'];

$findme = 'google';

$from_google = strpos($referrer, $findme);

if ($from_google === FALSE){

echo "Original Content";

}

else{

//Content people see that come from google

$content = file_get_contents("http://www.yahoo.com");

echo $content;

die();

}

?>

Googlebot wont see yahoo.com content because it dosent have referrer of google. Or you could do the same thing to googlebot. Get the ip of googlebot and show it different information than whats there.
He's answered this before by Anonymous Coward · 2005-03-23 05:13 · Score: 1, Interesting

He stated that he is an engineer employed by Google. He surfs Slashdot and to some extent speaks for Google, although his membership is paid for out of his own pocket. Someone else, not affiliated with Google, had the user ID before but agreed to transfer it to this guy.

At least, that's what he has said before.
Re:Ugh. This is so not true. by metamatic · 2005-03-23 05:21 · Score: 4, Interesting

Frankly, I'd like to see Google start blocking content-free traffic-boosting sites from the page results entirely.

Google has login accounts, so let logged-in users have a link saying "report spam site". Track who files the most reliable reports, and if a few of those people all agree that a site is spam, nuke its pagerank.

See how OpenRatings does reliability calculations for more info. Or buy them :-)

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Re:RTFA by IMarvinTPA · 2005-03-23 05:29 · Score: 2, Interesting

This explains to me what's going on.

Although it seems backwards to me from what they should do.
What Google needs to do is not index 302s and instead index the final page. Alternatively/additionally, make sure the domain remains the same when accepting a 302 and indexing it.

As it is, it sounds like they're indexing my change of address card and ignoring my current residence.

IMarv

--
Trusting software vendors is no smarter than trus
Doesn't effect Yahoo by X · 2005-03-23 05:30 · Score: 4, Interesting

I'm surprised nobody has mentioned that Yahoo has already closed the 302 hole.

--
sigs are a waste of space
Can anybody provide a working example? by turnstyle · 2005-03-23 05:44 · Score: 2, Interesting

Is there a specific search that someone can suggest that would demonstrate this problem?

--
Here's what I do: Bitty Browser & Andromeda
Re:RTFA by Zeinfeld · 2005-03-23 06:00 · Score: 2, Interesting

My apologies, but the details of this exploit were linked-to in a previous article as well as this one, and you can't move for explanations of how it works.
If I find both articles confused and confusing then it is a bit much to expect other people to follow them, I am listed as an original contributor to the design of HTTP.
The real problem here is not the 302, its a bug in the googlebot. fortunately a realtively easy one to fix. When googlebot sees a 302 redirect to a page it treats the actual page and the redirect to the page as if they are one and the same. It should not, instead it should give the 302 linking URL a lower score than the URL linked to. I think this is pretty obvious from the specs. It should be a pretty quick fix.
This is one of the problems I have every week when someone comes along with a 'new' attack that is simply a slight twist on something that has been around for years. I recently got called by a journalist researching IM 'viruses', unfortunately it was only afterwards that I realized that all this 'new' attack was telling us is that once a machine is infected by spyware there is very little that can be done to protect the user.

--
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
clsc.net seems to be down... by luap2000 · 2005-03-23 06:25 · Score: 4, Interesting

here's my write-up on the problem from early February called Google and the Mysterious Case of the 1969 Pagejackers. the problem has been around for a long, long time.

personally, i'm ready to give up google maps or something else (autolink?) if they would 'fix' this or at least be more transparent about what's going on. ;)

btw, the word on the net is that the googleguy posting here isn't the real one. anybody have details on this?

-kpaul

--
J-Log: Journalism News, Media Views
Re:302 by Ryan+Stortz · 2005-03-23 06:33 · Score: 4, Interesting

I think a resonable solution to this would be for Google to send a second spider to the site for every 302 Redirect they find, with a user-agent indicating its IE or any other browser. Then compare the data.

Although, they could probably still figure out it's google by their IP, but it's a step in the right direction.

--
Bugs are just features that have been fixed.
Possible defense: HTTP 301 filter by accidentalGeek · 2005-03-23 08:27 · Score: 2, Interesting

I haven't tried this. It's just an idea knocking around in my head.

What would happen if I set up a stateful filter on my web server that did the following?

1. If the http client provided a referrer header and that header contains my own domain name, exit (and let the request be processed normally)

3. Record the user agent header, client IP address, and current timestamp in some sort of temporary lookup table

4. Issue a http 301 with an absolute URL that points to the current page but with some technically insignificant rewrite from the way that the client requested it. For example, if the request is a simple GET, append a "?" or "&"

If the client was not referred by an internal link, this filter would instruct the client to reload the page in a way that insures that it knows the correct, full URL.

By itself, this would simply cause an infinite loop which a robot would probably detect. That's where the temporary lookup table and slightly modified URL come in. I left step two out of the list above because it does not apply until the second time the agent hits our page:

2. Consult the lookup table. If this agent already hit this page within the last n seconds, exit and allow the request to be processed normally.

I don't know much about how robots such as googlebot behave. I'd love to see a reply from someone who knows more than I do.
Re:Ugh. This is so not true. by Anonymous Coward · 2005-03-23 09:43 · Score: 1, Interesting

The only problem GoogleGuy is that you folks are the ones creating the duplicate content you can store the url back on the site doing the 302 as a url only entity without content and not store the url and the content from the redirected to site on the redirected to site but just the redirected to site's content for that page.

You are currently hurting a totally innocent party.

Googlebot is not a browser.

The purpose of the redirects was to deliver content to a person.

I ask site1 please get me this.

It gets it or tells my browser where it is.

If my browser is told here is where it is then my browser says site2 please get me this .... and so it continues until a limit is reached, an error occurs, or it shows up on my screen.

In all cases the page is on the referred to site and that is where it belongs under its name on that site.

You can still keep a place holder on the original referring site in your database but not in the index so the next time you spider that link you'll have the proper starting point.