Millions of Pages Google Hijacked using ODP Feed
The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
What gains are made when someone hijacks a web site? This has probably been discussed before, but I'm too lazy right now to look it up. Anyone?
"I'm just here to regulate funkiness."
Google has the records, and probably the original
site exists with behavior dependent on browser name
being GoogleBot or not. The replacement site will
generally have some way of making money, which can
be tracked via financial transactions.
A few months ago, I rearranged my website. To make sure people could still find things, I put 301 redirects on all the old pages that I moved.
I noticed in my logs that search engines have repeatedly requested the 301 pages, but often don't follow the links to the new pages. And when searched with google, the pages still show up with the old urls. Should I be using 302 redirects instead?
I'm still not seeing any explanation of how it works, only what happens when it does work.
For at least the last 18-24 months it's been increasingly difficult to find non-spam/redirect/affiliate program links for a search on any popular consumer product on Google. Maybe they have too much faith in their current PageRank and think it needs to be tweaked instead of overhauled. Maybe they think they have enough momentum and don't care. They certainly should have the talent and resources to do something about this and it's kind of sad that they haven't. I predict we'll see another whizzy side project in a few months instead.
The thing is that all they have to do is keep it just good enough that people won't leave. Remember, AdWords is Google's product, everything else [gmail, orkut, etc] they've got is just a way to show you those ads. Google's success is entirely because they had clearly better search results than anyone else. If another company can clearly best them then Google may be in trouble.
IE doesn't support gopher:// URLs any longer, so assume that demand for Gopher would drive market share of Firefox et al. The problem is driving the demand for Gopher when IE doesn't support it.
My site the humor archives has been affected by this. I can tell because if you do the following search you can see a bunch of sites that are/were 302ing to my domain. I'm pretty pissed off and I seriously hope Google act soon to rectify the matter.
----
P/E ratios are a poor way to compare stocks in new growth companies, as they don't account for the rate at which earnings growth is accelerating which is far more important than the current earnings amount. P/E looks back, not forward.
If you look at the earnings rating that Investor's Business Daily computes, GOOG gets a score of 99, the maximum possible. I quote from IBD founder Bill O'Neil's book "How To Make Money in Stocks" to point out show short-sighted P/E thinking is with companies like this, picking one example most here are familiar with.
"American Online sold for over 100 times earnings in November 1994 before increasing 14,900% from 1994 to its top in December 1999."
I've made plenty of money buying companies with a P/E of near 100 and selling as it hit 500+. That said, I still wouldn't buy GOOG, but it's more because the market cap extrapolated from the relatively small number of public shares seems insane.
Except that you should be using 301 when your URI scheme changes.
Email to webmaster@google.com with the keyword "canonicalpage".
Google are not taking this problem seriously.
I'd suggest that if your website is affected, you send an email as above.
Sig for sale or rent. One previous user. Inquire within.
This has been possible in php forever. And a lot more hidden than a 302 redirect. You go to one page and depending on where you came from, it shows different content, but the url stays exactly the same. Here, go fool google:
//Content people see that come from google
$referrer = $_SERVER['HTTP_REFERER'];
$findme = 'google';
$from_google = strpos($referrer, $findme);
if ($from_google === FALSE){
echo "Original Content";
}
else{
$content = file_get_contents("http://www.yahoo.com");
echo $content;
die();
}
?>
Googlebot wont see yahoo.com content because it dosent have referrer of google. Or you could do the same thing to googlebot. Get the ip of googlebot and show it different information than whats there.
He stated that he is an engineer employed by Google. He surfs Slashdot and to some extent speaks for Google, although his membership is paid for out of his own pocket. Someone else, not affiliated with Google, had the user ID before but agreed to transfer it to this guy.
At least, that's what he has said before.
Frankly, I'd like to see Google start blocking content-free traffic-boosting sites from the page results entirely.
:-)
Google has login accounts, so let logged-in users have a link saying "report spam site". Track who files the most reliable reports, and if a few of those people all agree that a site is spam, nuke its pagerank.
See how OpenRatings does reliability calculations for more info. Or buy them
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
This explains to me what's going on.
Although it seems backwards to me from what they should do.
What Google needs to do is not index 302s and instead index the final page. Alternatively/additionally, make sure the domain remains the same when accepting a 302 and indexing it.
As it is, it sounds like they're indexing my change of address card and ignoring my current residence.
IMarv
Trusting software vendors is no smarter than trus
I'm surprised nobody has mentioned that Yahoo has already closed the 302 hole.
sigs are a waste of space
Is there a specific search that someone can suggest that would demonstrate this problem?
Here's what I do: Bitty Browser & Andromeda
If I find both articles confused and confusing then it is a bit much to expect other people to follow them, I am listed as an original contributor to the design of HTTP.
The real problem here is not the 302, its a bug in the googlebot. fortunately a realtively easy one to fix. When googlebot sees a 302 redirect to a page it treats the actual page and the redirect to the page as if they are one and the same. It should not, instead it should give the 302 linking URL a lower score than the URL linked to. I think this is pretty obvious from the specs. It should be a pretty quick fix.
This is one of the problems I have every week when someone comes along with a 'new' attack that is simply a slight twist on something that has been around for years. I recently got called by a journalist researching IM 'viruses', unfortunately it was only afterwards that I realized that all this 'new' attack was telling us is that once a machine is infected by spyware there is very little that can be done to protect the user.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
here's my write-up on the problem from early February called Google and the Mysterious Case of the 1969 Pagejackers. the problem has been around for a long, long time.
;)
personally, i'm ready to give up google maps or something else (autolink?) if they would 'fix' this or at least be more transparent about what's going on.
btw, the word on the net is that the googleguy posting here isn't the real one. anybody have details on this?
-kpaul
J-Log: Journalism News, Media Views
I think a resonable solution to this would be for Google to send a second spider to the site for every 302 Redirect they find, with a user-agent indicating its IE or any other browser. Then compare the data.
Although, they could probably still figure out it's google by their IP, but it's a step in the right direction.
Bugs are just features that have been fixed.
I haven't tried this. It's just an idea knocking around in my head.
What would happen if I set up a stateful filter on my web server that did the following?
1. If the http client provided a referrer header and that header contains my own domain name, exit (and let the request be processed normally)
3. Record the user agent header, client IP address, and current timestamp in some sort of temporary lookup table
4. Issue a http 301 with an absolute URL that points to the current page but with some technically insignificant rewrite from the way that the client requested it. For example, if the request is a simple GET, append a "?" or "&"
If the client was not referred by an internal link, this filter would instruct the client to reload the page in a way that insures that it knows the correct, full URL.
By itself, this would simply cause an infinite loop which a robot would probably detect. That's where the temporary lookup table and slightly modified URL come in. I left step two out of the list above because it does not apply until the second time the agent hits our page:
2. Consult the lookup table. If this agent already hit this page within the last n seconds, exit and allow the request to be processed normally.
I don't know much about how robots such as googlebot behave. I'd love to see a reply from someone who knows more than I do.
The only problem GoogleGuy is that you folks are the ones creating the duplicate content you can store the url back on the site doing the 302 as a url only entity without content and not store the url and the content from the redirected to site on the redirected to site but just the redirected to site's content for that page.
.... and so it continues until a limit is reached, an error occurs, or it shows up on my screen.
You are currently hurting a totally innocent party.
Googlebot is not a browser.
The purpose of the redirects was to deliver content to a person.
I ask site1 please get me this.
It gets it or tells my browser where it is.
If my browser is told here is where it is then my browser says site2 please get me this
In all cases the page is on the referred to site and that is where it belongs under its name on that site.
You can still keep a place holder on the original referring site in your database but not in the index so the next time you spider that link you'll have the proper starting point.