Millions of Pages Google Hijacked using ODP Feed
The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
Nothing for you to see here. Please move along.
OMG!!! Slashdot's been hijacked too!
I'll turn into a supernova and burn up everything. Well I'll turn into a black little hole and you'll turn into string.
What gains are made when someone hijacks a web site? This has probably been discussed before, but I'm too lazy right now to look it up. Anyone?
"I'm just here to regulate funkiness."
This is a placeholder. I'll include more details of why you shouldn't listen to Threadwatch.org in a bit, and debunk this some. Let me get this posted and I'll follow up.
(Yes, I am GoogleGuy.)
I am really extremely entirely confused about the article altogether. Is the hijacking more or less about Google digging into your site even when your robot.txt crawler robot is refusing google entrance?
This is the last straw! I'm going back to MSN, where I know that my data and privacy are being protected!!
*duck*
Socialism: A feeling of discontent and resentment caused by a desire for the possessions or qualities of another.
Google has the records, and probably the original
site exists with behavior dependent on browser name
being GoogleBot or not. The replacement site will
generally have some way of making money, which can
be tracked via financial transactions.
For every Good Thing, there are at least 100 different ways to abuse it.
FLR
I wasn't sure what a 302 hijack was, so here's the obligatory lowdown for those who didn't rtfa (from article linked page) This exploit allows any webmaster to have his own "virtual pages" rank for terms that pages belonging to another webmaster used to rank for. Successfully employed, this technique will allow the offending webmaster ("the hijacker") to displace the pages of the "target" in the Search Engine Results Pages ("SERPS"), and hence (a) cause search engine traffic to the target website to vanish, and/or (b) further redirect traffic to any other page of choice.
arg
see subject.
"I'm just here to regulate funkiness."
A few months ago, I rearranged my website. To make sure people could still find things, I put 301 redirects on all the old pages that I moved.
I noticed in my logs that search engines have repeatedly requested the 301 pages, but often don't follow the links to the new pages. And when searched with google, the pages still show up with the old urls. Should I be using 302 redirects instead?
"Oh! Look! Something beautiful! Something impressive! I must destroy it!"
pah. feeling jaded today, i guess.
"hey, could you pass me a paper towel? er.. I mean... DEPLOY ABSORBTION PANEL!"
buy GOOG on the dip as many non-techie investors panic sell. 8)
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
As web presence -defined as within about the first 10-20 results of a search- becomes more and more important to "success," black hat techniques such as this, to eliminate competitors, will become more and more common. Google, or any other search tool needs to be able to stay above the fray and not be subject to hacks such as this.
This is why Gopher will always be better than your feable world wide web junk.
Damn Google!!! Do you mean this is not www.kuro5hin.org ??
I can imagine it now... The slashdotting to end all slashdots. If every site in google was 302 redirected to RIAA.com How amazing would that be...
People are already using the 403 redirect on services such as no-ip and dyndns so they can manage to gain multiple (even whole pages) of listings for the terms they want
Business Voyeur
1. search Google for 'allinurl:', e.g. 'allinurl:slashdot.org'.
/me notices that my company's web site has been thusly hijacked... and yes! Doing a Google search on the main text on my company's web site shows dozens of unrelated sites high in the ranking. None of these actually have the text on their pages.
2. copy and paste any dubious URLS into this tool and check whether they're using 302 redirects or not.
3. Panic!
One example: http://www.tradedoubler.it.
Luckily, the phrase in question is complete gibberish and no-one ever finds our site through Google, only by reputation and word of mouth.
Still, I think it's clear Google have a serious problem here...
Sig for sale or rent. One previous user. Inquire within.
You are right, MSN only sets cookies that last the lifetime of their current OS.
Grundgesetz * 23. Mai 1949 - 30. November 2007 - http://www.vorratsdatenspeicherung.de/
302 hijacks work because Google goes to http://bad.site/ and gets redirected to http://good.site/. It then treats the contents of the bad.site as identical to that of good.site. The effect seems similar to if somebody simply copied an entire page off of your site (I'm not sure if it's actually more serious than this), but it's easier to do because you're just keeping a small table of redirections.
How serious is it? Don't know. It's pretty easy for a webmaster to check for hijacking and have her pages de-hijacked (see aforementioned article). It's probably not as screamingly awful as the threadwatch.org article suggests, but the redirector sites are rather annoying. Several of the comments in the webmaster article suggest that Google has already started moving on the problem.
For at least the last 18-24 months it's been increasingly difficult to find non-spam/redirect/affiliate program links for a search on any popular consumer product on Google. Maybe they have too much faith in their current PageRank and think it needs to be tweaked instead of overhauled. Maybe they think they have enough momentum and don't care. They certainly should have the talent and resources to do something about this and it's kind of sad that they haven't. I predict we'll see another whizzy side project in a few months instead.
The thing is that all they have to do is keep it just good enough that people won't leave. Remember, AdWords is Google's product, everything else [gmail, orkut, etc] they've got is just a way to show you those ads. Google's success is entirely because they had clearly better search results than anyone else. If another company can clearly best them then Google may be in trouble.
301 is a permanent redirect, 302 temporary.
This is why the "302 hack" works. If the redirect is only supposed to be temporary, the search engine keeps the URL of the 302 as the URL for the document, but indexes the content of the page to which the redirect is directed.
301 is what you should be using to point the SEs to your new pages if you've moved them. The behavior is supposed to be for the SEs to replace the old URL in their index with the new one, and furthermore count all links to the 301ed URL as being towards the new one. I don't know why it's not working for the grandparent poster, but it's the way that the functionality is "advertised" for Google and Yahoo, and it should work.
500GB of disk, 5TB of transfer, $5.95/mo
I was thinking that some major crisis had broken out and a million pages were hijacked at once creating something bigger than any other Internet event other, and it caused Google's stock to tank and force to them go private again, lay off workers and go bankrupt. But that's crazy. But still, word it right. Damn it.
In America, you spam computers In Soviet Russia, computers spam you!
My site the humor archives has been affected by this. I can tell because if you do the following search you can see a bunch of sites that are/were 302ing to my domain. I'm pretty pissed off and I seriously hope Google act soon to rectify the matter.
----
How about adding "Fiction: Google information for webmasters contains any facts"?
Domain-vultures - you know, the pr0n companies out there that harvest recently expired domains and point them to their adult content. There are lots and lots of sites like this, where some admin forgot to re-register the site, etc. And then, the domain is held hostage by the pr0n site until a ransom of a couple thousand is paid to them. Just a thought.
So try Teoma instead. They're not as well known as Google but I find they return much more relevant results in many cases.
I say that it would be an appropriate end for the company that bought DejaNews and is continuously screwing with the useful Usenet archive tool that it once represented.
Then again, Deja.com 'the place for consumers to search for product info' was an abomination in the middle years before the Google takeover.
Why do all the good Internet resources gradually turn to shit?
what major headlines ? millions of pages !! the world is coming to an end !!!!
a quick whois on threadwatch.org (the submitters site) reveals its hosted by search engine spammers
platinax.co.uk which is registed to a UK "company" called BriteCorp
http://www.britecorp.co.uk/
who offer all the usual SE spamming methods
coincidence ?
a whois on britecorp's platinex site reveals they have removed their address from the whois db, and their websites contact details are a mobile phone number (07963 808470)
further investigation on britecorp reveals they are not a "real" company but trading as "Brian Turner" (pic) and companies house dont seem to have any records of any of these companies, though iam sure further investigation could find out more
so why would a supposedly reputable marketing company have a cell phone as a primary contact point ?
something to hide egh ?
or perhaps local trading standards would like to hear about them and their "services" ?
northern scum by any other name
I was on slashdot reading all this stuff when my browser redirected to porn sites......
Oh wait. I got bored and did a search for porn...
I guess that's different.
To continue having the victim's hits redirected, the redirect needs to stay in place, doesn't it?
What in the world does the hijacker gain by having google point him, only to then load the victim's page?
hawk
It will also break many "click trackers", "portals", "directory sites", "search engine optimizers", and other annoyances, which is probably a plus for Google users. You know, those sites where you click on some phrase in Google and, three redirects later, you're at some irrelevant porno site.
If the site is porn and the correct site was something that might attract kids, get the site on that.
There are also some nice laws involving computer misuse. One could argue that Google had been "hacked".
An imaginative prosecuter will have many more ideas.
Why not just fix the bug and then recreate the rankings index? Googlebot hits my sites all the time, so I know that it covers the rest of the internet quite often as well. With their amount of hardware, it probably wouldn't take long.
A musician without the RIAA, is like a fish without a bicycle.
AllinURL returns results where the results are in the URL. So they *should* be returned.
I'm not convinced by this whole 302 nonsense. I haven't seen a single search where a 302 scraper site is ranking above the site it 302s for the scammed text.
To me it sounds like people's sites drop for whatever reason, then they look for a reason and they grasp at this 302 story.
I do an allinurl on my various sites (8 of them) and 6 have scrapers attached, only 1 has disappeared recently and that seems to have been caused by a change of IP address or maybe the loss of the yahoo directory link or perhaps because I have lots of pages with 20-30% similar content.
But if I only had 1 site I could easily blame a 302 problem.
The article is confused and baddly written. It does not explain the exploit being used ever. So stop dumping on people. It is not at all surprising that people don't get what is going on when the description is crud.
What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.
I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.
So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.
This would certainly explain the number of times I have done a Google search and ended up at an idiotic 'search site' that does nothing for me.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
It seems that when page A redirects to B, Google not only considers that a hit for A, but also assigns B's content to A (I just skimmed through all the posts here so maybe that's not what happens).
In that case, it seems to make more sense to just ignore A altogether since the hit and content rightfully belong to B.
This could be done by treating redirects as empty one-link pages, thus unifying the handlers and defeating this practice.
This story does not need "debunking".
What it needs is a rapid and satisfactory answer or Google will find themselves at the receiving end of more angst than they even know is possible.
A concrete example. My company's web site has been in existence since 1995. So we have pretty good page ranking. Our main page has one phrase, very distinct, unique.
When I search for this phrase (in quotes), Google reports hundreds of matches. These sites (except our own) do not contain the phrase but are sites that sell traffic boosting.
The 302 problem is real.
Incidentally, I just spent 15 minutes at Google.com looking for a way to report the problem. Where is that mention of "canonicalpage"? In the bottom shelf of a filing cabinet, behind a locked door that says "beware of the tiger"?
I'm not surprised you got only 30 reports. What I am surprised at is that you appear to speak for Google yet have such an inane response to what is a real (and for many people, a terrifying) problem.
Sig for sale or rent. One previous user. Inquire within.
This has very real potential to be taken advantage of for phishing scams.
Imagine someone searching for their bank's website on Google (because some think that [searching] is how the web works!) and clicking the wrong link. That link takes them to a site that looks just like their bank's website, and maybe there is a security alert on the front page asking them to verify their information. After doing so, they could be redirected to their real bank's site, never having realized their error.
Experience has shown me that most non-techies know they type an address into their browser, but after that, they pay no attention to it which makes this a real possibility.
Google seems to me have got itself into a corner. There are lots of pages of crap, but my non 302 site is not in its index. Google is playing the were smarter than the search engine optimizers game. The 'genuine' websites who have real information lose out. Google is loosing the plot, i wish to pay nobody, or advertise on google. I do have some yahoo pages indexed on google. I dont consider them important and to me prove how dumb there 'bot' is. When google stops playing the 'game' perhaps things might improve ? Google is not a definetive search in my eyes.
Send Peter Clifford Francis Macrae comdoms to 23 Bedford St, St.Neots, PE19 1AX, England
This is hilarious! Someone please mod up! Hope I get the above mods in M2.
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
Why do all the good Internet resources gradually turn to shit?
Because someone wants to make money from it.
And the knowledge that they fear is a weapon to be used against them...
It's about pushing unrelated sites up in the rankings.
For instance: I have a site with excellent page ranking. Now a new site will set up, and do a 302 to my site. Google now gives this new site my page ranking. When the new site is indexed, it removes the 302 redirection.
When you search for my site, you now find these new sites instead. There is no redirection when you click on a link, the the "cached text" that Google shows is wrong.
Basically this technique allows people to get high page rankings without earning them. It's very widespread - I counted over 60 such parasites for my company's web site (which has excellent page ranking).
Sig for sale or rent. One previous user. Inquire within.
And I know two other people who sent one. Maybe you should check again? I doubt me and my mates account for 10% of your responses. If you believe that the people affected by this are all "spammers" then perhaps the problem is false positives for your spam detection filters. In fact you should probably take a look at your spam detection filters anyway. Last time I checked--probably much more recently than you checked for canonicalpage emails, there was a bunch of scraper sites running AdSense where good relevant results used to be.
Email to webmaster@google.com with the keyword "canonicalpage".
Google are not taking this problem seriously.
I'd suggest that if your website is affected, you send an email as above.
Sig for sale or rent. One previous user. Inquire within.
This was originally posted the first time a story about this ran, but since a lot of people are still confused, here it is again...
There seems to be a lot of confusion as to why exactly this is such a big deal. A lot of people saying there's no problem or that this is nothing new... basically just not understanding the issue. Let me explain:
Suppose you have a small business under the domain http://xyz.com/, and search engines bring you a lot of traffic because you rank high for keywords in your market. You have a lot of people out there linking to you, a lot of satisfied customers, good content on your site. You're always in the top 10 somewhere when people search for "xyz widgets".
Well, this issue with Google makes it very easy -- incredibly easy -- for someone to knock your site out of the rankings entirely. And I mean for *everything*, to where searching for your own company name in quotes literally buries you hundreds of pages deep in the results. We're talking sites going from getting 1000 unique hits to 10 overnight.
And here's the kicker: It requires absolutely no technical knowledge, no time investment, and is perfectly legal...
All I have to do is have another domain handy that is roughly as popular as yours. And I make a "links" page, like one of those directory services, that lists your website. But instead of being a normal hyperlink, it's a CGI (or PHP or ASP or whatever) script that generates a 302 redirect to your domain... Now, these are very simple, common scripts. One-liners that you can download from cgiscripts.com and stick on your server. The original intent of these scripts is to track which links are being clicked on your site. But now they've found a new use, because when Google gets that 302, all hell breaks loose.
See, according to the HTTP spec, 302 is a *temporary* redirect, which means Google is supposed to interpret whatever content it finds at the 302 target (your site) as really belonging to the URL of the source (my site). Google is just obeying the spec strictly here, and with devestating results. Why? BECAUSE THE DUPE FILTER NOW KICKS IN! You see, Google has a "dupe filter" that says if the same exact content is found for two unique URLs, then one of the URLs is obliterated in the rankings. Because after all, searchers don't want to be finding the same content over and over. If that happens, they'll start using a different search engine. But Google, sticking strictly to the HTTP spec, doesn't know who the content really belongs to when it gets a 302.
So Google essentially flips a coin. And if it comes up tails, say bye-bye to your domain in the rankings. Your *entire* domain. Because the dupe filter isn't limited to just the page that the 302 is pointing to -- it applies across your entire domain.
These 302 "exit-link-trackers" are all over the web. They've been used by webmasters for years. But it's just recently that Google has started treating 302 this way, so it didn't have any bad effect before. But now it kills you.
The funny thing is, the solution seems pretty simple: Just stop treating 302s this way if they point to a different domain. But for whatever reason Google isn't listening. Hopefully the press that's being generated now will give them the kick in the ass that they need.
Okay, so basically this is the problem: when Google encounters a status 302 redirection (as opposed to the status 301 redirection) it then indexes the content as belonging to the initial URL, not the URL at the end result of the 302 redirection. Other things happen later because of google's design.
302 redirections are temporary redirections - the idea is that a 302 is supposed to be used when someone needs to be redirected to a new page, but should still use the original URL if they want to come back later. As an example, the page http://purl.oclc.org/OCLC/PURL/CONTRIBUTORS performs a 302 redirect to http://purl.oclc.org/docs/contributors.html. This means that although your web browser needs to go to some other URL for the content at the moment, they really should remember the first url as the permanent one.
Contrast this with what happens when your browser visits http://snowplow.org/martin - you get sent a 301 redirect to http://snowplow.org/martin/. (Note the extra slash) In this case, the server is saying "the url with the slash on the end is the real location, and you should not try to come back here without the final slash in the future."
Ideally, if every web browser behaved according to spec., bookmarks (remember bookmarks?) would get automatically updated to the new URL when you selected them and the redirect was a 301 redirect. However, for a 302 redirect, the bookmark would stay as is.
302 redirects can be very useful when you want to set up a hierarchy of "logical" URLs that will permanently point to the correct location. 301 redirects are useful when you're obsoleting an old URL and wish people to go and use the new URL from now on.
Okay, so how does this relate to google? Well, let's suppose that you have a great site on fruitbats. I can set up http://www.example.com/topics/fruitbats to be a 302-style redirect to your site, essentially saying "The information at http://www.example.com/topics/fruitbats is temporarily being hosted by http://www.yoursite.com/". Now, google when it spiders pages will see that, will go retrieve the text from your page and will then index it under http://www.example.com/topics/fruitbat, since after all I just gave a temporary (302) redirect.
But it gets worse, because a final part of google's indexing process is to compare pages for identical text, and throw out all but one of the URLs. Apparently this stage has nothing to go on other than the text and the recorded URLs, and so your URL stands a fifty-fifty chance of being thrown out.
Except that I've not just redirected http://www.example.com/topics/fruitbats to your site, but also http://www.example.com/topics/fruitbat, http://www.example.com/topics/fruit_bat, and http://www.example.com/topics/fruit_bats. Now your lone URL doesn't stand much of a chance of being the one kept by the "throw out duplicates" processor, does it?
In a sense, of course, there's little google can do to prevent this, because even if they weighted 302-redirects lower in their "throw out duplicates" stage, I could always just go snag a copy of your website each time googlebot visits, in essence doing the redirection myself. (How? Just search the apache mod_rewrite guide for "Dynamic Mirror") However, doing it through 302 redircts means that google pays for the bandwidth to go get your page, not me. (Not that this is necessarily a signficant amount of bandwidth, since we're only talking about basic google here and not images. Depending on the revenue you get by misdirecting google queries it might be economical)
Of course, for this to really work, I'd need a list of websites sorted by category to build up my redirect db. But wait! The ODP feed provides exactly that.
I am a little bit wary of doi
I'm so thinking of the recent iTMS DRM hack...
No, the way it works is with the 302, but only for the googlebot.
For this to work the scammer has to give the 302 only to the googlebot, all other browsers need to get the content of the scammer's page. If you google for "cheapest car insurance" (IIRC) you can find an example of this. Change your User Agent accordingly and click on the top Google link, you'll end up at another site. Change back to Mozilla and you'll get the scammer's site.
Sig is on vacation
This has been possible in php forever. And a lot more hidden than a 302 redirect. You go to one page and depending on where you came from, it shows different content, but the url stays exactly the same. Here, go fool google:
//Content people see that come from google
$referrer = $_SERVER['HTTP_REFERER'];
$findme = 'google';
$from_google = strpos($referrer, $findme);
if ($from_google === FALSE){
echo "Original Content";
}
else{
$content = file_get_contents("http://www.yahoo.com");
echo $content;
die();
}
?>
Googlebot wont see yahoo.com content because it dosent have referrer of google. Or you could do the same thing to googlebot. Get the ip of googlebot and show it different information than whats there.
He stated that he is an engineer employed by Google. He surfs Slashdot and to some extent speaks for Google, although his membership is paid for out of his own pocket. Someone else, not affiliated with Google, had the user ID before but agreed to transfer it to this guy.
At least, that's what he has said before.
My apologies, but the details of this exploit were linked-to in a previous article as well as this one, and you can't move for explanations of how it works. I also tend to get irritable with people who, when explicitely presented with information on a subject, can't be bothered to even attempt reading it (as the GP obviously hasn't, obviously not understanding the first thing about how it works), and instead just want everyone to explain it for them (again).
;-)
What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.
Nope. They're using a combination of 302 HTTP response headers and a bug/misfeature in Google's spidering system - they don't have to have any kind of access to the site being hijacked, and they aren't copying anything off the site. They set up a 302 redirect to the hijackee, and Google itself gets confused and attributes the hijackee's content to the hijacker.
This is all explained in the article, although since you apparently haven't understood it either I accept it might be badly-written and/or overly technical
I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.
This way they aren't just hosting the same content as another site (competing for rankings and leaving themselves open to accusations of copyright violation), they're actually knocking the original site out of the Google rankings altogether, in a pretty subtle way (so it might even go unnoticed by the site owner), with very little work (esp. compared to replicating a whole page/site), and without explicitely violating any laws (that I can see).
So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.
Not quite - this is common-or-garden (and long-known-about) page cloaking, which is a pain in the arse, but you can live with it. What the article is talking about is entirely different (see above).
Everything in moderation, including moderation itself
I guess what I haven't seen asked yet is:
Why is this not fixed yet?
C'mon Google.
-- Note: If you don't agree with me, don't bother replying. I won't read it.
I sort of agreed, it was really bad about a month or two ago, but has been getting better for most of the "commonly searched" terms. Some fairly obscure searches still turn up a bit of crap, but you can't do it for everyone.
A "Don't show me any results from this subnet + domain from now on" feature would be nice, as would google banning some of the worst offenders (which it seems to have done).
1q2w3e4r5t6y7u8i9o0pqawsedrftgthyjukilo;p'azsxdcf
This explains to me what's going on.
Although it seems backwards to me from what they should do.
What Google needs to do is not index 302s and instead index the final page. Alternatively/additionally, make sure the domain remains the same when accepting a 302 and indexing it.
As it is, it sounds like they're indexing my change of address card and ignoring my current residence.
IMarv
Trusting software vendors is no smarter than trus
I'm surprised nobody has mentioned that Yahoo has already closed the 302 hole.
sigs are a waste of space
Look, there *was* circumstancial evidence for the "Greg Duffy" thing ... i.e. just enough to make it a discussion. I agree that fearmongering is not the way to go. I appreciate that you looked into the issue (and my first instinct is to trust your explanation, that is was a DNS issue).
However, if this is Google's PR method, I think you are kind of asking for it! In the absence of information, the internet community will speculate until the cows come home. I'm not saying it's right, I'm just saying that's reality. Even though I said on my site that I thought Google didn't do anything underhanded I bet a lot of people were still not convinced. Google can do a little better than this, and although you have been fairly nice to me (thanks) this response is a little flamebaity for PR. Please understand that I mean no offense, it's just constructive criticism. Even if everything you say is true, a representative of the company should always at least attempt to sugar coat something like your last paragraph.
Also, on a more personal note, maybe Google should embrace the people that are involved in researching these problems instead of using this broken communications policy. I know that in my case I contacted you guys 5 *months* ago about the Google Print problem I described and never got any followup except for my t-shirt (which I really like). I have some great ideas about possible solutions to the problem I described, and as far as I can see Google has not fixed the root of the problem. When are you guys going to contact me?
-Greg Duffy
My company's web site is imatix.com
You will notice that the site's main page contains very little text. There is one marketroid phrase, "Strategic solutions for a complex world".
Now search Google for this phrase.
Look at the results. A completely irrelevant site has come in at first place. imatix.com is now at second place (this changed today).
imatix.com is an old site, with very high page rank. Now, it does not matter much for us, since no-one is going to search for this phrase, but if this can hit imatix.com, it can hit other sites.
The problem is entirely real, and it is extremely serious. I'd say, if Google don't fix this before it hits the main media, they will suffer irreparable damage to their reputation.
Sig for sale or rent. One previous user. Inquire within.
There is a simple solution for Google: Only honor 302 redirects when the original and target domains match (or points to a subdomain of the original domain.)
In all other cases treat a 302 (temporary) as a 301 (permanent) redirect, thus giving credit for the content to the actual hoster of the content.
This allows webmasters to continue using 302s to setup logical URLs to mask the organization of underlying content but eliminates the ability to hijack completely.
Natural != (nontoxic || beneficial)
Is there a specific search that someone can suggest that would demonstrate this problem?
Here's what I do: Bitty Browser & Andromeda
everyone knows google is #1
being at the top makes you a target and every little gnat is going to chew at you trying to get a piece.
remember altavista and others..
they ended up so spammed you had to go through pages of results to find anything any good.
I just think it has taken a while, but they are catching up with google now.
anime+manga together at last.. in real time.
If I find both articles confused and confusing then it is a bit much to expect other people to follow them, I am listed as an original contributor to the design of HTTP.
The real problem here is not the 302, its a bug in the googlebot. fortunately a realtively easy one to fix. When googlebot sees a 302 redirect to a page it treats the actual page and the redirect to the page as if they are one and the same. It should not, instead it should give the 302 linking URL a lower score than the URL linked to. I think this is pretty obvious from the specs. It should be a pretty quick fix.
This is one of the problems I have every week when someone comes along with a 'new' attack that is simply a slight twist on something that has been around for years. I recently got called by a journalist researching IM 'viruses', unfortunately it was only afterwards that I realized that all this 'new' attack was telling us is that once a machine is infected by spyware there is very little that can be done to protect the user.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
Taco posts the same URL.
Lars T.
To the guy who modded me down from perfect to terrible Karma - Apple haters still suck
Try http://s.teoma.com/search?q=See%2C+according+to+th e+HTTP+spec%2C+302+is+a+*temporary*+redirect%2C+wh ich+means+Google+is+supposed+to+interpret+whatever +content+it+finds+at+the+302+target+%28your+site%2 9+as+really+belonging+to+the+URL+of+the+source+%28 my+site%29.+Google+is+just+obeying+the+spec+strict ly+here%2C+and+with+devestating+results.+Why%3F+BE CAUSE+THE+DUPE+FILTER+NOW+KICKS+IN%21+You+see%2C+G oogle+has+a+%22dupe+filter%22+that+says+if+the+sam e+exact+content+is+found+for+two+unique+URLs%2C+th en+one+of+the+URLs+is+obliterated+in+the+rankings. +Because+after+all%2C+searchers+don%27t+want+to+be +finding+the+same+content+over+and+over.+If+that+h appens%2C+they%27ll+start+using+a+different+search +engine.+But+Google%2C+sticking+strictly+to+the+HT TP+spec%2C+doesn%27t+know+who+the+content+really+b elongs+to+when+it+gets+a+302.%0D%0A&qcat=1&qsrc=0& Search.x=0&Search.y=0
your sir are an obvious troll, the first link points to slashdot.org/imatrix.com which of course returns a slashdot 404 error page, and the google.com search link returns the imatrix.com websites link rated number one, and a bunch of placeholder sites below it so how does this demonstrate any harm in imatrix.com's page ranking?
come on Mods, at least read the post, and Check the links before you mod up something as informative
Apocalypse Cancelled, Sorry, No Ticket Refunds
So what happens if you change your browser to identify as Googlebot?
Problem solved, right? The scammer redirects you via 302, and you see the original content of the page that the scammer wants to index.
When has jurisdiction ever stopped the USA?
We just grabbed a guy out of Australia who'd
never set foot on US soil, unless you count
Australia as US soil.
"Almost Nothing" means "Not Nothing", aka "Ok, yeah, a couple of things". In this case, there's at least one technique that works, and there may be others that nobody's discovered or ranted about yet. But this one's ugly.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
here's my write-up on the problem from early February called Google and the Mysterious Case of the 1969 Pagejackers. the problem has been around for a long, long time.
;)
personally, i'm ready to give up google maps or something else (autolink?) if they would 'fix' this or at least be more transparent about what's going on.
btw, the word on the net is that the googleguy posting here isn't the real one. anybody have details on this?
-kpaul
J-Log: Journalism News, Media Views
I was about to say that the solution was to not replace pages in the index when the redirection happens across domains. But then public hosting sites like angelfire.com came to my mind. So they just have stop replacing redirects altogether.
The evil attacker would just then load up all the IP address blocks owned by Google into the redirect script and use that instead of the User-Agent.
dtach - A tiny program that emulates the detach feat
If a 302 is intended as a temporary redirect, shouldn't the Googlebot not replace the original index, considering a 302 redirect is supposed to be temporary? Why would anyone ever want what is supposed to be a temporary to be permanently indexed? Google should never have been indexing "temporary" redirects in the first place.
Why all the yammering and discussion on this?
It's pretty simple; 302 redirects allow bad guys to exploit Google.
It doesn't matter that it's the wrong way to use a 302 redirect. They are the BAD GUYS. Remember the "spammers lie" truism?
It's the Google rule that is broken. 302 should be treated as "cant find site" in their search rankings rather than assuming the the data sent by the web server is honest. It sucks that some legit users of 302 won't get ranked as well because of it, but boo hoo. Let anybody that has hardware or software problems get better equipment in the first place if their freaking world ends when they don't get ranked in their keyword group. I have NO SYMPATHY for someone that shoestrings their vital revenue stream infrastructure and then wonders why things go bad. It reminds me of my job too much.
Buy Google ADs if you need to make money off your site traffic.
Google will change the rule or they won't. If they want to stay relevant, they'd better. I find myself getting irritated with Google's crappy search results a lot now days, sooner or later I will find one of the little startup to use and they can kiss off if it keeps up. So I figure they will get to it. They are Google, they are good at what they do.
Now what I think they should do is download snippets of pages via the Google toolbar which then sends the data to Google to make a massively distributed bot-net spider that is indistinquishable from the web-using masses. At that point, as far as exploiting Google via IP of the bot or user agent of the bot IT IS ALL OVER.
Move along, nothing to see here but a bunch of people that don't understand redirect and HTTP protocols.
I have a solution in mind that may or may not work depending on robot/indexer behavior.
What would happen if I installed a filter on my Web server (in pseudocode):
if (!(http.referrer matches(my.domain))
{
send(301, target=full_url)
}
In English, if the client was referred by an external link, immediately issue a 301 redirect to the full URL of the current page. This should inform the robot that it is now looking at a new site and should start indexing content under a URL that I'm positive that I control.
This solution will collapse if the robot refuses to follow the 301 or does something that I don't expect.
I'd be interested to see a response from someone who understands spyders (since I don't).
"Redirects to a page should be treated as having far less PageRank value than the page itself. That will fix the problem."
Won't work. In a nutshell (expounded upon below) -- Google doesn't know what the "real page" is.
Specifically, I generate "301" on my own sites:
http://blurt.org/page_of_files
will generate a 301 redirect to:
http://blurt.org/page_of_files/
which then gets an index of files. Ok?
Now, sometimes I relocate data (popular big files):
http://blurt.org/p_o_f/bigfile.html
would get a 302 redirect to
http://big_ass_isp.com/myplace/bigfile.html
I don't want to loose my "pagerank" over the content location! And I am doing it to myself here...
Google will look at my site, and see the redirect. It spiders the page, and it is under MY url (where it should be).
But, if Google spiders the OTHER site (say, though an automatically generated links page), it may eliminate the proper URL.
But, REDUCING "pagerank" because I am providing for more bandwidth? (which is what you are suggesting). Why on earth would anyone want that?
Now, what SHOULD happen is that the page should have a tag on it that says "Please index me if you accessed me directly" or "Please index me if you accessed me via REDIRECT from there".
That would do it (I think -- not much of a web master, I'm afraid).
As it is, since the big_ass_isp site is under my control, I place a robots.txt file there to prevent Google from spidering big_ass_isp view of my pages.
But, I can't control other people from generating a REDIRECT to me (but I *can* additionally tag the page).
I don't think that there is much that Google can do about this. (Well, they could honour a custom tag - and propose of the W3 folk, which is what I think will happen).
Ratboy
Just another "Cubible(sic) Joe" 2 17 3061
Being the author of TFA that appeared a few days ago, i'll apologize for any confusion - yet, i'd say that you nailed it. Google has one page (as in "a set of indexed content") and a minimum of two URLs associated with it - at least one of these return something else than "OK", or "Not Modified", or whatever. Still, Google manages to pick one of the URLs that doesn't return one of these codes as being the appropriate URI for the set of content.
The interesting thing is that once the average searcher sees a result for, say "Site A" and clicks on it in good faith, he is not taken to "Site A" but directly to a script that is already in place on "Site B" and 100% controlled by "Site B".
Last time Googlebot saw this script, it redirected instantly to "Site A" ("302 Found"), but, you know... Scripts are scripts - they do one thing until you make them do another thing. And if you're a bit smart you can even make them conditional, showing Googlebot one thing and everybody else another. This is not even rocket science, it's really trivial programming at best. All you need is an "appropriate" site to forward "Site A" users to - preferably one that makes you instant money.
I'm not so sure about this though (that's why i snipped it from your post):
Before i wrote TFA that appeared a couple of days ago, i had been writing about this problem on search engine related fora for a very long time - literally more than a year, perhaps even two. These fora are frequented by verified search engine representatives, and the problem has also been solved ...By Yahoo! Not by Google.
It's more serious than simply copying your content, because the new site *replaces* yours in the ranking rather than competing with it. When I set up http://bad.site/1 through http://bad.site/100, all claiming to own the content at http://good.site/, Google displays only one of the 101 options in the listing -- and yours isn't too likely to get picked.
Actually, maybe that would happen anyway if I simply copied your site content perfectly 100 times. Not sure about that. Still, that aspect makes it much more of a concern.
Let's see if I'm understanding this right. Correct me if I'm wrong...
I set up a goat.cx mirror. The goat.cx mirror contains a 302 redirect to slashdot.org, making Google think that my content has been temporarily moved to slashdot.org. Therefore Google thinks that what's at slashdot.org is just a temporary version of what's normally at my website. Therefore people who would have been sent to Slashdot get sent to my site instead, i.e. people trying to find Slashdot via Google get goatse'd.
Correct?
I haven't tried this. It's just an idea knocking around in my head.
What would happen if I set up a stateful filter on my web server that did the following?
1. If the http client provided a referrer header and that header contains my own domain name, exit (and let the request be processed normally)
3. Record the user agent header, client IP address, and current timestamp in some sort of temporary lookup table
4. Issue a http 301 with an absolute URL that points to the current page but with some technically insignificant rewrite from the way that the client requested it. For example, if the request is a simple GET, append a "?" or "&"
If the client was not referred by an internal link, this filter would instruct the client to reload the page in a way that insures that it knows the correct, full URL.
By itself, this would simply cause an infinite loop which a robot would probably detect. That's where the temporary lookup table and slightly modified URL come in. I left step two out of the list above because it does not apply until the second time the agent hits our page:
2. Consult the lookup table. If this agent already hit this page within the last n seconds, exit and allow the request to be processed normally.
I don't know much about how robots such as googlebot behave. I'd love to see a reply from someone who knows more than I do.
Would it be possible for Google to simply disregard all 302 redirects that refer to a domain different than the document being crawled? In this way, all sites using 302 redirects legitimately (referring to their own content in another location on their domain) would be unaffected while the site hijacking scum would be eliminated.
I find laziness to be an excellent motivator.
Sorry for not writing this in the article - it's pretty long already and you just have to cut somewhere, but here goes:
Yahoo was exactly as vulnerable as the rest of the search engines. In fact this problem was pretty bad with Yahoo at one point. What Yahoo did was simply to fix it by implementing some internal rules about how to interpret redirects.
I believe it was fixed around June 2004 - at that time the problem had already been known (and aboused) for a long time, but use was not widespread yet. The details of the fix can be seen on this one-page PDF
It's simple (and identical to the solution i suggest in my article): When "Yahoobot" (actually it's called "Slurp") sees a 302 redirect, it checks if the domains of the redirect and the target are the same. If the redirect is from one domain to another, Yahoo keeps the URI from the target domain. If the redirect is from one page to another on the same domain, Yahoo keeps the "source" (ie. the redirect script URI).
of course, some just have to take your word for it, but i've heard it from other sources too.
;)
;)
in fact, i think i've learned *a lot* today.
re: the k5 article - cool. thanks. i'll try to see if an editor will let me change it later tonight. (and i admit that piece was a bit 'out there' too, although i tried my hardest to be objective for the most part...)
when you go from 7k to 700 visits a day, you start to lash out at things. i think i may have found the real cause of the problem, though. someone was DOSing my site and taking it down. i think Googlebot was trying to hit me before the server could come back up and maybe i got put on an 'unreliable server' list?
in any case, i'm looking into PPC and other options to spread things out some.
thanks again. please don't hold my foolishness against me.
-kpaul (the real one)
J-Log: Journalism News, Media Views
You're an idiot. Thats a part of Ask Jeeves, which we all know sucks, unless if you're too dumb to realize that you don't need to have a web search in question format.
XeRo
Whilst my original comment was supposed to be slightly tounge in cheek, I shall neverless play you pedantic definition game.
Prior to the get out clause afforded to them by the use of the word "almost" they explicitly state that it is fiction to say a competitor can have another site removed from Google's index.
Either it is fiction, or it can be done. If it is meerly hard to do it comes under the heading of fact.
Absolutely Roflmao!!
:)
:)
:)
I guess some people have never heard of the term "sole trader".
My internet business is barely a year old - almost everything is communicated with other webmasters via e-mail - phone support is provided as a last option, but it means that if anyone really needs to use it, then they can have my immediate attention wherever I am, to have their concerns addressed immediately.
As for spamming - well, this is one of those "anonymous cowards" some of us are familiar with, who believes that if you purchase a link from another site, or become involved in a link exchange, or register your site in a directory - then you're a spammer.
Thanks for the heads up on the Platinax registration details, though - hadn't realised they'd been left out. I had a run in with some Belgian Nazis last year, after I booted them from a forum I admin, when they tried to use it for promoting Neo-nazi propaganda. They've tried a few times to get back at me since, so I've been trying to reclaim some privacy online. Platinax reg details should be public, though - I'll put something online, then try and fine a PO Box for the hate crap.
all emailed google about this problem, like, right now? You think possibly that the wrath of a million slashdotters would make them listen?
Already sent them an emaily.
Try not to let life get in the way of living.
....why would you want to be an investment manager for others in the first place?
it takes more than a little slashdotting... try again: Pagejack article
I admit haven't been paying attention to this. What exactly is this 302 exploit? Is it just a matter of attackers spoofing referrer entries in their GET requests so the attacker's web site gets listed on the target's blogroll? Or is there more to it than that?
I'm proud of my Northern Tibetian Heritage
An example was posted in the beginning of the thread: site:drudgereport.com
A quick count showed 12% of the top 100 not being the real domain (i may have missed one or two). Actually this is quite common for the major news sites (please disregard opinion on drudgereport, this is about his URLs not his journalism)
And, clsc.net is still not down :p
As also written in TFA: The search engine spiders don't send a referrer, so your method won't work. No, they can't "just send a referrer", because they could have found a link to your site on a lot of pages, so which one should they choose? Also, some popular firewalls don't send the referrer either.
(see title of post)
LOL All these posts to explain the same thing over and over. The sad part is..there are others that still don't get it.
i didn't see it in the first few hundred of a site:domain.com
what's the other explanation for another URL having my title, description and cache?
when i run a header check on it, it shows a 302 redirect then 200.
in this case, i think the other person doesn't know using 302 isn't correct (ie. it's a link collection script...)
is there another explanation?
-kpaul
J-Log: Journalism News, Media Views
A woman of age 50 can be a grandma easily, (ie child at 23, gives birth at 20). That grandma grew up in the 60s/70s, and most likely went to clubs and hanged out with the hippies etc... so 21st grandmas are all hip and cool not like the yesteryear pre 40s teeners.
Liberty freedom are no1, not dicks in suits.
"is supposed to interpret whatever content it finds at the 302 target (your site) as really belonging to the URL of the source (my site)."
Claiming ownership of someone elses copyrighted works, I would think is actionable.
good luck in seeking a legal remedy though.
'allinurl:' shows URLs that contain a specific keyword, which can lead to false positives. 'site:' is supposed to show only the pages that Google knows about within a certain domain. If you search for 'site:yoursite.com' and get results from sites other than yoursite.com, then you know you've been affected. Especially if those other domains have taken the #1 result.
Here is one example.
Here is another example.
Thanks to the War on Drugs, it's easier to buy meth than it is to buy cold medicine!
kpaul i'll locate you elsewhere.
/. threads aren't linear - new messages pop up everywhere, including the parts you have already read once).
- i'm not sure if you re-read these threads (it's a godd thing to do as
If you do, and for others:
The "site:example.com" search is a good tool. However, it's not always practical if you have a lot of pages, as it's not always that you will be able to spot hijacks among the first 1000 pages.
So, try searching for specific document titles in stead, putting the document title in quotes. This way you will easily see if there's a result that has your headline, your snippet, your cache, and a URL that is not from your domain.
>> what's the other explanation
In all this talk about 302's we sometimes forget that a META REFRESH with a timeout of zero can do the exact same thing as a 302 redirect.
also, it coud simply be one of these:
- a copy of your page(s) on another domain
- a mirror of your page(s) on another domain
- another domain proxying your domain
Regarding these three cases, the "wrong URLs" should not be seen as an error, imho.
When i use these words, "mirror" usually refers to "close to verbatim copy" (more than 95% verbatim copy) -- ie. almost no difference in the content from your page -- while "copy" could easily just be fragments of your page, perhaps even mixed with fragments of other pages. A "proxy" will be 100% verbatim copies; in fact it will be your exact site, only shown on another URL.
For those that follow these things, 1bu.com is not a proxy, it's a mirror (as it strips out flash and stuff).
I am seeing all over the net a discussion on 302 Hijackings and that Google is evil. But the thing is no one is discussing the actual cause of it. The actual cause is the HTTP Protocol that says EXPLICITLY
:00:00 GMT
"10.3.3 302 Found
The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field." - Emphasis not mine.
You can read it for yourself at http://www.w3.org/Protocols/rfc2616/rfc2616-sec10. html
Now we all know the importance of protocol. Its a communicating language. In this case the protocol was basically developed when the WEB was pure and unadulterated. Where people expected others to follow and not misuse the protocol.
But with money always comes greed and dishonesty. WEB originally was not built with Business in mind. It was for free Information Interchange. But it has just evolved to a state where Commercially the WEB can be harnessed (exploited whatever) for its potential.
So now any search engine that follows the protocol to the letter is in effect aiding the Hijacking, but is it the mistake of the search engine or the protocol? Unlike Human languages, protocols dont evolve uninhibited. If it did then very soon no browser can understand all the servers and vice versa. i.e you might need 10 kinds of browsers to access 10 different website, beacuse those 10 websites talk a different language.
(Now come to think of it, is this not what is happening in the DRM world. You download music from one site and you can't play it on another without a hack). That is the reason there is a standard and it gets revised every so often so that it can also keep up with the times.
So some of the suggestions like throw the redirecting page into the bin and keep the target page will really have web wide repurcussions for people who use it with the standard in mind and with a legitimate purpose. So you ask who uses it and for what purpose?
Let me give an example.
Ever tried buying from Amazon.com?
Okay how do you reach the homepage?
Well i type in amazon.com into my browser and i get the page. BUT the url at which i get the page is exactly now http://www.amazon.com/exec/obidos/subst/home/home. html/103-7996157-2162261
Use this server header tool for understanding what happens http://www.webrankinfo.com/english/tools/server-he ader.php
1) Enter www.amazon.com
It says
HTTP/1.1 301 Moved Permanently
Date: Thu, 24 Mar 2005 14:38:22 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412
(Unix) amarewrite/0.1 mod_fastcgi/2.2.12
Set-Cookie: skin=; domain=.amazon.com; path=/; exp ires=Wed, 01-Aug-01 12:00:00 GMT
Location: http://www.amazon.com:80/exec/obidos/sub st/home/home.html
Connection: close
Content-Type: text/plain
So amazon.com doesnt exist (dont mistake me, the page amazon.com) what exists is http://www.amazon.com:80/exec/obidos/subst/home/ho me.html. 2) Now enter http://www.amazon.com:80/exec/obidos/subst/home/ho me.html in the box.
It says
HTTP/1.1 302
Date: Thu, 24 Mar 2005 14:40:48 GMT
Server: Stronghold/2.4.2 Apache/1.3.6 C2NetEU/2412
(Unix) amarewrite/0.1 mod_fastcgi/2.2.12
Set-Cookie: session-id-time=1112256000; path=/; do main=.amazon.com; expires=Thursday, 31-Mar-2005 08
Set-Cookie: session-id=002-8272699-5270422; path=/ ; domain=.amazon.com; expires=Thursday, 31-Mar-200 5 08:00:00 GMT
Location: http://www.amazon.com/exec/obidos/subst/ home/home.html/002-8272699-5270422
Connection: close
Content-Type: text/html
So now the home page is temporarily at http://www.amazon.com/exec/obidos/subst/home/home. html/002-8272699-5270422
If Google were
Lets say you could beat the index pretty soundly, acheiving a reliable 15% annual return. Let's further say you have considerably more available capital than most, say 1 million dollars. So you can manage your money and make 150,000 in a year, and you probably should.
Why would you want to be an investment manager for others? Because if you can reliable acheive a 15% return, "others" will pay you several million a year, at least.