Millions of Pages Google Hijacked using ODP Feed
The Real Nick W writes "Threadwatch reports that millions of pages are being Google Hijacked using the 302 redirect exploit and the ODP's RDF dump. The problem has been around for a couple of years and is just recently starting to make major headlines. By using the Open Directory's data dump of around 4 million sites, and 302'ing each of those sites, the havoc being wreaked on the Google database could have catastrophic effects for both Google and the websites involved."
I am really extremely entirely confused about the article altogether. Is the hijacking more or less about Google digging into your site even when your robot.txt crawler robot is refusing google entrance?
For every Good Thing, there are at least 100 different ways to abuse it.
FLR
"Oh! Look! Something beautiful! Something impressive! I must destroy it!"
pah. feeling jaded today, i guess.
"hey, could you pass me a paper towel? er.. I mean... DEPLOY ABSORBTION PANEL!"
buy GOOG on the dip as many non-techie investors panic sell. 8)
Diplomacy is the art of saying, "Nice doggie!" until you can find a rock.
As web presence -defined as within about the first 10-20 results of a search- becomes more and more important to "success," black hat techniques such as this, to eliminate competitors, will become more and more common. Google, or any other search tool needs to be able to stay above the fray and not be subject to hacks such as this.
Wow, getting modded up just for leaving a message on our answering machine! I guess it's true, just like with Wil Wheaton, if you claim to be (or are) someone of alleged importance, you too can get +5 Informative on every post, no matter what you say (or don't)!
Prosecute for what? Is there a law against redirecting web pages? I think this would be a pretty difficult prosecution. Google's going to have to take technical steps on this one.
It will also break many "click trackers", "portals", "directory sites", "search engine optimizers", and other annoyances, which is probably a plus for Google users. You know, those sites where you click on some phrase in Google and, three redirects later, you're at some irrelevant porno site.
Why not just fix the bug and then recreate the rankings index? Googlebot hits my sites all the time, so I know that it covers the rest of the internet quite often as well. With their amount of hardware, it probably wouldn't take long.
A musician without the RIAA, is like a fish without a bicycle.
AllinURL returns results where the results are in the URL. So they *should* be returned.
I'm not convinced by this whole 302 nonsense. I haven't seen a single search where a 302 scraper site is ranking above the site it 302s for the scammed text.
To me it sounds like people's sites drop for whatever reason, then they look for a reason and they grasp at this 302 story.
I do an allinurl on my various sites (8 of them) and 6 have scrapers attached, only 1 has disappeared recently and that seems to have been caused by a change of IP address or maybe the loss of the yahoo directory link or perhaps because I have lots of pages with 20-30% similar content.
But if I only had 1 site I could easily blame a 302 problem.
The article is confused and baddly written. It does not explain the exploit being used ever. So stop dumping on people. It is not at all surprising that people don't get what is going on when the description is crud.
What is really going on has nothing to do with 302, or at least very little. What these people are doing is to set up fake web sites using content filched from genuine Web sites. This allows (or is beleived to allow) them to climb the google rankings.
I don't see why someone would use a 302 response when they can just copy the entire content unless there is some sort of bug in Google's pagerank that is not being explained. Copying the entire content is much simpler.
So what the attacker does is to set up their site so that when the googlebot comes round it publishes some legitimate content, then when other folk follow the site from a google search they get pages infested with spyware or the like.
This would certainly explain the number of times I have done a Google search and ended up at an idiotic 'search site' that does nothing for me.
Looking for an Information Security student project suggestion?
Try http://dotcrimeManifesto.com/
It seems that when page A redirects to B, Google not only considers that a hit for A, but also assigns B's content to A (I just skimmed through all the posts here so maybe that's not what happens).
In that case, it seems to make more sense to just ignore A altogether since the hit and content rightfully belong to B.
This could be done by treating redirects as empty one-link pages, thus unifying the handlers and defeating this practice.
I think that is the RIAA wet dream -- to have every web page point to it. Don't they belive the only way to save music is to kill the web?
Why don't you just pick the new URL as the canonical one? This way, any hijacking attempts would have no effect. And if I really want to do a permanent redirect, I don't want the old URL to stay in Google's database, anyway. I guess transferring the PageRank would be tricky (would make it possible to hurt a page by redirecting from a very low-rated one), but this still seems to be a lot less open to abuse.
EagerEyes.org: Visualization and Visual Communication
Hey, if you've run across spammy sites, have you filled out a spam report and used the keyword slashdot? I mentioned in a earlier comment from a different story that you can do this. We got eight reports last time, and the responses are on their way. We do check that data to look for new tricks that spammers are trying.
Well shucks GG, not every webmaster is glued to WMW and other forums.. and even if they did the signal/noise ratio on this topic is so low that you probably couldn't find the information even if you were looking. It's hardly an obvious reporting mechanism. Although posting it on /. should help some, so that's appreciated. Thanks.
But look - what we have here are a whole bunch of webmasters who have been nuked off the face of the earth by 302 redirects and just don't have the technical knowledge to try and fix it. Mom and Pop stores, hobbyists, nonprofits etc etc. These people are just gonna get pasted.. they'll just be wondering why they don't get any visitors any more.
This is a HUGELY serious problem - and it's getting worse all the time as more and more people deliberately try to exploit the 302 bug. I've been hit by this bug myself, and let me tell you that unless you know EXACTLY what to look for you'd be stuffed - all you'd see is your traffic flatlining.
The key issue here - and it's the kind of issue that will really, really hit the headlines when it's exploited is redirection. Sure, I can use a 302 and send Googlebot to the correct page.. so first of all I basically 0wn the content of that page not the publisher. *Then* I insert an exploit into the 302 redirect.. and hey presto, I've 0wned hundreds of thousands if not millions of computers. *That's* going to make unpleasant reading for Google when it hits the headlines - "Use Google and Get Owned". Nasty.
Never email donotemail@WeAreSpammers.com
This has very real potential to be taken advantage of for phishing scams.
Imagine someone searching for their bank's website on Google (because some think that [searching] is how the web works!) and clicking the wrong link. That link takes them to a site that looks just like their bank's website, and maybe there is a security alert on the front page asking them to verify their information. After doing so, they could be redirected to their real bank's site, never having realized their error.
Experience has shown me that most non-techies know they type an address into their browser, but after that, they pay no attention to it which makes this a real possibility.
This is hilarious! Someone please mod up! Hope I get the above mods in M2.
See my journal for slashdot ID's by year. Mine created in 2005. http://slashdot.org/journal/289875/slashdot-ids-by-year
and that prosecuter has to get pretty imaginative to get jurisdiction over the people in some countries.
prosecution can't fix this problem.
world was created 5 seconds before this post as it is.
Thanks. And remember, identitiy theft is not a joke, unless you steal the identity of a clown.
Google/GoogleGuy isn't being evil, just seemingly suffering from ignorance and/or apathy.
That said, I'm reminded of a quote I heard once: "The only thing necessary for the triumph of evil is for good men to do nothing". Please stop doing nothing, Google.Obviously you did not google for it.
This is an idiotic response. Why on earth do people mod stuff like this up? Who in the hell is going to google for "canonicalpage"??? That is the solution you moron. Let me see you search for and find the solution without entering the term for the solution itself.
You are a moron, and whoever modded you up is even stupider than you.
I sort of agreed, it was really bad about a month or two ago, but has been getting better for most of the "commonly searched" terms. Some fairly obscure searches still turn up a bit of crap, but you can't do it for everyone.
A "Don't show me any results from this subnet + domain from now on" feature would be nice, as would google banning some of the worst offenders (which it seems to have done).
1q2w3e4r5t6y7u8i9o0pqawsedrftgthyjukilo;p'azsxdcf
It's EXTREMELY informative, because it tells you what Google's offical position is. Whether you like it or not, you need to know that. "Informative" doesn't mean "good".
If Bill Gates posted here in defence of some MS policy, it would hopefully similarly be modded "informative".
Look, there *was* circumstancial evidence for the "Greg Duffy" thing ... i.e. just enough to make it a discussion. I agree that fearmongering is not the way to go. I appreciate that you looked into the issue (and my first instinct is to trust your explanation, that is was a DNS issue).
However, if this is Google's PR method, I think you are kind of asking for it! In the absence of information, the internet community will speculate until the cows come home. I'm not saying it's right, I'm just saying that's reality. Even though I said on my site that I thought Google didn't do anything underhanded I bet a lot of people were still not convinced. Google can do a little better than this, and although you have been fairly nice to me (thanks) this response is a little flamebaity for PR. Please understand that I mean no offense, it's just constructive criticism. Even if everything you say is true, a representative of the company should always at least attempt to sugar coat something like your last paragraph.
Also, on a more personal note, maybe Google should embrace the people that are involved in researching these problems instead of using this broken communications policy. I know that in my case I contacted you guys 5 *months* ago about the Google Print problem I described and never got any followup except for my t-shirt (which I really like). I have some great ideas about possible solutions to the problem I described, and as far as I can see Google has not fixed the root of the problem. When are you guys going to contact me?
-Greg Duffy
There is a simple solution for Google: Only honor 302 redirects when the original and target domains match (or points to a subdomain of the original domain.)
In all other cases treat a 302 (temporary) as a 301 (permanent) redirect, thus giving credit for the content to the actual hoster of the content.
This allows webmasters to continue using 302s to setup logical URLs to mask the organization of underlying content but eliminates the ability to hijack completely.
Natural != (nontoxic || beneficial)
The problem you are describing here is not a 302 hijacking. Those sites don't do any redirecting, and they aren't duplicating your site page causing you to be bumped out of the loop. They just happen to have a link to your site and your "motto" on their page. The fact their page comes up before yours does seem stupid, but is unrelated from the 302 hijacking issue.
Ironically, the word ironically is often used incorrectly.
The same thing can be done with a CNAME record. Give your domain a CNAME to www.google.com. It will eventually obtain a PageRank of 10. But this PageRank is useless for obtaining better search positions, as it will go away once the real PageRank is calculated.
This may not be a problem because the PageRank shown in the toolbar is generally not the real PageRank Google uses to determine its positions.
These techniques may be nothing more than a placebo, and there's probably a few Google employees who get a good laugh out of webmasters using such techniques.
Why all the yammering and discussion on this?
It's pretty simple; 302 redirects allow bad guys to exploit Google.
It doesn't matter that it's the wrong way to use a 302 redirect. They are the BAD GUYS. Remember the "spammers lie" truism?
It's the Google rule that is broken. 302 should be treated as "cant find site" in their search rankings rather than assuming the the data sent by the web server is honest. It sucks that some legit users of 302 won't get ranked as well because of it, but boo hoo. Let anybody that has hardware or software problems get better equipment in the first place if their freaking world ends when they don't get ranked in their keyword group. I have NO SYMPATHY for someone that shoestrings their vital revenue stream infrastructure and then wonders why things go bad. It reminds me of my job too much.
Buy Google ADs if you need to make money off your site traffic.
Google will change the rule or they won't. If they want to stay relevant, they'd better. I find myself getting irritated with Google's crappy search results a lot now days, sooner or later I will find one of the little startup to use and they can kiss off if it keeps up. So I figure they will get to it. They are Google, they are good at what they do.
Now what I think they should do is download snippets of pages via the Google toolbar which then sends the data to Google to make a massively distributed bot-net spider that is indistinquishable from the web-using masses. At that point, as far as exploiting Google via IP of the bot or user agent of the bot IT IS ALL OVER.
Move along, nothing to see here but a bunch of people that don't understand redirect and HTTP protocols.
As an alternative, I'd love a cookie based version of this that you could click "ignore all results from this domain". After a couple of weeks you'd get rid of most of them on your personal browser. Make the lists sharable even. All the pagerank wannabies can do is start from scratch with new URLs.
More precisely, googlebot always sends the same referrer. Here's a snippet from an apache access log.
In practice, a static referrer and no referrer amount to the same thing so you're right from a practical standpoint. The referrer is not useful.
But that's OK because the system I described does not depend on the referrer header. If a referrer header is available, it will use it as a shortcut to determine that if client was referred by an internal link and potentially bypass the whole redirect process. This saves system and and network use for the majority of cases when the client is an ordinary web browser, but it's not essential and clearly won't be useful when the client is googlebot (or some other robot that does not provide a referrer).
If the client is a googlebot, the filter will see that there's no referrer. It will then check its stateful cache to determine if it has seen this robot recently. If so, it will let the robot right through and the request will be procesed normally. If not, it will issue the slightly obfuscated 301 redirect. When the robot follows this redirect, the filter will be invoked again. This time, it will recognize the robot from its previous visit and will let it through.
If, for example, I use redirects to distribute traffic between multiple servers on multiple hosts, the GoogleBot's behaviour of treating the redirecting host as the website's canonical host is correct. I want users to use the referring host so that I can change physical hosts with impunity.
well, a bunch of people have suggested that 302s should only be honored by crawlers if the domain is the same. i think that's a pretty good idea.
It's not Google that's broken--it's the web. It's just that the two-legged weasels are only now starting to pry open the cracks.
why do you say that? how is the web broken because of the way google crawls it? the http standard was designed before googlebots were crawling it. long long before. the googlebot need to be more intelligent is all.
in this age of communication i'm just not getting through
claus, I'm glad that you mentioned this search. I looked through those 100 results. Every example that I saw in those results was from a while ago--they were all listed with the Supplemental Result tag. So this is already handled correctly in our main index, and as urls are updated in the supplemental index, those examples should be handled correctly as well.
Thanks for mentioning this search; it's a good point. We've already made some changes to improve our heuristics, and you can see that improvement in the fact that current urls look better than the supplemental urls.
In a sense, of course, there's little google can do to prevent this, because even if they weighted 302-redirects lower in their "throw out duplicates" stage, I could always just go snag a copy of your website each time googlebot visits, in essence doing the redirection myself.
However, doing it through 302 redircts means that google pays for the bandwidth to go get your page, not me.
Ah, but doing it through a 302 also means that the target site can't notice you making regular hits to it and block your IP address.
There's also perhaps a legal distinction. Actively copying someone else's site without permission is pretty clearly copyright infringement. Just 302ing to it most likely isn't.