Google Fires Back About Search Engine Spam
coondoggie writes "The folks at Google are taking issue over spam and the quality of Google searches, which some claim has gone down in recent months. Today on Google's official blog, Principal Engineer Matt Cutts said, 'January brought a spate of stories about Google’s search quality. Reading through some of these recent articles, you might ask whether our search quality has gotten worse. The short answer is that according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been in terms of relevance, freshness and comprehensiveness. Today, English-language spam in Google’s results is less than half what it was five years ago, and spam in most other languages is even lower than in English.' Cutts also explained that the company has made a few significant changes to their method of indexing."
My anecdotal evidence trumps your empirical evidence any day!
When you're afraid to download music illegally in your own home, then the terrorists have won!
According to our own tests we are 100% awesome. We have tested you and you are not :( --Elgoog
A typical problem companies have is measuring the quality of their products: By their metric, it's great! But per the user experience it's not. The users must be wrong.
The metric doesn't always capture the things that the users care about. Also, expectations can change. Better than five years ago may not be good enough
Based on my experience, Google's search quality is insufficient to make it useful for most purposes. It's plan B now. No search engine is much better, but plan A is to use better resources: Wikipedia, knowledge written or compiled by an expert, etc.
"spam in most other languages is even lower than in English."
this is definately not true for Spanish. There has always been a higher level of spam results for Spanish
Bottom line is that their 'metrics' are faulty. Who gives a damn about freshness when the content is irrelevant. Bottom line is that in recent memory its actually more difficult to find good results using google.
PS. No one cares about forum postings that barely scratch the surface of a subject, contain incomprehensible grammar, or just contain questions about your topic rather than relevant information. But if google doesn't even want to recognize that it is doing things that customers don't like they will eventually go the way of the dodo bird as well.
the evaluation metrics we've refined over the past decade
In other words, as long as they keep changing the evaluation criteria, they always pass them!
I've seen more parked domains in google results than I have actually content recenty.
But it is becoming increasingly difficult to find the information I really want/need in my searchs.Maybe it is time to change your metrics.
"To those who are overly cautious, everything is impossible. "
" according to the evaluation metrics that we’ve refined over more than a decade, Google’s search quality is better than it has ever been in terms of relevance, freshness and comprehensiveness. "
And thus begins the downfall of Google. Once you start drinking your own lemonade and stop listening to the people who use your product, you're on a greased downhill slope.
And the worms ate into his brain.
Perception is reality.
Anyway, I think the argument is: The spammers are gaming your Metrics. It's not that there's 50% less spam in your search results, it's that you're detecting 50% less spam in the first place.
See, this is where Google goes off the rails and starts to believe its own press. Cutts said, in effect, "Our search engine tells us that our search engine is doing just fine." Yeah, well, ultimately Google's search engine isn't the center of the universe and the ultimate authority on everything. The users are. If the users say that the quality of search results are going down, then they're going down. Period. Google better figure out how to change their evaluation metrics to reflect what users are seeing rather than attempt to change user's opinions to match what their evaluation metrics say.
Every time I search for something these days I get some ridiculous set of non-results due to the fuzzy matching. I search for "TIPC layer3" google nicely finds me results about TCP Layer3 because google thinks I must have typo'd something. This happens constantly with one or two letter off searches where the search results I get are adjusted because the alternative ranks higher.
Google's search is not getting better, it's getting more and more 'Clippy' every year.
Google has a dilemma. If their search engine takes you directly to the place you want to go, they don't make any money. For a good analysis of this, see "Google Sucks All the Way to the Bank", by Jill Whalen She is, unfortunately, right. It's essential for Google's success that some of their own ads be more relevant than their search results. Part of their revenue comes from sending users on a side-trip to AdWords-heavy pages. We've measured this, using a browser plug-in which reports AdWords appearances to us. About 36% of domains with AdWords (counting domain names, not traffic) are what we consider "bottom feeders", junk sites with a commercial purpose but no identifiable business behind them.
On the local search front, spam in Google Places is even worse than in their main search results. This, though, appears to be due to ineptitude, not malice. Google added a business search system to Google Maps a year or two ago; that's what Google Places really is. You've been able to go to a Google Maps page and search for businesses for some time now. Few people knew this.
Then, in October 2010, Google merged the map search results into their main search results. "Places" results suddenly got top billing in Google. The "search engine optimization" (SEO) industry swung into action, and began spamming Google Places on a massive scale. (We have a paper on this, which has been mentioned by Techdirt, the New York Observer, etc. It's an amusing read.) Recommendation spamming, which had been going on for a while at a low level, grew substantially once recommendations started affecting Google search results.
This, incidentally, is why Blekko won't work. If they get enough market share to matter, techniques will be developed to spam them into meaninglessness.
Stopping web spam is technically quite possible. We do it by finding the business behind the web site, and doing some automated due diligence. We check business records, SEC filings, BBB ratings, and Dun and Bradstreet to verify business legitimacy. We down-rate most of the junk. We try to err in the down-rating direction, taking the position that it's the job of a company to demonstrate their legitimacy by using their real name and address on their web site, which has to match real-world business records. Our demo site demo site for this shows what search is like if you take a hard line on spam.
Our approach requires more of a hard-ass attitude than Google's business model can perhaps afford. With Bleekko making Google look foolish, though, and Bing slowly improving, Google may have to actually do something that works, even if it cuts into revenue from the spam.
I've switched to other search engines; from my experience, Google provides too many tangential and corporate references when I do research.
Also, how does Google "know" that their search results were valid? I'll often do a Google search, click a couple of links, and after being disappointed, I'll go to another search engine where I get more useful results.
What bugs me the most are searches on technical or medical topics, where Google give me a dozen "harvester" results -- e.g., I get sites that have stolen conversations from other message boards, and reported them along with tons of ads. Yuck! There must be dozens of hundreds of sites, all with broken answers to questions about JavaScript and/or medicines.
Just because evidence is anecdotal doesn't mean it should be blithely discounted. If I say "Ouch" at being cut, that means the injury hurt me; the pain is quite real even if no one else has felt it.
All about me
I think that the solid consensus among the people I know that track such things is that the spammers are winning and the quality of search is going down. I know that this is my own experience. That may or may not mean that Google is slacking off, but I don't think that perception comes from thin air.
"Our tests say we're better than what our customers are saying!"
In this context, spam means web sites that don't actually contain any real content, just junk text, lists of keywords, etc., together with paid links or banner ads and the like. They won't answer any question you may have, unless you are asking to see more spam. There is more and more of this crap, and it dominates some web search queries.
I'm seeing less spam than a few years ago when link farms and Wikipedia clones were showing up everywhere on the top results pages. This smells like Microsoft funded FUD.
I am becoming gerund, destroyer of verbs.
I've certainly noticed the quality of searches going down recently, at least for less common searches. I regularly search for oddball system files, software, drivers, etc, the first few pages of results are often very scammy looking sites devoid of actual content and what I am looking for is a dozen pages in. Often these results trump even official big company web sites. Heck while half asleep I used Google to search for OpenOffice, clicked the first link, clicked a big download button, and when trying to install it later I realized whatever I downloaded was certainly *NOT* OpenOffice. (Don't know what it was, I deleted it quickly)
In the last few years, I've found search results have been dominated more and more by content mills like associated content, ehow, hubpages, about, and others; or some low quality Q&A page, like yahoo answers. The pages are hastily written and edited, and low content. The articles are also typically written by someone without any relevant knowledge or experience - so the information is common knowledge or wrong.
If google's metrics say quality is up, but their users think quality is down, then google's metrics need to be revised to match user experience more closely. I've started using duck duck go because they block content mills, and thus I think their results are as good or better than google, even without the complicated algorithms and all the data google has accumulated.
In my own experience spam on google is constantly getting worse and more fustrating to deal with ... I expect it for searches where there is not likely to be any hits but it is also starting to creep into top spots in situations where there is more dense information available.
I remember back in the day people working logistics used to run algorithms to maximize profits for store supply chains but their efforts actually lost a great deal of revenue as algorithms did not understand human factors and how people having to go somewhere else to get an objectivly less profitable item would impact their sales.
It is a complex space and to think you can simply throw algorithms at detecting and characterizing a problem you can't detect and quantify in the first place (Unless they actually can but are choosing not to for obvious evil reasons) seems more than just a little bit naive.
If I were google I would conduct a survey and see what real humans think about the problem rather than playing the part of a foolish statistician.
I also take exception to Matts message.. don't tell someone whos pissed off about the amount of spam that it is getting better. This is an amature hour loosing proposition. Just tell us what you plan on doing to fix it or don't say anything at all.
What about an "Elite" search engine? "Made by geeks/nerds for geeks/nerds."
(I lost track of the political correctness, pick either or your own.)
The guy who wants drivers, the guy who wants the KDE results, the guy who wants the scrotwm, my advanced search examples, on and on. We don't want to buy things. We're out to search for ruthless hard info.
Google took a cute step with the "reading level". It sorta helps.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
Here's a great example of returning pages that don't contain what you're searching for.
Search for +open +cat +mug +frame
The first link only contains 2 of the 4 terms.
Returning a page that does not contain a required search term is a failure state.
I find being offended by me offensive.
Well a real article from 1997 would be better than what I was describing (though I find adding an Ubuntu version (as that's what I've been using) works wonders.
e.g. Blah don't work 10.10 ubuntu
My point is that the "fresh" sites I get are all paragraphs of the same article, in a site called techwizbang.com or some such, with my exact google search appended as a search query on their site.
I wouldn't even be offended if it was the front page of a blog, but when it's some clearly (I hope) robot generated blog full of ads, often linking to other blogs, I would call it SPAM, even if "fresh", "Comprehensive", and "relevant". It is true they have pretty much halted SPAM in the sense that a porn site comes up when looking for something else, but these search pages, or keyword pages, on crappy robot generated blogs are pissing me off.
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Actually it does, the description text is hidden until some user actions are taken. A ctrl-f on the page may not return results for the terms, but viewing source and ctrl-f does.
insight through the mind
Actually, if you read the blog post from Google linked in TFS, they aren't saying that "there is no problem" (as parent post's title suggested) or that "it's great" (as parent post's text suggested.)
They did say that their own metrics don't show the trend that various, mostly anecdotal, critics have claimed. But they also said that they view the spam that does exist as a problem, and they announced several steps to address it:
As we’ve increased both our size and freshness in recent months, we’ve naturally indexed a lot of good content and some spam as well. To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly. The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments. We’ve also radically improved our ability to detect hacked sites, which were a major source of spam in 2010. And we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content. We’ll continue to explore ways to reduce spam, including new ways for users to give more explicit feedback about spammy and low-quality sites.
As “pure webspam” has decreased over time, attention has shifted instead to “content farms,” which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. Nonetheless, we hear the feedback from the web loud and clear: people are asking for even stronger action on content farms and sites that consist primarily of spammy or low-quality content. We take pride in Google search and strive to make each and every search perfect. The fact is that we’re not perfect, and combined with users’ skyrocketing expectations of Google, these imperfections get magnified in perception. However, we can and should do better.
This is not a company denying that there is a problem because their internal metrics don't match the problems being reported. It is a company acknowledging that there is a problem and committing to take action on it, even though their own internal metrics don't agree with their critics on the size of or trend in the problem.
If I was at google, the very first thing I would implement would be a double robot:
- the classic one, identified as googlebot
- another discreet one, identified as IE7 (or whatever is the most common browser at the time), with the page rendered by IE7, blurred a bit and then OCRed.
The two are then compared, and if they are far from matching, dump the pagerank in the bit bucket. This way you eliminate hidden text, white on white and see text in GIFs.
Non-Linux Penguins ?