Microsoft Tracks Down Mass Fake Web Pages
An anonymous reader writes "According to an article on New York Times, Microsoft researchers have discovered tens of thousands of junk Web pages, created only to lure search-engine users to advertisements. While most of us have run across them from time to time, the company researchers have found the pages are deliberately generated in vast numbers by a small group of shadowy operators. By following the money trail, Microsoft researchers were able to track the flow from big-name advertisers to search engine spammers. Many use Google's blogspot.com to set up spam doorway pages. 'The practice has proved to be a vexing problem for the major search companies, which struggle to prevent both spammers and companies specializing in improving legitimate clients' Web traffic -- a field known as search-engine optimization -- from undermining their page-ranking systems. Surprisingly, the researchers noted that the vast bulk of the junk listings was created from just two Web hosting companies and that as many as 68 percent of the advertisements sampled were placed by just three advertising syndicators.' The report is available at Microsoft Strider Search Ranger project page."
They could have saved a lot of time and money by just visiting forums like DigitalPoint. These doorways and other spammy sites are for sale every day. It's no secret.
Developers: We can use your help.
I was actually surprised to find their "what to do" points so simple and to the point.
Man. This Microsoft project is just a ripoff of Google's Gandalf Search Wizard project...
This guy's the limit!
I fully expect to see an improvement in my search results ... for about five minutes, until the SEO spammers crank out their next method of making the Internet less efficient.
I am, therefore you think.
Is it really cheaper to use Page Ranking companies instead of just well, PAYING for an advertisement on Google or MSN or something?
Time to time? For mee it seems like more than 50% when I scan the search results. Maybe less, maybe more, but certainly more than "time to time". For many of my searches, I may not find anything truly relevant until the second and third page. People have learned how to play Google to the point where more and more Windows Live is starting to give better results (scary!).
If you want news from today, you have to come back tomorrow.
they harvested most of their results from Google.
Summation 2
There's actually some pretty decent research here. The site cloning report is a good read.
t tack_by_Website_Clones.htm
http://research.microsoft.com/SearchRanger/Spam_A
The cloning of popular blogs as been a scourge for a while now, both for manipulating search engines and good old fashioned advertising - using someone else's content to draw visitors in
-- Using the preview button since 2005
It's coming from inside the building!!!
The original generic sig.
PageRank is designed to be resistant to exactly this sort of attack. The amount of Google karma you get is proportional to the karma of the pages that link to you. Creating lots of pages with no karma that link to you therefore shouldn't do you any good at all. Why do they bother?
Theories:
(1) There's a subtle way that it helps I haven't spotted yet, perhaps to do with non-PageRank elements of Google's search ordering
(2) This is all done by a very few companies because they are the few that don't understand PageRank and therefore don't realise it won't help...
Xenu loves you!
Ok. Forgive me if MS just discovering this makes me think they just entered 2002. That crap is _not_ new folks.
On the other hand, what idiot spouts off about two hosting companies being responsible without naming them? Seriously. This isn't Fark, you can't get kicked off for calling some asshole out.
Quick, somebody make a few thousand clones of this report.
Microsoft researchers have discovered tens of thousands of junk Web pages, created only to lure search-engine users to advertisements.
In other news, Microsoft researchers have discovered that the sky is blue and that water is wet.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
It amazes me how dumb Microsoft search researchers are as they are probably the last ones to discover that the majority of spam web pages are created by a handful of shadow operators. If I were in charge of the researchers who made this finding so late, I would have them dismissed promptly as this finding is too little, too late.
Question everything
Google is already developing methods to deal with clusters of these fakes. Usually they're scraping web directories and databases. I've seen a lot of this lately, searching for dental hygiene schools for my girlfriend. Usually they're linking to each other, even if they're huge clusters. Legit SEO guys (yes, there are consultants who actually try to get your site linked legitimately and by hand) call these areas "bad neighborhoods". Whatever Google's doing, though, clearly isn't enough, and a lot of these guys are using adsense to make money. Martinibuster's got a few good links on the subject.
www.about.com?
On another note, I've been wondering, based on results I see fairly regularly, whether it is possible for a site to dynamically produce a page based on the google search that it is linked from.
When I am looking at search results I often hit pages that look like they were designed to match exactly my query, but are full of meaningless high level fluff, ads and links.
You can see how they make them: fed by Digg, obviously.
(found via Digg's "who blogged about this" feature, remove f- from the start of the url)
f-cartoons-plugin.com/blog/
f-www.primenewsblog.com/
f-fatmobil.com/blog/
f-www.cartoonsfans.com/blog/
f-searchroads.com/blog/
...a friend of mine figured he could get great Google listings by autogenerating trashy link farm pages, he had the top 1000 porn search terms all cunningly mispelled, ie "Brittney Spares" and hundreds of thousands of static pages all linking into each other across a bunch of subdomains. For about a year we reckoned he had some stupid percentage of all porn listings in Google, and in that time he made around $1,000,000 from banner clicks. Eventually Google caught onto it and blocked his sites enmass, but he'd made enough to buy some property by then.
I just finished reading how much the Strider group at M$ has accomplished and how, and it is rather amazing. They lifted the covers off of typo-domain squatters exploiting Google's programs, a progressive honeypot setup that detects which levels of XP are attackable by different mal-ware attacks (up to and including reporting zero-day exploits if the latest "patch hardened" machine is exploited], and now this project. Even better, they are publishing the "how", and any OS (AKA Mac OS or any of the Linux distros) could benefit by using similar approaches on even more machines.
So -- from an admitted open source advocate -- here's a rare kudo to the giant in Redmond for keeping a "white hat" and his group -- and letting them work.
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
I just use Firefox with the Adblock & Filterset.G add-ins. I don't see any ads to click on.
Thanks for a informative post. Beats the typical whiny M$ iz S4T4|\| crap.
Google does keep up, but quietly- anecdotally, last week I was searching for a certain spec ARM9 dev board (the VULCAN-Lite) with USD also as a search term and all kinds of fake keyword sites and eastern block bride services were in the top 20 results.
I sent Google feedback with my search terms (VULCAN-Lite +USD), explained what spam was popping up, and as I write this comment a few days later-- the Google search comes back clean (empty for +USD, no spam in first 30 results for VULCAN-Lite). They apparently listen and respond to random user feedback pretty quickly.
I'm going to find a copy of this list and check it, fully expecting to find Linux sites in it.
That's all it's been from Microsoft lately. Microsoft, the anti-Linux company (who also sells some software on the side).
As far as research divisions of big companies go, MS's is the most uncool by miles. I have yet to see any announcements coming from MS Research that evoke anything other than a yawn - this announcement being a good example. This can't be said of HP, IBM, the old AT&T, the old SUN, etc.
One wonders if MS hires talented people only in order to prevent them from doing interesting research for other companies, not in order to do interesting research for them.
I read the research paper a couple days ago after reading about it in the NY Times. Seeing how this research is Microsoft funded and implicates Google, claiming they're syndicators are in cooperation with the spammers, one has to question researcher bias. I'd like to see a peer-reviewed and independently verified article before accepting these outrageous claims. Note that the researchers focused on a few keywords and strictly limited the scope of their efforts. This doesn't mean the findings are untrue, it just calls their methodology into question.
signature pending slashdot approval
Firefox has an extension called customizegoogle. It adds a 'filter' option to a google results page. Allows one to filter out the sneaky pages that hi-jack your search query.
Most people have no idea what they are doing, and are silently panicking on the inside.
I look at these situations much like I looks at people that cheat welfare systems and such. So many people spend so much time figuring out how to cheat a system, I wonder if that same time was spent trying to work the system the right way how much of a difference in the net outcome it would be...
dB Masters
What was the point of this effort? To improve its own search results? To show up Google?
Proud member of the American Non Sequitur Society. We might not make much sense, but boy do we love pizza!
(just posted today on the reg.)i ve_malware/
Microsoft's search excels in spreading malware
http://www.theregister.co.uk/2007/03/20/windows_l
No wonder Microsoft never has any real innovation.
What if I want MY page to just be a sea of ads? I setup the code, I did the work, why can't I show what I want? It's not my fault that Google misreads my page or gives someone else a higher ranking because of it. I'm sure there are whole boatload of sites that could be deemed "junk", but out here in the digital wild west, I'm free to do what I want on my 10MB of free space....Aren't I?
Someone go get those bloody bastards, and shoot them dead. The Internet won't be that much of a safer place (other bastards will rise to replace them), but every step taken to sanitize it will be a welcome one.
Here's a thought: why can't search spiders be a bit smarter, and discard any links on a page that are set to "display: none"? Or, better yet, why not flag them as potentially abusive? I realize there are legitimate reasons for hiding a link with the CSS display attribute if you're using dynamic HTML, but I'd venture to guess the majority of hidden links are used for search engine manipulation.
Of course, the scammers would just try some other tactic -- perhaps hiding links in Z-layers behind opaque graphics -- but it is always an arms race, isn't it?
Comment removed based on user account deletion
Comment removed based on user account deletion
""According to an article [in the] New York Times, Microsoft researchers have discovered tens of thousands of junk Web pages ..."
m
There are plenty of pages pushing junk out there. Here's one I came across just today:
http://onecare.live.com/standard/en-us/default.ht
The site in question seems to have changed from google ADSENSE, but when I complained about it (WHEN THEY WERE USING GOOGLE ADSNSE) they bascially said it was not against there policy. I think Microsoft knows that google is getting lots of revenue from cybersquatters so that is why they are going after them.
Original Message Follows:
Subject: Other
Date: Fri, 22 Sep 2006 18:20:26 -0000
Hi there, I came accross a site by accident and I notice that it seems to
direct link to adsense advertisers via a direct link as opposed to the
hidden javascript that is normally the case. Also the page seems to be
made up entirely of adsense adds which I thought was against adsense
policy.
Can I do this too? http://paypall.com/
BTW I see lots and lots of sites like this and they all seem take
advantage of mistakes when typeing in website urls - ths seems EVIL to
me...
Regards,
Hi,
Thank you for your email. It appears that paypall.com is a member of our AdSense for Domains (AFD) program. Because we respect the confidentiality of all publishers, we cannot disclose any additional details of our relationships with other sites.
If you own sites that generate more than 750,000 page views per month you may be eligible for our AFD program. If you meet this requirement and you'd like to learn more about the program, please visit http://www.google.com/domainpark .
For additional questions, I'd encourage you to visit the AdSense Help Center (http://www.google.com/adsense_help), our complete resource center for all AdSense topics. Alternatively, feel free to post your question on the forum just for AdSense publishers: the AdSense Help Group (http://groups.google.com/group/adsense-help).
Sincerely,
Kevin
The Google AdSense Team
To access the Google AdSense home page or to log in to your account,
please visit: https://www.google.com/adsense
This didn't make microsoft sound nearly evil enough for /.
"Microsoft Strider Search Ranger"? Come on now. Are they turning to japanime/manga naming conventions? How long until: Microsoft Laser Super Action Happy Extreme START!!!!! Microsoft Real Swift Rainbow Sunshine Police Now LOVE!!!! This is taking the concept of branding into its exact opposite. And then you have things like, Apple TV. And you wonder why MSFT is tanking.
10,000 pages??? Geez, I want to work for microsoft, those guys make wally look industrious http://www.google.com/search?hl=en&q=allinurl%3Adm xargs&btnG=Google+Search
Well... it's a bit like blaming the PC security problems mostly on Windows. The shoe fits.
As a Slashdot discussion grows longer, the probability of an analogy involving cars approaches one.
Old news, see: http://johnbokma.com/mexit/2006/07/13/
/finally/ accepted by the abuse desk. If I can find thousands of blogs with some Perl, why can't Google fix this before those blogs get spammed on thousands and thousands of open guestbooks, blogs, etc.
Have been reporting this to Google for over a year. Only recently long lists (thousands) of blogs got
Furthermore, the problem is not limited to Google. LayeredTech, ThePlanet, and several other hosting providers have no problem at all with making it a pain in the ass to report abuse and just host too much garbage for too long.
And all the while non-solutions like Akismet are applied by the masses. It's time some people create a draft on how comments should be stored in blogging software (hint: including remote ip, proxy related environment variables, etc) and we get a online reporting tool like spamcop. Filtering? Look at your inbox. It's not going to happen. And CAPTCHA? By the time bots have problems with it, most people can't solve them.
Seriously, I have had phishing email for some of these 80.77.x.y websites recently as well. A "Good on ya!" to MicroSoft & UC Davis! Root the bastards out and stomp 'em!
.. paranoid crackpot leftover from the days of Amiga.
welcome to the social, MS
I think it is funny timing how we turned down a $73k/month in advertising last night from one of the top three spam supporting syndicators. They were seeking a $1.16 per average click through.
I am very glad I read the detailed report from end to end. We seek value in advertising, not spam, but it is very difficult for well meaning companies to figure out which is which. You shouldn't have to be a rocket scientist to differentiate the deceptive tactics/companies from the valid ones. I guess most forms of fraud end up being abstractly similar to this scheme in the end though.
If something smells fishy don't eat it.
JohnE
jobbank.com - Search jobs, post resume,
This is good work by Microsoft. They've tracked down a few big-time web spammers, all the way up the food chain. But there are more.
We've been working on the web spam problem, from a different angle. Our starting point is the legal requirement that a business cannot be anonymous. Every legitimate business must have an identifiable person or corporation behind it. (See CA B&P code sec. 17358, ("disclosure of ... legal name and address information shall appear on ...
the first screen displayed ... (or) on the screen on which a buyer may place the order for goods or services ...") the European Directive on Electronic Commerce ("the service provider shall render easily, directly and permanently accessible to the recipients of the service and competent authorities, at least the following information: (a) the name of the service provider; (b) the geographic address at which the service provider is established...")
Given that basis, our solution to web spam is straightforward: if we can't find a valid business name and address on a web site that's selling or advertising, it's not a legitimate business. Of course, if there is a name and address, it should match business license data, corporate registration data, fictitious name filings, and similar records of business existence.
So we have a system that parses web pages in some detail, looking for addresses. If a web site has a name and address on it that obeys postal addressing rules, we can usually find it. We have access to some business databases, and we're adding more. We look at some other info, like SSL certs and BBB seals, which has some credibility. Thus, we can check for legitimacy.
Our goal is to feed this into search engine rankings, so that non-legitimate businesses fall out of visibility.
"Doorway pages" and "affilates" with no business behind them aren't legitimate businesses, so they're toast. Completely phony addresses won't work, either; they won't match business records. Stealing the name address of a legitimate business is felony identity theft, which is a place you don't want to go. (Also, sometimes, we can detect and report that.)
An early version of this is already running at SiteTruth.com. If you're responsible for a commercial web site, run it through the Detailed SiteTruth analysis, for Webmasters and see what SiteTruth finds. If SiteTruth can't find your business name and address, you might want to fix that. The day will come when it affects your search placement.
This is the alpha test phase for SiteTruth; there's more coming.
Web spam used to be a safe tactic. That was then. This is now.
Anyone who makes a website, no matter who considers it "junk" still is not forcing or spamming any serach engine. In order to be listed in a search engine, they (the search engines) must send out its crawlers in a search for websites...If a SE(s) end up listing a "junk" website in thier search engine becuase their SE crawler found it in the endless boundrys of cyberspace...thats not the owner of the so called "junk" web site(s) problem...nor it should it be.
There is NO SUCH thing as "spamming a Search Engine"
There is only what THEY allow to be indexed in their SE or what they don't
This term was invented by the SE's folks for their own puposes. To get you on their side by being your "protector"
------
"the most terrifying thing happen to me today...I visited a blog and it...it...it had ADS on it and information about...about...watches...and...and it did not make sense to me...and the ads were from google and yahoo...and...and...oh my god...thhe maddness! I think it was a "spam website...thats been talked about SO MUCH...on CNN,MSNBC....oh the pain...I feel so dirty...Iam go to take a shower and a bottle of nightqil....I just can't deal with this....LOL
-----
BTW: I thought search engine companies invented algorithms and search engine filters and about a thousand other things to rid themselves of "so called junk sites" starting about 5 years ago.
Whats scary, is some how along with google they now think and are trying to take complete control over the internet...Earth to google and ms...you don't and never will.
But at the rate this are going, they will control enough of it...to ruin it for a lot of people.
If they don't want what they consider "junk" websites in their SE...and they cannot control this through technology...then they should go back to the "old fashion" directory days were every website and blog (blogs did not exsist...LOL) that gets indexed into their SE is first viewed by a "person" and approved or disapproved for inclustion to their SE.
Just as with real history of humans on this planet...if you do not remember all the freedoms the internet provided in the past...you to will loose more and more of your freedom on the internet...with Google and MS as your masters....and you will never know or remember how it exactly happened...its just will be.
This is not ALL about websites they don't like...it is ALL about controling everything you do on line...weather your aware of it or not.
Thats it...nothing to "heavy"...LOL
Peace! Chipper Jack
http://www.iraqsinconvenienttruth.com/
Hi, I love analogy's...but thats a bad one. Peace
http://www.iraqsinconvenienttruth.com/
Looks like the dudes are having a little shot at google/yahoo. Meanwhile one of the authors quotes 4 of his articles so i wonder if he's trying to pump up his academic ratings on the side
I have read the stories about "we have a long list of blocked IP addresses and all the horrible bots are using my bandwidth". Brett Tabke is a liar. I have tried accessing his site from many different (static!) IPs in different /16 blocks and they were all blocked. Tabke's business model is to have an ad-free website and charge $180 per year for access to the site. He wants to attract new paying customers without paying for bandwidth to nonpaying visitors. So he blocked access by nonpaying visitors from Asia and Europe completely because they were not generating enough revenue.
Avantslash: low-bandwidth mobile slashdot.