MS Research Automates Search Engine Spam Hunt
Barbie Dollar writes "Researchers at Microsoft are working on an ambitious new project to hunt down and neutralize large-scale search engine spammers. The project, called Strider Search Defender, automates the discovery of search spammers through non-content analysis. The project integrates technology from two previous Microsoft Research prototypes (Strider HoneyMonkey and Strider URL Tracer) and promises a new approach to removing junk results from search engine queries."
Every anti-Microsoft blog and article in existence has been flagged as search engine spam.
More at 11.
"You will pay for your lack of vision..." - Emperor Palpatine to Ray Charles
Sure, preventing search engines from indexing blogspam posts is great. Maybe that's the first step, but it's not going for the root cause - the botnets that run the apps that post/email in the first place, and the compromised webservers hosting order sites.
Web 2.0 == Giant Blogspam Circle Jerk
Microsoft, by cracking down, could effectively decrease the spam sites, the results would be fewer AdWords and microAds displayed and clicked, and could lower revenue for Google and Yahoo.
A side effect is better search results, which would increase use of Google again. Where is MSN Search in all of this...I don't know. But fewer of those crap sites, the better.
Researchers at Microsoft are working on an ambitious new project to hunt down and neutralize large-scale search engine spammers.
So, if by some miracle, they actually discover a way to hunt down and nuetralize the search engine spammers, what are the odds that they share this information with other Search Engine companies?
I'm all for people being allowed to try and game the system...Anything else would restrict the whole purpose of the Internet as a repository for whatever the hell someone wants to put in there.
At the same time, I'm all for search engines blacklisting people who game the system, parked domains, crap aggredator pages, etc. It's all about building a better mousetrap.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
..that Strider HoneyMonkey was Arwen's pet name for Aragorn?
My guess is the next project will be called Strider Hiryu and this will eliminate said spam.
search-spam sucks bad. I'm tired of doing searches and finding 100s of useless links and "secondary search pages" with nothing but ads and other junk [spyware/adware].
Tom
Someday, I'll have a real sig.
"Strider Search Defender" is just a cover name. It's really the "Aragorn Search Defender" it just likes to remain incognito so that spam-zombies don't think to hunt it down.
If this signature is witty enough, maybe somebody will like me.
Sure, preventing search engines from indexing blogspam posts is great. Maybe that's the first step, but it's not going for the root cause - the botnets that run the apps that post/email in the first place, and the compromised webservers hosting order sites.
These are not mutually exclusive goals. If you take away any incentive for spamalizing content (meaning, not only does it not boost your search placement, it penalizes you), then much of the pressure to run botnets and crack servers goes away.
Don't disappoint your bird dog. Go to the range.
So, if by some miracle, they actually discover a way to hunt down and nuetralize the search engine spammers, what are the odds that they share this information with other Search Engine companies?
Their purpose is to make their own search engine more effective for users, thus generating more traffic for them. A nice side effect would be that Yahoo and Google, etc., would feel more pressure to integrate similar technologies into their own engines. As usual, competition produces the best results.
Don't disappoint your bird dog. Go to the range.
All major search engines have been doing this for quite some time. Google is probably the best hunter of them all and the most recent update, which occured on June 27, banned a large number of spammers who had billions of sites indexed. Unfortunately, the war on spam is quite difficult. They spammers are working with non-content pages but it is a matter of time before they start generating non-jibberish content to spam with, too.
Hopefully, Microsoft's approach will give some effect and push other operators to work harder on preventing the web spam.
Amusingly, you're most likely getting affected only if you're searching for penis pumps, pornographic content and gambling.
Full Tilt
Seems to me that a group of 10 people could easily flag a large amount of spam websites. Is this currently being done by any major engine?
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
Of course, with their track record of Neat Ideas vs. Actual Products, (WinFS, etc.) I'm not holding my breath.
I am, however, wishing them luck.
Microsoft forgot to mention my non-content based method of blocking comment spam entirely known as Bad Behavior. And now that they seem to have swiped a few of my ideas, I'm going to have to go see what they're up to...
How am I supposed to fit a pithy, relevant quote into 120 characters?
This *must* be one of the next battle lines in the so-called search wars.
I remember the first time I saw google - I was blown away: "Wow. These results are exactly the web pages I was looking for!" But that's no longer the case when you search in google. They've really fallen behind in being able to separate out (or, as they say, "search for") the pages I want from the junk.
I hope google will win this war, but maybe microsoft chucking some money at the problem will help light a fire under google to get this fixed before someone else does it better. If searching at google no longer brings me relevant results better than any other source, I'm gonna start looking for somewhere else to search. Just like I did when I switched to google from yahoo back in the twentieth century.
I seriously doubt MS is going to "shut down" every windows box on the planet. ...searching for spam, eh? oddly spam finds the rest of us, but at MS they have to "search" for it.
Why are you even spending time reading and posting on Slashdot at all if you're so worried about the middle east and the end times being near, hmmm??
you all fail at classic gaming.
we can only hope that this research is as fruitful as their speech synthesis research, email spam blocking, multiplatform video codec, next-gen filesystem, advanced CLI shell, and portable computing.
yay for MS research!
If you don't know what AltaVista is (was), get off my lawn.
Arms race.
This is exactly what happens in email. You say "Oh! I can filter 99% of my spam by grabbing anything with 'Viagra' in the subject line!"
The spammers, noticing this, start using subject lines like "Urgent! Read now!"
You adjust your filter to watch for anything with "Urgent" in the subject line and "Viagra" in the body.
They send you Vi.ag.ra instead. You catch that, they send you Vlagra.
They send "Penis pills". You filter anything with "Penis". Then your freind changes their signature to "The Pen is Mightier than the Sword". Since your filter is smart enough to catch "Vi ag ra", it's also dumb enough to think "Pen is" means "Penis".
You adjust your filter to assign a score based on how many bad things it notices, and you add a few good things to even the score -- like whitelisting a few close friends, and anything coming in with "I AM NOT SPAM" in the subject line. Of course, you realize it won't work entirely -- the spammers will eventually use "I AM NOT SPAM", and sooner or later you'll get an email from someone you never heard of, who wants to talk to you about a business proposition, who got your email from somewhere like a forwarded message or somewhere else on the Internet, and they don't add the "I AM NOT SPAM" flag. But for awhile, it works.
Then the spammers start sending messages that contain no text at all, just a few large images.
You filter that, meaning you completely miss your grandmother's email -- family photos -- or your girlfriend's birthday surprise email -- you fill in the blanks.
Before you know it, you're spending all your spare time tweaking your spam filtering settings, and it's still not enough. You thought it would be so easy -- just a Perl one-liner used to block 99% of your spam, with 0 false positives! But things are changing too fast now. At some point, you get the genius idea to make it open source. Hundreds of like-minded people flock to it, desparate. Every day, your spamfilter downloads a new copy of the rules database, a collection of Perl one-liners used to catch spam. But you're getting hundreds of spams a day now, which means as soon as the spammers switch tactics, you could have a thousand spams in your inbox before you get the daily database update -- and that's assuming the daily update has a rule that blocks these.
Basically, you've created Spam Assassin. Works like an anti-virus program. It also means that someone has to get hit with a new virus (type of spam) before the filter can block it, but even when it's at its best, it's still nowhere near good enough. Remember, 95% accuracy on 500 spams a day means you still get 25 spams in your inbox.
This is why its best to automate this kind of thing. Use a statistical filter such as dspam, bogofilter, or crm114. They are actually more accurate, when trained by humans, than a hand-coded filter.
So yes, you do need humans to train your web filter, but you also need your humans to continue to train and retrain a statistical filter. You can't just pick an arbitrary five websites and either assume that's all there is, or remove everything like those five, because that just starts the exact same arms race I've just described.
Don't thank God, thank a doctor!
So in other words, it'll be called Aragorn when it becomes master?
Windows has detected an undetectable error.
Google could cut their spam to 1/4 if they stop accepting websites whose domains are less then 7 days old (Will render domain kiting useless)
I would pay to use a search engine that removed all "blogs" and shopping sites from the results.
This addresses a particular kind of spam page that is promoted in a particular way.
But it does nothing to address the vast majority of the pages that contaminate search engine results. I'm referring to automatically generated pages that look like good pages and hence rank well in search engines, but really have little except links and perhaps some public domain info. E.g., there could be one each for every resort hotel in Mexico. The search engine result turns up a summary that makes it look like there are "reviews" there. But either the reviews section is empty, or else they reproduce something that's available on dozens of other sites as well. In one case, apparently, a single such site had 4 billion "different" pages. I'm not making that number up.
More sophisticated kinds of link-network analysis will be needed before those bite the dust.
To err is human. To forgive is good system design.
non-content analaysis? isn't that patented by slashdot readers?