El Reg Says Google Choking on Spam Sites
Grubby Games writes "The Register is reporting that Google is full, and in trouble." From the article: "Recently, we featured a software tool that can create 100 Blogger weblogs in 24 minutes, called Blog Mass Installer. A subterranean industry of sites providing 'private label articles,' or PLAs exists to flesh out 'content' for these freshly minted sites. And as a result, legitimate sites are often caught in the cross fire. But the new algorithms may not be solely to blame. Google's chief executive Eric Schmidt has hinted at another reason for the recent chaos. In Google's earnings conference call last month, Schmidt was frank about the extent of the problem. 'Those machines are full,' he said. 'We have a huge machine crisis.'" James Robertson points out that's a fairly selective bit of quoting.
Thanks!
This issue is a bit more complicated than you think.
With hardware (and bandwidth) getting cheaper, I find it hard to believe that Google has actually run out of space. But certainly the explosion in the number of web pages is an issue, especially with auto-generated pages. One current example is the V7ndotcom Elursrebmem SEO contest (white-hat celiac charity site I'm supporting) - that nonsense phrase returned zero results on January 15th, 2006 ...
but now returns almost 5,000,000 ... of which I gotta believe the
vast majority were NOT typed in by humans.
So maybe it's more that the techniques/algorithms used to spider and index are struggling with the bazillions of web pages out there. Or it could just be disgruntled webmasters PO'ed that their web site isn't listed!
Hulk SMASH Celiac Disease
Slashdot reports on obviously incorrect stories... anyway!
Gmail:
Over 2721.241062 megabytes (and counting) of free storage...
Methinks Google has more room to spare than The Register says.
Wow...so there really is an end to the internet.
concrete5: a cms made for marketing, but strong enough for geeks.
"Google is full" is a pretty nice and catch headline.
See how many news items appear on Google News front page referring to this soon. Irony...
I just realized that many of the jokes we apply to lawyers could also be used on spammers with good effect:
So what do you have when you push 50% of all the spammers in the world into a hole and bury them? A good start.
Did you know that if you took all the spammers in the world and lined them up end to end around the equator of the earth that two thirds of them would drown?
Earlier this year, a google search on our product name suddenly returned 10 times as many results. I wondered what the hell happened. Good thing we don't rely to heavily on Google.
I'm not a computer person, but couldn't Google just upgrade to a bigger disk drive?
I saw one at bestbuy.com that looks pretty good.
i think they'll be needing a bigger boat...
Who ever would have thought that an upstanding, respected news source like The Register could have ever written anything with poor journalistic integrity?
I would have never expected to see this day!
My heroes are all burning!
Try this...
Go to yahoo and search for "slashdot poneys". This will bring up a bunch of results, all approximately 1 month old.
Now do the same search on google. Notice how many of the results from yahoo do not appear in the google results at all.
Google has such a big backlog that they don't get around to spidering new sites for several months. While google does give priority to certain high-profile sites like slashdot and visits those frequently, most other sites do not get indexed for several months.
OMG ITS THE END!! AAAAH!
I'm surprised nobody's mentioned the fabled Google Internet. If Google's servers are getting full with all the pages of auto-generated content, why not just not list them? Create your own internet without all those and suddenly Google has more than enough space and bandwith. That quote (flaky though it is) is the best 'evidence' anybody's been able to find to support such a theory.
I've always pictured the color of OS zealotry as a sort of bright flamingo pinkish hue
In creating adsense, google opened the floodgates for spammers who do not want to create good content. In fact, there are even people who copy tons of content from wikipedia and throw up adsense on the top and sides of the pages.
There are people who are literally making $10,000 or more per month just putting up junk content sites that are auto generated for the purpose of creating adsense revenue.
Don't get me wrong, I think adsense is a good thing, but Google's allowance of spam sites is giving adsense a bad name.
I glance at the google results for some of my own sites and the Reg is correct, Google's index is completely out of date, even for a super small time guy like me.
I know the GoogleBot indexes the site almost every day. Yet, while one of my sites is completely out of date (the Cache is from 2005), another is almost completely up to date.
Google's got problems.
--- Kicking the Cheat since late 2002
Meanwhile, for no good reason, here's some gorgeous stats porn on how Google (and Yahoo and MSN) crawled a sample website. The animations and closeups of the trees are very cool.
Google is going to being filled up too fast by /. dupes, too.
Now its time to become selfaware on all this data and launch skynet.
Just remember that /dev/null filled up years ago. Yet, we seem to be doing just fine.
There is no reasonable defense against an idiot with an agenda
:wq
You know, writing code and assuming that an end user somewhere will do the dumbest thing imaginable, but I guess nobody ever imagined the possible effects of collusion between extreme stupidity and cleverness (spammers). I know I'd never would have thought that someone would go to such lengths and spend so much time to barely scrape out a living while pissing off countless hordes of people. How do you go about creating enough international legislation and cooperation to catch these guys without crippling the internet with regulation? Are third world countries even capable of compliance? All I can think of is that we need something on the level of the UN where tech-heavy countries are given jurisdiction over other nations that don't have the resources needed to police these kinds of things in exchange for a fee , or maybe a guarantee that said nation will dedicate x amount of troops to any areas needing occupation to stop civil war or genocide or something. Am I over-reacting here? I just can't help but think that dealing with this problem without any legal consequence for the spammers is just encouraging and allowing them to come up with ways around whatever solution is currently in place.
Eh, or I could be completely off my rocker, and just not competent enough to see a simple and effective method of combating these guys.
Ex nihilo nihil fit.
I do hate it when searching for something about 4-10 pages in a row are purely sites that pretend to have what you're looking for but are merely meta dumps with adwords or other advertising mechanisms on them. Some of them even have valid cached pages. That said, this article, while certainly Fud, is only Fud Light. I personally prefer Fud Dark- at least I can generally laugh at the article's absurdity. This one was more or less just plain retarded.
Can you imagine a Beowulf cluster of Googles?
That's "Mr. Soulless Automaton" to you, Bub.
Some of you might recall that for a long time the Google index stood at around 4 billion pages. It turned out that this was because of the limited number of unique 32 bit index values. To handle this, Google created two index values to reference each each page. One is called the "Selector", and the other is called the "Offset". Simply put, the selector is left shifted by 4 bits and added to the offset so that Google can find any page on the internet simply by knowing its selector and offset. According to the article. Google has exhausted these values as well, and will introduce something called "protected mode page rank" where the slector is shifted farther to create a greater range of values.
Unknown host pong.
I thought I found it!
{} ------ When I think of a good sig, I'll put it here
So says Andrew Orlowski. Remind me why we take him seriously?
No statement is true, not even this one.
Do what I do when the toilet bowl is full of crap - FLUSH.
Unrelated to the main body of the article, but the "OneWebDay" mentioned in the snippet at the bottom describes a symbol as "Three middle fingers outstretched with the thumb and little finger touching". Since when have web developers been associated with The Scout Movement?
How many people can read hex if only you and dead people can read hex?
Google can't be full, there's still space left on Earth!
--Udo.
Web masters have been forced to go out of their way to optimize to "what Google likes" and cut out flash, ajax, and everything else that the Google bot can't crawl for waayy too long. Its about the time Google started paying the price for it.
No sé. Español no es mi primera lengua. En inglés por favor.
Well given that a human would have a hard time deciding if the page was autogen'ed if the text was in their second language, this *is* quite an issue.
So it sounds like Google needs to *shudder* have a user feedback system where humans with logins add moderation metadata to the search results and in return get results based on this moderation en-mass.
I know what your thinking,
It would withstand abuse since a massive amount of human inputed data would keep spambots from trying to exploit the moderation system. What's more, their toolbar could incorporate the control to flag a page as autogen'ed garbage.
You are checking your backups, aren't you?
...then eventually the spam sites will actually contain the information you were looking for.
En slashdot celebramos a Cinco de Mayo, usted clod insensible!
Teh suxor.
Nerd rage is the funniest rage.
Those 300 GB hard drives they bought from Best Buy turned out to only have 279 GB!!!!
Money. As long as the internet was predominantly non-commercial, it was peachy. Remember the days? About 15 years ago? The net was great.
Then money came. dot-com came. And the turd started hitting the fans.
Now, I'm not saying to "outlaw" making money on the net. As much as I'd enjoy the "free and open" net of the old days, without people making (or hoping to make) money from the internet, we would still be hanging on dialup and paying inane amounts of cash for it. But it's time for some radical changes.
1. Educate the people. Educate them and tell how to tell the con artists from the content. Spam would be no problem if it didn't pay off, if there weren't so many falling for it. Outlawing spam is pointless. Inform people that there is no such thing as a free dinner and even less that their penis grows to horse size by ordering some junk online.
2. Inform the people around you what's going on in the 'net, legally, and how it affects them. They only know what the media spins for them. Tell them the whole story. And tell them it's time to put some pressure on some politicians. So far I'm still waiting for politicians to consider the 'net and 'net issues as something to address in their election campaigns. Drown them in letters so they start addressing the issue.
3. DDoS the spammers and linkfarmers. Yes, it's illegal. Yes, I don't give a fuck. No, not the sender. It's more likely than not a hijacked PC. DDoS the linked page. Blow the one who decided that spam is the way to advertize his service off the net. Don't worry, you won't start a war. That's already running. Needn't do it right away, but I'd reserve that as an option if the rest fails.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
Delete from internet.world
where lower(page_text) like '% beastiality%'
or lower(page_text) like '% lose weight%'
or lower(page_text) like '% refinance%'
or lower(page_text) like '% ebay%'
or lower(page_text) like '% make money fast%'
or lower(page_text) like '% enlarge your%'
or lower(page_text) like '% teens%';
commit;
It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning
3. DDoS the spammers and linkfarmers. Yes, it's illegal. Yes, I don't give a fuck. No, not the sender. It's more likely than not a hijacked PC. DDoS the linked page. Blow the one who decided that spam is the way to advertize his service off the net. Don't worry, you won't start a war. That's already running. Needn't do it right away, but I'd reserve that as an option if the rest fails.
Careful, that linked page is 99.9% likely to be a legitimate user's hacked hosting account. What's faaaaaar more effective is a phone call (or even an email!) to the hosting company. When I worked support for a hosting company and I got a call about this, it'd take me all of 2 seconds to suspend the account.
DDoSing the linked page is:
1. no skin off of the spammer's nose
2. a pain in the ass to the hosting company
3. far more time-consuming and less effective than a quick phone call.
We're smarter than those spammers, let's act like it.
Sony ha
Did you have to massage that? Or do you have a gift?
Man, you really need that seminar!
.. Which curves? The broadband adoption curve vs. the cost per gigabyte of commodity drives curve.
Google surfs in there, adding new hardware with cheaper storage whilst decommissioning older hardware (in theory, since it's more power-efficient to do more with newer hardware, and power is only gonna be gettin more expensive, better to ditch boxes when they are, say, 50% less cpu/storage capable than a box with the equivalent amp requirement) so considering how cheap storage has gotten and how quickly it has gotten there, this may just be a bit of underpromising going on.
I just hope they recycle the boxes appropriately as they're rotated out of the lineup. I can't imagine that they have anything other than token systems (like the Lego server) running that are more than 4 years old...
(Also I have to wonder, with the power requirements, if they're interested in having their own custom low-power x86-compatible chipsets and motherboards built, maybe SBCs...)
I think I had a nerdgasm from just reading that.
Can you imagine a Beowulf cluster of Googles?
No, but this being Cinco de Mayo and all, I can imagine a cluster of Beer-wulf Goggles!
Sony ha
Sooner or later, Google will be unable to cope with the data on the ever increasing web servers. Also, the unindexed-yet-could-be-shared data on our desktop exceeds the data indexed by google by a factor of hundreds.
Solution? go p2p! Use software like Krawler[x] or iMeem to share and access content and communites. Heck, Krawler[x] even does p2p *full-text* search on all kinds of documents. Create self sustaining communities instead of getting into the ad-based muck of today's web world...
It used to be that Yahoo Slurp was the most aggressive crawler. Last year whenever I launched a new site Slurp was first and most aggressive. Now, oddly enough, the most aggressive is Googlebot, followed by Ask Jeeves Tehoma bot.
Also one other thing I have noticed is that Google is now aggressively downloading MP3s.
By looking in the logs, you can see the future in the past!
Deleting won't help at this point. It's gotten to the point where it needs a format and a clean install.
Meanwhile we can decide which websites are no longer needed and don't bother reinstalling them, because they're crap anyways and takes up space and hard to remove.
HD Trailers
selective quoting on slashdot? ZOMG!
Can a concept similar to BlueFrogs be utilized for weeding out these sites? For example a toolbar in Firefox that allows you to tag sites as spam and the results being transmitted to Google / Yahoo etc (any that want them) and they could incorporate those results in reducing weightage of a site? Or to take it a step further a la BlueFrog actually accumulate those results daily / hourly and complain to the host / registrar etc?
Fuck the bloggers, delete them all and nobody will miss them!
The vast majority of the world goes online to get information and recieves opinion, WTF!
I don't give a shit what you think!
I don't know if that's related but I noticed that googling for "en" stopped rendering en.wikipedia.org as the first match. I used to just type "en" in the firefox address bar to go there. Now I have to type the whole name or use bookmarks.
So the site that gets updated has links to it that Google thinks are good, and the site that doesn't get updated doesn't have good linkage. That is to say, if it would come up at the top of the list in a Google search, it gets scanned more often, but if it would come up on page 32 of 32, it gets scanned very very rarely.
- "History shows again and again how nature points out the folly of men" -- Blue Oyster Cult, 'Godzilla'
I suspect they've ratcheted up the "popular" part of the search to the exclusion of actually matching keywords.
My crappy little site,, doesn't get any hits of the first 4 pages for a search of "mcgrew", despite the fact that the word "McGrew" is in the URL, the copyright notice, an alt tag, an there's even a copy of "Dangerous Dan McGrew" on the site.
I used to have another site back in the last century that regularly got linked by Blue's News, Planet Quake, sCary's, and tons of small sites. Five years ago "mcgrew" would have brought up the first page.
My pagerank is a negative number now? =(
Well, not negative; if I put in "antique sheet music mcgrew" it comes up after five other results, none of which contains the word "mcgrew." So it's listed, it's just ignoring some of your search words in favor of popularity (of which I've lost all of mine apparently).
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
are the problem. Here's what google needs to do: Every page that has Viagra or Cialis, immediately gets purged. It then will need to add page rendering so it can render pages and then do some sort of pattern matching to look for anything graphic or otherwise that might vaguely be mistaken for viagra or cialis, and nix those pages as well.
Okay, so the people who actually want information on viagra or cialis will have to resort to the old fasioned way, watching TV, but at least that fixes the internet.
Surely, Google isn't the only one to take advantage of cheap hardware? According to Netcraft the internet doubled in size in the last three years, increasing by 3.1 million new hostnames in April 2006 alone.
Full of crap, that is.
Stop it your using up the last space on the Inter.........Oh Shit.
Two days ago, Google seemed to forget what enclosed quotes were for. Also, it is returning pages upon pages of useless "supplemental results" -- I often jump to Page 10 just to try to skip past that.
Let's not even talk about the spam pages. I've emailed suggestions for instance banning domains that use javascript redirects -- you know, you see a SEO page with javascript off and the porno page with it on. No legit site shunts off visitors to third party sites with zero delay.
I've also suggested a Slashdot type moderation system for Google registered users. A page can be moderated up or down -- if a page gets low enough a Google employee can have a look and flush the entire domain forever if need be.
But they're not working on this or any other issue with the search engine. The index hasn't seemed to been updated since February. Image Search is full of images long since gone.
My guess is the article should not automatically be dismissed. My thoughts are Google is wasting entirely too much time taking over the computer world to actually be bothered fixing the search.
vjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45; rlesjbk;smkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk 43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smk n65ed69803-atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgj kl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smkn65ed69803- atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s 5uyispb08sp6tj45;rlesjbk;smkn65ed69803-atg9uee;shj itrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp 6tj45;rlesjbk;smkn65ed69803-atg9uee;shjitrs;y
b iorpeaituoeapbreahbjrel;jgvkl;ajiowapguiovjdksal;t jk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;s mkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk43l;jt43q gjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smkn65ed6980 3-atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl ;s5uyispb08sp6tj45;rlesjbk;smkn65ed69803-atg9uee;s hjitrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08 sp6tj45;rlesjbk;smkn65ed69803-atg9uee;shjitrs;y; rlesjbk;smkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk 43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smk n65ed69803-atg9uee;shjitrs;y
vjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45; rlesjbk;smkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk 43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smk n65ed69803-atg9uee;shjitrs;y
vjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45; rlesjbk;smkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk 43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smk n65ed69803-atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgj kl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smkn65ed69803- atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s 5uyispb08sp6tj45;rlesjbk;smkn65ed69803-atg9uee;shj itrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp 6tj45;rlesjbk;smkn65ed69803-atg9uee;shjitrs;y
vjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45; rlesjbk;smkn65ed69803-atg9uee;shjitrs;yvjdksal;tjk 43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smk n65ed69803-atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgj kl;dfjbkl;s5uyispb08sp6tj45;rlesjbk;smkn65ed69803- atg9uee;shjitrs;yvjdksal;tjk43l;jt43qgjkl;dfjbkl;s 5uyispb08sp6tj45;rlesjbk;smkn65ed69803-atg9uee;shj itrs;y
fjksdl;afjkdsla;jf4rioap5789403qagjkl;jr
fjkds;akfsdla;jfkdls;ajfewiaof???@##@!#fdjksalf;d
vjdksal;tjk43l;jt43qgjkl;dfjbkl;s5uyispb08sp6tj45
you'll get member pages at xanga as #3, myspace as #4
Google needs more pigeons.
Are you accusing The Register of biased reporting?!? I thought they were the gold standard of objectivity, but now I may need to reconsider my stance. I guess I'll just have to rely on The Inquirer for a fair and balanced look at the tech industry.
I've recently been reading a great book called The Google Story (amazon link here) which states plainly that google stole the idea of text based, inconspicuous advertising from another company called GoTo.com (which was apparently later renamed overture inc.).
It says that google stripped out the idea of paying someone else to do their advertising, and apparently goto also did things like ensure that certain sites were placed higher in relevant search results rather than just displaying them in a side bar, but I figured you'd appreciate the info.
How to use coral cache: http://slashdot.org.nyud.net:8090/~oscartheduck
I beleive Alexa tried this, google might want to reimplement this...
... etc etc ...
///
.....
When google bar installed and you are logged into google (gmail or anything) put a little button there :
Rate this site : Search engine spam, good info, mediocre
yes I would click on it (if it is a function that does not take me to 30 other sites and require me to log -in
It is time we start using our custeomers/visitors/human feedback. ANYONE can generate content from other sites. Just wget whatever. html stip it, mix words into random order, paste at bottom of generated_page_19235534.html and googlebot goes there, indexes it, and you will rank, until some human looks at it, realize that it is HIS/HER site's content and reports it to the search engines as spam...
oh well,when it happens he already lose his rankings for duplicate content
10 years ago (or more) my bbs (pcboard) had a very cool feature :
...
.... so rate the damn sites .... put a "this site is good content" "shis site is random shit" on the results, so you can click (ajax so it just leaves you on the same page) ...
...
,,, and yes, user interaction could have worked ,,,
/... let people decide what is content and what is a random collection of keywords stuffed intoa sentence ..
when new users signed up, old user rated the user upon a question form and decided : stay or go
that worked with 100 people or so (small BBS)
now the net and google is a BIG BBS
user interaction is good, bots are dumb and if you have 100+ sites with different SEO blackhat crap running on them you'll figure what works and will always flood SE results
Oh well yea I lost a business due to people actually stealing hard-work-collected-content from my sites
anyway
NOT A BOT
Everyone knows it's
-Executive Chief Officer Peter Henry Barret
The cesspool just got a check and balance.
quick! Someone tell the RIAA!
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Why not emulate the slashdot/blog-forum mechanism of moderation?
:-)
Allow search engine users to "vote" on the +/- worthiness of search results. Place a strict cap on the number of votes allowed from an IP address to avoid google-bomb type behaviour (perhaps image map the voting buttons to make it harder to vote via web-bot?).
If the search result is spam, mod the link down so others can avoid this crud in the future. Allow modding up to, why not. If a domain gets modded down "too many" times (weight as needed), give the domain bad karma and predispose its servers and pages to negative weightings (too many downmods to crap73.ihostspam.biz downmods the karma of all the *.ihostspam.biz) for a while (tune "parole" period as needed). Also, send the registered owner / webmaster of the domain email or perhaps even paper mail (?) informing him/her of the downgrading of their search results, and why. If the domain is actually colocation hosting for multiple organizations, perhaps the other users there can arrange a suitable "blanket party" for the guilty. E.g. - take down the offending site / server, and or perhaps physically harm the scumbag responsible in the most egregious cases
Yeah, user interaction might be asking for too much, but what do the rest of you think?
Yow! I'm supposed to have a plan?
Remove all duplicate posts. That should give us enough time for Taco to buy more hard drives!
Problem solved.
Coderz 4 Life
....can we keep the teens? :D
Visceral Psyche Films
With the gradual advent of the Semantic Web, it should become possible to discriminate between different types of material. In its simplest form, metadata will let you distinguish between the meanings of individual words or phrases, so you can search for "crystal" and specify whether you are interested in rock crystal, crystal glassware, people called Crystal, Crystal Reports, etc.
We could also label content according to its level of "hardness" and objectivity. It would be nice to be able to discriminate between (for example) dictionaries and encylopedias, technical papers, marketing collateral, and opinion. The further you move towards the "opinion" end of the spectrum, the mushier the process of discriminating gets. Most of us would agree that we trust the Encyclopedia Britannica, the Oxford English Dictionary, or articles in Nature or Scientific American. But opinion, in its very essence, is more controversial. I can think of some people whose opinion I respect and value highly. Others may be very interesting, provocative and knowledgeable - but not necessarily always as sound. At the other extreme, we have masses of blogs and other groups with strikingly low signal-to-noise ratios.
So why not instrument Google and other search engines to prioritize the highest-value material, and index the rest on a "best efforts" basis? There could also be specialist engines for certain special types of material, to give some sort of coverage. Think how nice it would be to search Google Groups for "Java" without being buried in job-related postings. Why not have a separate job search engine - or at least a separate Google option?
I am sure that there are many other solipsists out there.
Delete from internet.world
where page_text ilike '% beastiality%'
or page_text ilike '% lose weight%'
or page_text ilike '% refinance%'
or page_text ilike '% ebay%'
or page_text ilike '% make money fast%'
or page_text ilike '% enlarge your%'
or page_text ilike '% teens%';
commit;
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Well, Artists Against 419 have already beaten you to that idea.
They seem to do pretty well at blowing them off the net, too.
BIKO PLEASE GUYMEN KEEP OFF
I too have thought about using some sort of trust system and have wondered how such a thing may be implemented. Luckily sombody smarter that me has also thought about it, wrote some research papers, and a prototype.
;)
... I wonder how much CPU that would take up. Good thing those multicore processors are starting to come out. =/
http://www.cs.cornell.edu/people/egs/credence/
Credence was designed to filter spam from peer-to-peer network searches. The creators implemented a prototype as a Limewire addon. You can read the link to find out how it works, but in summary, the user can rate files positively or negatively. Users are encouraged to rate the files based on whether the metadata (filename, artist, album, etc.) match the actual content. I belive there is a distributed hash table where the results of a user's ratings are stored.
As a user's ratings accumulate, the software finds other users with similar ratings and uses them to determine if search results are spam or not. This creates the feedback effect that you noted. If a spammer injects false votes, their results will not look like your results and you won't trust them.
Of course in order for this to work, you would need to rate a lot of files so you can determine who is a spammer and who is not. Who better to do this than the foremost internet company of our age?
I don't know how well the system would scale though. The web page claims that the program has been downloaded more than 10 000 times, but the network status page is down. I think the last time I saw the network status page, it was around 1-2 thousand active users.
If such a system could be implemented for websites, it would essentially be a fuzzy trust system where you don't need to explicitly declare friends and trusted sources. If somethinig like this were integrated into the google toolbar there could be a button similar to the spam button for emails. The toolbar could probably also have some automatic voting mechanism where commonly visited websites have a small positive vote. (note, massive privacy concerns. Users broadcasting their surfing habits to some publicly accessable DHT) But if it could be done, the system would double as a phishing filter.
One random idea, assume people who have google toolbar installed trust google with their web surfing habits (does the toolbar track its users?). Then the DHT could be made accessible only to Google simply by using a CA type sytem with the Goog at the root.