Poor Spelling Beats Google's China Filter
antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."
that not everything can be filtered but this is a search using english alphabets. How good (read horrible) is the filter which searches using chinese langauge ?
They called me mad, and I called them mad, and damn them, they outvoted me. -Nathaniel Lee
As we all know, Google has a patented page ranking system that calculates the correlation of words with websites. It does this (primarily) by reading links from all of its cached websites and parsing html links to determine what words are being used to describe the page in the link.
A while back, this was known as Google Bombing and certain individuals exploited Google's system very effectively by linking to pages with words that, by all rights, were not very accurate. After all, do a Google search for the word 'failure' and the top site is George W. Bush's Whitehouse domain Biography.
So what do you do to help the Chinese? Perhaps you could make a page with two columns. In one column would be the correct text with no link and the key word. In the other column would be all the permutated misspellings with links to the real sites. You could host this one your website and send it to friends asking them to also host it. They would need to slightly alter it and host it but it would effectively provide the page ranks for the misspellings and allow anyone in China (who has access to your page) a key if they need it.
My work here is dung.
Now was this simply a failure of the filter method used, or did google deliberately create a weak filter to subvert the effort?
...search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks.
So I did a Google search and all those pictures of tanks are basically one photo hosted on different sites.
LSA is useful for dealing with synonyms, so I cannot see any reason why it wouldn't work with misspellings (assuming that they're common).
bang goes my karma... again...
First - I don't think it would have any "real-world value". Using words like "warez" may have some "real-world value" but I think the moment some misspelled word becomes a dissident symbol, Google would have to filter it out.
Second - let's all not forget that Chinese don't quite "spell" it when writing. I don't know how well (if at all) bayesian filtering and stuff would work for "kanji" (or how do they call it?)
As I recall, the exact same arguments were made by corporations such as Coca Cola who did business in South Africa under the Apartheid regime. They claimed they were helping bring about reform from within, giving good jobs to blacks, etc. And incidentally promoting the regime and helping to undercut the resistance.
They're filtering English mispellings, but what about French, Spanish, or German? A Chinese person could just search for what they're looking for under different languages. Granted, English is taught in China in their schools to everyone, but the folks who know other languages can start getting things and spreading it to the others.