Poor Spelling Beats Google's China Filter
antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."
I are a gud spelr!
The simple truth is that interstellar distances will not fit into the human imagination
- Douglas Adams
that not everything can be filtered but this is a search using english alphabets. How good (read horrible) is the filter which searches using chinese langauge ?
They called me mad, and I called them mad, and damn them, they outvoted me. -Nathaniel Lee
This gives me an idea of how I can get past Bush and Co. monitoring my internet usage. I'll be able to say with a straight face that I never searched for Porn, but rather I was hoping to find information about shellfish
...as a Leader of the Revolution.
Kind of reminds me of when Napster installed that half-assed search filter. Midonna and Mitallica suddenly became quite popular.
People who want to get information will get it, and you can't stop them.
As we all know, Google has a patented page ranking system that calculates the correlation of words with websites. It does this (primarily) by reading links from all of its cached websites and parsing html links to determine what words are being used to describe the page in the link.
A while back, this was known as Google Bombing and certain individuals exploited Google's system very effectively by linking to pages with words that, by all rights, were not very accurate. After all, do a Google search for the word 'failure' and the top site is George W. Bush's Whitehouse domain Biography.
So what do you do to help the Chinese? Perhaps you could make a page with two columns. In one column would be the correct text with no link and the key word. In the other column would be all the permutated misspellings with links to the real sites. You could host this one your website and send it to friends asking them to also host it. They would need to slightly alter it and host it but it would effectively provide the page ranks for the misspellings and allow anyone in China (who has access to your page) a key if they need it.
My work here is dung.
This is a perfect example of why I've been saying all along that google is making the right decision in cooperating with the Chinese Government: http://yro.slashdot.org/comments.pl?sid=175251&cid =14571383
Now was this simply a failure of the filter method used, or did google deliberately create a weak filter to subvert the effort?
So.. Chinese people speaking the same broken Engrish on the Internet as they typically do elsewhere beats the Great Firewall of China.
Engrish in the spirit of Freedom!
--- We need more Ron Paul!
It would probably be better to *NOT* point these things out.
LSA is useful for dealing with synonyms, so I cannot see any reason why it wouldn't work with misspellings (assuming that they're common).
bang goes my karma... again...
Thanks for your feedback. We will endeavour to respond to your bug report as soon as possible, and release an update if appropriate.
Sincerely,
Google information liberation management team
Google Inc. "Do no evil."
Chinese web users can see full, uncensored results for their Google search by replacing "&meta=" with "&meta=cr%3DcountryBR" in the URL. Once the string is replaced, the censorship will not affect the results.
c hina&btnG=%E6%90%9C%E7%B4%A2&meta=cr%3DcountryBR
This is what a chinese search for Democracy looks like after this method has been applied:
http://www.google.cn/search?hl=zh-CN&q=democracy+
It's not just any picture of tanks; it's the picture of that guy who paused on the way home from shopping to stand in front of four tanks. You know, big metal machines that can squash a pedestrian flat without noticing? Amazingly, as famous as this picture is it is unknown inside China. My Chinese friends in college had never seen it or anything of those ill fated demonstrations despite being in Beijing when it was happening. The word on the street in town during the protests was simply that 'something is happening' and everybody better stay in their homes if they know what's good for them. The Chinese government's crackdown on the media is impressively (depressingly?) comprehensive.
I don't know how well (if at all) bayesian filtering and stuff would work for "kanji"
/. doesn't like me writing in hanzi. Go to http://www.zhongwen.com/ and go to the "pronunciation" section of the dictionary. You'll see it as clear as day that way).
All right, this question has come up several times in the thread.
The Mandarin dialect has approximately 31 phonetic components. These can be combined as single phoneme, dual phoneme, and triple phoneme groups. Some sounds always stand alone, some combine into triples, some do not. Some phonemes only exist as initials. Some only as finals, etc. etc. The end result is a hundred-odd unique phonetic combinations.
Then there are tones. Five tones per phonetic combination. There are a few sounds that never appear in certain tone patterns, but this is the exception, and not the rule. So this brings us up into mid 3-digits of total possible sound groupings, including intonation.
Now, you've probably heard somewhere that there are thousands of characters. So if there are only a few hundred unique sounds, but thousands of characters, of course, you have homonyms everywhere.
(I was going to do a demo of how this works, but
Now, the problem is that there are many characters mapping to each sound. As such, while you can only mess with English words so much before they become unrecognizable (porn, pron, pr0n, prawn, etc.), you can make hundreds of permutations of any common phrase in Chinese simply by swapping out the correct character for a different one.
I am not aware of a Chinese version of l33t-speak. There's trashy, slang Chinese, sure. But either you have the right character, or you don't. Without a standard nomenclature for screwing up words, it becomes hard to try alternate 'spellings' to work around the filter.
This seems as good a place to bring it up as any.
Let's do a thought experiment.
On one side, we have a reasonably interesting search engine company.
On the other, we have a control-minded, autocratic government.
The search engine company (that wants to operate in China) is told by the autocratic government "We don't want Bad Things sneaking in through the search engine. Keep Bad Things out."
The search engine company says "OK. We'll play along. Give us a list of things you don't want to see. We'll get rid of them".
"Taiwan Independence" returns 0 results.
"Free Tibet" is delinked.
Various combinations of Tiananmen, 6 and 4 mysteriously vanish.
Unfortunately, Bad Things do not fit into nice little boxes. People mis-spell words. While it is easy to come up with a list of sites that contain Bad Things you do not want to see, new sites come up all the time. Is my friend's picture gallery from Tiananmen just some postcards to the folks back come, or is there some subtle political commentary in there? Well, you'll have to read it and find out.
If I search on (former Taiwanese president) Lee Teng-Hui, does that contain Bad Things? Does it link to Bad Things? How dangerous is a stooped 85 year-old former college professor anyhow?
Is Ghandi axiomatically Bad? Martin Luther King? Doesteyevsky? The list goes on and on and on.
The censors can control the obvious things. Ultimately, they will lose.
The real problem is that China is, for all its faults, a modern country. People come in, people fly out. When I go to China, lots of people ask what's going on in the outside world. I am a little circumspect in what I say, but my memory banks don't magically get erased when I cross over from Hong Kong to Shenzhen. Over 90% of the Chinese students you see toiling away at your local research university will ultimately go home. That's just the way it goes. They too don't forget whatever subversive thoughts may have crept into their heads during five or six years of study abroad.
The deck is stacked, and the good guys will ultimately win.