Slashdot Mirror


Poor Spelling Beats Google's China Filter

antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."

6 of 248 comments (clear)

  1. Re:Valuable Lesson from Spammers by hunterx11 · · Score: 2, Informative

    Kanji is the Japanese term for Chinese characters. In Mandarin it is hanzi. For the sake of completeness, it's hanja in Korean.

    --
    English is easier said than done.
  2. Re:Obvious by Heian-794 · · Score: 4, Informative

    I can only add that the Chinese government, with their insistence on the not-at-all-intuitive-to-non-Chinese-speakers romanization system that is Pinyin, have only themselves to blame.

    Ask a number of reasonably educated people whose native languages use the Roman alphabet to listen to a Chinese person pronounce "Tiananmen" and then write down what they think the spelling should be. I guarantee many of them will "misspell" it as "Tienanmen", since the vowel in question is pronounced like the sound that most languages express with an "e".

    Expect more of this as Pinyin isn't going away any time soon.

    (And yes, I do have my flame-retardant jacket, Academic Dispute Wear Edition, all prepared!)

  3. How to Hack Google's censor in China by DigDuality · · Score: 4, Informative

    Chinese web users can see full, uncensored results for their Google search by replacing "&meta=" with "&meta=cr%3DcountryBR" in the URL. Once the string is replaced, the censorship will not affect the results.

    This is what a chinese search for Democracy looks like after this method has been applied:

    http://www.google.cn/search?hl=zh-CN&q=democracy+c hina&btnG=%E6%90%9C%E7%B4%A2&meta=cr%3DcountryBR

  4. Re:Obvious by Heian-794 · · Score: 3, Informative

    Putko, they did of course have standards, but they only make sense if you already speak Chinese.

    "Tian" does not rhyme with "fan", but somehow, "duo" and "luo" rhyme with "po" and "fo", which do contain "u" sonuds in the middle; they just aren't written because plain "po" doesn't exist.

    One of the purposes of pinyin was a potential replacement of the character system with it, so I can understand them not considering the interests of non-native speakers, but if you're going to force it on non-natives too, well, expect to see spelling "errors" becmoe unavoidable when they use Chinese.

  5. Re:Valuable Lesson from Spammers by wumingzi · · Score: 5, Informative

    I don't know how well (if at all) bayesian filtering and stuff would work for "kanji"

    All right, this question has come up several times in the thread.

    The Mandarin dialect has approximately 31 phonetic components. These can be combined as single phoneme, dual phoneme, and triple phoneme groups. Some sounds always stand alone, some combine into triples, some do not. Some phonemes only exist as initials. Some only as finals, etc. etc. The end result is a hundred-odd unique phonetic combinations.

    Then there are tones. Five tones per phonetic combination. There are a few sounds that never appear in certain tone patterns, but this is the exception, and not the rule. So this brings us up into mid 3-digits of total possible sound groupings, including intonation.

    Now, you've probably heard somewhere that there are thousands of characters. So if there are only a few hundred unique sounds, but thousands of characters, of course, you have homonyms everywhere.

    (I was going to do a demo of how this works, but /. doesn't like me writing in hanzi. Go to http://www.zhongwen.com/ and go to the "pronunciation" section of the dictionary. You'll see it as clear as day that way).

    Now, the problem is that there are many characters mapping to each sound. As such, while you can only mess with English words so much before they become unrecognizable (porn, pron, pr0n, prawn, etc.), you can make hundreds of permutations of any common phrase in Chinese simply by swapping out the correct character for a different one.

    I am not aware of a Chinese version of l33t-speak. There's trashy, slang Chinese, sure. But either you have the right character, or you don't. Without a standard nomenclature for screwing up words, it becomes hard to try alternate 'spellings' to work around the filter.

  6. Using Images To Subvert Filter by Anonymous Coward · · Score: 1, Informative

    In the back of my mind, I am beginning to think that images are the only way to really get the point across. Images are hard to parse for text, and meaning.

    If you could post an image of a tank rolling over someone in China with a good imbedded caption, it might get the point across without alerting the Chinese government.