Slashdot Mirror


Poor Spelling Beats Google's China Filter

antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."

1 of 248 comments (clear)

  1. Re:Valuable Lesson from Spammers by wumingzi · · Score: 5, Informative

    I don't know how well (if at all) bayesian filtering and stuff would work for "kanji"

    All right, this question has come up several times in the thread.

    The Mandarin dialect has approximately 31 phonetic components. These can be combined as single phoneme, dual phoneme, and triple phoneme groups. Some sounds always stand alone, some combine into triples, some do not. Some phonemes only exist as initials. Some only as finals, etc. etc. The end result is a hundred-odd unique phonetic combinations.

    Then there are tones. Five tones per phonetic combination. There are a few sounds that never appear in certain tone patterns, but this is the exception, and not the rule. So this brings us up into mid 3-digits of total possible sound groupings, including intonation.

    Now, you've probably heard somewhere that there are thousands of characters. So if there are only a few hundred unique sounds, but thousands of characters, of course, you have homonyms everywhere.

    (I was going to do a demo of how this works, but /. doesn't like me writing in hanzi. Go to http://www.zhongwen.com/ and go to the "pronunciation" section of the dictionary. You'll see it as clear as day that way).

    Now, the problem is that there are many characters mapping to each sound. As such, while you can only mess with English words so much before they become unrecognizable (porn, pron, pr0n, prawn, etc.), you can make hundreds of permutations of any common phrase in Chinese simply by swapping out the correct character for a different one.

    I am not aware of a Chinese version of l33t-speak. There's trashy, slang Chinese, sure. But either you have the right character, or you don't. Without a standard nomenclature for screwing up words, it becomes hard to try alternate 'spellings' to work around the filter.