Poor Spelling Beats Google's China Filter
antifoidulus writes "CNN's money section contains a blurb(among other blurbs) about how poor spelling can beat Google's Chinese filter. The example given in the article is that a search for "Tiananmen" will yield peaceful pictures of the square, but a search for common mis-spellings such as "Tienanmen" will yield plenty of photos of tanks."
Kind of reminds me of when Napster installed that half-assed search filter. Midonna and Mitallica suddenly became quite popular.
People who want to get information will get it, and you can't stop them.
This is a perfect example of why I've been saying all along that google is making the right decision in cooperating with the Chinese Government: http://yro.slashdot.org/comments.pl?sid=175251&cid =14571383
Who would have thought a thechnique spammers use to beat filters would have real-world value.
Is Google's filter Baysian based?
Ignorance is curable, stupid is forever.
It would probably be better to *NOT* point these things out.
Can you spell Bukcake? Or Pusy? Or AZZ? Get that by the filters!!!! But seriously, this is where pr0n comes from, the spelling that is, to get by filters...
...and so the weakness of computers is revealed: people and their presumption of perfection.
Sig? - yeah, whatever.
Google has really good suggested search terms for typos. Hint, hint. Skeet, skeet.
A NYC lawyer blogs. http://www.chuangblog.com/
SHUT UP!
Do you want to ruin it?
Come on, damnit! Shutupabout it.
Consider this the "getting your foot kicked under the table" move.
Check out my sysadmin blog!
In Chinese, a single character ( for example -- though I'm not sure if this will display properly) represents a whole syllable (as well as a meaning or idea), rather than a consonant or vowel, as most English letters do (some are unpronounced, or just change the sound of another letter).
This eliminates certain types of bad spellings, obviously, but opens certain avenues that aren't available in English, such as choosing characters with similar meanings but different sounds, or similar sounds but different meanings.
For the Tiananmen example, the characters for TianAnMen () mean "Heaven," "Peace," "Gate." Heaven could be replaced with "Sky," which has a completely different sound, or "Money," which (if I rcall correctly) is pronounced "Qian" (Q sounds close to English CH). This could also happen with with the other two characters in this word, and of course for many other 'bad' words.
The reason that common words like "pr0n" have become associated with porn, or other examples, is that a community of users agreed upon a certain misspelling of those words, and the same can and WILL happen in China to evade whatever filters search engines use. There is no way to have an even semi-open search system that doesn't allow human ingenuity to overcome its filters, and the brief history of the internet in the west indicates that these filters will, ultimately, be only partially and temporarily effective.
Although the moon is smaller than the earth, it is farther away.
They aren't necessarily out to defeat the determined. They can however, quickly and easily sanitize the popular perceptions by sweeping things under the rug. To the average citizen, they do a little search and never see anything particularly shocking. Mission accomplished. And as I said, given time, the determined will eventually get their message across. The Internet just adds another layer to a game that's been going on since the dawn of government.
Am I the only one thinking "why are we adveritising this so they modify their filters and improve them"? That's great that people are finding ways around the filters... but maybe keep that on the down low??
I'd rather see Google grand stand about not bowing to China's governmental pressure to assist in forceful suppression of ideas. Yes, that may get Google banned in China. However, Google is so big and powerful everywhere else in the world that news of its existence and popularity would become known to some curious folks in China who would begin to resent their government for banning it. In that resentment you'll find the seeds for a transforming change. That's a more self aware path to change than embracing the half truth of letting the Chinese people think: "Google? Oh yes. We have that too."
Look... as much grief as Google is getting for this, they know hackers are going to get past the wall. The Great Fire Wall of China will work about as well as the original did. It's there to make a point and it's not going to stop anyone.
1n Ch1n3s3, a s1ngl3 charact3r ( f0r 3xampl3 -- th0ugh 1'm n0t sur3 1f
th1s w1ll d1splay pr0p3rly) r3pr3s3nts a wh0l3 syllabl3 (as w3ll as a
m3an1ng 0r 1d3a), rath3r than a c0ns0nant 0r v0w3l, as m0st 3ngl1sh
l3tt3rs d0 (s0m3 ar3 unpr0n0unc3d, 0r just chang3 th3 s0und 0f an0th3r
l3tt3r).
Th1s 3l1m1nat3s c3rta1n typ3s 0f bad sp3ll1ngs, 0bv10usly, but 0p3ns
c3rta1n av3nu3s that ar3n't ava1labl3 1n 3ngl1sh, such as ch00s1ng
charact3rs w1th s1m1lar m3an1ngs but d1ff3r3nt s0unds, 0r s1m1lar
s0unds but d1ff3r3nt m3an1ngs.
F0r th3 T1ananm3n 3xampl3, th3 charact3rs f0r T1anAnM3n () m3an
"H3av3n," "P3ac3," "Gat3." H3av3n c0uld b3 r3plac3d w1th "Sky," wh1ch
has a c0mpl3t3ly d1ff3r3nt s0und, 0r "M0n3y," wh1ch (1f 1 rcall
c0rr3ctly) 1s pr0n0unc3d "Q1an" (Q s0unds cl0s3 t0 3ngl1sh CH). Th1s
c0uld als0 happ3n w1th w1th th3 0th3r tw0 charact3rs 1n th1s w0rd, and
0f c0urs3 f0r many 0th3r 'bad' w0rds.
Th3 r3as0n that c0mm0n w0rds l1k3 "pr0n" hav3 b3c0m3 ass0c1at3d w1th
p0rn, 0r 0th3r 3xampl3s, 1s that a c0mmun1ty 0f us3rs agr33d up0n a
c3rta1n m1ssp3ll1ng 0f th0s3 w0rds, and th3 sam3 can and W1LL happ3n
1n Ch1na t0 3vad3 what3v3r f1lt3rs s3arch 3ng1n3s us3. Th3r3 1s n0 way
t0 hav3 an 3v3n s3m1-0p3n s3arch syst3m that d03sn't all0w human
1ng3nu1ty t0 0v3rc0m3 1ts f1lt3rs, and th3 br13f h1st0ry 0f th3
1nt3rn3t 1n th3 w3st 1nd1cat3s that th3s3 f1lt3rs w1ll, ult1mat3ly, b3
0nly part1ally and t3mp0rar1ly 3ff3ct1v3.
Are you suggesting that the US population should be considered educated?
You are checking your backups, aren't you?
meh. english romanization is not at all intuitive to non-english speakers: "cough", "ghost", "cant", "cent", "through", "trough". at least pinyin is consistent.
This is a tautology.