Google Faces Plagiarism Questions Over Chinese Software

← Back to Stories (view on slashdot.org)

Google Faces Plagiarism Questions Over Chinese Software

Posted by Zonk on Sunday April 8, 2007 @07:03AM from the i'll-just-take-a-look-and-yoink dept.

yaohua2000 writes "Google's laboratory in China has launched its first product, a Pinyin Input Method Editor. The software allows the romanized characters to be translated to more traditional Chinese symbols , via entering on a QWERTY keyboard. Users soon discovered that the data Google used for the product was unusually similar to the data used by a Chinese rival, Sogou. Google has evaded the question about software similarities, reports PC World. 'The similarities, which included an error involving the name of a celebrity, were noted on a Google Labs discussion board about its Pinyin IME. Users noted that entering the Pinyin pinggong into the Google IME incorrectly produced the name of Feng Gong, an actor and comedian.'"

8 of 187 comments (clear)

Min score:

Reason:

Sort:

This wouldn't be the first time... by Anonymous Coward · 2007-04-08 07:45 · Score: 3, Interesting

This wouldn't be the first time that Google used other people's software in their live services without due credit.

Another example is the spell checkers that Google's Gmail have for the dozen or so languages to support. Nowhere to be found is an explanation of where these spell-checkers come from, so it would be safe to assume that Google wrote them themselves, or at least bought them from some company that allowed them not to give them credit? Well, the reality is more sad. It turns out that Google actually uses the free-software project, aspell, to do its spell-checking, and the dozens of person-years that went into writing the actual dictionaries for aspell were simply co-opted by Google. When you spell-check in some language X, you do not see any credit for the person who wrote the dictionary, or to aspell. Even if you look very hard in the documentation, this credit is nowhere to be found. It's all very legal under the GPL, but ugly behavior, especially for scientists (like most of the Google who's-who) who are used to giving credit where credit is due.

And how do I know that Google's Gmail uses free-software spell-checkers? Well, I used a method very similar to that described in the article. I'm the author of one of the dictionaries that Google "adopted", and I deliberately inserted some "misspelled" (aka "easter-egg") words into the dictionary, so I can immediately recognize a spell-checker based on my dictionary - and it turns out that Google's Gmail spell-checker is indeed based on my dictionary.

So it's great that Google reuses other software - free-software and commercial software - but they should learn to give credit where credit is due. It doesn't have to be the google.com homepage (of course) - even in some deep-down help page would do.
1. Re:This wouldn't be the first time... by Dominic_Mazzoni · 2007-04-08 08:20 · Score: 1, Interesting
  
  I work at Google. Email me with more information and I'll pass it on to the Gmail team.
2. Re:This wouldn't be the first time... by cubic6 · 2007-04-08 08:45 · Score: 5, Interesting
  
  Care to release those words that prove that Google uses Aspell? I don't see any proof in your post, just claims that are impossible to verify because you give very little information. You're an author of some dictionary that's used in Aspell, you put intentionally misspelled words in your dictionary, but you don't tell us which dictionary or which words, so what do we have to go by? Why is your post any more trustworthy than any other AC post? Furthermore, it's pretty suspicious that you claim that you INTENTIONALLY put incorrect words in your dictionary to catch people using it as part of a larger project, when such use is perfectly legal. Things like that undermine Aspell's credibility as a reference tool, which, as a contributor, I would think you'd care about.
  
  --
  Karma: Contrapositive
Re:Ironic, isn't it? by Anonymous Coward · 2007-04-08 08:00 · Score: 3, Interesting

Right, of course. It's perfectly ok to discriminantly refer to the Chinese based on a broad generalization. I mean.. any decisions a corporation in China makes is obviously the representation of the entire country. Just like Diebold and Microsoft are for the US. The Chinese government has refused on multiple occasions to enforce copyright of others and blatantly turns a blind-eye to this sort of behavior. If someone in China were to take the Microsoft source and re-sell it as a Chinese OS, the government would probably smile and buy the OS and say they were "supporting the Chinese economy" or "supporting the Chinese developers". This happened to Cisco, when a Chinese company stole their source and re-sold the exact same product. The government didn't do a damn thing. The country is notoriously bad at this sort of behavior, so the generalization is fair, I would say.

Not to mention, all these observations are made only by the Chinese.. thats what "users" means right? This would be a bad assumption, since I know quite a few people who are not Chinese or Chinese descendants who can at the least speak or write some Chinese (either Mandarin or Cantonese, depending on the person).

And, of course, the company is clearly making a huge deal out of this right now aren't they? Even though, according to the article that nobody seems to have read, Sohu.com hasn't actually done anything yet. But I must be new here, too. I am under the assumption everyone actually bothers to read the articles and see anything more than what they want to see. I don't think this matters. It is still fair to say, no one should bitch because it would really be the pot calling the kettle black.
Re:not saying it's the case by Gwwfps · 2007-04-08 09:49 · Score: 2, Interesting

How Pinyin Chinese IMEs work is they primarily use a database of words, algorithms are used to form words for the first time, the similarity in the databases used by Google and Sogou. The most damning evidence IMO is that Google Pinyin actually produces the names of several Sogou employees(Zhao Liyang, Tong Zijian, Lu(v) Jieyong), which Sogou apparently put into their word database as a kind of signature. Since the chances of getting anything but famous people's name correct out of box for any IME are quite low (e.g. Google Pinyin doesn't produce the name of any person in my family the first time), there's definitely something fishy going on here.

As for the typoes, most of them can definitely be explained away as coincidences, as they are common pronounciation errors many people will make. The only one in there that can been seen as evidence of plagiarism is "Ping Gong". It's supposed to be "Feng Gong", the name of a Chinese comedian. As even non-Chinese speakers can see, they are nothing like each other pronouciation-wise.

There are possibly more to this than Google plagiarising, though. For example, Sogou releases the typoes it has fixed publicly, that's how people realized that there are similarities. However, it seems Sogou hasn't actually fixed some of those typoes, even though they said on the release that they have. Google actually fixed all of those already, according to some users on forums.

Maybe both Google and Sogou licensed their databases from a single source? Maybe the parent post is close, since both Sogou and Google have data coming from their respective search engines, maybe the similarity is because people searching for similar things? However, until Google can come up with an explanation about those employee names being in their database, it is most likely that they copied from Sogou.
Re:This is big news in China by epine · 2007-04-08 10:27 · Score: 5, Interesting

I was involved in a very early effort to develop a pinyin based IME. Think 4.77Mhz. It worked quite well, in fact. Good dictionaries are hard to come by. Back then, not easy at all. In fact, we liberated data quite freely from any resource we could obtain. I made it a policy that each dictionary term had to come from at least two independent sources (sources unlikely to have stolen from each other). The singletons had to be manually reviewed by a qualified linguist. It's like that old saying: stealing from one source is plagiarism, stealing from multiple sources is research.

Eventually I found an extremely effective compression method (the IME portion of our system fit into 128K including dictionary) using a hash table approach. Collisions in the hash table generated spurious terms. The spurious terms that conflicted with legitimate terms were suppressed by a "phantom dictionary". The rest of the phantoms were allowed to remain. These only came up for pinyin bigrams (almost always bigrams) that were non-productive in the stock dictionary. The user supplied dictionary took priority over the system dictionary (and the phantoms it contained) so conflicts didn't arise.

Because of the way the hash table was constructed, our dictionary generated an exponentially increasing number of phantoms with increasing phrase length. By the time you got to four character phrases, the phantoms vastly outnumbered the legitimate vocabulary. Note that our system distinguished 8000 hanzi characters for the input system, so the space of possible four character phrases was up in the trillions, and the phantoms were extremely sparse by that metric, and never seen in the wild.

Any competitor who had decided to enumerate our dictionary (I could have suggested several practical ways to achieve this) would have ended up with barrels of nonsense, unless they also devoted the resources, as we had, to "research" rather than plagiarise.

Nor was it possible to copy our dictionary directly in its compressed format, as the hash function was tied to a hardware dongle. I never heard that the algorithm embedded in the dongle was ever cracked directly, but I do know that the vendor's recommended algorithm for feeding the dongle was awful, and failed most of my statistical tests. We beefed up the routine until many (but far from all) of the statistical tests for randomness were satisified, and then ran the device ten times overspec to get the performance we required. Fun times.

A funny story is that our software was listed as "cracked" on some hacker site because some l33t dude had removed the code to test for the presence of a functioning dongle, and the message we displayed "where's your dongle?" (OK, it wasn't quite like that) without noticing that with the dongle absent, the pinyin input method used white noise as the dictionary hash function, and produced nothing but chicken soup for the hanzi output text. To successfully change the hash function and maintain the dictionary compression ratio, you had to solve a bipartite graph matching problem and then recompute the phantom table, and none of that code shipped with the product.

In this era, with the amount of data you can scrape off the internet on a the barest whim, I'm a bit shocked that anyone still stoops to our tried and true "research" methodologies from the mid eighties. My involvement ended around 1991 as it became apparent that Windows 3.x was going to take over the world. My joy in life at that time was writing bug-free code, and I didn't see any way to achieve that the way the world was turning. If someone tapped me on the shoulder and woke me up after my fifteen year snooze, I could probably suggest many fascinating IME features I had planned back then that still haven't been implemented, though I haven't checked on this in a long while. We already had simplified/classical, Mandarin/Cantonese working from a single dictionary. It wasn't proper dialectic Cantonese though, that was something I wished to do, but never completed. We did all this pre Unicode, so we had to invent our own Unicode, too. Anyone need a first edition Unicode standard? I think I've got three.
Input method by DrYak · 2007-04-08 11:09 · Score: 5, Interesting

Just fucking google it ;)

Chinese is a complex language to write. It doesn't use an alphabet (like most western languages). It doesn't even use syllables (like, for example, 2 of the Japanese writing system), it uses logographs : in an over-simplified way, we can say they use 1 symbol for every different word/idea/etc.
This makes thousands of different symbols (According to wikipedia : a little less than 50k variants in the Kangxi dictionary).

This ISN'T something you can put on a regular occidental 107 keys keyboard.

Therefor you have several solutions :

- Custom keyboards :
Use special keyboards where the most frequently couple of thousand of symbols are present.
Not very practical (symbols harder to find compared to looking for a letter on a 107 keyboard). Wikipedia has a picture.

- By shape of characters :
Either by handwriting recognition, or by decomposing charachters (the different strokes) and putting them on a regular keyboard layout.

- By sound of words :
Either by using something like Zhuyin which is system that was invented to help teaching chinese. It has 31 symbols, 1 for each consonant or vowel in chinese. As such, it can be used for other purposes, like putting it on a keyboard : the person type the sound and the software guess the corresponding word/logogram.
Or an alternative method is the Pinyin : it uses latin letters to write the sound. (And thus is interesting for computers on which latin keyboards are widespread).

The mapping of sound to logographs isn't completely straightforward, for example Chinese is a tonal language, but some system don't require the writer to specify tones using marks. Some software work is required. And this software isn't infallible.

Google released such a software. User can phonetically type Chinese on any occidental keyboard using (tone-less) pinyin, and the software tries to convert it to actual Chinese characters.
This software produce the same correct results as another popular one. (Hopefully. If the google soft didn't give the correct results, there would be problems. I wouldn't be a functional pinyin input system).
Sometime, the software hesitates and give a choice of possibilities. Most of the time, the same as the concurrent (Possibly explained by the fact that both softwares have to process the same user input, using the same pronunciation system that isn't unambiguous).
But, sometime the Google soft is plain wrong, and produces the same errors as the concurrent. And THIS is suspicious, because maybe some part of the software uses piece from the concurrent (part of the algorithm ? statistical data ?)

The company is suing googles on the grounds that if both softwares behave the same down to the bugs, maybe some part could have been illegally copied.

Meanwhile, adepts of Google Seppuku rejoiced world wide a cheap and easy to find software that could also be used to produce random chinese caracter to be subsequently imported into Google as Kanji.

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
true perspective by WindBourne · 2007-04-08 11:55 · Score: 3, Interesting

It is Chinese stealing from other Chinese. Not really surprising since they have no qualm stealing from any company and then trying to claim it for their own work.

It also partially why you do not want to use china to do any IP type work. They will steal from others and leave your company at risk, as well as allow other chinese companies to steal from yours.

Understand that this is simply a big part of who they are now. They have been taught for the last 60 years that all the property belongs to the state and the community. It will be difficult for them to consider private ownership of anything for a number of generations. I am guessing that it will end about the time that China considers itself a superpower (which will happen). Sadly, that may be when a war occurs with between either China and (America|Russia|Europe|India). Offhand, I am guessing Russia. They will need a number of their resources (land, water, oil, etc).

--
I prefer the "u" in honour as it seems to be missing these days.