1. There is no one-to-one mapping between Pinyin and Chinese characters, one pinyin usually corresponds to 17 Chinese characters on average. To improve the first choice accuracy on Pinyin->Chinese character conversion, the IME needs a Chinese word list as well as the corresponding word frequency information and the phonetic annotation information (One Chinese character may have several different pronunciations) as language models (think about continuous speech recognition in English). Such kind of data are usually derived from a huge corpus, manually checked/proofred search engine key words, etc. Such kind of data need a huge amount of time to maintain and are copyrighted. There are some public domain data in this area, but those public domain data have far worse performance than the proprietory data maintained by Sogou Pinyin.
2. Sohu.com (NASDAQ: SOHU), the owner of Sogou Pinyin Input, has nothing to do with Baidu (NASDAQ: BIDU). SOHU, Baidu, and Google.cn are just three competitors in China.
No, as a native Chinese speaker, I can tell you that most Chinese internet users were *enraged* by Google China's recent public announcements (in the announcement, Google China, acknowledged that Google Pinyin "used" data from non-google sources, but they said those data had been removed in Google Pinyin's latest update, but they didn't acknowldge Sogou Pinyin, haven't apologize in public about their plagiarism up till now). Accoriding to Sogou programmers, there are still undisclosed Sogou easter eggs even in the latest version of Google Pinyin.
Please check the following message for more details. "Google Pinyin's plagiarism behavior" is one of the most influencial internet news in the recent a few days in China. Many Chinese internet users found it's funny to see an american company, whose moto is "Don't be Evil", steal encrypted and copyright protected data from a competitor in China so blatantly.
Google Suggest != Pinyin Input Method. Pinyin (http://en.wikipedia.org/wiki/Pinyin ) is a romanization system to represent the pronunciation of Chinese characters/words in alphabetical format. Pinyin Input method is a system that can transate Pinyin (e.g. Beijing) into Chinese Characters (e.g. ). Most Chinese people use Pinyin to input Chinese characters via the QWERTY keyboard. Since there are only around 400 distinct syllables in Chinese and there are around 6763 commonly used Chinese characters, one pinyin will respond to around 17 Chinese character on average, that why we need new data set/algorithms to train the Pinyin Input method to get a higher accuracy. It's completely different from what Google suggest is supposed to do.
There ARE numerous evidences that showed the Google Pinyin IME input method (a.k.a. Google Pinyin) indeed copied the data libriary of Sogou Pinyin IME input method's (a.k.a Sogou Pinyin). Developers of the Sogou pinyin created some easter eggs in their products (e.g. all the names of the Sogou develpement team members, a few spelling typos), Programmers of Google China copied all these easter eggs and typos verbatium to their Google Pinyin product verbatim.
Sohu.com (NASDAQ: SOHU), the owner of the Sogou Pinyin, accused Google China's plagiarism behavior in their official announcement today (in Chinese), asking Google to stop the copyright infrigment, apologize in public media to SOHU.
Google China's official response acknowledged that "the Google Pinyin IME Input method included some data not created by Google itself, and those data have been removed in the latest update". Google China's offical announcement still didn't acknowledge the original data creator, didn't appologize for their copyright infrigement either. Accodring to SOHU, there are still undisclosed "easter eggs" created by Sogou Pinyin programmers even in the latest update of Google Pinyin.
FYI: Here are the screen shots of a few easter eggs and typos in Sogou Pinyin, which are found in Google Pinyin verbatium.
Plagiarism has been confirmed officially by Google, Sohu and IDG news reporter Sumner Lemon.
t e-google-admits-word-database.html
- to-google-take-down.html
Google admits word database came from third party - Network World
http://www.networkworld.com/news/2007/040907-upda
An earlier report by the same reporter: Sohu to Google: Take down copycat software
http://www.networkworld.com/news/2007/040707-sohu
Google China's Official Apology to Sohu.com (in Chinese)
http://googlechinablog.com/2007/04/blog-post.html
You are totally wrong in two aspects.
1. There is no one-to-one mapping between Pinyin and Chinese characters, one pinyin usually corresponds to 17 Chinese characters on average. To improve the first choice accuracy on Pinyin->Chinese character conversion, the IME needs a Chinese word list as well as the corresponding word frequency information and the phonetic annotation information (One Chinese character may have several different pronunciations) as language models (think about continuous speech recognition in English). Such kind of data are usually derived from a huge corpus, manually checked/proofred search engine key words, etc. Such kind of data need a huge amount of time to maintain and are copyrighted. There are some public domain data in this area, but those public domain data have far worse performance than the proprietory data maintained by Sogou Pinyin.
2. Sohu.com (NASDAQ: SOHU), the owner of Sogou Pinyin Input, has nothing to do with Baidu (NASDAQ: BIDU). SOHU, Baidu, and Google.cn are just three competitors in China.
No, as a native Chinese speaker, I can tell you that most Chinese internet users were *enraged* by Google China's recent public announcements (in the announcement, Google China, acknowledged that Google Pinyin "used" data from non-google sources, but they said those data had been removed in Google Pinyin's latest update, but they didn't acknowldge Sogou Pinyin, haven't apologize in public about their plagiarism up till now). Accoriding to Sogou programmers, there are still undisclosed Sogou easter eggs even in the latest version of Google Pinyin.
6 58353
Please check the following message for more details. "Google Pinyin's plagiarism behavior" is one of the most influencial internet news in the recent a few days in China. Many Chinese internet users found it's funny to see an american company, whose moto is "Don't be Evil", steal encrypted and copyright protected data from a competitor in China so blatantly.
http://slashdot.org/comments.pl?sid=229975&cid=18
Google Suggest != Pinyin Input Method. Pinyin (http://en.wikipedia.org/wiki/Pinyin ) is a romanization system to represent the pronunciation of Chinese characters/words in alphabetical format. Pinyin Input method is a system that can transate Pinyin (e.g. Beijing) into Chinese Characters (e.g. ). Most Chinese people use Pinyin to input Chinese characters via the QWERTY keyboard. Since there are only around 400 distinct syllables in Chinese and there are around 6763 commonly used Chinese characters, one pinyin will respond to around 17 Chinese character on average, that why we need new data set/algorithms to train the Pinyin Input method to get a higher accuracy. It's completely different from what Google suggest is supposed to do.
There ARE numerous evidences that showed the Google Pinyin IME input method (a.k.a. Google Pinyin) indeed copied the data libriary of Sogou Pinyin IME input method's (a.k.a Sogou Pinyin). Developers of the Sogou pinyin created some easter eggs in their products (e.g. all the names of the Sogou develpement team members, a few spelling typos), Programmers of Google China copied all these easter eggs and typos verbatium to their Google Pinyin product verbatim.
s html
s html
4 b76b78f44791dad8379.shtm
4 b76b78f44791dad8379.shtm
Sohu.com (NASDAQ: SOHU), the owner of the Sogou Pinyin, accused Google China's plagiarism behavior in their official announcement today (in Chinese), asking Google to stop the copyright infrigment, apologize in public media to SOHU.
http://tech.sina.com.cn/i/2007-04-08/17041454175.
The PR officer of Google China (NASDAQ: GOOG) also released an official response a few hours later today (in Chinese).
http://tech.sina.com.cn/i/2007-04-08/18351454194.
Google China's official response acknowledged that "the Google Pinyin IME Input method included some data not created by Google itself, and those data have been removed in the latest update". Google China's offical announcement still didn't acknowledge the original data creator, didn't appologize for their copyright infrigement either. Accodring to SOHU, there are still undisclosed "easter eggs" created by Sogou Pinyin programmers even in the latest update of Google Pinyin.
FYI: Here are the screen shots of a few easter eggs and typos in Sogou Pinyin, which are found in Google Pinyin verbatium.
http://www.donews.com/Content/200704/69ce12fbc826
http://www.donews.com/Content/200704/69ce12fbc826