Google Admits to Using Sohu Database

← Back to Stories (view on slashdot.org)

Google Admits to Using Sohu Database

Posted by CowboyNeal on Monday April 9, 2007 @11:46AM from the cut-and-paste dept.

prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"

13 of 209 comments (clear)

Min score:

Reason:

Sort:

Dictionary mistakes. by Tackhead · 2007-04-09 11:53 · Score: 5, Funny

> Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents.
...including the ones for "plagiarize", "research", and apparently a new one for the 2000s under "leverage".
Leverage! Leverage!
Let no one else's work cut short your edge,
Against the truth you can surely hedge,
So don't cut short your edge,
But leverage, leverage, leverage!

(One man deserves the credit! One man deserves the blame!
And Sergei Brin Ivanovich Lobachevsky is his name!)
Google's initial explanation by Anonymous Coward · 2007-04-09 11:55 · Score: 5, Funny

"In the future, Google invents a time machine that's used by a rogue employee to travel back in time to give Sohu this database. It's clear then that Sohu stole our database."
This reminds me of by Diordna · 2007-04-09 11:57 · Score: 5, Interesting

"Stolen from Apple Computer" (whole story)
So... by Anonymous Coward · 2007-04-09 11:58 · Score: 5, Interesting

When caught making a mistake, they admit it, work to resolve it, and move on?
I think there are a few other companies who could learn from that approach ...
Time for a slogan change? by GFree · 2007-04-09 12:12 · Score: 5, Funny

"Do no evil"

should be changed to

"Do just a tiny bit of evil"

which at this rate will probably end up as

"All your web are belong to us"
1. Re:Time for a slogan change? by LarsG · 2007-04-09 12:33 · Score: 5, Insightful
  
  This reminds me of Animal Farm and how the commandments on the barn wall changed.
  
  The people outside looked from Google to MS, and from MS to Google, and from Google to MS again; but already it was impossible to say which was which.
  
  --
  If J.K.R wrote Windows: Puteulanus fenestra mortalis!
Do no evil by z-j-y · 2007-04-09 12:26 · Score: 5, Insightful

Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.

Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.

Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.

For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.
1. Re:Do no evil by ShawnDoc · 2007-04-09 13:07 · Score: 5, Insightful
  
  This is a serious problem when dealing with Chinese companies. Now that Google has opened offices in China and has staffed them with native Chinese people, they're going to have a hard time enforcing western style ideas about copyright and what constitutes "doing no evil". Its a problem we've run into in the past with our Chinese operations. The way the problem was "solved", by removing the engineers names, but still clearly using the other company's engine (they didn't remove the identical bugs), is something I have seen happen in the past when dealing with our R&D team in China when we've found them using code they "borrowed" either from open source code or from an engineers past employer. I've never seen it handled in public like this however. Google is going to need to take some serious Q&A steps in their Chinese offices to keep stuff like this from happening again or else risk their Chinese office ruining the entire company's reputation.
Ironic by smackt4rd · 2007-04-09 12:42 · Score: 5, Funny

So now american companies are pirating chinese software? Oh the irony! :)
Re:On what do you base your judgment? by Daengbo · 2007-04-09 13:36 · Score: 5, Informative

In my mind, there is some question of whether a database of facts should, in fact (hee hee), be copyrightable at all. The characters were not original. The pinyin is not original. The pinyin for each character is, in fact, well established. Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?

It reminds me of a court case a few years ago in Thailand, where a judge put several Thai fonts into the public domain, stating "No one owns the Thai alphabet. It belongs to the people."

--
Put identity in the browser.
Re:Exactly how did they get a copy of the DB? by tooyoung · 2007-04-09 13:45 · Score: 5, Informative

OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

I suspect that there's more to this story that we're not hearing.

Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.

For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.

I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
Tutorial on Chinese input by microbee · 2007-04-09 14:02 · Score: 5, Informative

There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.

IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).

pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" "" ... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.

A good implementation uses following approaches:
1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.

So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.

The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.
Oblig futurama quote by pedantic+bore · 2007-04-09 14:03 · Score: 5, Funny

"The internet is about the free exchange of other people's ideas!"

--
Am I part of the core demographic for Swedish Fish?