Google Admits to Using Sohu Database
prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"
Google doing evil, or sticking it to evil?
~
Leverage! Leverage!
Let no one else's work cut short your edge,
Against the truth you can surely hedge,
So don't cut short your edge,
But leverage, leverage, leverage!
(One man deserves the credit! One man deserves the blame!
And Sergei Brin Ivanovich Lobachevsky is his name!)
"In the future, Google invents a time machine that's used by a rogue employee to travel back in time to give Sohu this database. It's clear then that Sohu stole our database."
I'm sure someone will step up and help them save face in this embarrassing situation! When in doubt, you can always try to change the subject, that has worked well in the previous thread. Now that I think about it, we need a RoughlyDrafted-esque site for Google, anyone up to the task?
"Stolen from Apple Computer" (whole story)
lol, Google may or may not be evil but they can spin doctor with the Microsofts of the world.
Now what could be so wrong about leveraging non-Google resources?
I guess Google Labs will have to subscribe to Turnitin.com now.
Proof by very large bribes. QED.
Could be just a coincidence. Doesn't quantum physics state that essentially anything is possible? /apologist
When caught making a mistake, they admit it, work to resolve it, and move on? ...
I think there are a few other companies who could learn from that approach
and slashdot smells it, lol!
surely after helping so many students copy their research papers you should know the number 1 rule of copying another persons work: Change the F*CKING NAME!
Is this a single isolated incident or simply the first one of more coming from the company that does no evil?
Not in the States at least, AFAIK...
The mistakes were the giveaway. Surely these are "creative works"?
Engineering is the art of compromise.
"Do no evil"
should be changed to
"Do just a tiny bit of evil"
which at this rate will probably end up as
"All your web are belong to us"
It's not stealing. Trivially. Not disputing they probably do some illegal stuff, but illegal doesn't mean wrong.
As far as I can see, google are the greatest force for good (good: destroying copyright law!) in a long time.
So then, did the guy who stole my car stereo, was he "leveraging some non-car thief assets"?
I don't respond to AC's.
Talk about drinking the kool-aid...
They're a search engine. They're not curing cancer or solving world hunger. No, they are not the greatest force for good in a long time.
I recommend tagging this "copyvio"
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
As if the chinese aren't the biggest pirates/copycats around.
Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.
Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.
Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.
For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.
Ok, so we do do some evil, but jusy with our competitor's code. That isn't so bad, is it?
OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?
I suspect that there's more to this story that we're not hearing.
This reminds me of the recent story about GPL code found in OpenBSD. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.
Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.
> They have not complied with Sohu's requests to date.
:-)
One of Sohu's demands was to remove it. They did that, even prior to the cease & desist deadline, per the article. It sounds like they'll have to compensate Sohu next, which isn't overly surprising. As for where they got it, perhaps someone sold it to them? We don't know, so I'll reserve judgment about whether it was acquired in an un-Google "evil" way until we hear the rest of the story.
> It's not the first time Google have taken a fairly liberal interpretation of someone else's copyright either.
As for the copyright stance, I honestly don't care. Yes, I dislike Microsoft's hypocrisy concerning copyright, but I don't really give a damn about imaginary property at this point in time, and I don't see Google out there telling people that copyright infringement is evil, wrong, Communist and anti-American.
Frankly, I'm more inclined to distribute my works with only one request: that you do not acknowledge my authorship in any way. Of course, almost the only way to enforce that is to post AC
...do as the Romans do?
So now american companies are pirating chinese software? Oh the irony! :)
Copyrights exist for a reason...read a book or something and figure it out.
Oh, they exist for a reason alright. That's why I oppose 'em! http://piratpartiet.se/
Google should just convince someone plausibly responsible to commit Sepaku with the promise their family would be taken care of in thanks for removing their shame.
... Theo De Raadt says that the Chinese are INHUMAN.
*ducks*
Nobody believes the official spokesman, but everybody trusts an unidentified source. -- Ron Nesen
I say Google stops being apologetic and says "so what". Afterall, China has no respect for U.S. copyrights and patents and steals from us every day.
If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.
It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.
Please help metamoderate.
Following the protocols established by the recent OpenBSD/Linux Broadcom driver fiasco, the proper response would be to denounce Sohu for having been ripped off by Google.
Shame on you Sohu! This is inhuman!
Google may be filled with the best engineers, but once you move out of North America, they know nothing about ethics or morality.
I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
Forget thrust, drag, lift and weight. Airplanes fly because of money.
After all, we know that all Google employees are under Total Management Mind Control, and that Google Knows Everything Everyone's Doing. It's not even remotely possible that a handful of Google employees in China could shadily cut corners (using an already-extant database instead of compiling one from their own company's data) without Sergey Brin and Larry Page having personally authorized it from Mountain View, or that it would actually take a bit of time for upper management to investigate an issue when it's uncovered.
Good, google admitted it. I bet google contracted a Chinese company to supply them the database though. Apart from that, basically every piece of IP the USA has ever created has been copied by the Chinese and profit has been made. But, that doesnt make it right, and google needs to come 100% clean because if we start doing what the Chinese do to us, then there will be no more good people left in the world...
renegadesx got the memo, apparently.
How is Google's pinyin IME better than the tons of other pinyin IMEs out there? I tried it, and apart from having a search button, it doesn't seem to be a whole lot better than the Microsoft Pinyin IME that comes with Windows.
How does Google plan to set themselves apart from the rest of the competition and, even better, how does this fit into the "big picture"? Will the mass of adopters suddenly begin using Google search because it's built into their IME?
But they SAID they weren't evil, therefore that MUST make them good! Or, at least, that is how I fit into my naive worldview! Everything is either absolutely evil (Microsoft) or absolutely good (Google). There is no in-between.
There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.
... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.
IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).
pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" ""
A good implementation uses following approaches:
1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.
So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.
The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.
"The internet is about the free exchange of other people's ideas!"
Am I part of the core demographic for Swedish Fish?
I've been thinking about this. Throwing the evilness of Google aside for a moment, why should someone be able to copyright a listing of the phonetic pronunciation of an alphabet?
Let's just imagine how I might create this list. I would have to hire people who spoke the Chinese. Then I would ask them to record the pronunciation of each character that they know. This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway). There are about 3500 characters that you need to know in order to be literate. And all of these people would have learned these at school.
But how did they learn them? Well, they had a textbook and they memorized the list from the textbook.
Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.
But let's say I do that. Now I have a list of the 3500 most common characters. And with that, I've probably got 99% of everything that's in a newspaper. But that's probably not good enough. I probably want a list
of say 60,000 characters. Otherwise it's pretty useless in a general sense. Uncommon characters are uncommon, but you *will* bump into the words over time.
So where do I find these characters? Can I hire some guy that knows them all? It would be very difficult. The best place to look is in a book. But wait... what am I going to do? Every time I find a character my people don't know, look it up in a book? Why don't I just copy it from the book in the first place? That's just copyright infringement again.
Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.
But if I'm going to have a community project like this, what the heck do I need copyright for? What am I protecting? If everyone is going to contribute, everyone should benefit.
So, personally, I don't think one should have copyright on this kind of material (same thing for spelling). It's just not in the public interest. This goes doubly so now that we have the internet and creating these kinds of projects is very inexpensive.
OK, I've gone on long enough... But one more rant. What's with this "do no evil" thing? Isn't that setting the bar a little low. If I told my parents that I'd work hard not to be evil, I think they'd be somewhat disappointed in me. If Google wanted to actually "do some good" rather than "do no evil", they could start a community project to collect this data and share it with the world.
Sigh... I guess we'll have to wait for some guy in his garage (but here's betting that someone has already started something).
That's just like that old story about the resort where there were girls looking for husbands and husbands looking for girls. It's not a symmetrical situation. If BSD coders feel it's all right to give their work away for free to commercial companies, it doesn't mean GPL coders should be forced to do the same. Even if the BSD people disagree about the way GPL people licence their code, they should take care to respect the other point of view.
And you thought Easter Eggs were just there for kicks. ;)
127.0.0.1
TURN ABOUT IS FAIR PLAY.
Ok fine, we have stolen from them before... but Beef and Broccoli don't count.
I am very small, utmostly microscopic.
It's not whether or not they exist for a reason that I question.
It's whether or not they exist for a good reason.
The language isn't copyrighted, and google was more than free to come up with their own dictionary/database. However, in this case they used somebody else's. The infringement is not against the language itself, but against the use of somebody's precompiled database (inclusive of errors, amusingly enough).
its the facts of life.
They're significantly reducing the lockin to Microsoft products, by encouraging, buying and thereafter funding web application projects that often overlap with what is currently locked in to Microsoft. They even brew some of their own sometimes. They continue the development of Linux and Python with a wide adoption of both. All of these things are creating wealth for everyone, and crippling Microsoft little by little, which we know is what we want. I'd much rather have a Google & Microsoft duopoly if it means Microsoft would finally have to clean up its shit and accomodate whatever open source platform Google would support in that scenario.
Sam ty sig.
Sorry, I was just leveraging some non-personal resources.
Finally, the first (?) crack on the building appears.
Am I just going to have to start-up my own evil-free(tm) company?
The chief of Google China, Kai-Fu Li, used to be Microsoft's vice president, go figure...
In the US, a list of words in lexicographic order is not necessarily copyrightable (eg. phone books).
Is it also so in China? And does China have laws making databases IP like the US?
Americans seem to think that their bizarre and extreme notions of IP are universal law.
Perhaps someone here is an expert on Chinese IP law - did Google-China do anything illegal?
Ummm ... hi there ... Canadian here ... please can we not get dragged into this :-)
]{
Dare not to use your real name eh, anonymous coward? The head of google china was educated in North America, he worked in North America and he was sent back to China by Microsoft. So where did he learn his engineering ethic? Do you want to compare the number of IT lawsuits going on in America and China? I have to give it to you though. That was a quick one! I can't imagine anyone able to strike so low so fast, except for someone that always have this little hate in mind.
What I think is that you are one among many who are envious that they don't have the ideas and insight that Google has. Google just bought youtube, and somehow you think they have the resources to prevent millions of people from uploading copyrighted content. They are obligated to take down what they are told is copyrighted and they have done that. This works the same for any hosting provider. Scanning books so that the internet public would be able to search for books the way we do for websites was an awesome idea. By returning the names of the books and small quotations, this protects the copyright owners. This was not copyright infringement, because they were not selling the contents of the books, only the ability to search them. I think this is actually great for authors, whether they realize it or not - and it isn't illegal just because some may want it to be. And finally, if you don't like your page cached by Google then exclude google using the robots.txt exclusion STANDARD, as every good webmaster knows. (On second thought maybe you like traffic to your site?)
I'm not sure what happened in this case, but I do know that in American law you cannot copyright something unless it has some artistic value. If what Google took is judged to be just raw data such as a phonebook, have they broken the law? I think the more Google "does no evil" the more people will try to prove them wrong, but the courts will decide ultimately. I don't blame you for being envious, for I am too. But I admit it.
and why aren't the wealthy held accountable?
i wonder if there are any ethical wealthy people?
sure doesn't seem so
Sohu cares?
Like every successful hi-tech company, sohu.com is registered in the US or in a carribean island and run by western venture capitalist firms.
Now that would make Google guilty.
It's definitely not enough to learn 3500 characters and their meanings. Contemporary Chinese uses mostly two character words. So depending on the context the meaning of every character changes and that needs to be learnt as well. For example xin means heart, yet mostly it is used in conjunction with another word: xiao-xin means careful, xin-xin (not the same characters) means confidence, guan-xin means to be concerned about, xin-li means mentality etc.
There are literally hundreds of dictionary entries containing the character for xin. Granted that you might get an idea about the meaning of a word if you know each of the characters, but mostly you will still have to learn meaning and usage and they will most definitely need seperate dictionary entries. E.g. gu-shi means story, shi-gu (same characters) means accident. - So the data structure itself is much more complex than you put it.
Do you have any more detail on that Thai font decision by the way, like what fonts it involved ? The PSL, DS ones ?
Oh please... if Google wanted to distance itself from it, they could have done so long ago. "Sorry, mates, some of our employees fucked up, they've been fired and the offending code/product/database is now being pulled off the market until we build our own replacement."
The whole bullshit, including trying to get away with just deleting the original developpers' names, and press releases about "leveraging non-Google assets" is what's damning Google. It's not just that the original incident happened, it's that from there Google seemed to not even understand why it's bad and why the heck should they give a damn. The original incident may have been an individual developper's fuck-up, but from there it's Google and their corporate policies deciding how to deal with it. And how they _did_ chose to deal with it, frankly, stinks.
Yes, noone expects total mind control, but if _also_ the legal team is out of control and answers it in a way unrepresentative of Google, and _also_ the PR team is out of control and pulls a damning "we were just leveraging someone else's resources" statement on their own, etc, then, ffs, they have a problem. At some point you have to assume some responsibility and control, and not just hide behind not knowing what everyone else is doing. If you don't even know what your legal and PR teams are doing at all, even in a public incident, then you better assert some control real fast.
Additionally "do no evil" does imply a dose of responsibility there. You can't say, basically, "oh, the Mafia does no evil, it's just some of our members that we don't really mind-control, that are shooting people or fitting them with cement shoes." If the individual members are free to do evil, and get the company's full backing in some "we were only leveraging other people's resources" statement, then on what do you base that "do no evil" slogan any more?
RL "evil" isn't some "Black And White" game notion, involving actively hating all humanity and actively seeking to do harm, including self-harm, just for harm's sake. And no company does that overtly anyway, so if that's what Google is distancing itself from, then it doesn't say much.
RL "evil", including corporate evil, is more along the lines of not giving a damn about who gets hurt, if it helps you forward your own interests. It's not actively trying to poison a river just for the chuckle of seeing some people get sick, it's not caring who gets sick as long as you saved some money by just dumping your waste in the river. It's not actively trying to get some excuse to shoot some people as a Mafia don, it's about not giving a damn if it takes some corpses to forward your own interests in an area. If shooting some people to make an example is what works, so be it, it's as good a means to an end as any. Etc.
Or to get back to corporations, Enron too didn't make defrauding investors its whole purpose, it just didn't give a damn who gets hurt by their lies. It had no qualms even with advising its own employees to buy stock at a time when management was selling theirs. Again, not because some super-villain at the top had a chuckle at hurting employees, but because they didn't give a damn.
Basically it's not about having some principles to create as much suffering and destruction as possible, it's about lacking the principles and empathy to avoid doing it. That's what corporate evil is: simple sociopathic behaviour.
And if an organization doesn't give a damn at all about what its employees are doing, and who they're hurting, as long as they get the product out the door, then, congrats, it just lost all credibility for some "do no evil" claim. It just showed as much sociopathic tendencies as any other corporation, only maybe in a more decentralized fashion. You know, why have one sociopath at the top coming up with all evil schemes, when you can have a thousand sociopaths in lower positions encouraged to feel free to come up with their own heists.
A polar bear is a cartesian bear after a coordinate transform.
Again, it is not the "known data" that is at question here, but the database as an object in its entirety.
Nobody is accusing Google of "copying Chinese characters", but rather of copying a specific collection that somebody has invested time and money in creating. This is not a corpus, but rather more like a dictionary. Anyone can create one, but google - which I have emminent respect for in other areas, but not this one - has decided to take somebody else's "dictionary" rather than creating their own. The compilation existed as somebody else's work. Likely google could have made an attempt to buy it. Equally likely, they could have produced a similar offering on their own. Instead, they chose to take another group's work and then denied both giving said group adequate compensation, or even that they had taken it from said group.
Make that 95%, and count me in as one of those who'd disagree and who're curious as well.
butter the donkey
Google is a corporation and should be dealt as such. It is not an individual with a single mind nor strong beliefs.
It can make mistakes and do evil regardless of what they say. Their primary purpose is to make money and they will and can do anything to achieve it.
Also it is easy to do what you believe in with a small and likely minded group of people. It is much harder to do so with over 10,000 people most of which don't think the way or have the moral obligations as you do.
The person responsible for the copying has been sacked. ...
The person responsible for the sacking has been sacked...
All this loose talk about 'plagiarism' and 'stolen' is pointless. Either Google infringed a copyright or they did nothing illegal. Pick one.
There are limits on whether you can copyright facts at all, and they vary from country to country. Does China even have a copyright law that covers dictionaries?
I think GP is a troll, but the actual point is valid. It isn't that parts of the world "know nothing about ethics or morality." It is that other cultures have other standards of ethics and morality. While most cultures have similar basic ethics and morals (do not kill, do not steal--actually a generalization of the first, etc.), something that falls into a gray area like reusing the IP of another will be inconsistent throughout the world. Besides, we don't really have an established moral outlook on IP infringement, which is why we call it "theft" more often than not. It's because theft is the closest thing we know to IP infringement. Hence, it is negative by association.
It is not to say that Google using Sohu's database is OK because it happened in China. If Sohu started using Google's database, Google would likely make a big stink about it too. But it probably isn't as big a deal over there as it is here. Certainly, it would not be considered "evil" behavior. It wouldn't be good either, but it doesn't quite fall into evil yet.
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
You must be Chinese because your English sucks really badly. How about going to ESL and brushing up on your grammar?
meh, the argument for why compilations of public domain "facts" should be considered a copyrightable work is that it is work to compile those facts. Why people can't understand that not all work results in property is beyond me, but there's ya reasoning.
I don't know about in China (does China even have a copyright system to begin with?), but in the U.S., the amount of "work" you put into something doesn't matter one whit in terms of it being copywritable. You could spend your entire life compiling statistics on something, and at the end of the project, the only thing that you could copyright would be things like the actual typesetting and any copy that you wrote in between the statistics themselves. It's the same thing with recipes: anyone can copy Julia Childs' french bread recipe from Mastering the Art of French Cooking, what they can't copy is the text itself describing how to execute/implement that recipe.
But enormous amounts of effort are routinely put into things like mathematical and physics tables (and historically, they were a lot more important than they are now), and the data themselves aren't protected. You can't own the digits of pi, or the atomic weights of the elements, regardless of how much time you spend figuring them out. The problems associated with letting people "own" and claim copyright to bare facts or compilations of facts would would greatly outweigh the possible economic benefits of letting people derive additional economic gain from them.
If the Chinese allow companies to copyright bare collections of facts, they're a bunch of idiots.
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
I don't believe that morality comes into it. Possibly ethics, but my limited experience with the US tells me that if you can a) gain advantage, b) get away with it, and c) the exposure is less than the cost of doing it yourself, then you steal/copy/infringe on the "IP". Anything less would be bad business. China isn't so different...
Forget thrust, drag, lift and weight. Airplanes fly because of money.
At any rate, since human perception is highly flawed, the practice of "Doing evil to combat perceived evil" can really be reduced to "Doing evil and hoping it limits the evil that others do". However, "Doing evil and hoping it limits the evil that others do" is really the same thing as simply "Doing evil." in fact, it is even worse, because it is really "Doing evil while competing with other evil in the hopes that you are the only one left".
Naturally, once we have truly followed the diabolical nature of this new approach, we are simply left with "Doing evil in the hopes of having a monopoly on doing evil."
Shame on you, google.
It is your personal duty to fight for what is right on a daily basis. Ignoring injustice is identical to approving