Google Admits to Using Sohu Database
prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"
Google doing evil, or sticking it to evil?
~
'Mistake' is a bit euphamistic here. The dictionary was never made public yet Google somehow managed to accquire it. They have not complied with Sohu's requests to date. They dragged their feet over the whole issue and only came clean when there more than sufficient proof they were infringing.
Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
Actually, when caught, they just removed the developer's names from the dictionary. When a big deal of it was made, *then* they went to town 'not doing evil'. They still haven't said how it happened; I bet they will quietly settle it, and we will never hear more.
--
WHO ATE MY BREAKFAST PANTS?
Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.
Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.
Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.
For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.
This reminds me of Animal Farm and how the commandments on the barn wall changed.
The people outside looked from Google to MS, and from MS to Google, and from Google to MS again; but already it was impossible to say which was which.
If J.K.R wrote Windows: Puteulanus fenestra mortalis!
This reminds me of the recent story about GPL code found in OpenBSD. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.
Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.
> They have not complied with Sohu's requests to date.
:-)
One of Sohu's demands was to remove it. They did that, even prior to the cease & desist deadline, per the article. It sounds like they'll have to compensate Sohu next, which isn't overly surprising. As for where they got it, perhaps someone sold it to them? We don't know, so I'll reserve judgment about whether it was acquired in an un-Google "evil" way until we hear the rest of the story.
> It's not the first time Google have taken a fairly liberal interpretation of someone else's copyright either.
As for the copyright stance, I honestly don't care. Yes, I dislike Microsoft's hypocrisy concerning copyright, but I don't really give a damn about imaginary property at this point in time, and I don't see Google out there telling people that copyright infringement is evil, wrong, Communist and anti-American.
Frankly, I'm more inclined to distribute my works with only one request: that you do not acknowledge my authorship in any way. Of course, almost the only way to enforce that is to post AC
Did he leave you an exact copy?
What?
Google may be filled with the best engineers, but once you move out of North America, they know nothing about ethics or morality.
I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.
Forget thrust, drag, lift and weight. Airplanes fly because of money.
After all, we know that all Google employees are under Total Management Mind Control, and that Google Knows Everything Everyone's Doing. It's not even remotely possible that a handful of Google employees in China could shadily cut corners (using an already-extant database instead of compiling one from their own company's data) without Sergey Brin and Larry Page having personally authorized it from Mountain View, or that it would actually take a bit of time for upper management to investigate an issue when it's uncovered.
When caught making a mistake, they admit it, work to resolve it, and move on? ...
I think there are a few other companies who could learn from that approach
What a great approach indeed! Steal, and if caught, deny it a little, then cover it up.
Actually I think Google learned that from someone else's company, or is Google "innovating" here? A debate for the coming generations.
Replace all instances of "Google" with "Microsoft" in your post and see if your argument makes any sort of sense!
According to TFA, the data (which apparently was built by the Sohu company) was not publically available and was not licensed to other companies. Obviously, the data must exist in some form within the product itself. That would suggest that either the company had some unsecured internal servers, or that Google hired some of their people who conveniently kept a copy of the data, or they figured out how to decode the data dictionary from a copy of the product.
Interestingly, TFA says that Google are now using "tens of thousands" of data points culled from their web crawls, whereas previously the Sohu dataset contained 300,000+ data points. That suggests that a straight web crawl is much less effective than doing the legwork that the Sohu company did. In fact, speculating a little more: 330,000 is the size of the dataset claimed by Sohu, and 300,000 is the overlap size claimed by the company. Assuming Google's product had both web crawl data and Sohu's data initially, that would suggest that Google's web crawl data is only about 30,000 data points, one tenth the size.
In information retrieval, database size tends to matter more than algorithms. For example, one major reason for Google's own superiority over its competitors in web search is that its own webcrawl dataset is at least twice the size of its nearest competitor. If you look at a company like Ask.com who are fourth and have some very interesting clustering algorithms based on the teoma search engine, they would definitely be competitive with Google if they only had a comparable size web crawl database.
Oh please... if Google wanted to distance itself from it, they could have done so long ago. "Sorry, mates, some of our employees fucked up, they've been fired and the offending code/product/database is now being pulled off the market until we build our own replacement."
The whole bullshit, including trying to get away with just deleting the original developpers' names, and press releases about "leveraging non-Google assets" is what's damning Google. It's not just that the original incident happened, it's that from there Google seemed to not even understand why it's bad and why the heck should they give a damn. The original incident may have been an individual developper's fuck-up, but from there it's Google and their corporate policies deciding how to deal with it. And how they _did_ chose to deal with it, frankly, stinks.
Yes, noone expects total mind control, but if _also_ the legal team is out of control and answers it in a way unrepresentative of Google, and _also_ the PR team is out of control and pulls a damning "we were just leveraging someone else's resources" statement on their own, etc, then, ffs, they have a problem. At some point you have to assume some responsibility and control, and not just hide behind not knowing what everyone else is doing. If you don't even know what your legal and PR teams are doing at all, even in a public incident, then you better assert some control real fast.
Additionally "do no evil" does imply a dose of responsibility there. You can't say, basically, "oh, the Mafia does no evil, it's just some of our members that we don't really mind-control, that are shooting people or fitting them with cement shoes." If the individual members are free to do evil, and get the company's full backing in some "we were only leveraging other people's resources" statement, then on what do you base that "do no evil" slogan any more?
RL "evil" isn't some "Black And White" game notion, involving actively hating all humanity and actively seeking to do harm, including self-harm, just for harm's sake. And no company does that overtly anyway, so if that's what Google is distancing itself from, then it doesn't say much.
RL "evil", including corporate evil, is more along the lines of not giving a damn about who gets hurt, if it helps you forward your own interests. It's not actively trying to poison a river just for the chuckle of seeing some people get sick, it's not caring who gets sick as long as you saved some money by just dumping your waste in the river. It's not actively trying to get some excuse to shoot some people as a Mafia don, it's about not giving a damn if it takes some corpses to forward your own interests in an area. If shooting some people to make an example is what works, so be it, it's as good a means to an end as any. Etc.
Or to get back to corporations, Enron too didn't make defrauding investors its whole purpose, it just didn't give a damn who gets hurt by their lies. It had no qualms even with advising its own employees to buy stock at a time when management was selling theirs. Again, not because some super-villain at the top had a chuckle at hurting employees, but because they didn't give a damn.
Basically it's not about having some principles to create as much suffering and destruction as possible, it's about lacking the principles and empathy to avoid doing it. That's what corporate evil is: simple sociopathic behaviour.
And if an organization doesn't give a damn at all about what its employees are doing, and who they're hurting, as long as they get the product out the door, then, congrats, it just lost all credibility for some "do no evil" claim. It just showed as much sociopathic tendencies as any other corporation, only maybe in a more decentralized fashion. You know, why have one sociopath at the top coming up with all evil schemes, when you can have a thousand sociopaths in lower positions encouraged to feel free to come up with their own heists.
A polar bear is a cartesian bear after a coordinate transform.