Slashdot Mirror


Google Admits to Using Sohu Database

prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"

209 comments

  1. Is this... by Hsensei · · Score: 1, Insightful

    Google doing evil, or sticking it to evil?

    --
    ~
    1. Re:Is this... by Anonymous Coward · · Score: 1, Funny

      well considering it's from godless communist china they must be sticking it to evil!
      hooray!

    2. Re:Is this... by renegadesx · · Score: 1

      Its both. Do evil to combat evil. Thats the American way now, didn't you get the memo?

      --
      Make SELinux enforcing again!
    3. Re:Is this... by Anonymous Coward · · Score: 1, Funny

      Sorry, I believe that memo was supposed to go down the memory hole. We are at war with East-Asia...

    4. Re:Is this... by Simon+Garlick · · Score: 0, Troll

      American evil is GOODER than dirty stinkin' gook evil.

    5. Re:Is this... by renegadesx · · Score: 1, Insightful

      OK I was just making a joke, you are being blatently racist and get modded up

      --
      Make SELinux enforcing again!
    6. Re:Is this... by GCH · · Score: 0

      Since communism does not recognize intellectual property, what is the problem? Or, does that only apply to the proletariat? China needs to decide whether they are capitalist or communist ... this middle road approach is going to only cause more problems.

    7. Re:Is this... by jstomel · · Score: 3, Funny

      Chinks are chinese, gooks are vietnamese. People need to learn to keep their racial slurs straight or soon we won't be able to tell who anybody hates, and that would be terrible!

    8. Re:Is this... by 808140 · · Score: 4, Interesting

      No, actually, "gook" is a term that originated in the Korean war for Korean people. Because many of the soldiers who fought in the Korean war were officers in the Vietnam war, their racial slurs were adopted and modified by a new generation, leading to great confusion about the origins of the term.

      The etymology of the word gook is interesting, because it may be one of the few racial slurs that originated with a people's term for themselves. In Korean, guk means "country" and by extension a country's people; when it is not modified (cf. waiguk, outside country, foreigner) it is understood to be Korea or its peoples. Speakers of Chinese will recognize the word as having sintic origin (gúo, country, and wàigúo, foreign country, respectively, in Mandarin).

      The term was appropriated by the Americans during the Korean war and used as a racial slur for Korean people in general, which must have been confusing to the Koreans (imagine someone using "American" as a slur for Americans to get an idea). Then, in Vietnam, the old "Asians are all the same" mentality prompted GIs to extend its meaning (imagine "American" being a racial slur for all white people, for example -- yes, I know many Americans aren't white, it's not a perfect analogy, deal with it).

    9. Re:Is this... by zippthorne · · Score: 1

      The middle road approach typically works itself out eventually. see the French Revolution.

      --
      Can you be Even More Awesome?!
    10. Re:Is this... by rm69990 · · Score: 1

      More like Google being a business. Seriously, get over this "Evil" crap, it's marketing speak for crying out loud. Nothing more, nothing less.

      If you actually believe that Google will forgo profits to avoid appearing evil to an extremely small percentage of the population that actually give a shit what Google as a company does, I have a bridge to sell you.

      Every Slashdot user could quit using Google, and the affect on their financial situation would be negligible. So they don't really care what you think, or whether you think they are "evil" or not.

      Nonetheless, evil is an incredibly subjective term.

    11. Re:Is this... by Anonymous Coward · · Score: 2, Insightful

      imagine someone using "American" as a slur for Americans to get an idea

      Why imagine? Come to Europe! But make sure to say you're Canadian...

    12. Re:Is this... by AlecC · · Score: 1

      Something like using "Yankee" or "Yank" to refer to all Americans. But we wouldn't do that, would we?

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    13. Re:Is this... by ThePengwin · · Score: 0, Flamebait

      *pats you on the back*

      Its okay, welcome to the internet.

    14. Re:Is this... by ajs · · Score: 1

      It's neither. It's a mistake, and human beings make mistakes. They hired someone who did the wrong thing, and I'm sure that mistake is being rectified.... Opening foriegn offices is tricky stuff, and controling them is tricker. Google is just starting to figure this out.

    15. Re:Is this... by Anonymous Coward · · Score: 0

      imagine someone using "American" as a slur for Americans to get an idea
      ----
      Don't need to imagine it. Where have you been LOL. American is often used to describe people from the US in a derogatory fashion, often with sneering emphasis on the word "American".

      -AC

    16. Re:Is this... by Anonymous Coward · · Score: 0

      http://www.amen.mobi/

      i THINK THAT IS SIN

    17. Re:Is this... by Dextrously · · Score: 1

      *start off topic* Yeah, I have to say this isn't uncommon. Many already use ``American'' as an insult referring to U.S. citizens. Although I always found the title "American" to be a bit arrogant. Wouldn't anyone living on the American continent be an American? Most U.S. citizens refer to themselves as simply Americans, but would refer to someone from say Venezuela as a Mexican. Even though, once again, both Mexican people and Venezuelan people are also Americans. Anyhow, referring to groups of peoples with derogatory intentions is wrong. No matter what country you live in, no one is beyond the title of idiot. :) *end off topic* Getting back to the whole Google using another's database though. I wonder if it was just the actions of one within the company or actually planned out by management. I might consider this just one person's laziness getting the better of the company, or possibly a misunderstanding between of what they can and can't use. Either way, direct copies are no-no's. Even younger children in school know to at least switch words around or intentionally induce a mistake into their work. ;D

    18. Re:Is this... by Anonymous Coward · · Score: 0

      Imagine someone using "American" as a slur for Americans to get an idea

      Why imagine? Come to Europe! But make sure to say you're Canadian..

      This has been true in most European countries for decades. Especially in countries that has a lot of military and business cooperation with USA. As the War on Terror continues it has spread.

      Personally, I find it unfair to Canadians and South Americans.

  2. Dictionary mistakes. by Tackhead · · Score: 5, Funny
    > Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents.

    ...including the ones for "plagiarize", "research", and apparently a new one for the 2000s under "leverage".

    Leverage! Leverage!
    Let no one else's work cut short your edge,
    Against the truth you can surely hedge,
    So don't cut short your edge,
    But leverage, leverage, leverage!

    (One man deserves the credit! One man deserves the blame!
    And Sergei Brin Ivanovich Lobachevsky is his name!)

    1. Re:Dictionary mistakes. by Anonymous Coward · · Score: 0

      You can't do that on the internet

    2. Re:Dictionary mistakes. by Anonymous Coward · · Score: 0

      You, sir, are a genius.

    3. Re:Dictionary mistakes. by PezJunkie42 · · Score: 1

      If I had mod points today, you would have just won all of them.

    4. Re:Dictionary mistakes. by Anonymous Coward · · Score: 0
      > You, sir, are a genius.

      Naw, he's just leveraging Tom Lehrer.

    5. Re:Dictionary mistakes. by Warg!+The+Orcs!! · · Score: 1

      ...lehrverage....

      --
      Travelling forward in time at a rate of 1 second per second.
    6. Re:Dictionary mistakes. by StuartHankins · · Score: 1

      Tom Lehrer? Wow, haven't heard that in awhile.

  3. Google's initial explanation by Anonymous Coward · · Score: 5, Funny

    "In the future, Google invents a time machine that's used by a rogue employee to travel back in time to give Sohu this database. It's clear then that Sohu stole our database."

    1. Re:Google's initial explanation by BungaDunga · · Score: 2, Funny

      In fact, if we hadn't used their database, our employee won't be able to go back in time to give it to Sohu, and we wouldn't have been able to steal their database. QED.

    2. Re:Google's initial explanation by Anonymous Coward · · Score: 0

      Watch out for that next zebra crossing you come to.

  4. Have no fear! by mattgreen · · Score: 1

    I'm sure someone will step up and help them save face in this embarrassing situation! When in doubt, you can always try to change the subject, that has worked well in the previous thread. Now that I think about it, we need a RoughlyDrafted-esque site for Google, anyone up to the task?

  5. This reminds me of by Diordna · · Score: 5, Interesting

    "Stolen from Apple Computer" (whole story)

  6. "built leveraging some non-Google resources" by Anonymous Coward · · Score: 0

    lol, Google may or may not be evil but they can spin doctor with the Microsofts of the world.

    Now what could be so wrong about leveraging non-Google resources?

  7. Turnitin.com Subscription Coming by slashbob22 · · Score: 3, Funny

    I guess Google Labs will have to subscribe to Turnitin.com now.

    --
    Proof by very large bribes. QED.
  8. Could be just a coincidence. by Anonymous Coward · · Score: 0

    Could be just a coincidence. Doesn't quantum physics state that essentially anything is possible? /apologist

  9. So... by Anonymous Coward · · Score: 5, Interesting

    When caught making a mistake, they admit it, work to resolve it, and move on?
    I think there are a few other companies who could learn from that approach ...

    1. Re:So... by Timesprout · · Score: 4, Insightful

      'Mistake' is a bit euphamistic here. The dictionary was never made public yet Google somehow managed to accquire it. They have not complied with Sohu's requests to date. They dragged their feet over the whole issue and only came clean when there more than sufficient proof they were infringing.

      Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.

      --
      Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
      What truth?
      There is no dupe
    2. Re:So... by Breakfast+Pants · · Score: 4, Insightful

      Actually, when caught, they just removed the developer's names from the dictionary. When a big deal of it was made, *then* they went to town 'not doing evil'. They still haven't said how it happened; I bet they will quietly settle it, and we will never hear more.

      --

      --

      WHO ATE MY BREAKFAST PANTS?
    3. Re:So... by Anonymous Coward · · Score: 0

      Or possibly at least one free software project...

    4. Re:So... by inviolet · · Score: 1

      Its not the first time Google have taken a fairly liberal interpretation of someone elses copyright either.

      Perhaps so. But then, Google has billions of dollars in the bank. They have no need to steal anything from anyone, and every reason not to.

      Can you really suppose that anyone in Google management decided to snag Sohu's database? Google is in the database business, so they know all about the salting of databases. They had to know that any commercial database will be filled with giveaway records (e.g., in this case, the developers' names).

      Probably, Google legitimately acquired the database by subcontracting with some of the locals -- locals who stole it on their own prerogative. And now that it's hit the fan, Google can't say anything in its own defense without making the situation worse.

      Once everyone calms down, I'll bet we learn that a certain acquisition manager at Google got reprimaned for failing proper "due diligence" before approving the purchase from someone who turned out to be shady.

      --
      FATMOUSE + YOU = FATMOUSE
    5. Re:So... by suv4x4 · · Score: 2, Insightful

      When caught making a mistake, they admit it, work to resolve it, and move on?
      I think there are a few other companies who could learn from that approach ...


      What a great approach indeed! Steal, and if caught, deny it a little, then cover it up.

      Actually I think Google learned that from someone else's company, or is Google "innovating" here? A debate for the coming generations.

    6. Re:So... by Anonymous Coward · · Score: 1, Insightful

      Replace all instances of "Google" with "Microsoft" in your post and see if your argument makes any sort of sense!

    7. Re:So... by Anonymous Coward · · Score: 0

      That's what I've been telling people about Microsoft Visual J++ all along.

      Can you really suppose that anyone in Microsoft management decided to snag Sun's programming language? Microsoft is in the database business, so they know all about the features of the programming languages. They had to know that any programming language will be filled with giveaway features (in Java's case, treating everything, except the primitive type, as an object).

      Probably, Microsoft legitimately developed the programming language by subcontracting with some of the locals in Redmond, WA -- locals who stole it on their own prerogative. And now that it's hit the fan, Microsoft can't say anything in its own defense without making the situation worse.

    8. Re:So... by ClosedSource · · Score: 1

      That would make some sense except for the fact that J++ was presented as Java clone from day one. Sun sued MS on the basis of violating a contract, Sun never claimed that MS had stolen anything because nothing was.

    9. Re:So... by Anonymous Coward · · Score: 0

      Whoa there, buddy this is Google we're talking about. They weren't "doing evil" in the first place. They were "leveraging". Now they're just "leveraging" in a slightly different way.

    10. Re:So... by aussie_a · · Score: 1

      They didn't make a mistake, they deliberately violated someone's copyright. Of course they fixed it and moved on, however tons of companies DON'T violate people's copyright in the first place. Perhaps Google, as well as some other companies, can learn from those that get it right in the first place.

    11. Re:So... by philpalm · · Score: 1

      Do you believe the other company will receive any compensation? Most likely they will be leveraged out of their niche.

    12. Re:So... by Anonymous Coward · · Score: 0

      'Mistake' is a bit euphamistic here. The dictionary was never made public yet Google somehow managed to accquire it. They have not complied with Sohu's requests to date. They dragged their feet over the whole issue and only came clean when there more than sufficient proof they were infringing.

      "Mistake" may well be a euphemism, but "dragging their feet"? Not really. Google is a big company. There were probably three or four programmers who had any knowledge about what was going on. Sohu complains - the legal and PR people get involved, but they knew nothing. The first thing they decided was almost certainly, "Make no move until we are sure what we should do." Hence the delay.

  10. Once again, GOOGLE FARTS! by Anonymous Coward · · Score: 0

    and slashdot smells it, lol!

  11. Cmon Google... by Anonymous Coward · · Score: 3, Funny

    surely after helping so many students copy their research papers you should know the number 1 rule of copying another persons work: Change the F*CKING NAME!

  12. I wonder... by flyboy81 · · Score: 2, Interesting

    Is this a single isolated incident or simply the first one of more coming from the company that does no evil?

    1. Re:I wonder... by themushroom · · Score: 1

      ...while working with the Chinese guvmint?

    2. Re:I wonder... by AmberBlackCat · · Score: 1

      I guess the only thing reasonably certain is it's the first time they got caught.

  13. Are dictionaries copyrightable? by Anonymous Coward · · Score: 0

    Not in the States at least, AFAIK...

  14. Mistakes are by EmbeddedJanitor · · Score: 1

    The mistakes were the giveaway. Surely these are "creative works"?

    --
    Engineering is the art of compromise.
    1. Re:Mistakes are by Anonymous Coward · · Score: 0

      just like they always are. Map makers used to insert tiny mistakes to keep other cartographers from copying their work.

  15. Time for a slogan change? by GFree · · Score: 5, Funny

    "Do no evil"

    should be changed to

    "Do just a tiny bit of evil"

    which at this rate will probably end up as

    "All your web are belong to us"

    1. Re:Time for a slogan change? by Ngarrang · · Score: 2, Funny

      Do no evil, or don't get caught.
      We redefine evil.
      Emulate or Innovate, which ever is more convenient.

      --
      Bearded Dragon
    2. Re:Time for a slogan change? by LarsG · · Score: 5, Insightful

      This reminds me of Animal Farm and how the commandments on the barn wall changed.

      The people outside looked from Google to MS, and from MS to Google, and from Google to MS again; but already it was impossible to say which was which.

      --
      If J.K.R wrote Windows: Puteulanus fenestra mortalis!
    3. Re:Time for a slogan change? by Viceroy+Potatohead · · Score: 1

      "Your search - do no evil - did not match any documents" [or]
      "Did you mean: services?"

      I think it's high time for Google to do an internetectomy to remove references to "do no evil" so they can get back to business as usual, without us calling them on it all the time.

    4. Re:Time for a slogan change? by Dragonslicer · · Score: 1

      "Do just a tiny bit of evil"
      The Diet Coke of evil?
    5. Re:Time for a slogan change? by Anonymous Coward · · Score: 2, Funny

      It's not gotten to that point yet. If you want to figure out which is Google and which is MS, if you're ducking chairs or you hear the distant chant of "developers, developers, developers", it's MS.

    6. Re:Time for a slogan change? by Anonymous Coward · · Score: 0

      I thought the slogan was 'don't be evil'... Now, considering the balance of good/evil that we are currently aware exists within the Google microcosm, do you think they are 'being' evil? Is it *really* necessary for one to 'do no evil' in order to not *be* evil? Or does it mean maintaining a popular perception among the critical that you are mostly good...with a few bad seeds and decisions scattered throughout... This is life, my friend. Not being evil means doing what you think is best...regardless of the rules imposed upon you. The depths of rationalization can go pretty far, but popular perception holds a strong net to catch you if you're willing to stay above ground.

      So far, it seems Google hasn't needed to use the safety net of perception very often... Let's just hope it doesn't tear beneath them. They're pretty heavy.

    7. Re:Time for a slogan change? by nephridium · · Score: 1

      My sentiments exactly. What will prevent Google from becoming evil? The compnay is growing rapidly. The information it possesses of many people are enough to theoretically pinpoint who/where they are, what their political affiliations might be, what they like to do in their spare time etc.etc. - THAT is power! And power corrupts. Maybe not now, maybe the "Google guys" have enough foresight and prudence to guard against their company becoming too evil for now, but things are bound to change in a generation or two.

      Animal farm ran very well at the beginning until Napoleon replaced the original "good" leader (Snowball). From then on the rules of Animalism were rewritten and rewritten until it was indistinguishable from the other farms run by (the evil) humans.

      --


      And when you gaze long enough into the code, the code will also gaze into you.
    8. Re:Time for a slogan change? by LarsG · · Score: 1

      It's not gotten to that point yet.

      True, but they seem to be playing on the proverbial greased incline. Some of which isn't really Google's fault but is more a result of them being so big and having to make choices between evils. e.g. China

      --
      If J.K.R wrote Windows: Puteulanus fenestra mortalis!
  16. Re:Do no evil? by Anonymous Coward · · Score: 0

    It's not stealing. Trivially. Not disputing they probably do some illegal stuff, but illegal doesn't mean wrong.

    As far as I can see, google are the greatest force for good (good: destroying copyright law!) in a long time.

  17. Car stereo by DogDude · · Score: 3, Funny

    So then, did the guy who stole my car stereo, was he "leveraging some non-car thief assets"?

    --
    I don't respond to AC's.
    1. Re:Car stereo by iminplaya · · Score: 2, Insightful

      Did he leave you an exact copy?

      --
      What?
    2. Re:Car stereo by Anonymous Coward · · Score: 0

      "Did he leave you an exact copy?"

      So the guy that counterfeited 20 million dollars, did he devalue the dollar and thus everything I've worked for, or was he just "leveraging his laser color printer".

      You're right it's not stealing but it's not exactly harmless either.

    3. Re:Car stereo by iminplaya · · Score: 1

      So you're saying that an extra copy of your stereo diminishes the value of the one you have in your car somehow? Or are you just upset that he "paid" less for it than you did for yours? Should we set a quota of the number of stereos on the street so nobody can sell one for less than the price you paid? Please remember that all value placed on currency is strictly faith based, especially now. If nobody wants your money, it isn't worth dog poop, with or without the counterfeiters.

      --
      What?
  18. Re:Do no evil? by Anonymous Coward · · Score: 0

    Talk about drinking the kool-aid...

    They're a search engine. They're not curing cancer or solving world hunger. No, they are not the greatest force for good in a long time.

  19. New tag: copyvio by Matt+Perry · · Score: 1

    I recommend tagging this "copyvio"

    --
    Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.
  20. Pot meet Kettle by Anonymous Coward · · Score: 0

    As if the chinese aren't the biggest pirates/copycats around.

    1. Re:Pot meet Kettle by Anonymous Coward · · Score: 0

      and, apparently, cry babies to boot

  21. Do no evil by z-j-y · · Score: 5, Insightful

    Google is going to release a statement that stealing code/data is not evil in China, and Google must fit in local cultures and abide by local laws.

    Seriously, this is just pathetic. I am appalled by the Google apologists on slashdot.

    Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough? They did this by stealing data and who knows what from others. Mind you that the data is not publicly available, so Google must have committed certain crimes to obtain the data.

    For those who don't see what's the big deal: the mapping from ASCII sequence to Chinese character/phrase is not trivial; actually it is what Chinese input is all about.

    1. Re:Do no evil by maxume · · Score: 2, Interesting

      There is no way to tell if the copying was done by 'Google' or if it was done by some engineer on their own. Sure, 'Google' needs to take steps to make sure that they what they put out meets some sort of standard, but the backpedaling and what not is pretty much the response you would get no matter how the copying was initiated, so there isn't much reason to assume where the responsibility for the copying lies.

      --
      Nerd rage is the funniest rage.
    2. Re:Do no evil by QuantumG · · Score: 2, Insightful

      Or done by a Chinese company which Google outsourced to. Isn't that how all corporations do their evil? Outsource it to Evil Inc. Everyone except Microsoft and Enron I guess.

      --
      How we know is more important than what we know.
    3. Re:Do no evil by homer_s · · Score: 1

      Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough?

      So, offering a 'me too' product is now evil?

    4. Re:Do no evil by ShawnDoc · · Score: 5, Insightful

      This is a serious problem when dealing with Chinese companies. Now that Google has opened offices in China and has staffed them with native Chinese people, they're going to have a hard time enforcing western style ideas about copyright and what constitutes "doing no evil". Its a problem we've run into in the past with our Chinese operations. The way the problem was "solved", by removing the engineers names, but still clearly using the other company's engine (they didn't remove the identical bugs), is something I have seen happen in the past when dealing with our R&D team in China when we've found them using code they "borrowed" either from open source code or from an engineers past employer. I've never seen it handled in public like this however. Google is going to need to take some serious Q&A steps in their Chinese offices to keep stuff like this from happening again or else risk their Chinese office ruining the entire company's reputation.

    5. Re:Do no evil by ReallyEvilCanine · · Score: 2, Insightful
      I'm appalled, too. I'm also surprised. What I'm not is a Google apologist. I still stand by the crux of my comment based on my work in I18N and with IMEs.


      Google must have committed certain crimes to obtain the data.
      No, or at least, "Not necessarily intentionally". The dictionary could've been indexed via the spiders. It could've been indexed via the desktop search app. There are lots of ways that Google could've got the information. Anyone who works for Google, knows the deep ins and outs of their data handling, and who reads and posts on this site ain't gonna tell. As I wrote in the last comment, Google is information. They get it from everywhere, and they know how to store, sort and use it. It may well have been intentional theft, but I don't think Google the corporation has reached the point where they actually believe "All Data Are Belong To Us".

    6. Re:Do no evil by Anonymous Coward · · Score: 0
      They're not using the same engine. They're using (mostly) the same *data*, that was mined from the competitor's program.

      Of course it's still a bad thing.

    7. Re:Do no evil by The_Wilschon · · Score: 1

      When the me-tooist is a corporate giant and the me-firsters are still quite small, the me-tooist will typically crush the me-firsters merely by virtue of its size, name recognition, and ability to lose money on a market for a while in order to gain a monopoly of it.

      Even if they hadn't ganked anybody's data to do it, shoehorning themselves into a market full of players much smaller than themselves is not very nice.

      Gratuitous analogy: Michael Johnson steals a kid's shoes and then wears them to run at a high school track meet.

      --
      SIGSEGV caught, terminating

      wait... not that kind of sig.
    8. Re:Do no evil by Achromatic1978 · · Score: 1

      The dictionary could've been indexed via the spiders.

      The database wasn't bulk browseable.

      It could've been indexed via the desktop search app.

      I certainly hope not. I would be horrified to find that my desktop search database was being uploaded to Google.

      The information was NOT publicly available. Making it out as though Google just happened upon the database because "Google is information" (?!?) just reeks of a new way to spin.

    9. Re:Do no evil by Nazlfrag · · Score: 1

      Well sometimes you just can't find anywhere more evil to outsource to.

    10. Re:Do no evil by adelord · · Score: 1

      This is a serious problem when dealing with Chinese companies. Now that Google has opened offices in China and has staffed them with native Chinese people, they're going to have a hard time enforcing western style ideas about copyright and what constitutes "doing no evil". Its a problem we've run into in the past with our Chinese operations. The way the problem was "solved", by removing the engineers names, but still clearly using the other company's engine (they didn't remove the identical bugs), is something I have seen happen in the past when dealing with our R&D team in China when we've found them using code they "borrowed" either from open source code or from an engineers past employer. I've never seen it handled in public like this however. Google is going to need to take some serious Q&A steps in their Chinese offices to keep stuff like this from happening again or else risk their Chinese office ruining the entire company's reputation. Your's is the cleanest explanation for this event I've read so far.
      A little bit of cultural context, and some insight into the difficulties any well-meaning company faces as they grow and grow in size and number of locations- an insightful post. Thanks.
      I'm not a Google apologist, but I tend to give them the benefit of the doubt most of the time. This inclination may be the result of an illusion, but there usually is a "they're doing less than evil" explanation like this one.
      --
      Eugene Debs: "Money constitutes no proper basis of civilization"
    11. Re:Do no evil by asninn · · Score: 1

      Chinese input is a well established market; Google Giant forces itself into the market with a product that is very similar to existing ones and offers no innovation. That is not evil enough?

      Um, no, that's not evil at all - it's called capitalism. Now, you might argue that capitalism in general is evil, but that'd hardly be Google's fault...

      Seriously, if Google doesn't have anything new to offer, no innovations, no improvements or changes over existing products, then they won't do very well in the "well established market"; and if people do flock to them and their product, then maybe they had something to offer after all, even if it was something you didn't/could/didn't want to see. (Of course, that's assuming that they're not using any underhanded or outright illegal tactics to ensure their success, like other companies *coughmscough* did (and continue to do), but I honestly don't see how they could do so here even if they wanted to.)

      --
      butter the donkey
    12. Re:Do no evil by ioshhdflwuegfh · · Score: 1

      Now we just need some clean explanation for the "any well-meaning company" term...

    13. Re:Do no evil by Anonymous Coward · · Score: 0

      Google Desktop Search isn't supposed to upload your private desktop data from your private index to their corporate databases. That is UNBELIEVABLY evil. Worse that Sony's rootkit. Windows Desktop Search doesn't do that. Nor X1.

      However, it is a known fact that GDS "phones home" even when you specifically ask it not to do so. (http://ansemond.com/blog/?p=78/)

      Evil is as evil does. Google does do evil, and not just in China. They tried ti right here in my own home.

    14. Re:Do no evil by ioshhdflwuegfh · · Score: 1

      [Google is] going to have a hard time enforcing [in China] western style ideas about copyright and what constitutes "doing no evil" What do you mean? Are you saying that the responsibility of Google is to enforce the Western values upon "natives" (to borrow your word) of the non-Western world?
    15. Re:Do no evil by ioshhdflwuegfh · · Score: 1

      I'm appalled, too. I'm also surprised. Why?

      What I'm not is a Google apologist. So naturally you go on to provide some apologies for Google:

      The dictionary could've been indexed via the spiders. It could've been indexed via the desktop search app. There are lots of ways that Google could've got the information. Anyone who works for Google, knows the deep ins and outs of their data handling, and who reads and posts on this site ain't gonna tell. And then, well, some more apologies:

      Google is information. Well, thanks for this explanation, I always thought of Google as being a corporation.

      They get it from everywhere, and they know how to store, sort and use it. This must be some very deep secret knowledge that only Google possesses:

      Anyone who works for Google, knows the deep ins and outs of their data handling, and who reads and posts on this site ain't gonna tell. Google is information... Google is corporation... information... corporation... brain in pain:

      It may well have been intentional theft, but I don't think Google the corporation has reached the point where they actually believe "All Data Are Belong To Us".
  22. About that do no evil stuff.... by pcause · · Score: 1

    Ok, so we do do some evil, but jusy with our competitor's code. That isn't so bad, is it?

  23. Exactly how did they get a copy of the DB? by WoTG · · Score: 1

    OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

    I suspect that there's more to this story that we're not hearing.

    1. Re:Exactly how did they get a copy of the DB? by tooyoung · · Score: 5, Informative

      OK, so now that Google has admitted to copying the sohu.com pinyin database... exactly how did they get a copy in the first place? Is there a publicly available file for personal use or was there some sort of web scraping or what?

      I suspect that there's more to this story that we're not hearing.


      Exactly. Reading 95% of the comments for this story and yesterday's story, everyone seems to think that this is about stealing code. This is about Google using the same data to train an algorithm. Both algorithms make the same mistakes because they were trained using the same data, which contained incorrectly labled information. It is whether or not this data was publicly available that is the issue.

      For (a horribly contrived) example: Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data.

      I'm not defending Google at all here. If they stole the data from Sohu, they should get in trouble. Based on the fact that Google is in the web-mining business, I would guess that they just grabbed this data off of the net, and someone forgot to think about if they had the right to use it.
    2. Re:Exactly how did they get a copy of the DB? by martin-boundary · · Score: 2, Insightful
      To paraphrase Wirth: "Programs = Code + Data"

      According to TFA, the data (which apparently was built by the Sohu company) was not publically available and was not licensed to other companies. Obviously, the data must exist in some form within the product itself. That would suggest that either the company had some unsecured internal servers, or that Google hired some of their people who conveniently kept a copy of the data, or they figured out how to decode the data dictionary from a copy of the product.

      Interestingly, TFA says that Google are now using "tens of thousands" of data points culled from their web crawls, whereas previously the Sohu dataset contained 300,000+ data points. That suggests that a straight web crawl is much less effective than doing the legwork that the Sohu company did. In fact, speculating a little more: 330,000 is the size of the dataset claimed by Sohu, and 300,000 is the overlap size claimed by the company. Assuming Google's product had both web crawl data and Sohu's data initially, that would suggest that Google's web crawl data is only about 30,000 data points, one tenth the size.

      In information retrieval, database size tends to matter more than algorithms. For example, one major reason for Google's own superiority over its competitors in web search is that its own webcrawl dataset is at least twice the size of its nearest competitor. If you look at a company like Ask.com who are fourth and have some very interesting clustering algorithms based on the teoma search engine, they would definitely be competitive with Google if they only had a comparable size web crawl database.

    3. Re:Exactly how did they get a copy of the DB? by Anonymous Coward · · Score: 0

      Wouldn't a database of pinyin -> Chinese characters be considered facts?

      Why would copyright be involved?

      They could claim plagarism, big deal, I don't think google claimed they created the dictionary.

    4. Re:Exactly how did they get a copy of the DB? by nebosuke · · Score: 1

      No, because the same pinyin can map to many chinese characters. The logic+data that determines the preference of a given mapping over another equally valid mapping is not only complex and substantial, but subjective as well, and therefore definitely constitutes a creative work.

    5. Re:Exactly how did they get a copy of the DB? by PassBy · · Score: 2, Informative

      I think you are misunderstanding how a Pinyin input works. But anyhow, it is rumored that Sohu had put in some "database finger prints" in their database. Which means, there are hard-coded patterns of Chinese characters that you wouldn't normally get by typing in corresponding English letters (i.e. Name of some Sohu employees). The mistake confirmed by Chinese users, is in fact a misspelling. A Chinese comedian's name, which should be spelled "feng gong" (two characters), can only be outputted by typing "ping gong" in both IME. I am going to try to explain why this is obviously a proof of "leveraging". Names of people and other stuff in Chinese, are mostly combinations of Chinese characters that have no logical or any connections. That means, by just using algorithms, names won't come up by just typing their corresponding pronunciation.

    6. Re:Exactly how did they get a copy of the DB? by Hucko · · Score: 1

      It was stored on a Windows box?

      botnet
      holes
      obsessed with botnets

      --
      Semi-automatic amateur armchair Australian philosopher; conjecture ready at any moment...
    7. Re:Exactly how did they get a copy of the DB? by ioshhdflwuegfh · · Score: 1

      Lets say that I write some hand writing recognition software using a neural-net. In order to train my software, I use a large database of handwriting samples that I have found on the web. However, the person that compiled this database made the mistake of labeling all of the sample images of the letter 'n' as the letter 'q', and all of the images of the letter 'q' are labeled as the letter 'n'. Person B comes along and uses the same data set to train a naïve-Bayes classifier. Guess what? Both algorithms will make the same mistakes when it comes to the letters 'n' and 'q'. Not because I stole code from Person B, but because we used the same training data. According to this example, both algorithms behave pretty much the same way because they've been trained on the same set of data, no? That is, algorithms that need to be trained in order to be functional at all only depend in an even more crucial way on the training set than, say, a key-based search of a phone book.
  24. this is quite troubling by martin-boundary · · Score: 2, Insightful
    It is clear from this example that _some_ Google engineers have not the first clue about what clean room engineering is and when it should be used. Everyone in the software industry is under pressure to produce, that doesn't mean cutting corners is acceptable.

    This reminds me of the recent story about GPL code found in OpenBSD. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license. He apparently thought it was ok to do this as long as all the offending functions would be renamed in the final release, but was caught checking in unmodified functions by accident.

    Google is well known for using a lot of GPL software, but it is also true that they do not distribute the source code of their flagship programs to the public. Episodes like this make people wonder if they "accidentally" use some GPL code in their distributed products without telling anyone.

    1. Re:this is quite troubling by QuantumG · · Score: 1

      Uh huh. Are you trying to suggest that there is something wrong with this:

      1. Take existing code under incompatible license
      2. Write new functionality and integrate into your code
      3. Test and develop your application until it is "ready"
      4. Replace incompatible code with your own code

      I mean, if you were talking about using proprietary code in the first step then I could imagine that you might have some kind of argument.. but it's GPL code man.. you're free to do whatever you want with it. Only when you distribute it are you required to place other code that it is based on under the GPL.. and if you remove the GPL licensed code then you have no such responsibility anymore.

      Unfortunately the dude fucked up.. everyone does it now and then.

      --
      How we know is more important than what we know.
    2. Re:this is quite troubling by tppublic · · Score: 1
      ...if you were talking about using proprietary code in the first step then I could imagine that you might have some kind of argument

      No need for imagination. Go read Sega v. Accolade.

      it's GPL code man.. you're free to do whatever you want with it

      NO. You are Free to do whatever the license grants you the right to do. From GPL Section 2(b): "You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License." (emphasis added)

      You are proposing to directly leverage GPL code to develop new code. That new code and the combined code are a derivative work of the GPL code.

      Thus, to directly answer your question: Yes, there is something wrong with what you are proposing. No lawyer would want you to do it, because the work you produce in steps 2 and 3 is a derivative work of GPL code, and thus must be licensed under the GPL to avoid copyright infringement.

    3. Re:this is quite troubling by QuantumG · · Score: 1

      Yeah, you're on crack if you think that new code you write is a derivative work just because you have read some GPL code.

      --
      How we know is more important than what we know.
    4. Re:this is quite troubling by martin-boundary · · Score: 1

      .. but it's GPL code man.. you're free to do whatever you want with it.
      Of course you can. But if you modify _it_, then the end product is covered under the GPL. Let's take your example:

      1. Take existing code under incompatible license
      No problem there. At this point you have a copy of the GPL'd code, and no code of your own. You can do anything you like with the code.

      2. Write new functionality and integrate into your code
      At this point you have a derivative of the original GPL'd code. No problem there, you can do anything you like with the code.

      3. Test and develop your application until it is "ready"
      That's fine too.

      4. Replace incompatible code with your own code
      Here you're taking some GPL code, and modifying it. The result is GPL code. It doesn't matter if your modification consists of "removing" the "original" GPL'd code, the code you're modifying is still GPL, so the result is GPL.

      Now granted, it looks confusing that you can end up with a GPL'd code which looks like you've all written it yourself, but that's because in this scenario the developer was sloppy about the disclaimers. If he'd been more pedantic, he would have seen where the mistake lies, as follows:

      In step 1., the code is marked with the GPL copyright disclaimer on each source file. To get from step 1. to step 2., whenever you copy a GPL function into a new source file, you must _also_ copy the accompanying disclaimer into the new source file. Now your new source file has the GPL copyright disclaimer (pedantic, but necessary). Next you modify your source file any way you like, but you can't remove the GPL disclaimer, even though you can remove and change all the code below it. At some point, all the code below is your own, but the GPL disclaimer is still there and valid, because it was present throughout the development. If you now remove the disclaimer and put a BSD one in instead, you're clearly breaching the copyright.

      So if you act pedantically, you can't fail to see where you're stuck with the GPL. Also, if you're pedantic, you can easily see how to go around the issue: create a set of source files which act as a _proxy_ for the copied GPL'd functions and isn't directly mixed with your other code, then you'll be able to split off the GPL code in the end. Besides, it makes the whole code more modular.

      FWIW, I agree that the BSD guy made a mistake he paid dearly for, but if we as developers are going to play the copyright game and make a fuss when others abuse it, then we must play it _correctly_ and not _sloppily_.

    5. Re:this is quite troubling by QuantumG · · Score: 1

      At this point you have a derivative of the original GPL'd code. No problem there, you can do anything you like with the code. No.. if you distribute it *then* you are obligated to release your code under the GPL, *but not before*.

      --
      How we know is more important than what we know.
    6. Re:this is quite troubling by martin-boundary · · Score: 1

      No.. if you distribute it *then* you are obligated to release your code under the GPL, *but not before*.
      The GPL applies to the source code throughout its existence, not merely to distributed source code if and when it gets distributed. In fact, the line "Copyright (C) DATE AUTHOR" which is filled in somewhere near the top of the disclaimer is a statement of ownership.

    7. Re:this is quite troubling by QuantumG · · Score: 1

      Dude, you don't know what you are talking about ok? Stop speaking now.

      Fucking Slashdot.

      --
      How we know is more important than what we know.
    8. Re:this is quite troubling by martin-boundary · · Score: 1

      Is that a quick way to "save face" and retire from the thread? I can accept that.

    9. Re:this is quite troubling by Dun+Malg · · Score: 1

      It is clear from this example that _some_ Google engineers have not the first clue about what clean room engineering is and when it should be used. What kind of idiot "clean room engineers" a freakin' dictionary? You "clean room" the software that uses the dictionary...
      --
      If a job's not worth doing, it's not worth doing right.
    10. Re:this is quite troubling by martin-boundary · · Score: 1

      What kind of idiot "clean room engineers" a freakin' dictionary? You "clean room" the software that uses the dictionary...
      Spot on. But remember that the dictionary was probably obtained in some form by examining the software that directly uses it. In other words, Google's programmers were reverse engineering a competitor's product.

  25. On what do you base your judgment? by Anonymous Coward · · Score: 4, Insightful

    > They have not complied with Sohu's requests to date.

    One of Sohu's demands was to remove it. They did that, even prior to the cease & desist deadline, per the article. It sounds like they'll have to compensate Sohu next, which isn't overly surprising. As for where they got it, perhaps someone sold it to them? We don't know, so I'll reserve judgment about whether it was acquired in an un-Google "evil" way until we hear the rest of the story.

    > It's not the first time Google have taken a fairly liberal interpretation of someone else's copyright either.

    As for the copyright stance, I honestly don't care. Yes, I dislike Microsoft's hypocrisy concerning copyright, but I don't really give a damn about imaginary property at this point in time, and I don't see Google out there telling people that copyright infringement is evil, wrong, Communist and anti-American.

    Frankly, I'm more inclined to distribute my works with only one request: that you do not acknowledge my authorship in any way. Of course, almost the only way to enforce that is to post AC :-)

    1. Re:On what do you base your judgment? by Daengbo · · Score: 5, Informative

      In my mind, there is some question of whether a database of facts should, in fact (hee hee), be copyrightable at all. The characters were not original. The pinyin is not original. The pinyin for each character is, in fact, well established. Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?

      It reminds me of a court case a few years ago in Thailand, where a judge put several Thai fonts into the public domain, stating "No one owns the Thai alphabet. It belongs to the people."

    2. Re:On what do you base your judgment? by QuantumG · · Score: 2, Interesting

      meh, the argument for why compilations of public domain "facts" should be considered a copyrightable work is that it is work to compile those facts. Why people can't understand that not all work results in property is beyond me, but there's ya reasoning.

      --
      How we know is more important than what we know.
    3. Re:On what do you base your judgment? by Daengbo · · Score: 1

      I know the reasoning: I just don't understand it. Writing a historical novel or even a biography is different from a raw database of publicly available facts. One is art, while the other is just data entry.

    4. Re:On what do you base your judgment? by buro9 · · Score: 1

      Any book out there is merely a collection of public-domain words, it's the arrangement or them into a single collection that is copyrighted.

      A database is little difference.

      There is of course time and effort spent in creating the collection, and some of the interpretation could be argued to be a creative effort in and of itself.

      A map is public-domain knowledge, but the compiled article is copyrighted. It's hard to imagine why this database should be exempt from copyright when every other instance of compiled public knowledge I can think of right now is copyrighted.

    5. Re:On what do you base your judgment? by Daengbo · · Score: 2, Informative
      Well, Duke's law page makes it clear that copyright is based on originality and not "sweat of the brow."

      The relevant portion:

      In 1991, the Supreme Court addressed this question in Feist Publications v. Rural Telephone Co.10 Feist is a publishing company specializing in area-wide telephone directories, and Rural is a public utility company that provides telephone service to Northwest Kansas. Feist had almost 50,000 white page listings in fifteen counties, while Rural had fewer than 8,000. The white pages listed the names, phone numbers, and towns of residence of all of the residents in a particular area alphabetically by last name. The two companies competed vigorously for yellow page advertisements. Feist copied Rural's collection of white page listings in order to compile its own. The district court granted summary judgment to Rural, relying on the 'sweat of the brow' doctrine, which justified protection because of the labor involved in collecting and arranging the facts.

      The Supreme Court rejected this doctrine because, with the Copyright Act of 1976, Congress made it clear that originality was a requirement for copyright protection.
      I submit that there is no originality in the character -- Pinyin pairing, though perhaps there is in the use of the engineers' names.
    6. Re:On what do you base your judgment? by Torvaun · · Score: 1

      IIRC, there was a time when most maps included a location or feature that didn't really exist. If I further recall correctly, one of those locations was a partial inspiration for R'lyeh, an island that was only there on the map. The whole point of this was to force other cartographers to explore for themselves instead of just copying the map. If someone else's map had your R'lyeh, you can sue them for copyright violation. Looks like that's what's happening here.

      --
      I see your informative link, and raise you a pithy comment.
    7. Re:On what do you base your judgment? by heinousjay · · Score: 2, Funny

      So the slogan is data entry wants to be free?

      --
      Slashdot - where whining about luck is the new way to make the world you want.
    8. Re:On what do you base your judgment? by asninn · · Score: 1

      Why should a compilation of public-domain facts which in itself is a derivative work be copyrightable?

      Sohu is probably asserting copyright over the errors they introduced. ;)

      --
      butter the donkey
    9. Re:On what do you base your judgment? by gaspyy · · Score: 1

      Someone has taken the time to compile the data into the database. It cost time and money to do so. Google chose to take the shortcut and use that db instead of making their own (which further hints that the work involved was not trivial at all - you really can't argue that Google made a mistake, that they didn't know what they were doing).

      It's a little similar with the fonts: the Latin or Thai or whatever alphabet is (and shouldn't be) copyrightable. However, creating a font is not an easy task, especially one that works at small sizes. The creators should imho be protected one way or the other (actually font plagiarism is in the rage - fonts are not copyrightable, only their names)

      Of course, this is Slashdot and it's Google we are talking about, so all their actions are justifiable - heaven forbid anyone to criticize their behavior.

    10. Re:On what do you base your judgment? by Plutonite · · Score: 1

      That is a great point, but they would argue that it is the effort put into the work that makes it "theirs". Same with an encyclopedia - you can use (and cite) it, but you sure as hell can't produce a carbon copy under a different name with zero recognition. Recognition of authorship, among free(beer) work at least, is a courtesy we have no need to abandon.

      Which brings me to the GP issue: why don't you want your name on things you've done? Recognition is a "nice" thing. If all of maathematics was written down in nameless books with mystery authors, we would live in a crappy world indeed. The people who are both capable and willing to produce knowledge should be valued for it, because therein lies the worth of the human race. Or something.

    11. Re:On what do you base your judgment? by gotem · · Score: 1

      As for where they got it, perhaps someone sold it to them?
      I know, they did a google search for:
      "index of" pinyin.dic

    12. Re:On what do you base your judgment? by Jeff+DeMaagd · · Score: 1

      I suppose a book shouldn't be copyrighted, because it uses letters and words that already exist?

    13. Re:On what do you base your judgment? by That's+Unpossible! · · Score: 1

      Exactly! Writing a book is simply re-arranging factual letters into known words, and common sentences.
      How is that considered ORIGINAL? Bahhh...

      --
      Ironically, the word ironically is often used incorrectly.
    14. Re:On what do you base your judgment? by ioshhdflwuegfh · · Score: 1

      I submit that there is no originality in the character -- Pinyin pairing, though perhaps there is in the use of the engineers' names. So then after the removal of these names, the originality of the database is gone?
    15. Re:On what do you base your judgment? by ioshhdflwuegfh · · Score: 1

      Recognition of authorship, among free(beer) work at least, is a courtesy we have no need to abandon. nor, strictly speaking, to follow. For example:

      If all of maathematics was written down in nameless books with mystery authors, we would live in a crappy world indeed. How come? Would mathematics be different?
    16. Re:On what do you base your judgment? by sunnybayz · · Score: 1

      A Pinyin input method (like Sogou Pinyin or Google Pinyin) not only maps pinyin(a string of alphabets) to Chinese characters, but also prompts the user with most likely words or phrases following the character just typed. The accuracy of phrase prediction is the major metric for evaluating Pinyin IMEs, because it dominates the speed of typing Chinese sentences. To predict phrases accurately, people have to build a statistical language model, which is non-trivial work. It requires a lot of machine learning, parameter tuning and manual labeling. In principle, both Google and Sogou can build good models out of the huge amount of Web pages in their databases. As far as I know, Sogou spent more than one year in building and refining their model, while Google took it off the shelf. I think Google did the wrong thing.

    17. Re:On what do you base your judgment? by Kadin2048 · · Score: 1

      Someone has taken the time to compile the data into the database. It cost time and money to do so. Google chose to take the shortcut and use that db instead of making their own (which further hints that the work involved was not trivial at all - you really can't argue that Google made a mistake, that they didn't know what they were doing).

      Doesn't matter; copyright -- at least U.S. and I think British copyright, I have no idea what if any philosophy underlies the Chinese system, if indeed they have one -- is blind to the time, money, and "sweat of the brow" that went into creating a work. The only thing that matters is the originality of the final work. Compilations of facts, although they may require a lot of energy to create, are not copywritable. The canonical case involves telephone directories, which require a lot of effort to assemble, and may be freely copied.

      Unless someone can come up with a convincing argument for why this pinyin database is more "creative" than the compilation of a telephone directory or a book of mathematical tables, I don't really see why the copyright is an issue.

      --
      "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
    18. Re:On what do you base your judgment? by Daengbo · · Score: 1

      I'm not saying anything about what Google did. I think attribution is important, and I don't know that particulars of the extent of what was copied. I just heard so many people talking about "sweat of the brow" as reason for what Google did being wrong, and wanted to present facts like this to let them know that's not the way it is. Databases require trade secrets, apparently.

  26. When in Rome... by Anonymous Coward · · Score: 0

    ...do as the Romans do?

  27. Ironic by smackt4rd · · Score: 5, Funny

    So now american companies are pirating chinese software? Oh the irony! :)

  28. Re:Do no evil? by Anonymous Coward · · Score: 0

    Copyrights exist for a reason...read a book or something and figure it out.

  29. Re:Do no evil? by Anonymous Coward · · Score: 0

    Oh, they exist for a reason alright. That's why I oppose 'em! http://piratpartiet.se/

  30. Sepaku by Anonymous Coward · · Score: 0

    Google should just convince someone plausibly responsible to commit Sepaku with the promise their family would be taken care of in thanks for removing their shame.

  31. Their new spokesperson ... by myster0n · · Score: 2, Funny

    ... Theo De Raadt says that the Chinese are INHUMAN.

    *ducks*

    --
    Nobody believes the official spokesman, but everybody trusts an unidentified source. -- Ron Nesen
    1. Re:Their new spokesperson ... by Anonymous Coward · · Score: 1, Funny

      In Soviet Russia the inhumans say that Theo De Raadt is Chinese.

    2. Re:Their new spokesperson ... by micromuncher · · Score: 1

      C'mon. Anyone who misuses root access and a position of authority such as sysadmin to delete a term paper of someone who disagress with them and is subsequently fired by the university who needs to send the police after him to retrieve server room keys MUST be an authority on authority.

      --
      /\/\icro/\/\uncher
  32. Tough Luck by Plekto · · Score: 0, Troll

    I say Google stops being apologetic and says "so what". Afterall, China has no respect for U.S. copyrights and patents and steals from us every day.

  33. Were the errors intentional? by SuperBanana · · Score: 3, Informative

    If you ask around in the GIS/mapping community, it's known that the [street] map data providers (Delorme, Garmin, etc) will insert garbage data here and there. A street name is slightly wrong, or they have a mystery street that doesn't exist in the real world. They use it to try and tell if/when someone steals their data. If Zyugyz Road in Somecity, CA exists- the legal team fires at will.

    It's kind of weird, considering that most mapping companies do little more than get their hands on town/county/state GIS data for cheap, massage it a bit, then charge assloads of money for it.

    1. Re:Were the errors intentional? by Dan+East · · Score: 1

      The same happens with government medical related data. Take the ICD9 database for example. It is distributed in a database format not conducive to programmatic access. For example, there are hundreds of codes with the description of "Other". Its description only makes sense in the context of all its parent levels, which then produces an extremely large, redundant description. Companies will simply reformat the data, take copyright and profit.

      Dan East

      --
      Better known as 318230.
    2. Re:Were the errors intentional? by Anonymous Coward · · Score: 0

      Yes same here. In this google/sohu IME story the sohu developers inserted their names into sogou pinyin IME, so these names come out as the first choice for the particular key stroke sequence, although these names are all not trivial ones. Then people found that the same names come out as the first choice in google's IME, the only reason this can happen is that google is *copying* sohu's IME dictionary.

      As you are probably not a user of Chinese pinyin IME, here an example to help you understand the situation: imagine that you have an English IME and you type in "sb" and you get "SuperBanana" as the first choice, how do you think?

    3. Re:Were the errors intentional? by Anonymous Coward · · Score: 0

      Yes MSFT do the same with their softwares. Some unpleasant people dare to call that bugs.

  34. Shame! by BluBall · · Score: 3, Funny

    Following the protocols established by the recent OpenBSD/Linux Broadcom driver fiasco, the proper response would be to denounce Sohu for having been ripped off by Google.

    Shame on you Sohu! This is inhuman!

  35. Re:Any surprise this was done in China? by BiggerIsBetter · · Score: 3, Insightful

    Google may be filled with the best engineers, but once you move out of North America, they know nothing about ethics or morality.

    I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.

    --
    Forget thrust, drag, lift and weight. Airplanes fly because of money.
  36. Right! Google is evil! by SEE · · Score: 3, Insightful

    After all, we know that all Google employees are under Total Management Mind Control, and that Google Knows Everything Everyone's Doing. It's not even remotely possible that a handful of Google employees in China could shadily cut corners (using an already-extant database instead of compiling one from their own company's data) without Sergey Brin and Larry Page having personally authorized it from Mountain View, or that it would actually take a bit of time for upper management to investigate an issue when it's uncovered.

  37. Not a big deal by gaz_hayes · · Score: 0, Flamebait

    Good, google admitted it. I bet google contracted a Chinese company to supply them the database though. Apart from that, basically every piece of IP the USA has ever created has been copied by the Chinese and profit has been made. But, that doesnt make it right, and google needs to come 100% clean because if we start doing what the Chinese do to us, then there will be no more good people left in the world...

    1. Re:Not a big deal by Achromatic1978 · · Score: 1

      Tell you what, grab an M16 and man the borders. What the fuck piece of xenophobic, nationalistic tripe is this? "no more good people left in the world"?

    2. Re:Not a big deal by Anonymous Coward · · Score: 0

      Yeah. If we torture the terrorists, the terrorists have won.

  38. Google fucks up, so bash the Americans by Anonymous Coward · · Score: 0

    renegadesx got the memo, apparently.

  39. Please tell me by Mazin07 · · Score: 1

    How is Google's pinyin IME better than the tons of other pinyin IMEs out there? I tried it, and apart from having a search button, it doesn't seem to be a whole lot better than the Microsoft Pinyin IME that comes with Windows.

    How does Google plan to set themselves apart from the rest of the competition and, even better, how does this fit into the "big picture"? Will the mass of adopters suddenly begin using Google search because it's built into their IME?

    1. Re:Please tell me by hackingbear · · Score: 2, Funny

      The advanced feature will be:


      When you are typing your term paper using this IME, the IME will automatically google the Web and find out other papers on the same topic and you can just stop thinking and typing but instead copy from those paper on a click of a button.

  40. Re:Do no evil? by mattgreen · · Score: 1

    But they SAID they weren't evil, therefore that MUST make them good! Or, at least, that is how I fit into my naive worldview! Everything is either absolutely evil (Microsoft) or absolutely good (Google). There is no in-between.

  41. Tutorial on Chinese input by microbee · · Score: 5, Informative

    There are a lot of misundertstandings about how IME works and how Google copied non-public databases. So let me explain.

    IME accepts keyboard input and converts it into certain language characters. There are many different input methods that decide how to generate Chinese characters by using English keyboards, and pinyin is one of them (and the most popular one).

    pinyin is popular because it's simple and bears almost no learning curve. However, it suffers the problem of aliasing. For example, "shi" under pinyin will convert into "" "" "" ... in general, the same sequence could map to many different words (could be several dozens), and you usually need to select from them by choosing 1, 2, 3, ...(the input bar will display them from which you could choose, somtimes needing page-down). A native implementation of pinyin is thus very slow and cumbersome to use.

    A good implementation uses following approaches:
    1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
    2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.

    So the quality of the pinyin method depends heavily on how well the input could guess and prioritize the guesses, and thus the dictionary that is being used. And generating this dictionary (keeping it both contemporary and accurate) takes a lot of time.

    The dictionary is typically distributed together with the input method (or it wouldn't work). You could obtain sohu's dictionary by just installing its input method, and Google has likely obtained it this way. However, I don't think it's in an open-standard format, so Google probably has done certain reverse-engineering to be able to actually use it in its own software.

    1. Re:Tutorial on Chinese input by Anonymous Coward · · Score: 0

      It is a known fact in China that the first version of sohu's IME almost exact duplicated of another Chinese vendor IME product.

    2. Re:Tutorial on Chinese input by Dahan · · Score: 0
      What's so special about SoHu's (or Google's) IME anyway? I use the IME that comes with Windows XP (Microsoft New Phonetic IME 2002a) and find it perfectly adequate.

      1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).
      MS's IME does this.

      2. allow partial input for common phrases. This inputs a whole phrase at once, each character only requiring the first English letters. It speeds up input significantly.
      MS's IME doesn't do the "partial input" part, but it does have a phrase list and will automatically select the right characters from among the homophones--you do, however, have to enter the complete pinyin/zhuyin. You can also define your own user phrase list.
    3. Re:Tutorial on Chinese input by loyukfai · · Score: 1
      1. adjust word location by how frequently it's used in the past. So most frequently used words are shift to the front, making selection much faster. Typically they should fit into the first page (no scrolling required).

      While it's in the long-term a preferred concept, it cannot be done constantly and too frequently. People who really do type a lot memorize the position of the characters on the list. If the position changes too frequently. It'll instead slow them down a lot.

  42. Oblig futurama quote by pedantic+bore · · Score: 5, Funny

    "The internet is about the free exchange of other people's ideas!"

    --
    Am I part of the core demographic for Swedish Fish?
    1. Re:Oblig futurama quote by Anonymous Coward · · Score: 0

      Close.

      You forgot "ideas that were exchanged only a day earlier"

      Welcome to Slashdot

  43. That shouldn't be copyrightable by wrook · · Score: 4, Interesting

    I've been thinking about this. Throwing the evilness of Google aside for a moment, why should someone be able to copyright a listing of the phonetic pronunciation of an alphabet?

    Let's just imagine how I might create this list. I would have to hire people who spoke the Chinese. Then I would ask them to record the pronunciation of each character that they know. This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway). There are about 3500 characters that you need to know in order to be literate. And all of these people would have learned these at school.

    But how did they learn them? Well, they had a textbook and they memorized the list from the textbook.

    Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.

    But let's say I do that. Now I have a list of the 3500 most common characters. And with that, I've probably got 99% of everything that's in a newspaper. But that's probably not good enough. I probably want a list
    of say 60,000 characters. Otherwise it's pretty useless in a general sense. Uncommon characters are uncommon, but you *will* bump into the words over time.

    So where do I find these characters? Can I hire some guy that knows them all? It would be very difficult. The best place to look is in a book. But wait... what am I going to do? Every time I find a character my people don't know, look it up in a book? Why don't I just copy it from the book in the first place? That's just copyright infringement again.

    Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.

    But if I'm going to have a community project like this, what the heck do I need copyright for? What am I protecting? If everyone is going to contribute, everyone should benefit.

    So, personally, I don't think one should have copyright on this kind of material (same thing for spelling). It's just not in the public interest. This goes doubly so now that we have the internet and creating these kinds of projects is very inexpensive.

    OK, I've gone on long enough... But one more rant. What's with this "do no evil" thing? Isn't that setting the bar a little low. If I told my parents that I'd work hard not to be evil, I think they'd be somewhat disappointed in me. If Google wanted to actually "do some good" rather than "do no evil", they could start a community project to collect this data and share it with the world.

    Sigh... I guess we'll have to wait for some guy in his garage (but here's betting that someone has already started something).

    1. Re:That shouldn't be copyrightable by progprog · · Score: 0

      Really, the task of creating this list authoritatively without infringing copyright is monumental. Probably the *only* way to do it is with a community project where people just submit the pronunciations they know.

      It's not just about pronunciations, it's about the choices that appear and the order they appear in.

      Take the term "guanxi" (meaning "connections"). One term, two characters. For a good dictionary, the correct characters for this term will map it to the default choice available after typing all six letters. A garbage dictionary would have no concept of common terms and perhaps put the characters for "can wash" before the characters for "connections".

      Ordering the terms appropriately is important since a pinyin spelling maps to multiple characters. There is a huge difference in efficiency when the exact term you want is within the first couple "hits", as it may. Which is something Google may have some experience in...

      There *is* such a community project - SCIM. I wonder why Google didn't use/extend SCIM's database instead.

    2. Re:That shouldn't be copyrightable by Anonymous Coward · · Score: 0

      This is absolutely correct. The OP (wrook) has a misunderstanding of how pinyin romanization works (also, many characters have more than one pronunciation, even in the same dialect -- take , for example, which can either be pronounced xing2 as in "to walk" or hang2 as in the word for "bank"). Writing a dictionary is incredibly difficult given that going from pinyin to Chinese character can have as many as 1:50 mappings.

    3. Re:That shouldn't be copyrightable by Psx29 · · Score: 2, Informative
      This is pretty easy because in Chinese each character has only one pronunciation (per dialect, anyway).

      In the case of mandarin, while it is the case most of the time that each character has only one pronunciation there are cases where are character may have a different reading depending on the compound word it is in. The case with simplified Chinese as per the mainland makes this even more burdensome as multiple characters with different tones or different pronunciations altogether were combined to make the language easier to read/write.

      Mapping the standard pinyin (romanised) transcription of each character is not the hard part. The hard part is performing analysis on the sentence structure allowing one to type with a minimal amount of tone marks and saving time in the process. Correct me if I'm wrong but I believe it is this analytical data that google has been accused of stealing and as such, there is no justification for this being in the public domain.

    4. Re:That shouldn't be copyrightable by Anonymous Coward · · Score: 0

      This is pretty easy because in Chinese each character has only one pronunciation

      I don't know where you learned that but in mandarin (the official and main Chinese dialect) there are 4 tones and even 5 if we consider the neutral.

      Good example is ma that can be written in pinyin má mà mâ mã or ma. Each of these words has a different meaning and a different pronunciation. Also the chinese character can be the same or different.

      Check this website to know more about pinyin: http://www.chinesenow.net/cnword/default.aspx and just type ma, you'll see the meaning

      To learn Chinese try this: http://www.chinesepod.com/

    5. Re:That shouldn't be copyrightable by Anonymous Coward · · Score: 0

      Interestingly, there has relatively recently (1991) been an analogous Dutch court case where the judge decided exactly that, stating that just a compilation of public domain facts does not, by itself, constitute original authorship and therefore is not a protected work in the sense described by copyright law.

    6. Re:That shouldn't be copyrightable by dominator · · Score: 1

      Wait. I can't just memorize a list from one book and put it in another book. That's copyright infringement. In order for it not to be copyright infringement, I need to make sure that my sources all memorized the pronunciations from different sources. That's going to be difficult.
      That's not how copyright law works, at least in the USA. Lists are facts. Facts are not copyrightable, nor are compilations thereof. Mainly because copyright isn't determined by the "sweat of the brow" rule, but rather "the sine qua non of copyright is originality." (Justice O'Connor) The relevant SCOTUS case on this is Feist v. Rural.
    7. Re:That shouldn't be copyrightable by PaulCotney · · Score: 1

      Yes, there are projects dedicated to this kind of thing going on. CEDICT: Chinese-English dictionary http://www.mandarintools.com/cedict.html is available. Also, several IMEs are percolating on Sourceforge if anyone is interested, and I think the majority of them build their data off of CEDICT.

      Mind you, the database that Google was playing with was probably larger and more current. It may have also had some fields to allow the program to better determine which character combinations belong to which "word" using some kind of frequency calculation. I believe this being the art to making a good IME. However, public Chinese pinyin resources are definitely available so I think it was just Google being lazy and sloppy.

    8. Re:That shouldn't be copyrightable by Jeff+DeMaagd · · Score: 1

      I think there's a difference between just copying someone else's list and compiling your own from numerous sources that aren't identifiable.

      One secret I've heard about the textbook industry is that an author might use dozens of sources, and reinterpret the information in their own words. Much of that information in a a textbook is often public knowledge or from public domain sources, but there's the work needed to compile that information into that particular order. I think using many different sources and refactoring it is far better than using just a single source so completely that all the errors are mimicked too.

    9. Re:That shouldn't be copyrightable by Noted+Futurist · · Score: 1

      Yes, "do no evil" is a cop out, and indicative of what they truly are. "DO GOOD" is a worthwhile goal, but not doing evil ends in... evil.

    10. Re:That shouldn't be copyrightable by bunytu · · Score: 1

      To use standard keyboard to input Chinese Character is not such a simple thing. There are more than 10 well know input methods, they are there for a reason. While it's true that one character almost always have one pronunciation, hence one way to input using roman letters; one pronunciation normally has more than 20 characters to match, so how does the software predict words makes huge difference. On top of that, Chinese almost never use those roman letters in relly life, no matter speaking or writing, and very few Chinese pronounce standard/news reader type Mandarin, and they certainly will make lots of mistakes on a keyboard. Many input software give users different level/type mistake tolerance. If you ever use Chinese input, try "shi", 207 characters come you way, and around 50 are frequently used ones.

  44. No symmetry by mangu · · Score: 1
    This reminds me of the recent story about GPL code found in OpenBSD. There too, an OpenBSD developer took someone else's code and started modifying it without keeping the GPL license


    That's just like that old story about the resort where there were girls looking for husbands and husbands looking for girls. It's not a symmetrical situation. If BSD coders feel it's all right to give their work away for free to commercial companies, it doesn't mean GPL coders should be forced to do the same. Even if the BSD people disagree about the way GPL people licence their code, they should take care to respect the other point of view.

    1. Re:No symmetry by Anonymous Coward · · Score: 0

      What ill-informed propaganda.

      "That's just like that old story about the resort where there were girls looking for husbands and husbands looking for girls. It's not a symmetrical situation."

      Which explains why Linux did it first years ago?

      http://slashdot.org/bsd/01/09/24/1432223.shtml

      "If BSD coders feel it's all right to give their work away for free to commercial companies, it doesn't mean GPL coders should be forced to do the same."

      You're a fool. The licenses are not bilateral *either* way. I thought this was cleared up years ago thanks to a /. story on the matter, but apparently the myth perpetuates or has returned--it's not okay to use gpl code in bsd code without permission, and it's not okay to use bsd code in gpl code without permission.

      "Even if the BSD people disagree about the way GPL people licence their code, they should take care to respect the other point of view."

      Respect? *pfft* Like how most Linux proponents still think Apache had a GPL license? How open source groups, made up of mainly gpl proponents, push that license near exclusively as a fit-all, often times making code unavailable for bsd users?

      You know squat about what you are talking about if you are going to throw out the respect card. Theo is wrong, but he has a reason for his stance--the Linux community is shit when it comes to BSD issues so why should he give a rat's ass.

      Funny on your view is revisionist history--majority ignorant. Linux is more popular now. For a long time, the rumor was Linux's TCP/IP stack (as well as W2K and NT) was taken from the BSD stack (the Linux claim is certainly discounted). Rumors like this always swirl, and some turn out to be true.

      Now people with their short memories or unknowledgeable about what goes on in either camp can making sweeping, stupid claims.

      btw, "BSD" is not one community, much as Linux does not have one community. The issue you have here is with OpenBSD.

      btw2, you do realize you missed the poster's point that code swapping happens, whether by error, purposeful, or with working with or even merely seeing[1] the same data.

      btw3, let's find out the facts before we go crucifying one party or the other. Too many people "fill-in" things that aren't there, rush to judgment, or seek facts to support their view for no other reason than to win an argument.

      [1] Story--In college, a buddy and I took the same class. We often proofread each other's papers a couple days before, to swap ideas, as well as check for errors. This was encouraged by the professor to talk and discuss papers, and should be the point of any educational atmosphere anyways.

      While we were both heavy computer users, we did this with dead trees--marked up the printouts with pens during a sitdown. Read, chat, talk about the ideas, talk about other stuff, return to the papers, that sort of thing, given them back, go do the revisions and work on our own papers more independently.

      The morning before they were due, we also would do a final passthrough, usually over breakfast at the cafeteria.

      One time in going over each others papers the morning before, we both read a section in the other, and said, "Uhh, dude, this looks like what I wrote." We put the papers side by side, and we had a paragraph, which had ~4 words different in the whole thing outside of some minor phrasing differences i.e. As well, Socrates ignored vs. Socrates ignored, as well,....

      Reason? We had discussed this particular topic in depth, it was a somewhat difficult, odd idea to fully understand, and we both reworked our parts to what we had discussed. We're both smart guys, very good recall, and we went back and wrote what we had discussed when we had worked the issue out. And came up with nearly the same thing in the respective parst of our paper.

      Totally unintentional. A plagiarism catcher these days would have nailed it though. Anyways, apply

  45. Easter Eggs save the day. by DarkLegacy · · Score: 1
    > In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.

    And you thought Easter Eggs were just there for kicks. ;)

    --
    127.0.0.1
    1. Re:Easter Eggs save the day. by Anonymous Coward · · Score: 0

      What's really strange is engineers making a dictionary. I would have though that you need to hire linguists for that kind of a job.

  46. Finally we steal some IP from them! by gatkinso · · Score: 2, Funny

    TURN ABOUT IS FAIR PLAY.

    Ok fine, we have stolen from them before... but Beef and Broccoli don't count.

    --
    I am very small, utmostly microscopic.
  47. Re:Do no evil? by Thexare+Blademoon · · Score: 0

    It's not whether or not they exist for a reason that I question.

    It's whether or not they exist for a good reason.

  48. And it isn't by phorm · · Score: 1

    The language isn't copyrighted, and google was more than free to come up with their own dictionary/database. However, in this case they used somebody else's. The infringement is not against the language itself, but against the use of somebody's precompiled database (inclusive of errors, amusingly enough).

    1. Re:And it isn't by Daengbo · · Score: 1

      Exactly my point. It was a database of known data, not a "creative work." There was no creativity here, and I question whether a compilation of facts with no artistic merit qualifies as copyrightable. I'm in a large minority with this opinion, by the way. A dictionary is copyrightable because someone had to write the definitions, but should a corpus be?

  49. google is SO in troubbblllllle by mycall · · Score: 0

    its the facts of life.

  50. Re:Do no evil? by setagllib · · Score: 3, Interesting

    They're significantly reducing the lockin to Microsoft products, by encouraging, buying and thereafter funding web application projects that often overlap with what is currently locked in to Microsoft. They even brew some of their own sometimes. They continue the development of Linux and Python with a wide adoption of both. All of these things are creating wealth for everyone, and crippling Microsoft little by little, which we know is what we want. I'd much rather have a Google & Microsoft duopoly if it means Microsoft would finally have to clean up its shit and accomodate whatever open source platform Google would support in that scenario.

    --
    Sam ty sig.
  51. Here's your wallet back mate by Paranoia+Agent · · Score: 2, Funny

    Sorry, I was just leveraging some non-personal resources.

  52. About time by XCondE · · Score: 1

    Finally, the first (?) crack on the building appears.

    Am I just going to have to start-up my own evil-free(tm) company?

  53. Old habbits die hard? by PassBy · · Score: 1

    The chief of Google China, Kai-Fu Li, used to be Microsoft's vice president, go figure...

  54. Provincialist Americans by Keith+McClary · · Score: 1

    In the US, a list of words in lexicographic order is not necessarily copyrightable (eg. phone books).

    Is it also so in China? And does China have laws making databases IP like the US?

    Americans seem to think that their bizarre and extreme notions of IP are universal law.

    Perhaps someone here is an expert on Chinese IP law - did Google-China do anything illegal?

  55. Re:Any surprise this was done in China? by Kristoph · · Score: 1

    Ummm ... hi there ... Canadian here ... please can we not get dragged into this :-)

    ]{

  56. Re:Any surprise this was done in China? by PassBy · · Score: 1

    Dare not to use your real name eh, anonymous coward? The head of google china was educated in North America, he worked in North America and he was sent back to China by Microsoft. So where did he learn his engineering ethic? Do you want to compare the number of IT lawsuits going on in America and China? I have to give it to you though. That was a quick one! I can't imagine anyone able to strike so low so fast, except for someone that always have this little hate in mind.

  57. Re:Do no evil? by Anonymous Coward · · Score: 0

    What I think is that you are one among many who are envious that they don't have the ideas and insight that Google has. Google just bought youtube, and somehow you think they have the resources to prevent millions of people from uploading copyrighted content. They are obligated to take down what they are told is copyrighted and they have done that. This works the same for any hosting provider. Scanning books so that the internet public would be able to search for books the way we do for websites was an awesome idea. By returning the names of the books and small quotations, this protects the copyright owners. This was not copyright infringement, because they were not selling the contents of the books, only the ability to search them. I think this is actually great for authors, whether they realize it or not - and it isn't illegal just because some may want it to be. And finally, if you don't like your page cached by Google then exclude google using the robots.txt exclusion STANDARD, as every good webmaster knows. (On second thought maybe you like traffic to your site?)

    I'm not sure what happened in this case, but I do know that in American law you cannot copyright something unless it has some artistic value. If what Google took is judged to be just raw data such as a phonebook, have they broken the law? I think the more Google "does no evil" the more people will try to prove them wrong, but the courts will decide ultimately. I don't blame you for being envious, for I am too. But I admit it.

  58. how come the wealthy so need to cheat? by Anonymous Coward · · Score: 0

    and why aren't the wealthy held accountable?

    i wonder if there are any ethical wealthy people?

    sure doesn't seem so

  59. Begs teh question. by Anonymous Coward · · Score: 2, Funny

    Sohu cares?

  60. No, sohu.com is a Delaware company by hackingbear · · Score: 1
    Depend on how you look at it. Officially it is a Delaware company according to its S-1 filing.

    1. Organisation and Nature of Operations

    Sohu.com Inc. (the "Company") was incorporated in Delaware, USA in August 1996 under the name of Internet Technologies China, Inc. The Company changed its name to Sohu.com Inc. in September 1999. The Company does not have any substantive operations of its own and substantially all of its primary business operations are conducted through its wholly-owned subsidiary, Sohu ITC Information Technology (Beijing) Co., Ltd., which was incorporated in the People's Republic of China during 1997. The Company offers internet-based advertising and content through its internet portal site, Sohu.com. The Company conducts its business within one industry segment and markets its products and services to clients primarily in the People's Republic of China.


    Like every successful hi-tech company, sohu.com is registered in the US or in a carribean island and run by western venture capitalist firms.


    Now that would make Google guilty.

  61. Minor correction by Anonymous Coward · · Score: 0

    It's definitely not enough to learn 3500 characters and their meanings. Contemporary Chinese uses mostly two character words. So depending on the context the meaning of every character changes and that needs to be learnt as well. For example xin means heart, yet mostly it is used in conjunction with another word: xiao-xin means careful, xin-xin (not the same characters) means confidence, guan-xin means to be concerned about, xin-li means mentality etc.

    There are literally hundreds of dictionary entries containing the character for xin. Granted that you might get an idea about the meaning of a word if you know each of the characters, but mostly you will still have to learn meaning and usage and they will most definitely need seperate dictionary entries. E.g. gu-shi means story, shi-gu (same characters) means accident. - So the data structure itself is much more complex than you put it.

  62. Thai fonts by Anonymous Coward · · Score: 0

    Do you have any more detail on that Thai font decision by the way, like what fonts it involved ? The PSL, DS ones ?

    1. Re:Thai fonts by Daengbo · · Score: 1

      I'll take a look into it. It's been a few years now. Some popular Thai fonts were being illegally distributed (surprise!) and so the copyright owner went to court over it. In a surprise verdict, the judge ruled that fonts could not be copyrighted. Googling brings up this page, but I couldn't get the page to show up. Maybe it's for members only, or maybe the Bangkok Post (or Thai Internet, in general) is just insanely slow.

  63. Oh please... by Moraelin · · Score: 2, Insightful

    Oh please... if Google wanted to distance itself from it, they could have done so long ago. "Sorry, mates, some of our employees fucked up, they've been fired and the offending code/product/database is now being pulled off the market until we build our own replacement."

    The whole bullshit, including trying to get away with just deleting the original developpers' names, and press releases about "leveraging non-Google assets" is what's damning Google. It's not just that the original incident happened, it's that from there Google seemed to not even understand why it's bad and why the heck should they give a damn. The original incident may have been an individual developper's fuck-up, but from there it's Google and their corporate policies deciding how to deal with it. And how they _did_ chose to deal with it, frankly, stinks.

    Yes, noone expects total mind control, but if _also_ the legal team is out of control and answers it in a way unrepresentative of Google, and _also_ the PR team is out of control and pulls a damning "we were just leveraging someone else's resources" statement on their own, etc, then, ffs, they have a problem. At some point you have to assume some responsibility and control, and not just hide behind not knowing what everyone else is doing. If you don't even know what your legal and PR teams are doing at all, even in a public incident, then you better assert some control real fast.

    Additionally "do no evil" does imply a dose of responsibility there. You can't say, basically, "oh, the Mafia does no evil, it's just some of our members that we don't really mind-control, that are shooting people or fitting them with cement shoes." If the individual members are free to do evil, and get the company's full backing in some "we were only leveraging other people's resources" statement, then on what do you base that "do no evil" slogan any more?

    RL "evil" isn't some "Black And White" game notion, involving actively hating all humanity and actively seeking to do harm, including self-harm, just for harm's sake. And no company does that overtly anyway, so if that's what Google is distancing itself from, then it doesn't say much.

    RL "evil", including corporate evil, is more along the lines of not giving a damn about who gets hurt, if it helps you forward your own interests. It's not actively trying to poison a river just for the chuckle of seeing some people get sick, it's not caring who gets sick as long as you saved some money by just dumping your waste in the river. It's not actively trying to get some excuse to shoot some people as a Mafia don, it's about not giving a damn if it takes some corpses to forward your own interests in an area. If shooting some people to make an example is what works, so be it, it's as good a means to an end as any. Etc.

    Or to get back to corporations, Enron too didn't make defrauding investors its whole purpose, it just didn't give a damn who gets hurt by their lies. It had no qualms even with advising its own employees to buy stock at a time when management was selling theirs. Again, not because some super-villain at the top had a chuckle at hurting employees, but because they didn't give a damn.

    Basically it's not about having some principles to create as much suffering and destruction as possible, it's about lacking the principles and empathy to avoid doing it. That's what corporate evil is: simple sociopathic behaviour.

    And if an organization doesn't give a damn at all about what its employees are doing, and who they're hurting, as long as they get the product out the door, then, congrats, it just lost all credibility for some "do no evil" claim. It just showed as much sociopathic tendencies as any other corporation, only maybe in a more decentralized fashion. You know, why have one sociopath at the top coming up with all evil schemes, when you can have a thousand sociopaths in lower positions encouraged to feel free to come up with their own heists.

    --
    A polar bear is a cartesian bear after a coordinate transform.
    1. Re:Oh please... by SEE · · Score: 1

      if Google wanted to distance itself from it, they could have done so long ago

      So you think less than a week from release of product, to issue raised, to confession of theft and replacement with a genuinely in-house database, isn't a fast-enough response time? So, tell me, is it just that you never have dealt with a large organization before in your lifetime, or is it that you believe that Google management has magic powers?

    2. Re:Oh please... by Moraelin · · Score: 1

      They obviously had the time to first try to "fix" it by removing the original developpers' names, and now to pull the weird "we're just leveraging non-Google assets" statement. It seems to me like it would have taken exactly the same time to do a less irritating statement.

      Let me also say that I seriously doubt that they could replace such a database in-house within a week. There's a _lot_ of work involved in such a thing. Even if you have the most l33t code ever, the research involved isn't something you'd get done in a couple of days. So I'm guessing they're just continuing to show a "so who cares about other people's IP, as long as they don't know we use it" attitude.

      --
      A polar bear is a cartesian bear after a coordinate transform.
    3. Re:Oh please... by SEE · · Score: 1

      1) What they had time for was somebody to yell at a lower-level guy "Fix this!" and the lower-level guy to make a stupid, half-assed response. Again, have you never dealt with a large organization made up of human beings?

      2) The guys who made the original complaint say the new database doesn't rip off theirs. Maybe they're ripping off somebody else's, but you have no evidence of that.

  64. it is not known data by phorm · · Score: 2, Informative

    Again, it is not the "known data" that is at question here, but the database as an object in its entirety.

    Nobody is accusing Google of "copying Chinese characters", but rather of copying a specific collection that somebody has invested time and money in creating. This is not a corpus, but rather more like a dictionary. Anyone can create one, but google - which I have emminent respect for in other areas, but not this one - has decided to take somebody else's "dictionary" rather than creating their own. The compilation existed as somebody else's work. Likely google could have made an attempt to buy it. Equally likely, they could have produced a similar offering on their own. Instead, they chose to take another group's work and then denied both giving said group adequate compensation, or even that they had taken it from said group.

    1. Re:it is not known data by Daengbo · · Score: 1

      See here.

  65. Re:Any surprise this was done in China? by asninn · · Score: 1

    I'm curious how much time you've spent outside of North America, because I'm pretty sure 92% of the world population would disagree with you.

    Make that 95%, and count me in as one of those who'd disagree and who're curious as well.

    --
    butter the donkey
  66. Why is anyone surprised? by ady1 · · Score: 1

    Google is a corporation and should be dealt as such. It is not an individual with a single mind nor strong beliefs.

    It can make mistakes and do evil regardless of what they say. Their primary purpose is to make money and they will and can do anything to achieve it.

    Also it is easy to do what you believe in with a small and likely minded group of people. It is much harder to do so with over 10,000 people most of which don't think the way or have the moral obligations as you do.

  67. Google's response by Loconut1389 · · Score: 2, Funny

    The person responsible for the copying has been sacked. ...
    The person responsible for the sacking has been sacked...

  68. illegal? or just wrong? by Anonymous Coward · · Score: 0

    All this loose talk about 'plagiarism' and 'stolen' is pointless. Either Google infringed a copyright or they did nothing illegal. Pick one.

    There are limits on whether you can copyright facts at all, and they vary from country to country. Does China even have a copyright law that covers dictionaries?

  69. Re:Any surprise this was done in China? by steelfood · · Score: 1

    I think GP is a troll, but the actual point is valid. It isn't that parts of the world "know nothing about ethics or morality." It is that other cultures have other standards of ethics and morality. While most cultures have similar basic ethics and morals (do not kill, do not steal--actually a generalization of the first, etc.), something that falls into a gray area like reusing the IP of another will be inconsistent throughout the world. Besides, we don't really have an established moral outlook on IP infringement, which is why we call it "theft" more often than not. It's because theft is the closest thing we know to IP infringement. Hence, it is negative by association.

    It is not to say that Google using Sohu's database is OK because it happened in China. If Sohu started using Google's database, Google would likely make a big stink about it too. But it probably isn't as big a deal over there as it is here. Certainly, it would not be considered "evil" behavior. It wouldn't be good either, but it doesn't quite fall into evil yet.

    --
    "If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
  70. Re:Any surprise this was done in China? by Anonymous Coward · · Score: 0

    You must be Chinese because your English sucks really badly. How about going to ESL and brushing up on your grammar?

  71. Nobody cares about "work." by Kadin2048 · · Score: 1

    meh, the argument for why compilations of public domain "facts" should be considered a copyrightable work is that it is work to compile those facts. Why people can't understand that not all work results in property is beyond me, but there's ya reasoning.

    I don't know about in China (does China even have a copyright system to begin with?), but in the U.S., the amount of "work" you put into something doesn't matter one whit in terms of it being copywritable. You could spend your entire life compiling statistics on something, and at the end of the project, the only thing that you could copyright would be things like the actual typesetting and any copy that you wrote in between the statistics themselves. It's the same thing with recipes: anyone can copy Julia Childs' french bread recipe from Mastering the Art of French Cooking, what they can't copy is the text itself describing how to execute/implement that recipe.

    But enormous amounts of effort are routinely put into things like mathematical and physics tables (and historically, they were a lot more important than they are now), and the data themselves aren't protected. You can't own the digits of pi, or the atomic weights of the elements, regardless of how much time you spend figuring them out. The problems associated with letting people "own" and claim copyright to bare facts or compilations of facts would would greatly outweigh the possible economic benefits of letting people derive additional economic gain from them.

    If the Chinese allow companies to copyright bare collections of facts, they're a bunch of idiots.

    --
    "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  72. Re:Any surprise this was done in China? by BiggerIsBetter · · Score: 1

    I don't believe that morality comes into it. Possibly ethics, but my limited experience with the US tells me that if you can a) gain advantage, b) get away with it, and c) the exposure is less than the cost of doing it yourself, then you steal/copy/infringe on the "IP". Anything less would be bad business. China isn't so different...

    --
    Forget thrust, drag, lift and weight. Airplanes fly because of money.
  73. Doing evil to combat evil.. by iendedi · · Score: 1

    Its both. Do evil to combat evil. Thats the American way now, didn't you get the memo? That is only one step away from "Doing evil to combat perceived evil". Or is that even one step?

    At any rate, since human perception is highly flawed, the practice of "Doing evil to combat perceived evil" can really be reduced to "Doing evil and hoping it limits the evil that others do". However, "Doing evil and hoping it limits the evil that others do" is really the same thing as simply "Doing evil." in fact, it is even worse, because it is really "Doing evil while competing with other evil in the hopes that you are the only one left".

    Naturally, once we have truly followed the diabolical nature of this new approach, we are simply left with "Doing evil in the hopes of having a monopoly on doing evil."

    Shame on you, google.
    --

    It is your personal duty to fight for what is right on a daily basis. Ignoring injustice is identical to approving