Slashdot Mirror


Google Open-Sources SyntaxNet Natural-Language Understanding Library, Parsey McParseface Training Model

Google announced on Thursday that it is open sourcing its new language parsing model called SyntaxNet. It's a piece of natural-language understanding software, Google says, that you can use automatically parse sentences, as part of its TensorFlow open source machine learning library. The company also announced that it is releasing something called Parsey McParseface (Google has a sense of humor), which is a pre-trained model for parsing English-language text. Nate Swanner of The Next Web, attempts to explain it: Combining machine learning and search techniques, Parsey McParseface is 94 percent accurate, according to Google. It also leans on SyntaxNet's neural-network framework for analyzing the linguistic structure of a sentence or statement, which parses the functional role of each word in a sentence. If you're confused, here's the short version: Parsey and SyntaxNet are basically like five year old humans who are learning the nuances of language. In Google's simple example above, 'saw' is the root word (verb) for the sentence, while 'Alice' and 'Bob' are subjects (nouns). Parsey's scope can get a bit broader, too.

5 of 56 comments (clear)

  1. Re:Prase this, McParseface by mythosaz · · Score: 4, Interesting

    ...and while McParseface is at it, he can chew on:

    "Wouldn't the sentence 'I want to put a hyphen between the words Fish and And and And and Chips in my Fish-And-Chips sign' have been clearer if quotation marks had been placed before Fish, and between Fish and and, and and and And, and And and and, and and and And, and And and and, and and and Chips, as well as after Chips?"

  2. Re:SubjectsSuck by Aighearach · · Score: 2

    It all read just fine to me. The only mistake I noticed was that

    natural-language understanding software

    should have been

    natural-language-understanding software

    since it is the software doing the understanding, not the language. The quote itself is clear and concise. If you didn't understand it that probably just means you lack the technical vocabulary to even make use of the tool.

  3. Time flies like an arrow. by jeffb+(2.718) · · Score: 3, Interesting

    Fruit flies like a banana.

  4. Rule-based still easily best by Jezral · · Score: 2

    94% syntax is definitely good, for a machine learning parser. Now if you were to come to the land of rule-based parsers, 94% is the norm.

    Google loves machine learning, and it's easy to see why. That's how they made their whole stack. They have the huge amounts of data to train on, and the hardware to do so. It's so seductive to just throw a mathematical model at huge amounts of data and let it run for a few weeks.

    Rule-based systems don't need any data to work with - they just need a computational linguist to spend a year writing down the few thousand rules. But the end result is vastly better, fully debuggable, easily updatable, understandable, and domain independent. That last bit is really important. A system trained for legalese won't work on newspapers, but a rule-based system usually works equally well for all domains.

    In 2006, VISL had a rule-based parser doing 96% syntax for Spanish (PDF) - our other parsers are also in that range, and naturally improved since then. Google is hopelessly behind the state of the art.

    1. Re:Rule-based still easily best by Jezral · · Score: 2

      which seams much more expensive than

      It'd seem that way, but it's really not if you factor in the whole chain.

      Machine learning needs high quality annotated treebanks to train from. Creating those treebanks takes many many years. It is newsworthy when a new treebank of a mere 50k words is published. Add to that the fact that each treebank likely uses different annotations, and you need to adjust your machine learner for that, or add a filter. Plus each treebank is for a specific domain, so your finished parser is domain-specific. If you want to work with other kinds of text, you need to produce a treebank for that domain and then train on it.

      Thus, the bulk work is in annotation and mathematical models. Google skipped the step of creating a treebank, and instead use available ones. There aren't any usable treebanks for smaller languages, making the whole machine learning endeavor useless for all but the large languages.

      Rule-based parsers are the opposite of that. You can put the same amount of man hours into creating rules as you otherwise would a treebank plus mathematical model, but you can do so on any old laptop with almost zero data to work from. You just need to know the language. A parser produced in this way is not domain specific, but can be easily specialized for a domain if needed. And a rule-based parser can be used as a bootstrap engine for creating high quality treebanks, because the rules are upwards 99% accurate, meaning humans only need to put a fraction of work on top of it.

      And as I wrote, rules are debuggable. You can figure out exactly why a word was misanalyzed, and fix it. Machine learning can't do that. The edit-compile-test loop of machine learning is in weeks or hours - with rules it's in minutes or seconds.