Linguistics Meets Linux: A Review of Morphix-NLP
Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."
All this language processing packed onto a single CD yet
Trolling is a art,
I was in the process of downloading this already. Damn you slashdot!
Neat.
--dw
Actually, I saw someone working on something like parsing english as a programming language, try a Google for 'controlled english' sometime. The general idea is that management may not be able to write the specifications, but they can read them and tell you it isn't what they're really after _before_ you code the thing.
Maxis will have The Sims actually talking, instead of looking "special".
If you want to play the typical stereotype... please at least get it right.
:)
It's the Japanese who has problems pronouncing L's... and the Chinese have problems pronouncing R's.
The Westerners on the other hand, can pronounce almost anything, but will never ever get facts right
Welley Corporation - SLM Scammers
This page has some reasons.
New version? Got this after some googling
http://www.forum2010.org/
Actually, this software seems like it would totally useless for that purpose. The software was developed and has a bunch of heuristics and domain knowledge put in by experts in english or the relevant language. Without similar expertise, the software can't be adapted to a new language. The software isn't a universal translator.
So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
There was a brief time when they were Forum 3000, but the domain has fallen into the hands of domain squatters.
:)
Forum 2000 and 3000 died mainly because the people who ran them got bored and/or wanted to work on their graduate theses. It sure was fun to play with the Zephyr interface while it lasted, though.
I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
For more information, click here.
Here is where you can go to download the .iso image . .torrent on Slashdot.
Try not to kill their site. If someone has downloaded it, it would be nice of them to post a
I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
That's me, actually. You can't expect hundreds slashdot geeks suddenly slamming my site and having me not notice. ];-)
Forum 2010 had, in fact, nothing to do with the great fellows at Forum2k/3k aside from inspiration. And, just to end the rumors, I built the F2.01k matrix and all my own SOMADs as a senior project for my Comp Sci degree at Fontbonne University.
Now, I'm late for a date! Please don't destroy the matrix while I'm gone!
--
I'm a programmer getting my masters in linguistics. Computer Science undergrad. Trust me. This is some tough stuff... until you learn the basics. Then everything starts making sense. There is a huge hurdle getting into any field... and it is usually because of the terminology. Every field has it's own terminology because every field needs to be extremely precise in their explanations.
Linguists don't think Knuth is very lucid.
Linguistics is neat. Syntax (the study of the structure of language), Phonology (the study of the interactions of sounds and what a child has to actually 'learn'), Phonetics (the study of the human language system and the sounds that it can produce/hear), and Morphology (the study of the smallest possible unit that holds 'meaning') all work together to form an idea of what goes on in the human mind.
http://jones.ling.indiana.edu/~prrodrig
Well, I'll answer your questions both in respect to NLP, and also more generally.
First of all, most practical NLP techniques aren't *that* complicated simply because they must be able to be computed quickly. There are quite a few statistical hacks prevalent
Most NLP techniques use probabilistic variants of two models finite automata and pushdown automata (both models are actually pretty simple, but if you don't know what they are, they may sound complicated).
Finite automata consume input and transition to different states (a finite number of them) based on that input. They can also be interpretted as generating output instead of consuming input.
Push down automata are almost the same except that they have a stack that they can push symbols onto. Another name for push down automata are Context Free Grammars.
As I said above, most NLP techniques use probabilistic variants of and small extensions to these two concepts.
The reason that Markov models (probabilistic finite automata) work so well to model speech is because they are flexible, simple, and linear just like speech. The reason that CFGs work so well to model language is that they are flexible, and hierarchical, and so can capture the recursive nature of language (think about "the man who killed the horse who killed the dog who...").
Having said all of that, I don't think that these models capture the way that humans process language/speech. I think that neural networks have the potential to capture this better. They just aren't mature enough. We also don't really have a good architecture to run neural networks. A human brain has about 10^14 neurons (within a couple of orders of magnitude) that run in parallel. Try simulating that on todays serial architectures, and you'll run into problems.
So my hypothesis is that there is probably some inherently simple learning algorithm for neural networks that we just don't know yet that will help solve many different types of problems (there is some biological evidence of there being a single learning algorithm implemented in the brain).
So yes, there is likely a simpler answer, but until we know it, we have to use heuristics and statistical hacks in order to build systems that work.
As to science in general, the reason it all sounds complicated is twofold:
First things interect in a very chaotic way. Even if the interactions are simple, when you compose many very small interactions, you find complex behavior.
Secondly, even if the interactions are actually simple, we humans with our Neutonian intuitions have a hard time understanding non-Neutonian interactions.
Hope that helped.
http://yetanotherpoliticalrant.blogspot.com
I guess this would interest you too. BTW, have you read "Le Ton Beau de Marot" by Hofstadter?
In 1977, Xerox adopted Systran for internal translations by creating a Multinational Customized English that's easier to translate. [1]
In 1930, C.K. Ogden proposed a tiny version of English: just 850 words that could be learned in a few months and used to say anything. He called it Basic English (BE). [2] [3]
There is no talk of linguistics complete without mentioning Chomsky's political diatribes. :)
:) )
He pretty much defined linguistic theory for the past 40 years. Once he had a voice he turned into somewhat of a political critic. A conspiracy-theorist. I don't see him solving any political problems, and I don't know how well respected he is by those who study such things, but I think he's a loon. (But, oh god, I wish I could study with him.
Chomsky's papers are tough to comprehend for beginners. (Which I am.) Those who are interested in learning Chomskian theory may wish to pick up some Andrew Radford. (he is very understandable, and his book "Transformational Grammar" is aimed at the undergraduate level syntax class. Once you tackle that, you can read Haegemann, "Government and Binding," which seems to be the most used graduate level book... but this one is quite boring.)
In the meantime, a linguistic glossary which may help you get through some of the papers you may find: http://tristram.let.uu.nl/UiL-OTS/Lexicon/
http://jones.ling.indiana.edu/~prrodrig
Linguists have always been geeky. Don't forget that Larry Wall is a linguist first.
The only computer class I ever took was in 1983 called "Computer tools for natural language analysis". It was an introductory Unix course. We learned grep, awk, sed as well as tools like vi, Mail, and rogue. And a tiny little bit of C. But since then I've taught C at the graduate level.
Linguistics is all about the reprensentation and manipulation of information. But instead of it being about languages we design for particular purposes, it is about the language system that we use naturally.
Suppose you have a few thousand languages that you know were written with the same tools (like lex and yacc, but not lex and yacc), but you have no access to those tools. Suppose you are trying to figure out what those tools are from examining the languages (not the compilers) that have been specified using those tools. That is what theoretical linguistics is trying to do. We know that the specification of English and the specification of Dyirbal and every other human language out there are somehow "written" with the same tools. It's pretty need stuff.
Linguists were early adopters of TeX, have had a Unix affinity for a while, and as people who are interested in how information is internally represented and manipulated, like reading the source.
I remember once nagging the sys admins to always make sure that there is a man page for anything added to /usr/bin or /usr/local/bin.
The next day, they asked me to look at the manpage for something to see if it met with my approval. The DESCRIPTION was the C source. I was happy to say that it did, indeed, meet with my approval.
At one point, a well known professor (Geoffrey Pullum) had written a little essay for a newsletter on the "grammer of Unix" using linguistic style analyses of the shell. Naturally several of us feigned outrage at his confusion of "Unix" with the shell. Another linguist (Bill Poser), went so far as to write a shell which was verb (command) final, and post-positional. That is instead of saying
/bin/sh chsh
cat foo bar > bang
you would say
foo bar bang > cat
That is, the arguments preceed the command, and the redirect symbols go after the filename they redirect to or from. Now for various reasons, I had root access on a machine that Pullum used. So I changed his shell to this command final one. He actually caught on remarkably quickly. And after a quick
he was ready to concede the point.
For me, there is no surprise that linguists, and particularly computational linguists, are OSS enthusiasts. But that is enough of my random musings for now.
Prime numbers are exactly what Alan Greenspan says they are -S. Minsky