Linguistics Meets Linux: A Review of Morphix-NLP
Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."
Should have used BitTorrent. Then it'd be "I was in the process of downloading this already. Yay for Slashdot!!!"
The World Wide Web is dying. Soon, we shall have only the Internet.
This page has some reasons.
New version? Got this after some googling
http://www.forum2010.org/
So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
Well, yeah. I _know_ that. I was just speculating that such tools would be useful in the effort of learning/translating/etc. a language that had not as yet been studied formally.
--dw
There was a brief time when they were Forum 3000, but the domain has fallen into the hands of domain squatters.
:)
Forum 2000 and 3000 died mainly because the people who ran them got bored and/or wanted to work on their graduate theses. It sure was fun to play with the Zephyr interface while it lasted, though.
I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
For more information, click here.
Here is where you can go to download the .iso image . .torrent on Slashdot.
Try not to kill their site. If someone has downloaded it, it would be nice of them to post a
I was surprised to read that GATE was not listed in the package list. It's the best piece of software to tie together the descrete components that were included. Another complaint is that are a lot of so-so implimentations of very good algorithms. (#define NOT_FLAMEBAIT = 1) I suppose that you have to turn to corporate software to get the really robust implimentations and to free software when you want the cutting edge.
I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
That's me, actually. You can't expect hundreds slashdot geeks suddenly slamming my site and having me not notice. ];-)
Forum 2010 had, in fact, nothing to do with the great fellows at Forum2k/3k aside from inspiration. And, just to end the rumors, I built the F2.01k matrix and all my own SOMADs as a senior project for my Comp Sci degree at Fontbonne University.
Now, I'm late for a date! Please don't destroy the matrix while I'm gone!
--
I have been using the base Morphix system for a Bengali l10n Live CD project (which was mentioned at slashdot a few days back). I am really amazed by its capabilities - if you want to have a LiveCD of your own - this is probably the best starting point.
For documentation, you may want to have a look at the Morphix Wiki.
While right on this probably not being of much help to the typical anthropologist, it's not at all true that most of the software has lots of built in domain knowledge.
At least half the tools are general purpose applications for constructing various kinds of models, whether they be trees or HMMs or n-gram models or entropy models.
Believe it or not a lot of NLP work gets done on understanding algorithms that apply broadly across languages.
There is some English specific stuff on the CD, but most of it isn't.
The only software
There is no talk of linguistics complete without mentioning Chomsky's political diatribes. :)
:) )
He pretty much defined linguistic theory for the past 40 years. Once he had a voice he turned into somewhat of a political critic. A conspiracy-theorist. I don't see him solving any political problems, and I don't know how well respected he is by those who study such things, but I think he's a loon. (But, oh god, I wish I could study with him.
Chomsky's papers are tough to comprehend for beginners. (Which I am.) Those who are interested in learning Chomskian theory may wish to pick up some Andrew Radford. (he is very understandable, and his book "Transformational Grammar" is aimed at the undergraduate level syntax class. Once you tackle that, you can read Haegemann, "Government and Binding," which seems to be the most used graduate level book... but this one is quite boring.)
In the meantime, a linguistic glossary which may help you get through some of the papers you may find: http://tristram.let.uu.nl/UiL-OTS/Lexicon/
http://jones.ling.indiana.edu/~prrodrig
(1) Insert the knob behind the lever.
In (1) you could perhaps use a handfull of terms instead of "knob" -- controlled language enforces only certain licensed terms, this increasing overall consistency (same terms for same thing). This can be checked automatically once a positive list (or typically a hierarchy called "thesaurus") has been setup.
(2) He saw the girl on the hill with the telescope.
The second/third case are lexical and structural ambiguity: we want to avoid problems like with (2), where "saw" could be past of "to see" or have another (more morbid) interpretation. Even worse, it is unclear whether the girl is on the hill, carrying the telescope or whether "he" is spying on the girl with the telescope. I leave it as an exercise to the reader how many combinations (possible interpretations) there are in a sentence like (2) [Hint: Which verb? Who is where? Who carries the telescope?].
In a Controlled Language scenario e.g. ACE, after some initial investments in thesaurus construction, thesaurus lookup and simple parsing techniques are used to report problematic passages to a human editor, who has to correct it manually.
This is not programming in natural language. Typically only large companies can afford the initial investment.
I dropped out 15 years ago, so I'm not really the best person to ask. For popular books on linguistics, I'd recommend The Language Instinct by Steven Pinker. (It is the book I wish I'd written). My favorite journal back in the days when I was reading them was Natural Language and Linguistic Theory.
If you've had any contact, you'll know that linguistics is a bitterly divided field. I was of the west-coast variety. But you need advice from some one working in the field now. I'd suggest that you drop by your local university and ask around. But do remember that there are substantial divisions in linguistics, so take what you are told with a grain of salt.
Prime numbers are exactly what Alan Greenspan says they are -S. Minsky
As both a partly self-labeled linguistic anthropologist and a cultural anthropologist, I would like to respectfully qualify the parent's statements on the state of the field. This really isn't meant as a flame but I do enjoy discussions on the difficult relationship between linguistics and anthropology.
First, while anthropology seems to emphasize linguistics to a much lesser degree than in Boas' era, a large number of anthropologists do work on language, in one way or another. Granted, the groundwork of deciphering unknown languages isn't really part of the discipline anymore, but thorough research projects on how language and language varieties work in social and cultural settings are prominent in the work of many anthropologists, from Michael Silverstein to Alessandro Duranti. Whether or not you call this type of language science "linguistics" is a matter of choice. The fact remains that language still plays a prominent role in contemporary anthropology.
The matter of whether or not "post-modernism" killed cultural anthropology is also open to debate. While I understand the claim and did feel some frustrations caused by "post-modern" anthropology, I think that the ultimate impact is that of enhancing anthropology. True, most cultural anthropologists have stopped writing monographs about "The Xs," but "post-modern" self-criticism is now being replaced by hybrid research activities combining theory and practice. Interestingly enough, language has a large impact on much of this work, at least in the form of meaningful exchanges. Again, maybe not "linguistic" in the strictest sense, but surely enough to warrant language training.
Alexandre http://enkerli.wordpress.com/
Actually, Chomsky (or one of his contemporaries anyhow) discovered early on that almost no natural language can be represented solely by regular languages, or even context-free languages. Chomsky initially even tried to use unrestricted/semi-Thue grammars to represent natural languages, but realized just as quickly that this HUGE class of languages is much, much too big (in fact, it's actually Turing complete, and only useful to those doing research in the theory of computation, not the theory behind human language). That left the context-sensitive languages in the original Chomsky hierarchy, but even those languages were found to be much too general, and the most general simulators for linear bounded automata needed to process CSL's apparently requires exponential time to operate. Current research in computational linguistics these days seems to concentrate on classes of languages between CFL's and CSL's, formal languages which are "mildly" context sensitive to characterize human languages. One example is the tree-adjunct grammars (which also incidentally have been found to characterize RNA secondary structures very well, and are of great use in bioinformatics). There are a few other models out there which I researched while making a writeup on the Chomsky hierarchy for E2, but unfortunately E2 is still down... :(
Apparently computational linguistics is taking the same course that most other fields in artificial intelligence have taken lately. One camp takes the formal symbol manipulation approach (the original Chomsky theory and its descendants), and the other camp includes more recent approaches based on neural nets, fuzzy logic, genetic algorithms, and so forth, which are more grounded in biology rather than abstract mathematics. Sorta like the traditional SMPA robotics vs. Dr. Brooks' behavioral robotics.
Qu'on me donne six lignes écrites de la main du plus honnête homme, j'y trouverai de quoi le faire pendre.