Linguistics Meets Linux: A Review of Morphix-NLP
Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."
All this language processing packed onto a single CD yet
Trolling is a art,
I was in the process of downloading this already. Damn you slashdot!
Neat.
--dw
This means that GCC will have to be expanded to be expanded to support all human languages as well as programming languages...
Maxis will have The Sims actually talking, instead of looking "special".
Does anyone remember Forum 2000 (link does not actually work)? It's got some neat technology behind it. And the conversations between surfers and the SOMADs was hilarious. When I first saw the site, I thought it was actual people imitating the different characters. Does anyone know what happened to the site and why it no longer functions? I miss it.
I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
If you want to play the typical stereotype... please at least get it right.
:)
It's the Japanese who has problems pronouncing L's... and the Chinese have problems pronouncing R's.
The Westerners on the other hand, can pronounce almost anything, but will never ever get facts right
Welley Corporation - SLM Scammers
This page has some reasons.
Actually, this software seems like it would totally useless for that purpose. The software was developed and has a bunch of heuristics and domain knowledge put in by experts in english or the relevant language. Without similar expertise, the software can't be adapted to a new language. The software isn't a universal translator.
So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
no states have laws like that, this summer Texas ditched theres, they were the last to do so
stiff sodomy laws? theres a joke in there somewhere...
Here is where you can go to download the .iso image . .torrent on Slashdot.
Try not to kill their site. If someone has downloaded it, it would be nice of them to post a
This article is about linguistics, and he said "go read Chomsky", so I went and read Chomsky's bibliography. What I'm about to say applies to all modern philosophers and mathematicians:
God damn, them are some fancy-schmancy sounding titles! Does anybody ever get the feeling sometimes that maybe things are simpler than our smartest people currently make them out to be? If you can't talk as simple as I'm talking now, you ain't really "nailed it."
The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
I saw some links about all the math they never teach anymore (compound arithmatic, like pounds shillings pence comes to mind). I think something similar will be the case in 1000 years with everything Chomsky and any arbitrary math guy says: they just haven't thought about how to say it simply yet. Life just *ain't* that complicated (if you have the right way to think.)
I was surprised to read that GATE was not listed in the package list. It's the best piece of software to tie together the descrete components that were included. Another complaint is that are a lot of so-so implimentations of very good algorithms. (#define NOT_FLAMEBAIT = 1) I suppose that you have to turn to corporate software to get the really robust implimentations and to free software when you want the cutting edge.
I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
That's me, actually. You can't expect hundreds slashdot geeks suddenly slamming my site and having me not notice. ];-)
Forum 2010 had, in fact, nothing to do with the great fellows at Forum2k/3k aside from inspiration. And, just to end the rumors, I built the F2.01k matrix and all my own SOMADs as a senior project for my Comp Sci degree at Fontbonne University.
Now, I'm late for a date! Please don't destroy the matrix while I'm gone!
--
I remember when I was first let loose on a Unix system, and discovered tools like 'lex' and 'yacc' for lexical analysis and parsing. I was amazed that advanced language processing was so well supported - it was a short while before I discovered that they weren't for natural language processing :)
Ceterum censeo subscriptionem esse delendam.
I would say that westeners can not pronounce simple Chinese.
English is the only language I know but I studied Mandarin chinese for a few years.
There are all sorts of things in there that we have a lot of trouble pronouncing.
Can your karma go above being Excellent?
any of these natural language tools
can be helpful for spam filtering?
Cheers, Joel
I have been using the base Morphix system for a Bengali l10n Live CD project (which was mentioned at slashdot a few days back). I am really amazed by its capabilities - if you want to have a LiveCD of your own - this is probably the best starting point.
For documentation, you may want to have a look at the Morphix Wiki.
Wow. That's the first slashborging ("All Slashdotters should have the same opinions! Be consistent, dammit!") post I've seen in a long time.
Even though they're stupid as hell, I was beginning to miss them.
Win dain a lotica, en vai tu ri silota
While NLP has many benefits, it can also freeze certain linguistic elements that should be removed or amended.
As a simple example, take spell checking. When the computer can remember the spelling for every word and fix it automatically, who is going to worry about spelling simplification or reform? Yet changing to a standardized phonetic spelling would probably help people in the long run, if only by allowing children time to actually *write* rather than spending time in rote memorization and spelling bees.
The same holds true for grammar. Program existing grammatical rules -- in all of their illogical complexity -- into computers, and you reduce the incentive to simplify and improve such rules. If we had continued to use Roman numerals until the advent of handheld calculators, would there be as much incentive for using Arabic numerals? And yet, without zero and the simplicity of the latter, mathematics would be far poorer for it today. And if computers can soon parse logographic languages like Chinese, will it prevent simplification or even conversion to a (arguably better) phonetic alphabet?
NLP is important, granted, and will help more than it hurts, but it is important to realize that it has some potential drawbacks.
The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
....
That's also why none of the good stuff was made by the Romans - it was the Greeks, then the Arabs that had good numerals, made the discoveries, before the knowledge of a proper number system finally returned to Europe in more recent centuries. The roman numerals were more like the Dark Ages of mathematics.
I think something similar will be the case in 1000 years with everything Chomsky and any arbitrary math guy says: they just haven't thought about how to say it simply yet. Life just *ain't* that complicated (if you have the right way to think.)
Life might not, but math certainly can. E.g. x^n + y^n = z^n is not true for positive integers x,y,z and n > 2. Proof: 250 pages long or so alone. The final article to put it all together is 100+ pages alone. And you won't understand shit until you've read a couple thousand pages of basic number theory. If you think that's ever going to be something you can slap up on the blackboard in an hour, you're wrong.
For all that's been said and done, I think most "simplifying" moves have been made. I've done quite a bit of higher math, and I certainly haven't found any "easy" way to explain it to others. Sure, I can *show* you how phasors rotating in the complex plane can be used to derive the output of a AC circuit of resistors, capacitances and inductances, but noone will understand why.
Most people will never get past the "apples" math. 3, 1/2, sqr(2), all operations on them can be understood by thinking of it in terms of physical objects. Now try make people "understand" e.g. complex numbers and operations. Hell most people have trouble understanding a trivial induction proof.
Now say I got a standard induction proof:
f(1) is true.
if f(n) is true, f(n+1) is true.
And this proves it for n infinitely large.
Then, people believe it's some "infinity magic". But in reality it's simply that for every finite number there is a conventional, finite proof.
Let's say I want to prove it for f(325266235235352):
f(1) is true.
Since f(1) is true, f(2) must be true.
Since f(2) is true, f(3) must be true.
Since f(325266235235352 - 1) is true, f(325266235235352) is true.
But people don't understand that. Which tells me they will never understand 90% of higher math, because it won't get much simpler than that...
Kjella
Live today, because you never know what tomorrow brings
Linguists have always been geeky. Don't forget that Larry Wall is a linguist first.
The only computer class I ever took was in 1983 called "Computer tools for natural language analysis". It was an introductory Unix course. We learned grep, awk, sed as well as tools like vi, Mail, and rogue. And a tiny little bit of C. But since then I've taught C at the graduate level.
Linguistics is all about the reprensentation and manipulation of information. But instead of it being about languages we design for particular purposes, it is about the language system that we use naturally.
Suppose you have a few thousand languages that you know were written with the same tools (like lex and yacc, but not lex and yacc), but you have no access to those tools. Suppose you are trying to figure out what those tools are from examining the languages (not the compilers) that have been specified using those tools. That is what theoretical linguistics is trying to do. We know that the specification of English and the specification of Dyirbal and every other human language out there are somehow "written" with the same tools. It's pretty need stuff.
Linguists were early adopters of TeX, have had a Unix affinity for a while, and as people who are interested in how information is internally represented and manipulated, like reading the source.
I remember once nagging the sys admins to always make sure that there is a man page for anything added to /usr/bin or /usr/local/bin.
The next day, they asked me to look at the manpage for something to see if it met with my approval. The DESCRIPTION was the C source. I was happy to say that it did, indeed, meet with my approval.
At one point, a well known professor (Geoffrey Pullum) had written a little essay for a newsletter on the "grammer of Unix" using linguistic style analyses of the shell. Naturally several of us feigned outrage at his confusion of "Unix" with the shell. Another linguist (Bill Poser), went so far as to write a shell which was verb (command) final, and post-positional. That is instead of saying
/bin/sh chsh
cat foo bar > bang
you would say
foo bar bang > cat
That is, the arguments preceed the command, and the redirect symbols go after the filename they redirect to or from. Now for various reasons, I had root access on a machine that Pullum used. So I changed his shell to this command final one. He actually caught on remarkably quickly. And after a quick
he was ready to concede the point.
For me, there is no surprise that linguists, and particularly computational linguists, are OSS enthusiasts. But that is enough of my random musings for now.
Prime numbers are exactly what Alan Greenspan says they are -S. Minsky
(1) Insert the knob behind the lever.
In (1) you could perhaps use a handfull of terms instead of "knob" -- controlled language enforces only certain licensed terms, this increasing overall consistency (same terms for same thing). This can be checked automatically once a positive list (or typically a hierarchy called "thesaurus") has been setup.
(2) He saw the girl on the hill with the telescope.
The second/third case are lexical and structural ambiguity: we want to avoid problems like with (2), where "saw" could be past of "to see" or have another (more morbid) interpretation. Even worse, it is unclear whether the girl is on the hill, carrying the telescope or whether "he" is spying on the girl with the telescope. I leave it as an exercise to the reader how many combinations (possible interpretations) there are in a sentence like (2) [Hint: Which verb? Who is where? Who carries the telescope?].
In a Controlled Language scenario e.g. ACE, after some initial investments in thesaurus construction, thesaurus lookup and simple parsing techniques are used to report problematic passages to a human editor, who has to correct it manually.
This is not programming in natural language. Typically only large companies can afford the initial investment.
Where are the "All Your Base" trolls when it's actually relevant?
Wh47 d1d j00 541, 31337 15n't t3h r0xor5 ne m0r3???
As both a partly self-labeled linguistic anthropologist and a cultural anthropologist, I would like to respectfully qualify the parent's statements on the state of the field. This really isn't meant as a flame but I do enjoy discussions on the difficult relationship between linguistics and anthropology.
First, while anthropology seems to emphasize linguistics to a much lesser degree than in Boas' era, a large number of anthropologists do work on language, in one way or another. Granted, the groundwork of deciphering unknown languages isn't really part of the discipline anymore, but thorough research projects on how language and language varieties work in social and cultural settings are prominent in the work of many anthropologists, from Michael Silverstein to Alessandro Duranti. Whether or not you call this type of language science "linguistics" is a matter of choice. The fact remains that language still plays a prominent role in contemporary anthropology.
The matter of whether or not "post-modernism" killed cultural anthropology is also open to debate. While I understand the claim and did feel some frustrations caused by "post-modern" anthropology, I think that the ultimate impact is that of enhancing anthropology. True, most cultural anthropologists have stopped writing monographs about "The Xs," but "post-modern" self-criticism is now being replaced by hybrid research activities combining theory and practice. Interestingly enough, language has a large impact on much of this work, at least in the form of meaningful exchanges. Again, maybe not "linguistic" in the strictest sense, but surely enough to warrant language training.
Alexandre http://enkerli.wordpress.com/