Linguistics Meets Linux: A Review of Morphix-NLP
Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."
Does anyone remember Forum 2000 (link does not actually work)? It's got some neat technology behind it. And the conversations between surfers and the SOMADs was hilarious. When I first saw the site, I thought it was actual people imitating the different characters. Does anyone know what happened to the site and why it no longer functions? I miss it.
I claim first use of "Error No. 0B" - or "No. 0B error." It'll be the new ID 10T!
Actually, this software seems like it would totally useless for that purpose. The software was developed and has a bunch of heuristics and domain knowledge put in by experts in english or the relevant language. Without similar expertise, the software can't be adapted to a new language. The software isn't a universal translator.
So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
You can get such lists pretty easily without having to type them in. Just looking up the most frequently used POS for that word gives almost 90% accuracy. Alternatively I wrote a program that automatically predicts the POS for new words.
However, your BNF grammer is likely to come unstuck as soon as you try to parse either casual english or moderately complex english. Either one very quickly leads to adding lots of infrequently used grammar rules, and hence lots of ambiguity in even simple sentences.
The idea of controlled english was to create a useful subset of english that does conform to a BNF grammar (or LL(1), or something, I forget). Writing in it turns out to be quite hard -- very easy to forget you're writing in a programming language. But there is at least one english controlled english machine-assisted translator.
Given a few years, I wouldn't be surprised to see a program like that be the basis of the next big thing in programming languages.
This article is about linguistics, and he said "go read Chomsky", so I went and read Chomsky's bibliography. What I'm about to say applies to all modern philosophers and mathematicians:
God damn, them are some fancy-schmancy sounding titles! Does anybody ever get the feeling sometimes that maybe things are simpler than our smartest people currently make them out to be? If you can't talk as simple as I'm talking now, you ain't really "nailed it."
The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
I saw some links about all the math they never teach anymore (compound arithmatic, like pounds shillings pence comes to mind). I think something similar will be the case in 1000 years with everything Chomsky and any arbitrary math guy says: they just haven't thought about how to say it simply yet. Life just *ain't* that complicated (if you have the right way to think.)
I remember when I was first let loose on a Unix system, and discovered tools like 'lex' and 'yacc' for lexical analysis and parsing. I was amazed that advanced language processing was so well supported - it was a short while before I discovered that they weren't for natural language processing :)
Ceterum censeo subscriptionem esse delendam.
While NLP has many benefits, it can also freeze certain linguistic elements that should be removed or amended.
As a simple example, take spell checking. When the computer can remember the spelling for every word and fix it automatically, who is going to worry about spelling simplification or reform? Yet changing to a standardized phonetic spelling would probably help people in the long run, if only by allowing children time to actually *write* rather than spending time in rote memorization and spelling bees.
The same holds true for grammar. Program existing grammatical rules -- in all of their illogical complexity -- into computers, and you reduce the incentive to simplify and improve such rules. If we had continued to use Roman numerals until the advent of handheld calculators, would there be as much incentive for using Arabic numerals? And yet, without zero and the simplicity of the latter, mathematics would be far poorer for it today. And if computers can soon parse logographic languages like Chinese, will it prevent simplification or even conversion to a (arguably better) phonetic alphabet?
NLP is important, granted, and will help more than it hurts, but it is important to realize that it has some potential drawbacks.
I guess this would interest you too. BTW, have you read "Le Ton Beau de Marot" by Hofstadter?
In 1977, Xerox adopted Systran for internal translations by creating a Multinational Customized English that's easier to translate. [1]
In 1930, C.K. Ogden proposed a tiny version of English: just 850 words that could be learned in a few months and used to say anything. He called it Basic English (BE). [2] [3]
The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
....
That's also why none of the good stuff was made by the Romans - it was the Greeks, then the Arabs that had good numerals, made the discoveries, before the knowledge of a proper number system finally returned to Europe in more recent centuries. The roman numerals were more like the Dark Ages of mathematics.
I think something similar will be the case in 1000 years with everything Chomsky and any arbitrary math guy says: they just haven't thought about how to say it simply yet. Life just *ain't* that complicated (if you have the right way to think.)
Life might not, but math certainly can. E.g. x^n + y^n = z^n is not true for positive integers x,y,z and n > 2. Proof: 250 pages long or so alone. The final article to put it all together is 100+ pages alone. And you won't understand shit until you've read a couple thousand pages of basic number theory. If you think that's ever going to be something you can slap up on the blackboard in an hour, you're wrong.
For all that's been said and done, I think most "simplifying" moves have been made. I've done quite a bit of higher math, and I certainly haven't found any "easy" way to explain it to others. Sure, I can *show* you how phasors rotating in the complex plane can be used to derive the output of a AC circuit of resistors, capacitances and inductances, but noone will understand why.
Most people will never get past the "apples" math. 3, 1/2, sqr(2), all operations on them can be understood by thinking of it in terms of physical objects. Now try make people "understand" e.g. complex numbers and operations. Hell most people have trouble understanding a trivial induction proof.
Now say I got a standard induction proof:
f(1) is true.
if f(n) is true, f(n+1) is true.
And this proves it for n infinitely large.
Then, people believe it's some "infinity magic". But in reality it's simply that for every finite number there is a conventional, finite proof.
Let's say I want to prove it for f(325266235235352):
f(1) is true.
Since f(1) is true, f(2) must be true.
Since f(2) is true, f(3) must be true.
Since f(325266235235352 - 1) is true, f(325266235235352) is true.
But people don't understand that. Which tells me they will never understand 90% of higher math, because it won't get much simpler than that...
Kjella
Live today, because you never know what tomorrow brings
Actually, not very many anthropologists these days do much linguistic work. That's partly because linguistics has developed as a separate field and partly because cultural anthropology was largely taken over by Postmodernists, as a result of which it has nearly died. Most research on "exotic" languages these days is done either by linguists or by missionaries (who want to translate the New Testament).
I am a linguist and have done extensive fieldwork, mostly on Carrier, the native language of a large region of northern British Columbia. (I also hack a little. Once upon a time I wrote the head-final shell mentioned in Charles Dodgson's comment.) Software is increasingly used for this kind of work, but for the most part it is not the sort of NLP software provided on the Morphix-NLP CD. A lot of that software is useful primarily if you've got a large corpus to work with, and it often presupposes that some basic resources exist, such as a lexicon, or at least a wordlist with part of speech information. For many languages even basic resources such as a lexicon don't exist or aren't available in electronic form, and when you're dealing with really small languages, there aren't any ready-made corpora, such as news text. If you want a text corpus, you've got to make it yourself, usually by recording people telling stories or whatever, and transcribing it. This is an important part of fieldwork, but its incredibly slow and tedious.
There are some tools designed specifically for this kind of linguistic research. One is Transcriber, a tool that assists a human being in transcribing audio recordings. One of the older tools is Shoebox a dictionary database program for field linguists, originally written to run under DOS.
Some of us have used Unix tools to extract and process information, e.g. grep to do regular expression searches. Ken Church at Bell Labs used to give a tutorial "Unix for Poets" on how to use Unix tools for linguistics. Here is his handout. For example, I've produced dictionaries of several dialects of Carrier using scripts written mostly in AWK plus the usual Unix tools, controlled by elaborate Makefiles. Some of us also use emacs a lot, not only as an editor but for doing searches. If you're interested in what kinds of software are of interest to linguists, you might check out the Computational Resources for Linguistic Research page.
It is worth mentioning that spread of the internet has made available a lot of useful material for linguistic research. There are now quite a few languages for which you can obtain a good chunk of text (say at least 100K words), and often you can find parallel text (that is, the language you're interested in plus a translation into English or another language that is useful to you). But this works mostly for relatively big languages, that is, say, languages with a million or more speakers. There are around 340 such languages, depending on how you count, about 2% of the world's oral languages.
One topic that concerns some of us is how software and other technology can speed up the process of documenting dying languages. Languages are rapidly become extinct - some experts estimate that as many as 90% of the languages currently spoken will be extinct in 100 years. [Computer languages may be proliferating at the same rate.:)] The late Ken Hale had seven languages die on him. If we don't find a way to speed up the documentation, or slow down the rate of extinction, most of those languages are going to die without very much being known about them.
I'm not at all interested by airy analysis about sentence structure -- I like historical linguistics. Every wonder about a word like "go"? Why is it's preterit "went"? Well, the preterit used to be "eode" which actually comes from the same stem from which Latin ire comes. And from ire, we get only the French future stem ir- (as in J'irai -- I will go). This is important and all, but why are linguists so interested in this computer-related stuff, and not in the rich and varied history of our language, as well as those of many others?
Lets see... if it had a good language guesser that could be fit into a plugin then we could toss all messages in languages we can't read (or see no use for), for instance all messages I get that are in English are either from some mailinglist, or spam. I've actually been working on a "spot English"-plugin to use on the mail that isn't automatically shunted into the mailinglist-folders, but if the work is already done, yay!
You might think that looking at the charset used would be enough but 'taint so! Frequency of letters isn't good enough either, two good ways is checking for the most frequent words or the most frequent letter trigrams. If you want to know more, see if you can find the paper "Comparing two language identification schemes" by Gregory Grefenstette. It used to be openly hosted at xerox but now the server is gone.