Linguistics Meets Linux: A Review of Morphix-NLP

← Back to Stories (view on slashdot.org)

Linguistics Meets Linux: A Review of Morphix-NLP

Posted by CowboyNeal on Thursday December 11, 2003 @02:24PM from the natural-linux-processing dept.

Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."

5 of 186 comments (clear)

Ironic.. by grub · 2003-12-11 14:26 · Score: 5, Funny

All this language processing packed onto a single CD yet /. can't run a spellchecker... :)

--
Trolling is a art,
Re:Great... by lakeland · 2003-12-11 14:31 · Score: 5, Funny

Actually, I saw someone working on something like parsing english as a programming language, try a Google for 'controlled english' sometime. The general idea is that management may not be able to write the specifications, but they can read them and tell you it isn't what they're really after _before_ you code the thing.
Re:Good Chinese Compression by MoThugz · 2003-12-11 14:38 · Score: 5, Funny

If you want to play the typical stereotype... please at least get it right.

It's the Japanese who has problems pronouncing L's... and the Chinese have problems pronouncing R's.

The Westerners on the other hand, can pronounce almost anything, but will never ever get facts right :)

--
Welley Corporation - SLM Scammers
Re:Chomsky and stuff by monecky · 2003-12-11 16:14 · Score: 5, Interesting

I'm a programmer getting my masters in linguistics. Computer Science undergrad. Trust me. This is some tough stuff... until you learn the basics. Then everything starts making sense. There is a huge hurdle getting into any field... and it is usually because of the terminology. Every field has it's own terminology because every field needs to be extremely precise in their explanations.

Linguists don't think Knuth is very lucid.

Linguistics is neat. Syntax (the study of the structure of language), Phonology (the study of the interactions of sounds and what a child has to actually 'learn'), Phonetics (the study of the human language system and the sounds that it can produce/hear), and Morphology (the study of the smallest possible unit that holds 'meaning') all work together to form an idea of what goes on in the human mind.

--
http://jones.ling.indiana.edu/~prrodrig
Re:that's pretty cool by belmolis · 2003-12-11 20:42 · Score: 5, Interesting

Actually, not very many anthropologists these days do much linguistic work. That's partly because linguistics has developed as a separate field and partly because cultural anthropology was largely taken over by Postmodernists, as a result of which it has nearly died. Most research on "exotic" languages these days is done either by linguists or by missionaries (who want to translate the New Testament).

I am a linguist and have done extensive fieldwork, mostly on Carrier, the native language of a large region of northern British Columbia. (I also hack a little. Once upon a time I wrote the head-final shell mentioned in Charles Dodgson's comment.) Software is increasingly used for this kind of work, but for the most part it is not the sort of NLP software provided on the Morphix-NLP CD. A lot of that software is useful primarily if you've got a large corpus to work with, and it often presupposes that some basic resources exist, such as a lexicon, or at least a wordlist with part of speech information. For many languages even basic resources such as a lexicon don't exist or aren't available in electronic form, and when you're dealing with really small languages, there aren't any ready-made corpora, such as news text. If you want a text corpus, you've got to make it yourself, usually by recording people telling stories or whatever, and transcribing it. This is an important part of fieldwork, but its incredibly slow and tedious.

There are some tools designed specifically for this kind of linguistic research. One is Transcriber, a tool that assists a human being in transcribing audio recordings. One of the older tools is Shoebox a dictionary database program for field linguists, originally written to run under DOS.

Some of us have used Unix tools to extract and process information, e.g. grep to do regular expression searches. Ken Church at Bell Labs used to give a tutorial "Unix for Poets" on how to use Unix tools for linguistics. Here is his handout. For example, I've produced dictionaries of several dialects of Carrier using scripts written mostly in AWK plus the usual Unix tools, controlled by elaborate Makefiles. Some of us also use emacs a lot, not only as an editor but for doing searches. If you're interested in what kinds of software are of interest to linguists, you might check out the Computational Resources for Linguistic Research page.

It is worth mentioning that spread of the internet has made available a lot of useful material for linguistic research. There are now quite a few languages for which you can obtain a good chunk of text (say at least 100K words), and often you can find parallel text (that is, the language you're interested in plus a translation into English or another language that is useful to you). But this works mostly for relatively big languages, that is, say, languages with a million or more speakers. There are around 340 such languages, depending on how you count, about 2% of the world's oral languages.

One topic that concerns some of us is how software and other technology can speed up the process of documenting dying languages. Languages are rapidly become extinct - some experts estimate that as many as 90% of the languages currently spoken will be extinct in 100 years. [Computer languages may be proliferating at the same rate.:)] The late Ken Hale had seven languages die on him. If we don't find a way to speed up the documentation, or slow down the rate of extinction, most of those languages are going to die without very much being known about them.