New Online Dictionaries Automate Away the Linguistic Middleman
An article in The New York Times highlights two growing collections of words online that effectively bypass the traditional dictionary publishing system of slow aggregation and curation. Wordnik is a private venture that has already raised more than $12 million in capital, while the Corpus of Contemporary American English is a project started by Brigham Young professor Mark Davies. These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it. Says founder Erin McKean in the linked article, 'Language changes every day, and the lexicographer should get out of the way. ... You can type in anything, and we'll show you what data we have.'
You can type in anything and we'll show you the data we have sounds a lot like Google search.
CCAE is an annotated corpus more than a dictionary. It counts words, word co-occurrences, etc. It's also manually annotated with parts of speech and other such things, not fully automated. Its scope is bigger and more recent than what was possible before computers, but the general idea is ancient: 18th-century classicists would manually compile frequency and word co-occurrence tables for ancient languages to try to get an understanding of their structure.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
At the risk of being elitist, I wonder if I should adjust my use of language to that of the average American.
"Slow" sources that take the time to verify things are what's needed to become reliable sources, which Wikipedia cites. Unfortunately "new" ideas can be a victim of deletionism.
Let's eliminate the making-sense and explaining that human beings can do. The absurdity of most spell check and voice recognition "did you mean" suggestions doesn't give me much hope that it's all just a matter of having enough data. Yes, Google can seem almost prescient, but only if thousands of other people are looking for the same things as I am. When I could really use a hint, Google never comes up with something useful. On the contrary, then I have to coax it not to replace my carefully selected search term with something "more popular" that hasn't got anything to do with what I'm looking for.
"The Free Dictionary" appears to be just a spammy repackaging of Wikipedia content. Lots of their articles even have a footer saying they're licensed under the GFDL from Wikipedia.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
What are these guys, all we get is what they're not:
traditional dictionary publishing system
slow aggregation
curation
crowd-sourced effort
human intervention
I'm guessing they are also not street taco vendors, catholic priests or christmas tree salesmen. Great, that really narrows it down. So, what are they? I mean in terms of workflow, or data diagrams, or even user experience. And who are their users, anyway, unless they provide a really good reason, the rest of the world will continue to use wikipedia/wikimedia products, google (lets face it, mostly google), and the urban dictionary (dare I invoke encyclopedia dramatica?)
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
Have gnu, will travel.
Obviously, I'd suppose you still needed a few lexicographers to come up with the system.
And to maintain it, right?
The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?
I'm not a lawyer, but I play one on the Internet. Blog
So if I type in "anything" I won't get just an interpreted response
but really -- what... everything?
bjd
I wonder what kind of sales pitch it takes to get $12 million for a free web dictionary.
'Just imagine if we could provide 100 definitions from other people for the word "butt", how much is that worth to you?'
It doesn't detect that telivision is an incorrect spelling because there are so many authoritative examples of that spelling: http://www.wordnik.com/words/telivision
Google seems to do a good job of detecting spelling errors and automatically updating it's dictionary and of course it also shows you websites where that word is used. I don't really see what Wordnik provides.
we've eliminated the middle man by letting users submit whatever they want, and pocketing all the money!
You make it sound like they're completely removing the human elements. And just, a corpus by nature does that, as they're only really involved in setting the bounds of the collection and letting the authors speak for themselves. Wordnik, on the other hand, allows *anyone* to contribute, but they're not allowed to give definitions. (definitions are only gathered from official dictionaries and the like). What you do with Wordnik is give examples of the word in context -- because it's actually really hard to define some words.
The thing is -- there's no editors trying to come up with 'is this a word or not' ... if you put it in, it's a word. It doesn't matter that only you and your 4 friends use it, or that it's important enough -- if you want to add it, you can. Yes, they also automate adding stuff from other sources, and so did wikipedia early on (CIA factbook for countries, US census for places in the US, etc.)
Yes, you can use wordnik as a sort of meta-dictionary, but you can also add words to it, look to see the values in scrabble, tag words (words you hate, jargon in your field, etc.). It *is* fostering human intervention -- how many of you out there can add a word to a print dictionary? And unlike those print dictionaries, we don't have to wait 3-4 years before someone decides that something is 'officially' a word.
Build it, and they will come^Hplain.
Regarding Wordnik, I don't think Rick Santorum is going to be a fan of their site.
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
It might have something to do with tech sites that take pains to point out that the Internet is not just the World Wide Web.
All of the interviewed persons as well as the author of the NY Times Article leave a major issue unmentioned, and that is historical word use. As a very enthusiastic user of the Oxford English Dictionary ( yes, it has the place of honour in my living room ), each time I look up a word in the venerable OED I am amazed at the thick and variegated strata of historical meaning, and the gradual shifting in it, even for words we think of as "simple".
To wit, neither the Wordnik nor the CCAE person mentioned these important aspects of a dictionary's use. For good reasons: such corpuses as Wordnik and the CCAE are "mere" aggregations of internet use. Which, also and not accessorily, is not necessarily idempotent with everyday use.
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
Current generation nonsense, it's high time we return to Latin. Ita et vos per linguam nisi manifestum sermonem dederitis, quo modo scietur id quod dicitur? eritis enim in aëra loquentes.
But I'd accept Old English.