New Online Dictionaries Automate Away the Linguistic Middleman

← Back to Stories (view on slashdot.org)

New Online Dictionaries Automate Away the Linguistic Middleman

Posted by timothy on Sunday January 1, 2012 @06:31AM from the boncha-porftis-hworkin dept.

An article in The New York Times highlights two growing collections of words online that effectively bypass the traditional dictionary publishing system of slow aggregation and curation. Wordnik is a private venture that has already raised more than $12 million in capital, while the Corpus of Contemporary American English is a project started by Brigham Young professor Mark Davies. These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it. Says founder Erin McKean in the linked article, 'Language changes every day, and the lexicographer should get out of the way. ... You can type in anything, and we'll show you what data we have.'

60 comments

Min score:

Reason:

Sort:

Isn't that called Googling? by hawks5999 · 2012-01-01 06:34 · Score: 3, Insightful

You can type in anything and we'll show you the data we have sounds a lot like Google search.
1. Re:Isn't that called Googling? by Dachannien · 2012-01-01 06:37 · Score: 1
  
  Indeed.
2. Re:Isn't that called Googling? by Sqr(twg) · 2012-01-01 07:22 · Score: 1
  
  The difference should be in the prioritizing of results. The first few pages from Google might give only hits based on the most common meaning of a word, while Wordnik, according to TFA, should group citations by meaning.
  In practice, this didn't seem to work for the words I tried.
3. Re:Isn't that called Googling? by Samantha+Wright · 2012-01-01 09:37 · Score: 4, Insightful
  
  Here's the results for 'magic'.
  
  Gee, it sure looks like they're returning random search engine results next to—oh look, a list of opinions as proferred by so-called "linguistic middlemen."
  
  I like how the top example for how 'magic' is used in English isn't even purely English, but a bullet point about features in the Zend framework. I'll make a habit of saying "__magic()" in everyday speech more often!
  
  I think the worst outcome of this is that PHP now somehow has influence on the evolution of a natural language. I do not believe I am alone in feeling terrified by this prospect.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
4. Re:Isn't that called Googling? by Samantha+Wright · 2012-01-01 09:39 · Score: 1
  
  Yes.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
5. Re:Isn't that called Googling? by Phrogman · 2012-01-01 10:12 · Score: 1
  
  Yes here: http://www.wordnik.com/words/Fascism
  and here: http://www.wordnik.com/words/Anti-Semitism
  
  --
  "The first time I got drunk, I got married. The second time I bought a chimpanzee, after that I stayed sober" Arian Seid
6. Re:Isn't that called Googling? by Jezral · 2012-01-01 19:45 · Score: 1
  
  Then maybe what you want is DeepDict. E.g., magic is used like http://gramtrans.com/deepdict/lookup.php?word=magic&class=N&lang=eng&top=200 - it is not free, though all words starting with 's' are currently open to viewing for anyone.
  It yields info such as: black magic, Orlando magic, ceremonial magic ... magic kingdom, magic roundabout, magic flute ... practice magic, radiate magic ... magic of animation ... etc
  (disclaimer: I work on the DeepDict project)
7. Re:Isn't that called Googling? by Anonymous Coward · 2012-01-02 02:13 · Score: 0
  
  There were two instances of 'magic' in that sentence, what makes you think the dictionary referred to the second one and not the first one?
  The __magic() might very well be invoking black magic as the sentence indicates, how are you to know?
CCAE isn't that nontraditional by Trepidity · 2012-01-01 06:38 · Score: 2

CCAE is an annotated corpus more than a dictionary. It counts words, word co-occurrences, etc. It's also manually annotated with parts of speech and other such things, not fully automated. Its scope is bigger and more recent than what was possible before computers, but the general idea is ancient: 18th-century classicists would manually compile frequency and word co-occurrence tables for ancient languages to try to get an understanding of their structure.

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
1. Re:CCAE isn't that nontraditional by hedwards · 2012-01-01 07:08 · Score: 1
  
  Having access to a good corpus is really helpful, but once you start hitting the 2k word count additional entries aren't really that helpful to anybody other than hardcore linguists.At that point it's generally more helpful to have information about what words frequently travel together and where they're likely to appear in a sentence.
2. Re:CCAE isn't that nontraditional by gsnedders · 2012-01-01 07:53 · Score: 1
  
  It's depends what you're doing. I've spent a while dealing with the Scottish Corpus of Texts and Speech, and there the size is around four million words. If you're doing anything based upon dialects, size does make a very real difference, because you're interested in the density of usage by area. Personally, even in a non-linguistic context, I find it useful to know whether someone in x is likely to know (by virtue of using) a word y.
3. Re:CCAE isn't that nontraditional by RancidPeanutOil · 2012-01-02 05:27 · Score: 1
  
  If you're a trained computational linguist then obviously yes, you can generate this information (collocations and concordances) from corpora. The COCA actually has some pretty neat features along those lines, but it's not user-intuitive.
Good idea? by colinrichardday · 2012-01-01 06:40 · Score: 1

At the risk of being elitist, I wonder if I should adjust my use of language to that of the average American.
1. Re:Good idea? by rolfwind · 2012-01-01 07:01 · Score: 1
  
  It's inevitable, language always adjusts to popular usage eventually, even with guards in place that act as filters.
  Though I still cringe when people say they "could care less."
  Not that all rules set in place by self-annointed authorities. I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.
2. Re:Good idea? by vlm · 2012-01-01 07:05 · Score: 4, Funny
  
  Though I still cringe when people say they "could care less."
  That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.
  
  --
  "Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
3. Re:Good idea? by Anonymous Coward · 2012-01-01 07:12 · Score: 2, Informative
  
  You're sentence could of been improved if you had leveraged a preposition to end it with.
4. Re:Good idea? by Trepidity · 2012-01-01 07:28 · Score: 1
  
  To be quite honest, it's not an, uh, a very uncommon pattern of speech, if I may say so, to interject one's spoken English with, discourse... discourse particles, and, well, other minor disfluencies, which do--- which do vary by social class, but more in, uh, word choice than in what you might call actual frequency.
  
  --
  10 PRINT CHR$(205.5+RND(1)); : GOTO 10
5. Re:Good idea? by colinrichardday · 2012-01-01 07:29 · Score: 1
  
  I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence.
  So I'm not the only one? Yeah!!! Although I believe that you may have a question mark outside of quotes if the sentence (and not the quoted material) is a question.
6. Re:Good idea? by bigstrat2003 · 2012-01-01 07:40 · Score: 3, Interesting
  
  This post proves that there should be a "made my brain explode" moderation option.
  
  --
  "16MB (fuck off, MiB fascists)" - The Mighty Buzzard
7. Re:Good idea? by Samantha+Wright · 2012-01-01 09:48 · Score: 5, Insightful
  
  Oh, that's purely typographical. When moving blocks of metal type around, a full-stop/period or comma is more delicate than a quotation mark, since it's only x-height and not capital letter height. Typographers got in the habit of putting them on the inside to keep them safe. That's also why certain ligatures of f and the long s were preserved from scribal writing: those letters were designed to hook over others, and if the next letter was tall then it would create a structural instability (an x-height hole.) If modern punctuation had evolved before the invention of moveable type, we would probably put the quotation mark directly above the other punctuation mark, and use logical punctuation for ? and !. However, it didn't, so it was all put inside to stay consistent.
  
  To be honest, I find it visually more pleasant. After looking at code that passes strings around as arguments in C-style imperative languages all day, it's nice to see something without a big gap on the baseline (this "is," an "example", for you.) Since the quotation mark is already floating up and away from the letters, it's less jarring to see it separated from the word than a comma or period. (This is more or less the modern aesthetic justification for keeping it the traditional way. However, modern typographers don't always agree with traditionalists: watch what happens when you point out that the "single" space used to separate sentences prior to the invention of the typewriter was actually larger than a standard double space.)
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
8. Re:Good idea? by AlienIntelligence · 2012-01-01 10:20 · Score: 2
  
  Though I still cringe when people say they "could care less."
  That begs the question if inappropriate use of "begs the question" is like, worse, like, than like using the word like, like in as the first like word after every like lung inhalation. I think that is a full 360 degree reversal from your suggestion.
  I live in the corner of a quad of homes that creates an interesting
  amplifying effect of sounds, within the area. So that a house that
  is completely on the other side, hundreds of feet away, you can
  clearly hear people talk. [Yeah, it DOES suck].
  So, the other day, I heard this teen-thing speaking to her folks
  and about the 20th like, I was gonna "say loudly" since that's
  all one has to do...
  "Like will you shut the fuck up"
  But tis the season and all that crap.
  -AI
  
  --
  For me, it is far better to grasp the Universe as it really is than to persist in delusion
9. Re:Good idea? by Anonymous Coward · 2012-01-01 18:35 · Score: 0
  
  "I never understood why end-of-sentence punctuation should appear inside quotations, especially if it might not match what was quoted, like making a question out of a sentence."
  For what it's worth, it's now accepted that you DON'T need to do it that way. "Logical quoting" is the alternative. It works exactly like most coders would expect:
  http://home.swipnet.se/sunnanvind/logical.html
  http://catb.org/jargon/html/writing-style.html
10. Re:Good idea? by RancidPeanutOil · 2012-01-02 05:28 · Score: 1
  
  For all intensive purposes, yes
11. Re:Good idea? by lsatenstein · 2012-01-02 06:17 · Score: 1
  
  What about punctuation for other languages such as and or the Spanish inverted question mark at the beginning and ? at the end of a question
  
  --
  Leslie Satenstein Montreal Quebec Canada
12. Re:Good idea? by Samantha+Wright · 2012-01-02 07:10 · Score: 1
  
  I didn't know this one off the top of my head, but Wikipedia says they were introduced in Spanish in 1754 because there's no way to recognize that a sentence is a question just from looking at the words; it's purely a tonal difference—and for really long sentences it can get disorienting if you have to go back and re-read it because you just found out that it was a question when you got to the end. I imagine the exclamation point was just made to be consistent.
  
  What were the other symbols you tried to type? Guillemets?
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Citation needed by Anonymous Coward · 2012-01-01 06:51 · Score: 0

"Slow" sources that take the time to verify things are what's needed to become reliable sources, which Wikipedia cites. Unfortunately "new" ideas can be a victim of deletionism.
1. Re:Citation needed by Samantha+Wright · 2012-01-01 10:11 · Score: 1
  
  Funny, I don't see any citations for that... It must not be notable.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
Great Idea by Anonymous Coward · 2012-01-01 06:52 · Score: 1

Let's eliminate the making-sense and explaining that human beings can do. The absurdity of most spell check and voice recognition "did you mean" suggestions doesn't give me much hope that it's all just a matter of having enough data. Yes, Google can seem almost prescient, but only if thousands of other people are looking for the same things as I am. When I could really use a hint, Google never comes up with something useful. On the contrary, then I have to coax it not to replace my carefully selected search term with something "more popular" that hasn't got anything to do with what I'm looking for.
Re:Wikitionary? by Trepidity · 2012-01-01 06:53 · Score: 4, Informative

"The Free Dictionary" appears to be just a spammy repackaging of Wikipedia content. Lots of their articles even have a footer saying they're licensed under the GFDL from Wikipedia.

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
What are these guys? by vlm · 2012-01-01 07:02 · Score: 1

What are these guys, all we get is what they're not:

traditional dictionary publishing system

slow aggregation

curation

crowd-sourced effort

human intervention
I'm guessing they are also not street taco vendors, catholic priests or christmas tree salesmen. Great, that really narrows it down. So, what are they? I mean in terms of workflow, or data diagrams, or even user experience. And who are their users, anyway, unless they provide a really good reason, the rest of the world will continue to use wikipedia/wikimedia products, google (lets face it, mostly google), and the urban dictionary (dare I invoke encyclopedia dramatica?)

--
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
1. Re:What are these guys? by Anonymous Coward · 2012-01-01 08:06 · Score: 0
  An excellent way to find the information you're looking for would be to visit the links in the story. Since you seem to have misplaced them, here's another copy for your convenience:
  growing collections of online words
  Wordnik
  Corpus of Contemporary American English
  Furthermore, the summary actually contains the text "These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it."
  How did you manage to make your post and miss all of the above?
So then ... by PPH · 2012-01-01 07:08 · Score: 2

... if its used, it is automatically entered into this 'dictionary'. On one hand, I shudder to think of the direction that various languages might take. On the other hand, there could be hope for words like malamanteau. That seems perfectly cromulant to me.

--
Have gnu, will travel.
1. Re:So then ... by Anonymous Coward · 2012-01-01 08:29 · Score: 0
  
  The word you're looking for is cromulent, not cromulant, doofus.
2. Re:So then ... by AK+Marc · 2012-01-01 08:51 · Score: 1
  
  That's always how language was defined. Why change now?
  
  --
  Learn to love Alaska
3. Re:So then ... by Man+On+Pink+Corner · 2012-01-01 12:53 · Score: 2
  
  He has altered the English language. Pray he does not alter it further.
Lexicographers out of the way by Compaqt · 2012-01-01 07:09 · Score: 3, Informative

Obviously, I'd suppose you still needed a few lexicographers to come up with the system.
And to maintain it, right?
The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?

--
I'm not a lawyer, but I play one on the Internet. Blog
1. Re:Lexicographers out of the way by VortexCortex · 2012-01-01 08:38 · Score: 4, Funny
  
  Obviously, I'd suppose you still needed a few lexicographers to come up with the system.
  And to maintain it, right?
  The problem seems to be when you've put 95% of lexicographers out of a job, who's going to train the next bunch, and will it be cost-effective at a university level to have a graduate program in such for 1 or 2 individuals?
  Syntax error on line(s): 1 thru 1 Ambiguous contraction in "I'd". Syntax error on line(s): 1 thru 1 Mixed tense in "still needed". Note: Root word "need" satisfies the expression. Syntax error on line(s): 3 thru 3 Incomplete sentence. Syntax error on line(s): 5 thru 5 Expected colon after "be" in "to be when". Syntax error on line(s): 5 thru 5 Expected capitalization of "when" in "to be when". Syntax error on line(s): 5 thru 5 Extraneous comma. Note: This message is generated only once for multiple errors.
  Point taken: Screw the Lexicographers!
2. Re:Lexicographers out of the way by smellotron · 2012-01-01 15:57 · Score: 1
  
  Syntax error on line(s): <snip>
  
  I would like to subscribe to your newsletter. Do you provide an Outlook plugin?
3. Re:Lexicographers out of the way by Livius · 2012-01-01 16:03 · Score: 1
  
  Syntax error: unknown token "thru"
4. Re:Lexicographers out of the way by Anonymous Coward · 2012-01-01 18:43 · Score: 0
  
  http://en.wiktionary.org/wiki/thru
words will not escape us anymore? by Mister+Liberty · 2012-01-01 07:09 · Score: 1

So if I type in "anything" I won't get just an interpreted response
but really -- what... everything?
bjd
Wordnik is a dictionary aggregator by NaCh0 · 2012-01-01 08:05 · Score: 3, Funny

I wonder what kind of sales pitch it takes to get $12 million for a free web dictionary.
'Just imagine if we could provide 100 definitions from other people for the word "butt", how much is that worth to you?'
1. Re:Wordnik is a dictionary aggregator by eulernet · 2012-01-01 10:59 · Score: 1
  
  Totally agree, and it seems that their data is not cross-checked at all:
  http://www.wordnik.com/words/internet
  antonyms
  Words with the opposite meaning:
  World Wide Web
  WTF ?
Telivision by aembleton · 2012-01-01 08:14 · Score: 4, Insightful

It doesn't detect that telivision is an incorrect spelling because there are so many authoritative examples of that spelling: http://www.wordnik.com/words/telivision

Google seems to do a good job of detecting spelling errors and automatically updating it's dictionary and of course it also shows you websites where that word is used. I don't really see what Wordnik provides.
1. Re:Telivision by Anonymous Coward · 2012-01-01 08:21 · Score: 0
  
  Hey man, language changes every day. Not only are we free to redefine words the wya we pelase, we nac splel thme ayn way ew ikle. Gte wthi ti anm!!
2. Re:Telivision by AK+Marc · 2012-01-01 09:41 · Score: 1, Troll
  
  What's funny is that 4 of the top 5 examples are by conservatives attacking liberals (and one transcription error on a CNN interview). What's that say about where our language is going and who is taking it there?
  
  --
  Learn to love Alaska
3. Re:Telivision by VortexCortex · 2012-01-01 10:45 · Score: 1
  
  I second this notion. I frequently use the define: $searchTerm query with Google.
  For example: telivision,
  or: Wordnik
  
  Compare the latter to the same search on Wordnik: Wordnik
  Bonus: Those Google links are wrapped in TLS, so no one sees the query terms or results in transit. https://www.wordnik.com/ takes you to their developer site...
4. Re:Telivision by vikingpower · 2012-01-01 22:20 · Score: 1
  
  Language use, and interpretations thereof, is not politically bias-free. Even more so opinions, scientific or not, on language use.
  
  --
  Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
5. Re:Telivision by AK+Marc · 2012-01-02 09:42 · Score: 1
  
  If we find one group, say Wal-Mart shoppers, who use words that don't exist like misunderestimate and nuk-u-lar more than others, does that mean anything? And if so, what?
  
  --
  Learn to love Alaska
$12m in venture capital to invent Urban Dictionary by SpiralSpirit · 2012-01-01 08:35 · Score: 1

we've eliminated the middle man by letting users submit whatever they want, and pocketing all the money!
What a horrible summary by oneiros27 · 2012-01-01 11:11 · Score: 1

These sources differ from both conventional dictionary publishers and crowd-sourced efforts like the excellent Wiktionary for their emphasis on avoiding human intervention rather than fostering it.

You make it sound like they're completely removing the human elements. And just, a corpus by nature does that, as they're only really involved in setting the bounds of the collection and letting the authors speak for themselves. Wordnik, on the other hand, allows *anyone* to contribute, but they're not allowed to give definitions. (definitions are only gathered from official dictionaries and the like). What you do with Wordnik is give examples of the word in context -- because it's actually really hard to define some words.
The thing is -- there's no editors trying to come up with 'is this a word or not' ... if you put it in, it's a word. It doesn't matter that only you and your 4 friends use it, or that it's important enough -- if you want to add it, you can. Yes, they also automate adding stuff from other sources, and so did wikipedia early on (CIA factbook for countries, US census for places in the US, etc.)
Yes, you can use wordnik as a sort of meta-dictionary, but you can also add words to it, look to see the values in scrabble, tag words (words you hate, jargon in your field, etc.). It *is* fostering human intervention -- how many of you out there can add a word to a print dictionary? And unlike those print dictionaries, we don't have to wait 3-4 years before someone decides that something is 'officially' a word.

--
Build it, and they will come^Hplain.
1. Re:What a horrible summary by martin-boundary · 2012-01-01 13:20 · Score: 1
  
  That's a stupid idea. To use an analogy that Slashdot understands: a traditional dictionary is like a standards document. It's useful to promote interoperability between speakers both during a single transaction (conversation between two parties), and also in log files (written documents to be read again later).
  Collecting random words on the web into a dictionary is like getting rid of standards altogether, or saying that every piece of software out there, no matter what it does, is standards compliant. We saw what that leads to in the early browser wars.
  We need language gate keepers. It's ok if language evolves somewhat over a period of 100 years, but if it changes so much that we can't make sense of what people wrote even 10 years ago, then we're in big trouble. In particular, dictionaries *shouldn't* be published more often than once every 25 years or so: It actually helps continuity if we force ourselves to use the same language that was current in the previous generation.
2. Re:What a horrible summary by oneiros27 · 2012-01-02 14:53 · Score: 1
  
  Do you really mean to tell me that you only use words as they're defined in the dictionary? And if so, which dictionary? Because as we all know, there's lots of different standards out there. And then there's versioning of the standards, and those implementations that aren't quite complient (in language, those would be regional dialets). Language is not as cut and dried as you think it might be.
  But your suggestion is actually done in other countries -- the French have a government group that officially approves new words to be added to their language, with the result that they have much fewer words than we do.
  And I admit, there are problems with allowing anyone to change the language -- we have judges who are willing to use modern definitions of terms to decide what 200+ year old legal documents mean ... because after all, they should've planned for language to change when they write the contitution and the bill of rights.
  
  --
  Build it, and they will come^Hplain.
Wordnik by metamatic · 2012-01-01 11:43 · Score: 1

Regarding Wordnik, I don't think Rick Santorum is going to be a fan of their site.

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak
Internet != WWW by tepples · 2012-01-01 14:48 · Score: 1

It might have something to do with tech sites that take pains to point out that the Internet is not just the World Wide Web.
1. Re:Internet != WWW by Anonymous Coward · 2012-01-02 01:04 · Score: 0
  
  "!=" != "antonym"
  lameness filter is lame
Historical accretion by vikingpower · 2012-01-01 22:16 · Score: 1

All of the interviewed persons as well as the author of the NY Times Article leave a major issue unmentioned, and that is historical word use. As a very enthusiastic user of the Oxford English Dictionary ( yes, it has the place of honour in my living room ), each time I look up a word in the venerable OED I am amazed at the thick and variegated strata of historical meaning, and the gradual shifting in it, even for words we think of as "simple".
To wit, neither the Wordnik nor the CCAE person mentioned these important aspects of a dictionary's use. For good reasons: such corpuses as Wordnik and the CCAE are "mere" aggregations of internet use. Which, also and not accessorily, is not necessarily idempotent with everyday use.

--
Religous speak to God. Insane are spoken to by God. When all shut up, one can finally hear Shostakovich in peace
continuity by mcswell · 2012-01-02 05:27 · Score: 1

Current generation nonsense, it's high time we return to Latin. Ita et vos per linguam nisi manifestum sermonem dederitis, quo modo scietur id quod dicitur? eritis enim in aëra loquentes.
But I'd accept Old English.