Physicists Discover Evolutionary Laws of Language
Hugh Pickens writes "Christopher Shea writes in the WSJ that physicists studying Google's massive collection of scanned books claim to have identified universal laws governing the birth, life course and death of words, marking an advance in a new field dubbed 'Culturomics': the application of data-crunching to subjects typically considered part of the humanities. Published in Science, their paper gives the best-yet estimate of the true number of words in English — a million, far more than any dictionary has recorded (the 2002 Webster's Third New International Dictionary has 348,000), with more than half of the language considered 'dark matter' that has evaded standard dictionaries (PDF). The paper tracked word usage through time (each year, for instance, 1% of the world's English-speaking population switches from 'sneaked' to 'snuck') and found that English continues to grow at a rate of 8,500 new words a year. However the growth rate is slowing, partly because the language is already so rich, the 'marginal utility' of new words is declining. Another discovery is that the death rates for words is rising, largely as a matter of homogenization as regional words disappear and spell-checking programs and vigilant copy editors choke off the chaotic variety of words much more quickly, in effect speeding up the natural selection of words. The authors also identified a universal 'tipping point' in the life cycle of new words: Roughly 30 to 50 years after their birth, words either enter the long-term lexicon or tumble off a cliff into disuse and go '23 skidoo' as children either accept or reject their parents' coinages."
Anyone that has played Scrabble (especially against a computer) know that there's tons of words out there that no one has ever heard of, most of which you can't even find a definition for. What the hell is a Qi? I don't know, but I can get 66 points for it.
'Culturomics'? You'd think that people studying words would be able to come up with a better word than that.
When our name is on the back of your car, we're behind you all the way!
This looks like really interesting and important research - perhaps even a tenth as important as these physicists think it is!
What physicists do when they are bored ... take away research from other fields
NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
Please. No more portmanteaus with -onomics on the end. I automatically think of Regan.
I write professional videogame reviews! http://www.digitallydownloaded.net/
The OED has about 600 thousand words, though still this is a lot less than a million. It would be interesting to see the most commonly used word that isn't in the dictionary.
-- Ed Avis ed@membled.com
Anyone that has played Scrabble (especially against a computer) know that there's tons of words out there that no one has ever heard of, most of which you can't even find a definition for. What the hell is a Qi? I don't know, but I can get 66 points for it.
Qi is a simple one, it's a two letter word and there are roughly a hundred two letter words accepted by TWL which are hackable. Qi is also something I've seen reading Chinese philosophy so that doesn't really upset me. The ones that really get me when I play against computers or people who cheat are actually the longer ones. Recently I have seen outgnawn, aliquot, mahoes, votive, the list goes on when your friends are using websites to look up permutations.
You can study this stuff and memorize things like I-dumps: ziti, ilia, ixia, inion, etc. But in the end what really got my scores higher was studying the short 2 and 3 letter words and building thick crossword-like packs of words especially over TL tiles.
My work here is dung.
Related: I recently learned that a large portion of the PhD's working at a particular Google office have astrophysics degrees. Go figure.
weinersmith
Everything in the world is just applied physics, except for mathematics.
Bringing mathematical rigour to fields of research where it has previously been ignored can clearly provide some interesting insights.
...Grand Unification Theory of Cosmology Proven.
My husband works for Merriam-Webster as an assistant editor/lexicographer. You wouldn't believe some of the stuff that goes on there. People will call and demand fame for a word. For example, some guy called in and said he'd been the one to come up with the word 'ginormous', and wanted credit for it. They don't seem to understand the process. MW's archives in the basement is a CIA-esque compilation of language; they'll use every collegiate they have for reference, going all the way back to the first one. Husband says it won't be long before internet-meme creations are included.
You want to know how to help your kids? LEAVE THEM THE F*&K ALONE. --George Carlin
I remember an episode of 'Recess', a Saturday morn cartoon from the late 90's, where the main characters made up a word to replace swearing: whomps. It wasn't long before the school board dog-piled them, saying it wasn't allowed as they considered it a swear now since all the kids were using it to curse. It was a very interesting episode.
You want to know how to help your kids? LEAVE THEM THE F*&K ALONE. --George Carlin
Why would physicists be studying this kind of thing?
When you graduate with a PhD in physics, you get three things:
The third means that you are obliged, at least once, to submit a paper about some other field to arxiv.org. Ideally, this paper should not cite any relevant research in the field - only other papers by physicists - and, for bonus points, should base its entire thesis a weak statistical correlation.
I am TheRaven on Soylent News
It's not in the dictionary. Look it up.
Yeah, it seems that while 'sneaked' is older and more likely to be considered THE true version by any authority that accepts only one of the two, 'snuck' seems more natural to many people and is gaining ground in all English speaking countries, even in newspapers (e.g. much more common in Canadian newspapers than 'sneaked').
weinersmith
While I agree that grammar Nazis can go a bit far, and I had no problem with what you just wrote (I'll ignore threw/through, this is /. after all), I find that a lot of people write impossible-to-parse sentences. I see this in business correspondence all the time. I'll get an email from a coworker, and I won't even know what they are asking because what they typed doesn't make any sense at all, or can be interpreted about 5 different ways. A lot of it comes from people being too lazy to just type out a whole sentence or paragraph, which is sometimes what is required to get the point across. I think a lot of it is due to people not being able to type fast enough, so they just get impatient, and write the shortest thing possible, instead of what actually makes sense.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Well obviously Google employees working in their moon office would have astrophysics degrees.
(As an aside: that page is the second hit for googling "google jobs" for some reason.)
I see this all the time (I have a PhD in the humanities and I am a software engineer) where someone from outside the field does something and claims it is a universal law but really, they just worked on English and cannot (or will not) prove that it works for other languages. Usually, these papers also lack any kind of literature review and ignore many of the problems that this would uncover. I saw one paper by a physicist that tried to use bit fields to model language change; it was just massively reductionist and couldn't explain anything at all for all the mathematical rigour.
I go to my University's language lunch which has lots of this and scare the pants off grad students by saying "this is all very well but does this work for Japanese or Old Irish or any other language?" This usually makes their faces go white because naturally English is the ONLY language that matters and is therefore "universal".
So physicists have reinvented battleship curves. Congratulations! We couldn't have done it a century ago without you!
It is the alternate spelling of "Chi", a concept in Daoist philosophy that represents the primal energy of the universe.
As in "tai chi". As in "qi gong". It is also sometimes spelled "ki".
The ancient Chinese must have played a lot of Scrabble
The Scrabble word that bothers me is "aa". I mean seriously. Who even wants to play with you any more? It's not fun when you start bringing out the scrabble dictionary. I thought we said no 2-letter words, anyway. And no, I'm not being a baby.
You are welcome on my lawn.
There has been mathematical studies on how long irregular verbs might survive in the English language for a long time. I remember seeing the first such article a while back.
Basically the more used a verb- the longer it will take us to be liberated from its influence. Some like the verb "to be" are so enconsced in our language that they may take many many generations to eliminate.
Of course- this ignores any political movement to eliminate them- as countries become closer- if English remains the language of democracy- there may be a push to make English more standard. A new English without all the rule contradictions it currently has would be double-plus good.
"That's the way to do it" - Punch
I'm sure Americans will have created 8000 of those new words each year. Not content with the ones we British gave them, they wanted their own.
Jonathanjk.com
Bringing mathematical rigour...
Physicists are widely known for their lack of mathematical rigor. David Hilbert, perhaps the most influential mathematician of the 20th century (who incidentally discovered Einstein's field equations before Einstein, though who was also nice enough not to get into a priority dispute since most of the work leading up to the discovery was Einstein's), is often quoted as saying some variation on, "Physics is too difficult for physicists!" His meaning was apparently that the mathematics required to rigorously justify assertions in advanced physics is often beyond the reach (or inclination) of physicists. This isn't necessarily a bad thing, by the way, but it indicates the traditional lack of rigor in physicist's math.
The paper itself says,
We use concepts from economics to gain quantitative
insights into the role of exogenous factors on the evolution
of language, combined with methods from statistical
physics to quantify the competition arising from correlations
between words and the memory-driven autocorrelations
in u_i(t) across time.
Perhaps "Bringing quantitative statistical analysis..." is a better phrase.
First off, I'd say your lack of language skills is indeed impacting your ability to coherently formulate an argument. Otherwise you would have noticed that the original post is not, in fact, about grammar at all. Rather, it's about words.
That said, I would also say that HOW you present an argument is just as important (if not more so) than the content of the argument itself. The point of making an argument at all is to convince someone else of the validity of your viewpoint. This task is impossible if you are unable to make yourself understood, and it's very difficult if people have dig through your statements in order to tease out your meaning. Also, the better your language skills, the less the chance your arguments will be misunderstood.
The reality is, though, it doesn't really matter how terrific your ideas are if you are unable to efficiently articulate them. Which makes the better point?
1) "I took my - you know - thing .... the thing thats sits on the round things.... you know - the THING.... yeah - with the keys and stuff - the THING. Anyway, I took the thing to the place... you know - where I do stuff... there's coffee and papers and stuff - the PLACE... and the guy who tells me what to do... you know - the PLACE."
- or -
2) "I drove my car to work."
s/threw/through/g
"through" is an adverb indicating a passage between locations or a change of state.
"threw" is the past tense of throw.
Grammar Nazi's often get a bit extreme but when your basic spelling is up-to-shit the actual meaning of your writing gets lost. Yes language evolves - this means we coin new words, we gradually change laws of grammar - but it is not a license to write whatever you want and claim it means what you intended to mean.
I'm fairly certain from context that you intended to write "through" for example - but if I hadn't recognized it I would have been wondering if you were so badly bullied that teachers actually threw you around in school.
>I have only learned to dislike people who feel the need to correct every detail, and discredit my arguments
It's not a discrediting of arguments to correct grammar mistakes. However, repeating them when you have been corrected just makes you look stupid. Worse, it makes you an asshole. Yeah, YOU are the asshole. Why ? Because using the proper conventions of language (grammar, spelling etc.) is a form of politeness. It makes your writing easy to read.
Furthermore, it is to your own advantage as well. When you ignore good language rules what you write more often than not doesn't mean what you intended it to mean. Some of your readers will simply misunderstand you. Others will be annoyed. Very few will actually have a clue what you were trying to say- because what you were trying to write and what you actually write no longer bear any but the most limited of resemblances.
The only thing that saves the grammar-ignorant from being completely illiterate is the human ability to infer meaning from context - but context is incredibly culture, time and location specific. So the meaning of your words now become discernible exclusively to people who share your background. Everybody else (that could literally be people who live two neighborhoods away) are just sitting there shaking their heads and wondering what the fuck you're trying to say.
Oh and for a little encouragement... I am writing in my THIRD Language and very nearly all of the fucking time I get it right... you first language speakers have absolutely no excuse.
Unicode killed the ASCII-art *
Had you clicked the the link to the PDF provided in the summary, you'd have stumbled onto their paper -- as in "the thing we're discussing here" -- where they mention Spanish and Hebrew were also studied.
Every end has half a stick.
Not surprising really. What does an astrophysicist do? Point hyper sensitive instruments at random portions of the sky and generate humongous data sets that need heavy processing to extract structure and meaning. A really large part of Astrophysics these days is data analysis, almost all of it done with automated codes.
Which is for example why Renaissance Technology has a lot of Astrophysicists on board as well.
Physicists claimed the evolution of language was based on some characterization of words of vocalization pattern and energy usage, the idea being that languages which afford more efficient energy requirements to the speaker tend to survive by natural selection process, just as animals in any environment evolve physical characteristics that are specifically adapted to efficient energy usage in that environment.
All languages evolve like this. The only reason we feel the need to fixate them on a standard is it gives a pretence of security. The rules themselves are just a long winded way of trying to legitimise the eccentricity of a language (English) pasted together from various other European languages. Our words are disparate, our Italian alphabet is lacking several letters and our accent changes every five miles down the road in a country of 80 million. Occasionally, we are lucky enough to get away with flouting the rules without being shot down by some jobsworth pedant. There will never be any kind of reform from the top down, if that were possible in any way whatsoever there would be no French 'weekend' for sure.
You want to put the rules in perspective, consider the many millions of human beings to come that are born into the world all thinking the same thing: 'frankly, I could not give a toss about cultural heritage mammy, now where is my coke and crisps please?'. If you don't want to be paddling a canoe up a waterfall the rest of your life, then it is much more pragmatic to be relaxed about such matters, because people are much more willing to respect convention when they are not beaten over the head with it.
That stupid word always drived me crazy.
Yeeeaaaah!
Upward mobility is a slippery slope - the higher you climb the more you show your ass.
All this reminds me of when a mathematician, a physicist, and an engineer were told of a man who is across the room from a woman and moves half the remaining distance to the woman every minute. The mathematician said, "The man will never reach the woman." The physicist said, "In twenty minutes the man will be within an atomic radius of the woman and can be said to have reached her." The engineer said, "No problem, in five minutes that guy will be close enough for all practical purposes."
Please adjust this joke to the sexual proclivities of your audience as needed.
You shall see a cow on the roof of a cotton house.
Newton's laws weren't overturned, they were refined. Chomsky seems to have had the wrong idea entirely.
"Dog-pile" is pretty recent too. Viz magazine made up a new swear word "fitbin" to put on the front cover simply to point out to WHSmiths (who also used to put Private Eye on the top shelf because they thought it was an adult entertainment magazine simply because of the name) that their in-house censorship of titles was pathetic and Daily Mail-y in the extreme.
"Wait. Something's happening. It's opening up! My God, it's full of apricots!"
Published in Science, their paper gives the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster's Third New International Dictionary has 348,000) with more than half of the language considered 'dark matter' that has evaded standard dictionaries (PDF).
Umm, no. The phrase "true number of words in English" is sufficiently ill-defined to make the question meaningless. There are two ways people think about whether something is a "true word" in English, but more or less, you need to either rely on an authoritative reference to make that determination (which is not what's happening here), or you note it's existence by some level of usage in practice, and set a somewhat arbitrary bar for how often the word has been used (which is what's happening here.)
As per Zipf's law, etc, tweak that "bar" a little bit, and you'll get quite different results.
I'm a nature photographer.
When taking "History of the English Language" last year as part of my graduate work, the professor I studied under was part of the Middle English Dictionary Project. It was interesting to speak with him on the life and death of words after the printing press, and I remember him giving a 30 to 50 year estimation for a word to cement itself or become rare. It doesn't really seem like this is anything new.
From TFA, the researchers were analyzing Google's corpus of primarily English texts. Anything they have to say about the development of language can thus only be said to hold true for English .
Different languages work differently, and are subject to different pressures of usage and culture and global politics. Somehow I doubt that Mori or Arabic or German are changing in quite the same ways or at quite the same rates as English.
TL;DR: "Universal", my shiny white honky ass.
"What in the name of Fats Waller is that?"
"A four-foot prune."
I agree with your main point, and agree that the modern Hebrew vocabulary is subject to diverse influences, including European languages.
That said, Hebrew (modern or otherwise) is not that hard to classify -- it is firmly in the Semitic language grouping, itself part of the Afroasiatic language family. Hebrew is a cousin to Arabic, and a cousin to ancient Egyptian, Touareg, Somali, and Amharic (Ethiopian).
Cheers,
"What in the name of Fats Waller is that?"
"A four-foot prune."
Speaking as a linguist (working on my Ph.D.) this is something of a tempest in a tea-pot. The most relevant use would be for glottochronology - a field that's largely been abandoned by anyone seriously working on historical linguistics because of the various problems involved with that approach, including what the authors of the paper find, that the rate of word loss is not constant over time. They have a better idea of the rate of word loss, which could help improve glottochronology, but the method has a lot of flaws regardless.
Also, the question they're asking - how do words change over time, in terms of coining, becoming current, and becoming obsolete - really isn't a question historical linguists are that concerned about. Historical linguists are much more interested in how the forms of words change over time (phonological change), or how their function changes over time (grammaticalization), whereas the coinage and loss of words isn't often so important, especially on the large scale statistical level. Furthermore, this type of model probably handles languages with phenomena like avoidance speech poorly, since that would change how and why words are kept or lost.
Their language sample is at heart a convenience sample - they happened to have access to lots of data in those three languages, and it is largely written data. Spanish and English are both related languages with very similar cultural contexts, while Hebrew is a strange choice in that is has an ancient history, but only quite recent revitalised usage. Whether most spoken interaction (which is what linguists tend to be more interested in) has even a tiny subset of the total number of words they are talking about is an open question and would be better tested against corpora with a large quantity of spoken data such as the British National Corpus or the International Corpus of English.
It's an interesting study, but if it hadn't been written by physicists I'm not sure if it would have ended up in Diachronica or the Journal of Historical Lingiustics, much less Science. Their "statistical rules" are interesting, but really not of any great use to wider linguistic inquiry. I think its import is really just exaggerated by the fact that science editors read Science and NOT most linguistics journals, and therefore they think it's really impressive.
From your post, it seems the assignation of the asshole title should not be exclusive... A grammarian would point out several errors in your post (mostly subject-verb agreements), and some of these even vary between British and American English. I'm sure everyone who read your post, or the post above it, understood what both of you were trying to say, so most of your arguments do not apply in this case. While I agree that abandoning grammatical rules can make communication very difficult, I also think some grammatical rules have been detrimental to clarity. Some rules are not even agreed upon - see ending a sentence with a preposition, where to put the punctuation with respect to the closing quotation mark, whether "everybody" can be plural, etc. Syntax is important, but a lot of language is like white-space, and languages that rigidly interpret white-space are a pain in the ass, just like grammarians.
It's not that similar, actually. In the above "paradox", you have a sum of the total distance covered after x time. If they were 10 feet a part, then after x minutes it is 5 + 2.5 + 1.25 + ... until you have x terms. As x goes to infinity, this sum will approach the full 10 feet. So the math is right, never will 10 feet be reached. And so the physics/engineering joke is fine, technically they will not meet following those rules, but there's always a point of "close enough". The rule itself is impossible to follow, though.
In Zeno's paradox of Achilles and the tortoise, it works like this. The tortoise is say moving at 1 foot per second, and is 10 feet ahead. Achilles moves at 10 feet per second (~7mph), so after 1 second he will reach the point where the tortoise is now. But after that 1 second the tortoise will be another foot head, so Achilles must take another 0.1 seconds to reach the new point, but in that 0.1 seconds the tortoise has moved again, and so on forever, with the next step taking 0.01 seconds but still not catching the tortoise. Even if you allow for the physics/engineering "close enough" at no point is Achilles EVER past the tortoise, only "close enough" to call him "caught up". The reason this is different is that x terms in the sum no longer take exactly x minutes, since each term is over a shorter time as well as a shorter distance. If you take the limits on the infinite sum, the distance between them goes to 0, and the total amount of time goes to a finite number, not infinity (in this case, that finite number is 1 and 1/9 second, exactly what you get if you just ask how long it takes a person going 9 feet per second to cross the original 10 foot distance). Mathematically there is no problem with taking a finite amount of time to go a finite distance, so there is no paradox, the equation works out exactly when Achilles catches up to the tortoise. It's not a time reachable in the sums you came up with to describe it, but it's still a finite time. Where in the dance paradox above, the time it takes to reach 0 distance IS infinite.
ASCII stupid question, get a stupid ANSI
At least I have the excuse that I'm not writing in my first, or indeed even my second language.
Unicode killed the ASCII-art *
Because using the proper conventions of language (grammar, spelling etc.) is a form of politeness.
On this point (and while I concur with your sentiments overall), I would like to point out that it's not so much that using language correctly being a form of politeness, but taking the point communicated and the communication medium seriously. Interchanging their/there/they're, two/to/too, for/four, than/then, through/threw, and other such errors imply one of three things:
1) Non-native speaker, confusing one word for another.
2) Ignorance on the writer's part.
3) Carelessness.
Two of the three boils down to negligence. The remaining can be easily confirmed or disproven through context; non-native speakers typically have a very distinct pattern of grammatical errors that are more complex in nature. Non-native speakers' errors are typically tense, object-subject, or otherwise construct-related, as opposed to simply not using (and hence not knowing) the correct word. There are vocabulary issues for non-native speakers, but they tend to substitute similar-meaning words, as opposed to completely unrelated but similar-sounding words.
But I digress. For the former case, since the writing was done trivially, the act of reading would be trivial as well. If the writer obviously doesn't care about what's written, why should any reader put the same level of effort into parsing the words? Since the writer cannot be bothered to use a dictionary (which is easier than ever now that there's a search bar at the top of every browser) or proof-read, or even learn the language used, why should the reader bother with trying the decipher the information that's probably not so important anyway.
As such, it's not so much about respecting the reader, but respecting the point. The merits of the point can only be addressed after acknowledging that the point is important enough to warrant addressing. And that starts with the initial communication of the point--in this case and in many others, the written text.
"If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be."
Poorly worded title, I don't see any laws, theories, or other predictive content.. just some analysis.
I was crazy back when being crazy really meant something. (Charles Manson)
A year or so ago a contributor to the London Review of Books identified a golden age of swearing, until it was pointed out that the "apparent prevalence of the word fuck in the period before 1820, and its complete disappearance for more than a century thereafter, can be explained by the end of the use in printing of the ‘long s’, which modern optical character recognition sees as an ‘f’. All the apparent ‘fucking’ before then is actually just ‘sucking’"