Using The Web For Linguistic Research
prostoalex writes "The Economist says linguists are gradually adopting the World Wide Web as a useful corpus for linguistic research. Google is used, among other resources, to research how the written language evolves and how some non-standard examples of usage become more or less acceptable (The Economist quotes the phrase 'He far from succeeded,' where 'far from' is used as an adverb). LanguageLog is a resource linked in the article, where linguists discuss current peculiarities of the English language."
It's probably a good thing that they steer away from Slashdot as a corpus of English usage. Or, should I say, in SOVIET RUSSIA it's best Slashdot stays away from THEM! Or is it that only old people use the Internet as a corpus of the English language while pouring hot grits down a naked and petrified Natalie Portman's pants?
Indeed what their sayin is true. U can learn English very well, especially grammer readin /. frist psots. Teh intarweb seems to certainly kick arse for that sorta research. Very 1337 articel. Thx d00dz.
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
When we might actually say words like 'lol' out aloud. Imagine a deal going down between two mining companies and the CEO of one company with a straight face, and deadly serious demeanour saying to the cameras: "Despite many thinking we pwned them in the deal, we believe it came out leet for every1"
http://www.sandstorming.com
more than just web users are adjusting to this shift in language. i countinously question my co-workers (social workers) in telling the youth what is propper and not. if a launguage does not evolve then it dies. using words, moslty slang and rap song lyrics, is becoming more than just the normal, it is becoming the standard.
It came to me that the English language was in deep trouble when people started saying "rotfl" and "lol" in person. There seems to be kind of a backlash brewing though, with improved email composition styles dictated by employers, and such.
SNACKS ARE AWESOME
This is not the first time when Google (and search engines in general) changed how we do things.
Nowadays copyrighters use Google to search for potential violations of their intelectual property. Plagiarism is easy to detect nowadays thanks to Google as well. Instead of using rather expensive systems in order to search for duplicate work, teachers are now one search away in distinguishing original work from the rest.
Does He far from succeeded, sound totally fuckin retarded to anybody else?? Like something an idiot would say to try and seem intelligent?
This begs the question of how much "incorrect" use of a phrase is necessary for it to become the "correct" use of a phrase.
NB: I'm being ironic.
One mo'e cripplin' bombshell hit da damn already beleaguered *BSD community when IDC confirmed dat *BSD market share gots dropped yet again, now waaay down t'less dan some fracshun uh 1 puh'cent uh all servers. Comin' on de heels uh a recent Netcraft survey which plainly states dat *BSD gots lost mo'e market share, dis news serves t'reinfo'ce whut we've knode all along. What it is, Mama! *BSD is collapsin' in complete disarray, as fittin'ly 'esemplified by failin' wasted last in de recent Sys Admin comprehensive netwo'kin' test. Man!
You's duzn't need t'be de Amazing Kreskin t'predict *BSD's future. De hand writin' be on de wall, dig dis: *BSD faces a bleak future. In fact dere won't be any future at all fo' *BSD a'cuz *BSD be dyin'. Doodads is lookin' real baaaad fo' *BSD. As many of us is already aware, *BSD continues t'lose market share. Red ink flows likes some riva' of blood.
FreeBSD be de most endangered uh dem all, havin' lost 93% uh its co'e developuh's. De sudden and unpleasant departures uh long time FreeBSD developuh's Jo'dan Hubbard and Mike Smid only serve t'undersco'e da damn point mo'e clearly. Slap mah fro! Dere kin no longa' be any doubt, dig dis: FreeBSD be dyin'.
Let's keep t'de facts and look at da damn numbers.
OpenBSD leada' Deo states dat dere are 7000 users uh OpenBSD. How many users uh NetBSD is dere? Let's see. De numba' of OpenBSD versus NetBSD posts on Usenet be roughly in ratio uh 5 to 1. Derefo'e dere is about 7000/5 = 1400 NetBSD users. BSD/OS posts on Usenet is about half uh de volume uh NetBSD posts. Derefo'e dere are about 700 users uh BSD/OS. A recent article put FreeBSD at about 80 puh'cent uh de *BSD market. Man! Derefo'e dere is (7000+1400+700)*4 = 36400 FreeBSD users. Dis be consistent wid de numba' of FreeBSD Usenet posts.
Due t'de troubles uh Walnut Creek, abysmal sales and so's on, FreeBSD went out uh business and wuz snatchn upside by BSDI who sell anoda' troubled OS. Now BSDI be also wasted, its co'pse turned ova' to yet anoda' charnel crib.
All majo' surveys show dat *BSD gots steadily declined in market share. *BSD be very sick and its long term survival prospects is very dim. WORD! If *BSD be to survive at all it gots'ta be among OS dilettante dabblers. *BSD continues t'decay. Slap mah fro! Nodin' sho't uh a miracle could save it at dis point in time. Fo' all practical purposes, *BSD be wasted.
Fact, dig dis: *BSD be dyin'
hammerrevolution.com --;
There are more non native speakers on the web then
native speakers.
In the European community the native English
speaking persons are by far a minority. That way
French expressions are poring into the language
in an unstoppable way. Those expressions are then
used by native speaking politicians and are
broadcasted by television. That way they enter the
mainstream of the English language.
Regards
I've used the web for corpus linguistics research. My last big project was to look at a lot of web pages with Mexican and Chilean slang Spanish, and see if there was a difference in vocabulary usage. There was a significant difference; I could, 70% of the time, tell if a given passage was Chilean or Mexican Spanish.
I could have gotten a higher accuracy rate, but this was just a simple undergraduate project.
Without RTFA my fist instint is to say why post anything related to natural language on slashdot? But the truth is, as a sysadming/webmaster/anything that plugs into an outlet for a small credit union I am appalled at the way people want to write on the web. It's hard to describe, but see (for the moment) this for a crippled example (yeah, a work site published externally, FSCK'ing horrible - more where that came from). Anyhow, it seems the second people publish shit one the web they give up on grammer/puncuation etc - in the included link originally draft had every link capitolized. No bold, color or anything - fuck it, aparently it's OK to throw proper grammer to the wind if it's on the web, even if the purpose is to manage peoples retirement. ARGH.
side note - my bad grammer/spelling is OK only because I'm a FUCKING CODER. I don't want to hear from the grammer/spelling Nazis on the text of this post.
anyhow - slight possibility of feedback on a complelty offsubject page I'm working on, here. Break it, fuck with it whatever. Jon.
There needs to be an anual prize for the highest compression ratio using random pages from the web as the corpus. This would probably do more for real advancement of artificial intelligence than the Turing competitions.
Seastead this.
Unlike French and Italian, there is no official instution that defines 'correct' English. Essentially, the English-speaking world just 'makes it up' as it goes. Thus when I see the adverb 'really' butchered into 'real' I must try not to get annoyed. i.e. It's real hard to use your mother tongue. vs. It's really hard to use your mother tongue. Please help me here - is the misuse/non-use of 'really' something that's taught in school?
Good! Natural language is a moving target. The web is an excellent communication medium and ignoring it would be quite a
silly move. The example reminds me of "To boldly go", which was not proper, but its elegance is hard to argue against.
Though I've done it at a higher level of the educational system (while doing a Ph.D. in Linguistics). The big, big advantage of using search engines is the sheer size and variety of the content available on the web. For a number of things, there is simply no other way to get enough examples, because the phenomenon you're interested in is just too rare. The downsides are repetitiveness (it's often the case that you get the same document a lot of times at many different URLs; for example, song lyrics), typos, unreliable language-dectection algorithms in search engines (search for weird stuff in Spanish in Google, and you'll often get back some Portuguese results), unreliable numbers, etc.
Are you adequate?
roflcopter....
The article addresses this in a weird way, where it first draws attention to the distinction, but once it reaches its crux, where google is used as a tool, the distinction is ignored entirely; instead it opts to focus on stranger things.
I woulda thght such a thng was unpossible.
Scouring the net for written material, prose or otherwise, and studying, analyzing, tabulating it is a cool and grand idea. Lots to be learned I'm sure.
... What about researching and analyzing vernacular data that is not publicly available on google, news sites, public message boards, usenet, etc? What similarities and differences can be found in what is considered to be personal or private communication?
However
I'm almost sure someone has thought of this before but the obvious problem is: how is one able collect ample data categorized as private or personal communications? Afterall, it isn't possible to just google or grep ICQ or AIM logs from thousands of people...or is it?
Yes, we can record the errors made by the uneducated public (and even those done by, uhm, me). The question is: should we do that or not?
I was pretty taken aback when a council of linguist in Poland suddenly declared some widely-chastised and not even very popular errors to be valid usage. I've been brought up in the circles of people who not only put a lot of stress to the language you use, but also cruelly point out every incorrect word or phrase you use -- and this made me quite intolerant to bad speech.
Being but a dirty foreigner, I know that my English can sound bad in the ears of native English speakers -- that's why I sometimes ask people to correct me if they spot errors.
In other words: some people find careless speech repulsive. Thus, we should do whatever we can to promote correct usage as opposed to legalising incorrect uses.
The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
when you doubt between two spellings of a word, check the search results count in google. I've used that trick.
Then again, my idea of fun is to use google count for finding the words that get misspelt(google ratio with misspelled 5%) the most often.
I thought compatable was common, but i only get a 1% ratio there. Maybe there should be a category 'non native'.
Is conneXion considered an error? I like it much better than connection.
Just now i find out that there are lists , eg at most commonly misspelled words.
I think that for most of the 20th century, English, and most languages in the industrialized world, was largely static, dominated by the written word which was dominated by proper grammar. Since WWII, popular culture and faster communications have increasingly exposed us to local vernaculars, mostly through radio and television. The written word lagged behind in its cultural evolution.
Thanks to the internet (initially email, BBS's and IRC, but more widely known on the Web), we now have a hybrid of the spoken and written word: the "typed word". This form of language evolves at the same rate as the spoken word, and injects its own vernacular as a side effect of the medium: acromyn and abbreviation "words" (rofl, how r u), along with common misspellings (pwned), and mixing letters with numbers or punctuation (133t, n00b). All of these serve at least one purpose, whether as a form of super shorthand, insult, the appearance of being "cool", or are merely the result of laziness on the part of the author. Most typed-word terms don't transfer well when spoken.
One of my hobbies is studying (European) languages and how they are related. Sometimes I worry about the damage the typed word is causing to the spoken and written word (and any proper linguist should at least be interested in the phenomenon). Luckily, most typed word expressions aren't pronounceable, and the ones that are sound absurd, because they are removed from their original context when spoken, and everyone recognizes gibberish when they hear it. How the typed word affects the written word remains to be seen. Yes both are typed now, but only the written word has a chance of going through an editorial process. I think it will take a very long time for the formal lexicon and rules of grammar to embrace, however reluctantly if ever, the typed vernacular.
My favourite piece of linguistic research using Google is to search for "attention to detial".
+ to+detial%22&btnG=Google+Search
I then have a laugh at all the hits...
http://www.google.com/search?hl=en&q=%22attention
I've had the chance to use Google as a grammar or style checker in my day job as a glorified copy editor. I type two nearly identical expressions X and Y in the search box. If expression X gets 10,100 hits and expression Y only 500 hits, I use expression X.
For example, as a non-native speaker, I found myself waffling between the expression (A) "run for mayor of" and the expression (B) "run as mayor of." Letting Google arbitrate, I found 14,900 hits for (A) and only 200 hundred hits for (B). I chose (A).
I discovered there's practically a dead heat between the expressions "a new lease on life" (which, if I'm not mistaken, is the expression favored by American usage) and "a new lease of life," with the latter nosing out the former 144,000 hits to 140,000. In this instance I let my own usage arbitrate. Since I'm more exposed to American than to English, I chose on.
I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
I love a bit of cunning linguistics.
Indy Media Watch-Proctologist of the Internet
...which was this little program I wrote around the nascence of the internet. it took any sentence as input and kept a record of which words preceded each word, and which words followed each unique word. The idea was to build up a simple map of which words could precede or follow others completely without context. From this you could follow paths that made sentences or paths that looped forever, or paths that made no sense, and some interesting paths that made unintended sense.
... The next iteration built up a more discreet map by scoring proximity of unique words in sentences and inclusion in sentences together. Again, the idea was to build a simple statistical map free of any context, simply to get a sense of pure lexical association.
Why a tree? Language and geneology seem to have a common thread. Meaning is like genetics. Language is expressive. Information is a kind of tree whose branches grow as reality elaborates and past events accumulate. New terms need to be invented for the dynamics we perceive in reality, just as new names are given to individuals as they emerge into the world. Patterns, continuity, periodicity. Such things lie at the heart of material existence and provide the hooks for consciousness itself. Information theory is the next great frontier, along with particle physics. Already they have converged and diverged and converged again. And playing with artificial trees turns out to be a lot of fun.
As for the "Meme Tree" program
The theory is that the internal consistency of these various lexical maps should roughly reflect many aspects of associative meaning. You could think of the statistical map as a Godelian bubble whose "truth" - if you will - is imposed by the laws governing the statistical associations. We don't derive the laws of language and meaning from these exercises, but we create an internally-complete map that reflects something about the nature of meaning.
There is a practical aim as well. If you can derive the strength of equivalence and the various levels and colors of associative meaning you could in theory build a "Truth Machine" capable of answering any question with a high degree of accuracy. The result of any question could be computed as any other information retrieval problem would be.
I never got around to having my little Meme Tree programs scrape the internet for random sentences. However, this should be a very simple thing to do. Google has had programming contests in the past - programs that use the Google database in interesting ways. Statistical analysis of language is basically what they do. Research projects on their data could provide stunning insights into the nature of information itself, its relation to language and to reality, and likely into our very nature as linguistic beings.
-- thinkyhead software and media
Link on front page of bbc.co.uk - bbc.co.uk/voices/ - their attempt at tracking accents and dialects across the UK.
Just a month ago I finished a paper exploring using Google counts in great detail for language analysis and other forms of meaning extraction.
"Automatic Meaning Discovery Using Google":http://arxiv.org/abs/cs.CL/0412098/
Comments welcome, -Rudi.
English isn't my first language, so I often use Google to verify the use of an expression by comparing the number of hits I get for various forms, or as a "spell-checker" by using Google "Did you mean" suggestions to correct my spelling mistakes.
Lately, I find that some mistakes have become so "popular", that I can't do this anymore, because Google now recognized the mistake as a "valid" search word.
Please refrain from trademarking your 'unique' spelling of intellectual. Thank you.
. . .
:
:
Those expressions are then
used by native speaking politicians and are
broadcasted by television.
Dude, it's worse, the French have already infiltrated as far as the advertising business and are using covert channels to spread some dangerous crack i heard was called La Liberte
http://french.about.com/b/a/081281.htm
Slightly more seriously
Apart from pointing out that your use of the word native is rather presumptive of geographic origin in this big wide internet thing, i wonder if this linguistic adoption is more one way towards English since the internet. OK the French got Le Weekend, and tons of anglicised nouns, tried to ban them all and didn't manage. But i read Friday that a British pilot training firm lost a contract to a French one. The reason cited by the Asian airline was that, whilst the training had to be in English, the French trainers spoke better, clearer, more intelligble English than did the English. I can't argue with that. Sadly.
lol @ anonymous cowards
lol lol lol
My question to all -- so how far are we, I ask to you master linguists + computer scientists, before we will have self forming dictionaries based strictly on cached google data?
two years?
-o- Geoff Peters
I have just read the above and I must admit it: I am teh lame, amn't I?
Sincerely,
Pan Tarhei Hosé, PhD.
"Homo sum et cogito ergo odi profanum vulgus et libido."
I am American but have to write in Japanese for work. No matter how much one learns in school, when one writes in a foreign language, you'll hit a point of wondering if what you wrote is how native speakers say something or is even understandable. Whenever I hit a point like that, I put the sentence in question (or key fragments thereof) into a Google search. If nothing comes up, I know I have to rewrite. If only a few links come up, I know what I wrote might be a little wierd, but is at least understandable. If I get pages and pages of links, I'm golden.
I use search engines all the time for linguistics reseach: when I'm reading or translating from one language to another, and I run into an odd usage, I just type the phrase in the magic box and *poof*, I get hundreds of contextual examples. Likewise, if I'm writing in a foreign language, and I need to know if a preposition or a construction is correct (and not simply words), again all I have to do is type it in and see what comes out.
Measuring how the internet changes world languages is only a small part of what the 'net offers those interested in linguistics and linguistic usage. Most of the web data archived on google does not consist of ROTFLMAOs and pwn3ds; it consists of everyday usage, and a good deal of that is from the last decade. Much of linguistics deals precisely with that: how the language is used in a daily basis. That's also how dictionaries come about: they're [i]descriptive[/i] accounts of usage (which is why the high school journalistic trick of beginning an article with "Webster's defines fistula as..." doesn't work. Dictionaries don't lay down the law, they describe it).
Of course, some people have been arguing that this gives room for errors and abuses. Of course it does! just 'cos something doesn't play by the rules doesn't mean it's not in common usage. And just because people don't follow rules of orthography, grammar and style doesn't excuse us from teaching these things, or trying to follow them. After all, language is about communication, and these corruptions hinder our ability to communicate, especially communicating complex thoughts.
So yeah, "to impact" is to make an active verb out of a passive participle, and "to impinge" should be used ihstead. There are plenty of uses of "bonified" out there. Google finds about 20,000 such occurrences. That doesn't make it correct. Nor does that make Google's suggested correction "bonafide" correct either (306,000 occurrences). The correct spelling is [i]bona fide[/i] (1,050,000 occurrences).
And don't worry too much about purely textual forms appearing in speach. LOL is just this decade's SOB. A spoken "I R0XX0R, J00 5UXX0R" shouldn't alarm us too much when we consider all those medical shows where doctors run around yelling "Get me a boron enema STAT!", pompous academics actually say "such economic perturbations may affect the governance of a certain cryptodictatorship, VIZ the United States", and we all drop down to the pharmacist to "Fill an RX", all spoken forms of what are written Latin abbreviations (statim -- immediately, videlicet -- that is, Rx -- Respondeo, although some classicists may insist it's the symbol for Jupiter).
One linguistic area that is interesting is the gradual adoption of worldwide slang. We hear Americans these days using terms like "Bog Standard" and "Arsed".
What's the point of this rant? Teh intardnet is a great resource for linguistic usage, beyond the navel-reflection of IT professionals. Disciplines like linguistics deal in examples of usage, and the internet is a great stockpile of everyday language. Descriptive grammar and descriptive dictionaries are not an excuse for ignoring arbitrary rules. Most of the lingusitic phenomena we see with internet usage are not new.
Alot, alot, alot, alot...
I really, really hope that 'alot' will never become accepted usage. But its use seems to be growing... a lot.
Slightly off-topic, but does anyone know where I can download (or buy) a font that uses the letters for KK Phonics?
The IPA phonic set is widespread and available from many sources, but I'm having a hard time finding one for KK Phonics.
Most dictionaries show pronunciation keys in both, but IPA seems to be more popular currently.
Excellent points. Linguists study language as it is used not as it is prescribed.
operating system
linux kernel
free software
And citations linked to those pairs such as:
Linus torvalds as the moving force behind the operating system that is reshaping the computing industry.
Andrew tanenbaum has been derided for his heavy hand and misjudgements of the linux kernel such a reaction to tanenbaum is unfair.
Respect for richard stallman's contributions to the free software movement and consider him the real pioneer in the field but I believe...linus who has turned that dream into the beginning of a reality by bringing...next level.
IMHO, as client-end data analysis gets more sophisticated (and increased broadband used allows for quicker web data mining), linguistical tools on the desktop can leverage the raw data on the web to do some pretty interesting things.
Really, the web is the largest corpus out there. Using Google is just a great way to get it down to a manageable size.
Adding or changing characters in a literal string seems like misquoting. Traditionally in handwritten work the comma went almost directly under the quotation mark. When people shifted to typewriters and then computers, an arbitrary choice was made to put the comma first. Most programmers I meet seem to have reversed that choice.
This post written under Gentoo-linux with an SCO IP license.
Let me pat myself on the back here .... you know how the Breen site has a link that points to Google images? It was me who suggested that to Prof. Breen a few years back. I hated finding words in Japanese on the site that meant what I wanted to say but either turned out to be obscure and never used, or actually have different meanings entirely (think of English, how lie and lay sound the same, but are very different in meaning.) If no images come up, the word is something no-one ever uses and if the images are all wrong for the meaning I want, I try another word till the images are right. It was me who also got the katakana for Lucy Liu's name listed on the site. I'm still waiting for her to be so impressed by my effort on her part that she turns up at my door to be mine ... LOL, I think it'll be a long wait ...
There is also http://www.ethnologue.com/, which keeps track of over 6000 human languages.
I don't think there's anything unbelievable about it. I'm a non-native English speaker and I often find other non-native speakers easier to understand than native speakers. Native speakers speak faster, and they often employ more subtle distinctions between different sounds, which non-natives have difficulty hearing or reproducing accurately. Then there are dialect issues: some native speakers are near-incomprehensible even when they attempt to mimic 'standard' BBC English. Another factor is that a non-native speaker may have a more limited vocabulary. Of course, in all this I assume that the non-native speaker is close to fluent, but even a strong foreign accent can be surprisingly easy to decipher for another foreigner.
This is misleading in suggesting that LanguageLog is limited to English. Actually, it deals with all sorts of linguistic topics and languages.
[Esperanto is] a joke. Latin wi' t' grammar took out.
There is no language with the "grammar took out". Every language has a grammar. Some have "fusional" morphology like Latin and Greek, with multiple meanings in a given affix; some have "agglutinative" morphology like Turkish and Esperanto, with simpler affixes stacked in a word; and some have more "isolating" morphology like Toki Pona, Chinese, and (to an extent) English, with each word being an independent unit to a large extent. Over time, isolating languages become agglutinative, agglutinative languages become fusional, and fusional languages become isolating.
I have an undergrad degree in Linguistics (U of MN back when they actually had a Linguistics Department), and would like to point out a few things that probably need to be considered when doing this kind of research (in no particular order).
1. Dialect/Sociolect/Ideolect - What may be acceptable for a black Kentucky high school girl to say to her peers may never be uttered by a white 50 something banker from Seattle. And people have various individual language-use "foibles" that can throw off a study (this is called an ideolect). So, I think this can tend to ingore just how acceptable saying a thing may be. Just because *someone* *somewhere* utters a phrase does not endow it with meaning or acceptability.
2. Medium of communication - I agree with others here that the Internet is somewhat unique as a form of communicating. It is very interesting to me to observe that we seem to be seeing the rise of "written only" phrases. Obviously, written language evolved from spoken, but this would seem to be a different creature. Will "LOL" or "ROFL" ever really come into spoken usage? (I, for one hope not, but that's another topic). If those phrases do come into common usage, is this a new phenomenon? I can't think of other examples of this off the top of my head. Are there other ways in which this medium influences communication?
3. Knowledge of the speaker - The Internet provides this "body of data" but how much data does is provide about the speakers (writers)? I suspect that it generally does not provide much. Look at the average Slashdot (or other) posting. You may or may not have information about the poster (I always post as AC), and that information may or may not be factual. How can you know that you're studying the writing of someone from America or Britan or India?
4. Orthography - It is very common for orthography to be ignored by phonologists. Historical Linguists often rely on orthography to trace a word's history. How valuable is orthography? This kind of searching gives it ultimate value, which could be dangerous. It can also leave out attempts at simplification. If I seach for "laughing out loud" but don't search for "lol" am I really getting everything I am looking for? How do I know every phrase to search for? I think it's interesting to see how ppl simplify orthography for the sake of rapid (or easier?) communication. And it seems the internet goes for a more phonological (and intuitive) orthography with usages like "r u" for "Are you", etc.
I'm sure there are more aspects to consider, but this is a start. This is an interesting direction for research, but I think it is more thorny than the article lets on.
--Jonathan
CBS News Sunday Morning ran a piece today on BuzzMetrics http://www.buzzmetrics.com/, a data mining company that uses Google, among other tools surely, to dig through blogs, forums, etc. to find out what people are saying about particular companies or products. It was interesting that their analysts' job included not only the data mining but helping their customers make sense of the way they said it for incorporation into their marketing campaigns.
A user signing as phaln on Slashdot today remarks, apropos of a comment exchange about using the entire web as a corpus (the way we often do here at Language Log Plaza), which led to some comments on the sort of random slangy stuff on the web that might make that a bad idea for grammarians seeking information about English:
e s/001829.html#more
i ve s/001628.html
It came to me that the English language was in deep trouble when people started saying "rotfl" and "lol" in person.
Now, the user is being humorous, of course. But it is remarkable how often people say this sort of thing. It reaches newspaper columns and magazines as well as everyday conversations about language ("Oh, you're a linguist? What do you think about the way Internet slang is changing the language?"). I've heard a half-hour radio discussion about it on the BBC World Service (in the middle of the night; it was a real yawn, a perfect fix for my insomnia). It seems likely that at least some people really do think English might be altered radically by the intrusion of email abbreviations for phrases like "[I'm] rolling on the floor laughing" or "[I'm] laughing out loud" into regular spoken English.
Don't worry. Nothing radical or even slightly significant will happen. Suppose, say, "rotfl" (pronounced "rotfull") became quite common in speech (which seems unlikely, since if your interlocutor falls down and rolls on the floor laughing it generally needs no comment; but maybe as a metaphor, or on the phone). What would have changed? One interjection (a word grammatically like "ouch") added. Total effect on language: utterly trivial. Not even noise level. Interjections are so unimportant to the fabric of the language that they are almost completely ignored in grammars. There's almost nothing to say. They have no syntactic properties at all -- you pop one in when the spirit moves you. And their basic meaning is simply expressive of a transitory mental state ("Ouch!" means something like "That hurt!"). Don't worry about English. It will do fine. Not even floods of email-originated phrases entering the lexicon would change it in any significant way. If phaln were to suggest such a thing seriously I would be LOL.
From: http://itre.cis.upenn.edu/~myl/languagelog/archiv
Also, for anyone interested, Pullman's crusade against Dan Brown is simply delightful. A good entry about it (Pullman posts about Dan Brown all the time):
http://itre.cis.upenn.edu/~myl/languagelog/arch
It's troubling to read so many comments that worry that the linguistic researchers will find "bad language", and worse, that people have moderated such comments up. It reflects a misunderstanding of what linguists do: they want to get a description of the language as it is used, and as it changes, and historically speaking, usages that start in the gossip of teenage girls often become mainstream a couple of generations later. They need it all, and they probably need the crappy stuff most of all, because it is closer to spoken English.
Backlash you say...
But employers have always been the driving force behind stiff formality, which is the antithesis of useful productivity. They are the people that have so many of us trussed up in those idiotic suits and ties all day--and what good has that ever done anyone, outside of the textile industry?
It's similar but even worse in Japan, where employers train all their new incoming young workers in an entire dialect. That's the "keigo" or polite speech forms, a swampland of gratuitous linguistic complexity that annoys native and foreigner alike. Only the moldiest old hyperconservative codgers keep it alive--because they have the money. Everyone in their right mind shuns keigo, until faced with the job hunt.
Standardizing languages somewhat does make sense. But letting the [money-minded but otherwise mindless PHB] employers drive the process is hardly likely to yield an improved product for the rest of us to live with. The "back" in that "backlash" is more like the "back" in "backwards".
I can help you there, since I do understand the difference. But I'd really rather not...
:-)] about rigidly differentiating their adverbs from their adjectives. I mean heck, all you are trying to do is modify the next word; who needs a whole separate form if that next word happens to be a verb versus a noun?
Methinks you aren't just annoyed by the really-versus-real dichotomy. There are loads of word-pairs like that in English, differentiated only by the fact that the adverbial form ends in -ly while the adjectival form does not.
But when you start learning other languages, you find that not all of them have this Nazi attitude [am I comparing you to Hitler already?
Didn't some programming languages learn from this problem a while back? Let's hear it for operator overloading.
Features like that -ly fascism add needless complexity to a language. For example, think of that craziness about associating gender with all nouns in the Romance languages. [Who isn't annoyed by that??] Features that maintain needless complexity are the first things to be discarded when language reform or evolution comes along--and we all say good riddance.
Umm, don't you think it is more likely that the "people with too much time on their hands" are the ones who have time to look up what is proper usage?
:-)
"Fracking", you say. Yeah, I sometimes don't have time to look up the proper spelling of a word either.
Cheers
Glanced over your paper, nice work; thanks for making your implementation available, I'll have to try this sometime.
Very interesting conversation! Really don't want to criticize anybody's comment...Well say if xxv century person would read our messages now, what do you think he could say about the language we use?! He would say we speak like morons & have no respect to the proper form of word, hahaha. Now ask a modern british man what he thinks about the way americans speak, hahaha, ask american what he thinks about the way irish speak...Geographical, political/economical & time factor play the key role. Besides, the main tendency of any language is the tendecy of simplifying itself in it's development. Internet does the greatest job in this sence , cause it globalizes every new thing we get. It mirrors everything. So makes sence to use it for linguistic analisys. Also I think it's a little bit too late to be worried about preserving "the pure inglish language". Things got way too far. There's millions of people who learnED english & in english speaking society has much stronger position than many "native english speakers". They also speak chinese, japanese, hawaiian, russian, german, french, spanish italian... They grow up & "launch a Google" . And their children'll grow up bilingual as well.& No matter what you are saying about the pure great english...they'll keep chatting during the expensive american school classes & use all them : "U, cuz, gotta, thanx, ru, lol, etc..." & form the modern language. But this is not like writing, by a native speaker :" GESS HOW I SORE" (from one of the comments here, ahaha) Do you feel it?! They are the future of the world. & Many of them start their english with slashdot. Good for them. Good for slashdot. Sooner or later English will be strongly influenced by other languages, as well as will get simplified in use.
I don't think this has been established one way or another. Linguistic complexity would be an incredibly difficult term to objectively quantify.
Humorless sig goes here.
For those that are interested, another great linguistics/language (English, mostly) site is Literal Minded. He gets into backformation and all sorts of weird language phenomenon. The author often uses google to justify (or disprove) his theories. And no, I'm not him... just an avid reader!
In my linguistic school I was told many times that this tendency does take place. And that's why often we pronounce one way & write the other. The written form is more fixed. Oral speech changes constantly & mirrors all the factors which influence it's development. Next the written form gets transformed, catching up after the oral. Say gorgeous. // splendidly or showily brilliant or magnificent// From gorge, gorget. Etymology: Middle English gorgayse, from Middle French gorgias elegant, from gorgias wimple. Now let's see... middle english gorgayse & modern gorge. What looks simpler ?
gorgeous
Etymology: Middle English gorgayse, from Middle French gorgias elegant, from gorgias wimple, from gorge gorget
: splendidly or showily brilliant or magnificent
Shouldn't you be comparing gorgeous (modern) and gorgayse (middle English). It seems that, earliest to most modern it went
gorget -> gorgias -> gorgayse -> gorgeous
No?
I remember from my liguistics classes that it's not really possible to say which language is more complicated than another.
Humorless sig goes here.
Text Compression as a Test for Artificial Intelligence, 1999 AAAI Proceedings. Matt Mahoney shows that text prediction or compression is a stricter test for AI than the Turing test. (1 page poster, compressed Postscript).
Seastead this.
ok gorgayse came from gorget & turned into gorgeous, which is being pronounced differently then spelled. (base -gorge ; same as fame-famous) So this base got simplified as the time went by & got the modernized ending - ous. P.S. : synthetic languages are for sure more complicated. I'd rather die then learn one.
Well, English has always had a problem with the spelling being not quite being consistent the actual pronunciation. I'm not sure this proves the point one way or the other though.
I agree with you on the synthetic languages. On the other hand, it could be argued that the complexity in syntax is just moved into the morphological level.
Humorless sig goes here.