Slashdot Mirror


Using The Web For Linguistic Research

prostoalex writes "The Economist says linguists are gradually adopting the World Wide Web as a useful corpus for linguistic research. Google is used, among other resources, to research how the written language evolves and how some non-standard examples of usage become more or less acceptable (The Economist quotes the phrase 'He far from succeeded,' where 'far from' is used as an adverb). LanguageLog is a resource linked in the article, where linguists discuss current peculiarities of the English language."

30 of 205 comments (clear)

  1. They should probably avoid Slashdot by Peter+Cooper · · Score: 4, Funny

    It's probably a good thing that they steer away from Slashdot as a corpus of English usage. Or, should I say, in SOVIET RUSSIA it's best Slashdot stays away from THEM! Or is it that only old people use the Internet as a corpus of the English language while pouring hot grits down a naked and petrified Natalie Portman's pants?

    1. Re:They should probably avoid Slashdot by mizhi · · Score: 2, Interesting

      Hopefully, they'll harvest well written webpages for data and not those of 13-year old girls drooling over Orlando Bloom, AOL users, or porn sites.

      Actually, I take that back.

      It could actually be very interesting from a lexical or morphological point of view. The phenomenon of abbreviating words, such as "u" for "you" or "ur" for "you're" or "ru" for "are you." Language teachers in classrooms have been seeing it crop up in actual homework assignments. While reading such language may be like having glass wiped across the eyes of people educated before computers came into wide-spread use, it's interesting how it's affecting younger people.

      There's a collision between the high tech world children grew up with today and the way language is taught in schools in a similar way to the situation with how students speak on the street versus how they are expected to speak in the classroom or the professional world. Remember when it was proposed that ebonics be considered a valid dialect for using in the classroom?

      What would be even more interesting to study is how keyboard effect the structure of languages. It seems that people are under the assumption that languages are static and don't change, but this is incorrect.

      Because the keyboard is still the main way of inputing information into the computer, people take short cuts and I would be surprised if that didn't start to effect their use of language in other contexts.

      I'm just rambling, but such studies would be akin to socialogical studies that look at the influence of technology on social organization.

      --
      Humorless sig goes here.
    2. Re:They should probably avoid Slashdot by Joe+Tie. · · Score: 2, Interesting

      Because the keyboard is still the main way of inputing information into the computer, people take short cuts

      One thing that's always been at the front of my my mind, why aren't these kids learning how to type? Or at least to type with any reasonable amount of skill. The only computer I had as a child was a Commodore 64, and I was still faster than most of todays youth even with their abbreviations. I was somewhat lucky in that our schools somehow foresaw the advent of the home computer and made sure we knew how to type, but I'd certainly hope that held even more true in todays schools!

      --
      Everything will be taken away from you.
  2. Indeed by Pan+T.+Hose · · Score: 4, Funny

    Indeed what their sayin is true. U can learn English very well, especially grammer readin /. frist psots. Teh intarweb seems to certainly kick arse for that sorta research. Very 1337 articel. Thx d00dz.

    --
    Sincerely,
    Pan Tarhei Hosé, PhD.
    "Homo sum et cogito ergo odi profanum vulgus et libido."
  3. I rue the day... by sandstorming · · Score: 3, Funny

    When we might actually say words like 'lol' out aloud. Imagine a deal going down between two mining companies and the CEO of one company with a straight face, and deadly serious demeanour saying to the cameras: "Despite many thinking we pwned them in the deal, we believe it came out leet for every1"

    1. Re:I rue the day... by Peter+Cooper · · Score: 3, Interesting

      When we might actually say words like 'lol' out aloud.

      I've heard it done. I've also heard 'roffle' (an attempt at pronouncing ROTFL I guess). Bizarre, really, since those terms are attempts to turn physical real-life actions into a verbal-only form.

    2. Re:I rue the day... by JustKidding · · Score: 2, Informative

      You may be unaware that "lol" actually is a correct word in the dutch language, meaning (having) fun.

      lol (de ~) 1 [inf.] plezier
      (taken from, www.vandale.nl, an authoritive dutch dictionary)

  4. Epiphany by phaln · · Score: 2, Funny

    It came to me that the English language was in deep trouble when people started saying "rotfl" and "lol" in person. There seems to be kind of a backlash brewing though, with improved email composition styles dictated by employers, and such.

    --
    SNACKS ARE AWESOME
  5. Google does it again by vladd_rom · · Score: 3, Interesting

    This is not the first time when Google (and search engines in general) changed how we do things.

    Nowadays copyrighters use Google to search for potential violations of their intelectual property. Plagiarism is easy to detect nowadays thanks to Google as well. Instead of using rather expensive systems in order to search for duplicate work, teachers are now one search away in distinguishing original work from the rest.

  6. *BSD be dyin' by Anonymous Coward · · Score: 2, Funny
    It be now official. Netcraft gots confirmed, dig dis: *BSD be dyin'

    One mo'e cripplin' bombshell hit da damn already beleaguered *BSD community when IDC confirmed dat *BSD market share gots dropped yet again, now waaay down t'less dan some fracshun uh 1 puh'cent uh all servers. Comin' on de heels uh a recent Netcraft survey which plainly states dat *BSD gots lost mo'e market share, dis news serves t'reinfo'ce whut we've knode all along. What it is, Mama! *BSD is collapsin' in complete disarray, as fittin'ly 'esemplified by failin' wasted last in de recent Sys Admin comprehensive netwo'kin' test. Man!

    You's duzn't need t'be de Amazing Kreskin t'predict *BSD's future. De hand writin' be on de wall, dig dis: *BSD faces a bleak future. In fact dere won't be any future at all fo' *BSD a'cuz *BSD be dyin'. Doodads is lookin' real baaaad fo' *BSD. As many of us is already aware, *BSD continues t'lose market share. Red ink flows likes some riva' of blood.

    FreeBSD be de most endangered uh dem all, havin' lost 93% uh its co'e developuh's. De sudden and unpleasant departures uh long time FreeBSD developuh's Jo'dan Hubbard and Mike Smid only serve t'undersco'e da damn point mo'e clearly. Slap mah fro! Dere kin no longa' be any doubt, dig dis: FreeBSD be dyin'.

    Let's keep t'de facts and look at da damn numbers.

    OpenBSD leada' Deo states dat dere are 7000 users uh OpenBSD. How many users uh NetBSD is dere? Let's see. De numba' of OpenBSD versus NetBSD posts on Usenet be roughly in ratio uh 5 to 1. Derefo'e dere is about 7000/5 = 1400 NetBSD users. BSD/OS posts on Usenet is about half uh de volume uh NetBSD posts. Derefo'e dere are about 700 users uh BSD/OS. A recent article put FreeBSD at about 80 puh'cent uh de *BSD market. Man! Derefo'e dere is (7000+1400+700)*4 = 36400 FreeBSD users. Dis be consistent wid de numba' of FreeBSD Usenet posts.

    Due t'de troubles uh Walnut Creek, abysmal sales and so's on, FreeBSD went out uh business and wuz snatchn upside by BSDI who sell anoda' troubled OS. Now BSDI be also wasted, its co'pse turned ova' to yet anoda' charnel crib.

    All majo' surveys show dat *BSD gots steadily declined in market share. *BSD be very sick and its long term survival prospects is very dim. WORD! If *BSD be to survive at all it gots'ta be among OS dilettante dabblers. *BSD continues t'decay. Slap mah fro! Nodin' sho't uh a miracle could save it at dis point in time. Fo' all practical purposes, *BSD be wasted.

    Fact, dig dis: *BSD be dyin'

  7. Be carefull thought... by Anonymous Coward · · Score: 3, Interesting

    There are more non native speakers on the web then
    native speakers.
    In the European community the native English
    speaking persons are by far a minority. That way
    French expressions are poring into the language
    in an unstoppable way. Those expressions are then
    used by native speaking politicians and are
    broadcasted by television. That way they enter the
    mainstream of the English language.

    Regards

    1. Re:Be carefull thought... by Spy+Hunter · · Score: 2, Insightful
      You're overdramatizing. This is a process that will take hundreds if not thousands of years, even with technology helping to accelerate it. It's not like we'll wake up 10 years from now with a unified language and forget how to read today's literature!

      By the time we have a unified language, we'll have a whole new set of literature to go along with it. Today's literature will be like ancient greek literature, and yes, it will only be readable by people with special training. It will need to be translated, just like ancient greek is today. What's the big deal? The biggest difference is that only one translation would be needed, and therefore all the translation work could be focused on that.

      Furthermore, nobody will be forced to adopt a unified language. It will simply evolve. Words will travel from one language to another. Phrases will creep in from other languages. Languages will become closer, and eventually merge. You can see it happening today; at least the beginnings. It will only continue even faster, as the Internet is here to stay and the growth of the global marketplace shows no signs of slowing.

      Academics care about linguistic diversity in an abstract sense, but normal people really don't. People care about it, but in a much more practical sense of everyday communication. People will accept gradual, evolutionary changes to their language, as long as they can express themselves in a way they like. Academics often fight against change, because their theories were all developed to explain the old ways of doing things. They will fight against language unification; luckily I believe they will not be able to prevent it, or even slow it very much. [Note: this is a gross generalization about "academics", please remember that all generalizations are false.]

      You ask what's so great about a global language? The removal of all language barriers from everything! Duh!

      Maybe you don't personally notice any language barriers right now, but that doesn't mean you couldn't benefit from their removal. Maybe there are some really cool people in China right now doing brilliant work in your field that you just don't know about because it's all in Chinese. Maybe you would benefit from the increased efficiency of a global economy without language barriers. I think it's an indisputable fact that removing language barriers is a great thing.

      --
      main(c,r){for(r=32;r;) printf(++c>31?c=!r--,"\n":c<r?" ":~c&r?" `":" #");}
    2. Re:Be carefull thought... by monecky · · Score: 2, Interesting

      > Academics care about linguistic diversity in an abstract sense, but normal people really don't.

      I think you're a bit wrong on this. There are around 6,800 languages. Most languages have developed their own culture. Do you really think millions of people around the globe would be willing to lose their identity?

      For example, after the collapse of the Soviet Union, Uzbeks started replacing Russian loan-words with the original Uzbek words.

      Paul Rodrigues

      --
      http://jones.ling.indiana.edu/~prrodrig
    3. Re:Be carefull thought... by rob_squared · · Score: 2, Funny

      Don't worry, according to the French we're doing far greater damage to their language and culture.

      --
      I don't get it.
  8. I've used the web for corpus linguistics research by Anonymous Coward · · Score: 2, Informative

    I've used the web for corpus linguistics research. My last big project was to look at a lot of web pages with Mexican and Chilean slang Spanish, and see if there was a difference in vocabulary usage. There was a significant difference; I could, 70% of the time, tell if a given passage was Chilean or Mexican Spanish.

    I could have gotten a higher accuracy rate, but this was just a simple undergraduate project.

  9. Non-official English by Anonymous Coward · · Score: 2, Informative

    Unlike French and Italian, there is no official instution that defines 'correct' English. Essentially, the English-speaking world just 'makes it up' as it goes. Thus when I see the adverb 'really' butchered into 'real' I must try not to get annoyed. i.e. It's real hard to use your mother tongue. vs. It's really hard to use your mother tongue. Please help me here - is the misuse/non-use of 'really' something that's taught in school?

    1. Re:Non-official English by Kafir · · Score: 2, Insightful
      From Merriam-Webster Online:
      real (3, adverb): VERY (he was real cool -- H. M. McLuhan)
      usage Most handbooks consider the adverb real to be informal and more suitable to speech than writing. Our evidence shows these observations to be true in the main, but real is becoming more common in writing of an informal, conversational style. It is used as an intensifier only and is not interchangeable with really except in that use.

      I'd say you're fighting a losing battle on this one. I'm not too bothered by it, either; the English language has other words that function both as adjectives and as adverbs, despite the existence of a distinct adverb form - near dead and nearly dead are both standard, for instance.
  10. 'Language' == spoken || written? by adam31 · · Score: 2, Insightful
    How do you even pronounce 'pwn3d' ? Google is not a tool to study speech patterns, and there's nothing to say that speech even resembles written text.

    The article addresses this in a weird way, where it first draws attention to the distinction, but once it reaches its crux, where google is used as a tool, the distinction is ignored entirely; instead it opts to focus on stranger things.

  11. Re:inner city teens by Kafir · · Score: 3, Insightful

    i countinously question my co-workers (social workers) in telling the youth what is propper and not.

    I'm glad they're telling the youth what is proper; you're clearly incompetent to do so.

    using words... is becoming more than just the normal, it is becoming the standard.

    Is that right? Using words is "becoming more than just the normal"? I've been using words for years now; I'm glad to hear that's becoming the standard. Your post is a perfect example of why people should learn to write in something approaching standard English. Your meaning is barely intelligible, and you sound like an idiot.

  12. Popular usage != wanted usage by KiloByte · · Score: 2, Informative

    Yes, we can record the errors made by the uneducated public (and even those done by, uhm, me). The question is: should we do that or not?

    I was pretty taken aback when a council of linguist in Poland suddenly declared some widely-chastised and not even very popular errors to be valid usage. I've been brought up in the circles of people who not only put a lot of stress to the language you use, but also cruelly point out every incorrect word or phrase you use -- and this made me quite intolerant to bad speech.

    Being but a dirty foreigner, I know that my English can sound bad in the ears of native English speakers -- that's why I sometimes ask people to correct me if they spot errors.

    In other words: some people find careless speech repulsive. Thus, we should do whatever we can to promote correct usage as opposed to legalising incorrect uses.

    --
    The creatures outside looked from Alt-Right to Antifa; but already it was impossible to say which was which.
  13. Three types of language by Dracos · · Score: 3, Interesting

    I think that for most of the 20th century, English, and most languages in the industrialized world, was largely static, dominated by the written word which was dominated by proper grammar. Since WWII, popular culture and faster communications have increasingly exposed us to local vernaculars, mostly through radio and television. The written word lagged behind in its cultural evolution.

    Thanks to the internet (initially email, BBS's and IRC, but more widely known on the Web), we now have a hybrid of the spoken and written word: the "typed word". This form of language evolves at the same rate as the spoken word, and injects its own vernacular as a side effect of the medium: acromyn and abbreviation "words" (rofl, how r u), along with common misspellings (pwned), and mixing letters with numbers or punctuation (133t, n00b). All of these serve at least one purpose, whether as a form of super shorthand, insult, the appearance of being "cool", or are merely the result of laziness on the part of the author. Most typed-word terms don't transfer well when spoken.

    One of my hobbies is studying (European) languages and how they are related. Sometimes I worry about the damage the typed word is causing to the spoken and written word (and any proper linguist should at least be interested in the phenomenon). Luckily, most typed word expressions aren't pronounceable, and the ones that are sound absurd, because they are removed from their original context when spoken, and everyone recognizes gibberish when they hear it. How the typed word affects the written word remains to be seen. Yes both are typed now, but only the written word has a chance of going through an editorial process. I think it will take a very long time for the formal lexicon and rules of grammar to embrace, however reluctantly if ever, the typed vernacular.

  14. Google as a grammar checker by Hal+XP · · Score: 2, Interesting

    I've had the chance to use Google as a grammar or style checker in my day job as a glorified copy editor. I type two nearly identical expressions X and Y in the search box. If expression X gets 10,100 hits and expression Y only 500 hits, I use expression X.

    For example, as a non-native speaker, I found myself waffling between the expression (A) "run for mayor of" and the expression (B) "run as mayor of." Letting Google arbitrate, I found 14,900 hits for (A) and only 200 hundred hits for (B). I chose (A).

    I discovered there's practically a dead heat between the expressions "a new lease on life" (which, if I'm not mistaken, is the expression favored by American usage) and "a new lease of life," with the latter nosing out the former 144,000 hits to 140,000. In this instance I let my own usage arbitrate. Since I'm more exposed to American than to English, I chose on.

    --
    I'm a sci-fi vegan: I don't want the aliens to think we have as much right to live as the fried chickens we eat.
  15. Reminds me of "Meme Tree"... by Slur · · Score: 3, Informative

    ...which was this little program I wrote around the nascence of the internet. it took any sentence as input and kept a record of which words preceded each word, and which words followed each unique word. The idea was to build up a simple map of which words could precede or follow others completely without context. From this you could follow paths that made sentences or paths that looped forever, or paths that made no sense, and some interesting paths that made unintended sense.

    Why a tree? Language and geneology seem to have a common thread. Meaning is like genetics. Language is expressive. Information is a kind of tree whose branches grow as reality elaborates and past events accumulate. New terms need to be invented for the dynamics we perceive in reality, just as new names are given to individuals as they emerge into the world. Patterns, continuity, periodicity. Such things lie at the heart of material existence and provide the hooks for consciousness itself. Information theory is the next great frontier, along with particle physics. Already they have converged and diverged and converged again. And playing with artificial trees turns out to be a lot of fun.

    As for the "Meme Tree" program ... The next iteration built up a more discreet map by scoring proximity of unique words in sentences and inclusion in sentences together. Again, the idea was to build a simple statistical map free of any context, simply to get a sense of pure lexical association.

    The theory is that the internal consistency of these various lexical maps should roughly reflect many aspects of associative meaning. You could think of the statistical map as a Godelian bubble whose "truth" - if you will - is imposed by the laws governing the statistical associations. We don't derive the laws of language and meaning from these exercises, but we create an internally-complete map that reflects something about the nature of meaning.

    There is a practical aim as well. If you can derive the strength of equivalence and the various levels and colors of associative meaning you could in theory build a "Truth Machine" capable of answering any question with a high degree of accuracy. The result of any question could be computed as any other information retrieval problem would be.

    I never got around to having my little Meme Tree programs scrape the internet for random sentences. However, this should be a very simple thing to do. Google has had programming contests in the past - programs that use the Google database in interesting ways. Statistical analysis of language is basically what they do. Research projects on their data could provide stunning insights into the nature of information itself, its relation to language and to reality, and likely into our very nature as linguistic beings.

    --
    -- thinkyhead software and media
  16. BBC voices by matt+me · · Score: 2, Informative

    Link on front page of bbc.co.uk - bbc.co.uk/voices/ - their attempt at tracking accents and dialects across the UK.

  17. Done: nous sommes desolés que notre president by new500 · · Score: 3, Insightful

    . . .

    Those expressions are then
    used by native speaking politicians and are
    broadcasted by television.


    Dude, it's worse, the French have already infiltrated as far as the advertising business and are using covert channels to spread some dangerous crack i heard was called La Liberte :

    http://french.about.com/b/a/081281.htm

    Slightly more seriously :

    Apart from pointing out that your use of the word native is rather presumptive of geographic origin in this big wide internet thing, i wonder if this linguistic adoption is more one way towards English since the internet. OK the French got Le Weekend, and tons of anglicised nouns, tried to ban them all and didn't manage. But i read Friday that a British pilot training firm lost a contract to a French one. The reason cited by the Asian airline was that, whilst the training had to be in English, the French trainers spoke better, clearer, more intelligble English than did the English. I can't argue with that. Sadly.

  18. Writing in Japanese by minairia · · Score: 3, Insightful

    I am American but have to write in Japanese for work. No matter how much one learns in school, when one writes in a foreign language, you'll hit a point of wondering if what you wrote is how native speakers say something or is even understandable. Whenever I hit a point like that, I put the sentence in question (or key fragments thereof) into a Google search. If nothing comes up, I know I have to rewrite. If only a few links come up, I know what I wrote might be a little wierd, but is at least understandable. If I get pages and pages of links, I'm golden.

  19. Linguistics 101 by DingerX · · Score: 2, Insightful

    I use search engines all the time for linguistics reseach: when I'm reading or translating from one language to another, and I run into an odd usage, I just type the phrase in the magic box and *poof*, I get hundreds of contextual examples. Likewise, if I'm writing in a foreign language, and I need to know if a preposition or a construction is correct (and not simply words), again all I have to do is type it in and see what comes out.

    Measuring how the internet changes world languages is only a small part of what the 'net offers those interested in linguistics and linguistic usage. Most of the web data archived on google does not consist of ROTFLMAOs and pwn3ds; it consists of everyday usage, and a good deal of that is from the last decade. Much of linguistics deals precisely with that: how the language is used in a daily basis. That's also how dictionaries come about: they're [i]descriptive[/i] accounts of usage (which is why the high school journalistic trick of beginning an article with "Webster's defines fistula as..." doesn't work. Dictionaries don't lay down the law, they describe it).

    Of course, some people have been arguing that this gives room for errors and abuses. Of course it does! just 'cos something doesn't play by the rules doesn't mean it's not in common usage. And just because people don't follow rules of orthography, grammar and style doesn't excuse us from teaching these things, or trying to follow them. After all, language is about communication, and these corruptions hinder our ability to communicate, especially communicating complex thoughts.

    So yeah, "to impact" is to make an active verb out of a passive participle, and "to impinge" should be used ihstead. There are plenty of uses of "bonified" out there. Google finds about 20,000 such occurrences. That doesn't make it correct. Nor does that make Google's suggested correction "bonafide" correct either (306,000 occurrences). The correct spelling is [i]bona fide[/i] (1,050,000 occurrences).

    And don't worry too much about purely textual forms appearing in speach. LOL is just this decade's SOB. A spoken "I R0XX0R, J00 5UXX0R" shouldn't alarm us too much when we consider all those medical shows where doctors run around yelling "Get me a boron enema STAT!", pompous academics actually say "such economic perturbations may affect the governance of a certain cryptodictatorship, VIZ the United States", and we all drop down to the pharmacist to "Fill an RX", all spoken forms of what are written Latin abbreviations (statim -- immediately, videlicet -- that is, Rx -- Respondeo, although some classicists may insist it's the symbol for Jupiter).

    One linguistic area that is interesting is the gradual adoption of worldwide slang. We hear Americans these days using terms like "Bog Standard" and "Arsed".

    What's the point of this rant? Teh intardnet is a great resource for linguistic usage, beyond the navel-reflection of IT professionals. Disciplines like linguistics deal in examples of usage, and the internet is a great stockpile of everyday language. Descriptive grammar and descriptive dictionaries are not an excuse for ignoring arbitrary rules. Most of the lingusitic phenomena we see with internet usage are not new.

  20. Programmer grammar by cbr2702 · · Score: 2, Insightful

    Adding or changing characters in a literal string seems like misquoting. Traditionally in handwritten work the comma went almost directly under the quotation mark. When people shifted to typewriters and then computers, an arbitrary choice was made to put the comma first. Most programmers I meet seem to have reversed that choice.

    --


    This post written under Gentoo-linux with an SCO IP license.
  21. Re:inner city teens by chialea · · Score: 2, Interesting

    >His meaning is perfectly intelligible, but some language snobs (very few of whom are actually linguists and know anything much about language) pretend not to be able to understand certain accent/dialects in order to feel superior.

    Incomprehension often has very little to do with that. A friend of mine moved to MA from NC at the same time as I moved from CA. She could not understand most people there, most people there could not understand her. I could, on the other hand, understand both of them. I've been at at least one conference in which two non-native speakers of English could not understand each other at all, and required a native speaker to translate.

    There are simply certain grammatical patterns that I don't understand well, if at all. It has nothing to do with snobbery; I simply can't understand, most likely because I haven't been exposed to it all that much.

    When using media of international exchange, I would certainly try to make myself comprehensible. I spend quite a lot of time trying to do this in my research papers and communication. Writing in unambigious, grammatically correct English (or something approaching it) is the first step towards sharing ideas with a wide audience. People limit their communication and opportunities by the language they use.

    Lea

  22. It looks like no one read the article by JoeBuck · · Score: 2, Insightful

    It's troubling to read so many comments that worry that the linguistic researchers will find "bad language", and worse, that people have moderated such comments up. It reflects a misunderstanding of what linguists do: they want to get a description of the language as it is used, and as it changes, and historically speaking, usages that start in the gossip of teenage girls often become mainstream a couple of generations later. They need it all, and they probably need the crappy stuff most of all, because it is closer to spoken English.