The Curious Case of Increasing Misspelling Rates On Wikipedia
An anonymous reader writes "The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources. As the saying goes, the more eyes the better. One particular student who was curious about this conducted rudimentary text mining on a sampling of the Wikipedia corpus to discover how misspelling rates on Wikipedia change through time. The results appear to indicate an increasing rate of misspellings through time. The author proposes that this consistent increase is the result of Wikipedia contributors using more complex language, which the test is unable to cope with. How do the results of this test compare to your own observations on the detail accuracy of massively crowd-sourced applications?"
Every web browser as auto spell-check capabilities these days. Most of them correct as you type.
So why should there be any misspellings on something that is managed strictly from a web interface?
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or
is it that the words are simply unknown to the normal spell-check dictionaries?
I find occasional misspellings in mainstream news articles as well (and I am by no means a natural born speller).
But most maddening to me is the "they're their there" errors, and similar wrong word usage.
Spell checkers offer little help in catching these, but a 6th grade education usually suffices.
Maybe the same people who wont waist there time checking they're spelling also cant be bothered to use the write word. ;-)
Sig Battery depleted. Reverting to safe mode.
Whether it's open source software or online collaborative projects, the smart people always get driven away over the long term. Smarter people are usually more interested in creating high-quality content, whereas stupider people end up putting out crap purely for political reasons. Eventually these stupider people start trying to modify the work of the smarter people, but do a poor job at it. When they're called out on their shitty work by the smart people, the fools make a huge stink. This soon devolves into a political mess where the smarter contributor is severely inhibited from contributing by the constant moaning and bitching of the idiots. Not wanting to waste time with such shenanigans, the smarter person leaves for some other endeavor. After a while, many of the smarter people are driven away, and the end result is that the stupider people make up the bulk of the project's contributions.
We've seen this happen with many open source software projects, and I don't think that other kinds of online collaborative projects are any different.
I wonder how many our typos compaired to how many our truely speling mistaks?
I can offer my own opinion of this phenomenon: the bad is driving out the good. Fewer competent writers are bothering to edit Wikipedia articles nowadays. Not only do contributions get reverted / deleted by editors who think they "own" the article, but good writers simply get tired of fixing the semi-literate ramblings of people who cannot write a coherent sentence.
It's the old axiom that incompetent people cannot recognize their own incompetence, and so do not realize that their "contributions" are not improving the article, but instead are making it worse. Eventually the good contributors get tired of sweeping back the ocean with a broom, and just walk away from Wikipedia.
So slashdot has just posted an article about a test where even the test's AUTHOR believes the results are due to shortcomings in the test itself. This has to be the most pointless article I've read in a while...
The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources.
[citation needed]
#DeleteChrome
Here is a typical example:
Person A and B are on the ground floor of some building.
Person A would like person B to have some parcel delivered to the 7th floor of the building.
Here's how person A delivers the request:
"Buddy, please bring this parcel up to the seventh floor, thanks".
I posit that this grammar is wrong. He should say:
"Buddy, please take this parcel to the seventh floor, thanks", because they are in the same area and buddy B, by doing the needful, will be leaving that place.
Worse still, you even hear it in the main stream media.
Other cases:
Folks addressing "data" and "media" as singular! Again, wrong. They should be using "datum" and "medium".
Eye don't no how ewe can automate proof reading. You still knead a human in the loupe.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Unless the article is locked, just fix the spelling errors yourself instead of whining about them getting worse.
If someone is passing you on the right, you are an asshole for driving in the wrong lane.
... and the growth in size of many articles, combined with the limited number of Wikipedia editors, is one possible reason why spelling errors may be on the increase. Also, one form of vandalism is the intentional introduction of spelling errors.
Is not the increase in rates, and that crowdsourcing doesn't solve the problem, but that spell checkers don't solve the problem. What's up with that?
icebike is a victim of Muphry's Law.
older IE's do not have spell check.
TFA measured "Average misspellings as percentage of sampled content", up from 0 in 2001 to over 6% now.
MSIE: The world's most standards-complaint web browser.
Welcome to the future, where text input has become minimalized and marginalized. When half the dictionary has become standardized to 3 letter abbreviations, I don't find it the least bit surprising that spelling and grammar have gone out the window. Maybe this isn't some new symptom of aa illiterate world, but rather the written English word evolving in front of our eyes; the technological revolution.
I would guess that this is nothing to do with spellcheckers (which are useful for catching typos, but fairly useless for catching mis-spellings). As this was observed over time, might it not be possible that the decreasing level of literacy may be being exposed by a decrease in the average age of the contributors?
which don't even always follow normal English phonetic conventions
Wait, English has normal phonetic conventions?
I think some the issue here is that a new generation is showing up with poor literacy skills. The primary schools are under pressure to meet their government mandated competency requirements, budget cuts, and various other issues, and have cut back on some of the basic skills that were once taught.
I work at a tutoring center / assistance center at a college and it is depressing what students are coming out of high school in their basic literacy skills. Writing skills are non-existing, were some of them do not even know how to hold a pencil correctly and unless there is a computer with a spell checker, their spelling is limited to about the 4th grade level.
I have been seeing this for several years now and these are the people that are replacing the older generation of people who did not have computers as evasive as it is now.
It's sad. Through all this web content, I am slowly unlearning how to spell or use proper grammar.
English teachers / professors (with a few exceptions) used to be my arch-enemies (as a math / science person) and wished them all a pleasant, if sudden, death for their batshit-insane insistence on making mountains out of molehills (i before e, except after c; can't end a sentence with a preposition; this {subject}) with regards to the language, and yet lately I finding myself wishing there were more of them.
It's not fair: I've nursed some of those grudges for years!
I am John Hurt.
I believe this is due to Wikipedia becomnig more diverse. As more people learn about Wikipedia, the more people contribute to it. It's overall becoming more accessible to everyone and therefore everyone is putting their two cents. Where as in the beginning you could argue that the population was more centered around a niche crowd who are more pedantic then those that just wish to contribute in some form.
Surely one factor is that early in its history much of the content on Wikipedia was copied from a public domain edition (1911?) of the Encyclopaedia Britannica, in which one would expect to find very few spelling errors. Over time more and more of the content is user-generated, so naturally it is more likely to contain typos.
.sig withheld by request
The increase in the percentage of spelling errors is an artifact of his experimental procedure. He randomly takes a Wikipedia article instead of analyzing the most popular ones. As Wikipedia has become larger, it has attracted more fringe topics, probably from authors in different countries in the world where English is not their first language. Wikipedia now probably has more articles that aren’t viewed and revised as much. Thus, randomly sampling has now higher chances of selecting such articles and thus, higher spelling mistakes.
He should change his experiment so that he analyzes the spelling mistakes on the most accessed and modified pages in Wikipedia or discard articles where the activity on the article is below a certain threshold.
After the last time I tried to clean up some grammar and spelling in an article and it was immediately reverted with "didn't cite sources" I gave up.
Occasionally living proof of the Ballmer peak.
0% in 2001? As in, less than a percent by enough that it didn't make sense to round up to 1%? that's a huge increase. I wonder if it's a real increase, or a result of sabotage by either independent malefactors, or by Britannica using an automated approach...
Can you be Even More Awesome?!
That's a myth. If those eyes aren't attached to competent people, having more of them will do no good.
Knowledge is power; knowledge shared is power lost.
Very well put, If I may say so! (forgot to log in)
If common internet usage is any indicator, you eventually will be out of work because so few seem to care about spelling and grammar... so then they'll just let the engineers write the documentation. (Sorry for expressing such a horrid thought so close to a Major Holiday ;) )
/F
Actually I don't really believe that this will happen, but it is a scary thought.
Stupidity... has a habit of getting its way.
We awl noe wot ure saeng no mader howe u spell it.
if your life is such a big joke then why should I care?
This may sound like a get off my lawn type post, but from what I've seen it seems that the writing ability of younger people has severely declined. And it's not even that big a difference in age that I'm talking about here, I'm talking about people less than 10 years younger than me. I "abuse" the language a fair amount myself, but I'm talking about seeing people thinking column has a b in it, and despair doesn't have an e. There are fluctuations in the language that I'm used to; such as the color vs. colour thing; but basic spelling problems that would not be correct in any dialect seems to be pretty common. And of course we have the their vs. there problem.
I see a lot of good comments here, but the fact of the matter is that the novelty of editing has worn off for many of us. In the beginning, when Wikipedia was small, or when it was new, or whatever the reason, it was fun to keep an eye on a few pages. That novelty has worn off, and along with it, any desire to fix the little spelling errors I find along the way.
As a side note, Wikipedia had, at one time, a large number of articles about my profession. None of them was accurate, at least in the US sense of defining many terms, specializations, and equipment. Maybe other parts of the world call things differently, but I doubt to the degree that Wikipedia was wrong. Still, I wasn't about to go re-write and fix links in every article -- even if I would have been able to find sources.
So frankly, I've given up. Yes, I notice spelling errors on Wikipedia. I just read past them. It's not worth fighting with people over and it's not worth my time to fix. My interests lie elsewhere. Sorry, universe.
Also, misspellings and bad grammar on the internet are cool. Just look at some of drivel published by actual legitimate news sources (AP, I'm looking at you. Would it kill you to spell check an article before posting? I know it wouldn't fix the "their/there" and related problems, but it's a start.)
Actually, it's 0.00. But many of the 2400 articles they sampled were less than 10 years old. Noticeably, the rate jumps up to 2.58% in 2002, and then continues to climb a pretty steady by 0.365%/year after that, with a slightly higher uptick between 2006-07.
I'm not entirely sure what to take away from that, but it does seem that the more articles WP adds, the less people care about writing them properly.
MSIE: The world's most standards-complaint web browser.
Wikipedia is not immune to entropy.
http://alternatives.rzero.com/
Is it so hard to remember smallest to biggest?
Yes, because it's 2011-12-23 where I am, not 32-21-1102. Days are smaller than tens of days, right? I prefer ISO 8601 because biggest to smallest is "lexicographically monotonic" on any date in the common era, meaning that sorting a set of strings representing any dates since Jesus M. Christ was potty trained gives the same result whether one treats them as dates or as generic strings.
"was immediately reverted with "didn't cite sources" " - Could you provide a link to the diff? One problem could be a lacking edit summary in combination with extensive rearranging of text. In such cases, it is difficult for an other editor to see what changed by looking at the diff. If the edit comes from an anonymous IP address and does not have an edit summary, a hurried editor could misunderstand your intentions.
Avantslash: low-bandwidth mobile slashdot.
Part of the problem is the article selection methodology. By pulling random articles, the study author is going to be getting mostly articles that have received little attention, and mostly short articles. (Table 2 and Graph 2 show this very clearly--of the 2400 articles examined, only 14 existed in 2001. Half of them didn't exist until 2007. A quarter were created between 2009 and the present.) It's possible that what has been demonstrated is simply that relatively new articles on relatively unimportant topics tend to be less-well maintained.
The major issue is the corpus used for the study. While a half-million-word dictionary sounds impressive, it's still going to fall down in a couple of key areas. For one, foreign-language terms are likely to be nearly completely unrepresented. For another, a lot of proper nouns are going to be missing. If I write an article about Japanese manga or a Norwegian village, I'm going to be including all kinds of things that an English-language dictionary just isn't going to contain. (Worse, I'll get two misspellings for each Japanese term, since I'll have it in the article with both the original Japanese word plus the romanized transliteration). Another problem area will almost certainly be articles on highly technical topics (molecular biology is full of new and unusual abbreviations).
While certain classes of 'obvious' non-words aren't counted, many will be missed. For example, the article preprocessor filters out percentages, but will pass through numbers followed by the degree symbol (which will show up in scientific and geographic articles).
What is noticeably lacking from the report is any mention of manual checking performed by the author to evaluate the accuracy of the results generated by the spell checker. Table 4 reports that about five percent of articles contain more than 25% misspelled words(!); honestly, even people on Twitter don't (generally) show that level of illiteracy. Are there certain types of articles which are responsible for these grossly inflated counts?
In summary -- sloppy methods give useless results. No news.
~Idarubicin
Hi,
My counter(?) hypothesis is that the long tail of articles grows most, and gets no to little proof-reading. Therefore I'd love to see the results normalized by (log maybe) of Page Views (from http://stats.grok.se/ ). I've also a few doubts about the quality of randomly sampled pages in general, and also whether the growth of jargon (which may or may not end up as spelling-errors has increased).
Excellently interesting piece though! Great work.
Winton
But the Wiki-peepia is teeming with a plurality of crotchfruit who think spelling like a SPED is ok, and to call them on this fact would hurt their feelings.
I was going to comment, but I get tired of being chastized for 'correcting the teacher.'
Did you even bother to proofread your post? Or was it your purpose to obfuscate?
There is nothing to FEAR but NOTHING itself; and I fear there is a whole lot of nothing going on. --scorpivs
I'm sure wikipedia's quality increased as the number of people who cared had increased but then everyone uses it and its quality goes back down. There is also an increase in articles on stupid stuff that attracts kids and idiots so there will be more mistakes.
Australians have told me they consider their country's accent closer to American English, while Canadians have said their accent is closer to that of Britain. To my ear the converse is true, and perhaps these statements are more indicative of feelings about national identity.
in everyday speech?
You got me there. I use the ISO 8601 format only in writing. In speech, I have "Dee-sem-burr" to disambiguate. It's the same way a lot of languages say "four and twenty" but still write 24.
That sounds reasonably accurate, though my impression of Australian English is that it is about half-way between British and American, splitting the difference between the two varieties (and adding a few quirks of their own).
However, the English spoken in Canada is virtually the same as that spoken in the United States - certainly the differences are no greater than those which exist regionally in the U.S. - and I've never heard a Canadian say otherwise. Perhaps your friends were trying to say that Canadian English is closer to British English than U.S. English is - which is true, but just barely.
Chalk up the misspellings to smartphone use and use of abbreviations.
I have another problem related. I work in English (born and raised), (RWS)
I am in a French province and I also work and write in French, (RWS)
and I write and read Spanish. Spelling words that are common to English
and the second language always leads to misspellings.
English tends to double up consanents, the other languages do not.
BTW (RWS = Read Write Speak) (BTW = By the way)
Leslie Satenstein Montreal Quebec Canada
a hurried editor could misunderstand your intentions
See. It's not actually a problem. The editor was likely hurried, so you can't blame them for their blatant and repeated fuck up. After all, they "own" so many articles of their prose that need their constant protection.
Contrary to your hurried editor, a GOOD editor wouldn't make arbitrary decisions with no basis in fact. If they cannot see the change or a problem, perhaps they don;t need to act. But, they don't hesitate to hit revert as soon as they get the change notification!
The word you're grasping for is "colloquialism". They are not always defined by national geographic boundaries either.
"Soda", "cola", "pop", "coke", "coca", "pepsi", "tonic", "soft drink", "soda water", "fizzy drink", and "refresco", are common colloquialisms used as generic descriptions of carbonated sugary drinks, regardless of name brand similarities.
As in your example, "cigarette", "fag", and "smoke" are common colloquialisms.
Those are just ones I'm familiar with, and can think of right off, because they're used in places I've been.
Serious? Seriousness is well above my pay grade.
"The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources. As the saying goes, the more eyes the better."
As the saying goes, none of us is as dumb as all of us.
Who's responsible for checking those wikipedia entries for correct spelling? All of us? You mean, none of us.
No one ever, ever, cites a diff when they are bitching about Wikipedia on Slashdot.
Americans do not "control" soft dictionaries. Tin foil much?
Set your country code and browser language appropriately and you are good to go.
Sig Battery depleted. Reverting to safe mode.
Wasn't misspelled idiot.
Sig Battery depleted. Reverting to safe mode.
The author proposes that this consistent increase is the result of Wikipedia contributors using more complex language, which the test is unable to cope with.
Simple: add something to the test that checks average word and sentence length as a rudimentary way of determine sentence complexity.
Well, since you asked, range of breadth is like depth of draft, only horizontally.
It so happens that the University of Melbourne offers baccalaureate degrees in range of breadth studies. This is apparently better than majoring in one subject while also completing a minor in another subject. Or maybe it has something to do with the flip in the Coriolis effect on that far side of the equator. As you are probably aware by now, I am no expert in these areas.
Will
I had a short stint working at a defense contractor with a bunch of folks with less programming experience than me. What really amazed me is one day one of my co-workers was reading a news article and complained that they'd misspelled the word "organize" with an "s" rather than a "z".
I just commented, "Oh, that's just the British spelling." to which everyone in the group was literally amazed at how I could possibly know that.
Because, um, I've read a few books?
I have no problems with our friends in the UK spelling and pronouncing things differently. It's just another bit of spice to the variety of language. Of course, too much of our language is polluted with the foul flavors of misuse and ignorance, but being the good Grammar Nazi that I am, I'll keep plugging along trying not to make mistakes myself and making the occasional snarky comment upon seeing some egregious misuse of the language.
Several years ago, my biggest peeve was people using "loose" for "lose", an error which seemed to me to grow very fast. The new one that really chafes my behind is the increasing use of "suppose" for "supposed", as in "You're supposed to use good grammar if you want to communicate effectively." This one bothers me a lot more because it's not a case of failing to know the vagaries of English spelling (after all, "choose" rhymes with "lose") but a complete misapprehension of the grammar of the words involved.
You are in a maze of twisty little passages, all alike.
You're right. I'm not in the habit of bookmarking Wikipedia articles I read, nor keeping a running list of ones I attempted to edit. This incident was about three years ago, and I have no idea what article it was I was trying to edit. All I remember was a fairly annoyed feeling, followed by "that's the last time I offer to help."
Occasionally living proof of the Ballmer peak.
Presumably it would be the last, or near to last, edits made before you abandoned your Wikipedia account?
It is a damn poor mind indeed which can't think of at least two ways to spell any word.
- Andrew Jackson