The Curious Case of Increasing Misspelling Rates On Wikipedia
An anonymous reader writes "The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources. As the saying goes, the more eyes the better. One particular student who was curious about this conducted rudimentary text mining on a sampling of the Wikipedia corpus to discover how misspelling rates on Wikipedia change through time. The results appear to indicate an increasing rate of misspellings through time. The author proposes that this consistent increase is the result of Wikipedia contributors using more complex language, which the test is unable to cope with. How do the results of this test compare to your own observations on the detail accuracy of massively crowd-sourced applications?"
Every web browser as auto spell-check capabilities these days. Most of them correct as you type.
So why should there be any misspellings on something that is managed strictly from a web interface?
Is it part of the arrogance of those electing themselves to write and editing articles on wiki that they refuse to use a spell checker, or
is it that the words are simply unknown to the normal spell-check dictionaries?
I find occasional misspellings in mainstream news articles as well (and I am by no means a natural born speller).
But most maddening to me is the "they're their there" errors, and similar wrong word usage.
Spell checkers offer little help in catching these, but a 6th grade education usually suffices.
Maybe the same people who wont waist there time checking they're spelling also cant be bothered to use the write word. ;-)
Sig Battery depleted. Reverting to safe mode.
Whether it's open source software or online collaborative projects, the smart people always get driven away over the long term. Smarter people are usually more interested in creating high-quality content, whereas stupider people end up putting out crap purely for political reasons. Eventually these stupider people start trying to modify the work of the smarter people, but do a poor job at it. When they're called out on their shitty work by the smart people, the fools make a huge stink. This soon devolves into a political mess where the smarter contributor is severely inhibited from contributing by the constant moaning and bitching of the idiots. Not wanting to waste time with such shenanigans, the smarter person leaves for some other endeavor. After a while, many of the smarter people are driven away, and the end result is that the stupider people make up the bulk of the project's contributions.
We've seen this happen with many open source software projects, and I don't think that other kinds of online collaborative projects are any different.
I wonder how many our typos compaired to how many our truely speling mistaks?
I wonder if these folks have corrected for the fact that there is just more content out there than before, which means the people and systems checking for spelling have more to crawl through. If those people/systems time spent hasn't grown as fast, then the misspelling rate will rise...
I can offer my own opinion of this phenomenon: the bad is driving out the good. Fewer competent writers are bothering to edit Wikipedia articles nowadays. Not only do contributions get reverted / deleted by editors who think they "own" the article, but good writers simply get tired of fixing the semi-literate ramblings of people who cannot write a coherent sentence.
It's the old axiom that incompetent people cannot recognize their own incompetence, and so do not realize that their "contributions" are not improving the article, but instead are making it worse. Eventually the good contributors get tired of sweeping back the ocean with a broom, and just walk away from Wikipedia.
So slashdot has just posted an article about a test where even the test's AUTHOR believes the results are due to shortcomings in the test itself. This has to be the most pointless article I've read in a while...
your a looser, so their!!!!
The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources.
[citation needed]
#DeleteChrome
Here is a typical example:
Person A and B are on the ground floor of some building.
Person A would like person B to have some parcel delivered to the 7th floor of the building.
Here's how person A delivers the request:
"Buddy, please bring this parcel up to the seventh floor, thanks".
I posit that this grammar is wrong. He should say:
"Buddy, please take this parcel to the seventh floor, thanks", because they are in the same area and buddy B, by doing the needful, will be leaving that place.
Worse still, you even hear it in the main stream media.
Other cases:
Folks addressing "data" and "media" as singular! Again, wrong. They should be using "datum" and "medium".
Eye don't no how ewe can automate proof reading. You still knead a human in the loupe.
For all intensive purposes, "whom" is no longer a word. That begs the question, "who cares"?
Unless the article is locked, just fix the spelling errors yourself instead of whining about them getting worse.
If someone is passing you on the right, you are an asshole for driving in the wrong lane.
... and the growth in size of many articles, combined with the limited number of Wikipedia editors, is one possible reason why spelling errors may be on the increase. Also, one form of vandalism is the intentional introduction of spelling errors.
Is not the increase in rates, and that crowdsourcing doesn't solve the problem, but that spell checkers don't solve the problem. What's up with that?
icebike is a victim of Muphry's Law.
... this whole article is basically just asking us if we think the internet is misspelling things more often these days?
It's quite likely there are also a lot of homonym failures, as aptly pointed out humorously in the other posts. Just because it passes through spell-check doesn't mean it's technically correct.
Then you also have a problem with technical words, Anglicization of foreign words (which don't even always follow normal English phonetic conventions - translation of Japanese words/names for instance), unique names, etc. that aren't included in most browser dictionaries. If you're not watching close enough, it's certain that auto-correct can easily make a mess of things.
older IE's do not have spell check.
So, I wonder if this text mining exercise took into account the fact that, although all contributors (to the English version of Wikipedia) are, in fact, typing their articles in English - spelling can vary throughout the world. For example - a spelling of "defence" is not "wrong" - it is just the way a Brit, Aussie, Kiwi etc would spell what Americans call "defense". Same goes with colour/color, centre/center, organise/organize etc etc etc. If you include the many instances of these sorts of words across thousands of wikipedia articles, you're going to skew your results...
Welcome to the future, where text input has become minimalized and marginalized. When half the dictionary has become standardized to 3 letter abbreviations, I don't find it the least bit surprising that spelling and grammar have gone out the window. Maybe this isn't some new symptom of aa illiterate world, but rather the written English word evolving in front of our eyes; the technological revolution.
...I shall never be out of work.
Egg corn.
I would guess that this is nothing to do with spellcheckers (which are useful for catching typos, but fairly useless for catching mis-spellings). As this was observed over time, might it not be possible that the decreasing level of literacy may be being exposed by a decrease in the average age of the contributors?
I think some the issue here is that a new generation is showing up with poor literacy skills. The primary schools are under pressure to meet their government mandated competency requirements, budget cuts, and various other issues, and have cut back on some of the basic skills that were once taught.
I work at a tutoring center / assistance center at a college and it is depressing what students are coming out of high school in their basic literacy skills. Writing skills are non-existing, were some of them do not even know how to hold a pencil correctly and unless there is a computer with a spell checker, their spelling is limited to about the 4th grade level.
I have been seeing this for several years now and these are the people that are replacing the older generation of people who did not have computers as evasive as it is now.
I personally, think that the biggest problem, as a random contributor, is the hostile attacking attitude that wikipedia has for outside contributors. Far too often, I've seen my edits reverted, or, worse, completely blocked due to huge random network-wide ip address blocks (thanks, Alison, you f#$$#% b$^%$). I don't bother editing articles anymore -- and if I do manage to luckily slip past one of the huge ip network blocks, I expect that my contributions will be destroyed in some political fight, so I might just comment on the discussion page, if I really feel strongly about it.
I use wikipedia on a daily basis, but due to the hostile attitude and blatant biases, I don't trust it, and I've mostly given up on it. It's become just another website for me -- there are other, more specialized websites out there -- which sometimes have higher quality information.
Another possible reason for more errors, aside from nobody giving a damn about wikipedia anymore, is that Firefox, with its built in spell checker, is declining in use, in favor of Chrome. Chrome doesn't max out the cpu and destroy batteries like Firefox -- but it also doesn't have a built in spellchecker.
I've noticed many more spelling errors in online news articles recently. I'm not talking about blogs, but actual news publications, including AP articles. Most of the errors are the type where a word is spelled correctly, but it's obviously the wrong word, and an automated spell checker wouldn't catch it. Ex: "It was am important step..."
Before yesterday it had referred to a web-based chat called cg:irg for a year.
It's sad. Through all this web content, I am slowly unlearning how to spell or use proper grammar.
English teachers / professors (with a few exceptions) used to be my arch-enemies (as a math / science person) and wished them all a pleasant, if sudden, death for their batshit-insane insistence on making mountains out of molehills (i before e, except after c; can't end a sentence with a preposition; this {subject}) with regards to the language, and yet lately I finding myself wishing there were more of them.
It's not fair: I've nursed some of those grudges for years!
I am John Hurt.
I believe this is due to Wikipedia becomnig more diverse. As more people learn about Wikipedia, the more people contribute to it. It's overall becoming more accessible to everyone and therefore everyone is putting their two cents. Where as in the beginning you could argue that the population was more centered around a niche crowd who are more pedantic then those that just wish to contribute in some form.
Surely one factor is that early in its history much of the content on Wikipedia was copied from a public domain edition (1911?) of the Encyclopaedia Britannica, in which one would expect to find very few spelling errors. Over time more and more of the content is user-generated, so naturally it is more likely to contain typos.
.sig withheld by request
The increase in the percentage of spelling errors is an artifact of his experimental procedure. He randomly takes a Wikipedia article instead of analyzing the most popular ones. As Wikipedia has become larger, it has attracted more fringe topics, probably from authors in different countries in the world where English is not their first language. Wikipedia now probably has more articles that aren’t viewed and revised as much. Thus, randomly sampling has now higher chances of selecting such articles and thus, higher spelling mistakes.
He should change his experiment so that he analyzes the spelling mistakes on the most accessed and modified pages in Wikipedia or discard articles where the activity on the article is below a certain threshold.
After the last time I tried to clean up some grammar and spelling in an article and it was immediately reverted with "didn't cite sources" I gave up.
Occasionally living proof of the Ballmer peak.
Canadians don't call the sidewalk "pavement" as British do. They don't call a cigarette a "fag".
Canadian English is closer to American English than to British English. It's mostly American English with British spellings.
Some Babylonian confusion is the cost of globalization. And ever since the WWW was invented, english has deteriorated ever faster. But the phenomena of languages deteriorating is as old as the languages themselves. Pidgin English was an offspring of colonialism. And see what happened to Latin, oince it became popular. Then take a look at where colonialism in effect taking place today. A language's popularity can well become its downfall.
For some comic relief: A link to Mark Twain's "Plan for the Improvement of English Spelling":
http://design.caltech.edu/erik/Misc/Twain_english.html
And please excuse any poor spelling and all awkward phrasing.. I'm just a Norwegian lass.
And it sed everything was spelled write.
I joined Wikipedia just because I grew tired of all the spelling errors and wanted to correct them.
That's a myth. If those eyes aren't attached to competent people, having more of them will do no good.
Knowledge is power; knowledge shared is power lost.
We awl noe wot ure saeng no mader howe u spell it.
if your life is such a big joke then why should I care?
This may sound like a get off my lawn type post, but from what I've seen it seems that the writing ability of younger people has severely declined. And it's not even that big a difference in age that I'm talking about here, I'm talking about people less than 10 years younger than me. I "abuse" the language a fair amount myself, but I'm talking about seeing people thinking column has a b in it, and despair doesn't have an e. There are fluctuations in the language that I'm used to; such as the color vs. colour thing; but basic spelling problems that would not be correct in any dialect seems to be pretty common. And of course we have the their vs. there problem.
I see a lot of good comments here, but the fact of the matter is that the novelty of editing has worn off for many of us. In the beginning, when Wikipedia was small, or when it was new, or whatever the reason, it was fun to keep an eye on a few pages. That novelty has worn off, and along with it, any desire to fix the little spelling errors I find along the way.
As a side note, Wikipedia had, at one time, a large number of articles about my profession. None of them was accurate, at least in the US sense of defining many terms, specializations, and equipment. Maybe other parts of the world call things differently, but I doubt to the degree that Wikipedia was wrong. Still, I wasn't about to go re-write and fix links in every article -- even if I would have been able to find sources.
So frankly, I've given up. Yes, I notice spelling errors on Wikipedia. I just read past them. It's not worth fighting with people over and it's not worth my time to fix. My interests lie elsewhere. Sorry, universe.
Also, misspellings and bad grammar on the internet are cool. Just look at some of drivel published by actual legitimate news sources (AP, I'm looking at you. Would it kill you to spell check an article before posting? I know it wouldn't fix the "their/there" and related problems, but it's a start.)
Wikipedia is not immune to entropy.
http://alternatives.rzero.com/
Is it so hard to remember smallest to biggest?
Yes, because it's 2011-12-23 where I am, not 32-21-1102. Days are smaller than tens of days, right? I prefer ISO 8601 because biggest to smallest is "lexicographically monotonic" on any date in the common era, meaning that sorting a set of strings representing any dates since Jesus M. Christ was potty trained gives the same result whether one treats them as dates or as generic strings.
"was immediately reverted with "didn't cite sources" " - Could you provide a link to the diff? One problem could be a lacking edit summary in combination with extensive rearranging of text. In such cases, it is difficult for an other editor to see what changed by looking at the diff. If the edit comes from an anonymous IP address and does not have an edit summary, a hurried editor could misunderstand your intentions.
Avantslash: low-bandwidth mobile slashdot.
Working as a substitute teacher in a the high schools of a metropolitan public school system, I've seen spelling get worse over time. The students knowledge in most other subjects has also nosedived. Teachers have been forced to lower the bar time and time again. Tests are seldom given that are not "open notes." Every year, the students seem more lazy. Most see virtually no reason to worry about how to spell. They know they could spell-check. But, sadly, they're too lazy to do so.
Part of the problem is the article selection methodology. By pulling random articles, the study author is going to be getting mostly articles that have received little attention, and mostly short articles. (Table 2 and Graph 2 show this very clearly--of the 2400 articles examined, only 14 existed in 2001. Half of them didn't exist until 2007. A quarter were created between 2009 and the present.) It's possible that what has been demonstrated is simply that relatively new articles on relatively unimportant topics tend to be less-well maintained.
The major issue is the corpus used for the study. While a half-million-word dictionary sounds impressive, it's still going to fall down in a couple of key areas. For one, foreign-language terms are likely to be nearly completely unrepresented. For another, a lot of proper nouns are going to be missing. If I write an article about Japanese manga or a Norwegian village, I'm going to be including all kinds of things that an English-language dictionary just isn't going to contain. (Worse, I'll get two misspellings for each Japanese term, since I'll have it in the article with both the original Japanese word plus the romanized transliteration). Another problem area will almost certainly be articles on highly technical topics (molecular biology is full of new and unusual abbreviations).
While certain classes of 'obvious' non-words aren't counted, many will be missed. For example, the article preprocessor filters out percentages, but will pass through numbers followed by the degree symbol (which will show up in scientific and geographic articles).
What is noticeably lacking from the report is any mention of manual checking performed by the author to evaluate the accuracy of the results generated by the spell checker. Table 4 reports that about five percent of articles contain more than 25% misspelled words(!); honestly, even people on Twitter don't (generally) show that level of illiteracy. Are there certain types of articles which are responsible for these grossly inflated counts?
In summary -- sloppy methods give useless results. No news.
~Idarubicin
Because the poster is lying. They really added "bob is gay" to an article and are butthurt that wikipedia wouldn't accept their brillant vandalism, and they rationalize it by lying to themselves and the world about the facts around their edit.
I, for one, welcome our illiterate overlords.
I was getting the feeling that spelling really doesn't matter anymore. Misspellings seem to be so common now days along with the use of incorrect words - loose/lose, there/their, etc. It seems as if the reader can figure out what the writer meant then all is good.
Hi,
My counter(?) hypothesis is that the long tail of articles grows most, and gets no to little proof-reading. Therefore I'd love to see the results normalized by (log maybe) of Page Views (from http://stats.grok.se/ ). I've also a few doubts about the quality of randomly sampled pages in general, and also whether the growth of jargon (which may or may not end up as spelling-errors has increased).
Excellently interesting piece though! Great work.
Winton
this just feels like the sort of thing Wikipedia would do on purpose, just to encourage other people to contribute ... remember, that's part of the MO over there.
But the Wiki-peepia is teeming with a plurality of crotchfruit who think spelling like a SPED is ok, and to call them on this fact would hurt their feelings.
My guess is that most of the increase in misspelling is because more and more contributors to English Wikipedia don't have English as a first language.
This is a good thing. It means less cultural bias. It also means more correct and informative articles in areas where the English speaking world lack knowledge. Most native English speakers can only understand English, or have a very poor understanding of other languages; this means that knowledge from the non-English parts of the world is unlikely to ever reach the English speaking world, unless someone that doesn't have English as a first language make an effort to communicate in English, which is surprisingly hard on some subjects, since English sucks as a medium to communicate some ideas. English is after all a simplistic trading language, a pidgin, historically developed mostly by users that don't have English as their first language and only use it for trade. Until very recently, it have never been used to communicate any deeper thoughts; even native English speakers used other languages like Latin, French and German, depending on subject (Latin was the language of philosophers and biologists , French of diplomats, politicians and social scientists, German of Mathematicians, Logicians and Chemists et c.). Apart from being a very long-winded and inexpressive language, narrow-mindedness and bigotry literally built into the English language.
I was going to comment, but I get tired of being chastized for 'correcting the teacher.'
Did you even bother to proofread your post? Or was it your purpose to obfuscate?
There is nothing to FEAR but NOTHING itself; and I fear there is a whole lot of nothing going on. --scorpivs
I'm sure wikipedia's quality increased as the number of people who cared had increased but then everyone uses it and its quality goes back down. There is also an increase in articles on stupid stuff that attracts kids and idiots so there will be more mistakes.
Interesting, Slashdot...
A flawed test produces a flawed report, a flawed article, and completely off topic comments in the 4+. Who would have guessed.
in everyday speech?
You got me there. I use the ISO 8601 format only in writing. In speech, I have "Dee-sem-burr" to disambiguate. It's the same way a lot of languages say "four and twenty" but still write 24.
Chalk up the misspellings to smartphone use and use of abbreviations.
I have another problem related. I work in English (born and raised), (RWS)
I am in a French province and I also work and write in French, (RWS)
and I write and read Spanish. Spelling words that are common to English
and the second language always leads to misspellings.
English tends to double up consanents, the other languages do not.
BTW (RWS = Read Write Speak) (BTW = By the way)
Leslie Satenstein Montreal Quebec Canada
adding "misspelled" words to your own dictionary, will cause more mispelled words over time. simple.
ironic, huh?
a hurried editor could misunderstand your intentions
See. It's not actually a problem. The editor was likely hurried, so you can't blame them for their blatant and repeated fuck up. After all, they "own" so many articles of their prose that need their constant protection.
Contrary to your hurried editor, a GOOD editor wouldn't make arbitrary decisions with no basis in fact. If they cannot see the change or a problem, perhaps they don;t need to act. But, they don't hesitate to hit revert as soon as they get the change notification!
"The crowd-sourced nature of Wikipedia might imply that its content should be more 'correct' than other sources. As the saying goes, the more eyes the better."
As the saying goes, none of us is as dumb as all of us.
Who's responsible for checking those wikipedia entries for correct spelling? All of us? You mean, none of us.
alas ... "lead me to ask" in the first paragraph when "led" was what was meant?
No one ever, ever, cites a diff when they are bitching about Wikipedia on Slashdot.
Americans control the initial soft dictionaries that are used by spell checkers, but between Canada, the UK, Australia, and South Africa, the MAJORITY of the world spells things like "colour" rather differently than the US would like.
Rather than say wikipedia is rife with spelling errors, maybe it's time to admit the Americans can't spell.
Proof that kids these days have no idea how to use what they have, which naturally extends to language use.
Nah, just kidding. Most people don't spell particularly well. Early in Wiki's life, it was maintained by a more concentrated population of geeks, who tend to spell better. Now that the lowest common denominator of spellers is closer to the mean average contributor, spelling prowess starts to list. I also don't doubt that with exponentially fewer pages out there, those of us spelling/grammar Nazis who troll for errors have too many to effectively deal with.
The main weakness I see in your protocol is your spelling check
=> could you confirm your results by using a blacklist of commonly misspelled words instead of a whitelist ?
this is a much quicker experiment and your can probably find a blacklist and/or improve it for the most common missplelled words your must have collected (and please tell us about that list : i'd examine it statistics-wise and timewise)
"Every web browser as auto"
So you failed on your fourth word. Maddening indeed.
So the rule is that I shouldn't ask you what the hell "range of breadth" means? Bummer.
Well aren't you an idiot.
The author proposes that this consistent increase is the result of Wikipedia contributors using more complex language, which the test is unable to cope with.
Simple: add something to the test that checks average word and sentence length as a rudimentary way of determine sentence complexity.
You're right. I'm not in the habit of bookmarking Wikipedia articles I read, nor keeping a running list of ones I attempted to edit. This incident was about three years ago, and I have no idea what article it was I was trying to edit. All I remember was a fairly annoyed feeling, followed by "that's the last time I offer to help."
Occasionally living proof of the Ballmer peak.
Presumably it would be the last, or near to last, edits made before you abandoned your Wikipedia account?
It is a damn poor mind indeed which can't think of at least two ways to spell any word.
- Andrew Jackson