Developing a Vandalism Detector For Wikipedia
marpot writes "In an effort to assist Wikipedia's editors in their struggle to keep articles clean, we are conducting a public lab on vandalism detection. The goal is the development of a practical vandalism detector that is capable of telling apart ill-intentioned edits from well-intentioned edits. Such a tool, which will work somewhat like a spam detector, will release the crowd's workforce currently occupied with manual and semi-automatic edit filtering. The performance of submitted detectors will be evaluated based on a large collection of human-annotated edits, which has been crowdsourced using Amazon's Mechanical Turk. Everyone is welcome to participate."
Apparently, how their vandalism detector works right now is by automatically reverting any edits done by anonymous editors.
(And yeah, that's a bit sarcastic, but somewhat true.)
Welcome to Slashdot. Although everyone is welcome to contribute to Slashdot, at least one of your recent posts did not appear to be constructive and has been modded down. Please use TrollTalk for any test edits you would like to make, and read the welcome page to learn more about contributing constructively to this web site. Thank you.
Whoever posted this clearly isn't aware of the actual work being done in the field. For instance, I was running an anti-vandalism bot in 2006, and it wasn't new at the time. They've gotten gotten much more sophisticated since then.
Why are they so intent on reinventing the wheel? Do they not even realize that the wheel exists already? Why not just improve on it instead?
Cyde Weys Musings - Scrutinizing the inscrutable
Wikipedia, the encyclopedia that anyone can edit - in my ass.
Harry Potter:
"The novels revolve around [[Harry Potter (character)|Harry Potter]], an orphan who discovers at the age of eleven that he is a wizard.{{cite web|url=http://edition.cnn.com/2000/books/reviews/07/14/review.potter.goblet/|title=Review: Gladly drinking from Rowling's 'Goblet of Fire'|date=14 July 2000|publisher=CNN|accessdate=28 September 2008}} Wizard ability is inborn, but children are sent to wizarding school to learn the magical skills necessary to succeed in the [[wizarding world]]. Harry is invited to attend the boarding school called [[Hogwarts|Hogwarts School of Witchcraft and Wizardry]]. Each book chronicles one year in Harry's life, and most of the events take place at Hogwarts.{{cite news|url=http://www.newsobserver.com/308/story/639602.html|title=Harry Potter, Hogwarts and Home|last=Frauenfelder|first=David|date=17 July 2007|publisher=The News & Observer Publishing Company |accessdate=29 September 2008}} As he struggles through adolescence, Harry learns to overcome many magical, social and emotional hurdles.{{cite web|url=http://www.southflorida.com/movies/sfe-potter-synopses,0,6711375.story|title=Plot summaries for the first five Potter books|last=Hajela|first=Deepti|date=14 July 2005|publisher=SouthFlorida.com|accessdate=29 September 2008}}"
"=== Supplementary works ===
{{see also|J. K. Rowling#Philanthropy|l1=J. K. Rowling: Philanthropy}}
Rowling has expanded the [[Harry Potter universe]] with several short books produced for various charities.{{cite web|url=http://news.bbc.co.uk/1/hi/business/6903111.stm|title=How Rowling conjured up millions|publisher=BBC|accessdate=7 September 2008 | date=19 July 2007}}{{cite web|url=http://www.alibris.com/search/books/qwork/1198169/used/Comic%20Relief%20:%20Quidditch%20through%20the%20ages|title=Comic Relief : Quidditch through the ages|publisher=Albris|accessdate=7 September 2008}} In 2001, she released ''[[Fantastic Beasts and Where to Find Them]]'' (a purported Hogwarts textbook) and ''[[Quidditch Through the Ages]]'' (a book Harry read for fun). Proceeds from the sale of these two books benefitted the charity [[Comic Relief]].{{cite web|url=http://www.comicrelief.com/stuff-to-buy/harrys-books/the-money/|title=The Money|publisher=Comic Relief|accessdate=25 October 2007}} In 2007, Rowling composed seven handwritten copies of ''[[The Tales of Beedle the Bard]]'', a collection of fairy tales that is featured in the final novel, one of which was auctioned to raise money for the Children's High Level Group, a fund for mentally disabled children in poor countries. The book was published internationally on 4 December 2008.{{cite web|title=
JK Rowling Fairy Tales To Go On Sale For Charity|work=ANI|year=2008|url=http://living.oneindia.in/insync/2008/harry-potter-jk-rowling-charity-020808.html
|accessdate=2 August 2008}}{{cite news|url=http://news.bbc.co.uk/1/hi/entertainment/7142656.stm|title=JK Rowling book fetches £2m|date= 13 December 2007|publisher=BBC|accessdate=13 December 2007}}{{cite web|url=http://www.amazon.co.uk/gp/feature.html?docId=1000137983|title=Amazon purchase book|publisher=Amazon.com Inc|accessdate=14 December 2007}} Rowling also wrote an 800-word [[Harry Potter prequel|prequel]] in 2008 as part of a fundraiser organised by the bookseller [[Waterstones]].{{cite web|title=Rowling pens Potter prequel for charities|author=Williams, Rachel |year=2008|publisher=''[[The Guardian]]''|url=http://www.guardian.co.uk/books/2008/may/29/harrypotter.jkjoannekathleenrowling}} Retrieved on 31 May 2008.
== Structure and genre ==
{{see also|Harry Potter influences and analogues}}
The ''Harry Potter'' novels fall within the genre of [[fantasy literature]]; however, in many respects they are also [[bildungsroman]]s, or [[coming of age]] novels.{{cite web|url=http://findarticles.com/p/articles/mi_m0OON/is_1_24/ai_107896944|title=Wizards and wainscots: generic structures and genre themes in the Harry Potter series|last=Anne Le Lievre|first=Kerrie|ye
I've had many more problems with admin abuse than vandalism. Vandalism is quick and easy to deal with. Admins are the biggest problem in Wikipedia editing; they have no accountability and abuse their power.
How about a log of each admin's activities, including reversions, bans, etc, and a way for non-admins to challenge actions (without spending countless hours in an appeal process worthy of a federal court).
How do we tell intent from the resulting content? Yes, clearly FUCK UR MOM and UR MOM SUCKS COCKS IN HELL are vandalism, but what about misinformed people making edits, is that vandalism? There is no bad intent there. What about people that doesn't understand wikipedia making edits that are non neutral. Is that vandalism?
Bayesian statistics are an interesting thing. Mwhwhwhwhaaaa. Who thought they would say that about stats?
Anyway. you can tell spam with a remarkably high degree of accuracy... Guess what. You can tell "Important" and "friends" emails with a similar degree of accuracy (you define what's important or who are friends). No offence to most vandals (of any type), but usually they are complete fuckwits. I suspect they and what they write are probably even more predictable than spammers.
Deleted
Before any more detectors are rolled out, how about they come up with a workable definition of vandalism? And actually use it fairly, ethically and logically.
There's a great deal of evidence to suggest the current definition of "vandalism," is something a wikiadmin decides he just doesn't like, or disagrees with, or in some way interferes with his power-trip.
Right now, you can think of wikipedia as having two columns per article - first is the working article column, with the second being the discussion column.
What we really need is a third column, one for the currently published version of the article.
While this may not be popular, it would go a long way to getting rid of the spam, and might even solve some of the other issues facing wikipedia.
With such a system, you could even assign articles to a subject matter expert as the editor, who could approve changes, or just incorporate the best changes in.
Not every article would need to have this, but as articles mature, they could move to this over time.
Just have a look on the Discussion Page for "Dawn Wells" to understand why most Wikipedia Admins are Fuck Wads.
Since the problem is tantalizingly easy to frame as a standard data-mining or machine-learning problem, albeit with some quirks, there's quite a lot of work from a lot of research groups that seems to be looking at it. Some examples: one, two, three, four, five, six, seven.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
There is an art to Wikipedia abuse. If someone cites a Wikipedia article in some argument they're making, you can always just go to Wikipedia and edit the page so that they're wrong. But that's what a novice Wikipedia vandal does.
A pro knows to edit the article in a very subtle way, so that it looks like the person has poor reading comprehension. Let's say the person cites a Wikipedia article with a sentence like this, in order to support the argument that Colbert is a Democrat.
Although by his own account he was not particularly political before joining the cast of The Daily Show, Colbert is a self-described Democrat.[12][13]
This bears the mark of authority, because of the footnote subscripts that are already on it. (We can skip the step where we maliciously relocate them here.)
A novice might change it to this (correctly preserving the authoritative footnote superscripts):
Although by his own account he was not particularly political before joining the cast of The Daily Show, Colbert is a self-described Republican.[12][13]
It makes the person appear to be wrong- and the vandalism is obvious- like swapping Eurasia for Eastasia. There's no way he could have misread that.
But change it to this
Although by his own account he was not particularly political before joining the cast of The Daily Show, Colbert has even been described as a Democrat.[12][13]
and the person looks not only wrong, but plausibly wrong because it looks like he can't read. That's what makes successful Wikipedia vandalism an art.
Just because the tag exists, doesn't mean you can slap it everywhere you see an edit that doesn't support your world view! You deletionists are ruining Wikipedia for the rest of us. Assume Good Faith!
Please help publicise swpat.org - the software patents wiki
Whoever posted this clearly isn't aware of the actual work being done in the field. For instance, I was running a ___[thing]___ in _[year]_, and it wasn't new at the time. They've gotten much more sophisticated since then. Why are they so intent on reinventing the wheel? Do they not even realize that the wheel exists already? Why not just improve on it instead?
* * *
This looks like a useful template for the standard "why reinvent the wheel" Slashdot post; I hope you don't mind if I reuse it.
If it stops Deletionists from deleting well-intended edits. Better a short article than no article.
It was just too visionary for its time http://www.everytopicintheuniverseexceptchickens.com/
I believe that vandalism on Wikipedia can be limited. But would it really be possible to detect all kinds of vandalism?
FTA:
"Yahoo! Research will award a cash prize of 500 Euros to the winner of the plagiarism detection task. "
500 Euro's doesn't sound much for detecting plagiarism on a site like Wikipedia...
Sack William Connolley.
That eco-terrorist has vandalised more climate-related pages (5500+) then the rest of the vandals put together.
http://wattsupwiththat.com/2009/12/22/william-connolley-and-wikipedia-turborevisionism/
I ask because I don't know. I can see turning a page into a screed as vandalism, but that doesn't differ greatly from many of the wikipedia articles that I've read; quite a few of them are overwhelmingly dedicated to hostility to the topic or advocates of the topic. Earlier today, when I was reading the news, there was a link to the Wikipedia article on the Tea Party movement: well over half of the article was dedicated to quotes from anti-Tea Party people (MSNBC, NYT, LAT, etc.) spouting off hostility to it.
Is that vandalism?
This project will place more power in the hands of anonymous, faceless Wikipedia bureaucrats. It is therefore harmful. If Wikipedia bureaucrats are too lazy to review possibly offensive material by hand and instead want a machine to do it for them then MAYBE the world does not need that kind of Wikipedia at all.
If you want to view a Wikipedia administrator drunk with his own sense of self-importance check this out:
As soon as you start trusting a vandalism detector over manual monitoring a lot of stuff will start to slip through, gets through the news, then the detector won't be trusted any longer. It will have a short life but will be interesting to watch.
Sew m@ny things that can bee done to bypass mechanisms. Even simple euphemisms like cleaning the old rifle http://images.clipartof.com/small/5039-Man-Cleaning-Inside-The-Barrel-Of-His-Unloaded-Rifle-Gun-Clipart.jpg ...are sure to slip through. There are so many language mechanisms that can be used to fool automated tools, but that will be immediately recognized by people.
... but the truth?
http://en.wikipedia.org/w/index.php?title=Nick_Xenophon&oldid=326486984#As_federal_senator
If the world doesn't want Wikipedia, they are more than welcome to stop reading it. In truth, however, it seems the world very much wants Wikipedia, since it is the 5th most popular website in the world (by unique visitors per month, if memory serves).
There are well-intentioned edits on Wikipedia? Even if there were, how could you tell...
Slashdot: Playing Favorites Since 1997
If I had mod points, I'd mod the parent up and the grandparent down. Seriously, almost everything in Wikipedia is transparent. Search the revision history and logs and look for the information you need. RTFM.
A lot of people on /. seem to derive very general opinions about admins from a personal disappointing encounter. They do not include diffs of their edits or their username. From my experience in most cases the guy who got reverted by an admin broke some kind of rule (and often enough they just got reverted by a regular non-admin, but they assume it was an admin). Instead of RTFM those people post as AC complaining generally about admins without providing any traceable cases of admin abuse. I know my opinion isn't very popular, but unless you give concrete examples your allegations are just FUD.
You're looking for a DWIM (Do What I Meant) interpreter with PDCH (Predictive Digital Concierge Heuristics). While the technology is available it's currently quite costly. Bugs, errata, and maintenance can deliver less than an optimal experience. Might I instead offer you this mail order bride? We have imported personal assistants in stock from less privileged nations - and if you have the means we can outsource minute-to-minute management of them to our Bangalore VPDT (Virtual Presence Discipline Team). Please consult your accountant and tax lawyer concerning withholding for personal staff, particularly if you intend to pursue public service.
/At your service!
Help stamp out iliturcy.
From my experience with contributing to Wikipedia, and from reading some of the talkback (is that what they're called?) discussions, I don't think there's much need for such a tool; there seems to be an elite class of Wiki users that delete anything that they deem unworthy while giving the most bizarre reasons for doing so.
I still think the best solution would be a color coding overlay over the text that would show the reader immediately 1.) how trustworthy the author has been and 2.) how long before the edit has been done (without being reverted). That way it would be easy to see the sections written by reputable authors who have always added useful info and distinguish it from "amendments" that have been entered just a few minutes ago by an anonymous coward.
;)
And for those who do not want to log in to edit, that would be fine too, if the edit stands the test of time it's highly probable that the information entered was good, so over time it will get a similar color "status" as an edit from a reputable author. It would also be easy to see last minute amendments be known authors, and as we all know, should be taken with a (larger than usual) grain of salt, no matter how well known he is
Just add a toggle button to switch between default view and the color coded view.
BTW this system would also works very well for blogs and news sites.
And when you gaze long enough into the code, the code will also gaze into you.
If the world doesn't want Wikipedia, they are more than welcome to stop reading it. In truth, however, it seems the world very much wants Wikipedia, since it is the 5th most popular website in the world (by unique visitors per month, if memory serves).
The problem isn't that the world doesn't need the Wiki, it's that the world generally misunderstands the Wiki.
Despite any claims to the contrary, the only USEFUL information on the Wiki is the references which are cited. The articles themselves are pure trash, and in most cases are the end result of a flame war between various editors. In the end you either have a horribly, obviously biased article, a completely deleted article, or an article which has been rendered so vague as to be useless through constant edit tinkering.
In short, the Wiki should NEVER be directly referenced for any type of citation, since (by it's own claim) the only information in the Wiki is itself backed by outside sources. So if you need to cite the Wiki, at the very least use the citations they already dug up for you. The Wiki is a starting point for information, not the destination.
My hope would be that whether they read Wikipedia or not, people would not support projects like this one which place more power in the hands of Wikipedia admins. Such projects by definition place less power in the hands of ordinary Wikipedia users.
Hopefully companies like Google will also question whether Google is deserving of $2M contributions, especially when in terms of democratic process Wikipedia is getting worse instead of better, as admins go off on their power trips with more and more powerful tools.
Read the Wikipedia talk page for the Martin Heidegger article and you can see that parts of Wikipedia are infested with Neo-Nazi sympathizers who have the protection of a particular Wikipedia admin.
Who cares 90% of the info on those sites are bougus anyway, it's like trying to fix the preputally broken!!
Wikipedians administrators don't seem to have a clue about the effects of vandalism.
The time wasted by humans who's job is solely to revert vandalism is irrelevant. There are more than enough people who are willing to do this work and if they weren't doing this work they would not be contributing useful content to Wikipedia.
The negative effects are concentrated on the knowledgeable editors who are adding useful new content. There may be 5 to 10 persons activietyl adding content to an article. Each time a change is made to the article each of these editors need to examine the content to determine if it is
is everything that the admin establishment doesn’t agree with. Just like in a state with total censorship.
And on top of that, the admins often don’t know shit about anything.
Which is not surprising, considering that they most likely sit in underpants in their basement all day long. Why else would they have so much time to troll around Wikipedia on a deletion spree? Which is obviously not a very mentally healthy thing to do either.
It’s simple: As long as Wikipedia can at all be controlled by a subset of humanity, it’s doomed to fail as a encyclopedia for all people. By definition.
That’s why it must become a P2P system. With cascading information source rules definable by every user for himself. With everybody being able to be the publisher of his view of Wikipedia.
Because in the end, nearly all you know, is based on the trust on other sources (human beings) anyway. (Yes, including most of what you call “facts”. Unless you checked for yourself, that information IS based on trust.)
Any sufficiently advanced intelligence is indistinguishable from stupidity.
there is a subset of vandalism that a bot can be very good at detecting. this bot can never handle every kind of vandalism. for example, adding some subtly false statement to a biographical article, but spelling everything correctly, using correct grammar and adding something that looks like it could be a legitimate source is difficult for even human editors to recognize as vandalism.
adding 1s everywhere or deleting the entire article is very easy to detect.
Case in point --- There is an article in Wikipedia about a certain country.
In that article, they blame their previous British colonial master for everything.
I tried to make some corrections to that article to make it more "neutral", and they changed it back within 10 minutes.
I tried again, and again they changed it back.
For the third time, I was warned by someone from Wikipedia (dunno if it's a volunteer or something) that I have no right to make any correction to that particular article anymore.
The "THEY" in question is the government of that country. They have a "cyber-patrol" group in charge of "online propaganda" and that Wikipedia article is one of their many lies, aka propaganda, they have put online.
Now, how do you define vandalism in this case?
Muchas Gracias, Señor Edward Snowden !
Officially, vandalism is defined as edits made in bad faith.
In other words, the scope of the problem does not include discovering the cure for human stupidity, however laudable that might be.
Furthermore, people here are failing to apply the 80-20 rule: if you can clean up 80% of the vandalism at 20% of the human effort currently expended, the attention available to deal with the difficult twenty percent would more than triple. I've seen entire pages replaced with the word "penis" or a crass four word comment about some pimple twit schoolmate. There's a lot of low hanging fruit here.
I sometimes think Wikipedia needs to implement a mechanism where citations are corroborated by some semi-trusted party: "yes, this citation really contains the support for the claim added to the article." Any editor who hasn't contributed a corroborated citation needs to be kept on a fairly short rope. My opinion is that the underlying currency of good faith contribution is the properly cited claims, preferably from A-list source material and not Joe Random Blog.
How much vandalism is contributed by editors who have added fully sourced claims to three or more articles? If I've seen such a case of vandalism, I can't recall it. I've seen editors make half a dozen quasi-good faith contributions (always unsourced) who have then degenerated into petulance and destruction, perhaps when testing limits becomes a better way to get noticed.
Most of the vandalism I've run into has been fairly fresh, using a couple of days old or at most a week. On obscure articles, I've encountered heavy vandalism that persisted unchallenged for months. In some ways the long-standing dark-corner vandalism is more problematic, like the mother-in-law who swipes her finger in some obscure crevice to document a damning laxity.
Another case I've often seen is vandalism caught by someone inexperienced, and fixed in that instance (but not with a conspicuous revert), while ten other vandalisms from the same editor on the same spree remain unrepaired. If an unproven editor's contribution seems to be suffering a higher than normal attrition rate, then everything the editor has done should be flagged for attention.
A lot could be built on top of a decent blame function, such as the ability to determine whether two versions of an article differ only in text, and with better exposure statistics for how often an edit has been viewed by someone who ought to know the difference.
This article is no great bag of chips, but it contains some pertinent key phrases.
AI comes of age
This fellow Kroon seems to believe that augmented intelligence is the way of the future. I concur. The game is to best combine what humans do well with what the algorithms do better, combined with an effectiveness metric taking into account power law distributions, minus all the pointless hand-wringing about highly motivated adversaries escaping the cunning traps.
Profound acts of bad faith are not remotely the same problem. It's unconscionable scope-creep to bring these worries into the petty vandalism conversion. Yes, some fraction of the thwarted petty vandals will escalate into more profound acts of vandalism. Such is life. Problems remain for the future. Many people think we've made no progress on spam. My view is that the spam filters have essentially driven all the amateur spammers out of the system. Once the level of professionalism required to get spam past the spam filters begins to equal the difficulty of doing a real job, then the flow of spam will finally begin to atrophy.
Another example is ProPolice (or other stack smashing guards) which accomplishes nothing at all on a formal basis, but nevertheless tilts the landscape on exploit cost/benefit, and qualifies your adversaries. One of the heavy burdens on Wikipedia as it now sta
Rogue admins abusing their power? An "in" club? If you have a problem with an admin, provide evidence (a diff of the admin abusing his power) here. Follow the case, argue it out, and the admin will be dealt with. Every admin is elected in, guys. If you think Wikipedia is important enough that all the scary "rogue admins" are actually doing harm, go become a part of the election process. Anyone can vote, and your opinion matters regardless of how many edits you have, or how many articles you've worked on. This isn't like America where your vote only matters symbolically. You can stop these evil boogiemen from getting elected, if you want to. Admins aren't "above" the user. They're just the people who hold onto the brooms. It's the users who make the messes, and the users who point the messes out to the janitors. That's how it was back when I was involved in the community, anyway. Oh, cept SlimVirgin. She's a fucking fascist.
GFA/M/S d-- s: a--- C++++ UBL++$ P+ L+++ !E- W++ N+ !o K- w--- !O !M !V PS++ PE Y+ PGP+ t+++ 5- X+ R tv@ b++ DI++++ D+ G
Read the Wikipedia talk page for the Martin Heidegger article and you can see that parts of Wikipedia are infested with Neo-Nazi sympathizers who have the protection of a particular Wikipedia admin.
Really? Because I actually read through the damn thing, and all I see is a debate about the difference between being a Nazi or being a National Socialist. Add a number of people acting like pompous twats, and you get an edit war, not the coming of the third reich.
People replying to my sig annoy me. That's why I change it all the time.
As owner of one of the first vandalism reverto bots out there (although pattern speaking, tawkerbot2 didn't do nearly as much as CB) the first take there was if you remove the perceived vandalism almost immediately people don't get any fun out of vandalizing and stop doing it. There was massive opposition at the offset, but then, as volumes increased, people began to freak when the bot was non operational. Yes, it had false positives which needed to be dealt with, but if I recall correctly, statistically speaking, it was less than a 2% false positive rate - and this was on hundreds of thousands of edits.
Those who opposed any use of the term "Nazi" in the Heidegger article argued that it was pejorative, and that Heidegger was not a Nazi, he was a National Socialist.
A later commentator said that whether intentional or not, those who posted this drivel were attempting to rehabilitate the Nazis by arguing that they weren't Nazis at all, they were National Socialists. He mentioned Lithuania, where this process is farther along than it is here.