Developing a Vandalism Detector For Wikipedia
marpot writes "In an effort to assist Wikipedia's editors in their struggle to keep articles clean, we are conducting a public lab on vandalism detection. The goal is the development of a practical vandalism detector that is capable of telling apart ill-intentioned edits from well-intentioned edits. Such a tool, which will work somewhat like a spam detector, will release the crowd's workforce currently occupied with manual and semi-automatic edit filtering. The performance of submitted detectors will be evaluated based on a large collection of human-annotated edits, which has been crowdsourced using Amazon's Mechanical Turk. Everyone is welcome to participate."
Further Reading
http://en.wikipedia.org/wiki/User:ClueBot
Summation 2
Whoever posted this clearly isn't aware of the actual work being done in the field. For instance, I was running an anti-vandalism bot in 2006, and it wasn't new at the time. They've gotten gotten much more sophisticated since then.
Why are they so intent on reinventing the wheel? Do they not even realize that the wheel exists already? Why not just improve on it instead?
Cyde Weys Musings - Scrutinizing the inscrutable
I'm not sure why he bragged about reversion speed. All that's really dependent on is your network connection. For one, your network connection has to be good enough to download, in real time, the diffs of all edits to Wikipedia. Most aren't.
Anyway, a decision as to whether a given diff is vandalism or not needs to be made in a small fraction of a second, as there are dozens of edits coming in every second, and if you continuously fall farther and farther behind, you lose. Given an ideal network connection, vandalism should be reverted in a couple of seconds or so.
I suppose there's some argument to be made for a large cluster of computers handling all edits on Wikipedia, each one spending up to a full second judging each individual edit, but the truth is that none of the algorithms currently in use for vandalism detection are nearly sophisticated enough to require so much computation time.
Cyde Weys Musings - Scrutinizing the inscrutable
In response to whether those two examples are vandalism, the answer is no, they are not.
You'd need a strong AI to be able to make those determinations, and if such a thing existed, it'd make more sense just to have the strong AI write the encyclopedia.
What we're talking about here is obvious vandalism (blanking, insertion of curse words, etc.) of the type that can be detected by an algorithmic/heuristic program.
Cyde Weys Musings - Scrutinizing the inscrutable
Oh yes, it definitely hits a large number of false positives, presumably also 'fixed' within 30 seconds. For every one that goes reported (including the hundreds or thousands of archived reports), there must be many that go unreported, by 'non-Wikipedians' who edited a page with an error, and then went on their way. Or by people who didn't stick around to 'watch' that their edit doesn't get 'fixed' by an automated process...
The false positive rate on the anti-vandalism bots is a lot lower than you would think. The bots are written quite conservatively, take a lot of factors into account, and only pull the revert trigger when they are quite sure.
It's the type II error rate that's pretty high. Unfortunately, that's not solvable without strong AI.
Cyde Weys Musings - Scrutinizing the inscrutable
We have studied the accuracy of ClueBot, and found that (on a small corpus) it has very good precision (low falsy positive rate), but a very low recall (low true positive rate). (see: http://www.uni-weimar.de/medien/webis/publications/downloads/papers/stein_2008c.pdf) But the picture might look quite different on a large scale.
A system like this has been implemented for the German Wikipedia. Almost everybody who has an account can verify articles to be vandalism-free, unless you are logged in you see the last verified version by default.
(+1, Disagree)
I've had many more problems with admin abuse than vandalism. Vandalism is quick and easy to deal with. Admins are the biggest problem in Wikipedia editing; they have no accountability and abuse their power.
How about a log of each admin's activities, including reversions, bans, etc, and a way for non-admins to challenge actions (without spending countless hours in an appeal process worthy of a federal court).
What are you talking about? All users have logs that track their actions:
http://en.wikipedia.org/wiki/Special:Contributions/Jimbo_Wales
http://en.wikipedia.org/w/index.php?title=Special%3ALog&type=&user=Jimbo+Wales&page=&year=&month=-1&tagfilter=
Actions can be challenged at any point on the talk page or the administrator boards.
How about a log of each admin's activities, including reversions, bans, etc, and a way for non-admins to challenge actions (without spending countless hours in an appeal process worthy of a federal court).
Reversions: http://en.wikipedia.org/wiki/Special:Contributions
Bans: http://en.wikipedia.org/wiki/Special:Log/block
Deletes: http://en.wikipedia.org/wiki/Special:Log/delete
Anything else you're too lazy to find yourself?
Your hair look like poop, Bob! - Wanker.
If I had mod points, I'd mod the parent up and the grandparent down. Seriously, almost everything in Wikipedia is transparent. Search the revision history and logs and look for the information you need. RTFM.
A lot of people on /. seem to derive very general opinions about admins from a personal disappointing encounter. They do not include diffs of their edits or their username. From my experience in most cases the guy who got reverted by an admin broke some kind of rule (and often enough they just got reverted by a regular non-admin, but they assume it was an admin). Instead of RTFM those people post as AC complaining generally about admins without providing any traceable cases of admin abuse. I know my opinion isn't very popular, but unless you give concrete examples your allegations are just FUD.
mod parent up