Competition Produces Vandalism Detection For Wikis

← Back to Stories (view on slashdot.org)

Competition Produces Vandalism Detection For Wikis

Posted by timothy on Sunday September 26, 2010 @02:40AM from the citation-needed dept.

marpot writes "Recently, the 1st International Competition on Wikipedia Vandalism Detection (PDF) finished: 9 groups (5 from the USA, 1 affiliated with Google) tried their best in detecting all vandalism cases from a large-scale evaluation corpus. The winning approach (PDF) detects 20% of all vandalism cases without misclassifying regular edits; moreover, it can be adjusted to detect 95% of the vandalism edits while misclassifying only 30% of all regular edits. Thus, by applying both settings, manual double-checking would only be required on 34% of all edits. Nothing is known, yet, whether the rule-based bots on Wikipedia can compete with this machine learning-based strategy. Anyway, there is still a lot potential for improvements since the top 2 detectors use entirely different detection paradigms: the first analyzes an edit's content, whereas the second (PDF) analyzes an edit's context using WikiTrust."

5 of 62 comments (clear)

Min score:

Reason:

Sort:

Machine learning - right by Animats · 2010-09-26 04:21 · Score: 4, Informative

Wikipedia already has programs which detect most of the blatant vandalism. Page blanking and big deletions are caught immediately. Deletions that delete references generate warnings. Incoming text that duplicates other content on the Web is caught. That gets rid of most of the blatant vandalism. It's not a serious problem on Wikipedia.
The current headaches are mostly advertising, fancruft, and pushing of some political point of view. That's hard to deal with using what is, after all, a rather dumb machine learning algorithm that has no model of the content or subject matter.
Rules can only get so much by tawker · 2010-09-26 06:20 · Score: 3, Informative

As the owner of the first vandalism reverting bot in mainstream use - http://en.wikipedia.org/wiki/User:Tawkerbot2 I guess I have a bit of perspective on the whole problem. Originally the bot was designed / created to auto revert one very specific type of vandalism, a user who would put a picture of spongebob squarepants into pages while blinking them (or squidward or some cartoon character) - that was pretty easy to get. Next we went to stuff like full page blanking, ALL CAP LETTER UPDATES and additions of a tonne of bad words, based on common vandalism trends (ie, if a page had 0 profanity on it and someone added a few words it would be reverted, again, not too many false positives. That basically caught the "dumb kid" type of vandalism, and it was amazing how much lower a percentage it caught of total edits when students went back to school. The only problem, at the time, it was a resource pig. The bot was originally running on a P2 300MHz w/ a grand total of 256MB of RAM and the load got to be so high that we had to move it about 5 times. It's interesting to note that at first, many many people were opposed to the idea of automated vandalism revision, it was almost a contest to revert stuff first - and the bot would win a vast majority of the time. However, as time went on, my inbox started getting rather full whenever I had a power outage, cat knocked the cord out of the box hosting it etc. Community reaction to bots doing the grunt work in vandalism really changed. Anyways, just my 2c on it, and just for the heck of it to prove I'm actually the Tawker on wiki, http://en.wikipedia.org/w/index.php?title=User%3ATawker&action=historysubmit&diff=387163504&oldid=268687392
1. Re:Rules can only get so much by Anonymous Coward · 2010-09-26 08:17 · Score: 1, Informative
  It looks like the winning entry uses all of those attributes plus a bunch more. From pages 3-4 of the paper.
  
  Anonymous -- Wether the editor is anonymous or not.
  Vandals are likely to be anonymous. This feature is used in a way or another in
  most antivandalism working bots such as ClueBot and AVBOT. In the PAN-WVC-
  10 training set (Potthast, 2010) anonymous edits represent 29% of the regular edits
  and 87% of vandalism edits.
  Comment length -- Length in characters of the edit summary.
  Long comments might indicate regular editing and short or blank ones might suggest vandalism, however, this feature is quite weak, since leaving an empty comment in regular editing is a common practice.
  Upper to lower ratio -- Uppercase to lowercase letters ratio
  Vandals often do not follow capitalization rules, writing everything in lowercase or
  in uppercase.
  Upper to all ratio -- Uppercase letters to all letters ratio.
  Digit ratio -- Digit to all characters ratio
  This feature helps to spot minor edits that only change numbers, which might help to find some cases of subtle vandalism where the vandal changes arbitrarily a date or a number to introduce misinformation.
  Non-alphanumeric ratio -- Non-alphanumeric to all characters ratio
  An excess of non-alphanumeric characters in short texts might indicate excessive
  use of exclamation marks or emoticons.
  Character diversity -- Measure of different characters compared to the length of inserted text.
  This feature helps to spot random keyboard hits and other non-sense. It should take
  into account QWERTY keyboard layout in the future.
  Character distribution -- Kullback-Leibler divergence of the character distribution of the inserted text with respect the expectation. Useful to detect non-sense.
  Compressibility -- Compression rate of inserted text using the LZW algorithm.
  Useful to detect non-sense, repetitions of the same character or words, etc.
  Size increment -- Absolute increment of size, i.e., |new| |old|.
  The value of this feature is already well-established. ClueBot uses various thresholds of size increment for its heuristics, e.g., a big size decrement is considered an
  indicator of blanking.
  Size ratio -- Size of the new revision relative to the old revision
  Complements size increment.
  Average term frequency -- Average relative frequency of inserted words in the new
  revision.
  In long and well-established articles too many words that do not appear in the rest
  of the article indicates that the edit might be including non-sense or non-related
  content.
  Longest word -- Length of the longest word in inserted text.
  Useful to detect non-sense.
  Longest character sequence -- Longest consecutive sequence of the same character in
  the inserted text.
  Long sequences of the same character are frequent in vandalism (e.g. aaggggghhhhhhh!!!!!, soooooo huge).
  Along with analyzing those basic stats, the winning entry also examines categories of words.
  
  Vulgarisms -- Vulgar and offensive words, e.g., fuck, suck, stupid.
  Pronouns -- First and second person pronouns, including slang spellings, e.g., I, you, ya.
  Biased -- Colloquial words with high bias, e.g., coolest, huge.
  Sex -- Non-vulgar sex-related words, e.g., sex, penis, nipple.
  Bad -- Hodgepodge category for colloquial contractions (e.g. wanna, gotcha), typos (e.g.
  dosent), etc.
  All -- A meta-category, containing vulgarisms, pronouns, biased, sex-related and bad
  words.
  Good -- Words rarely used by vandals, mainly wiki-syntax elements (e.g. __TOC_
Re:20% with no false positives? by Rhaban · 2010-09-26 06:54 · Score: 2, Informative

Care to show us even one article where 99% of good edits are reverted? Remember, that will mean that over 99% of all edits are reverted.
not if there are bad edits that are not reverted.
Re:Manual double checking? by Anonymous Coward · 2010-09-26 07:31 · Score: 1, Informative

According to the 2nd link, the vandalism rate on Wikipedia is 2391/28468 = 0.084, not 0.60!

The second link actually says:

The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified.

So that is a vandalism rate of 2391/32452 = 0.074. When I do the math I get 33% of all edits requiring a manual check. The vast majority of them are false positives.
0.074 * (0.95-0.20) + (1-0.074) * 0.30 = 0.0555 + 0.2778 = 0.3333