Exactly, but both kinds of tools need to solve the same underlying problem: given an edit, is it vandalism?
The better those tools answer this question, the more time of Wikipedia editors is saved.
Me too, experience that is. We tooke the feauteres from our research with high througput, and implemented a live edit analysis for the English portion of Wikipedia. It listens on the IRC channel, downloads edits wikitexts of old and new revision, and then does its magic. And it did so once on an old laptop. The computer was connected at max 1 GBit/s.
I cannot agree more with what you say, but I'd like to give it a twist: I want computers to assist me, and I want them to to it good, reliable, and robust. If I happen to be a Wikipedia editor that doesn't change a thing, I still want the computer to assist me with what I'm doing. Now, currently there is no such thing, and the only thing I'd like to foster research in doing so.
Now, some always go ten steps further, when someone talks about a new "solution" based on computers. They directly envision a world where computers take over. And that, apart from being unrealistic today, must be considered ideological, instead of logical.
After all, all you see here and all you see on Wikipedia is made possible only by machines working with intelligent algorithms.
Your right, it's machine learning, data mining, NLP, and information retrieval. But the fun thing is turning a research prototype into a tool that can be left alone most of the time. That hasn't happened yet.
Also, research on this problem hast started only in 2008, rule-based tools developed by Wikipedians are there since 2006. All the works you listed are acutally all there is! That's not much to work with, is it?
We are very aware of the existing tools (Huggle, Twinkle, and so on). See the links in the above post, and see the links in the resources section of the competition Web page. An accurate vandalism detector will take a lot of research an development, just like spam detectors did...
Why did you stop developing your tool, anyway?
Don't you think there's a difference in the oddities of writing accidentally 10 words that have been written before and 100 words? The former can hardly be called plagiarism, the latter won't happen accidentally.
For a human it is really quite easy to find different writing styles, but for a computer it isn't, yet. That's why there is an analysis tasks dedicated to this problem at the competition.
There's only so much that can be down with current plagiarism detection approaches. We definitely expect similar approaches from unrelated participants.
No, simply because near-duplicate texts of sufficient length are not written accidentally independent of one another. Take the comments on this page as an example: Although many discuss the same arguments I bet you won't find 10 words in row which appear twice.
It's true, random text operations are not all too realistic. However, if a tool manages to find all of the randomly created cases accurately if will definitely find the subset of texts that are also human-readable.
Try this: http://www.netspeak.org/?query=*%20microsoft%20sucks%20*
Exactly, but both kinds of tools need to solve the same underlying problem: given an edit, is it vandalism? The better those tools answer this question, the more time of Wikipedia editors is saved.
Me too, experience that is. We tooke the feauteres from our research with high througput, and implemented a live edit analysis for the English portion of Wikipedia. It listens on the IRC channel, downloads edits wikitexts of old and new revision, and then does its magic. And it did so once on an old laptop. The computer was connected at max 1 GBit/s.
I cannot agree more with what you say, but I'd like to give it a twist: I want computers to assist me, and I want them to to it good, reliable, and robust. If I happen to be a Wikipedia editor that doesn't change a thing, I still want the computer to assist me with what I'm doing. Now, currently there is no such thing, and the only thing I'd like to foster research in doing so.
Now, some always go ten steps further, when someone talks about a new "solution" based on computers. They directly envision a world where computers take over. And that, apart from being unrealistic today, must be considered ideological, instead of logical.
After all, all you see here and all you see on Wikipedia is made possible only by machines working with intelligent algorithms.
Your right, it's machine learning, data mining, NLP, and information retrieval. But the fun thing is turning a research prototype into a tool that can be left alone most of the time. That hasn't happened yet. Also, research on this problem hast started only in 2008, rule-based tools developed by Wikipedians are there since 2006. All the works you listed are acutally all there is! That's not much to work with, is it?
We are very aware of the existing tools (Huggle, Twinkle, and so on). See the links in the above post, and see the links in the resources section of the competition Web page. An accurate vandalism detector will take a lot of research an development, just like spam detectors did... Why did you stop developing your tool, anyway?
This is by far overestimated. Dependent on how elaborate your edit model ist, you can analyse edits live on a laptop.
We have studied the accuracy of ClueBot, and found that (on a small corpus) it has very good precision (low falsy positive rate), but a very low recall (low true positive rate). (see: http://www.uni-weimar.de/medien/webis/publications/downloads/papers/stein_2008c.pdf) But the picture might look quite different on a large scale.
Don't you think there's a difference in the oddities of writing accidentally 10 words that have been written before and 100 words? The former can hardly be called plagiarism, the latter won't happen accidentally.
For a human it is really quite easy to find different writing styles, but for a computer it isn't, yet. That's why there is an analysis tasks dedicated to this problem at the competition.
There's only so much that can be down with current plagiarism detection approaches. We definitely expect similar approaches from unrelated participants.
No, simply because near-duplicate texts of sufficient length are not written accidentally independent of one another. Take the comments on this page as an example: Although many discuss the same arguments I bet you won't find 10 words in row which appear twice.
It's true, random text operations are not all too realistic. However, if a tool manages to find all of the randomly created cases accurately if will definitely find the subset of texts that are also human-readable.
I'd say if you obfuscate something enough it eventually becomes an original. Paraphrases are originals, aren't they?