Competition Seeks Best Approaches To Detecting Plagiarism
marpot writes "Does your school/university check your homeworks/theses for plagiarism? Nowadays, probably Yes, but are they doing it properly? Little is known about plagiarism detection accuracy, which is why we conduct a competition on plagiarism detection, sponsored by Yahoo! We have set up a corpus of artificial plagiarism which contains plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents. A random plagiarist was employed who attempts to obfuscate his plagiarism with random sequences of text operations, e.g., shuffling, deleting, inserting, or replacing a word. Translated plagiarism is created using machine translation."
The tools are fairly good, but, in my experience, they'll always report 3-7% or so of your paper as plagiarized, just because it's pretty difficult to write about _anything_ without unknowingly using previously written words. I would _hope_ that anyone who would pursue disciplinary action from such a tool's results would at least take a look to see if the sections being flagged are consequential.
I have no idea how good they are with catching paraphrasing, though... it strikes me that the semi-intelligent plagiarizers would be doing that more than a straight copy and paste. There's also the "acceptable vs unacceptable" distinction to be made.
Plausible conjecture should not be misrepresented as proof positive.
I once was on a Fido forum with someone who would often write responses nearly word-for-word identical to mine. It was uncanny; I'd see his post and recognize my own writing, only to realize it wasn't mine. Timestamps would sometimes show my post was written first, sometimes his. I imagine some others on the forum thought at least one of us was a sock puppet, but neither of us was.
(If he's on slashdot, he's probably composing a post just like this one)
That probably happens rarely. But build a big enough database, and it will happen often. Particularly given the restricted problem domains in undergraduate papers. It's not just a computer problem; even humans will think "plagiarism" when they see two papers with similar ideas and similar turns of phrase. Which I think demonstrates that plagiarism cannot be established satisfactorily merely by showing similarity between papers.
My wife teaches for Phoenix. Probably 90% of the plagiarism she sees is from students copying and pasting whole papers word-for-word from random cheat sites. Occasionally she'll get someone who fails to properly quote sources, but that's very much the minority. For the most part, the cheaters aren't all that bright, nor do they try to hide their cheating. They're just hoping they get away with it.
Plagiarism is a symptom of professors only being involved in the last step: reviewing the final product.
Require the students to submit multiple drafts. Meet with them for 15 minutes each and discuss their thought processes on the ongoing paper. You'll get better final products, teach people not to procrastinate, and smoke-out people who have no involvement in their "own work."
What, can't do that because you have 60 students in a class? Well, there's part of the problem too.
We're trying to find a technology solution to a problem with less student-teacher interaction. Typical!
Seriously, the humanities are in trouble. With over 6 billion people on the planet, it's extremely difficult to have an original thought. This sets the stage for endless repetition. Add to that the fact that the very process of teaching the humanities usually means imparting a teacher's single interpretation of the source material to the students who then do the natural thing when it comes to writing a paper and parrot back to the teacher what they've heard, knowing that's the only way to get a good grade, and the resulting combination is deadly.
The papers are all going to be similar from the beginning, because it's a rare instructor who actually encourages dissenting opinions (and that fault in teaching is a whole other discussion of its own). Then the papers are going to be similar because there really are only so many ways to interpret the source material that are defensible. And finally, the papers are heavily likely to be similar to at least one other paper written about the subject, when every paper ever written on the subject is considered (exactly what the plagiarism sites attempt to do).
I think the problem this competition is trying to solve is intractable in the face of the current educational system. It's gotten to the point where, if the software considers a large enough number of sources, even the instructor's own papers are going to look like plagiarism.
Hell, look at the Slashdot comment system. A million people read the front page, but only a few thousand post comments. Thousands more are content to simply moderate the comments, and face it, comments they agree with are more likely to be modded up, one way or another. Then compare the modded comments. We get a lot of duplicate or near duplicate thought, and hence near duplicate comments on every article. Why? Because when you get enough people together in one place, discussing the same subject in writing, there are only so many viewpoints and only so many comments that won't get modded down for being of the "cubic what?" variety.
Time to go back to grading on spelling and grammar. We've reached the end of the grading on ideas road. Coherency of presentation is all we have left. (One could argue it's all we ever had.)
I realize that plagiarism detection represents an interesting problem in computer science, and that it goes some distance toweard solving a serious problem. However, I read an article in the Chronicle of Higher Education, behind a paywall, alas, which leads me to believe that it is only a partial solution to academic dishonesty. The article suggested that, thanks to the Internet, the costs of human capital are now so low that hiring a ghostwriter to compose one's papers, sidestepping the problem of plagiarism to begin with, is far more expedient than plagiarism itself. It described a Russian-"businessman"-headed network of Filipino paper-writers, most paid between $1 and $3 a page, who are able to market their services to the West through a web site and remote call centers. At $20/page to the end-user, with no possibility of plagiarism detection, I think that most desperate students would find this a good deal. In my opinion, ghostwriting will supplant plagiarism as time goes on.
What is a teacher to do? In-class writing samples would seem to be the only hope of detecting ghostwriting. Students could, of course, argue that at home, they can "polish" their papers, and that therefore they will not resemble the in-class samples. Moreover, checking samples against papers is a thankless and time-consuming task which is only a preliminary to actually evaluating the work. Perhaps there is a computer-based solution to this, but, in the meantime, perhaps potential ghostwriting customers could take their desires to their logical conclusion, and simply buy their degrees on the Internet directly.
"Imaginary solutions to real problems."
The Computer Science department at my uni routinely scans final year dissertations using automated software. Mine was flagged up as "possibly plagiarised"; a significant amount of content could be found elsewhere on the web (can't remember the exact percentage).
My project supervisor said when he got the email from the system saying it came back positive he was very surprised - given the small amount of research in the area (there are only 5 or 6 papers on the same topic that I am aware of), and no other research on that exact method of solving the problem .
When I found this out I was more than a little worried - I wasn't aware of copying any other work . It turns out that it had picked up on stupid stuff, like the boilerplate at the beginning of the dissertation, or phrases like "In conclusion,", and nothing longer than 3 or 4 words in any paragraph.
This sort of plagiarism detection that detects word shuffling is fine for people that REALLY don't have a clue (i.e. the ones that forget to change the @author javadoc tag when copying their friends Java coursework), but it would still be relatively trivial to change enough words in a sentence to fool the system.