Competition Seeks Best Approaches To Detecting Plagiarism
marpot writes "Does your school/university check your homeworks/theses for plagiarism? Nowadays, probably Yes, but are they doing it properly? Little is known about plagiarism detection accuracy, which is why we conduct a competition on plagiarism detection, sponsored by Yahoo! We have set up a corpus of artificial plagiarism which contains plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents. A random plagiarist was employed who attempts to obfuscate his plagiarism with random sequences of text operations, e.g., shuffling, deleting, inserting, or replacing a word. Translated plagiarism is created using machine translation."
Here's an insightful fact related to this article:
Little is known about plagiarism detection accuracy
Does your school/university check your homeworks/theses for plagiarism? Nowadays, probably Yes, but are they doing it properly? Little is known about plagiarism detection accuracy, which is why we conduct a competition on plagiarism detection, sponsored by Yahoo! We have set up a corpus of artificial plagiarism which contains plagiarism with varying degrees of obfuscation, and translation plagiarism from Spanish or German source documents. A random plagiarist was employed who attempts to obfuscate his plagiarism with random sequences of text operations, e.g., shuffling, deleting, inserting, or replacing a word. Translated plagiarism is created using machine translation
Now, I understand that plagiarism is common among the weakest of undergrad writers; but "machine translation from Spanish or German source documents" and "random text operations" seem like unrealistic experimental stimuli.
In order to be a success, a plagiarized paper has to survive scrutiny by automated systems, if any are deployed, and human graders, if any are paying attention. Machine translation and text mangling should trivially defeat automated systems, at least any that aren't cranked well into World o' false positives territory; but would they pass human scrutiny? Even if they did, handing in something produced by machine translation and text mangling would probably earn you a referral to "Remedial English 101 For Life".
Simply using words would not constitute plagiarism. You just can't allow students to use words that somebody else has used before.
For more information of this technique, please read my recent paper, Clickous Verandim Redundo Berata Quizzomandus.
He's getting rather old, but he's a good mouse.
Just imagine everyone's surprise when all the entrants turn in the exact same process.
If brevity is the soul of wit, then how does one explain Twitter?
A plagiarised paper just smells bad, and is characterized by shifts in voices and writing styles, sudden ignorance of the the critical points raised earlier. The same author who can't write a grammatically correct sentence one moment is throwing down complex constructions the next The harder part is identifying the source of the plagiarism. For undergraduate papers, even the harder part is trivial. After all, the point of plagiarism is that the author is too lazy to write anything original.
For academics (professors), the situation isn't all that different. Plagiarism is usually a mix of stupidity, laziness and pressure to get stuff done. It usually happens where big, popularizing authors try to rip off the obscure ones (go back twenty years a la Mr. Ambrose, or pick something in a different language, preferably Italian), or when someone needs a book in an obscure field, and tries to pirate something really obscure.
Even so, if a plagiarist has enemies who give a damn, they can find the source fairly fast. So why construct a test for the most obfuscated cases, when a plagiarist clever enough to obfuscate could simply write something original and sufficiently clever?
... use the same system the US Patent Office uses for finding prior art.
On second thought, scratch that idea.
Have gnu, will travel.
Calculate an md5 hash of the paper, if it matches the md5 of another, it's plagiarized.
It's a monkeys on a typewriter thing. these companies add papers to there database as they compare them. If you feed enough papers into a database eventually they will all come back plagiarized there are not an infinite number of possible term papers there are only so many things that could be written for a topic that make sense, and most English teachers recycle topics. why English departments buy into this I don't understand let it go for long enough(it would only take another decade or two at most) and you will start getting people who didn't even know they were plagiarizing getting kicked out of college, I'm not talking about improper citations I'm talking about guy in Washington has the same idea as a guy in New York 20 years later. I'm not a lawyer, so i don't know if this is possible, but couldn't they copyright these databases in some form or render them proprietary. If they did that there business model could change to just collecting royalties.
I once was on a Fido forum with someone who would often write responses nearly word-for-word identical to mine. It was uncanny; I'd see his post and recognize my own writing, only to realize it wasn't mine. Timestamps would sometimes show my post was written first, sometimes his. I imagine some others on the forum thought at least one of us was a sock puppet, but neither of us was.
(If he's on slashdot, he's probably composing a post just like this one)
That probably happens rarely. But build a big enough database, and it will happen often. Particularly given the restricted problem domains in undergraduate papers. It's not just a computer problem; even humans will think "plagiarism" when they see two papers with similar ideas and similar turns of phrase. Which I think demonstrates that plagiarism cannot be established satisfactorily merely by showing similarity between papers.
Plagiarism is a symptom of professors only being involved in the last step: reviewing the final product.
Require the students to submit multiple drafts. Meet with them for 15 minutes each and discuss their thought processes on the ongoing paper. You'll get better final products, teach people not to procrastinate, and smoke-out people who have no involvement in their "own work."
What, can't do that because you have 60 students in a class? Well, there's part of the problem too.
We're trying to find a technology solution to a problem with less student-teacher interaction. Typical!
Law enforcement uses automated fingerprint detection to identify possible matches. It never claims a match based on the computer.
Using a program as the sole plagiarism judge and jury is profoundly unfair. If a university wants to discipline a student for a plagiarism hit, then it needs to obtain the source document--and pay the source document's creator if necessary to obtain it.
Confronting the student with the alleged source gives the student a fair chance to defend himself/herself.
Seriously, the humanities are in trouble. With over 6 billion people on the planet, it's extremely difficult to have an original thought. This sets the stage for endless repetition. Add to that the fact that the very process of teaching the humanities usually means imparting a teacher's single interpretation of the source material to the students who then do the natural thing when it comes to writing a paper and parrot back to the teacher what they've heard, knowing that's the only way to get a good grade, and the resulting combination is deadly.
The papers are all going to be similar from the beginning, because it's a rare instructor who actually encourages dissenting opinions (and that fault in teaching is a whole other discussion of its own). Then the papers are going to be similar because there really are only so many ways to interpret the source material that are defensible. And finally, the papers are heavily likely to be similar to at least one other paper written about the subject, when every paper ever written on the subject is considered (exactly what the plagiarism sites attempt to do).
I think the problem this competition is trying to solve is intractable in the face of the current educational system. It's gotten to the point where, if the software considers a large enough number of sources, even the instructor's own papers are going to look like plagiarism.
Hell, look at the Slashdot comment system. A million people read the front page, but only a few thousand post comments. Thousands more are content to simply moderate the comments, and face it, comments they agree with are more likely to be modded up, one way or another. Then compare the modded comments. We get a lot of duplicate or near duplicate thought, and hence near duplicate comments on every article. Why? Because when you get enough people together in one place, discussing the same subject in writing, there are only so many viewpoints and only so many comments that won't get modded down for being of the "cubic what?" variety.
Time to go back to grading on spelling and grammar. We've reached the end of the grading on ideas road. Coherency of presentation is all we have left. (One could argue it's all we ever had.)
Here's a good article explaining how Google makes plagiarism detection easy: http://questioncopyright.org/node/4 There was a story a couple years ago about one of these plagiarism detection services, Turnitin, getting sued for copyright infringement... does anyone know if that went anywhere? http://education.zdnet.com/?p=953
Plagiarism is a symptom of professors only being involved in the last step: reviewing the final product.
Require the students to submit multiple drafts. Meet with them for 15 minutes each and discuss their thought processes on the ongoing paper. You'll get better final products, teach people not to procrastinate, and smoke-out people who have no involvement in their "own work."
What, can't do that because you have 60 students in a class? Well, there's part of the problem too.
We're trying to find a technology solution to a problem with less student-teacher interaction. Typical!
I never taught a class involving humanities paper writing (in the science classes I taught, I could detect borrowed work by asking our kids to explain the calculations in their presentations and reports), but my wife meets with students several at least once after they turn in a required outline and bibliography to her. The bibliography, meeting, and my wife's extensive knowledge of scholarship in her field have made plagiarism rare and very obvious. Also, they make the students write vastly better papers and learn a lot more. Even having students meet with a TA to discuss paper ideas and progress is a huge help, and required outlines, drafts, and (especially) bibliographies should be part of the writing process in every lower level undergrad class. In upper level classes, the meeting is sufficient.
"I zero-index my hamsters" - Willtor (147206)
This is a useful mechanism for search engines, which need to distinguish original content from hundreds or thousands of blogs echoing it. Imagine the Web with all the duplicate, repetitive material ignored. No wonder Yahoo is supporting this. Someone over there is thinking.
I realize that plagiarism detection represents an interesting problem in computer science, and that it goes some distance toweard solving a serious problem. However, I read an article in the Chronicle of Higher Education, behind a paywall, alas, which leads me to believe that it is only a partial solution to academic dishonesty. The article suggested that, thanks to the Internet, the costs of human capital are now so low that hiring a ghostwriter to compose one's papers, sidestepping the problem of plagiarism to begin with, is far more expedient than plagiarism itself. It described a Russian-"businessman"-headed network of Filipino paper-writers, most paid between $1 and $3 a page, who are able to market their services to the West through a web site and remote call centers. At $20/page to the end-user, with no possibility of plagiarism detection, I think that most desperate students would find this a good deal. In my opinion, ghostwriting will supplant plagiarism as time goes on.
What is a teacher to do? In-class writing samples would seem to be the only hope of detecting ghostwriting. Students could, of course, argue that at home, they can "polish" their papers, and that therefore they will not resemble the in-class samples. Moreover, checking samples against papers is a thankless and time-consuming task which is only a preliminary to actually evaluating the work. Perhaps there is a computer-based solution to this, but, in the meantime, perhaps potential ghostwriting customers could take their desires to their logical conclusion, and simply buy their degrees on the Internet directly.
"Imaginary solutions to real problems."
The Computer Science department at my uni routinely scans final year dissertations using automated software. Mine was flagged up as "possibly plagiarised"; a significant amount of content could be found elsewhere on the web (can't remember the exact percentage).
My project supervisor said when he got the email from the system saying it came back positive he was very surprised - given the small amount of research in the area (there are only 5 or 6 papers on the same topic that I am aware of), and no other research on that exact method of solving the problem .
When I found this out I was more than a little worried - I wasn't aware of copying any other work . It turns out that it had picked up on stupid stuff, like the boilerplate at the beginning of the dissertation, or phrases like "In conclusion,", and nothing longer than 3 or 4 words in any paragraph.
This sort of plagiarism detection that detects word shuffling is fine for people that REALLY don't have a clue (i.e. the ones that forget to change the @author javadoc tag when copying their friends Java coursework), but it would still be relatively trivial to change enough words in a sentence to fool the system.
If you have graded more than 2 assignments in your life, and really read each and every paper, and provided good critical feedback, then it is really easy to spot a plagiarized paper.
Also, a grader usually knows the subject matter and has read many other good and bad works on the subject. You can get a feel for a person's writing style and depth of knowledge on a subject in just a few sentences. Then when you "smell something fishy", then it usually is.
So far, whenever I "smell something fishy" I try to find the best sentence near the fishiness and paste it into Google. Plagiarists are not going to rewrite every sentence, if they do, then they probably learned something anyway. No, plagiarists are just lazy and in a hurry and deep down they know they deserve to be caught.
- I live the greatest adventure anyone could possibly desire. - Tosk the Hunted
They are trying to invalidate plagarism detection software by proving that you can still manage to plagarise in a way it won't detect (false negative). The thing is, this isn't the problem with plagarism software, the real problem is where it detects plagarism when none in fact took place (false positive). This will happen in a few ways:
1) There have been several highly publicized incidents where students have been in big trouble for plagarising their own work. This is ludicrous, they wrote it in the first place!
2) A large enough database of phrases, paragraphs, etc. will eventually encompass the majority of ways of phrasing a particular idea, therefore when discussing an existing idea the odds of saying something that has been said before will eventually approach certainty.
Now this wouldn't necessarilly apply if you were inventing a whole new concept, but in most classes that is not what you are being asked to do, instead you are asked to research how something has already been done. There is bound to be duplication here, especially as the database grows. This doesn't mean you plagarised something, merely that someone else has worded something similarily in the past. (For it to be plagarism you would have had to have seen and copied that earlier work, in this case you may not even know about it.)
And the Postmodernism Generator?
You don't have to write much of anything at all. Would you get a good grade? Fuck no. Would they FLUNK YOU FOR IT? Fuck no. Because its graded by untenured faculty who have to curry favour with students, or its graded by Grad Assistants who don't give a shit, and why should they.
Oh, look, a paper by Cindy Bleethstain. She's a fucking idiot. Let's see. Hmmmm. Yup. Incomprehensible bullshit, as usual. Give her a C+ because some of it is intelligible and kind of funny.
Oh, look another paper by Guido LeDouchebag. Bottlecaps are smarter than this turnip. Hmmm. Yup. More incomprehensible bullshit. C+. At least he finally discovered the spellchecker.
THAT'S what it is often like, unfortunately.
I read the paper, and if there is a passage that is noticeably different in tone, I'll copy past a section into Google and see where they pulled it. 9 times out of 10, it's a direct lift from a web page, unattributed. I send it back, and tell them "Footnotes, please. Also, automatic single grade loss. right off the top."
If it comes back still broken, then I nail 'em for plagiarism. It's a big deal, and requires paperwork I don't like to fill out...
So far I've only had one student have the cajones to not bother fixing their attributions, and he got crucified by the Ethics board. He was an arrogant little prick, too.
RS
Shoes for Industry. Shoes for the Dead.
The students cannot fake it, if the teacher cares about them learning.
Many many many moons ago, I was a Chem. Eng. grad student. This was before the internet existed, and before my beard had turned gray. One of my duties to pay my way was supervising a lab course for undergrads, and marking the students' lab reports (they were expected to produce about 20 pages per week just on this one lab course). I insisted on interviewing them individually on their reports, where they had to explain their results and conclusions. Nobody tried faking anything twice, because it was caught immediately; they had to read up and understand the background, or they were in deep shit. That class got the highest average mark ever in the year-end exam on the associated theory (the professor was pleasantly surprised).
Those who can make you believe absurdities can make you commit atrocities. - Voltaire