Slashdot Mirror


Software Finds Plagiarism In Research

shmG writes "Researchers from the Virginia Bioinformatics Institute have created a seek-and-destroy program — for plagiarism. Called ET Blast, it's designed to find plagiarism in scientific papers. It does a full-text analysis, and then looks for similar publications in several databases. 'We have better literature,' Garner said. 'There are abstracts and full papers, and a database called Crisp, where you compare stuff to every grant the NIH gets. It's compared to any research that's been funded.'"

111 comments

  1. What about ... by gstoddart · · Score: 3, Interesting

    What about academic "recycling".

    I remember being told a long time ago that some researchers will basically make several permutations of the same paper to submit to a bunch of different places. It's essentially the same paper, with nothing new in it, but if you can get several places to publish it, you can pad out your publications list.

    --
    Lost at C:>. Found at C.
    1. Re:What about ... by notgm · · Score: 4, Insightful

      if you resubmit your own work, it's not plagiarism.

    2. Re:What about ... by notgm · · Score: 1

      rewriting your own articles isn't classified as stealing.

    3. Re:What about ... by Travelsonic · · Score: 1

      Nor is plagiarism - plagiarism is fraud. The idea that it is more than that, IMO of course, is built on a fallacy driven need to drill into the heads of minds [like myself] about the seriousness of it. I can't help, however, in thinking that they have gone overboard and instead of effectively teaching us away, they seem more bent on SCARING THE EVERLOVING SHIT out of us and that, IMO, does NOT help in understanding ANY material/concept/etc.

      --
      If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot
    4. Re:What about ... by tmosley · · Score: 1

      I'd hope not. For most of the papers and grants coming out of our lab, we use the same introduction, and many of the procedures are the same, so many sections are just cut and pasted with a few words changed here or there to fit the particular experiment or project we were working on. It would be a major pain in the ass if we had to start rewriting that crap every time.

    5. Re:What about ... by Chris+Mattern · · Score: 1

      if you resubmit your own work, it's not plagiarism.

      It is, however, fraud in most cases, since most scientific journals require that papers submitted to them be research that is unpublished and not currently submitted for publication elsewhere.

    6. Re:What about ... by gstoddart · · Score: 1

      I'd hope not. For most of the papers and grants coming out of our lab, we use the same introduction, and many of the procedures are the same, so many sections are just cut and pasted with a few words changed here or there to fit the particular experiment or project we were working on.

      Oh, I get that you may do a series of experiments all with some commonality. That is fine.

      I'm specifically talking about people who essentially recycle the same paper several times with no material changes to any of the research or conclusions. That to me is bordering on being a little dodgy.

      But, maybe that's the norm.

      --
      Lost at C:>. Found at C.
    7. Re:What about ... by PvtVoid · · Score: 1

      I remember being told a long time ago that some researchers will basically make several permutations of the same paper to submit to a bunch of different places. It's essentially the same paper, with nothing new in it, but if you can get several places to publish it, you can pad out your publications list.

      So what? You can't plagiarize yourself. Researchers put out multiple, nearly identical papers all the time, especially those published in conference proceedings. (For example, this guy just go elected vice president of the American Physical Society.) It's also very common to recycle review material from one paper you have written to use in another.

      This is entirely distinct from university academic misconduct policies which require papers and so forth submitted in fulfillment of course requirements to be original, i.e. not submitted to other courses.

    8. Re:What about ... by Anonymous Coward · · Score: 0

      if you resubmit your own work, it's not plagiarism.

      My university begs to differ.

      How does one even redo a very similar assignment when you have the same thought process? There is bound to be similarities that their plagiarism detector would flag.

    9. Re:What about ... by crmarvin42 · · Score: 1

      When you publish in a scientific journal you hand over the copyright to the work. Therefore if you publish those results again, without citing the previous publication, you can be sued for copyright violations by the original publishing journal. I've heard of authors being banned from a specific journal for such behavior. It is not "Plaigarism" in that you are not taking credit for someone elses work, but it is academic dishonesty of the sort that can ruin a career.

      For example, my advisor gave a (Review) presentation at a scientific meeting (complete with manuscript for the Conference Proceedings). It was well recieved so he was then asked by the society holding the conference to submit the proceeding mansucript to their scientific journal. They wanted to make sure that it could be read by their subscribers even if they couldn't make it to the confernce. My advisor spent a lot of time on the phone with them making sure that they would prominently denote that the second publication was a re-print with a reference to the original conference proceedings. He didn't want to be accused of this "Recycling" that you are referring to. I won't pretend some don't do it, but no one at a respectable university.

      Besides, most researchers work in one small area. Usually the number of interested individuals is small enough where everyone knows everyone else. If I tried what you describe, I know a lot of people that would catch it in short order. It would be career suicide.

      --
      Bureaucracy expands to meet the needs of the expanding bureaucracy.-Oscar Wilde
    10. Re:What about ... by Anonymous Coward · · Score: 0

      From ACM's "Policy and Procedures on Plagiarism" (http://www.acm.org/publications/policies/plagiarism_policy):

      "Self-plagiarism is a related issue. In this document we define self-plagiarism as the verbatim or near-verbatim reuse of significant portions of one's own copyrighted work without citing the original source[*]. Note that self-plagiarism does not apply to publications based on the author's own previously copyrighted work (e.g., appearing in a conference proceedings) where an explicit reference is made to the prior publication[**]. Such reuse does not require quotation marks to delineate the reused text but does require that the source be cited."

      [*] See Collberg and Kobourov, http://portal.acm.org/citation.cfm?doid=1053291.1053293.
      [**] Manuscripts submitted to ACM Journals and Transactions based on the author’s own previously copyrighted work (e.g., appearing in a conference proceedings) must be disclosed at the time of submission and an explicit reference to the prior publication must be included in the submitted manuscript. The norm for ACM Journals and Transactions is that the submitted manuscript must contain at least 25% new content material (i.e., material that offers new insights, new results, etc.). For more details see http://www.acm.org/pubs/sim_submissions.html .

    11. Re:What about ... by robotkid · · Score: 2, Informative

      if you resubmit your own work, it's not plagiarism.

      Let me clarify the issue for those not accustomed to the rules of scientific publishing.

      There IS a thing as self-plagarism, and it's not necessarily a minor offense. At it's core, if you submit essentially the same work to multiple venues with the intent to pass each off as an independent body of work when they are not, then there is intent to deceive and that is an ethical breach of conduct. Worst case scenario, the author list and abstract has been changed just enough that it leads others to believe this particular experiment has actually been independently confirmed and duplicated when it has not.

      Most journals require that you affirm that the same manuscript is not currently under consideration for publication in another journal and has not already been published in a highly similar form elsewhere (except maybe as a conference abstract). This is different than re-submission, where a manuscript was rejected from one publication and you are now free to send to to another venue. And then there is the copyright issue, that as authors you are not necessarily the sole copyright holder (often the journal has some claim), in which case a duplicate publication is actually a violation of the journal's copyright.

      There is also the case where one, comprehensive study is artificially split into smaller, less meaningful sub studies with the intent to pad publication counts (there was an example of a prenatal intervention study where the effects on the mothers and on the infants for the exact same study were published separately without any reference to each other, diminishing the usefulness of the study). This is now not a copyright issue but now a scientific integrity issue, presumably the medical audience of such a study could be harmed by not being told both sets of outcomes for the same study in any sort of obvious way.

      There is an excellent resource on what constitutes scientific plagarism (including self-plagarism) here: http://facpub.stjohns.edu/~roigm/plagiarism/Self%20plagiarism.html

    12. Re:What about ... by buchner.johannes · · Score: 1

      In the dejavu subsite, most publications share an author, so yes, it does "recycling" / self-plagiarism.

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    13. Re:What about ... by Travelsonic · · Score: 1

      Somebody should tell ACM that plagiarism has jack shit to do with copyright in of itself - copyright infringement =/= plagiarism, plagiarism might involve copyright infringement, but is NOT how it is determined.

      --
      If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot
    14. Re:What about ... by Phoghat · · Score: 1

      It's not plagiarism it's bullshit

      --
      Think of how stupid the average person is, and realize half of them are stupider than that.
  2. does it translate to chinese? by Rivalz · · Score: 1

    Would be nice to widen it to IP & Copyright infringement.

  3. How is this different from Turnitin? by mlts · · Score: 4, Informative

    This sounds almost exactly like turnitin.com where when one uploads a paper to it, it searches almost anything it can get ahold of and will list any text in any academic journal that is copied verbatim.

    1. Re:How is this different from Turnitin? by Anonymous Coward · · Score: 0

      Exactly like turnitin.com? Maybe VBI plagiarized.

    2. Re:How is this different from Turnitin? by Anonymous Coward · · Score: 1, Informative

      Those cunts @ turnitin archive *YOUR* paper for eternity (without payment and without any course for redress) to achieve network effects and enhance their service.

    3. Re:How is this different from Turnitin? by Anonymous Coward · · Score: 0

      FTFA: Unlike plagiarism detectors, such as Turnitin.com, he said ET Blast uses different databases for comparative analysis. "We have better literature," Garner said. "There are abstracts and full papers, and a database called Crisp, where you compare stuff to every grant the NIH gets. It's compared to any research that's been funded."

    4. Re:How is this different from Turnitin? by robotkid · · Score: 1

      This sounds almost exactly like turnitin.com where when one uploads a paper to it, it searches almost anything it can get ahold of and will list any text in any academic journal that is copied verbatim.

      An apt analogy. Imagine the following scenario: you are simultaneously enrolled in a two classes that both require a lengthy essay which constitutes a large portion of your final grade. You find the two assignments to have similar enough parameters and decide to submit the same essay to both teachers without any prior approval for the double-dipping, thus making it appear you have spent more effort than you actually have. You are only "plagarizing yourself", so no harm, right?

      Doubtful.

      Self-plagarism is unethical if there is an intent to deceive, it has little to do with copyright. Teachers aren't using turnitin to make sure copyright infringers pay royalties, they are trying to find out who is cheating.

    5. Re:How is this different from Turnitin? by buchner.johannes · · Score: 1

      If you want to know the difference between this and turnitin, you'd have to read the article, it specifically mentions a few differences...

      --
      NB: The message above might reflect my opinion right now, but not necessarily tomorrow or next year.
    6. Re:How is this different from Turnitin? by redbeard55 · · Score: 2, Interesting

      There is no harm you have done the required work. Just because you can use your work in more than one place doesn't harm anyone. Assertions to the otherwise are ridiculous.

    7. Re:How is this different from Turnitin? by arkenian · · Score: 1

      Not true. A college credit is awarded based on the amount of work you've done in a given subject area. If they're giving you two credits for one credit of work, that's just as wrong as if I, as a government contractor, bill my time twice because a paper I wrote applies to two projects. Its not me using the work on both projects that's wrong, its having my customer pay me twice for the same work. This is about making it harder to cheapen the degree.

  4. Inspired by the World's #1 Hacker by Anonymous Coward · · Score: 0

    So did they run it against LIGATT's Gregory Evans' titular training book on how to Become the World's #1 Hacker, 100% plagiarized? http://www.amazon.com/How-Become-Worlds-No-Hacker/dp/0982609108/ref=cm_cr_pr_product_top

  5. Spam detector? by Zerth · · Score: 1

    Even better if it will show papers that are suspiciously similar to pharmaceutical companies advertising literature.

  6. Red faces all round then.... by accessbob · · Score: 1

    Since researchers constantly plagiarize their own work in order to get their paper count up, there are going to be some very red faces....

    1. Re:Red faces all round then.... by noidentity · · Score: 2, Funny

      Since researchers constantly plagiarize their own work

      Is this where the author of something passes it off as his own? I agree, that's a terrible thing.

  7. You can't plagiarize yourself [Re:What about ...] by Geoffrey.landis · · Score: 1

    if you resubmit your own work, it's not plagiarism.

    Correct! It's amazing to see how many people don't understand this point, but it's correct: you can't plagiarize yourself, because plagiarism is the act of passing somebody else's work off as being yours.

    I hate it when researchers report the same work in many different papers, but although it is a violation of research reporting standards, and in some cases a violation of an intellectual property contract... it's not plagiarism.

    --
    http://www.geoffreylandis.com
  8. False positives/negatives by Travelsonic · · Score: 1

    I wonder, how is the false positive / false negative rate? I mean, places like turnitin.com for example shows this problem quite well with regards to how even quotes - cited and all - raise some flags.

    --
    If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot
  9. There's nothing about "destroying" in the article by Zontar_Thing_From_Ve · · Score: 2, Insightful

    I can't blame the submitter for this one. The article itself uses the term "search and destroy" early on, yet says absolutely nothing about destroying anything.

  10. Shocking plagiarism already found by Anonymous Coward · · Score: 2, Funny

    They found a research paper on hydrogen stole 2 thirds from an existing paper on water.

    1. Re:Shocking plagiarism already found by Abstrackt · · Score: 1

      Thankfully, a follow-up paper on oxygen doused this explosive situation.

      --
      They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance. - Terry Pratchett
  11. CRISP no longer exists ... it's now RePORTer by Anonymous Coward · · Score: 0

    link to RePORTer: http://projectreporter.nih.gov/reporter.cfm

    in addition, it only contains funded grant materials, and only abstracts. perhaps, (s)he is referring to Pubmed and PubMedCentral (PMC).

    anyway, I'm not scared.

  12. Re:There's nothing about "destroying" in the artic by Zontar_Thing_From_Ve · · Score: 1

    Oops! Make that "seek and destroy" instead of "search and destroy", but still, it's just sensationalism.

  13. plagiarism differs in science vs. English Lit. by onionman · · Score: 5, Insightful

    I once had an English teacher who said, "If you have more than five consecutive words matching a source, without a citation then it's plagiarism." Perhaps that's how freshman writing assignments are graded, but it's silly when applied to scientific papers. Pick up any math paper on number theory, and you're bound to find the sentence "Let p be an odd prime number." without citation, but that would hardly qualify as plagiarism. Yet, syntactic matching appears to be exactly what this program is doing.

    What constitutes "plagiarism" in a scientific paper is very different from plagiarism in journalism or English literature. In scientific writing, it is expected that authors will use the same flat, impersonal style and repeat definitions and the results of others to save the reader the time of having to look them up. So, simple pattern matching between science papers will result in a great many false positives. In science (and math) writing what matters is the new result which the author is claiming. It seems to me that it would be nearly impossible for a computer program to detect the distinction.

    1. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      exactly.

      I'm afraid that someone's hard work will be destroyed because a sentence matches someone else's paper.

    2. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      "Let p be an odd prime number."

      I guess this could be considered a FACT. Facts are not covered by copyright. They are probably looking for the more obvious forms of plagiarism.

    3. Re:plagiarism differs in science vs. English Lit. by Oxford_Comma_Lover · · Score: 1

      > I once had an English teacher who said, "If you have more than five consecutive words matching a source, without a citation then it's plagiarism." Perhaps that's how freshman writing assignments are graded, but it's silly when applied to scientific papers.

      No. Just... no. It is not "silly," it is insulting, in either freshman english lit or scientific papers. Any teacher who defines plagiarism that way has a lot more to learn than he has to teach.

      --
      -- IANAL, this isn't legal advice, and definitely isn't legal advice for you. Also, Squee!
    4. Re:plagiarism differs in science vs. English Lit. by pz · · Score: 3, Insightful

      Furthremore, when a scientist has spent a number of years on a long-term research plan, the condensed versions of what he is studying become so well rehearsed that it gets memorized. I have stock phrases that I use when I want to describe this or that aspect of my work because, after giving dozens of presentations about it, they are the ones that work best. They are the most highly polished and refined. They communicate the idea well. And so, they often get trotted out with every manuscript or grant application. My students and post-docs learn to use the same phrasing because, flatly, it works.

      None of the instances of those phrases or full sentences require attribution because they are all from the same motherspring of thought. We are the writers. And, as you might imagine, this might well produce a raft of false positives to a system that blindly compares text.

      --

      Put my fist through my alarm clock with its ding-dong death inside my ear. - The Blackjacks.
    5. Re:plagiarism differs in science vs. English Lit. by careysub · · Score: 1

      ... Yet, syntactic matching appears to be exactly what this program is doing.

      What constitutes "plagiarism" in a scientific paper is very different from plagiarism in journalism or English literature. In scientific writing, it is expected that authors will use the same flat, impersonal style and repeat definitions and the results of others to save the reader the time of having to look them up. So, simple pattern matching between science papers will result in a great many false positives. In science (and math) writing what matters is the new result which the author is claiming. It seems to me that it would be nearly impossible for a computer program to detect the distinction.

      Hours of speculation and typing can save one minute of reading TFA. From the article:

      "Unlike other plagiarism detectors, it does not use phrases or similar words to check for copying. Helio Text actually looks at the entirety of the text."

      So no, it does not. It uses instead some sort of similarity metric computed from analyzing the entire text. This is possibly similar to the text distance metrics used in vector space search engine models (see: en.wikipedia.org/wiki/Vector_space_model ). They will be publishing a paper online in PLoS ONE.

      --
      Starships were meant to fly, Hands up and touch the sky - Nicky Minaj
    6. Re:plagiarism differs in science vs. English Lit. by luis_a_espinal · · Score: 1

      > I once had an English teacher who said, "If you have more than five consecutive words matching a source, without a citation then it's plagiarism." Perhaps that's how freshman writing assignments are graded, but it's silly when applied to scientific papers.

      No. Just... no. It is not "silly," it is insulting, in either freshman english lit or scientific papers. Any teacher who defines plagiarism that way has a lot more to learn than he has to teach.

      Perhaps so, but I could see where such a rule could come from, and it could instill a discipline of making sure things are properly cited. Without any other context, obviously the rule is rubbish, but I could see it as an excellent rule to live by when taking freshman courses in writing/composition.

    7. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      Just whitelist common turns of phrase. Or require the length of the matches to be longer. What your claiming isn't impossible to get around. They apply these same technologies to code as well in schools. And it works!

    8. Re:plagiarism differs in science vs. English Lit. by sribe · · Score: 1

      ...you're bound to find the sentence "Let p be an odd prime number."

      Actually, I kind of doubt you see that exact phrase very often. Although, you're certainly more likely to see it than "Let p be an even prime number."

    9. Re:plagiarism differs in science vs. English Lit. by sribe · · Score: 1

      Replying to myself, yes, I know about 2 ;-)

    10. Re:plagiarism differs in science vs. English Lit. by Antisyzygy · · Score: 1

      There are approximately 120,000 words in the English language. Most high school students only know probably a third of that at most. So 40,000. Its not hard to imagine across 1000 documents that there would be a pretty high chance 5 words would match in a sequence especially since there exists grammar as well as common ways of expressing things such as "had been taken to heart".

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
    11. Re:plagiarism differs in science vs. English Lit. by Antisyzygy · · Score: 1

      That makes sense. So it constructs a feature vector for each text and computes their distances relative to eachother. If there is a distance below some threshold then the papers are suspect.

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
    12. Re:plagiarism differs in science vs. English Lit. by onionman · · Score: 1

      ... Yet, syntactic matching appears to be exactly what this program is doing.

      What constitutes "plagiarism" in a scientific paper is very different from plagiarism in journalism or English literature. In scientific writing, it is expected that authors will use the same flat, impersonal style and repeat definitions and the results of others to save the reader the time of having to look them up. So, simple pattern matching between science papers will result in a great many false positives. In science (and math) writing what matters is the new result which the author is claiming. It seems to me that it would be nearly impossible for a computer program to detect the distinction.

      Hours of speculation and typing can save one minute of reading TFA. From the article:

      "Unlike other plagiarism detectors, it does not use phrases or similar words to check for copying. Helio Text actually looks at the entirety of the text."

      So no, it does not. It uses instead some sort of similarity metric computed from analyzing the entire text. This is possibly similar to the text distance metrics used in vector space search engine models (see: en.wikipedia.org/wiki/Vector_space_model ). They will be publishing a paper online in PLoS ONE.

      I did RTFA. However, there is no code, no algorithm description, no indication whatsoever in TFA describing exactly how their program operates. From the vague references in TFA it appears that this is nothing more than a glorified, article+abstract-wide, pattern matcher. Perhaps it is a little more clever and uses something similar to Google's page ranking algorithm via applying distance metrics to textual spaces. However, that is also a form of syntactic analysis rather than a context analysis. Barring further information on the algorithm, I can't see how your description invalidates my previous point.

    13. Re:plagiarism differs in science vs. English Lit. by Oxford_Comma_Lover · · Score: 4, Insightful

      > Perhaps so, but I could see where such a rule could come from, and it could instill a discipline of making sure things are properly cited. Without any other context, obviously the rule is rubbish, but I could see it as an excellent rule to live by when taking freshman courses in writing/composition.

      But that's half the problem. The rule may come from a desire to instill discipline, but it's just a bad rule, because it teaches that plagiarism of ideas isn't plagiarism at all, and that stringing five words together in a way that's been used before is, and that rewriting something in your own words makes it no longer plagiarism.

      Demand students live by a childish rule, and you will at best be someone they have to ignore as they try to actually learn things.

      --
      -- IANAL, this isn't legal advice, and definitely isn't legal advice for you. Also, Squee!
    14. Re:plagiarism differs in science vs. English Lit. by Lehk228 · · Score: 1

      well it's better to use constants than magic numbers anyways, though you should use a more descriptive name than p

      --
      Snowden and Manning are heroes.
    15. Re:plagiarism differs in science vs. English Lit. by Antisyzygy · · Score: 1

      It makes a feature vector of the text in its entirety, then computes the "distance" between any two vectors. This distance is computed in some large dimensional space I believe not necessarily using a Euclidean metric. If the "distance" is below some threshold the papers are suspect Id imagine.

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
    16. Re:plagiarism differs in science vs. English Lit. by NoSig · · Score: 1

      Actually it is quite common in papers that deal with primes in the first place, though the phrase is more often just "Let p be an odd prime" rather than "let p be an odd prime number".

    17. Re:plagiarism differs in science vs. English Lit. by tbischel · · Score: 1

      "Pick up any math paper on number theory, and you're bound to find the sentence 'Let p be an odd prime number.' without citation, but that would hardly qualify as plagiarism."

      I wonder how often you see specifically an odd prime number... since two is the only even prime, its really the oddest of the bunch.

    18. Re:plagiarism differs in science vs. English Lit. by chad_r · · Score: 1

      I wonder how often you see specifically an odd prime number... since two is the only even prime, its really the oddest of the bunch.

      The answer is:

      "About 48,200 results (0.53 seconds)"

    19. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      I read a piece years ago about the people at NIH that check papers being submitted for plagarims. One of their flags (not condemnations, but a flag requiring further checks) is 144 characters in a row the same. They noted that this was based on the phrase 'the Contitution of the United States of America'.

    20. Re:plagiarism differs in science vs. English Lit. by jvkjvk · · Score: 1

      because it teaches that plagiarism of ideas isn't plagiarism at all, and that stringing five words together in a way that's been used before is, and that rewriting something in your own words makes it no longer plagiarism.

      While I agree with your general premise about childish rules... Just no.

      Plagarism is taking someone elses words and claiming them as your own.

      You seem to be infected by the IP bug.

      Fortunately for the rest of us, one cannot plagarize ideas. Reformulating a concept in your own words does not count as plagarism, nor should it.

      Regards.

    21. Re:plagiarism differs in science vs. English Lit. by sribe · · Score: 1

      Actually it is quite common in papers that deal with primes in the first place, though the phrase is more often just "Let p be an odd prime" rather than "let p be an odd prime number".

      OK. I didn't remember it phrased that way from any number theory, but that was decades ago for me. Seems to me a bit obtuse compared to calling out the exception that is being excluded. But if it's done that, it's done that way, regardless of my opinions...

      Regarding your point on phrasing, yeah, just google the two. Yours wins 169,000 to 0.

    22. Re:plagiarism differs in science vs. English Lit. by AlecC · · Score: 1

      There is a grayer area than that. If I rewrite your book, with a paragraph-by-paragraph correspondence, the same plot, the same characters with names and appearances slightly changed, it is still changed. A book callel Earl of the Rings, about a hibbit from the Shaw taking a broach to be destroyed in Mt Gloom would probably be plagiarism (unless it changed enough to become parody).

      --
      Consciousness is an illusion caused by an excess of self consciousness.
    23. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      The essential info to have here is "what type of feature vector?": if it's too focused on the words used themselves, it becomes easy to foil such algorithm (for instance, by using a thesaurus and slightly rephrasing things); on the other hand, an algorithm that tries to extract the semantics of the text into a feature vector is bound to have a higher rate of false positives, be very computationally expensive and/or create feature vectors that are larger than the original object itself (choose at least two of the three).

      So there is obviously some trade-off involved here and I'm not so sure there's an acceptable "sweet spot" (meaning, an instance of such algorithm than provides both high sensitivity and low false positive rate) unless these guys made some serious breakthroughs in AI/textmining breakthrough.

      Ok. Enough speculation... "Virginia _Bioinformatics_ Institute" and an algorithm that contains BLAST in its name? It's obvious that this revolves around direct text-to-text alignment and not on the construction of feature vectors based on semantic information followed by calculation of distances (wikipedia confirms it: http://en.wikipedia.org/ETBLAST), which means it is only bound to catch only the most obvious of plagiarisms (of the "Ctrl+C/Ctrl+V" type) and not more elaborate types of plagiarism.

      Nothing to see here, please move along (meaning... wake me up when computers can actually assess "well done" plagiarism; although... yeah... this does look pretty neat, regardless :P)

    24. Re:plagiarism differs in science vs. English Lit. by Idarubicin · · Score: 3, Interesting

      You seem to be infected by the IP bug.

      Fortunately for the rest of us, one cannot plagarize ideas. Reformulating a concept in your own words does not count as plagarism, nor should it.

      You seem to be infected by a different sort of IP bug.

      Plagiarism is not the same thing as copyright infringement (though it's not uncommon for the same act to involve elements of both). One can plagiarize public domain sources. One can plagiarize ideas.

      Plagiarism is what happens when a writer presents other people's work (their words or their ideas) as his own, without giving due credit to the source. Pretending that you thought of something when you're actually just copying another author's reasoning is intellectual dishonesty, and squarely within the realm of plagiarism.

      If you copy someone's words verbatim, there is an added obligation to specifically identify the copied passage by blockquoting, using quotation marks, or otherwise clearly setting off the passage from the rest of your writing. If you're just paraphrasing, there's no obligation to use quotation marks (that would be silly) but there remains a need to properly name your source (through footnotes or other means). Rewriting someone else's work in your own words is otherwise still very much plagiarism.

      --
      ~Idarubicin
    25. Re:plagiarism differs in science vs. English Lit. by cpm99352 · · Score: 1

      Once upon a time, there *BAM*


      What a stupid rule.

    26. Re:plagiarism differs in science vs. English Lit. by Anonymous Coward · · Score: 0

      Are there any even prime numbers? ;)

    27. Re:plagiarism differs in science vs. English Lit. by canajin56 · · Score: 1

      "Syntactic matching appears to be exactly what this program is doing". At least you openly admit that you are only assuming you know how the fuck it works. Given that they are working in Bioinformatics, and that it's called "ET BLAST" I'm going to go out on a limb and say that it works similar to how BLAST works. When you computing the similarity matrix for a protein (or DNA), well, you could just put those two amino acid sequences (or basepair) side-by-side and count up where they match. Only, some amino acids are quite similar, so replacing one with the other probably wouldn't change the shape of the protein (at least, enough that it doesn't perform its purpose). But some, (like proline and just about any other aa) are quite different, and so aren't too likely to be aligned if they are matched like this. So, you have a pairwise lookup table for likelihood scores. These are usually based on the log-odds ratio. Mean, the log of Pr(X and Y are related) / Pr(X and Y are randomly selected amino acids). The bottom just comes from the common distribution of all 20 aa, and the top comes from wherever you like. Probably also from observation. Now, that's easy! You just sum up the lookup values for all pairs aligned between sequences A and B. But, subsitution is not the only kind of mutation you can get. You can have amino acids cut, and have amino acids added in. So, you assign a cost for these gaps, and score for the aligned parts, plus the cost of the gaps. Only, now that's not trivial. Now it's an NP Hard search problem. BLAST works by an approximation of this. It finds "anchors" which are highly similar subsequences. Then it forces those to be aligned, and has broken things into a several smaller subproblems. Since it's an exponential search problem, even just cutting it in half due to a SINGLE anchor match speeds it up immensely. (Usually you'll only even bother scoring if you have several anchor matches).

      Anywho, if you would read not just TFA, but TFP (The fine paper, not the blurb about the paper). (Paper title is "Text similarity: an alternative way to search MEDLINE"), you can see how it works. It does what I guessed before reading it (based only on the name!) it does dynamic programming to align words, and rank them by a log-odds lookup table trained on a large dataset. They do it in a hierarchical manner. They have a whole-text alignment, and a sentence based alignment. That is, in the sentence based one, each sentence is matched up with the highest scoring sentence in the document you're matching it to, and vise versa. And each sentence is scored via the log-odds score. The score of the document is the sum of these scores. And the whole text score aligns everything, so it doesn't rearrange the sentences. In a math-paper context, your common phrase "Let p be an odd prime number" would have a very low score. Because even if all 7 words did line up perfectly, each one is a common word in a math paper, so it would have a low log-odds score. (That is, prime would appear in a lot of sentences that are totally different, so seeing prime is a very poor indicator that two sentences are related at all). So, although it would be a "perfect" match in two different papers, the sentence score would be nearly zero, so while that sentence would not harm the score of the two papers as a whole, it also would NOT be contributing to their "sameness" very much. But yeah, it's still context free, which sucks for natural language, but what are you going to do? You could make it like, a 3rd level context sensitive frequency count, so then your sentence "Let p be an odd prime number" becomes "NULL NULL Let" "NULL Let p" "Let p be" "P be an" "be an odd" "an odd prime" "odd prime number" "prime number NULL" "number NULL NULL". And, you compare the odds of each of THOSE tokens in similar sentences, to their overall occurrence across all sentences. This makes your lookup table a lot bigger, and it means you need a bigger training data set for computing that table in the first place. An

      --
      ASCII stupid question, get a stupid ANSI
  14. Where's the code? by Qubit · · Score: 1

    I poked around the site, and found the page describing some JSON APIs and things, but no links to code or developer pages.

    So where's the code?

    Hmm, okay, that's weird. The project is run by the Virginia Bioinformatics Institute, but the disclaimer says:

    This software and data are provided to enhance knowledge and encourage progress in the scientific community and are to be used only for research and educational purposes. Any reproduction or use for commercial purpose is prohibited without the prior express written permission of the University of Texas Southwestern Medical Center.

    So they don't hold copyright to it? Or they didn't write it? Hmmmm....

    --

    coding is life /* the rest is */
    1. Re:Where's the code? by Anonymous Coward · · Score: 0

      run the code on itself and you can see if it's plagerised!

    2. Re:Where's the code? by Antisyzygy · · Score: 1

      They probably are using some code owned by that institute to save time writing it themselves.

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
  15. And now the all seeing eye turns to teachers and p by Anonymous Coward · · Score: 0

    About flipping time!

    This is no different than the student version and it is very good at what it's doing. This means that profs who "cheat" will get caught. Amazing!

  16. Re:You can't plagiarize yourself [Re:What about .. by Travelsonic · · Score: 5, Interesting

    In High School, they tried to cram the concept of "self plagiarism" down our throats - what a crock of shit... you can NOT by DEFINITION plagiarize YOUR OWN WORKS. Recycling may be lazy, may violate other ethics, but to call it plagiarism is, IMO, very intellectually dishonest of these institutions.

    --
    If you believe in privacy, and believe you have "nothing to hide" at the same time, you're a goddammed idiot
  17. Lame by Anonymous Coward · · Score: 0

    Not only is this lame, but how does it handle passages where a work is quoted legitimately. I despise crap like this and Turnitin.

  18. In other news: by Anonymous Coward · · Score: 0

    Several academic and research institutions have noted a sharp drop of published research from their Chinese native researchers. Experts are speculating there may be a link between this and the institutions' recent adoption of ET Blast.

  19. Too bad by luis_a_espinal · · Score: 1

    Because, because beyond certain point of recycling, it's just dishonesty.

  20. Recycling by luis_a_espinal · · Score: 1

    Even though recycling is not plagiarism, I would love to see this tool being used to create some sort of recycling ranking for individual academics and colleges. There is a not-so-fine line between exploring different aspects of a subject and simply recycling for the purpose of maximizing presence. The former is necessary for the pursuit of research. The later is just f* dishonesty (and a costly one for society since it is typically used for securing research moolah.)

    1. Re:Recycling by Antisyzygy · · Score: 1

      Its considered unethical by the majority of scientists to recycle papers unless there is a significant update from one to the next, i.e., methods changed, or additional steps are taken which improve the results. It is not considered unethical to have your paper resubmitted to a different conference or journal if it was rejected from another however.

      --
      That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
    2. Re:Recycling by luis_a_espinal · · Score: 1

      Its considered unethical by the majority of scientists to recycle papers unless there is a significant update from one to the next, i.e., methods changed, or additional steps are taken which improve the results. It is not considered unethical to have your paper resubmitted to a different conference or journal if it was rejected from another however.

      I know, I was referring to the former. In fact, referring a paper to different conferences (say within the same year), that I would *not* consider it recycling.

  21. Re:You can't plagiarize yourself [Re:What about .. by Anonymous Coward · · Score: 3, Informative

    I actually ran into this in grad school. When writing a tech related paper, I referenced one of my past papers on the same subject as a source. My professor made it clear I had to cite myself to avoid "self-plagiarism". I thought it quite possibly the stupidest thing I had ever heard in my life, and it was coming from a celebrated PhD at a major New England university.

  22. That's all fine and good, but... by bobdotorg · · Score: 2, Interesting

    ... can it find dupes on Slashdot?

    --
    __ Someday, but not this morning, I'll finally learn to use the preview button.
    1. Re:That's all fine and good, but... by The_mad_linguist · · Score: 1

      Can it load the front page?

      Finding dupes on slashdot is like finding corruption in congress.

  23. UK by Anonymous Coward · · Score: 0

    In the UK they do this for UCAS (University) applications, they check your personal statement to test both if it plagiarises or it's crap (uses too many common themes) and warns you about it.

  24. Re:You can't plagiarize yourself [Re:What about .. by Attila+Dimedici · · Score: 1

    The reason for the rise of the concept of "self-plagiarism" is these types of automated plagiarism detectors. If I have written a lot of papers that are in their database and can lift sections out of a previous paper I wrote without citing it as a source, these programs are going to generate a lot of false positives.

    --
    The truth is that all men having power ought to be mistrusted. James Madison
  25. Re:You can't plagiarize yourself [Re:What about .. by Anonymous Coward · · Score: 0

    Maybe the problem is that we don't have a good terms to differentiate between appropriate reuse of one's own writing, and unnaceptable reuse.

    For instance, it's a violation of academic ethics to try to publish the exact same paper in multiple places. You're effectively trying to increase your publication count without adding anything new to the body of knowledge. It's still not plagiarism, since it's your own work, but it is unethical.

    Not citing previous work when writing a paper is also wrong, though not in the same way. It can be either an honest mistake, lazy, or downright unethical (e.g. not citing the work of someone you don't like). Not citing your own previous work in the area is similarly wrong. Not because it would be plagiarism, but because citations are vital to help others understand the context, significance, and background to the present work. So you should cite yourself when appropriate, just as you would cite others.

    And lastly, there are times where re-using your own material is absolutely acceptable. For instance when releasing a new edition of a book, it just makes sense to tweak the things that need changing. It doesn't make sense to rewrite every sentence to avoid 'plagiarizing' yourself. Similarly if you write a review article of a certain field, it just makes sense to re-use some of the text from a previous review (now outdated) that you wrote. (There may or may not be secondary copyright concerns, depending on the various contracts in place.) It isn't plagiarism, and it isn't wrong.

    Perhaps academia needs to develop terms to cleanly differentiate between these cases. Or alternately people need to be more specific when they are talking about appropriate vs. inappropriate behavior. Abusing "plagiarism" as a catch-all for "unethical publication" confuses the issue.

  26. Re:You can't plagiarize yourself [Re:What about .. by Lucky75 · · Score: 1

    Tell that to my university where I got accused of academic dishonesty for reusing one paragraph again in a course that I failed. Utterly ridiculous.

    "Okay, I give myself permission to copy work from myself....there.....now it isn't plagiarism."

    --
    DNA -- National Dyslexic Association
  27. Re:You can't plagiarize yourself [Re:What about .. by Lucky75 · · Score: 1

    To follow my last post, I usually like to use the following argument: If I'm asked what the answer to 1+1 is, I'm going to answer '2'. I'm not going to say that the answer is '3' next time just to make my answer different.

    --
    DNA -- National Dyslexic Association
  28. Re:You can't plagiarize yourself [Re:What about .. by Anonymous Coward · · Score: 5, Interesting

    Yes, but maybe the problem is that we don't have a good terms to differentiate between appropriate reuse of one's own writing, and unnaceptable reuse.

    For instance, it's a violation of academic ethics to try to publish the exact same paper in multiple places. You're effectively trying to increase your publication count without adding anything new to the body of knowledge. It's still not plagiarism, since it's your own work, but it is unethical.

    Not citing previous work when writing a paper is also wrong, though not in the same way. It can be either an honest mistake, lazy, or downright unethical (e.g. not citing the work of someone you don't like). Not citing your own previous work in the area is similarly wrong. Not because it would be plagiarism, but because citations are vital to help others understand the context, significance, and background to the present work. So you should cite yourself when appropriate, just as you would cite others.

    And lastly, there are times where re-using your own material is absolutely acceptable. For instance when releasing a new edition of a book, it just makes sense to tweak the things that need changing. It doesn't make sense to rewrite every sentence to avoid 'plagiarizing' yourself. Similarly if you write a review article of a certain field, it just makes sense to re-use some of the text from a previous review (now outdated) that you wrote. (There may or may not be secondary copyright concerns, depending on the various contracts in place.) It isn't plagiarism, and it isn't wrong.

    Perhaps academia needs to develop terms to cleanly differentiate between these cases. Or alternately people need to be more specific when they are talking about appropriate vs. inappropriate behavior. Abusing "plagiarism" as a catch-all for "unethical publication" confuses the issue.

  29. Re:You can't plagiarize yourself [Re:What about .. by Antisyzygy · · Score: 1

    RECYCLOPS will make you recycle!

    --
    That brings me to an interesting point, / . is just "the ramblings of socially-inept, technology-literate news-mongers".
  30. How is that news? by hnangelo · · Score: 1

    How is that news? I've seen a few universities using systems like that for a few years now...

  31. amount of scientific plagiarism creeping up by peter303 · · Score: 1

    The first study I read in Nature ten years ago placed it about 1-2% in European/North American Journals. A more recent study doubled that figure. Pilot tests in Asia find the number well into double digits.

    No one has fully stated the cause for the increase. I am guessing its better software and nearly all papers are in electronic databases now. A more pessimistic explanation would be that as the "Internet Generation" enters the scientific workforce, their sloppy IP habits migrate into research papers.

    The same recent Nature article recommended routine scans of submitted papers to reduce plagiarism retractions in the future. Retractions are always embarassing to editors.

  32. Publishers are using CrossCheck by 1_brown_mouse · · Score: 1

    http://www.crossref.org/crosscheck.html

    They already create DOIs for their published work and now can check the works before publishing.

  33. Re:You can't plagiarize yourself [Re:What about .. by pentalive · · Score: 1

    It does help others find your previous work.

  34. It's like fingerprint analysis. by MarkvW · · Score: 1

    In fingerprint analysis, the computer spits out a possible match. It's up to the human to determine whether or not that match is valid. It's the same with this stuff.

  35. Re:You can't plagiarize yourself [Re:What about .. by noidentity · · Score: 1

    To follow my last post, I usually like to use the following argument: If I'm asked what the answer to 1+1 is, I'm going to answer '2'. I'm not going to say that the answer is '3' next time just to make my answer different.

    You're just not being creative enough. You can come up with a different answer, for example "1+1 is 1.999..." or "1+1 is 1, for sufficiently large values of 1" etc.

  36. If only .... by Anonymous Coward · · Score: 0

    If only the Philippine Supreme Court had this technology ..... http://www.abs-cbnnews.com/insights/08/09/10/plagiarism-supreme-court

  37. Can you get false positives? by SimonJG · · Score: 1
    How does this text comparison work? Is it intelligent enough to weight the different sections differently?

    Very often, much of the introductory and methodology sections may be recycled or adapted from previous publications and only the results and conclusions are scientifically novel.

    1. Re:Can you get false positives? by dxk3355 · · Score: 1

      I just saw a presentation that described the two basic formulas that these programs use. The important one is to measure Damerau–Levenshtein distance. This can be combined with fuzzy string searching or other algorithms to determine a percentage.

    2. Re:Can you get false positives? by dxk3355 · · Score: 1

      Dice's coefficient is another important one. You can use it to across words, sentences, and paragraphs to determine a similarity measure.

    3. Re:Can you get false positives? by canajin56 · · Score: 1

      It doesn't rate different sections differently. But, it works by going, for each sentence, find the most similar sentence and compute the score. The score for matching sentences will be somewhat high. The score for dissimilar ones will be quite negative. So, even if it's 50% recycled stuff, and 50% all new stuff, the negative score from the 50% of the sentences that are new will dominate the total score, and it's going to end up pretty low overall. But yeah, it's not a plagiarism detection algorithm, it's a similarity measure. You can easily trip it up if you're trying to. And you can be measured as "similar" without plagiarism. It's up to a person to make that call. This is just a measure of similarity, originally meant to speed up a literature search. If you took an existing problem, built on previous works, made some improvements and adjustments to previous methodologies, and got better or more interesting results, then your paper probably will have a high similarity score to previous papers. And it's supposed to, that is the entire point of the program! As a side effect, it notices cases of blatant plagarism, where somebody lifts the majority of a paper verbatim and publishes it elsewhere. Though, most cases are "self plagiarism" where a person publishes multiple times with minor variations to inflate their publication list and get more funding.

      --
      ASCII stupid question, get a stupid ANSI
  38. It's no more... by fletto · · Score: 1

    Unfortunately, during the beta stage the program came across this certain Spielberg movie and a Metallica song and offed itself. Too bad, it seemed to be a pretty handy piece of software.

  39. Awesome by JambisJubilee · · Score: 1

    This is a really great tool, actually. For scientific, the time between gathering notes/ideas/data and writing them down can be significant. Even an academic mini-thesis might have 200+ citations. By the time you write the paper it's hard to remember which of your (handwritten) notes are original. I've always wanted a tool that could double check for me.

  40. Tit-for-tat by Anonymous Coward · · Score: 0

    Software finds plagiarism in Research? Research finds plagiarism in Software. Research finds plagiarism in software used to find plagiarism in research. Software finds plagiarism in research used to find plagiarism in software. Will this arm's race never end? Please?? And won't someone think of the children?

  41. Re:You can't plagiarize yourself [Re:What about .. by Geoffrey.landis · · Score: 1

    I was saying to myself, wait, this post is identical to the previous one... duh.

    But, since you're posting as anonymous, it doesn't increase your publication count to republish it. Fail.

    (And, anyway, "Anonymous Coward" is already the most-cited author on slashdot.)

    --
    http://www.geoffreylandis.com
  42. That's all fine and good, but... by TubeSteak · · Score: 0, Redundant

    ... can it find dupes on Slashdot?

     

    --
    [Fuck Beta]
    o0t!
  43. Is it really a plagiarism tool? by saiful76 · · Score: 1

    The article points to this link for the search engine. I did a search with a small paragraph copied from a paper and found too many results with different scores (it doesn't explain what these scores mean). It didn't tell anything decisively if the text is copied from any source, which is expected from a plagiarism tool.

    Secondly, the About page doesn't talk plagiarism at all. What it says is: "eTBLAST is a unique search engine for searching biomedical literature. Our service is very different from PubMed. While PubMed searches for "keywords", our search engine lets you input an entire paragraph and returns MEDLINE abstracts that are similar to it. This is something like PubMed's "Related Articles" feature, only better because it runs on your unique set of interests."

    However, I must say that the results did give lot of interesting related papers in the same subject which is not easy to find with keyword search. To me, it looks more like a search engine where you can search using a paragraph instead of keywords, which is quite impressive in itself. The site also offers few nifty features such as "Find an Expert" and "Find a Journal" which should be useful for research professionals. I also found the citations page to be quite informative. Since this service is free with API's available, it can be a great source for creating mashups.

  44. Re:There's nothing about "destroying" in the artic by Anonymous Coward · · Score: 0

    It has the potential to destroy the reputations of unethical scientists, because some scientists publish slight changes of a single paper of theirs to multiple journals to increase their publication count. This system will hopefully bring some sunshine on that practice, because the scientists who are essentially copying their own papers make other scientists (who are actually doing research and writing separate papers for each publication) look bad, as their publication count isn't as high. Publication count lower than another scientist translates to lower money for research the next year.

    Thanks to these scientists to bring sunshine to their own field. This kind of a meta-version of the mantra that science is self correcting, as some scientists are using science to destroy the reputations of other, unethical, scientists. Go science!

  45. Re:You can't plagiarize yourself [Re:What about .. by BrokenHalo · · Score: 1

    Maybe the problem is that we don't have a good terms to differentiate between appropriate reuse of one's own writing, and unnaceptable reuse.

    It always used to make me chuckle to find textbook references cited as "personal observation" in journal articles written by one of my university's professors. Most scientists can't get away with that. But if you are as much of a bigwig in your field as he was, I guess it's not as arrogant as it might seem.

  46. you may be violating copyright. by pigwiggle · · Score: 1

    like i said

    --
    46 & 2
  47. not so cut and dried by pigwiggle · · Score: 1

    Most publications are group work. Maybe the first author wrote the entire work without input, using only the results of others. And maybe every other author made significant changes or critiques. Those words can't be reused – unless they include every previous author in the new list. Reuse an introduction a few times and the author list is going to get pretty long. Anyway, it is copyright violation to use previously published phrases and images in a publication for a different publisher. That is clear.

    --
    46 & 2
  48. ET Blast by HTH+NE1 · · Score: 1

    "Ouch."

    --
    Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
  49. Didn't I read this same article last week? by minstrelmike · · Score: 1

    Didn't I read this same article last week?

  50. Old hat by Tomji · · Score: 1

    My English prof back in 2000 had this software already.
    However, my final paper was "borrowing" quiet heavily and he didn't find out. Maybe this version works better? :)

  51. Re:You can't plagiarize yourself [Re:What about .. by Anonymous Coward · · Score: 0

    Now that should be easy to fix in the software:

    if author_new == author_old then plagiarism = False

  52. Re:You can't plagiarize yourself [Re:What about .. by Anonymous Coward · · Score: 0

    if you resubmit your own work, it's not plagiarism.

    Correct! It's amazing to see how many people don't understand this point, but it's correct: you can't plagiarize yourself, because plagiarism is the act of passing somebody else's work off as being yours.

    I hate it when researchers report the same work in many different papers, but although it is a violation of research reporting standards, and in some cases a violation of an intellectual property contract... it's not plagiarism.

    Tell that to John Fogerty! Okay, he won, but still.....

  53. Software finds plagiarism in VBI's own material in by Geminii · · Score: 1

    5... 4... 3... 2... 1...