Detecting plagiarism by comparing sequences of referencesOctober 14, 2013
I recently reviewed a manuscript bearing a suspicious resemblance to an already-published paper. The exact wording of the previous paper had not been copied; however, in a couple of sections, the new manuscript made a series of points in the same order as the previous paper, using a very similar sequence of references. The sequence of greatest overlap is shown below (with references from Paper B renumbered according to the bibliography of Paper A).
While this similarity does not itself prove that plagiarism occurred, it certainly constitutes grounds for further investigation.
It occurred to me that comparing sequences of references might be a good way of detecting possible cases of “subtle plagiarism,” in which the original text has been rephrased. This would be sort of analogous to the BLAST (Basic Local Alignment Search Tool) algorithms used in biology to identify related nucleotide and amino acid sequences (as in the comparison below of Calcium-Dependent Protein Kinases from different organisms, taken from Ojo et al. 2010).
In poking around online, I noticed that, not only had others independently come up with “my” idea, they had already implemented it and published papers about it. The 2-page Gipp & Beel contribution to the 21st ACM Conference on Hyptertext and Hypermedia (June 2010) provides a nice introduction to the approach.
Gipp and coworkers are also developing CitePlag.org, a website intended to let others perform document comparisons of their own. According to Gipp, the site currently struggles with long documents and not-yet-accounted-for citation styles. My own testing of it indicates that it is not especially useful at the moment, but definitely on the right track.