The paper explores the notions of plagiarism and its inverse, text ownership, and asks how close or different they are, from a procedural point of view that might seek to establish either of these properties. the emphasis is on procedures rather than on the conventional divisions of authorship studies, plagiarism detection etc. We use as a particular example the notion of computational detection of text rewrites, in the benign sense of journalists' adaptations of the Press Association newsfeed. The conclusion of it all is that, whatever may be the case in copyright law, procedural detection and establishment of the ownership is a complex and vexed matter.
The paper was an invited address at the Annual Meeting Digital Resources in the Humanities 2000 at Sheffield University. The work described was supported by EPSRC award GR/M34041. The author gratefully acknowledges comments and suggestions from Ted Dunning, Paul Clough, Roberta Catizone and Louise Guthrie.
I have used the term 'ownership' here hoping to separate out the considerations I want to raise from legal issues of copyright about which I know little, except that the ease of general access to enormous amounts of text that the web now offers is making traditional notions of copyright harder to enforce, if not meaningless. Indeed, a recent court ruling in the US [] has decided that web links to material that is illegal based on copyright law (and the DVD decrypting software in particular) are as illegal as the material pointed to, and that links to sites with illegal links are illegal, which may have the effect that the whole web is now illegal in the US. So one may consider the copyright issue more confused than ever and best not discussed further.
It is not simply the amount of text on the web, and the ease of copying text verbatim, that are the source of our current unease about text ownership, but the relative weakness of detection mechanisms for identifying even substantial unaltered quotation. In a recent experiment reported in the Communications of the ACM, small-scale but suggestive, an American academic took 100 words at random from a dictionary and then accessed the sites named by those words, each followed by .com (this procedure is now almost universally successful!). He extracted a reasonably sized length piece of text from each site located and then fed it back as a long string search, with a well-known webcrawler over a range of search engines. Only about one in five texts was relocated by this method, which the author took to mean that with current, commercially-available web search technology a student copier––and he was concerned with simple student essay plagiarism from the web––has only a 20% chance of detection even when a suspicious marker can be bothered to do such a search.
That result is an interesting base line, and one perhaps higher than many would expect given the low level of most experiences with any kind of demanding web search. It certainly makes the claims of the many commercially available plagiarism detection tools [] seem dubious, or at least those that do not work with a special segregated set of texts, a point we shall return to. But all this is little more than anecdote, and this paper is not about plagiarism and copyright as such, for I am more concerned with what one might, more portentously, call the phenomenology of text use and reuse. The web is nothing qualitatively new in that regard–huge amounts of text have been whizzing about the internet for decades––I was on the Arpanet in 1972 or so when Minsky sent out the whole of his book on frames to a number of people for comment. For me, and possibly for every recipient, that the was one's first encounter with a very large, editable, chunk of prose. At the time, that had to be something that had been composed on line, since no one was then inputting much in the way of corpora composed by someone else and in another medium.
It became obvious to many in the early Seventies therefore, long before the web and structured corpora as we now have them, that the ability to reuse one's own prose, as well as that of other people––collaborators on joint papers being the standard benign case––was going to change the nature of documents, and was going to make text what it now is, virtually a mass term, something quantifiable and obtainable in bulk.
As owners of new libraries were once said to buy books by volume, and independently of their content. I suspect that reuse of one's own prose has also risen enormously since that time, for those who normally operate with computers as writing tools, and that there are many documents about such that, no matter whose names occur at their tops, they are in fact an amalgam of many hands, so much so that no one could be sure of who wrote what. The classic modern example would probably be American reports within the military contract research domain, a world I knew very well over decades: it is often the case that no one even pretends they are to be read by anyone or have any real function whatever, because they transfer spelling errors and all kinds of irrelevant material from version to version. They simply fulfil reporting requirements and the genre itself is so lifeless that it is in no one's interest to say, stop this, let us start again and write a document from scratch.
Again, I believe it is common now, in the academic thesis world, to find chunks of theses that have been incorporated by a student from his supervisor's work, often with the tacit agreement of the supervisor, who sees this as one way of helping a marginal or inadequate doctoral student, and that this is just about defensible, and possibly less effort, in an overstressed world, than editing and reshaping every sentence of a student whose first language is far from English. Did renaissance painters, we may console ourselves by imagining, not stroll about their studios adding a head here or there to a student's work, not worrying about who then signed the whole work?
These phenomena are clues that point us towards our goal: that of benign plagiarism or adaptation, done by individuals and groups, and in situations where no one is deceived and no author is exploited. Such texts rarely reach well-balanced corpora, of course, and are often shelfware, and stand to real text as styrofoam does to some more substantial material.
Some of you will think at this point that none of this is an original situation created by technology: to go from the ridiculous to the sublime, the King James Bible may well be such an object, created by a committee so that no individual's prose can be identified reliably, and many of whose contributors may have drawn from a range of multilingual sources already well worked over by other authors. This is true, but the Bible was probably an exception in that respect whereas I suggest that it may now be becoming something closer to the norm.
I also want to ask the question as to what is the relevance of these phenomena for the creation of digital libraries: much academic technical writing will not be of the conflated type that I have described above, as opposed to the work of a single author, or a small group, writing material they would not knowingly use again. A question that arises and must already have done so, is how can one ensure that any material is included only once in such a collection? It might seem that the answer should be obvious in any well-edited and balanced collection; but as corpora grow, it may not be possible to ensure backwards compatibility by the methods of scholarship alone. One remembers the case of Stalin's collected works, whose length was announced soon after his death, and before the material had been surveyed, not least because his collected works had to be not shorter than Lenin's in terms of shelf length. Every message and war telegram is said have been included, some more than once. But we may assume that those considerations will not affect the sort of collections we have in mind. Any solution to this would not be of merely scholarly interest by the way: in commercial environments, there is said to be still a great need for a reliable algorithm that will remove semi-duplicates from what a web search returns, to cut down on the enormous and unusable numbers of hits in a way that may be easier than just going for greater precision.
This issue is not simply one of duplication, which could perhaps be legitimate if the same passage appeared more than once in a single author's works, but that of adaptation or rewriting. I have touched on the classic topic of plagiarism detection, one long known to scholars and now to computer programmers–the latter in both senses since so much plagiarism and adaptation is now OF computer programs––yet so much text adaptation is in fact benign and the key example is press rewriting. As many of you will know–and this has been the topic of a technical paper in this conference by my colleague Paul Clough–––press agencies such as the Press Association and Reuters in this country, Associated Press in the US and so on, issue great volumes of news text each day with the specific intention that newspapers will use this material as it is (verbatim, as they call it, unsurprisingly) or rewrite it to fit their available space and house style, which is done especially by those papers with fewer resources for original source material, such as local and regional prints.
We began this work, my colleagues Rob Gaizauskas, Paul Clough, Scott Piao and myself, along with colleagues in the Sheffield Department of Journalism, as an exercise in quantifying text use in the context of the Press Association's (PA) press feed. The interest of the PA itself was to find a way of quantifying the degree to which different newspapers receiving the feed actually used it. The obvious feature here that is of interest, in connection with the issue of text ownership, is that the determination cannot be made on any simple quantification of key words in common between the PA feed and a newspaper text, because all versions of the story will contain some or all of the relevant terms and names.
It is important to see how this particular application of text reuse, that is to say measuring when the PA is (versus when it is not) a story source, is just an example of what might become a generic text attribution technology, one with close relationships to the other 'attribution technologies' we have touched on. Let us try a simple taxonomy by generic task, independently of applications, as follows:
We can perhaps distinguish the following processes, as opposed to particular application tasks:
I. Of these n sets of texts to which is text A most similar?
II. Has this set of texts one or more subsets whose members are improbably similar to each other?
III. Is there a subset of texts (out there in a wide set like the web) similar to text A?
IV. Is text A improbably similar to text B?
I. is plagiarism when the putative sources are known; or authorship where the possible candidates are known, as in the case of the Federalist papers, or forensic cases such as when all those in the village who might have written the poison pen letter are known, or whether Mrs A's will is more like her other writing or like her butler's. It is also very close to the classic routing problem of deciding, of a new message, which topic bin to put it in.
But notice here that, although these examples answer the same question and may use the same technique, the features examined may be quite different: e.g. to detect the authorship of Mrs. A or the butler, we would not in general be looking at other wills they might have written, and would therefore be looking for some statistical signature of him or her over the closed class words, since topic words would again give no guidance.
II. Is mass class cheating on student exercises, such as the recent Edinburgh case, where a subset of class essays were copied from one of the essays and this was detected, though it may have been harder to determine which is the source essay. This is also the form of testing for self plagiarism of, say, the academic papers of an individual, though we tend not to call this plagiarism but overpublication. Lest you think this a fantastic example, let me assure you it goes back at least to [], where Rick Bellow used a rather different connectionist technique to show that the papers of some well-known authors in the field of Artificial Intelligence were so similar as to be essentially the same paper under different titles. Since then, sophisticated web retrieval sites like the one at the NEC laboratory [] in Princeton allow you to check very easily when cited papers have sentences and phrases in common, whether or not they appear to be by different authors. Applications to "closed group" corpora like all one's CV papers could easily become part of a future RAE exercise. You have been warned.
III. Is web-style plagiarism of student texts like the initial ACM example, and also web search retrieval itself.
IV. Is a subquestion of III, when we are simply comparing individuals rather than proper samples and is close to the rewrite problem we have identified in journalism, but also identical to many forms of the rewrite-cum-plagiarism question, benign or otherwise: was this Gospel or historical work rewritten from that? Was this particular student essay rewritten from that source?
In each type of question we have used the word 'similar', and of course the important question is what kinds of statistical function, if any, express a useful notion of similarity. Candidate functions normally express the likelihood of a set of texts containing words or word-strings drawn from the same set by chance. A major problem with that, as is well known, is that it tends to ignore what one might call similarities independent of authorship, e.g. topic. As Ken Church put it in a recent talk [], if you have one Noriega in a text, another Noriega is very likely, no matter what the overall statistics of occurrence of the name Noriega are. If that example still carries an element of authorship discrimination, since the same author will be producing the successive examples of "Noriega", then one might consider the recent upsurge of occurrences of "George W. Bush" in texts world-wide, which has no consequences whatever for similarity indicating authorship, only topic.
In the case of the kind of text ownership or attribution we have taken as our focus, benign press rewrites (though, as we noted, plagiarism among historians would be more exciting, and formally identical), one will not be able to distinguish between those articles that have and have not been rewritten from the same Press Association source by any kind of criterion having to do with words key to the topic, such as proper names in a court report, since they will tend to be just as likely to occur in non-rewrites as rewrites, as well as in the case where A and B are not rewrites of each other (in either direction) but both rewrites on a single, possibly unidentified, source text.
Much of the original inspiration for MeTeR came from Ted Dunning [], a former student of mine who has since become well known in the text statistics world, and whose speciality has been inference from very small samples. What we hoped was that the key discriminator of reuse would lie in common ngrams of words, syntagms if you prefer, that were probably not of topic specific words but would still be indicative, criterial, of a rewritten text rather than a contemporary one covering the same story. The problem was always that the samples were so small–––the comparison of single newspaper stories – and therefore most of the heavyweight methods listed above as I-III would not be applicable, since they all rest on the notion of a significant sample.
However, it has not turned out to be as straightforward as that: even very long common ngrams may not be criterial of rewriting, since we have found a 14-word ngram in common between two stories on the same topic that are not mutual rewrites nor, so far as we know, derived from a common source. Ted Dunning has pointed out to me the frequency of the 7-gram "staff writer of the Wall Street Journal", which not only carries no implication of common authorship but can be seen not to do so, because it is plainly used in place of personal attribution.
The dotplot displays relations that can be intuitively seen as rewrites, and our task now is what set of criteria, fused together, as it were, will explicate or capture what we can see in the dotplot pictures. We have no definitive answers yet, but have every reason to think, as I said, that quite disparate phenomena may have to be fused together, in the sense of a linear weighted sum of criteria. One negative element may newspaper style, one that would have to be subtracted from a weighting: part of the difference between a PA story and a rewrite may be an adaptation to house style. Newspapers have house style books–those of the more politically distinctive newspapers like the Guardian and the Telegraph ('do not use partner in the sense of spouse') make good reading, but it may prove hard to capture their constraints computationally. Kilgarriff has shown that there certainly is a quantifiable difference [] between the styles of the broadsheets (i.e. one not captured simply by vocabulary size vis-a-vis the tabloids).
There may be some mileage in detecting the relative sizes of, say, the proper name sets in differing document versions as an indication of derivation direction: in that a rewrite may be expected to have, at most, a subset of the proper names that occur in a source. This is a technique well known in the standard scholarship of document versions. In the Seventies, Dan Bobrow at Xerox-PARC proposed a formal structuring of document versions worked on by many authors, as a way of recovering access earlier versions when necessary, a suggestion that was well-ahead of its time.
Within MeTeR, the complex structure of sentence transformations that our Journalism Department colleague John Arundel has located in a study of PA material may allow us to construct some form of minimal rearrangement calculus, one that could allow us to display a rewrite (but not a non-rewrite) as the product of some kind of higher-level spell correction of the original, permuting not letters but short syntagms. If we have any success in that area in the next year, it will have consequences beyond rewrite detection, but may give new life to one of the oldest and least developed parts of computational linguistics, the control of style, in this case newspaper house style, whose automation, or even partial automation, would attract Lord Copper's immediate attention.
As I said, all this is work in progress and we do not know what, will prove the optimal combination of techniques if any, though we do know that in the broader, surrounding, disciplines appropriately combined methods have often proved the best available. Those are usually known as Information Retrieval (IR) and Information Extraction (IE): the first, and older, discipline being the statistical location of a relevant subset of documents from a wider set (broadly now the technology behind web search), and IE the use of linguistic techniques that examine the structures of word strings to locate facts, or other structured material, within large document collections.
You can see elements of both these techniques in the task list given earlier, and although difficult to distinguish completely, and both are now involved in the more sophisticated web search engines, yet they still retain a difference of emphasis in technique, in that IR remains overwhelmingly quantitative, or as it is often put 'a text is a bag of words', whereas IE remains symbolic in the sense of working on explicit linear structure, something one tends to associate firmly with the notion of text. That this last may be prejudice can be seen by the striking, for some alarming, success of devices like Tom Landauer's patented student marking package []. It is shown student essays of various grades and then proceeds to assign grades to new essays that are virtually the same as those assigned by the US professors of the classes, although the method is very largely a 'bag of words' one and the final grading algorithm would assign the same grade if the words of the essay were presented in random order!
In more recent competitions that the US defence agency DARPA has run on language processing, the competition on computer question-answering ––– locating the appropriate answers to questions in large bodies of text–has seen a fascinating clash, as yet unresolved, between these two kinds of methods: on the one hand, the IR approach which attempts to answer questions using only its standard technique of locating the text (i.e. an answer sentence) most similar to the question. This is a test of how far the notion of 'similar text' can go, since the IE approach is, naturally you might think, to study the structure of questions and appropriate answers. The successful 'question answering' website ASK JEEVES [] is a judicious mixture of these two methodologies. More surprisingly to some, the 'most similar text' approach (i.e. technique III above) has been used to generate computer dialogue in the Loebner human-computer dialogue competition [], by searching large newspaper corpora for a sentence text most similar to what the human said last and returning it as a dialogue response. This is , in a sense, to reduce all human dialogue to question-answering, broadly construed. Although such an approach has never won the competition, it has done far better than many thought it had any right to expect.
These sorts of results alarm some more than others; their success, if it is success, could cause loss of interest in text-as-such, if text could be understood or produced by such banausic techniques, rather as Minsky predicted that people would lose interest in chess once computers began to win at it. I remember the last American university I worked at, where the main glory of the English Department was not its literary scholars but technical text editing and text simplification, and they would point to their excellent student employment figures to answer any criticism. I sometimes wonder, in the middle of the night, if the excesses of recent continental literary theory can be seen in reaction to all this text technology, stemming perhaps a desire to produce text and text theories inscrutable by machines, even if, alas, that has the same effect on human readers.
The last thought reminds one that , as always, there is nothing really new here, or at least nothing essentially connected to computers: Ogden and Richards' Basic English text simplification system in the Thirties came from the same 'objective approach to text' as the logical grammar movement of Carnap in Vienna, which sought to dismiss metaphysical text and much else, as 'grammatically ill formed', a movement which, after many twists and turns is still with us, as Chomsky's generative grammar program. That, in turn, created much of modern formal linguistics and has been the major source for IE if not IR, and for non-sentimental approaches to text in general.
To return to text ownership in conclusion: should we all be thinking of privatising all this endeavour and copyrighting our own text in some formal manner as we produce it, rather than leaving it to techniques like the ones above? That is of course, a tricky one, since one's own authorship signature may well not be distinctive in terms of techniques like II above. Shakespeare might look less distinctive if we had more substantial corpora written by his contemporaries, or lots of his own writing on, say, astrology or necromancy rather than Kings and Illyria–we cannot know. There might be hope for distinctiveness for Joyce, Hemingway or even Ballard but we, as individuals, should not be optimistic that such techniques would establish our uniqueness, certainly not across domains. Indeed, the extraordinary, to me at least, figure that corpora in different domains share no more than 2 or 3% of word types, make any general signature of an individual, completely independent of domain, very hard to imagine.
The standard Cabinet Office technique for identifying unique document copies is a differential spacing for each, and this has been the basis of "signature" and "watermark" techniques, but we cannot rely on that in a world where electronic documents can easily be reedited so as to lose format, or passed around without any format. There is always steganography, that hides messages in diagrams and photos, but those would then become the key parts of our documents and not the text itself.
However, there is in fact hope from both those possibilities. Steganography can be extended to electronic text, certainly if the document is something more than ASCII, and even there some will claim that the pattern of usage of optional commas could be made distinctive of an author. There are such things as synonyms for spaces ( in HTML for instance) or the names of paragraph types. All of these can be used to carry hidden meaning invisible to the casual reader.
There may be hope, if hope we want, in the general shift that is taking place in text mark up. We are told that all documents will be stored with XML markup in a few years and this would seem to render IE, as presently practised, redundant. IE will not be needed to locate, say, proper names in documents as identifying individuals, because these will be inserted for us by a shadowing XML editor that marks up our prose automatically as we write, just as it could spell correct silently for us without seeking our attention. This is, of course, not the end of IE but IE reborn, since the very same techniques as now analyse and extract items from electronic text will be needed at generation time to put in the XML mark up. IE will simply shift from an analytic to a generative technique but with much the same content. However, and here may be the point about ownership, it may prove expensive to remark up text, so original markup may well stay with a text for ever and be part of it, just not the part we see when we read it. There will be the place, perhaps, to code our ownership discretely and in a way revealed only with a key. A plagiarist would then be forced to remove all mark up before use, or someone will then invent indelible markup that cannot be removed.
At a simpler DIY level, it has been possible for some time to create PDF documents in such a way that they cannot easily be copied as text, thus giving protection against anything short of retyping, which the whole of our discussion assumes is what no one does any more. Perhaps the only systematic way for an individual determined to establish their ownership of their text, is to register with one of the new sites that takes a signature based on an ngram sample across a submitted document, to be compared automatically with all fresh submissions to the registration pool. But there are, of course, two problems here: first, you must register before your plagiarist, for obvious reasons, and that may not be too difficult. More seriously, these will again be systems, like most plagiarism detection systems, that work on a closed document set: a university class, or the set of papers submitted to a particular conference, or in this case, a growing set of volunteer registrants.
Your plagiarist may well not be shameless enough to register and, as we saw, general checks against the open web set of documents are not very effective and may not get better as it grows. The kind of reuse systems we have discussed – that is, of type IV––only apply when the potential suspect has been identified. Other systems tend to be of closed class types I and II, because the most general type III is ineffective.
This paper has, I fear, been long on gossip and anecdote and short on technology. At the end I remain haunted by the possibility of pieces of reused text, orphans passed about for ever, and with no access to their original parentage. They would be in a sense Darwinian text fragments––drawing again on the much exploited parallel of text and the linear structure of the human genome––ones that had survived inexplicably because of some property that led to repeated reuse. One could then imagine Bobrow's document version hierarchy proposal as analogous to the search for the text-equivalents in a language of the primeval ur-forms of mitochondrial DNA.
It may be that, in the final analysis, ownership of text is only a transitory thing at best, more like the lifetime ownership of a piece of genetic code than we might want to think. One of Chomsky's most potent and popularised delusions was that new utterances are in general newly created or generated, but evidence from corpora makes this highly unlikely: recent Longmans studies on English [] dialogue corpora (under Geoffrey Leech) showed that over 50% of English dialogue even on academic matters was composed of repeated ngrams, of up to 4 in length. The fact that phrase books work for foreign languages as well as they do may not be a mark of their relative poverty in comparison to a full language but rather that they may well be an effective partial model of it, whose effectiveness depends only on their size. We are, as language animals, to a substantial degree simply permuters of substrings already well established within the history of the whole language.
All this could seem like seeing language as the experience of going to see Hamlet and finding it full of cliches, but repeated on a potentially universal scale. But, like treason, perhaps plagiarism and text reuse that prosper are no longer treason, but the new establishment.
[] Shivakumar., N. and Garcia-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings 2nd International Conference on Theory and Practice of Digital Libraries, Austin, TX.
[] Kilgarriff, A. 1997. Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora" Also published in Proceedings Fifth ACL Workshop on Very Large Corpora, Beijing and Hong Kong, August 1997.