Paraphrase corpus

This corpus contains paraphrastic sentences with human annotated word/phrase alignments. It was created by Trevor Cohn, Mirella Lapata and Chris Callison-Burch at the University of Edinburgh in 2006/2007.

Update March 2013: The corpus has recently been hand-corrected and extended with extra layers of annotation, including named entities and syntactic parse structure. Please visit Scott Martin's site for this version of the data, and see also their COLING paper which includes a description of the dataset.

The original sentences were drawn from three sources and annotated by two annotators. The sentence pairs were drawn at random from the following corpora:

The annotators were given the following annotation guidelines, and marked up the data using a web-based annotation tool (source will be made available).

Both the MTC and MSR texts are covered by licencing agreements. Please ensure that you are covered by appropriate licences, described here and here.

Please refer to the README file for details of the file locations and formats, and the scripts included for processing the data.

Download the corpus