Datasets

Quality Estimation

  • For all WMT quality estimation datasets, please check WMT12, WMT13, WMT14, WMT15 and WMT16.

  • PLANT - 1000 Machine Translation Errors manually annotated at phrase-level over 400 French-English source sentences and their machine translations, along their human post-edited version, and original references. Download. For more information on this dataset, check this paper.

  • EAMT09 dataset - machine translations with human judgements: 16,000 sentences, their reference translations, their machine translations as produced by 4 SMT systems, and their scores in a {1-4} scale as given professional translators. Download. For more on this dataset, check this paper.

  • TSD13 dataset - English-Spanish WMT12 machine translations by various MT systems, post-edited by 10 translation students : for training, 200 source sentences, each translated by a given MT system and post-edited by all 10 annotators; for test, 100 source sentences translated by 10 MT systems (1,000 data points), with their post-editing shared among the 10 annotators. Download. For more on how we used this dataset to rank MT systems via post-editing, check this paper.

  • WPTP12 dataset - machine translations with post-editing performed by multiple translators with different levels of expertise: 299 unique English source sentences and their machine translations into Spanish produced by eight MT systems, along their human post-edited version (by eight translators), totalling 1,624 {source, machine translation, post-edited translation} triples. For each triple, we provide post-editing time (global and normalised, i.e. seconds/word) and HTER score. Download. For more on this dataset, check this paper.

  • WMT12 feature sets: Features sets from all (but 1) participating teams in the WMT12 Quality Estimation shared task (see data below).

  • WMT12 dataset - machine translations with human judgements and post-editions: 2,254 English-Spanish source sentences and their machine translations, along their human post-edited version, original references, and 1-5 quality score. For the latter, the official version used in the WMT12 shared task on quality estimation takes a weighted average of 3 annotators, but all 3 individual annotations (and weights) are also available for both training and test sets. For more on this dataset, check this paper.

  • EAMT11 dataset - machine translations with human judgements and post-editions: 1,000 English-Spanish (en-es) and 2,525 French-English (fr-en) source sentences and their machine translations, along their human post-edited version, 1-4 quality score, post-editing time, and HTER score. Download. For more on this dataset, check this paper.

  • EAMT09 dataset - machine translations with human judgements: 16,000 sentences, their reference translations, their machine translations as produced by 4 SMT systems, and their scores in a {1-4} scale as given professional translators. Download. For more on this dataset, check this paper.

Others

  • SubIMDB, a structured corpus of subtitles. Download

  • Sentence-aligned English-Portuguese-Spanish parallel corpus: more than 200K (and growing) sentences from scientific news texts extracted from the trilingual FAPESP magasine. Download

  • Simple Wikipedia EN-ES parallel corpus: Random selection of 2016 segments from the English Simple Wikipedia corpus manually translated into Spanish. Download

  • Images as Context Dataset:Wikipedia images, their captions, human and automatic translations, similar images from ImageNet, and keywords from WordNet. Download

Software

Slides, etc.