Datasets

Multimodal Learning

  • Multi30K, an extended version of the Flickr30K dataset with translations of image descriptions into German, French and Czech (31K images).
  • How2, a dataset of instructional videos with 2000 hours of audio,video and human transcriptions in English, as well as translations into Portuguese for a subset of 300 hours.
  • For all WMT multimodal machine translation datasets, please check WMT16, WMT17 and WMT18.
  • Images as Context: Wikipedia images, their captions, human and automatic translations, similar images from ImageNet, and keywords from WordNet.

Quality Estimation

  • QT21 dataset, a large collection of human post-edits and error annotations on translations from statistical and neural MT systems for four language pairs, with up to 45,000 sentences per language pair, as described in this paper.
  • For all WMT quality estimation datasets, please check WMT12, WMT13, WMT14, WMT15, WMT16, WMT17 and WMT18.
  • PLANT - 1000 Machine Translation Errors manually annotated at phrase-level over 400 French-English source sentences and their machine translations, along their human post-edited version, and original references. Download. For more information on this dataset, check this paper.
  • TSD13 dataset - English-Spanish WMT12 machine translations by various MT systems, post-edited by 10 translation students : for training, 200 source sentences, each translated by a given MT system and post-edited by all 10 annotators; for test, 100 source sentences translated by 10 MT systems (1,000 data points), with their post-editing shared among the 10 annotators. Download. For more on how we used this dataset to rank MT systems via post-editing, check this paper.
  • WPTP12 dataset - machine translations with post-editing performed by multiple translators with different levels of expertise: 299 unique English source sentences and their machine translations into Spanish produced by eight MT systems, along their human post-edited version (by eight translators), totalling 1,624 {source, machine translation, post-edited translation} triples. For each triple, we provide post-editing time (global and normalised, i.e. seconds/word) and HTER score. Download. For more on this dataset, check this paper.
  • WMT12 feature sets: Features sets from all (but 1) participating teams in the WMT12 Quality Estimation shared task (see data below).
  • WMT12 dataset - machine translations with human judgements and post-editions: 2,254 English-Spanish source sentences and their machine translations, along their human post-edited version, original references, and 1-5 quality score. For the latter, the official version used in the WMT12 shared task on quality estimation takes a weighted average of 3 annotators, but all 3 individual annotations (and weights) are also available for both training and test sets. For more on this dataset, check this paper.
  • EAMT11 dataset - machine translations with human judgements and post-editions: 1,000 English-Spanish (en-es) and 2,525 French-English (fr-en) source sentences and their machine translations, along their human post-edited version, 1-4 quality score, post-editing time, and HTER score. Download. For more on this dataset, check this paper.
  • EAMT09 dataset - machine translations with human judgements: 16,000 sentences, their reference translations, their machine translations as produced by 4 SMT systems, and their scores in a {1-4} scale as given professional translators. Download. For more on this dataset, check this paper.

Text Adaptation and Simplification

Software

Slides, etc.