Speech separation challenge

Organisers:

Martin Cooke (University of Sheffield, UK)
Te-Won Lee (UCSD, USA)

Sponsored by the Pascal network.

Latest news

Introduction

Do you have an algorithm for speech separation? If so, you should take part in the 1st ever large-scale global comparison of techniques for separating and recognising speech! Results will be presented at a special session of Interspeech 2006 taking place in Pittsburgh (USA) from 17-21 September, 2006. Note that although the deadline for submission to Interspeech has passed, this website and the challenge will remain open. Feel free to submit any results you obtain to the organisers for display on this site.

The task is to recognise speech from a target talker in the presence of either stationary noise or other speech. You will be provided with plenty of training data and two test sets. One test set contains sentences spoken in speech-shaped noise at a number of SNRs (signal-to-noise ratios) ranging from clean to -12 dB. The other consists of pairs of sentences at a range of TMRs (target-to-masker ratios) from 6 to -9 dB. Only one signal per mixture is provided (i.e. the task is "single microphone").

Speech material is drawn from the recently-collected GRID corpus [1] which consists of sentences which are simple sequences of the form

     <command:4><color:4><preposition:4><letter:25><number:10><adverb:4>

    e.g. "place white at L 3 now"

(the numbers in brackets indicate the number of choices at each point).

Although the task is not particularly representative of everyday speech, it was chosen for the speech separation challenge because

We welcome contributions from the widest possible range of approaches to the speech separation/recognition problem e.g. signal processing/enhancement, ICA, CASA, model-based techniques, hybrid approaches. We would be very glad to see results for well-known existing algorithms as well as for the latest techniques - the novelty is in algorithm evaluation on a large data set to afford comparison with other approaches. Thanks to funding from the Pascal network, we will be able to provide extensive email support during the challenge period.

Training, development and final test data

The training and development sets are drawn from a closed set of 34 talkers of both genders. The training and development data are available as zip files which we have split to allow ease of downloading.

Note: Although utterances were semi-automatically screened (details in [1]), we are aware of a very small number of errors in the training data sets (estimated at < 0.1 %) where the sentence does not match the name (usually due to speaker error, and usually just in the letter component), or where the recording was truncated. We will update a list of corrections as we spot them.

Default recogniser and scoring scripts

An easy to use HMM-based recogniser is now available for those of you whose algorithms produce 'enhanced' waveforms. The entire process, from waves to scores, has been automated. We have provided a scoring script which is available as part of the recogniser package. You should use this even if you plan to use your own recogniser. The outputs of the scoring script provide the minimal set of results to report in your paper. You can also use the scoring scripts during development. This document describes both the scoring scripts and the recogniser.

Rules of the challenge

Results so far

The following authors have kindly agreed to have their Interspeech submissions made available (note that these articles should not be cited without the express permission of the authors):

Martin & Te-Won

 

Acknowledgements: Jon Barker helped to construct the stimuli and two-talker task. Ning Ma and Youyi Lu helped develop the recogniser and associated scoring scripts.

[1] Cooke, M. P., Barker, J., Cunningham, S. P. and Shao, X. (2006) An audio-visual corpus for speech perception and automatic speech recognition, Journal of the Acoustical Society of America, 120: 2421-2424.

[2] Barker, J. and Cooke, M.P. Modelling speaker intelligibility in noise, accepted for Speech Communication

 

Last updated: 11 November 06