SPHEAR TMR NETWORK

5. Work Plan

Here are details of the tasks. For each task we specify the partners involved, with the lead partner first.

Follow the task update links to find out what's going on.

Theme 1: Temporal Organisation of Auditory Events

Task 1.1: Assessing the role of synchrony/desynchrony of spectral channels in speech perception (Keele, Grenoble, IDIAP).

ASR systems generally use sequences of spectral frames of about 20ms as input and it has often been assumed that the human auditory system processed speech in a similar manner. However, human listeners are relatively insensitive to temporal shifts of different spectral regions, at least for vowel-vowel transitions and continuous speech (?6.6). This finding, which has clear implications for recognition strategies, will be explored further using psychoacoustic methodology, focussing on approximants and plosives, where co-ordination of the articulators may be more precise. In parallel, we will develop a phasic system which groups synchronous spectral events into a global event and test its role in speech recognition in noise.

Task 1.2: Understanding the 'precedence effect' for reflective sounds (Bochum, Patras).

The precedence effect is thought to be based on the interaural sound pressure and time differences at the two ears, but recent work [Clifton 87, Wolf 91] suggests a central control mechanism. We will address this point using Bochum's 'virtual auditory environment' technique (see ?4.1.1). In addition we will adapt and expand Patras's masking model (? 6.7) to the case of reflected sounds and derive psychoacoustic representations of the room response.

task update for 1.2

Theme 2: Segregating Sound Sources

Task 2.1: Spectral and spectro-temporal segregation with shared channels (Grenoble, Keele)

Existing work based on AM Maps will be extended here within the experimental paradigm of double-vowel segregation in the ShATR database (?4.5.1). AM maps will be coupled by linear schema-based segregation.

Task 2.2 Integrating grouping strategies across monaural cues (Keele, Grenoble, Sheffield)

The primary cues are harmonicity, common fate and closure. Prosodic fundamental frequency modulation and very low amplitude modulation frequencies (1-20Hz) may also be investigated. The models will be evaluated against subject performance on stimuli where the segregation cues reinforce each other and where they conflict. Strategies for combining cues, either as separate streams or as a joint representation, will be compared.

Task 2.3 Integrating monaural processing with binaural cues (Bochum, Grenoble, Patras)

Initially we will define representations for the cues to be integrated. Signals generated with the virtual auditory reality software (?4.1.1) will be used to test the combination. We will study strategies for adapting binaural processor parameters depending on the output of monaural processing. Details of Bochum's binaural sound localisation model in ?6.2.

Task 2.4 Coupling of Audio-Visual cues (Grenoble)

The challenge of AV recognition is to achieve better performance by fusing the two modalities than for A alone or for V alone, as listeners do, in order to increase the robustness of ASR in adverse conditions. Purpose-built audiovisual corpora will be exploited, using Grenoble facilities for acquiring and processing visual speech inputs [Lallouache, 90].

task update for 2.4

Task 2.5 Conflict resolution in multi-stream systems and models of auditory attention (Sheffield, Keele).

If a range of cues are used for auditory source separation, then there is scope for conflicts in the assignment of signal features to perceptual streams. High level processes are likely to be involved in resolving these conflicts as well as in the ability to attend to one of multiple streams. Psychophysical data (e.g. duplex perception, Darwin, Broadbent and Ladefoged) suggest mechanisms for conflict resolution. We aim to extend ASA models to simulate conflict resolution strategies seen in human perceptual experiments and to model attentional mechanisms. Two approaches will be investigated: an oscillatory associative memory (Sheffield, see ?4.2.1) and a representational model based on AM maps (Keele) with hierarchical cue ordering. The models will be evaluated on psychophysical stimuli with conflicting cues as well as with spontaneous speech data.

Task 2.6 Schema-based segregation by connectionist techniques (Grenoble).

Segregation of complex sounds such as vowels is performed with the use of knowledge about formant positions. This 'knowledge' can be represented by the weights of a trained multi layer perceptron. We propose to separate sound streams by using such `connectionist schemas', and to couple this process to the modules developed in 2.1 and 2.2. Conversely, the segregation can be driven using outputs of the perceptron as confidence, and a primitive segregation process can be recruited top-down.

task update for 2.6

Theme 3: Speech Recognition within Auditory Scenes

Task 3.1 Multistream recognition techniques (IDIAP, Sheffield).

Multistream recognition theory (?4.4.1) will be strengthened by a number of studies: we have to learn the trade-offs and efficient use of several model parameters, such as optimal features and timescales for each stream, choice of recombination unit (phone, syllable or other), recombination criteria (likelihood or posterior based recombination, mixture of experts paradigm...), optimal weighting scheme (based on some kind of reliability measure or cross-correlation information), and stream-specific linguistic categories.

task update for 3.1

more on multi-channel recognition

Task 3.2 Missing data recognition techniques (Sheffield, Grenoble, IDIAP)

In occluded speech recognition (?4.4.2), we will schedule work on various theoretical and practical issues concerned with statistical estimation of missing data, and how such estimates may be used in recognition algorithms. Multiband recognition might also be adapted to the missing data case (only a subset of the bands being available for use at a given time). A related study will use the 'dominance effect' in time-frequency representations in order to track vowel formants in noisy conditions, supplying the resulting marked channels to a missing data recogniser.

task update for 3.2

Task 3.3 Estimating the reliability of speech representations (Grenoble, Sheffield, IDIAP)

Issues are: What parameters control the reliability of a given measurement in a fusion process? How should we estimate the reliability of these parameters? What control mechanism can utilise these reliability estimates? Recogniser confidence might be assessed on the basis of signal-to-noise estimates or (in a multistream formalism) on cross-stream correlations or (in missing-data recognition) on the degree of occlusion.

Task 3.4: Speech enhancement techniques using noise masking models of the auditory mechanism (Patras, Bochum, Daimler-Benz)

This task develops research on speech enhancement based on auditory masking thresholds (?3.3 and ?6.7) A complementary objective will be to provide enhanced signals to other SPHEAR labs so that they will be able to assess their methods by comparison with the original degraded data.

Task 3.5 Recognition from binaural models (Bochum, IDIAP, Daimler-Benz)

This existing research strand (?6.2) will be extended to situations closer to real-life by modelling more than one interfering sound source and considering reflections and reverberations. We will adapt the extraction of the desired signal to the actual acoustic environment. The virtual auditory environment technique (?4.1.1) will provide data for recogniser training and testing. We will investigate binaural recognition in the multi-stream formalism.

Task 3.6 Recognising 'near-speech' (Sheffield, Keele)

The methodology behind this task is explained in ?4.2.2. We will use the multistream and missing data speech recognition techniques (tasks 3.1, 3.2) to compare the performance of listeners and algorithms on speech which has been spectrally filtered (bandpass, bandstop, lowpass, highpass), temporally-distorted or masked by nonstationary additive noise. While data for listeners exists for many of these stimulus types, we schedule new perceptual studies on the effects of noise bands of various durations in different spectral regions.

Theme 4: Recognition Applications

Task 4.1 Auditory processing for cellular phone and in-car applications (Daimler-Benz, in consultation with all other partners)

The objective of this task is to assess existing auditory techniques for the cellular phone and in-car tasks. This will involve comparison with standard approaches for robust ASR such as spectral subtraction or parallel model combination (?3.2).

Task 4.2 Multi-stream recognition in the cellular phone and in-car applications (IDIAP, Daimler-Benz).

This task will start by testing and comparing on regular (microphone and telephone) international reference databases (with known comparison points) the different trade-offs and possible implementations of the multi-stream approach outlined in task 3.1. To test robustness to variable conditions, we will then test the same systems (without re-training, since we aim at developing methods that are robust to unknown conditions) on databases artificially corrupted by adding stationary narrow band noise, non-stationary narrow band noise and wide-band noise (e.g., car noise). Finally, we will experiment with application-specific databases for the cellular phone task.

Task 4.3 Missing data recognition in the cellular phone and in-car applications (Sheffield, Daimler-Benz).

This task will follow the same pattern as task 4.1, but we will also investigate whether the small-vocabulary aspects of the cellular phone and in-car tasks (e.g. voice dialling) and the speaker-dependent aspects (personal phone-book) are more suitable for partial recognition than large-vocabulary problems.

Task 4.4 Engineering SPHEAR recognition techniques into recognition products (Daimler-Benz, IDIAP, Sheffield)

In this final task we are concerned with adapting what we have learned for use in a real-time recogniser with limited memory and computational resources. There will be an inevitable trade-off between what we would like to do and what is feasible in a commercial product.