SPHEAR TMR NETWORK

4.RESEARCH METHOD

Our research is inherently inter-disciplinary and requires a variety of methodologies, providing excellent training opportunities. Where particularly novel techniques are to be used, we explain them in sub-sections.

4.1 Psychoacoustic experimentation

This involves investigation of the performance of listeners in controlled conditions. Keele, Bochum and Grenoble have established facilities for this kind of work, which is the basis of tasks 1.1, 1.2, 2.4 and 3.6. A novel technique is to be used by Bochum in tasks 1.2, 2.2 and 3.5:

4.1.1 Psychoacoustic investigations with virtual sound sources

Previous experiments for investigating the precedence effect have been performed with simple, artificial stimuli and a few sound sources and reflections. The perceptive plausibility is of major importance for the reaction of the listener. This plausibility depends on the similarity of the sound pressure at the ear drum during the experiment compared to that in real environments. In an anechoic environment the sound reaching the eardrum is distorted by head, body and pinna depending on the direction of incidence. The distortions change the spectrum at each ear (monaural cues) and the differences between the ears (interaural time differences, interaural level differences). These distortions can be described formally by a transfer function (head-related transfer function, HRTF). HRTFs are measured with miniature microphones positioned within the ear canal. An authentic free field situation is generated by filtering the test sound with the corresponding HRTF and delivering the result through headphones. Situations in real rooms can be simulated if the reflections at the walls are modelled as mirror sources. The sound of the direct source and mirror source is then filtered with the corresponding HRTF. Recently, the method has been improved to allow movements of the listener and the sound sources in the virtual auditory environment. With this technique psychoacoustical experiments with natural and plausible stimuli can be designed while controlling all parameters of the virtual scenarios. The advantages of this technique have been shown in neurophysiological experiments [Hartung and Sterbing 96].

4.2 Computer Modelling of auditory processing and speech perception

We build computer models in order to develop and test theories of auditory competencies. Sheffield, Bochum, Keele and Grenoble have leading reputations for this work. The field now known as Computational Auditory Scene Analysis was to a large extent pioneered by these labs. Modelling is important in tasks 1.2, all the tasks in theme 2 and task 3.6. Computer models may themselves take many forms, for instance direct implementations of mathematical systems, object-oriented simulations, or neural nets of various kinds (e.g. tasks 2.5 and 2.6). Two novel modelling approaches are outlined below:

4.2.1 Neuronal Oscillators (task 2.5)

Neural models may themselves take a number of forms, such as abstract models of artificial neural networks, or more physiologically motivated models which represent grouping of auditory features by patterns of coherent neural oscillations. We adopt the latter approach in task 2.5. In this model, grouping of the energy in auditory channels is signalled by the pattern of temporal synchronisation in a network of neural oscillators. An interesting property of the model is that the oscillators are able to code the grouping of channel energy to more than one sound source [Brown et al, 96].

4.2.2 Modelling recognition of `Near Speech' (task 3.6)

A large number of psychoacoustical experiments have shown that listeners possess a remarkable robustness in recognising speech which has undergone a wide variety of processes such as filtering, temporally distortion, masking and spectral reduction. Any adequate model of speech perception must account for these abilities. Current acoustic representations of speech have (almost exclusively) been designed with clean speech as the sole model. We contend that designing representations and recognition processes to cope with 'near-speech' will lead to insights into robust speech recognition. This is the methodology of task 3.6.

4.3 Evaluating auditory models

An important issue is how one should evaluate a computer model. In some cases the `right answer' can be made available for comparison. In other tasks (e.g. 3.6) it is appropriate to compare the performance of the model with the performance of listeners on the same stimuli. A specialised task may be designed to isolate the effect one is modelling. The `double-vowel' paradigm for sound segregation research (used in tasks 2.1, 2.2 and 2.6) is a good example - listeners and models are asked to recognise simultaneously-spoken vowels. Sometimes a further control is available from the use of synthetic stimuli. Automatic speech recognition often forms a useful paradigm for assessing whether a model of auditory processing is adequate. For instance one can attempt to recognise speech evidence which has been segregated from other sources in an auditory scene. Sheffield has developed this particular paradigm, which is used in task 3.2 and explained further below. All the SPHEAR partners have experience in recognition experiments.

4.4 Automatic speech recognition techniques

In theme 3 we intend to develop, and perhaps combine, two new ASR paradigms in order to provide a framework in we can cast multi-scale models of temporal events and multiple-modal input and deal with incomplete data. We provide more details below.

4.4.1 Multistream speech recognition (task 3.1)

Multistream speech recognition is a generalisation of the multiband approach which has already shown promise [Bourlard and DuPont, 96, Hermansky et al 96]. In multiband recognition, the frequency range is split into several bands, and information in the bands is used for phonetic probability estimation by independent modules. These probabilities are combined for recognition later in the process. The multi-band paradigm is motivated by psycho-acoustic studies [Allen 93], by its potential robustness to noise, and by the potential of using parallel processing architectures. Multi-band speech recognition has recently been shown to have lower word error rate in both normal and band-limited noise conditions compared to traditional systems. Although still preliminary, this result was recently confirmed on a difficult conversational speech task in the framework of an international evaluation [WS 96]. While the theory of multi-stream systems is in principle straightforward, we have still to learn the trade-offs between segment choices, features, recombination approaches and so on.

4.4.2 Missing-data speech recognition (task 3.2)

The problem of recognition with incomplete data is of interest in many fields, most obviously vision [Ahmad and Tresp 93]. Sheffield is probably the first lab to research this topic in an ASR context, and has reported promising results [Green et al 95]. There are two approaches to dealing with missing data: one can ignore it (for instance by integrating over the missing dimensions with marginal distributions) or attempt to reconstruct it (for instance by forming a distribution for the missing data given the present data). Both these techniques have a number of variations, dependent on the form of the distributions chosen, the transformation from auditory representations to those suitable for statistical models, and the assumptions made for tractable computation. The theory and practice are still developing [Cooke et al 96], and this work will continue in task 3.2.

4.5 Corpus-based methods and assessment

Much of our work depends on the availability of recorded databases of speech and other sounds. In some cases, these corpora are specialised for particular experiments, for instance double-vowels (see above) and the audio-visual corpus associated with task 2.5. For larger-scale experiments, a number of speech corpora are now available, both for read speech (e.g. TIMIT, Wall Street Journal) and more spontaneous speech (e.g. SWITCHBOARD). These corpora are necessary for the training and evaluation of ASR systems. Wherever appropriate, we will seek to perform contrasting studies on the same corpus, so that results are directly comparable. We will follow the internationally-established ASR evaluation methodology. Two resources of direct interest for SPHEAR are described below.

4.5.1The ShATR corpus of auditory scenes

The ShATR corpus was designed and recorded by Sheffield for research in Auditory Scene Analysis in connection with the HCM network SPHERE which is a precursor to this proposal, and in collaboration with ATR Kyoto. In the ShATR recordings, five speakers sitting round a table are engaged in collaborative problem-solving. There is a high degree of speaker-overlap. Each speaker has an individual head-mounted microphone, and binaural recordings of the scene are made from a mannikin. In addition, an omni-directional microphone provides a monaural recording of the complete scene. The individual channels provide the `right answer' for separation studies. ShATR is to be used in tasks 2.1, 2.2 and 3.2.

4.5.2 Speech corpora

For the recognition applications in theme 4, Daimler-Benz will provide databases related to the in-car and cellular phone tasks. It will be possible to evaluate the auditory techniques we develop for this application in contrast to conventional techniques.