SPHEAR TMR NETWORK

3. SCIENTIFIC ORIGINALITY OF THE PROJECT

In this section the paragraphs are numbered for ease of reference.

3.1 There has been much progress in Automatic Speech Recognition (ASR) over the last decade. Stochastic modelling and connectionist techniques have been intensively investigated, using increasingly large speech databases and fuelled by the DARPA common evaluation exercises. Most effort in acoustic modelling has been devoted to Hidden Markov Models (HMMs): current large-vocabulary systems typically train HMMs for triphones, and incorporate trigram language models [Young 96]. Sheffield and IDIAP are collaborators in one of the leading large-vocabulary systems, which is a hybrid of HMM and connectionist methods [Hochberg et. al. 95]. Speaker-independent performance on read speech is approaching 90% word accuracy, but this falls to about 50% for spontaneous speech [WS 96].

3.2 There has been considerable interest in 'Robust ASR', i.e. maintaining recognition performance in the presence of noise [see Gong 95 for a recent review]. Much work has sought to improve recognition performance in the presence of stationary background noise. Well known methods are spectral subtraction [Lockwood and Boudy 92, Van Campernolle 89] and adaptation of the recognizer's models to the characteristics of the noise e.g. [Gales and Young, 93]. ASR can be improved in non-stationary noise by modelling those noises as references of the recognition system [Varga and Moore 90]. However, this approach is restricted to some typical noises which can be modelled in advance. There has been relatively little work on compensation for the influence of room reverberation. One such approach modifies the trajectories of the short-term energy in individual subbands [Hirsch 90]. All these techniques are based on estimation of the noise characteristics. Their performance is poor in comparison to human listeners on similar data: typically, recognition rates begin to fall at signal-to-noise ratios of +10dB, whereas there is no appreciable effect on human performance until -5dB [Green et al 95]. Furthermore, the auditory system maintains robust performance in unpredictable environments, where the characteristics of the noise are unknown and not fixed. In this network we intend to exploit computer modelling of the auditory system for robust ASR.

3.3 One approach to robust ASR is to employ speech enhancement techniques to a noisy signal, with the aim of making it more like the clean speech on which the system was trained. Speech enhancement techniques have been studied for nearly 3 decades. Much work has been based on the empirical or optimal modification of the degraded signal's Short-Time Spectral Amplitude (STSA), using mathematical criteria. Such state-of-the art optimisation is based on statistical estimators, using Minimum Mean Square Error criteria or HMM models of the original signal's STSA or linear prediction parameters. Some other contemporary methods employ wavelet signal decomposition or Neural Networks. However, relatively little effort has been devoted to optimising STSA modifiers using perceptual criteria. Patras have recently begun to research such techniques, utilising noise masking models of the auditory mechanism [Paraskevas and Mourjopoulos 95, Tsoukalas et al 97], work we intend to develop within the SPHEAR network (Task 3.4).

3.4 'Auditory Scene Analysis (ASA)' is the term coined by Bregman [Bregman 90] for the remarkable ability of listeners to separate, and pay selective attention to, the different sound sources in a cluttered acoustic environment. Bregman considers that an auditory scene is perceived as a number of 'streams', each stream corresponding to a single source. SPHEAR partners Sheffield, Keele, Grenoble and Bochum have played a leading role in the development of 'Computational Auditory Scene Analysis (CASA)', a field which is now of worldwide interest [see Proc. 1st Workshop on Computational Auditory Scene Analysis, IJCAI 95, Montreal for a recent collection of papers]. Following (and developing) the psycho-acoustic results of Bregman and others, CASA researchers have modelled the formation of time-frequency representations in the auditory system and the Gestalt grouping cues (such as harmonicity, common amplitude modulation and binaural differences) used to make organisation in these representations explicit and thus form Bregman's streams. Cooke and Brown of Sheffield were perhaps the first to demonstrate sound-source segregation in this way [Cooke 93, Cooke and Brown 93, Brown and Cooke, 94]. Bochum have a long-standing interest in spatial hearing [Blauert 96] and have led the exploitation of binaural grouping cues in their 'cocktail party listener' [Bodden 93]. Berthommier of Grenoble and Meyer of Keele have an established collaboration in the development of Amplitude Modulation (AM) maps for CASA [Berthommier and Meyer 95, Meyer and Berthommier 96]. Audiovisual coherence also plays a part in scene analysis and Grenoble is a leading player in the field of audiovisual speech perception and recognition [Robert-Rives et al 96].

3.5 CASA is still a young field, and the first two themes of the work we propose involve improvements to existing techniques and frameworks in which different strands of research can be brought together. Thus theme 1 involves dealing with auditory events at multiple time-scales, and theme 2 focuses on the interplay between 'primitive' and 'schema-driven' processes in sound source segregation. Primitive processes make use of low-level constraints derived from the physics of sound and the properties of the auditory system, whereas schema-driven processes use more specific knowledge about the properties of sound-source types. Primitive processing is usually thought of as `bottom-up', seeking to bind together auditory evidence, whereas, schema-driven processing often, but not always, works `top-down', looking for evidence to support a hypothesis. When multiple sound-sources are present, we need to distinguish between shared-channel conditions, in which the response in an auditory frequency channel is additive between the sources, and dominant-channel conditions, where each channel can be assigned to a single source, the other sources having minimal effect. More details are given in section 2 and section 5.

3.6 Conventional ASR systems generally take a single sound wave as input, and assume that this contains all the information from a clean speech source, and nothing else. This assumption does not hold in natural surroundings, and for robust ASR, it will be necessary to adapt recognition techniques to deal with more realistic cases. This is theme 3 of our proposal. In general, we will have multiple data streams, for instance from the two ears, from analysis at multiple time-resolutions, from multi-modal input, and/or from independent recognisers in different frequency bands. IDIAP [Bourlard and Dupont 96] and others [Hermansky et al 96] are already researching a `multistream recognition' formalism, which we intend to develop in SPHEAR. There is support for a multi-stream approach from the evidence that human listeners process the information from different frequency subbands quite independently of each other [Fletcher 53, Allen 94]. If the observation streams are assumed to be entirely synchronous, they may be accommodated easily in regular recognition systems, but if we have asynchronous streams, for instance at multiple time scales, we require a new structure. The core idea is to have an independent recogniser in each stream, but to bring the `opinions' of these recognisers together in time at some recombination level. Subject to some viable assumptions, training algorithms for multistream recognition can be formulated and the admissibility property (guaranteed to find the best solution in its own terms) can be preserved. This is further discussed in section 4 and section 5.

3.7 Furthermore, the data presented to the recogniser will generally be incomplete: if CASA is used to separate the speech evidence from other sources prior to recognition, this separation will never be perfect. Some spectral-temporal regions will not be recovered because they were appropriated by other sources - the dominant-channel condition above. We call such data `occluded speech', by analogy with vision. Sheffield have worked on this problem of missing-data recognition, and have adapted HMM techniques to the incomplete observation case. In digit recognition, more than 90% of the time-frequency elements can be removed (by simulated ASA) without effecting recognition performance [Green et al 95]. This result demonstrates the potential of CASA for robust recognition - it approaches human performance.

3.8 Our objective in theme 4 is to deploy and evaluate the fundamental research of themes 1 and 2 and the recognition techniques developed in theme 3 in application areas of interest to our industrial partner, Daimler-Benz. Daimler-Benzwishes to develop spoken dialogue systems for its product range, especially in the field of mobile communication. The applications of primary interest are the control of cellular phones and in-car functions by speech, which provide an excellent testbed for auditory-based techniques. It demands noise robustness (hands-free control in a car is the most testing condition), and is sufficiently difficult to be beyond the capability of contemporary recognisers. Daimler-Benz are able to provide experimental data for these applications and we will be able to make comparative assessments with other techniques.