TMR NETWORK SPHEAR

1. RESEARCH TOPIC

SPHEAR brings together seven partners (6 EU and one Swiss) with interests in understanding the auditory system and using this understanding as a basis for explaining speech perception and as a principled framework for speech technology. The twin goals of the network are to achieve better understanding of auditory information processing and to deploy this understanding for automatic speech recognition (ASR) in adverse conditions. The project acronym SPHEAR stands for Speech, HEAring and Recognition. This program, with its background of established collaboration, its interdisciplinary nature, its scientific fascination and its promise for exploitation provides an excellent opportunity to train young post-doctoral researchers in a vital, emerging field.

This field offers challenges which are both intellectually compelling and commercially important. Spoken language is one of the foundations on which human intelligence rests, and its understanding is an important aspect of any scientific account of perception and cognition. Speech technology (coding, recognition, synthesis, verification, morphing...) is becoming a part of everyday life, but current capability is still, by any measure, poor in comparison with human performance [Lippmann 96]. In particular, ASR performance deteriorates rapidly when even small amounts of noise are present. In striking contrast, human listeners are able to function well even at negative signal-to-noise ratios, and in the presence of several sound sources which may be changing unpredictably. Mainstream ASR approaches seem, arguably, to be reaching their limits in the face of these problems and ways of directly extending them are difficult to see [Bourlard et al, 96].

Speech communication has developed to make use of existing mechanisms (the vocal apparatus and the auditory system), and is profoundly influenced by their characteristics. Highly-developed perceptual skills enable us to perform 'Auditory Scene Analysis, ASA': how many sound sources are there, where are they located, what are their characteristics? The studies which we propose promise to make speech recognition more robust by modelling such aspects of auditory processing.

SPHEAR partners Sheffield, Keele, Grenoble and Bochum have played leading roles in the development, in the last 10 years, of 'Computational Auditory Scene Analysis', CASA. We have achieved some success in modelling the separation of sound sources by representational techniques [e.g. Brown and Cooke 94, Berthommier & Meyer 95]. This methodology involves grouping processes which bind together features in the time-frequency representation of sound events extracted by auditory models. Grouping can be based on 'primitive' cues, such as harmonicity or common amplitude modulation, or can be 'schema-driven', based on stored knowledge of source characteristics. Important cues come from binaural hearing - the differences in the signals received by the two ears. We have also begun to address the problem of how this auditory grouping might be used in robust speech recognition [Green et al, 95, Cooke et al 96, Patterson 96].

The program of work involves four themes: new developments of CASA to deal with the need to exploit the temporal organisation of events at multiple time-scales (Theme 1) and the need to improve sound-source segregation by coupling events expressed in different representations by a combination of bottom-up and top-down processes (Theme 2). In parallel with these developments, we will extend our CASA-based recognition work. The key to this is the ability to recognise on the basis of unreliable data - for a speech source in an auditory scene some time-frequency regions will be 'hidden' by other sources and several sources may contribute to a given observation (Theme 3). We propose to cast CASA-based speech recognition within a new 'multi-stream' recognition architecture developed by Bourlard of our Swiss partner, IDIAP. This provides a probabilistic way of combining recognition cues from several sources as well as a good framework for automatically handling unreliable information. As a real-life testbed for these ideas, and as an exploitation path, we will work in Theme 4 with our industrial partner, Daimler-Benz, on speech recognition in mobile communication systems. This is a particularly appropriate application area because it is demanding enough to challenge any state-of-the-art ASR system, it is characterised by the presence of non-stationary noise, and it is commercially attractive.