SPHEAR TMR NETWORK

2. PROJECT OBJECTIVES

Our overall objectives are to gain a better understanding of the human auditory system and use this understanding to improve ASR systems. Although over the years speech researchers have (knowingly or unknowingly) used certain properties of human hearing in the design of speech engineering systems [Bourlard et al 96], there has been little sustained multi-disciplinary effort in this direction, a situation which can be corrected by a TMR network. Each of our four themes has a number of associated tasks. The themes are not meant to be a rigid division, and many tasks relate to and in some cases depend on other tasks. We introduce themes and tasks below:

Theme 1: Temporal Organisation of Auditory Events

Cognitive processing of sound functions at multiple timescales. In auditory perception, these time scales range from microseconds for sound source localisation, to milliseconds for pitch analysis and event detection, to tens of milliseconds to analyse low frequency amplitude and frequency modulation characteristics of speech. In speech perception, phones are recognised in perhaps 100 milliseconds, and syllables and speech acts on longer time-scales still. It is necessary to build a consistent picture of processing and interaction at all these timescales, from signal to percept. The tasks in theme 1 all relate to aspects of temporal organisation. The multistream recognition architecture we develop in theme 3 has the potential to express timescale integration. The tasks in theme 1 have the following objectives:

Theme 2: Segregating sound sources

SPHEAR partners Bochum, Grenoble, Keele and Sheffield have established work on the segregation of simultaneous sounds. In this theme we continue the segregation work, which has matured to the point where a principle challenge is to model the integration of primitive cues with high-level 'schema-driven' mechanisms. Human listeners are able to segregate simultaneous speech using a range of very simple features, such as the monaural cues common harmonicity or common temporal pattern. Similarly, the primitive binaural cues for auditory scene analysis are the interaural time differences (ITD) and interaural level differences (ILD) of each spectral component caused by the different distances sound waves have to travel to each ear and attenuation by the shadow of the head. Each of the segregation cues has been demonstrated in isolation and modelled with some success [Bodden 93, Berthommier and Meyer 95, Brown and Cooke 94]. However, single cues only partly explain human competence in ASA. We know that the different low level grouping cues interact and that high level, schema driven, processes contribute to the perception of simultaneous speech. Furthermore, speech is a multimodal means of communication. A number of experiments on humans have demonstrated the role of vision in speech identification, particularly in acoustic noise [Campbell et al., 97], and we need to couple visual and auditory cues. Theme 2 is closely linked to experimental data obtained in theme 1 and will inform the data processing strategies of theme 3. The objectives of theme 2 are:

Theme 3: Speech Recognition within Auditory Scenes

Rather than a single data stream containing all the required information and nothing which is unwanted, recognisers which function in natural auditory scenes will be faced with multiple data streams, each of which may be incomplete or contaminated by interference from other sources (section 3.6). Even a single speech signal might be best processed in several representational streams. The objective of this theme is to extend recognition paradigms to such cases: we will develop `multistream' recognition (task 3.1: explanation in sections 3.6 and 4) and `missing data' recognition (task 3.2: explanation in sections 3.7 and 4). We also schedule related tasks with the following objectives:

Theme 4: Recognition Applications

Our objective in theme 4 is to explore the effectiveness of auditory processing, and the techniques developed in other themes, in practical ASR systems. Daimler-Benz are primarily interested in the voice control of cellular telephones (section 3.8) and in-car recognition, for which there is considerable industrial interest. This cellular phone application requires robust recognition of connected words and 'keyword-spotting' in a domain which is partly speaker-independent (commands, digits,...) and partly speaker dependent (personal phone book). It also requires 'barge-in', i.e. speech output interruptible by speech input. It is necessary to deal with both stationary noise (e.g. car noise) and unpredictable interruptions (e.g. door opening), and to operate in rooms with different reverberant characteristics. Our intention is to study the new approaches in comparison to known ones for recognition in noisy environments. The investigations will be based on Daimler-Benz's own data bases recorded in different environments. Initially, we will look at applying auditory processing together with conventional recognition in the cellular phone and in-car applications. Then we will apply our new recognition techniques (multistream and missing data). Finally, we will turn our attention to the problem of how the new approaches can be implemented in real-time recognition systems under the constraints of limited resources of computational performance and memory. The tasks associated with theme 4 have the following objectives: