RESPITE: Programme

ESPRIT Reactive Long Term Research Project:

Recognition of Speech by Partial Information Techniques (RESPITE)

No 28149

PROJECT PROGRAMME

Objectives
Workplan
1. Introduction
2. Detailed Workplan

1. Objectives

RESPITE will extend and apply two novel technologies &endash; missing data theory and multi-stream theory &endash; to the problem of robust automatic speech recognition (ASR), with particular application to cellular phones and in-car environments. It will also support studies whose purpose is to inform this endeavour.

The specific measurable objectives are to

develop techniques for identifying reliable data;
advance the theory of multi-stream processing;
advance the theory of missing and masked data handling;
inform the above by obtaining new perceptual data on speech recognition.
combine missing data and multistream processing with existing robust ASR methods
evaluate all this within a framework of demonstrator ASR applications to cellular phones and in cars.

1.1 Yardsticks

For the recognition-based objectives (2, 3 and 5) we will use well-established corpus-based evaluation techniques for ASR (for instance word accuracy), which will allow the benefit of each of the above innovations to be quantified in comparison to standard approaches. These studies will be made on standard reference data and on in-house data (see Task 1.1 in section 2.2.2).

For the demonstrators (objective 6), error rates can be measured as the user attempts to accomplish her/his task, under varying conditions.

Yardsticks for identifying reliable data (objective 1) can be based on comparisons between the algorithms' outputs and predefined 'optimal labelling of signal regions, and on recognition results that employ the data deemed to be reliable compared to those based on indiscriminate use of the whole signal.

The success of the perceptual studies (4) can be evaluated indirectly by the extent to which their results are deployed within recognition schemes and the resulting effect on performance. The studies will, in addition, have scientific merit in their own right. In this sense, the measures of success are those of experimental science: has an experiment been designed which will elicit the information required? have results been obtained which are statistically significant? Are these results reproducible? Can the results be understood in terms of the model which provoked the experiment?

2. Workplan

2.1 Introduction

In additional to the person-months accounted for here, substantial additional resources will be committed to the project by its authors and their colleagues.

There are 6 work packages:

WP0 Management
WP1 Resources and Basic Technologies
WP2 Identifying Reliable Information
WP3 Recognition Techniques
WP4 Application Demonstrators and Evaluation.
WP5 Dissemination and Take Up

In outline, the relation between these is as follows:

WP0 covers the coordinator's role and resource and is active throughout the programme.
WP1 provides the platform for the main body of the work and occupies months 1-12.
WP2 is required for Missing Data recognition and desirable for Multi Stream recognition. It runs throughout the project lifetime but will produce results incrementally, from month 6.
WP3 develops the central Missing Data and MultiStream technology and their integration. It will be active throughout the programme.
WP4 is concerned with assessment and deployment of the technology and again will be active throughout the programme.
WP5 covers the publication and technology take up aspects of the project and again will be active throughout the programme.

2.2 Detailed work plan

For each WP we specify the executing partners and the manager.

WP0 Management

Work Package Manager: Sheffield
Executing Partners: Sheffield
Project management is documented in section 3.

WP1 Resources and Basic Technologies

Work Package Manager: FPMs
Executing Partners: Daimler-Benz, MATRA, IDIAP, FPMs, Sheffield, ICSI
WP1 encapsulates the work involved in establishing common resources, a common software framework and baseline recognition systems for comparison with research prototypes.

Task T1.1 Database management

Task Coordinator: Daimler-Benz
Executing Partners: Daimler-Benz, MATRA, IDIAP, FPMs
Speech recognition technology is dependent on the availability of substantial corpora of spoken material for training and for evaluation. In the case of RESPITE, we are fortunate in that much of the data we need has already been collected. We intend to make use of the following resources:
Standard evaluation databases for robust ASR, allowing direct comparison with results already reported.
A GSM speech database recently collected at IDIAP on two different (low end and high end) cellular phones. It is also intended to complement those calls with about 50 calls recorded in a quiet room simultaneously via the GSM line and directly onto DAT from microphone).
Extensive recordings on speech in cars recently made by Daimler-Benz has recently, addressing the hands-free speaking style.
MATRA will provide access to an important multi-lingual Speech database in car environment and through GSM networks which is to be recorded in the frame of the new project SpeechDat-Car.
The man-months allocated to T1.1 cover the work involved in organising and processing material from these databases in such a way that they can be used by all partners as the basis for RESPITE research. For instance, work at FPMs will involve converting the databases for the STRUT format.

Task T1.2 Baseline recognition systems

Task co-ordinator: FPMs
Executing partners: FPMs, ICSI, Sheffield, IDIAP, Daimler-Benz, MATRA
We will first establish baseline results for our databases using 'reference' speech recognition research systems. Two kinds of platform are of interest:
Hidden Markov Model (HMM) systems, exemplified by HTK, a commercial package from Entropic which has been used as the basis for Sheffield's missing data work.
Hybrid systems which combine Artificial Neural Nets and HMMs, exemplified by STRUT (developed by FPMs) and the ICSI system.
These two methodologies have strengths and drawbacks: HMM systems are more generally accepted as a reference but are less amenable to some of the modifications we will need to make. Rather than attempting to coerce partners into using a single system at the outset, we will therefore pursue the pluralistic approach of establishing baseline results using the three recognition systems (HTK, STRUT and ICSI) mentioned above. Baseline configurations, configured for the evaluation databases, will be duplicated across the partners. For RESPITE portability, STRUT's data structure design and programming interface will be improved. Key features of the ICSI system are that it already includes code for multi-stream recognition, and front-end visualisation tools.
It is anticipated that as research progresses, we will migrate away from multiple recognition systems to a single system, possibly drawing pieces from each of the baseline recognisers. We will make this decision when we address the design of the first demonstrators (see T4.2). We will not invest undue effort into porting the research results to systems where they are not required within the project.

WP2 Identifying reliable information

Work Package Manager: ICP
Executing Partners: Sheffield, ICP, ICSI, IDIAP, FPMS
This work package is concerned with the identification of the regions within the signal to which the recognition techniques of WP3 should be paying the most attention and, conversely, indicating which features and streams should be regarded as contaminated or 'missing'. To this end, we will investigate a variety of techniques, based both on conventional statistical signal processing, and on the study and modelling of human audition.

Task T2.1 Computational Auditory Scene Analysis

Task coordinator: ICP
Executing partners: Sheffield, ICP, ICSI
The field of 'computational auditory scene analysis' (CASA) encompasses everything from the peripheral frequency analysis of the cochlea through to abstract constraints such as 'expectations' of familiar sounds. Task T2.1 will pursue the following studies:
Reference CASA implementation. ICP, ICSI and Sheffield will combine their existing expertise to produce software which will detect local (bottom-up) sound-organization cues such as harmonicity, common onset/offset, common modulation, and, for binaural signals, spatial location. These cues will be used to separate evidence from different sound sources. The effectiveness of this software will be evaluated using SNR and/or recognition-based metrics for assessing sound-source separation as outlined in section 2.1.
Comparison of source separation techniques. In a comparative study, we will also consider approaches to source separation not directly motivated by auditory processing, namely blind source separation and model decomposition
Interfacing CASA to speech recognition. We will research the coupling between CASA processing and recognition. In addition to performing sound-source separation, CASA can provide other information of use to a recogniser, such as the possible locations of speaker changes or overlap. By the same token, the analysis performed within the recogniser can provide constraints &endash; such as the preferred interpretation of an ambiguous segment &endash; that could be useful to the CASA processor. In the longer term (for the second generation of demonstrators, T4.2), we will investigate closer coupling of CASA and recognition, so that CASA is no longer seen as a front end which identifies reliable evidence for use by an adapted conventional recogniser. This goal links to the developments in recognition architectures proposed in T3.2 and T3.4

Task T2.2 Other information-location techniques

Task Coordinator: FPMs
Executing partners: IDIAP, Sheffield, ICP, ICSI, FPMS
Sound source separation is not the only way in which the most useful features within the signal can be identified. Since alternative techniques are likely to contribute complementary results, we will investigate the following areas:
Signal-to-noise ratio (SNR) estimation. We will improve existing schemes adopted by FPMs, ICSI and MATRA for estimating background noise, to make the process more robust and adaptive to changing noise conditions.
Confidence and entropy measures. The information quality in a channel (measured by entropy or statistical confidence measures) will be used to identify reliable features.
Experiments in human speech perception. We will conduct experiments to inform the design of time-frequency decompositions to use in speech recognition. In particular, we will extend investigations into the way that subband envelopes at different timescales (i.e. the 'modulation spectra') affect the intelligibility of speech.

WP3 Recognition Techniques

Work Package Manager: IDIAP
Executing Partners: all
In WP3 we investigate new recognition techniques, including those based on the missing data and multi-stream approaches, and extend these techniques to take advantage of the results of WP2. All methods and open issues discussed below will be tested on the common databases defined in Task T1.1. Results will be compared to "standard" (state-of-the-art) noise robust speech recognition techniques also benefiting, when possible, from findings of WP2.

Task T3.1 Developments in noise robust speech recognition

Task Coordinator: MATRA
Executing Partners: MATRA, IDIAP, Daimler-Benz
This task will test different variants of the new approaches in the framework of standard approaches (e.g. by doing subband emission probability weighting). It will be based on the existing noise-robust recognition systems at MATRA and Daimler-Benz.

Task T3.2 Missing Data Recognition

Task Coordinator: Sheffield
Executing Partners: Sheffield, IDIAP, FPMs, ICP
The following topics will be investigated:
Porting missing data techniques to other platforms: for instance FPMs will implement missing data techniques in STRUT.
Encoding spectral dynamics. We will make use of temporal derivatives and time-domain interpolation to improve phone probability estimates.
Exploiting masked regions. Missing data performance can be improved by exploiting counter-evidence: the probability that a given model could have been masked by the observed values. A joint perceptual-modelling study will be pursued here in order to better understand the relative weight attached to limited positive evidence and masked counter-evidence in the auditory-phonetic metric.
Decoder architecture for multiple sources. We will extend the traditional hypothesis search space to handle multiple simultaneous sources. Each evidence group provided by CASA will be regarded either as the continuation of existing hypotheses, in which case it is treated as partial evidence, or as the start of a new acoustic source, in which case it not only triggers new paths, but can be used as counter-evidence for the ongoing sources.

Task T3.3 Multi-stream Recognition

Task Coordinator: IDIAP
Executing Partners: IDIAP, ICSI, FPMs, ICP
This task will investigate multi-band and multiple time scale speech recognition. As with the missing data task, we will begin by making existing multi-stream software available to all partners. In the case of multi-band speech recognition, the following issues will be investigated:
Features: whether novel signal processing techniques are better suited to multi-band recognition than conventional processing.
Level of combination: whether the sub-band information streams should be combined on a per state level, or fluidly over a phone or a syllable.
Method of combination: the correct strategy for combining the sub-bands. Different likelihood or posterior probability based combination approaches will be tested. Furthermore, the reliability measures defined in Task 2.2 will also be tested for the combination.
Choosing sub-bands: what is the optimal number of sub-bands and what should the cutoffs be? This is still an open issue and so far the only way to get insights into it is to test different possibilities. Obviously, we would like to have as many sub-bands as possible, while keeping enough information in each band and minimizing the correlation across bands. Findings from Task T2.2 will be useful here.
For multiple time scale speech recognition, the following issues will be investigated:
Feature extraction: There are a number of experiments that need to be done to determine the best form for long-term signal representation, and to make use of some results of psycho-acoustic experiments. This will include critical band filtering (with the right trade off between temporal and spectral resolution) and the modulation spectrum. More generally, our explorations may show problems with some aspects of the long-time features we are planning to use, and we will modify them as suggested by our diagnostics and intuition.
Incorporating multiple time scale units: As for subband based ASR, this task will also involve research into levels of combination and methods for combination.
We will also need to modify or rewrite parts of our existing decoders to incorporate the new models. The solutions that will be considered and implemented here are referred to as 'HMM combination' and the 'two-level dynamic programming algorithm'.

Task T3.4 Combining Recognition Techniques

Task Coordinator: Sheffield
Executing Partners: Sheffield, IDIAP, FPMs, ICP
We will investigate a number of ways in which the techniques we are developing might interact:
The essence of multi-stream is several parallel feature streams and recognisers which are eventually combined; the essence of missing-data is modifying the probability estimation based on additional data on feature reliability. If not all the required features for a single stream are always available, one can implement missing data recognition within each stream. Similarly, the multi-stream recombination computation could be formulated in missing data fashion.
If missing data recognition is to be deployed within hybrid ANN/HMM systems, it will be necessary to adapt the techniques to produce phone probability vectors. This might be done on the basis of already-trained statistical models or with ANN architectures (such as Radial Basis Functions).
Multi-stream processing might exploit the output of a scene analysing module (Task T2.1) by treating each group as a separate stream, with recombination points whenever new groups appear. Any path through the resulting lattice represents a particular assignment of groups to streams.
In situations where some characteristics of the noise are known, it of course makes sense to use this knowledge. For instance in a car there will be predictable engine noise together with other unknown noises. We will therefore investigate ways of combining techniques like those developed by MATRA with the missing data and multi-stream approaches developed here.

WP4 Application demonstrators and evaluation

Work Package Manager: MATRA
Executing Partners: all
Each of the tasks in WP2 and WP3 have accompanying evaluation schedules making use of the databases introduced in task T1.1. In addition, there is a need to develop techniques within the framework of complete applications. Hence in WP4 we will specify and build application demonstrators in which we will deploy recognition modules as they become available. The demonstrators will involve specific applications within the in-car/telephone domains, for instance voice interaction with navigation systems.
While the databases of T1.1 provide material for recogniser training and for testing in the form of raw recognition performance, the demonstrators provide different challenges - the integration of recognition techniques into a live system with a habitable interface and perhaps limited computing resources. Assessment here should be based on performance while carrying out a task, as outlined in section 2.1. We will also be able to obtain subjective measures of performance across a wide set of conditions and speakers.

Task 4.1 Definition of application demonstrators

Task Coordinator: Daimler-Benz
Executing Partners: Daimler-Benz, MATRA, IDIAP
The RESPITE industrial partners and IDIAP will define several demonstrator applications for the in-car and cellular 'phone domains. It will be necessary to specify to the software interfaces required for each demonstrator, so that all partners can contribute. It will also be necessary to establish recognition performance targets for each demonstrator. IDIAP will test the resulting systems on their GSM database, which is particularly well suited to test robustness to noise and variability in frequency bandwidth.

Task T4.2 Demonstrator design and evaluation

Task Coordinator: MATRA
Executing Partners: all
This task formalises the process of building and evaluating demonstrator recognition applications deploying RESPITE techniques. The model is to produce an assessment of the problems involved in doing this after 12 months, and two generations of demonstrators scheduled for month 24 and month 33.
This task consists of three activities:
integration: The various algorithmic components will be integrated into in-house real-time interfaces developed by MATRA and Daimler-Benz.
application development: software design, implementation and testing.
validation: each year, all partners will contribute to the validation of the updated system.

WP5 Dissemination of Results and Exploitation

Dissemination of results and exploitation is dealt with in detail in section 9. Briefly,

Scientific results will be disseminated by all the usual channels.
A RESPITE web site will document project progress and provide accessibility to the community.
Training measures will be put in place for future researchers.
An international workshop will be held around month 27.
Exploitation paths are straightforward, through the product range of the industrial partners and the direct link of demonstrator systems.

These pages are maintained by Jon Barker, jon@dcs.shef.ac.uk

Last modified: Mon Dec 20 16:21:06 GMT 1999