ESCA -NATO

Tutorial and Research Workshop on

ROBUST SPEECH RECOGNITION FOR UNKNOWN COMMUNICATION CHANNELS

PONT-a-MOUSSON, FRANCE 17-18 April 1997

KEYNOTE TALKS

Hynek Hermansky, Carlos Avendano, sarel van Vuuren, and Sangita Tibrewala

Oregon Graduate Institute of Science and Technology, Portland, Oregon and International Computer Science Institute, Berkeley, California

Should Recognizers Have Ears?

The paper discusses author's experience with applying auditory knowledge to automatic recognition of speech. It argues that the reason for studying human auditory perception for engineering applications should be the ability of perception to suppress some parts of information in the speech message. Three properties of human speech perception

limited spectral resolution,
utilization of information from about syllable-length segments of speech signal
ability to alleviate unreliable cues,

Sadaoki Furui

NTT Human Interface Laboratories & Tokyo Institute of Technology

Recent Advances in Robust Speech Recognition

This paper overviews the main technologies that have recently been developed for making speech recognition systems more robust at both the acoustic and linguistic processing levels. These technologies are reviewed from the viewpoint of stochastic pattern matching paradigm for speech recognition. Improved robustness enables better speech recognition over a wide range of unexpected and adverse conditions by reducing mismatches between training and testing speech utterances. This paper focuses on supervised vs. unsupervised adaptation techniques, the Bayesian adaptive learning approach, the minimum classification error (MCE/GPD) approach, the parallel model combination (PMC, HMM composition) technique, and spontaneous speech recognition techniques.

Steven Greenberg

International Computer Science Institute 1947 Center Street Berkeley, CA 94704

On the Origins of Speech Intelligibility in the Real World

Current-generation speech recognition systems seek to identify words via analysis of their underlying phonological constituents. Although this stratagem works well for carefully enunciated speech emanating from a pristine acoustic environment, it has fared less well for recognizing speech spoken under more realistic conditions, such as

moderate to high levels of background noise
moderately reverberant acoustic environments
spontaneous, informal conversation

Under such "real-world" conditions the acoustic properties of speech make it difficult to partition the acoustic stream into readily definable phonological units, thus rendering the process of word recognition highly vulnerable to departures from "canonical" patterns.

Analysis of informal, spontaneous speech (e.g., the Switchboard corpus) indicate that the stability of linguistic representation is more likely to reside on the syllabic and phrasal levels than on the phonological. In consequence, attempts to represent words merely as sequences of phones, and to derive meaning from simple chains of lexical entities are unlikely to yield high levels of recognition performance under such real-world conditions.

A multi-tiered representation of speech is proposed, one in which only partial information from each level of linguistic abstraction is required for sufficient identification of lexical and phrasal elements. Such tiers of linguistic abstraction are unified through a hierarchically organized process of temporal binding and are, in principle, highly tolerant of the sorts of "distortions" imposed on speech in the real world.

Richard M. Stern

Department of Electrical and Computer Engineering and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

Compensation for environmental degradation in automatic speech recognition

The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both "data-driven" and "model-based" compensation strategies. Data-driven methods learn environmental characteristics through direct comparisons of speech recorded in the noisy environment with the same speech recorded under optimal conditions. Model-based methods use a mathematical model of the environment and attempt to use samples of the degraded speech to estimate model parameters. These general approaches to environmental compensation are discussed in terms of recent research in environmental robustness at CMU, and in terms of similar efforts at other sites. In general, compensation algorithms are evaluated in a series of experiments measuring recognition accuracy for speech from the ARPA Wall Street Journal database that is corrupted by artificially-added noise at various signal-to-noise ratios (SNRs), and in more natural speech recognition tasks.

Mark Gales

Cambridge University Engineering Department, Trumpington Street, Cambridge, CB2 1PZ, England.

"Nice" Model-Based Compensation Schemes for Robust Speech Recognition

As speech technology is applied to real world applications there is a need to build systems that are insensitive to differences in training and test conditions. These differences may result from ambient background noise, channel differences, speaker stress etc. A variety of techniques have been applied to this problem. This paper examines one class of approach, model-based compensation. In particular, where a speech model is combined with an "additive noise" model, "channel" model and, in the general case, a speaker stress model, to generate a corrupted-speech model. This will be referred to as a "nice" model-based compensation scheme. The label does not, necessarily, refer to good performance, purely the fact that a model set "matched" to the test environment is generated without the need for speech data from the new environment. Various ways of performing this compensation will be described along with the advantages and, of course, the disadvantages of such an approach. In addition, methods for combining the approach with compensation schemes which make use of speech data in the new environment will be detailed. This combined approach overcomes some of the limitations of the standard "nice" schemes.

Chin-Hui Lee

Dialogue Systems Research Department Bell Laboratories, Lucent Technologies 600-700 Mountain Ave. Room 2D-425 Murray Hill, NJ 07974-0636, USA

On Feature and Model Compensation Approach to Robust Speech Recognition

By now it should not be surprising that high performance speech recognition systems can be designed for a wide variety of tasks in many different languages. This is mainly attributed to the use of powerful statistical pattern matching paradigms coupled with the availability of a large amount of task-specific language and speech training examples. However, it is also well-known that such a high performance can only be maintained when the testing data resemble the training data. In reality, there often exists acoustic mismatch between training and testing conditions. These differences, in transducer, channel, environment, speaker and speaking style, account for most of the performance degradation in speech recognition. The speech distortion usually appears as a combination of the above differences but the exact form of the distortion is often unknown and difficult to model. One way to reduce acoustic mismatches is to adjust speech and acoustic features according to some models of the differences. Another method is to adjust the parameters of the statistical models, e. g. hidden Markov models, to make the modified models better characterize the distorted speech features. Depending on the knowledge used, this family of feature and model compensation techniques can be roughly categorized into three classes, namely: (1) training-based compensation, (2) blind compensation, and (3) structure-based compensation. We provide an overview of the capabilities and limitations of this family of approaches. The relationship between adaptation and compensation will also be discussed.

Maurizio Omologo

IRST - Istituto per la Ricerca Scientifica e Tecnologica, Italy

On the future trends of hands-free ASR: variabilities in the environmental conditions and in the acoustic transduction

Hands-free interaction represents a key-point for increase of flexibility of present applications and for the development of new speech recognition applications, where the user can not be encumbered by either hand-held or head-mounted microphones.

When the microphone is far from the speaker, the transduced signal is affected by degradation of different nature, that is often unpredictable. Special microphones and multimicrophone acquisition systems represent a way of reducing some environmental noise effects. Robust processing and adaptation techniques can be further used in order to compensate for different kinds of variability that can be present in the recognizer input.

The purpose of this paper is to re-visit some of the assumptions about the different sources of this variability and to discuss both on special transducer systems and on compensation/adaptation techniques that can be adopted. In particular, the paper will refer to the use of multimicrophone systems to overcome some undesired effects caused by room acoustics (e.g. reverberation) and by coherent/incoherent noise (e.g. competitive talkers, computer fans)

Shigeki Sagayama and Kiyoaki Aikawa

Issues Relating to the Future of ASR for Telecommunications Applications

Issues relating to automatic speech recognition (ASR) are discussed with respect to applications in the telecommunications area in the near future. As a preliminary, we introduce an interesting discussion from a past conference in Japan about what is hindering the spread of ASR. Then, some relatively new robustness issues in telephone-based ASR applications are discussed. These include accurate voice/noise discrimination, and multiple microphones, utterance verification/rejection for flexible vocabulary systems, breath noise and hand noise, instantaneous adaptation to environmental noise, a spell\-ing method for Japanese Kanji texts, dialog control issues, distransparency of ASR systems, children's voices, HMM training with localized data, adaptive dialog strategy.

This site is maintained by Phil Green (p.green@dcs.shef.ac.uk)

Last updated 18th February 1997