Speech Fragment Decoding
The CHiME project builds upon a computational hearing framework developed under a previous EPSRC grant, (
GR/R47400/01). The framewark, Speech Fragment Decoding (SFD), couples two problems that are usually addressed sequentially: source separation and source recognition. Approaches that tackle these problems in sequence typically lack power because the recognition stages operate on a single separation hypothesis. In contrast, SFD searches over a large number of likely speech/background segmentations for a solution which jointly optimises separation and recognition.
System Overview
The figure below provides an overview of the SFD system. Read the brief overview below to understand how it works and then click on components in the system figure to find out how they are being developed in the CHiME project:
How it works ...
The acoustic signal is first analysed using a gammatone filter bank to produce a time-frequency energy representation like that shown in the bottom left of the figure (or in more detail
here). The basic SFD technique makes the approximation that energy at any time-frequency `pixel' is dominated by a single source: the problem is to determine which pixels are dominated by the target speech source. A 'Fragment Analysis' module uses a variety of signal-driven processes (typically, within-channel pitch estimation and -- in the binaural case -- location estimation) to locally group time-frequency pixels into extended fragments of energy that are believed to originate from a common source. Pixels in a fragment belong together and hence share the same foreground or background label. This information constrains the range of possible foreground/background segmentations that are possible (i.e. it constitutes a simple
segmentation model).
A
decoding stage, that is informed by a set of hidden Markov models of the speech source (and, possibly, models of the
noise background) uses a Viterbi search to consider all possible spoken utterances in conjunction with all permissible foreground/background segmentations. (For fragment-based segmentation models this search can be conducted in an efficient manner). Missing-data speech recognition techniques (
Cooke et al. 2001) are exploited to handle the fact that the speech observations are unknown in the regions masked by the noise background. The most likely combination of utterance and segmentation is
output as the result of the decoding.
In addition to refining the existing SFD approach, CHiME is also looking at ways in which the framework may be exploited to adapt clean speech models and
learn noise models directly from noisy data.