MINUTES of the joint SPHEAR/RESPITE meeting

Hotel aux Mille Etoiles, Les Marecottes, Switzerland 1999sep13-14

(minutes recorded by Dan Ellis, dpwe@icsi.berkeley.edu)

Attendees:

Phil Green (Sheffield, Chair)
Herve Bourlard (IDIAP, Host)
RESPITE: Fritz Class (DaimlerChrysler) Christophe Ris (FPMs) Herve Glotin (ICP) Dan Ellis (ICSI) Andrew Morris (IDIAP) Catherine Glorion (Matra) Philip Lockwood (Matra) Jon Barker (Sheffield) Martin Cooke (Sheffield)
SPHEAR: Karsten Lehn (Bochum) Christos Tsakostas (Bochum) Joan Mari (DaimlerChrysler) Astrid Hagen (IDIAP) Katrin Keller (IDIAP) Christopher Kermorvant (IDIAP) Bill Ainsworth (Keele) John Mourjopoulos (Patras) Joerg Bucholz (Patras) Ascension Vizinho (Sheffield)

Co-ordinator's report (Phil Green)

SPHEAR

2nd year payments have just been distributed to partners. - euro 10k deduction arose because payments are only for documented spending, except initial payment which is clawed back from subsequent payments. Payment was delayed because some overheads were overclaimed -

Mid-term review

should be around March 2000 - next SPHEAR meeting is Feb 26/27 in Bochum - Mon Feb 28 will be 1 day review for Christianne Bernard + expert - Report due 1 month earlier. PG will collect pieces in 1999dec - Review is 1 hr coordinators report, 5 min each 'tour de table' + 10 min presentations by young researchers - Young researchers give feedback through questionnaires - *Everybody* must attend! -

SPHEAR 3-monthly reports

are due in Oct, Dec - Dec report to include pieces for mid-term report.

RESPITE

Project officer

is now Antonio Sanfilippo -
He gets short, bi-monthly mangement reports (1st was for 6 mo) .. which are based on the monthly email reports - Year 1 report is due around end of dec
- cost reports have about 1 month extra to be finalized

Annual reviews

for Human Lang Techs (DG XIII) are all presented during an 'annual event' in Luxembourg in early Feb - about 4h/project by co-ordinator and 1 or 2 others - single panel of 2 or 3 reviewers per set of projects - PDG to hassle for the date.

"International workshop"

has been promised in RESPITE program - will be around March 2001 - start thinking now -

General

Lab visits:

Please make them, record in prog. reports -

Web sites:

will be visited by reviewers (also for recruiting) => Keep them up-to-date and lively -
DaimlerChrysler behind firewall, must send web content to Sheffield to be put on their site. -

Joint publications

are a very good thing (esp. for SPHEAR) e.g. Speech Comm. special issue arising from Tampere meeting -

Easy part of project is over: now have to really get to work!

Technical session 1: Audition & CASA

(Martin Cooke, chair)

Christos Tsakostas: "Precedence effect & auditory streams"

Maximum delay before spatially-separate echo is heard as distinct depends on signal type.
Bregman-style streaming cues were used to manipulate precedence threshold.
Complex interactions observed; model as time-varying fusion threshold for each distinct perceived stream, which can be increased or decreased by presence of other streams.

discussion:

evidence for *level* at which precedence effect occurs? i.e. is it peripheral or cortical?

Joerg Bucholz: "A model for both masking and reflection phenomena"

Auditory masking (e.g. forward masking) has been extensively modeled, so has acoustic effect of rooms, but how do they interact?
A model that follows HRTF convolution with the MPEG-style simultaneous masking model plus the Pueschel revcor model of forward masking may be able to reproduce known "room masking threshold" data on ability of sounds from different directions to mask each other.

discussion:

SPHEAR has more 'basic science' content than RESPITE .. but it's still focussed towards practical ends like speech enhancement and recognition from binaural signals.

Bill Ainsworth: "Effect of filtered noise on voiced-plosive perception"

Work in progress linking psychoacoustics and speech recognition. Earlier work (JASA 1994) modeled intelligibility of syllables in continuous and gated noise maskers, although experimental results show some remaining curiosities.
New experiments look at effect of filtered noise (high, low pass, band pass, stop) on various CV syllables - *lots* of conditions -> lots of data. So far, results are consistent with exponential adaptation and time constants around 200ms. Need to run more subjects and enhance the model.

Martin Cooke: "Recognition of band-pass speech"

Did intelligibility tests for spoken digits band-pass filtered at various center-frequencies and bandwidths, then compared to a missing-data speech recognizer.
ASR much like listeners for CFs below 1500Hz, but listeners far better at higher frequencies.
Fine-scale frequency sampling reveals performance *dips* in ASR at around 700Hz and 2kHz - also for listeners at 700Hz.
Overall, suggests ASR is like listeners in low band, but 2nd mechanism used by listeners for high band, perhaps based on fine temporal structure which is ignored by ASR.
Relates to two-syllable percepts reported by Warren's vowel sequence illusion.

Herve Glotin: "CASA vs SNR estimation for Time Delay of Arrival processing"

ICP has recorded a stereo version of mixed Numbers95 utterances played over two speakers, recorded on two microphones.
"Binaural" style processing could separate information for each source based on a cross-correlation measure in each time-frequency cell, estimating the time-delay (and hence azimuth) of the dominant energy in that cell.
Using these as labels for a missing-data recognizer, can improve WER from 73% (best single mic) to 49% (labelled cells only).
This uses binaural info in the classifier; could use it in the feature calculation instead by using the labels as a time-frequency filter to maximize the SNR of one source. This approach gives 66% WER.

discussion: -

isn't there a significant advantage for the 'facing' mic? No, because recordings didn't have a dummy head, so no head shadow.
- 125ms window used for time-delay estimation seems very long.

Jon Barker: "The CASA toolkit"

CASA toolkit aims to make CASA processing widely accessible, easy to play with, and efficient enough to use on large speech corpora. Currently consists of a framework for C++ processing blocks which can be composited and linked together by a scripting language.
Should be easily extensible by reusing existing code: challenge now is to 'populate' with different CASA functions.
First release is nearing completion.

discussion: -

some 'generic CASA units' might include summarizers and robust trackers
- conventional speech feature calculation could easily be incorporated into the same framework.

General discussion: -

CASA is in a delicate state, the basic science is still unclear, but it has to deliver soon to retain credibility. Also, because we promised to make it work in RESPITE!
- But how do we know what "is" and "is not" CASA?
- information that is applied at the classification/decoding stage rather than in feature calculation? - foundation is audition rather than engineering, but can converge..
- Systems that don't identify *multiple* sources are not CASA
- CASA can be compared to SNR estimation and to blind-source-sep'n
- all based on idea of *independence* of different sources
- BSS looks for *perfect* separation, CASA is much fuzzier
- Complex, non-stationary noise (like factory noise) should be where CASA can prove its usefulness - number of sources changes rapidly

Technical session 2: Automatic Speech Recognition

(Herve Bourlard, chair)

Herve Bourlard: "Work at IDIAP on Respite, Sphear and Multichan (Swiss)"

- Multi-band work is pursuing 'full combination' approach, with weights based on SNR estimation, CASA, error-minimization, the Fletcher 'product of errors' rule (if it extends to >2 bands) and unsupervised EM estimation, which isn't quite working yet.
Ultimately, do this with 20 bands for 1M combination alternatives?
- Noise adaptation and reduction: spectral subtraction appears to outperform missing-data based on comparable noise estimates.
Noise modeling report compares spec. sub., missing data, cepstral normalization and blind equalization.
- Multi-scale models now being extended to advanced modeling of dependence among time-frequency coefficients e.g. wavelet tree, where each value may depend on a pair of values at the next-finest time resolution
-> "feature HMM" estimating acoustic probabilities for speech HMM.

Dan Ellis: "Connectionist AURORA system & other multistream work at ICSI"

- Neural net trained on AURORA (noisy digits) task does a little bit better than HTK overall, although significantly worse in clean.
- Summing the pre-nonlinearity phoneme outputs of nets based on different features, then using these values to train an HTK system gives the best performance, at 52% of HTK baseline WER overall.
- Other projects are looking at using different feature types for different conditions, and ways to optimize multiband pronunciation models.

Joan Mari: "Full combination for likelihood based systems"

- Full-combination approach expresses weighted combination of models for each band-configuration with weights estimated from input features. But reworking this in the *likelihood* domain of conventional Gaussian mixture models ends up with the weight terms *not* depending on acoustics - so what do we do? discussion: - estimate weights by EM maximization of data likelihood, gives a path by which data influences weights - maybe the weights don't have to depend on data - because mismatched models have very low likelihoods (but: inliers (small-deviated corruptions) vs. outliers)

Phil Green: "Progress in missing data ASR"

- Missing data can be done via marginalization of unknown dimensions or imputation of missing values.
Imputation is not as good, but supports a wider range of processing e.g. deltas, cepstra.
- State-dependent data imputation incorporating bounds constraints works pretty well.
- Current missing-data based on SNR estimates, so only works in so far as the noise is static.
- Future work includes frequency filtering, using RASTA, link to CASA - (mostly covered at Eurospeech and in Sp. Comm. article on website) *** Sheffield to test with DaimlerChrysler car noise, tougher than noisex, for comparison with IDIAP results.

discussion: - '

CASA decoder' (search across possible assignments of CASA-derived chunks of data) will improve things further (currently a Matlab prototype at Sheffield).
-> better than parallel model combination (PMC) because fewer decision points.

Catherine Glorion: "MATRA Work in noise robust ASR"

- Root-adaptive schemes i.e. X(f)^(1/gamma) rather than log(X(f)).
Adapt gamma to noise conditions: large gamma beats log for car noise but log is better for babble
-> use 'babble detector' based on stationarity measure (babble very non-stationary).
- Nonlinear spectral subtraction i.e. subtract some function of online noise estimate.
Improved Voice Activity Detector developed, but not yet optimized for this speech recognition task.

discussion:

- Different ways to use AURORA noisy-digits database: - use "multi-train" i.e. train on mix of clean & noisy, same noise types as test set. (cheating? but baseline) - train on clean only. Still enough data for a good model, much more interesting results, but harder.

Christophe Ris: "Noise level estimation"

- Comparison of Hirsch histogram method (dynamic thresholding), "Dromedary" histogram clustering (recover two histogram peaks by EM?), and envelope follower (median filtering of low-E frames).
- all systems avoid explicit pause detection, but rely on some speech pauses being included in analysis window.
- alternative is to look at noise floor inbetween harmonics of a narrow-band Fourier analysis
- can see noise during voiced segs. - evaluate by MSE of noise estimate to 1 Hz sinusoid-modulated noise.

discussion:

- Mons also reported 'good results' using missing data: = rasta filtering of post-imputation data got about 12% WER at 10 dB SNR - good!

General discussion:

- RESPITE has promised to *combine* multi-stream and missing data. Are we doing this? - e.g. use MD techniques in *individual bands* of multiband system?
- ICP-style systems could use MD techniques to give weights for multiband full-combination systems
** Andrew Morris to collect a list of missing data<->multistream transfer ideas.

Technical session 3: Demonstrators for RESPITE & SPHEAR

(P. Lockwood, chair) -

SPHEAR 'theme 4' promises increasingly realistic applications & focus on in-car applications from 2nd year onwards -
RESPITE promises definition of demo applications (plural) by EOY99 then two generations of implementations for the following 2 years.
RESPITE also has work packages for transfer of new techniques to systems of commercial partners

Catherine Glorion: Overview of MATRA ASR system:

Speech -> vx detect -> noise reduction -> ftr extract -> [CEP] [CEP] -> normalize -> vec quant -> hmm train -> [HMM] [HMM] -> finite-state def -> hmm decoder -> words
CEP and HMM are well-defined interface points for inserting new bits
CEP includes VAD status, noise energy, cepstral coeffs etc.
HMM consists of HTK-compatible model and distribution definitions

Fritz Class: Overview of DaimlerChrysler system:

Speech -> noise redux -> ftr ext -> [CEP] -> lda -> vq -> [VQ] [VQ] -> hmm verification -> results (words/n best/lattice)
VQ interface point is likelihoods of top 10 codewords (of 512)

discussion:

main differences = LDA in DC system Matra is continuous, DC semi-cont (VQ)
- best way to interface to RESPITE techniques is to provide a link from signal (waveform) level straight to state likelihoods for HMM models.
HMMs will need retraining (e.g. transitions) on new data.

Applications:

Matra uses ASR for:

navigation command & control (including spelling place names)
phone: voice dialling (speaker-dep keywords),
control - radio control: commands, station names
DaimlerChrysler applications are much the same

best for RESPITE demo? : connected digits for telephone dialling

** Matra to provide French-language training data for continuous numbers + maybe features for AURORA-style remote recognition?

Summary:

-ASR tasks will be connected digits in French and US English -
databases will be AURORA (clean train/multi train) + French subset of SPEECH-DAT.CAR (available around Jan 2000)
DaimlerChrysler has in-car US Eng DB, but not with 'clean' conditions needed for proper missing-data training.

Open discussion:

- What about SPHEAR? Use same basic goals, databases

DAY 2: ASR break-out discussion

Andrew Morris, chair

- Joan's full-combination likelihood equations
- product-of-errors rule c/w hybrid posteriors, HMM likelihoods
- RBFs for connectionist-style MD-compatible likelihoods?

Audition break-out discussion

(Joerg & Christos)

(dpwe wasn't there)

Final wrap-up session (left-over presentations from day 1):

Chris Kermorvant: "Multipath stochastic equalization"

- find transform from observed to training
-matched features by examining each best path through decoder.
Iterate between finding best decoder path and best equalization. - i.e. *state-dependent* feature mapping

Katrin Keller: "Wavelet domain Hidden Markov Trees (WHMT)"

- replace GMMs with more complex statistic inference model
- assign 'hidden state' at nodes in wavelet tree, infer feature distributions for descendent nodes dependent on state - figure out the whole thing with EM.

Andrew Morris: - Radial Basis Function (RBF) nets for connectionist likelihoods

- not really competitive with HTK GMMs?
- Can use HTK to find Gaussians, then use 2nd layer to calculate posteriors using Bayes' rule.
- Posterior decomposition into mismatch & utility measures - e.g. weights can have components due to noise/features and a part that simply means certain streams are best for certain classes.
- Clean-data likelihood as mismatch measure - maybe you can detect noisy data by looking at its likelihood distribution.
But in practice many noise conditions look a lot like the clean data - maybe *less* likelihood variation?