MINUTES of the joint SPHEAR/RESPITE meeting
Hotel aux Mille Etoiles, Les Marecottes, Switzerland 1999sep13-14
(minutes recorded by Dan Ellis, dpwe@icsi.berkeley.edu)
Attendees:
Phil Green (Sheffield, Chair)
Herve Bourlard (IDIAP, Host)
RESPITE: Fritz Class (DaimlerChrysler) Christophe Ris (FPMs)
Herve Glotin (ICP) Dan Ellis (ICSI) Andrew Morris (IDIAP) Catherine Glorion
(Matra) Philip Lockwood (Matra) Jon Barker (Sheffield) Martin Cooke (Sheffield)
SPHEAR: Karsten Lehn (Bochum) Christos Tsakostas (Bochum) Joan
Mari (DaimlerChrysler) Astrid Hagen (IDIAP) Katrin Keller (IDIAP) Christopher
Kermorvant (IDIAP) Bill Ainsworth (Keele) John Mourjopoulos (Patras) Joerg
Bucholz (Patras) Ascension Vizinho (Sheffield)
Co-ordinator's report (Phil Green)
SPHEAR
2nd year payments have just been distributed to partners. - euro 10k deduction
arose because payments are only for documented spending, except initial
payment which is clawed back from subsequent payments. Payment was delayed
because some overheads were overclaimed -
Mid-term review
should be around March 2000 - next SPHEAR meeting is Feb 26/27 in Bochum
- Mon Feb 28 will be 1 day review for Christianne Bernard + expert
- Report due 1 month earlier. PG will collect pieces in 1999dec - Review
is 1 hr coordinators report, 5 min each 'tour de table' + 10 min presentations
by young researchers - Young researchers give feedback through questionnaires
- *Everybody* must attend! -
SPHEAR 3-monthly reports
are due in Oct, Dec - Dec report to include pieces for mid-term report.
RESPITE
Project officer
is now Antonio Sanfilippo -
He gets short, bi-monthly mangement reports (1st was for 6 mo) .. which
are based on the monthly email reports - Year 1 report is due around end
of dec
- cost reports have about 1 month extra to be finalized
Annual reviews
for Human Lang Techs (DG XIII) are all presented during an 'annual event'
in Luxembourg in early Feb - about 4h/project by co-ordinator and 1 or
2 others - single panel of 2 or 3 reviewers per set of projects - PDG to
hassle for the date.
"International workshop"
has been promised in RESPITE program - will be around March 2001 - start
thinking now -
General
Lab visits:
Please make them, record in prog. reports -
Web sites:
will be visited by reviewers (also for recruiting) => Keep them up-to-date
and lively -
DaimlerChrysler behind firewall, must send web content to Sheffield
to be put on their site. -
Joint publications
are a very good thing (esp. for SPHEAR) e.g. Speech Comm. special issue
arising from Tampere meeting -
Easy part of project is over: now have to really get to work!
Technical session 1: Audition & CASA
(Martin Cooke, chair)
Christos Tsakostas: "Precedence effect & auditory streams"
Maximum delay before spatially-separate echo is heard as distinct depends
on signal type.
Bregman-style streaming cues were used to manipulate precedence threshold.
Complex interactions observed; model as time-varying fusion threshold
for each distinct perceived stream, which can be increased or decreased
by presence of other streams.
discussion:
evidence for *level* at which precedence effect occurs? i.e. is it peripheral
or cortical?
Joerg Bucholz: "A model for both masking and reflection phenomena"
Auditory masking (e.g. forward masking) has been extensively modeled, so
has acoustic effect of rooms, but how do they interact?
A model that follows HRTF convolution with the MPEG-style simultaneous
masking model plus the Pueschel revcor model of forward masking may be
able to reproduce known "room masking threshold" data on ability of sounds
from different directions to mask each other.
discussion:
SPHEAR has more 'basic science' content than RESPITE .. but it's still
focussed towards practical ends like speech enhancement and recognition
from binaural signals.
Bill Ainsworth: "Effect of filtered noise on voiced-plosive perception"
Work in progress linking psychoacoustics and speech recognition. Earlier
work (JASA 1994) modeled intelligibility of syllables in continuous and
gated noise maskers, although experimental results show some remaining
curiosities.
New experiments look at effect of filtered noise (high, low pass, band
pass, stop) on various CV syllables - *lots* of conditions -> lots of data.
So far, results are consistent with exponential adaptation and time constants
around 200ms. Need to run more subjects and enhance the model.
Martin Cooke: "Recognition of band-pass speech"
Did intelligibility tests for spoken digits band-pass filtered at various
center-frequencies and bandwidths, then compared to a missing-data speech
recognizer.
ASR much like listeners for CFs below 1500Hz, but listeners far better
at higher frequencies.
Fine-scale frequency sampling reveals performance *dips* in ASR at
around 700Hz and 2kHz - also for listeners at 700Hz.
Overall, suggests ASR is like listeners in low band, but 2nd mechanism
used by listeners for high band, perhaps based on fine temporal structure
which is ignored by ASR.
Relates to two-syllable percepts reported by Warren's vowel sequence
illusion.
Herve Glotin: "CASA vs SNR estimation for Time Delay of Arrival processing"
ICP has recorded a stereo version of mixed Numbers95 utterances played
over two speakers, recorded on two microphones.
"Binaural" style processing could separate information for each source
based on a cross-correlation measure in each time-frequency cell, estimating
the time-delay (and hence azimuth) of the dominant energy in that cell.
Using these as labels for a missing-data recognizer, can improve WER
from 73% (best single mic) to 49% (labelled cells only).
This uses binaural info in the classifier; could use it in the feature
calculation instead by using the labels as a time-frequency filter to maximize
the SNR of one source. This approach gives 66% WER.
discussion: -
isn't there a significant advantage for the 'facing' mic? No, because recordings
didn't have a dummy head, so no head shadow.
- 125ms window used for time-delay estimation seems very long.
Jon Barker: "The CASA toolkit"
CASA toolkit aims to make CASA processing widely accessible, easy to play
with, and efficient enough to use on large speech corpora. Currently consists
of a framework for C++ processing blocks which can be composited and linked
together by a scripting language.
Should be easily extensible by reusing existing code: challenge now
is to 'populate' with different CASA functions.
First release is nearing completion.
discussion: -
some 'generic CASA units' might include summarizers and robust trackers
- conventional speech feature calculation could easily be incorporated
into the same framework.
General discussion: -
CASA is in a delicate state, the basic science is still unclear, but it
has to deliver soon to retain credibility. Also, because we promised to
make it work in RESPITE!
- But how do we know what "is" and "is not" CASA?
- information that is applied at the classification/decoding stage
rather than in feature calculation? - foundation is audition rather than
engineering, but can converge..
- Systems that don't identify *multiple* sources are not CASA
- CASA can be compared to SNR estimation and to blind-source-sep'n
- all based on idea of *independence* of different sources
- BSS looks for *perfect* separation, CASA is much fuzzier
- Complex, non-stationary noise (like factory noise) should be where
CASA can prove its usefulness - number of sources changes rapidly
Technical session 2: Automatic Speech Recognition
(Herve Bourlard, chair)
Herve Bourlard: "Work at IDIAP on Respite, Sphear and Multichan (Swiss)"
- Multi-band work is pursuing 'full combination' approach, with weights
based on SNR estimation, CASA, error-minimization, the Fletcher 'product
of errors' rule (if it extends to >2 bands) and unsupervised EM estimation,
which isn't quite working yet.
Ultimately, do this with 20 bands for 1M combination alternatives?
- Noise adaptation and reduction: spectral subtraction appears to outperform
missing-data based on comparable noise estimates.
Noise modeling report compares spec. sub., missing data, cepstral normalization
and blind equalization.
- Multi-scale models now being extended to advanced modeling of dependence
among time-frequency coefficients e.g. wavelet tree, where each value may
depend on a pair of values at the next-finest time resolution
-> "feature HMM" estimating acoustic probabilities for speech HMM.
Dan Ellis: "Connectionist AURORA system & other multistream work at
ICSI"
- Neural net trained on AURORA (noisy digits) task does a little bit better
than HTK overall, although significantly worse in clean.
- Summing the pre-nonlinearity phoneme outputs of nets based on different
features, then using these values to train an HTK system gives the best
performance, at 52% of HTK baseline WER overall.
- Other projects are looking at using different feature types for different
conditions, and ways to optimize multiband pronunciation models.
Joan Mari: "Full combination for likelihood based systems"
- Full-combination approach expresses weighted combination of models for
each band-configuration with weights estimated from input features. But
reworking this in the *likelihood* domain of conventional Gaussian mixture
models ends up with the weight terms *not* depending on acoustics - so
what do we do? discussion: - estimate weights by EM maximization of data
likelihood, gives a path by which data influences weights - maybe the weights
don't have to depend on data - because mismatched models have very low
likelihoods (but: inliers (small-deviated corruptions) vs. outliers)
Phil Green: "Progress in missing data ASR"
- Missing data can be done via marginalization of unknown dimensions or
imputation of missing values.
Imputation is not as good, but supports a wider range of processing
e.g. deltas, cepstra.
- State-dependent data imputation incorporating bounds constraints
works pretty well.
- Current missing-data based on SNR estimates, so only works in so
far as the noise is static.
- Future work includes frequency filtering, using RASTA, link to CASA
- (mostly covered at Eurospeech and in Sp. Comm. article on website) ***
Sheffield to test with DaimlerChrysler car noise, tougher than noisex,
for comparison with IDIAP results.
discussion: - '
CASA decoder' (search across possible assignments of CASA-derived chunks
of data) will improve things further (currently a Matlab prototype at Sheffield).
-> better than parallel model combination (PMC) because fewer decision
points.
Catherine Glorion: "MATRA Work in noise robust ASR"
- Root-adaptive schemes i.e. X(f)^(1/gamma) rather than log(X(f)).
Adapt gamma to noise conditions: large gamma beats log for car noise
but log is better for babble
-> use 'babble detector' based on stationarity measure (babble very
non-stationary).
- Nonlinear spectral subtraction i.e. subtract some function of online
noise estimate.
Improved Voice Activity Detector developed, but not yet optimized for
this speech recognition task.
discussion:
- Different ways to use AURORA noisy-digits database: - use "multi-train"
i.e. train on mix of clean & noisy, same noise types as test set. (cheating?
but baseline) - train on clean only. Still enough data for a good model,
much more interesting results, but harder.
Christophe Ris: "Noise level estimation"
- Comparison of Hirsch histogram method (dynamic thresholding), "Dromedary"
histogram clustering (recover two histogram peaks by EM?), and envelope
follower (median filtering of low-E frames).
- all systems avoid explicit pause detection, but rely on some speech
pauses being included in analysis window.
- alternative is to look at noise floor inbetween harmonics of a narrow-band
Fourier analysis
- can see noise during voiced segs. - evaluate by MSE of noise estimate
to 1 Hz sinusoid-modulated noise.
discussion:
- Mons also reported 'good results' using missing data: = rasta filtering
of post-imputation data got about 12% WER at 10 dB SNR - good!
General discussion:
- RESPITE has promised to *combine* multi-stream and missing data. Are
we doing this? - e.g. use MD techniques in *individual bands* of multiband
system?
- ICP-style systems could use MD techniques to give weights for multiband
full-combination systems
** Andrew Morris to collect a list of missing data<->multistream
transfer ideas.
Technical session 3: Demonstrators for RESPITE & SPHEAR
(P. Lockwood, chair) -
-
SPHEAR 'theme 4' promises increasingly realistic applications & focus
on in-car applications from 2nd year onwards -
-
RESPITE promises definition of demo applications (plural) by EOY99 then
two generations of implementations for the following 2 years.
-
RESPITE also has work packages for transfer of new techniques to systems
of commercial partners
Catherine Glorion: Overview of MATRA ASR system:
Speech -> vx detect -> noise reduction -> ftr extract -> [CEP] [CEP]
-> normalize -> vec quant -> hmm train -> [HMM] [HMM] -> finite-state def
-> hmm decoder -> words
CEP and HMM are well-defined interface points for inserting new bits
CEP includes VAD status, noise energy, cepstral coeffs etc.
HMM consists of HTK-compatible model and distribution definitions
Fritz Class: Overview of DaimlerChrysler system:
Speech -> noise redux -> ftr ext -> [CEP] -> lda -> vq -> [VQ] [VQ]
-> hmm verification -> results (words/n best/lattice)
VQ interface point is likelihoods of top 10 codewords (of 512)
discussion:
main differences = LDA in DC system Matra is continuous, DC semi-cont (VQ)
- best way to interface to RESPITE techniques is to provide a link
from signal (waveform) level straight to state likelihoods for HMM models.
HMMs will need retraining (e.g. transitions) on new data.
Applications:
Matra uses ASR for:
-
navigation command & control (including spelling place names)
-
phone: voice dialling (speaker-dep keywords),
-
control - radio control: commands, station names
-
DaimlerChrysler applications are much the same
best for RESPITE demo? : connected digits for telephone dialling
** Matra to provide French-language training data for continuous numbers
+ maybe features for AURORA-style remote recognition?
Summary:
-ASR tasks will be connected digits in French and US English -
databases will be AURORA (clean train/multi train) + French subset
of SPEECH-DAT.CAR (available around Jan 2000)
DaimlerChrysler has in-car US Eng DB, but not with 'clean' conditions
needed for proper missing-data training.
Open discussion:
- What about SPHEAR? Use same basic goals, databases
DAY 2: ASR break-out discussion
Andrew Morris, chair
- Joan's full-combination likelihood equations
- product-of-errors rule c/w hybrid posteriors, HMM likelihoods
- RBFs for connectionist-style MD-compatible likelihoods?
Audition break-out discussion
(Joerg & Christos)
(dpwe wasn't there)
Final wrap-up session (left-over presentations from day
1):
Chris Kermorvant: "Multipath stochastic equalization"
- find transform from observed to training
-matched features by examining each best path through decoder.
Iterate between finding best decoder path and best equalization. -
i.e. *state-dependent* feature mapping
Katrin Keller: "Wavelet domain Hidden Markov Trees (WHMT)"
- replace GMMs with more complex statistic inference model
- assign 'hidden state' at nodes in wavelet tree, infer feature distributions
for descendent nodes dependent on state - figure out the whole thing with
EM.
Andrew Morris: - Radial Basis Function (RBF) nets for connectionist likelihoods
- not really competitive with HTK GMMs?
- Can use HTK to find Gaussians, then use 2nd layer to calculate posteriors
using Bayes' rule.
- Posterior decomposition into mismatch & utility measures - e.g.
weights can have components due to noise/features and a part that simply
means certain streams are best for certain classes.
- Clean-data likelihood as mismatch measure - maybe you can detect
noisy data by looking at its likelihood distribution.
But in practice many noise conditions look a lot like the clean data
- maybe *less* likelihood variation?