MINUTES of 2nd joint SPHEAR-RESPITE workshop
FCTS Lab, Mons, Sept 15-17, 2000
minutes taken by Andrew Morris, firstname.lastname@example.org
Bochum: Prof. Blauert
Patras: Joerg Buchholz
Keele: Georg Meyer, Bill Ainsworth
Sheffield: Phil Green, Stuart Cunningham, Jon Barker, Jordy Cohen
IDIAP: Andrew Morris,Astrid Hagen
ICP: Laurent Varin, Martin Heckmann
ICSI: Dan Ellis
DaimlerChrysler: Joan Mari Hilario, Udo Haiber, Fritz Class
FPMs: Christophe Ris, Jean-Marc Boite, Laurent Couvreur, Stéphane
Dupont, Olivier Pietquin
Babel: Olivier Deroo
Herve Bourlard, Martin Cooke, Jens Blauert, Christos
Scientific reviews. For interesting legal reasons,
neither project has yet had one. This does not matter, except that the
last stage of each project's funding will be help up until its scientific
review has been completed.
2 new SPHEAR placements: Kalle Palomaki, Sheffield
(on auditory attention); Olivier Crouzet, Keele
IDIAP has a vacency for a full time job on RESPITE.
SPHEAR and bi-monthly RESPITE reports are both due by end of September.
Please send Phil details of any new work not reported to this meeting.
web based progress report to be updated by end of this year.
RESPITE costs statements: next payments may be held
up a bit due to late submission of cost statements from Matra and ICP,
plus some errors in the statements submitted which had to be sorted out.
Finally submitted in August, so payments expected around end of year.
Babel have replaced Matra as Industrial partner in
RESPITE. Contract is to be amended. Babel will be funded in retrospect
back to Feb 2000 when Matra left.
RESPITE international CRAC workshop (Consistent &
Reliable Acoustic Cues for sound analysis) will take place just before
Eurospeech 2001, in Aalborg. To encourage proper discussion, this will
be limited to around 40 people. However, contributions from as many SPHEAR/RESPITE
partners as possible are welcome (deadline end April 2001). Selected presentations
will be invited to be expanded for a special issue of Speech Communication.
Presentations (publicly accessible) here
Some points arising from these talks:
Standard test data: In future everyone should try and base tests
on Aurora 2.0. This can be obtained from ELRA (http://www.icp.grenet.fr/ELRA/home.html).
This has a more realistic range of noises, and also has enough clean data
to permit training with clean data when required. Reported results must
use the full training and test sets, but as the new test set is even larger
than before, it will be important to use some subset of the test set for
Common segmentation: Aurora does not come with a training data segmentation.
It would help make results more directly comparable if we could agree on
a common segmentation. It would further help if we could agree on a common
speech unit and HMM topology.
Standard results presentation: It was greed that in future everyone
will present results with SNR increasing to the right, and WER increasing
Recognition demonstrators: plans
Psycho-acoustics: plans and products
The DC demonstrator will be for both SPHEAR and RESPITE.
The Babel demonstrator will be for RESPITE.
Both first generation demonstrators should be complete
by the end of this year. Because of the change from MATRA to Babel plans
have been disrupted, but we will still need to address that milestone
Babel already have a working demonstrator, built arount STRUT.
They have already implemented Stéphane Dupont's recent 'multi-bands
trained in artificial noise' system. This was demonstrated to work in real
time in a very high level of noise for a small vocabulary task. They are
willing to implement other experimental systems as well if required.
Extension is envisaged to replace the default speech processor, and to
build a graphical user interface.
As well as speech recognition, Babel also offer speech synthesis, and speech
and music compression.
Need to decide on I/O formats.
Graphical interface will be able to demonstrate each stage of the recognition
process, as well as real-time recognition. This will be very usefull for
explaining to people how each system works (except for any CASA front end,
which will require separate demonstration).
The 'feature level' interface and the procedure for conversion to DC UniIO
format, are both complete and tested, so tests should now be quite fast
to run. This will enable testing with any kinds of features, including
features for 'tandem' processing.
Initial tests were made with Aurora 1.0. These tests should now be rerun,
and all future tests should now use Aurora 2.0.
Care is needed in exact specification of the speech features which are
given to DC for testing.
The further 'likelihoods level' interface remains to be implemented. This
will input likelihoods (or scaled likelihoods) and transition probabilities.
This will be necessary for testing any HMM/ANN based system, where (scaled)
likelihoods will be produced outside of the DC system, for decoding by
Likelihoods from HMM/ANN are generally (though not necessarily) one per
phoneme. This will require input to an HMM using one-state-per-phoneme
models, whereas the DC system generally uses hidden-state whole-word models.
ACTION: either DC demonstrator must be able to handle medium-large vocabulary
recognition decoding, using sub-word models - or else the likelihoods which
are input to DC's system must be not for phonemes, but for whole-word hidden
states. This is possible, but is certainly not the usual way HMM/ANNs
are used at present.
Note that a minor further processing
of such 'likelihoods' input data would permit processing via the already
completed 'feature level' interface. Such tests would also be of interest
in their own right.
First generation RESPITE demonstrator was originally planned to be on the
SpeechDat Car database from Matra. However, Matra have left and we do not
have this database. It was agreed to use Aurora 2.0 instead. Several partners
already have this.
The number of possible tests (involving combinations of different features,
different speech enhancement techniques prior to feature extraction, etc.,
as well as the possibility of training either with clean data, or mixed
clean and noisy data) is very large. Before running more tests on
either system, we therefore need to consider carefully which tests should
be run. ACTION: agree on which tests
we should start running.
Most partners are currently using whole-word HMMs with small vocabulary
recognition tasks. ACTION: we still
need to decide the exact HMM topology which each demonstrator will
Do we intend to also test with sub-word HMMs? This may not be necessary
for the demonstrators, but it would certainly be desireable to test at
least our best systems on medium sized vocabulary CSR as well, which would
require monophone or triphone models.
Do we want to test missing-data using data imputation? Andrej had some
quite good results combining SS with MD imputation.
For some of the more exotic experimental systems, it might be a good idea
to do a quick complexity study before bothering to test them on
a system which is aimed at real-time processing.
Report on both demonstrators due March 2001.
Scope for interaction between different
Bill's report on the psycho-acoustics group discussion
to go here
Several areas of potentially useful interaction
between the different techniques presented were identified. Summary
of ideas will be presented as a clickable grid to go here - soon.
SPHEAR & RESPITE steering
comittee - See the private pages
of SPHEAR & RESPITE
It will be very nice to have a standard test database.
Liked CASA work. If they havn't done this already,
its important that CASA people check the wide litterature on correlograms.
Could multi-source decoder idea be taken down to
the level of individual time/freq pixels?
Answer (Andrew Moris) - that sounds very like
the HMM/GME approach which I presented at this meeting, but with the data
vector covering several speech frames instead of just one. The optimal
size of data window as used in HMM/ANN ASR is around 100 ms (9 frames).
The problem here (baring quantum computers) is the number of different
possible positions of missing data components within this window, and the
time it takes to evaluate the posterior state probabilities for each one.
I would guess that about 10 different positions of missing data is feasible,
and 10 is a lot less that 2^(9*20).
You should feel pretty good about the work you're
ACTION: For end of this year
"Interfacing CASA to speech recognition". Concerns mostly Fred and Jon.
"Experiments in human speech perception". Laurent to reminf Fred about
something he did in this line.
Documented software for multi-stream speech recognition (IDIAP, FPM)
STRUT onto web, with pointer to documentation?
RBF/GME software? (undocumented, over 9000 lines of C code)
ACTION: For end of this month
"Combining multi-stream with missing-data approaches" (IDIAP, report, not
software). All multi-band is MS+MD. So is HMM/GME. Should not be hard to
do a quick summary report, and refer to various recent publications.
"Multi-source decoder and report" (Sheffield)
Phil noted that "ICP part in RESPITE demonstrator
has to be discussed"
RESPITE: Thurs Jan 25th, Fri 26th in Luxembourg -
Phil to check
SPHEAR: Fri 6th, Sat 7th April in Keele
CRAC: Sun 2nd December 2001 (replaces the joint SPHEAR/RESPITE
meeting for next year).