Some MultiChan references

MINUTES of 2nd joint SPHEAR-RESPITE workshop

FCTS Lab, Mons, Sept 15-17, 2000

minutes taken by Andrew Morris, morris@idiap.ch

Attendees

Bochum: Prof. Blauert
Patras: Joerg Buchholz
Keele: Georg Meyer, Bill Ainsworth
Sheffield: Phil Green, Stuart Cunningham, Jon Barker, Jordy Cohen
IDIAP: Andrew Morris,Astrid Hagen
ICP: Laurent Varin, Martin Heckmann
ICSI: Dan Ellis
DaimlerChrysler: Joan Mari Hilario, Udo Haiber, Fritz Class
FPMs: Christophe Ris, Jean-Marc Boite, Laurent Couvreur, Stéphane Dupont, Olivier Pietquin
Babel: Olivier Deroo

Apologies

Herve Bourlard, Martin Cooke, Jens Blauert, Christos Tsakostas

Introduction (Phil)

Scientific reviews. For interesting legal reasons, neither project has yet had one. This does not matter, except that the last stage of each project's funding will be help up until its scientific review has been completed.
Recruitment

2 new SPHEAR placements: Kalle Palomaki, Sheffield (on auditory attention); Olivier Crouzet, Keele
IDIAP has a vacency for a full time job on RESPITE.

Scientific reports

ACTION: tri-monthly SPHEAR and bi-monthly RESPITE reports are both due by end of September. Please send Phil details of any new work not reported to this meeting.
ACTION: RESPITE web based progress report to be updated by end of this year.

RESPITE costs statements: next payments may be held up a bit due to late submission of cost statements from Matra and ICP, plus some errors in the statements submitted which had to be sorted out. Finally submitted in August, so payments expected around end of year.
Babel have replaced Matra as Industrial partner in RESPITE. Contract is to be amended. Babel will be funded in retrospect back to Feb 2000 when Matra left.
RESPITE international CRAC workshop (Consistent & Reliable Acoustic Cues for sound analysis) will take place just before Eurospeech 2001, in Aalborg. To encourage proper discussion, this will be limited to around 40 people. However, contributions from as many SPHEAR/RESPITE partners as possible are welcome (deadline end April 2001). Selected presentations will be invited to be expanded for a special issue of Speech Communication.

Research updates

Presentations (publicly accessible) here

Some points arising from these talks:

Standard test data: In future everyone should try and base tests on Aurora 2.0. This can be obtained from ELRA (http://www.icp.grenet.fr/ELRA/home.html). This has a more realistic range of noises, and also has enough clean data to permit training with clean data when required. Reported results must use the full training and test sets, but as the new test set is even larger than before, it will be important to use some subset of the test set for initial tests.
Common segmentation: Aurora does not come with a training data segmentation. It would help make results more directly comparable if we could agree on a common segmentation. It would further help if we could agree on a common speech unit and HMM topology.
Standard results presentation: It was greed that in future everyone will present results with SNR increasing to the right, and WER increasing upwards.

Discussion groups

Recognition demonstrators: plans and products

The DC demonstrator will be for both SPHEAR and RESPITE. The Babel demonstrator will be for RESPITE.
Both first generation demonstrators should be complete by the end of this year. Because of the change from MATRA to Babel plans have been disrupted, but we will still need to address that milestone
Babel demonstrator

Babel already have a working demonstrator, built arount STRUT.
They have already implemented Stéphane Dupont's recent 'multi-bands trained in artificial noise' system. This was demonstrated to work in real time in a very high level of noise for a small vocabulary task. They are willing to implement other experimental systems as well if required.
Extension is envisaged to replace the default speech processor, and to build a graphical user interface.
As well as speech recognition, Babel also offer speech synthesis, and speech and music compression.
Need to decide on I/O formats.
Graphical interface will be able to demonstrate each stage of the recognition process, as well as real-time recognition. This will be very usefull for explaining to people how each system works (except for any CASA front end, which will require separate demonstration).

DC demonstrator

The 'feature level' interface and the procedure for conversion to DC UniIO format, are both complete and tested, so tests should now be quite fast to run. This will enable testing with any kinds of features, including features for 'tandem' processing.
Initial tests were made with Aurora 1.0. These tests should now be rerun, and all future tests should now use Aurora 2.0.
Care is needed in exact specification of the speech features which are given to DC for testing.
The further 'likelihoods level' interface remains to be implemented. This will input likelihoods (or scaled likelihoods) and transition probabilities. This will be necessary for testing any HMM/ANN based system, where (scaled) likelihoods will be produced outside of the DC system, for decoding by DC's HMM.
Likelihoods from HMM/ANN are generally (though not necessarily) one per phoneme. This will require input to an HMM using one-state-per-phoneme models, whereas the DC system generally uses hidden-state whole-word models. ACTION: either DC demonstrator must be able to handle medium-large vocabulary recognition decoding, using sub-word models - or else the likelihoods which are input to DC's system must be not for phonemes, but for whole-word hidden states. This is possible, but is certainly not the usual way HMM/ANNs are used at present.
Note that a minor further processing of such 'likelihoods' input data would permit processing via the already completed 'feature level' interface. Such tests would also be of interest in their own right.

Planned tests

First generation RESPITE demonstrator was originally planned to be on the SpeechDat Car database from Matra. However, Matra have left and we do not have this database. It was agreed to use Aurora 2.0 instead. Several partners already have this.
The number of possible tests (involving combinations of different features, different speech enhancement techniques prior to feature extraction, etc., as well as the possibility of training either with clean data, or mixed clean and noisy data) is very large. Before running more tests on either system, we therefore need to consider carefully which tests should be run. ACTION: agree on which tests we should start running.
Most partners are currently using whole-word HMMs with small vocabulary recognition tasks. ACTION: we still need to decide the exact HMM topology which each demonstrator will use.
Do we intend to also test with sub-word HMMs? This may not be necessary for the demonstrators, but it would certainly be desireable to test at least our best systems on medium sized vocabulary CSR as well, which would require monophone or triphone models.
Do we want to test missing-data using data imputation? Andrej had some quite good results combining SS with MD imputation.
For some of the more exotic experimental systems, it might be a good idea to do a quick complexity study before bothering to test them on a system which is aimed at real-time processing.

Report on both demonstrators due March 2001.

Psycho-acoustics: plans and products

Bill's report on the psycho-acoustics group discussion to go here

Scope for interaction between different techniques presented

Several areas of potentially useful interaction between the different techniques presented were identified. Summary of ideas will be presented as a clickable grid to go here - soon.

SPHEAR & RESPITE steering comittee - See the private pages of SPHEAR & RESPITE

Jordy's slot

It will be very nice to have a standard test database.
Liked CASA work. If they havn't done this already, its important that CASA people check the wide litterature on correlograms.
Could multi-source decoder idea be taken down to the level of individual time/freq pixels?

Answer (Andrew Moris) - that sounds very like the HMM/GME approach which I presented at this meeting, but with the data vector covering several speech frames instead of just one. The optimal size of data window as used in HMM/ANN ASR is around 100 ms (9 frames). The problem here (baring quantum computers) is the number of different possible positions of missing data components within this window, and the time it takes to evaluate the posterior state probabilities for each one. I would guess that about 10 different positions of missing data is feasible, and 10 is a lot less that 2^(9*20).

You should feel pretty good about the work you're doing.

Imminent deliverables

ACTION: For end of this year

"Interfacing CASA to speech recognition". Concerns mostly Fred and Jon.
"Experiments in human speech perception". Laurent to reminf Fred about something he did in this line.
Documented software for multi-stream speech recognition (IDIAP, FPM)

STRUT onto web, with pointer to documentation?
RBF/GME software? (undocumented, over 9000 lines of C code)

ACTION: For end of this month

"Combining multi-stream with missing-data approaches" (IDIAP, report, not software). All multi-band is MS+MD. So is HMM/GME. Should not be hard to do a quick summary report, and refer to various recent publications.
"Multi-source decoder and report" (Sheffield)

AOB

Phil noted that "ICP part in RESPITE demonstrator has to be discussed"

Forthcoming meetings

RESPITE: Thurs Jan 25th, Fri 26th in Luxembourg - Phil to check

SPHEAR: Fri 6th, Sat 7th April in Keele

CRAC: Sun 2nd December 2001 (replaces the joint SPHEAR/RESPITE meeting for next year).