UPDATE: Jordan Cohen has stepped down since moving to DRAGON
DECIDED on three-monthly reports. The first of these will be
July 99.
UPDATE: No it doesn't. If more than 3 people need SPHEAR funding
we'll seek approval.
Theme 1: Temporal Organisation and Auditory Events
Task 1.1: Assessing the role of synchrony/desynchrony of spectral channels in speech perception.
The work on temporal shifts due to take place at Keele has been delayed by about 6 months due to recruitment problems.
Experiments on the perception of speech in noise combined with labial gestures are planned at Grenoble.
Theme 2: Segregating Sound Sources
Task 2.1: Spectral and spectro-temporal segregation with shared channels.
At Keele it was found that the ShATR database was not suitable for the planned double-vowel segregation experiments so they have instead been successfully carried out using vowel pairs extracted from the TIMIT database.
Task 2.2: Integrating grouping strategies across monaural cues.
The role of continuity and proximity in the grouping of formants in vowel-nasal syllables have been investigated at Keele. This work is on time but the 18 month period may be extended if interesting results continue to be produced. It was suggested that the modelling work on cue integration might use the Respite Toolkit being developed in a parallel project.
Task 2.4: Coupling of audio-visual cues.
This task has benefited from a placement (Jon Barker) at Grenoble and substantial progress has been achieved. A further placement is required.
Task 2.5: Conflict resolution in multi-stream systems and auditory attention
The task is closely linked with task 2.2 (monaural integration).
New developments
Recent work by Darwin and co-workers on the integration if F0 and ITD suggest that F0 is used primarily for simultaneous grouping, while IDT information is used to label speaker positions,s, after segregation has been performed.
Progress
Oscillator models are being optimised by Sheffield. Oscillator models were evaluated for speech segregation by F0 using modulation maps as the input representation at Keele.
Additional work has been carried out to study F0, continuity and proximity as grouping cues from a psychophysical perspective at Keele.
Plans
Oscillators as labelling devices in modulation maps will be investigated in ITD/F0 grouping tasks by the Sheffield group. Work investigating non-speech signals overlapping speech signals will be further investigated at Keele.
Lab visits / new placements
Lab visits between Keele and Sheffield are planned for June ?99.
Task 2.6: Schema-based segregation by connectionist tecniques.
A placement at Grenoble is being sought.
Theme 3: Speech recognition within Auditory Scenes.
Task 3.3: Reliability estimation
New developments
none
Progress
Work in Grenoble in collaboration with IDIAP concentrated on the development of a reliability estimator based on zero-crossing statistics. This algorithm is currently evaluated against an autocorrelation-based approach used by IDIAP. In addition to this, four established techniques were implemented at Sheffield.
Plans
The collaboration between IDIAP and Sheffield will continue and new work between Grenoble and Sheffield is planned for the next 12 months.
Lab visits / new placements
The possibility of placements outside the SPHEAR project and the posting of a member of the Sheffield team in Grenoble was discussed and will be taken further.
Joint papers
A joint Grenoble/IDIAP paper is in preparation.
Task 3.6: Near Speech Recognition
New developments
none
Progress
Work at Sheffield concentrates on the effect of multi-channel filtering on the perception of speech sounds and on models of these data using missing data recognition techniques. The proposal also includes a modelling temporally distorted speech recognition, which will be part of the assignment of a post-doctoral researcher at Keele once hired to fit in with the perception experiments proposed in task 1.1.
Plans
The modelling of psychophysical results is ongoing work at Sheffield. Collaborative work between Sheffield and Keele to study changes in the perception of vowel-nasal syllables are due to masking or segregation will be discussed.
Lab visits / new placements
A meeting between researchers from Keele and Sheffield is planned. The following placements were suggested:
Postdocs from Patras to Keele and from Keele to Grenoble
Students from Grenoble to Keele and to Bochum
Joint papers
Papers are being prepared by Meyer (Keele) and Berthommier (Grenoble) on AM maps and Schwartz, Berthommier (Grenoble) and Barker (Sheffield) on AVSA.
The psychoacoustical modelling approach of auditory scene analysis, introduced by Bregman (1990), helps to improve technical systems. The starting point of the primitive auditory scene analysis is an internal time-frequency representation of the incoming signal mixture. From this representation different features, e.g. amplitude modulation, frequency modulation, common signal onset and offset, periodicity and spatial position features, can be derived and used to perform an auditory grouping of the time-frequency elements. The resulting groups are likely to correspond to the acoustical events of the sound emitting sources.
It will be shown that this mechanism can be modelled by a grouping approach called "temporal fuzzy cluster analysis". The most important features for concurrent speech segregation, i.e. spatial position and fundamental frequency features, are selected and used by the model in a combined manner for performing a segmentation of auditory scenes.
A corpus of five english vowels on different fundamental frequencies and a set of head related transfer functions are used to set up different virtual acoustical scenes providing a test environment for the system. The model is tested in scenes that supply optimal information only for the spatial position and fundamental frequency feature extractors respectively and in scenes that contain both cues. The performance of enhancing the target speaker is assessed for these scenarios, using only spatial position features, only fundamental frequency features and a combination of both types of features. As expected it turns out that the hybrid system, which relies on binaural and monaural features, performs well in all conditions.
The first step in our studies was to use natural stimuli, such as speech and music, in order to get ideas for further experiments. Pilot experiments showed that P.E. depends significantly on the level of the signal. In the case of music stimuli, where auditory streams are involved, it seems that P.E. operates separately on each stream and a different threshold is formed for its one of them.
Further psychoacoustical experiments with broadband and passband noise used as stimuli, showed that there is a dependance on the duration of the signal. The dependance on the rise time was not so high as expected.
The human auditory system is capable of suppressing a large part of
the room reflections, often utilizing the precedence effect. Towards this
goal, an amplitude threshold, called the absolute perception threshold
(APT), can be determined, below which a reflection is inaudible. The APT
can mainly be described as a 9 dimensional function with the absolute acoustical
power of the signal, the relative and the absolute incident direction (elevation
and azimuth), the delay time, the number of reflections, the frequency
and the actual signal as parameters. This 9 Parameter space can be transformed
to a finite number of 3 dimensional spaces. A number of psychometric experiments
were developed, using a
virtual binaural environment, in order to assess the sensitivity of
the APT to the above parameters. It was found that single reflections have
to be separated in a very early part and a later part. This boundary was
determined to be frequency dependent and was also found to appear between
10 ms and 40 ms. From the test, it was found that for the early delays
no significant APT dependency with the direct signal power and the frequency
exists. For the later delays, the APT is decreasing with increasing signal
power and is also dependent on the frequency. The effects were found to
be balanced out by the influence of realistic environments and realistic
signals. In addition it was proposed, that the APT function might be further
simplified by taking into account some psycho acoustical models.
Jean-Luc Schwartz talked about "Synchrony and reliability in multi-stream fusion", with examples and some new data coming from audiovisual (AV) and audio speech. He recalled that the prototypical model in AV speech perception, that is a model based on separate identification of the A and the V streams followed by a decision process, has two main characteristics. (1) It assumes an independence of the two streams in the decision process (temporal AV coordination unexploited). (2) It considers that the weighting of the fusion process might be based on ambiguities at the output of each decision process (entropy). However, various considerations on AV speech shows that both these assumptions are arguable. Data on AV temporal dependencies suggest that the temporal coordination of the A and V streams is likely to be important in the perception/recognition process. Data on reliability of the A stream in AV speech perception show that the "ambiguous (high entropy) = unreliable" equation is likely to be wrong. At last, various markers developed at ICP for the labeling of reliable vs. unreliable portions of the time-frequency representation were briefly presented. They rely on zero-crossing statistics or autocorrelation cues for detecting a dominant voiced source, and intercorrelation binaural cues for detecting a dominant source in a separate location.
We show that the perceived nasal in synthetic vowel nasal syllables can be influenced by the preceding vowel. A synthetic nasal, which, in isolation, would be perceived as /m/ (formants at 250, 1000 and 2000Hz) is perceived as /n/ if it is preceded by vowels with high F2 values and without transitions between vowel and nasal. A number of experiments were carried out with the aim to explain this finding using the auditory scene analysis framework.
We show that format continuity and proximity is consistent with our experimental data.
Within the context of continuous-density HMM speech recognition in noise, we report on imputation of missing time-frequency regions using emission state probability distributions. Spectral subtraction and local signal--to--noise estimation based criteria are used to separate the present from the missing components. We consider two approaches to the problem of classification with missing data: marginalization and data imputation. A formalism for data imputation based on the probability distributions of individual Hidden Markov model states is presented. We report on recognition experiments comparing state based data imputation to marginalization in the context of connected digit recognition of speech mixed with factory noise at various global signal-to-noise ratios, and wideband restoration of speech. Potential advantages of the approach are that it can be followed by conventional techniques like cepstral features or artificial neural networks for speech recognition.
This paper addresses the problem of speech recognition in the presence of additive noise. To deal with this problem, it is possible to estimate the noise characteristics using methods which have previously been developed for speech enhancement techniques. Spectral subtraction can then be used to reduce the effect of additive noise on speech in the spectral domain. Some techniques have also recently been proposed for recognition with missing data. These approaches require an estimation of the local SNR to detect the speech spectral features which are relatively free from noise so as to perform recognition on these parts only. In this article, we compare these two different stra\-tegies, spectral subtraction and "missing data", on continuous speech additively disturbed with real noise. It is shown that missing data methods can improve recognition performance under certain noise conditions but still need to be improved in order to to reach the performance of the spectral subtraction.
Multi-band systems are based on independent processing of sub-bands until some stage is reached at which the sub-band features or sub- band partial likelihoods are recombined in one single feature or likelihood score. Doing this we avoid for the spreading of noise to all the feature vector, so that some parts of the feature vector are still reliable. The main problems that must be addressed during the design of a Multi-band system are feature extraction, recombination strategy - whether feature recombination or likelihood recombination- and decoding algorithm -only if likelihood recombination is not at the state level-. If likelihood recombination is selected then likelihoods can be recombined using the Full Combination approach or rather assuming independent sub-bands and recombine them in a linear or non-linear fashion. At DaimlerChrysler we have modified our LDA-feature front-end in order to handle multiple sub-bands, and we are presently training models using a feature recombination strategy, that is, an LDA feature vector made of concatenated sub-band LDA vectors. As a next step we will modify our pattern-matching block so as to use a likelihood recombination strategy at the state level. Then, extensive tests will be undertook so as to compare both strategies. A final aim remains recombination at phone or any sub-word level, in which we will allow for asynchrony between sub-bands.
The performance of most ASR systems degrades rapidly with data mismatch relative to the data used in training. Under many realistic noise conditions a significant proportion of the spectral representation of a speech signal, which is highly redundant, remains uncorrupted. In the "missing feature" approach to this problem mismatching data is simply ignored, but the need to base recognition on unorthogonalised spectral features results in reduced performance in clean speech. In multiband ASR the results from independent recognition on a number of within-band orthogonalised sub-bands are combined. This approach more accurately reflects the uncertainty in mismatch detection, but loss of joint information due to independent sub-band processing can also result in reduced performance with clean speech. In this article the "full combination" approach to noise robust ASR is presented in which multiple data streams are associated not with individual sub-bands but with sub-band combinations. In this way no assumption of sub-band independence is required. Initial tests show some improved robustness to noise with no significant loss of performance with clean speech. Essential to this approach is moreover the weighting scheme used in the combination of subbands. Equal weights and SNR-based weights have already been investigated and shown good results. New weighting schemes are now under investigation, including Least-Mean-Square Error based weights as well as weights based on the Expectation-Maximization approach.
The project is in good shape.
There is a Greek scientist in Bochum and a German one in Patras namely Christos Tsakostas and Joerg Buchholz. Both are jointly supervised by Prof. Blauert, Bochum and Prof. Mojoupoulos, Patras. Christos Tsakostas is working in the field of "The Presedence Effect and Auditory Stream Segragation". Joerg Buchholz is active in the field of "Binaural Room Masking and Exploitation of Early Refections". Both themes are well placed in the work program of SPHEAR.
In Bochum, the young following scientist are working in related areas, such providing a synergy effect:
Jonas Braasch Thomas Djelani Klaus Hartung Mark Brueggen
At Patras the following young scientists add to the team:
Dionysos Tsoukalas Thanasis Koutras
Joerg Buchholz has reported on his ideas and test software concerning the exploitation of early refections in rooms to enhance speech quality.
Christos Tsakostas has reported on extensive psychoacoustic pilot experiments concerning the echo threshold in the precedence effect.
A newsgroup is planned to be set up for the tasks 1.2 & 2.3 A joint conference publication is on the drawing board.
Patras and Bochums main research is centered around tasks 1.2 and 2.3 (see memorandum of these tasks). This is reflected by the current employment and recruiting situation. Almost all ressources of the partners will be allocated by these tasks in order to go into detail within these scientific areas. This results in a less detailed elaboration of tasks 3.4 and 3.5.
Currently no scientist payed by the network is working on tasks 3.4 and 3.5. There will be input from other projects on related areas running in Patras and Bochum.
Progress Task 3.4: Patras developed final version of the masking model algorithm. (scheduled for month 7) Task 3.5: No progress (scheduled for month 25)
Plans Task 3.4: Patras provides the masking model in the form of a program to be used by the other partners. This program will be submitted for possible inclusion of the CASA toolkit. Task 3.5: Results from other projects will be made available. There will be input from Sheffield as well.
Publications: Jens Blauert contributes an recent overview on binaural models.
Visits: Joerg Buchholz will have a placement at IDIAP (about 3 month duration). There will be further mutual exchanges between Patras and Bochum.
Look at paper about \Parallel Model Composition and Missing Data"
(Nokia) which seems interesting.
Investigate acoustic back-off as described by de Veth
UPDATE: To accommodate the mid-term review the workshop dates have moved to 26th-28th February
Jordan Cohen gave some comments of his impression of the workshop. He pointed out that nothing like exists in the USA. He stated that he got a very positive picture from SPHEAR's extensive collaboration all over Europe including a high number of research institutes which are interdisciplinary connected by the project. Jordan suggested creating a SPHEAR newsgroup. Sheffield agreed to look into this.
UPDATE: It's difficult to run a newsgroup from Sheffield because
of problems with the firewall. Other labs have been asked if they could
host it.