SPHEAR Workshop

Patras, April 30th and 1st May 99

Meeting Record

INCOMPLETE

  •  Welcome by Prof. George Kokkinakis
  • Coordinator's Report
  • Progress Reports
  • Research Updates
  • Working Groups
  • Steering Committee Meeting
  • Report back & planning


  • Coordinator's Report

     

    Project Status

    RESPITE links

    SPHEAR Web Site

    Approval for travel outside EC


    Progress Reports

    Theme 1: Temporal Organisation and Auditory Events

    Task 1.1: Assessing the role of synchrony/desynchrony of spectral channels in speech perception.

    The work on temporal shifts due to take place at Keele has been delayed by about 6 months due to recruitment problems.

    Experiments on the perception of speech in noise combined with labial gestures are planned at Grenoble.

    Theme 2: Segregating Sound Sources

    Task 2.1: Spectral and spectro-temporal segregation with shared channels.

    At Keele it was found that the ShATR database was not suitable for the planned double-vowel segregation experiments so they have instead been successfully carried out using vowel pairs extracted from the TIMIT database.

    Task 2.2: Integrating grouping strategies across monaural cues.

    The role of continuity and proximity in the grouping of formants in vowel-nasal syllables have been investigated at Keele. This work is on time but the 18 month period may be extended if interesting results continue to be produced. It was suggested that the modelling work on cue integration might use the Respite Toolkit being developed in a parallel project.

    Task 2.4: Coupling of audio-visual cues.

    This task has benefited from a placement (Jon Barker) at Grenoble and substantial progress has been achieved. A further placement is required.

    Task 2.5: Conflict resolution in multi-stream systems and auditory attention

    The task is closely linked with task 2.2 (monaural integration).

    New developments

    Recent work by Darwin and co-workers on the integration if F0 and ITD suggest that F0 is used primarily for simultaneous grouping, while IDT information is used to label speaker positions,s, after segregation has been performed.

    Progress

    Oscillator models are being optimised by Sheffield. Oscillator models were evaluated for speech segregation by F0 using modulation maps as the input representation at Keele.

    Additional work has been carried out to study F0, continuity and proximity as grouping cues from a psychophysical perspective at Keele.

    Plans

    Oscillators as labelling devices in modulation maps will be investigated in ITD/F0 grouping tasks by the Sheffield group. Work investigating non-speech signals overlapping speech signals will be further investigated at Keele.

    Lab visits / new placements

    Lab visits between Keele and Sheffield are planned for June ?99.

    Task 2.6: Schema-based segregation by connectionist tecniques.

    A placement at Grenoble is being sought.

    Theme 3: Speech recognition within Auditory Scenes.

    Task 3.3: Reliability estimation

    New developments

    none

    Progress

    Work in Grenoble in collaboration with IDIAP concentrated on the development of a reliability estimator based on zero-crossing statistics. This algorithm is currently evaluated against an autocorrelation-based approach used by IDIAP. In addition to this, four established techniques were implemented at Sheffield.

    Plans

    The collaboration between IDIAP and Sheffield will continue and new work between Grenoble and Sheffield is planned for the next 12 months.

    Lab visits / new placements

    The possibility of placements outside the SPHEAR project and the posting of a member of the Sheffield team in Grenoble was discussed and will be taken further.

    Joint papers

    A joint Grenoble/IDIAP paper is in preparation.

    Task 3.6: Near Speech Recognition

    New developments

    none

    Progress

    Work at Sheffield concentrates on the effect of multi-channel filtering on the perception of speech sounds and on models of these data using missing data recognition techniques. The proposal also includes a modelling temporally distorted speech recognition, which will be part of the assignment of a post-doctoral researcher at Keele once hired to fit in with the perception experiments proposed in task 1.1.

    Plans

    The modelling of psychophysical results is ongoing work at Sheffield. Collaborative work between Sheffield and Keele to study changes in the perception of vowel-nasal syllables are due to masking or segregation will be discussed.

    Lab visits / new placements

    A meeting between researchers from Keele and Sheffield is planned. The following placements were suggested:

    Postdocs from Patras to Keele and from Keele to Grenoble

    Students from Grenoble to Keele and to Bochum

    Joint papers

    Papers are being prepared by Meyer (Keele) and Berthommier (Grenoble) on AM maps and Schwartz, Berthommier (Grenoble) and Barker (Sheffield) on AVSA.


    Research Updates

    Karsten Lehn

    Modelling The Cocktail-Party-Effect Using Multiple Cues of Computational Auditory Scene Analysis

    The psychoacoustical modelling approach of auditory scene analysis, introduced by Bregman (1990), helps to improve technical systems. The starting point of the primitive auditory scene analysis is an internal time-frequency representation of the incoming signal mixture. From this representation different features, e.g. amplitude modulation, frequency modulation, common signal onset and offset, periodicity and spatial position features, can be derived and used to perform an auditory grouping of the time-frequency elements. The resulting groups are likely to correspond to the acoustical events of the sound emitting sources.

    It will be shown that this mechanism can be modelled by a grouping approach called "temporal fuzzy cluster analysis". The most important features for concurrent speech segregation, i.e. spatial position and fundamental frequency features, are selected and used by the model in a combined manner for performing a segmentation of auditory scenes.

    A corpus of five english vowels on different fundamental frequencies and a set of head related transfer functions are used to set up different virtual acoustical scenes providing a test environment for the system. The model is tested in scenes that supply optimal information only for the spatial position and fundamental frequency feature extractors respectively and in scenes that contain both cues. The performance of enhancing the target speaker is assessed for these scenarios, using only spatial position features, only fundamental frequency features and a combination of both types of features. As expected it turns out that the hybrid system, which relies on binaural and monaural features, performs well in all conditions.

    Jens Blauert

    Psychoacoustical background of the precedence effect

    Christos Tsakostas

    Precedence effect's dependence on signal parametersThe problems that encounter the speech recognition systems nowadays, depend mostly on the room reverberation. Systematical studies for the precedence effect (P.E.) are necessary in order to understand how humans cope with reflections in reverberant enviroments.As it is already known P.E. is signal dependant; our goal is to find out which parameter our conbination of parameters of the signal allow it to operate.

    The first step in our studies was to use natural stimuli, such as speech and music, in order to get ideas for further experiments. Pilot experiments showed that P.E. depends significantly on the level of the signal. In the case of music stimuli, where auditory streams are involved, it seems that P.E. operates separately on each stream and a different threshold is formed for its one of them.

    Further psychoacoustical experiments with broadband and passband noise used as stimuli, showed that there is a dependance on the duration of the signal. The dependance on the rise time was not so high as expected.

    Joerg Buchholtz

    Evaluation and application of the Absolute Perception Threshold of room reflections

    The human auditory system is capable of suppressing a large part of the room reflections, often utilizing the precedence effect. Towards this goal, an amplitude threshold, called the absolute perception threshold (APT), can be determined, below which a reflection is inaudible. The APT can mainly be described as a 9 dimensional function with the absolute acoustical power of the signal, the relative and the absolute incident direction (elevation and azimuth), the delay time, the number of reflections, the frequency and the actual signal as parameters. This 9 Parameter space can be transformed to a finite number of 3 dimensional spaces. A number of psychometric experiments were developed, using a
    virtual binaural environment, in order to assess the sensitivity of the APT to the above parameters. It was found that single reflections have to be separated in a very early part and a later part. This boundary was determined to be frequency dependent and was also found to appear between 10 ms and 40 ms. From the test, it was found that for the early delays no significant APT dependency with the direct signal power and the frequency exists. For the later delays, the APT is decreasing with increasing signal power and is also dependent on the frequency. The effects were found to be balanced out by the influence of realistic environments and realistic signals. In addition it was proposed, that the APT function might be further simplified by taking into account some psycho acoustical models.

    Jean-Luc Schwartz

    Synchrony and reliability in multi-stream fusion: some data from audio and audiovisual speech

    Jean-Luc Schwartz talked about "Synchrony and reliability in multi-stream fusion", with examples and some new data coming from audiovisual (AV) and audio speech. He recalled that the prototypical model in AV speech perception, that is a model based on separate identification of the A and the V streams followed by a decision process, has two main characteristics. (1) It assumes an independence of the two streams in the decision process (temporal AV coordination unexploited). (2) It considers that the weighting of the fusion process might be based on ambiguities at the output of each decision process (entropy). However, various considerations on AV speech shows that both these assumptions are arguable. Data on AV temporal dependencies suggest that the temporal coordination of the A and V streams is likely to be important in the perception/recognition process. Data on reliability of the A stream in AV speech perception show that the "ambiguous (high entropy) = unreliable" equation is likely to be wrong. At last, various markers developed at ICP for the labeling of reliable vs. unreliable portions of the time-frequency representation were briefly presented. They rely on zero-crossing statistics or autocorrelation cues for detecting a dominant voiced source, and intercorrelation binaural cues for detecting a dominant source in a separate location.

    Jon Barker

    Estimation of speech acoustics from visual speech features.

    Bill Ainsworth

    Perception of concurrent approximant-vowel syllables

    Georg Meyer

    Formant continuity and proximity as a grouping cue in vowel-nasal syllables (with a bit of duplex perception thrown in)

    We show that the perceived nasal in synthetic vowel nasal syllables can be influenced by the preceding vowel. A synthetic nasal, which, in isolation, would be perceived as /m/ (formants at 250, 1000 and 2000Hz) is perceived as /n/ if it is preceded by vowels with high F2 values and without transitions between vowel and nasal. A number of experiments were carried out with the aim to explain this finding using the auditory scene analysis framework.

    We show that format continuity and proximity is consistent with our experimental data.

    Jon Barker

    The RESPITE CASA toolkit

    Ljubomir Josifovski

    State based imputation of missing data for robust speech recognition and speech enhancement

    Within the context of continuous-density HMM speech recognition in noise, we report on imputation of missing time-frequency regions using emission state probability distributions. Spectral subtraction and local signal--to--noise estimation based criteria are used to separate the present from the missing components. We consider two approaches to the problem of classification with missing data: marginalization and data imputation. A formalism for data imputation based on the probability distributions of individual Hidden Markov model states is presented. We report on recognition experiments comparing state based data imputation to marginalization in the context of connected digit recognition of speech mixed with factory noise at various global signal-to-noise ratios, and wideband restoration of speech. Potential advantages of the approach are that it can be followed by conventional techniques like cepstral features or artificial neural networks for speech recognition.

    Ascencion Vizinho

    Noise Estimation and Missing Data Recognition

    Astrid Hagen (for Christopher Kermorvant)

    Some experiments of speech recognition in noise with noise reduction techniques

    This paper addresses the problem of speech recognition in the presence of additive noise. To deal with this problem, it is possible to estimate the noise characteristics using methods which have previously been developed for speech enhancement techniques. Spectral subtraction can then be used to reduce the effect of additive noise on speech in the spectral domain. Some techniques have also recently been proposed for recognition with missing data. These approaches require an estimation of the local SNR to detect the speech spectral features which are relatively free from noise so as to perform recognition on these parts only. In this article, we compare these two different stra\-tegies, spectral subtraction and "missing data", on continuous speech additively disturbed with real noise. It is shown that missing data methods can improve recognition performance under certain noise conditions but still need to be improved in order to to reach the performance of the spectral subtraction.

    Martin Cooke

    Evidence and Counter Evidence in Recognition by Listeners & Machines

    Joan Mari

    LDA features for MultiBand models

    Multi-band systems are based on independent processing of sub-bands until some stage is reached at which the sub-band features or sub- band partial likelihoods are recombined in one single feature or likelihood score. Doing this we avoid for the spreading of noise to all the feature vector, so that some parts of the feature vector are still reliable. The main problems that must be addressed during the design of a Multi-band system are feature extraction, recombination strategy - whether feature recombination or likelihood recombination- and decoding algorithm -only if likelihood recombination is not at the state level-. If likelihood recombination is selected then likelihoods can be recombined using the Full Combination approach or rather assuming independent sub-bands and recombine them in a linear or non-linear fashion. At DaimlerChrysler we have modified our LDA-feature front-end in order to handle multiple sub-bands, and we are presently training models using a feature recombination strategy, that is, an LDA feature vector made of concatenated sub-band LDA vectors. As a next step we will modify our pattern-matching block so as to use a likelihood recombination strategy at the state level. Then, extensive tests will be undertook so as to compare both strategies. A final aim remains recombination at phone or any sub-word level, in which we will allow for asynchrony between sub-bands.

    Astrid Hagen

    Different weighting schemes in the Full-Combination Approach

    The performance of most ASR systems degrades rapidly with data mismatch relative to the data used in training. Under many realistic noise conditions a significant proportion of the spectral representation of a speech signal, which is highly redundant, remains uncorrupted. In the "missing feature" approach to this problem mismatching data is simply ignored, but the need to base recognition on unorthogonalised spectral features results in reduced performance in clean speech. In multiband ASR the results from independent recognition on a number of within-band orthogonalised sub-bands are combined. This approach more accurately reflects the uncertainty in mismatch detection, but loss of joint information due to independent sub-band processing can also result in reduced performance with clean speech. In this article the "full combination" approach to noise robust ASR is presented in which multiple data streams are associated not with individual sub-bands but with sub-band combinations. In this way no assumption of sub-band independence is required. Initial tests show some improved robustness to noise with no significant loss of performance with clean speech. Essential to this approach is moreover the weighting scheme used in the combination of subbands. Equal weights and SNR-based weights have already been investigated and shown good results. New weighting schemes are now under investigation, including Least-Mean-Square Error based weights as well as weights based on the Expectation-Maximization approach.


    Working Groups

    Working Groups were asked to consider

    Tasks 1.2 and 2.3

    Chair: Jens Blauert

    The project is in good shape.

    There is a Greek scientist in Bochum and a German one in Patras namely Christos Tsakostas and Joerg Buchholz. Both are jointly supervised by Prof. Blauert, Bochum and Prof. Mojoupoulos, Patras. Christos Tsakostas is working in the field of "The Presedence Effect and Auditory Stream Segragation". Joerg Buchholz is active in the field of "Binaural Room Masking and Exploitation of Early Refections". Both themes are well placed in the work program of SPHEAR.

    In Bochum, the young following scientist are working in related areas, such providing a synergy effect:

    Jonas Braasch Thomas Djelani Klaus Hartung Mark Brueggen

    At Patras the following young scientists add to the team:

    Dionysos Tsoukalas Thanasis Koutras

    Joerg Buchholz has reported on his ideas and test software concerning the exploitation of early refections in rooms to enhance speech quality.

    Christos Tsakostas has reported on extensive psychoacoustic pilot experiments concerning the echo threshold in the precedence effect.

    A newsgroup is planned to be set up for the tasks 1.2 & 2.3 A joint conference publication is on the drawing board.

    Tasks 1.1, 2.1, 2.4, 2.6

    Chair: Bill Ainsworth

    Tasks 2.5, 3.3, 3.6

    Chair: Georg Meyer

    Tasks 3.4, 3.5

    Chair: Karsten Lehn

    Patras and Bochums main research is centered around tasks 1.2 and 2.3 (see memorandum of these tasks). This is reflected by the current employment and recruiting situation. Almost all ressources of the partners will be allocated by these tasks in order to go into detail within these scientific areas. This results in a less detailed elaboration of tasks 3.4 and 3.5.

    Currently no scientist payed by the network is working on tasks 3.4 and 3.5. There will be input from other projects on related areas running in Patras and Bochum.

    Progress Task 3.4: Patras developed final version of the masking model algorithm. (scheduled for month 7) Task 3.5: No progress (scheduled for month 25)

    Plans Task 3.4: Patras provides the masking model in the form of a program to be used by the other partners. This program will be submitted for possible inclusion of the CASA toolkit. Task 3.5: Results from other projects will be made available. There will be input from Sheffield as well.

    Publications: Jens Blauert contributes an recent overview on binaural models.

    Visits: Joerg Buchholz will have a placement at IDIAP (about 3 month duration). There will be further mutual exchanges between Patras and Bochum.

    Tasks: 3,1, 3.2, 4.x

    Martin Cooke (Chair), Ascencion Vizinho, Ljubomir Josifovsky, Joan Mari, Astrid Hagen

    Joint Publications

    IDIAP/Sheffield: could include a comparison between Shef data imputation and marginalization missing data approaches and IDIAP spectral subtraction and/or missing data as well as the Full-Combination approach.

    Common Database

    For this, though, a common database would be needed. As moreover a common standard for the experiments is required, it was proposed to use the "Aurora standard" which is also used in the RESPITE project. For IDIAP (Chris and Astrid) this would mean including the TIDigit database for their experiments. Both, Shef and IDIAP have to stronger use Daimler's car noise in the experiments. OR would it be easier if Shef was doing multi-stream experiments and IDIAP experiments aimed in their direction instead of changing databases For testing in the cellular phone environment, either the IDIAP/RESPITE cellular phone database was discussed or rather waiting for a corresponding database from Daimler.

    Exchange of Ph.D. Students among Institutes

    Jeorg from Bochum who is doing his first year at Patras, would be interested in coming to IDIAP for some time (in half a year or so). If this would be for less than 3 months, the paying strategy might not to be changed. Either Christopher or Astrid could go to Shef e.g.to work on a joint publication.

    Other Work

    "TRAPS" approach by Hermansky using time and frequency base features might be interesting to look at (ICSLP 99).

    Look at paper about \Parallel Model Composition and Missing Data"

    (Nokia) which seems interesting.

    Investigate acoustic back-off as described by de Veth


    Plenary Session

    The working groups reported their work and gave an overview of the planned work for the upcoming year. The workshop ended with some organisational topics including next year's SPHEAR meeting. It was decided to have the SPHEAR meeting 2000 at Bochum, probably beginning of March.

    UPDATE: To accommodate the mid-term review the workshop dates have moved to 26th-28th February

    Jordan Cohen gave some comments of his impression of the workshop. He pointed out that nothing like exists in the USA. He stated that he got a very positive picture from SPHEAR's extensive collaboration all over Europe including a high number of research institutes which are interdisciplinary connected by the project. Jordan suggested creating a SPHEAR newsgroup. Sheffield agreed to look into this.

    UPDATE: It's difficult to run a newsgroup from Sheffield because of problems with the firewall. Other labs have been asked if they could host it.