Research

Current and recent research projects

TAPAS (2017 - 2021) Training Network on Automatic Processing of PAthological Speech

TAPAS is an H2020 Marie Curie Initial Training Network that will provide research opportunities for 15 PhD students (Early Stage Researchers) to study automatic processing of pathological speech. The network consists of 12 European research institutes and 9 associated partners.

The TAPAS work programme targets three key research problems:

  • Detection: We will develop speech processing techniques for early detection of conditions that impact on speech production. The outcomes will be cheap and non-invasive diagnostic tools that provide early warning of the onset of progressive conditions such as Alzheimer’s and Parkinson’s.
  • Therapy: We will use newly-emerging speech processing techniques to produce automated speech therapy tools. These tools will make therapy more accessible and more individually targeted. Better therapy can increase the chances of recovering intelligible speech after traumatic events such a stroke or oral surgery.
  • Assisted Living: We will re-design current speech technology so that it works well for people with speech impairments. People with speech impairments often have other co-occurring conditions making them reliant on carers. Speech-driven tools for assisted-living are a way to allow such people to live more independently.

The TAPAS consortium includes clinical practitioners, academic researchers and industrial partners, with expertise spanning speech engineering, linguistics and clinical science. This rich network will train a new generation of 15 researchers, equipping them with the skills and resources necessary for lasting success.

DeepArt (2017 - 2018) Deep learning of articulatory-based representations of dysarthric speech

DeepArt is a Google Faculty Award project that is targeting dysarthria, a particular form of disordered speech arising from poor motor-control and a resulting lack of coordination of the articulatory system. At Sheffield, we have demonstrated that using state-of-the-art training techniques developed for mainstream HMM/DNN speech recognition, can raise baseline performance for dysarthric speech recognition.

The DeepArt project will aim to advance the state of the art by conducting research in three key areas:

  • articulatory based representations;
  • use of synthetic training data and
  • novel approaches to DNN based speaker adaptive training.

AV-COGHEAR (2015 - 2018) Towards visually-driven speech enhancement for cognitively-inspired multi-modal hearing-aid devices

AV-COGHEAR is an EPSRC-funded project that is being conducted in collaboration with the University of Stirling. Current commercial hearing aids use a number of sophisticated enhancement techniques to try and improve the quality of speech signals. However, today’s best aids fail to work well in many everyday situations. In particular, they fail in busy social situations where there are many competing speech sources; they fail if the speaker is too far from the listener and swamped by noise. We have identified an opportunity to solve this problem by building hearing aids that can ‘see’.

AV-COGHEAR aims to develop a new generation of hearing aid technology that extracts speech from noise by using a camera to see what the talker is saying. The wearer of the device will be able to focus their hearing on a target talker and the device will filter out competing sound. This ability, which is beyond that of current technology, has the potential to improve the quality of life of the millions suffering from hearing loss (over 10m in the UK alone).

The project is bringing together a researchers with the complementary expertise necessary to make the audio-visual hearing-aid possible. The project combines contrasting approaches to audio-visual speech enhancement that have been developed by the Cognitive Computing group at Stirling and the Speech and Hearing Group at Sheffield. The Stirling approach uses the visual signal to filter out noise; whereas the Sheffield approach uses the visual signal to fill in ‘gaps’ in the speech. The MRC Institute of Hearing Research (IHR) will provide the expertise needed to evaluate the approach on real hearing loss sufferers. Phonak AG, a leading international hearing aid manufacturer, is providing the advice and guidance necessary to maximise potential for industrial impact.

INSPIRE (2012 - 2016) Investigating Speech in Real Environments

INSPIRE is an FP7 Marie Curie Initial Training Network that will provide research opportunities for 13 PhD students (Early Stage Researchers) and 3 postdocs (Experienced Researchers) to study speech communication in real-world conditions. The network consists of 10 European research institutes and 7 associated partners (5 businesses and 2 academic hospitals). The senior researchers in the network are academics in computer science, engineering, psychology, linguistics, hearing science, as well as R&D scientists from leading businesses in acoustics and hearing instruments, and ENT specialists. The scientific goal of INSPIRE is to better understand how people recognise speech in real life under a wide range of conditions that are “non-optimal” relative to the controlled conditions in laboratory experiments, e.g., speech in noise, speech recognition under divided attention.

CHiME (2009 - 2012) Computational Hearing in Multisource Environments

CHiME is an EPSRC funded project that aims to develop a framework for computational hearing in multisource environments. The approach operates by exploiting two levels of processing that combine to simultaneously separate and interpret sound sources (J. Barker, N. Ma, A. Coy, & M. Cooke, 2010). The first processing level exploits the continuity of sound source properties to clump the acoustic mixture into fragments of energy belonging to individual sources. The second processing level uses statistical models of specific sound sources to separate fragments belonging to the acoustic foreground (i.e. the `attended’ source) from fragments belonging to the background.

The project will investigate and develop key aspects of the proposed two-level hearing framework:

  • statistical tracking models to represent sound source continuity;
  • approaches for combining statistical models of foreground and background sound sources
  • approximate search techniques for decoding acoustic scenes in real-time
  • strategies for learning sound source models directly from noisy audio data

CHiME will build a demonstration system simulating a speech-driven home-automation application operating in a noisy domestic environment.

References

  1. J. Barker, N. Ma, A. Coy, & M. Cooke. (2010). Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Computer Speech and Language, 24(1), 94–111. 10.1016/j.csl.2008.05.003 [PDF]

Earlier research projects

POP (2006 - 2009) Perception on Purpose

POP was a three year EC FP6 Specific Targeted Research project that combined auditory scene analysis and vision on robotic platforms. A key achievement in the audio processing was the combination of binaural source localisation techniques (S. Harding, J. Barker, & G. J. Brown, 2006) with a spectro-temporal fragment-based sound source separation component to produce a robust sound source localisation implementation suitable for real time audio motor control (H. Christensen, N. Ma, S. N. Wrigley, & J. Barker, 2009). We also spent some time on the tricky problems of trying to use acoustic location cues when the ears that are generating the estimates are themselves moving on unpredictable and possibly unknown trajectories (H. Christensen & J. Barker, 2009).

These demos show an early prototype sound-localizing robot called Poppy and a custom-made Audio Visual robot called Popeye.

The project also constructed a small corpus of synchronised streoscopic and binaural recordings (E. Arnaud et al., 2008) called CAVA which is freely available for download.

References

  1. H. Christensen, & J. Barker. (2009). Using location cues to track speaker changes from mobile, binaural microphones. In Proceedings of the 10th Annual Conference of the International Speech Communication Association (Interspeech 2009). Brighton, UK.
  2. H. Christensen, N. Ma, S. N. Wrigley, & J. Barker. (2009). A speech fragment approach to localising multiple speakers in reverberant environments. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4593–4596). Taipei, Taiwan: IEEE. 10.1109/ICASSP.2009.4960653 [PDF]
  3. E. Arnaud, H. Christensen, Y-C. Lu, J. Barker, V. Khalidov, M. Hansard, … R. Horaud. (2008). The CAVA Corpus: Synchronised Stereoscopic and Binaural Datasets with Head Movements. In ICMI ’08 Proceedings of the 10th international conference on Multimodal interfaces (pp. 109–116). Crete, Greece. 10.1145/1452392.1452414 [PDF]
  4. S. Harding, J. Barker, & G. J. Brown. (2006). Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE Trans. Speech and Audio Processing. IEEE Transactions on Audio, Speech and Language Processing, 14(1), 58–67. 10.1109/TSA.2005.860354 [PDF]

AVASR (2004 - 2007) Audio visual speech recognition in the presence of multiple speakers

This was an EPSRC project which looked at audio-visual speech recognition is ‘cocktail party’ conditions – i.e. when there are several people speaking simultaneously. The work first showed that standard multistream AVASR approaches are not appropriate in these conditions (X. Shao & J. P. Barker, 2008). The project then developed an audio-visual extension of the speech fragment decoding approach (Barker & Shao, 2009), that, like humans, is able to exploit the visual signal not only for its phonetic content but also in its role as a cue for acoustic source separation. The latter role is also observed in human audio-visual speech processing where the visual speech input can produce an ‘informational masking release’ leading to increased intelligibility even in conditions where the visual signal provides little or no useful phonetic content.

The project also partially funded the collection of the AV Grid corpus (M. Cooke, J. Barker, S. Cunningham, & X. Shao, 2006) which is available for download.

Demos of a face marker tracking tool (Barker, 2005) that was built at the start of the project can be found here.

References

  1. Barker, J., & Shao, X. (2009). Energetic and informational masking effects in an audio-visual speech recognition system. IEEE Transactions on Audio, Speech and Language Processing, 17(3), 446–458. 10.1109/TASL.2008.2011534 [PDF]
  2. X. Shao, & J. P. Barker. (2008). Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment. Speech Communication, 50(4), 337–353. 10.1016/j.specom.2007.11.002 [PDF]
  3. M. Cooke, J. Barker, S. Cunningham, & X. Shao. (2006). An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120(5), 2421–2424. 10.1121/1.2229005 [PDF]
  4. Barker, J. (2005). Tracking Facial Markers with an Adaptive Marker Collocation Model. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 665–668). Philadelphia, PA: IEEE. 10.1109/ICASSP.2005.1415492 [PDF]

Multisource (2002 - 2005) Multisource decoding for speech in the presence of other sound sources

This was an EPSRC funded project that aimed “to generalise Automatic Speech Recognition decoding algorithms for natural listening conditions, where the speech to be recognised is one of many sound sources which change unpredictably in space and time”. During this project we continued the development of the Speech Fragment Decoding approach (that was begun towards the end of the RESPITE project) leading to a theoretical framework published in (J. Barker, M. P. Cooke, & D. P. W. Ellis, 2005). Also during this time we experimented with applications of the missing data approach to binaural conditions (S. Harding, J. Barker, & G. J. Brown, 2006) and as a technique for handling reverberation (K. J. Palomäki, G. J. Brown, & J. Barker, 2004).

References

  1. S. Harding, J. Barker, & G. J. Brown. (2006). Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE Trans. Speech and Audio Processing. IEEE Transactions on Audio, Speech and Language Processing, 14(1), 58–67. 10.1109/TSA.2005.860354 [PDF]
  2. J. Barker, M. P. Cooke, & D. P. W. Ellis. (2005). Decoding speech in the presence of other sources. Speech Communication, 45(1), 5–25. doi:10.1016/j.specom.2004.05.002 [PDF]
  3. K. J. Palomäki, G. J. Brown, & J. Barker. (2004). Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition. Speech Communication, 43(1–2), 123–142. 10.1016/j.specom.2004.02.005 [PDF]

RESPITE (1999 - 2002) Recognition of Speech by Partial Information TEchniques

Before taking up a lectureship I spent three years as a postdoc working on the EC ESPRIT funded RESPITE project. The project focused on researching and developing new methodologies for robust Automatic Speech Recognition based on missing-data theory and multiple classification streams. During the project soft missing data techniques were developed (J. Barker, L. Josifovski, M. P. Cooke, & P. D. Green, 2000) and competitively evaluated on the Aurora speech recognition task (J. Barker, M. Cooke, & P. Green, 2001). At the same time, and in collaboration with Martin Cooke and Dan Ellis, the initial ideas for what became Speech Fragment Decoding were formulated (J. Barker, M. P. Cooke, & D. P. W. Ellis, 2000). A seperate collaboration with Andrew Morris and Herve Bourlard lead to a generalisation of the missing data approach (‘soft data modelling’) that is closely related to what is now know as ‘uncertainty decoding’ (A. C. Morris, J. Barker, & H. Bourlard, 2001).

It was also during the RESPITE project that the CASA Toolkit (CTK was developed. CTK aimed to provide a flexible and extensible software framework for the development and testing of Computational Auditory Scene Analysis (CASA) systems. The toolkit allowed auditory-based signal processing front-ends to be developed using a graphical interface (somewhat similar to Simulink). The toolkit also contained implementations of the various missing data speech recognition algorithms that have been developed at Sheffield. The front-end processing code has largely been made redundant by MATLAB, however we still use the CTK missing data and speech fragment speech recognition code. The code is no longer supported but can be downloaded from here.

References

  1. J. Barker, M. Cooke, & P. Green. (2001). Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise. In Proceedings of the 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Eurospeech 2001 (pp. 213–216). Aalborg, Denmark. [PDF]
  2. A. C. Morris, J. Barker, & H. Bourlard. (2001). From Missing Data to Maybe Useful Data: Soft Data Modelling for Noise Robust ASR. In Proceedings of the Worshop on Innovation in Speech Processing (WISP 2001). Stratford-upon-Avon, UK. [PDF]
  3. J. Barker, M. P. Cooke, & D. P. W. Ellis. (2000). Decoding speech in the presence of other sound sources. In Proceedings of the International Conference on Spoken Language Processing. Beijing, China. [PDF]
  4. J. Barker, L. Josifovski, M. P. Cooke, & P. D. Green. (2000). Soft decisions in missing data techniques for robust automatic speech recognition. In Proceedings of the 6th International Conference on Spoken Language Processing (Interspeech 2000). Beijing, China. [PDF]

SPHEAR (1998 - 1999) Speech Hearing and Recognition

Prior to RESPITE, I spent a year at ICP in Grenoble (now known as Gipsa-Lab) as a Postdoc on SPHEAR, an EC Training and Mobility of Researchers network. The twin goals of the network were to achieve better understanding of auditory information processing and to deploy this understanding in automatic speech recognition for adverse conditions. During the year I worked with Frédéric Berthommier and Jean-Luc Schwarz studying the relation between audio and visual aspects of the speech signal (Barker & Berthommier, 1999; Barker & Berthommier, 1999; Barker, Berthommier, & Schwartz, 1998).

References

  1. Barker, J. P., & Berthommier, F. (1999). Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models. In Proceedings of the ISCA Workshop on Auditory-Visual Speech Processing (AVSP) ’99. University of California, Santa Cruz. [PDF]
  2. Barker, J. P., & Berthommier, F. (1999). Evidence of correlation between acoustic and visual features of speech. In Proc. ICPhS ’99. San Francisco. [PDF]
  3. Barker, J. P., Berthommier, F., & Schwartz, J. L. (1998). Is primitive AV coherence an aid to segment the scene? In Proceedings of the ISCA Workshop on Auditory-Visual Speech Processing (AVSP) ’98. Sydney, Australia. [PDF]

SPRACH (1997 - 1998) Speech Recognition Algorithms for Connectionist Hybrids

SPRACH was an ESPRIT Long Term Research Programme project running from 1995 to 1998 which I was employed on for a brief six month stint while completing my PhD thesis. I had some fun doing some audio segmentation work with Steve Renals (then at Sheffield). The SPRACH project was performing speech recognition on radio broadcasts using what was then called a `hybrid MLP/HMM’ recogniser, i.e. an MLP is used to estimate phone posteriors which are then converted in likelihoods and decoded using an HMM in the usual manner. The audio-segmnetation work attempted to use features derived from the phone posteriors to segment the audio into regions that would be worth decoding (i.e. likely to give good ASR results) and regions that would not (i.e. either non-speech or very noisy speech regions) (Barker, Williams, & Renals, 1998).

References

  1. Barker, J. P., Williams, G., & Renals, S. (1998). Acoustic confidence measures for segmenting broadcast news. In Proc. ICSLP ’98. Sydney, Australia. [PDF]

PhD Thesis (1994 - 1997) Auditory organisation and speech perception

My thesis work (Barker, 1998; Barker & Cooke, 1997), supervised by Martin Cooke, was inspired by a paper (Remez et al., 1994), that had been recently published at the time, which employed experiments using a particular synthetic analogue of natural speech, known as ‘sine wave speech’ (SWS), to apparently invalidate the auditory scene analysis (ASA) account of perception– at least, in as far as it showed that ASA did not seem to account for the perceptual organisation of speech signals. This was a big deal at the time because it raised doubt about whether computational models of auditory scene analysis (CASA) were worth pursuing as a technology for robust speech processing. The thesis confirmed Remez’ observation that listeners can be prompted to hear SWS utterances as coherent speech percepts despite SWS seemingly lacking the acoustic ‘grouping’ cues that were supposedly essential for coherency under the ASA account. However, the thesis went on to demonstrate that the coherency of the sine wave speech percept is fragile – e.g. listeners are not able to attend to individual SWS utterances when pairs SWS utterances are presented simultaneously (the ‘sine wave speech cocktail party’ (Barker & Cooke, 1999)). Computational modelling studies indicated that, in fact, the fragility of SWS and the limited intelligibility of simultaneous sine wave speakers could be described fairly well by CASA-type models that combine bottom-up acoustic grouping rules and top-down models.

References

  1. Barker, J. P., & Cooke, M. P. (1999). Is the sine-wave speech cocktail party worth attending? Speech Communication, 27(3–4), 159–174. 10.1016/S0167-6393(98)00081-8 [PDF]
  2. Barker, J. P. (1998). The relationship between auditory organisation and speech perception: Studies with spectrally reduced speech (PhD thesis). Sheffield University, U.K.
  3. Barker, J. P., & Cooke, M. P. (1997). Modelling the recognition of spectrally reduced speech. In Proceeding of the Eurospeech ’97 (pp. 2127–2130). Rhodes, Greece. [PDF]