Research

Current and recent research projects

Cadenza (2022 - 2027) Machine Learning Challenges to Revolutionise Music Listening for People with Hearing Loss

Clarity is a 4.5 year EPSRC project in collaboration with the University of Salford (Comp Sci), University of Leeds () and University of Nottingham (Medicine) and with the support of the BBC, Google, Logitech, Sonova AG and user engagement via Royal National Institute for the Deaf (RNID).

1 in 6 people in the UK has a hearing loss, and this number will increase as the population ages. Poorer hearing makes music harder to appreciate. Picking out lyrics or melody lines is more difficult; the thrill of a musician creating a barely audible note is lost if the sound is actually inaudible, and music becomes duller as high frequencies disappear. This risks disengagement from music and the loss of the health and wellbeing benefits it creates.

The project will look at personalising music so it works better for those with a hearing loss.

The project will consider:

Processing and remixing mixing desk feeds for live events or multitrack recordings.
Processing of stereo recordings in the cloud or on consumer devices.
Processing of music as picked up by hearing aid microphones.

The project aims to accelerate research in this area by organising a series of signal processing challenges. These challenge will grow a collaborative community who can apply their skills and knowledge to this problem area.

The project will be developing tools, databases and objective models needed to run the challenges. This will lower barriers that currently prevent many researchers from considering hearing loss. Data would include the results of listening tests into how real people perceive audio quality, along with a characterisation of each test subject’s hearing ability, because the music processing needs to be personalised. We will develop new objective models to predict how people with a hearing loss perceive audio quality of music. Such data and tools will allow researchers to develop novel algorithms.

Project Website: http://cadenzachallenge.org

Clarity (2019 - 2024) Challenges to Revolutionise Hearing Device Processing

Clarity is a 5 year EPSRC project in collaboration with the University of Cardiff (Psychology), University of Nottingham (Medicine), University of Salford (Comp Sci) and with the support of the Hearing Industry Research Consortium, Action for Hearing Loss, Amazon and Honda.

The project aims to transform hearing-device research by the introduction of open evaluations (“challenges”) similar to those that have been the driving force in many other fields of speech technology. The project will develop the simulation tools, models, databases and listening test protocols needed to facilitate such challenges. We will develop simulators to create different listening scenarios and baseline models to predict how hearing-impaired listeners perceive speech in noise. Data will also include the results of large-scale speech-in-noise listening tests along with a comprehensive characterisation of each test subject’s hearing ability. These data and tools will form a test-bed to allow other researchers to develop their own algorithms for hearing aid processing in different listening scenarios. The project will run three challenge cycles with steering from industry partners and the speech and hearing research communities.

Project Website: http://claritychallenge.org

TAPAS (2017 - 2021) Training Network on Automatic Processing of PAthological Speech

TAPAS is an H2020 Marie Curie Initial Training Network that will provide research opportunities for 15 PhD students (Early Stage Researchers) to study automatic processing of pathological speech. The network consists of 12 European research institutes and 9 associated partners.

The TAPAS work programme targets three key research problems:

Detection: We will develop speech processing techniques for early detection of conditions that impact on speech production. The outcomes will be cheap and non-invasive diagnostic tools that provide early warning of the onset of progressive conditions such as Alzheimer’s and Parkinson’s.
Therapy: We will use newly-emerging speech processing techniques to produce automated speech therapy tools. These tools will make therapy more accessible and more individually targeted. Better therapy can increase the chances of recovering intelligible speech after traumatic events such a stroke or oral surgery.
Assisted Living: We will re-design current speech technology so that it works well for people with speech impairments. People with speech impairments often have other co-occurring conditions making them reliant on carers. Speech-driven tools for assisted-living are a way to allow such people to live more independently.

The TAPAS consortium includes clinical practitioners, academic researchers and industrial partners, with expertise spanning speech engineering, linguistics and clinical science. This rich network will train a new generation of 15 researchers, equipping them with the skills and resources necessary for lasting success.

DeepArt (2017 - 2018) Deep learning of articulatory-based representations of dysarthric speech

DeepArt is a Google Faculty Award project that is targeting dysarthria, a particular form of disordered speech arising from poor motor-control and a resulting lack of coordination of the articulatory system. At Sheffield, we have demonstrated that using state-of-the-art training techniques developed for mainstream HMM/DNN speech recognition, can raise baseline performance for dysarthric speech recognition.

The DeepArt project will aim to advance the state of the art by conducting research in three key areas:

articulatory based representations;
use of synthetic training data and
novel approaches to DNN based speaker adaptive training.

AV-COGHEAR (2015 - 2018) Towards visually-driven speech enhancement for cognitively-inspired multi-modal hearing-aid devices

AV-COGHEAR is an EPSRC-funded project that is being conducted in collaboration with the University of Stirling. Current commercial hearing aids use a number of sophisticated enhancement techniques to try and improve the quality of speech signals. However, today’s best aids fail to work well in many everyday situations. In particular, they fail in busy social situations where there are many competing speech sources; they fail if the speaker is too far from the listener and swamped by noise. We have identified an opportunity to solve this problem by building hearing aids that can ‘see’.

AV-COGHEAR aims to develop a new generation of hearing aid technology that extracts speech from noise by using a camera to see what the talker is saying. The wearer of the device will be able to focus their hearing on a target talker and the device will filter out competing sound. This ability, which is beyond that of current technology, has the potential to improve the quality of life of the millions suffering from hearing loss (over 10m in the UK alone).

The project is bringing together a researchers with the complementary expertise necessary to make the audio-visual hearing-aid possible. The project combines contrasting approaches to audio-visual speech enhancement that have been developed by the Cognitive Computing group at Stirling and the Speech and Hearing Group at Sheffield. The Stirling approach uses the visual signal to filter out noise; whereas the Sheffield approach uses the visual signal to fill in ‘gaps’ in the speech. The MRC Institute of Hearing Research (IHR) will provide the expertise needed to evaluate the approach on real hearing loss sufferers. Phonak AG, a leading international hearing aid manufacturer, is providing the advice and guidance necessary to maximise potential for industrial impact.

INSPIRE (2012 - 2016) Investigating Speech in Real Environments

INSPIRE is an FP7 Marie Curie Initial Training Network that will provide research opportunities for 13 PhD students (Early Stage Researchers) and 3 postdocs (Experienced Researchers) to study speech communication in real-world conditions. The network consists of 10 European research institutes and 7 associated partners (5 businesses and 2 academic hospitals). The senior researchers in the network are academics in computer science, engineering, psychology, linguistics, hearing science, as well as R&D scientists from leading businesses in acoustics and hearing instruments, and ENT specialists. The scientific goal of INSPIRE is to better understand how people recognise speech in real life under a wide range of conditions that are “non-optimal” relative to the controlled conditions in laboratory experiments, e.g., speech in noise, speech recognition under divided attention.

CHiME (2009 - 2012) Computational Hearing in Multisource Environments

CHiME is an EPSRC funded project that aims to develop a framework for computational hearing in multisource environments. The approach operates by exploiting two levels of processing that combine to simultaneously separate and interpret sound sources (missing reference). The first processing level exploits the continuity of sound source properties to clump the acoustic mixture into fragments of energy belonging to individual sources. The second processing level uses statistical models of specific sound sources to separate fragments belonging to the acoustic foreground (i.e. the `attended’ source) from fragments belonging to the background.

The project will investigate and develop key aspects of the proposed two-level hearing framework:

statistical tracking models to represent sound source continuity;
approaches for combining statistical models of foreground and background sound sources
approximate search techniques for decoding acoustic scenes in real-time
strategies for learning sound source models directly from noisy audio data

CHiME will build a demonstration system simulating a speech-driven home-automation application operating in a noisy domestic environment.

References

Earlier research projects

POP (2006 - 2009) Perception on Purpose

POP was a three year EC FP6 Specific Targeted Research project that combined auditory scene analysis and vision on robotic platforms. A key achievement in the audio processing was the combination of binaural source localisation techniques (missing reference) with a spectro-temporal fragment-based sound source separation component to produce a robust sound source localisation implementation suitable for real time audio motor control (missing reference). We also spent some time on the tricky problems of trying to use acoustic location cues when the ears that are generating the estimates are themselves moving on unpredictable and possibly unknown trajectories (missing reference).

These demos show an early prototype sound-localizing robot called Poppy and a custom-made Audio Visual robot called Popeye.

The project also constructed a small corpus of synchronised streoscopic and binaural recordings (missing reference) called CAVA which is freely available for download.

References

AVASR (2004 - 2007) Audio visual speech recognition in the presence of multiple speakers

This was an EPSRC project which looked at audio-visual speech recognition is ‘cocktail party’ conditions – i.e. when there are several people speaking simultaneously. The work first showed that standard multistream AVASR approaches are not appropriate in these conditions(missing reference). The project then developed an audio-visual extension of the speech fragment decoding approach(missing reference), that, like humans, is able to exploit the visual signal not only for its phonetic content but also in its role as a cue for acoustic source separation. The latter role is also observed in human audio-visual speech processing where the visual speech input can produce an ‘informational masking release’ leading to increased intelligibility even in conditions where the visual signal provides little or no useful phonetic content.

The project also partially funded the collection of the AV Grid corpus(missing reference)which is available for download.

Demos of a face marker tracking tool(missing reference)that was built at the start of the project can be found here.

References

Multisource (2002 - 2005) Multisource decoding for speech in the presence of other sound sources

This was an EPSRC funded project that aimed “to generalise Automatic Speech Recognition decoding algorithms for natural listening conditions, where the speech to be recognised is one of many sound sources which change unpredictably in space and time”. During this project we continued the development of the Speech Fragment Decoding approach (that was begun towards the end of the RESPITE project) leading to a theoretical framework published in(missing reference). Also during this time we experimented with applications of the missing data approach to binaural conditions(missing reference) and as a technique for handling reverberation(missing reference).

References

RESPITE (1999 - 2002) Recognition of Speech by Partial Information TEchniques

Before taking up a lectureship I spent three years as a postdoc working on the EC ESPRIT funded RESPITE project. The project focused on researching and developing new methodologies for robust Automatic Speech Recognition based on missing-data theory and multiple classification streams. During the project soft missing data techniques were developed (missing reference) and competitively evaluated on the Aurora speech recognition task (missing reference). At the same time, and in collaboration with Martin Cooke and Dan Ellis, the initial ideas for what became Speech Fragment Decoding were formulated (missing reference). A seperate collaboration with Andrew Morris and Herve Bourlard lead to a generalisation of the missing data approach (‘soft data modelling’) that is closely related to what is now know as ‘uncertainty decoding’ (missing reference).

It was also during the RESPITE project that the CASA Toolkit (CTK was developed. CTK aimed to provide a flexible and extensible software framework for the development and testing of Computational Auditory Scene Analysis (CASA) systems. The toolkit allowed auditory-based signal processing front-ends to be developed using a graphical interface (somewhat similar to Simulink). The toolkit also contained implementations of the various missing data speech recognition algorithms that have been developed at Sheffield. The front-end processing code has largely been made redundant by MATLAB, however we still use the CTK missing data and speech fragment speech recognition code. The code is no longer supported but can be downloaded from here.

References

SPHEAR (1998 - 1999) Speech Hearing and Recognition

Prior to RESPITE, I spent a year at ICP in Grenoble (now known as Gipsa-Lab) as a Postdoc on SPHEAR, an EC Training and Mobility of Researchers network. The twin goals of the network were to achieve better understanding of auditory information processing and to deploy this understanding in automatic speech recognition for adverse conditions. During the year I worked with Frédéric Berthommier and Jean-Luc Schwarz studying the relation between audio and visual aspects of the speech signal (missing reference).

References

SPRACH (1997 - 1998) Speech Recognition Algorithms for Connectionist Hybrids

SPRACH was an ESPRIT Long Term Research Programme project running from 1995 to 1998 which I was employed on for a brief six month stint while completing my PhD thesis. I had some fun doing some audio segmentation work with Steve Renals (then at Sheffield). The SPRACH project was performing speech recognition on radio broadcasts using what was then called a `hybrid MLP/HMM’ recogniser, i.e. an MLP is used to estimate phone posteriors which are then converted in likelihoods and decoded using an HMM in the usual manner. The audio-segmnetation work attempted to use features derived from the phone posteriors to segment the audio into regions that would be worth decoding (i.e. likely to give good ASR results) and regions that would not (i.e. either non-speech or very noisy speech regions) (missing reference).

References

PhD Thesis (1994 - 1997) Auditory organisation and speech perception

My thesis work (missing reference), supervised by Martin Cooke, was inspired by a paper (Remez et al., 1994), that had been recently published at the time, which employed experiments using a particular synthetic analogue of natural speech, known as ‘sine wave speech’ (SWS), to apparently invalidate the auditory scene analysis (ASA) account of perception– at least, in as far as it showed that ASA did not seem to account for the perceptual organisation of speech signals. This was a big deal at the time because it raised doubt about whether computational models of auditory scene analysis (CASA) were worth pursuing as a technology for robust speech processing. The thesis confirmed Remez’ observation that listeners can be prompted to hear SWS utterances as coherent speech percepts despite SWS seemingly lacking the acoustic ‘grouping’ cues that were supposedly essential for coherency under the ASA account. However, the thesis went on to demonstrate that the coherency of the sine wave speech percept is fragile – e.g. listeners are not able to attend to individual SWS utterances when pairs SWS utterances are presented simultaneously (the ‘sine wave speech cocktail party’ (missing reference)). Computational modelling studies indicated that, in fact, the fragility of SWS and the limited intelligibility of simultaneous sine wave speakers could be described fairly well by CASA-type models that combine bottom-up acoustic grouping rules and top-down models.