Pgt 2018

JPB-MSc-1: Distant microphone speech processing for CHiME-5 - Data Simulation (CS+SLP or ACS) (Jianbo Wu)
JPB-MSc-2: Distant microphone speech processing for CHiME-5 - Factored TDNNs (CS+SLP or ACS) (Hewei Ye)
JPB-MSc-2b: Distant microphone speech processing for CHiME-5 - Source Enhancement (CS+SLP or ACS) (Ziyuan Xia)
JPB-MSc-3: Visualisation Tools for a Speech Perception Database (CS+SLP or SSIT)
JPB-MSc-4: Lip reading for audio speech enhancement (CS+SLP or ACS) (Chenfeng Wei)
JPB-MSc-5: Blink detection for web navigation (ACS) (Mingqian Shi)
JPB-MSc-6: Eye tracking software for audio-visual speech perception research (CS+SLP or ACS) (Zixuan Zhang)

Mail all JWu58@sheffield.ac.uk HYe4@sheffield.ac.uk ZXia8@sheffield.ac.uk CWei7@sheffield.ac.uk MShi6@sheffield.ac.uk ZZhang120@sheffield.ac.uk

The project descriptions below are only intended as starting points. If you wish to discuss possibilities in greater detail I encourage you to email me to arrange a meeting.

JPB-MSc-1: Distant microphone speech processing for CHiME-5 - Data Simulation (CS+SLP or ACS)

Data Simulation: Training data augmentation using room acoustic simulation

CHiME-5 provides just 50 hours of training data. This is not really sufficient for best performance: 100 hours of 200 hours would be better. The idea of this project will be to use techniques for simulating room acoustics to remix CHiME recordings. Specifically, non-reverberant speech segments from the worn microphones will be mixed with different backgroud sounds and then reprocessed to sound as though they have been recorded through the distant microphones.

[TOP]

JPB-MSc-2: Distant microphone speech processing for CHiME-5 - Factored TDNNs (CS+SLP or ACS)

Acoustic Modelling: Improved acoustic modelling using factor-layer time-delay neural networks

The CHiME-5 baseline system uses an acoustic model based on a time-deley neural network. Recently it has been shown that these models can be improved by using so-called ‘factored’ layers which replace a fully connected layer with two smaller layers which have fewer parameters. This project will upgrade the CHiME baseline to use a factored layer TDNN and run experiments to find the most effective way to produce the factorisation.

Description

The student on this project will contribute to the development of a distant microphone speech recognition system (e.g. similar to Amazon Alexa or Google Home). The project will be part of a larger project being conducted in collaboration with Toshiba Research Labs. The system will be designed for recognising conversational speech using audio captured in people’s homes. The project will be using CHiME-5, a brand new conversational speech dataset made up of recordings of real dinner parties recorded in [people’s home] (http://spandh.dcs.shef.ac.uk/chime_challenge/). The data is currently being recorded by Sheffield in collaboration with Google and others and will be released in January.

The CHiME-5 dataset will be released with a baseline speech recognition system. The MSc project will aim to improve this baseline system by working on one component. There are multiple sub-tasks that may form the focus of this project: deep-learning for acoustic or language modelling; multiple microphone speech enhancement; speech source separation for overlapping speech; training data simulation and augmentation. The project is ideally suited to students on the Computer Science with Speech and Language processing MSc, but a student on the ACS MSc with an interest in machine learning could also be suitable.

The project is suitable for up to two students. Each student would be working with the same data and baseline software framework, but will focus on a different aspect of the the system.

If you are interested in this project please make an appointment to see me.

Background reading

[Distant microphone speech recognition reference] (http://spandh.dcs.shef.ac.uk/projects/chime/)
The CHiME-5 dataset
Further information may appear on my website project pages. http://staffwww.dcs.shef.ac.uk/people/J.Barker//project-year/pgt-2017.html

[TOP]

JPB-MSc-2b: Distant microphone speech processing for CHiME-5 - Source Enhancement (CS+SLP or ACS)

Source Enhancement: Time-frequency mask estimation for source separation

Distant microphone speech recognition systems often employ a time-frequency mask which identifies the time-frequency elements that are believed to be corrupted by noise. These masks can be used to filter out the noise, or for steering microphone arrays. This project would look at deep neural network approaches for mask estimation. These approaches have been shown to work in situations where the speech and noise have very different characteristics, but are harder to employ in tasks where the noise is coming from an interfering speaker. We will look at approaches that solve this problem by using a ‘speaker embedding’ to identify the target speaker.

Description

The project is suitable for up to two students. Each student would be working with the same data and baseline software framework, but will focus on a different aspect of the the system.

If you are interested in this project please make an appointment to see me.

Background reading

[Distant microphone speech recognition reference] (http://spandh.dcs.shef.ac.uk/projects/chime/)
The CHiME-5 dataset
Further information may appear on my website project pages. http://staffwww.dcs.shef.ac.uk/people/J.Barker//project-year/pgt-2017.html

[TOP]

JPB-MSc-3: Visualisation Tools for a Speech Perception Database (CS+SLP or SSIT)

Description

WThe English Consistent Confusion Corpus is a large-scale collection of noise induced British English speech misperceptions. These misperceptions have been elicited by asking listeners to transcribe English words mixed with complex noise backgrounds. The corpus has been distilled from over 300,000 listener responses and includes responses to over 9,000 individual noisy speech tokens. Of these, a subset of over 3,000 tokens induce ‘consistent confusions’, i.e. tokens that are misheard in the same way by a significant number of listeners.

This project will make a web based tool for browsing and visualising the contents of this database.

[TOP]

JPB-MSc-4: Lip reading for audio speech enhancement (CS+SLP or ACS)

Description

Speech can be hard to understand when there is a lot of background noise present. There are many well-established signal processing techniques for removing noise from speech signals, however, most of these techniques fail to make the speech any more intelligible - they just make it sound less noisy. This project will investigate an exciting new audio-visual strategy that starts with a noisy video recording of the speaker. The system will then use computer vision techniques to extract speech information from the pattern of the speaker’s lip movements. This information will then be used to improve the speech audio signal. (This isn’t a far-fetched idea, it is something that you and I do naturally when listening to speech!)

The project breaks into several components any one of which could be the main focus: i/ image processing for visual feature extraction from video; ii/ machine learning for visual to acoustic feature mapping; iii/ testing new algorithms for speech signal processing. It will also require software skills for building tools and demonstration systems, and it will provide experience of evaluating speech signals by running controlled listening experiments.

The project will be running in parallel to UK Research Council funded collaboration between Sheffield, University of Stirling, the Institute of Hearing Research and Phonak that is developing camera-equipped hearing aids.

Background reading

The following references provide some background to the field.

Hussain and Abel, “Image processing and speech enhancement: past, present and future trends”, Proc. NSFCRSE Workshop on Emerging Directions in Image, 2011, https://www.royalsoced.org.uk/cms/files/international/events/Yang_&_Hussain.pdf
Almajai, I., “Audio-Visual Speech Enhancement”, Ph.D Theis, University of East Anglia, 2009, - key publications from thesis here http://www.computing.surrey.ac.uk/personal/st/I.Almajai/docs/thesis_IMA.pdf, and here http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.302.6501&rep=rep1&type=pdf
Python OpenCV computer vision library.
Further information may appear on my website project pages. http://staffwww.dcs.shef.ac.uk/people/J.Barker//project-year/pgt-2017.html

[TOP]

JPB-MSc-5: Blink detection for web navigation (ACS)

Description

This project will develop and test blink detection as a means to control a standard web-browser. The goal will be to evaluate this as a technology for disabled users who are unable to use a conventional keyboard or touch-driven interface.

The project builds on a project that ran this year which built a chrome web browser extension that allow the browser to be operated using a brain-computer interface. This system works but the BCI technology is hard to use for many users. The new project will build a video-based blink detection system that can interface with this modified web-browser to allow an alternative control mechanism that many users will find easier to operated, and which avoids the intrusion of having to wear a headset.

There has been a lot of recent work on automatic blink detection (e.g., it is used to monitor driver alertness in some cars). The techniques are well understood and can be easily implemented in recent computer vision and machine learning toolkits. However, challenges remain in working out how best to integrate the technology with a web-browser, and how to produce a system that won’t be prone to false positive (e.g., distinguishing deliberate blinks from blinks that occur spontaneously).

It is expected that the student will use high-level tools such as the OpenCV computer vision library and the face tracking components of the dlib Python machine learning library.

Prerequisites

Experience in Python programming.

JPB-MSc-6: Eye tracking software for audio-visual speech perception research (CS+SLP or ACS)

Description

Eye tracking is used in psychology experiments as a way of monitoring a subject’s attention. In the Speech and Hearing research group we are interested in learning how normal hearing and hearing impaired listeners use their eyes to capture `visual speech cues’ (e.g., lip movements) that help them understand speech. This can be achieved by tracking a users gaze direction while they watch videos of speech presented on a monitor.

The Department has access to a state-of-the-art wearable eye-tracking device, the Tobii Pro Glasses 2 that could potentially be used for these experiments. However wearable eye-trackers allow gaze tracking to be tracked relative to the wearer’s head orientation. If used in a screen-based experiment, they do not directly tell you whereabouts on the screen the user is looking. However, this problem can be solved with some additional computer vision software.

This project would develop software that would allow the Tobii Pro Glasses to be used for screen-based experiments. This is essentially a video processing task that can be solved using computer vision techniques available in the OpenCV toolkit. The project would then demonstrate the effectiveness of the software by using it in some audio-visual speech perception experiments that we have planned.

If you are interested then please make an appointment to see me so that I can explain the problem in more detail.

Requirements

An interest in Computer Vision
Some Python programming experience

Background Reading

Tobii Glasses
Python OpenCV
Further information may appear on my website project pages. http://staffwww.dcs.shef.ac.uk/people/J.Barker//project-year/pgt-2017.html

[TOP]

MSc projects 2018 - 19

JPB-MSc-1: Distant microphone speech processing for CHiME-5 - Data Simulation (CS+SLP or ACS)

JPB-MSc-2: Distant microphone speech processing for CHiME-5 - Factored TDNNs (CS+SLP or ACS)

Acoustic Modelling: Improved acoustic modelling using factor-layer time-delay neural networks

Description

Background reading

JPB-MSc-2b: Distant microphone speech processing for CHiME-5 - Source Enhancement (CS+SLP or ACS)

Source Enhancement: Time-frequency mask estimation for source separation

Description

Background reading

JPB-MSc-3: Visualisation Tools for a Speech Perception Database (CS+SLP or SSIT)

Description

JPB-MSc-4: Lip reading for audio speech enhancement (CS+SLP or ACS)

Description

Background reading

JPB-MSc-5: Blink detection for web navigation (ACS)

Description

Prerequisites

Further reading

JPB-MSc-6: Eye tracking software for audio-visual speech perception research (CS+SLP or ACS)

Description

Requirements

Background Reading