Pgt 2006

JPB-MSc-1: Audio-Visual Lip Tracking (Pinelopi Sotiropoulou)
JPB-MSc-2: Eye-tracking for head pose estimation (Murad Abouammoh)
JPB-MSc-3: Audio-based speaker location estimation for diarization (Maral Dadvar)

The project descriptions below are only intended as starting points. If you wish to discuss possibilities in greater detail I encourage you to email me to arrange a meeting.

JPB-MSc-1: Audio-Visual Lip Tracking

Description

Lip tracking is the task of following the outline of a speaker’s lips through a sequence of video frames. This task is an important component of many audio-visual speech processing applications – including audio-visual speech recognition. The most successful lip tracking systems employ a technique know as active shape and appearance modelling (ASM/AAM). This technique employs a statistical model of the shape (and possibly, appearance) of the speaker’s lips that has been learnt from a small number of video frames in which the lip outlines have been traced by hand. The tracking system then examines the video and employs an iterative search to find a sequence of smoothly changing lip shapes that fit well to this model. Although this technique can work well in optimum conditions, trackers are prone to drift away from the correct solution when the video quality is poor.

This project aims to improve the reliability of a video-based lip tracking system by exploiting the audio speech signal. There is a well-understood mapping between speech sounds and the positions of the lips required to produce them (e.g. the sounds ‘m’, ‘b’ and ‘p’ are made by bringing the lips together, for vowels sounds the lips are open). So when we hear speech we have an expectation of how the lips should behave. A lip tracking system based on audio-visual models can take advantage of this statistical correlation to work more robustly in situations where the video information alone is inadequate.

Additional Information

For evaluation purposes the project will use part of a large corpus of audio-visual speech data that has recently been collected at Sheffield. Much of the necessary preprocessing has already been performed, and some tools have already been developed, allowing the project to have a running start. This is a ‘real’ project that is aligned with ongoing research within the Speech and Hearing Research Group. It is hoped that if successful the project outputs will form the foundation of publishable research.

The project is interdisciplinary in nature, involving elements of video processing, audio processing and machine learning, and offers a great opportunity to gain experience in these areas. There are no specific prerequisites, but this is a challenging project and students are expected to be willing to engage with material that may not be covered directly in the taught modules.

Prerequisites

Maths (matrices, vectors, linear algebra), good programming skills. Experience of MATLAB will be an advantage.

Initial reading

T.F.Cootes, C.J. Taylor, D.H.Cooper and J.Graham (1995) “Active Shape Models - Their training and application”, Computer Vision and Image Understanding 61(1) pp. 38-59 here
T.F. Cootes and C.J. Taylor (2001), “Statistical models of appearance for medical image analysis and computer vision.” Proc. SPIE Medical Imaging here
I. Matthews, T.F. Cootes and J.A. Bangham (2002), “Extraction of Visual Features for Lipreading.” IEEE PAMI Vol.24, No.2, pp.198-213 here
D.Cristinacce and T.F.Cootes (2004), “A comparison of shape constrained facial feature detectors.” Proc. Int. Conf on Face and Gesture Recognition here

Data and Code

See here.

[TOP]

JPB-MSc-2: Eye-tracking for head pose estimation

Description

This project aims to track the position of facial features in ‘talking head’-style video data. By tracking fixed features (e.g. eye and nose and mouth position) it will be possible to accurately estimate the position and orientation (the ‘pose’) of the speaker’s head. The eye-tracking component may also be used to estimate the speaker’s blink-rate.

As a starting point the project will experiment with eigen-feature and feature collocation techniques.

Additional Information

For evaluation purposes the project will use part of a large corpus of audio-visual speech data that has recently been collected at Sheffield. In order to train the tracking system and evaluate the results the project will involve hand-annotating a subset of the data. This is a ‘real’ project that is aligned with ongoing research within the Speech and Hearing Research Group. It is hoped that if successful the project outputs may contribute to publishable research.

Prerequisites

Maths (matrices, vectors, linear algebra) and programming skills.

Initial reading

A.W.Senior (1999), “Face and feature finding for a face recognition system.” Proc 2nd Int. Conf. on Audio and Video-based Biometric Person Authentication, pp. 154-159, Washington DC here
M.Turk and A.Pentland (1991), “Eigenfaces for recognition”, Journal of Cognitive Neuroscience. Vol 3, No. 1. 71-86, 1991. here
D.Cristinacce and T.F.Cootes (2004), “A comparison of shape constrained facial feature detectors.” Proc. Int. Conf on Face and Gesture Recognition here
S. Spors and R. Rabenstein (2001), “A real-time face tracker for color video”, IEEE Int. Conf. on Acoustics, Speech & Signal Processing (ICASSP), Utah, USA, May 2001 here

</ul>

Data

See here.

[TOP]

JPB-MSc-3: Audio-based speaker location estimation for diarization

Description

Given a recording of a conversation, diarization is the task of deciding, ‘Who spoke when?’. Most approaches to the task involve first detecting when a change in speaker has occurred. This can be difficult given that there is often little or no silence between speakers, and that speakers can have similar sounding voices.

If a conversation has been recorded using two microphones it is possible to possible to make an estimate of the location from which the sound is coming (i.e. in the same way that we can guess where a sound is coming from by using our two ears). As different speakers will be sitting at different positions in the room, it is possible to detect speaker changes by noting a sudden change in the direction of the speaker. However, estimating sound source direction is difficult (due to echoes from the walls) and direction estimates are noisy.

This project will evaluate the use of sound location as a feature for diarization. It will be evaluated using artificial ‘conversation’ made by concatenating utterances that have been recorded with two microphones from a variety of positions. Different approaches to smoothing the noisy location estimates will be considered. Approaches to be considered may include hidden Markov modelling and particle filtering.

Additional Information

This project will be supervised by myself, but with the support of Dr Christensen - a Research Associate who is currently working on robust sound source localisation. It is a ‘real’ project that is aligned with ongoing research within the Speech and Hearing Research Group. If successful the project may contribute to publishable research.

Prerequisites

Maths (some understanding of probability theory) and Java, C or C++ programming skills. Some experience with MATLAB may be helpful.

Initial reading

Y.Denda, T.Nishiura and Y. Yamashita, (2006) “Robust talker direction estimaition based on weighted CSP analysis and maximum likelihood estimation”, IEICE Trans. Inf. & Syst. Vol.E89-D, No. 30, pp.1050–1057 (here)
T.M.Shackleton and R.Meddis and J.Hewitt, (1992) “Across frequency integration in a model of lateralization”, JASA 91(4) 2276–2279 (here) *P. Perez, J. Vermaak, A. Blake. (2004) “Data fusion for visual tracking with particles.” Proc. IEEE, 92(3):495-513, 2004. (here)

[TOP]