Pgt 2007

JPB-MSc-1: Comparison of transform-based visual features for automatic lip reading (HLT) (Liang Chang)
JPB-MSc-2: Rapid adaptation of visual speech models using `eigenvoice' techniques
JPB-MSc-3: Audio-based ego-motion estimation (HLT/ACS/DataComms) (Huaxin Zhang)
JPB-MSc-5: Face detection and tracking in stereoscopic video data

The project descriptions below are only intended as starting points. If you wish to discuss possibilities in greater detail I encourage you to email me to arrange a meeting.

JPB-MSc-1: Comparison of transform-based visual features for automatic lip reading (HLT)

Description

The visual feature extraction algorithm is the most critical component of a successful automatic lip reading system: if the recogniser is supplied with poor features it will not produce good recognition results. Common feature extraction techniques operate by applying a fixed transform to a region of the image centred on the lips. However, there are many competing transforms that can be applied and very little agreement over which is `best’. This project will take advantage of a new audio-visual speech corpus – the Grid corpus – to test various visual features in direct competition.

The Grid corpus is one of the largest non-proprietary audio-visual corpora currently available and its design enables visual feature techniques to be quickly and easily compared. The project will make use of previous work using the Grid corpus that establishes a set of baseline results for the commonly-employed 2D-DCT transform. These results will be compared to the various variants of the standard 2D-DCT feature, and with other transforms such as PCA, wavelet transforms and, possibly, more exotic technique such as `sieve features’ (Harvey et al., 1997). The project will also experiment with standard front-end normalisation techniques such as histogram equalisation, and lighting correction that are designed to reduce intra- and inter-speaker variability. There has been very little systematic research in this area and the scope of the project is wide. It would make an excellent starting point for future PhD research.

Additional information

The project will involve using HTK and will best suite a student who is comfortable using linux and understands the basic principals of writing shell scripts.

This project is suitable for HLT students.

Initial reading

I. Matthews, G. Potamianos, C. Neti and J. Luettin (2001) “A comparison of model and transform-based visual features for auido-visual LVCSR.” Proc ICME-01 here
P. S. Aleksic and A. K. Katsaggelos (2004) “Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition” Proc ICASSP 2004, Volume 5, 917-920 here
R. Harvey, I. Matthews, J. A. Bangham and S. Cox, (1997) “Lip reading from scale-space measurements” In Proc. CVPR-97, pages 582–587 here
M. Cooke, J. Barker, S. Cunningham and X. Shao, (2006) “An audio-visual corpus for speech perception and automatic speech recognition.” Journal of the Acoustical Society of America 120(5), 2421-2424, 2006 here

[TOP]

JPB-MSc-2: Rapid adaptation of visual speech models using `eigenvoice' techniques

Description

The performance of a speech recognition system can be improved by adapting the model parameters to better fit the characteristics of the user. The standard algorithms (e.g. MAP adaptation, and MLLR adaptation) adapt very slowly requiring large amounts of data from the user. In more recent years new algorithms have emerged that can adapt rapidly using very little data. One of the best know of these is the eigenvoice technique. This technique has been shown to work very well for acoustic speech models but has not previously been tested for visual speech models (i.e. as used in automatic lip reading systems).

This exciting project will apply the eigenvoice model adaptation technique to models of visual speech. This has never been previously attempted and if successful could lead to publishable results.

The project is ambitious but it is manageable because it will be building on the output of previous related research conducted at Sheffield. The project will employ the audio-visual Grid corpus – a large audio-visual speech database that has been collected at Sheffield in recent years. The project will also make use of visual features and speaker-dependent models that have been constructed from this data as part of a recent EPSRC research grant.

Additional information

The project will involve using HTK and will best suite a student who is comfortable using linux and understands the basic principles of writing shell scripts. Some experience of MATLAB will also be helpful.

This project is suitable for HLT students.

Initial reading

R. Kuhn, J.-C. Junqua, P.Nguyen and N. Niedzielski, (2000), “Rapid speaker adaptation in eigenvoice space”, IEEE Trans. Speech and Audio Proc., Vol 8, No. 6 695–707, 2000 here
Woodland, Phil C. (2001), “Speaker adaptation for continuous density HMMs: A review”, Invited Lecture, In Adaptation-2001, 11-19.
P. S. Aleksic and A. K. Katsaggelos (2004) “Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition” Proc ICASSP 2004, Volume 5, 917-920 here
M. Cooke, J. Barker, S. Cunningham and X. Shao, (2006) An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America 120(5), 2421-2424, 2006 here

[TOP]

JPB-MSc-3: Audio-based ego-motion estimation (HLT/ACS/DataComms)

Description

Consider a robot moving around its environment using only eyes and ears as input. How can such a robot keep precise track of its movement? This problem of `egomotion estimation’ is a big research topic in robotics. A naive robot would just trust that its motors are making it move as it tells them to. So if it tells its motors to make it turn 10 degrees left it would trust that it had turned exactly 10 degrees left. Unfortunately, this is a poor strategy: motors are not precise, wheels can slip and external agents can act to cause the robots to move in unexpected ways. If this naive strategy was employed the robot’s belief would quickly diverge from reality. A more robust solution is to confirm the expected motion using feedback from the environment. In the case of a simple audio-visual robot this would mean using input from the microphones (ears) and cameras (eyes). For example, if the head turns to the left, the extent of the turn can be measured by the degree that the visual image moves to the right.

Projects JB-3 and JB-4 will examine the egomotion problem using binaural (two ear) and stereo-scopic (two eye) recordings that have been made using in-ear microphones and head-mounted cameras worn by a human subject whose head movements are being tracked by a precise head-tracking system. Recordings have been made in a sequence of acoustic and visual environments of increasing complexity which allow plenty of scope for experimentation. Two projects are proposed. JB-3 will attempt estimate the head motion using purely acoustic recovered from the audio recording. It will be evaluated by comparing the estimated motion with the output of the head-tracker. JB-4 will be the visual counterpart of JB-3 using visual cues such as optical flow to estimate the head motion. If both projects are adopted there may even be an opportunity to combine the two systems to produce a complete audio-visual solution.

The visual project, JB-4, will use tried and tested optical flow techniques as a starting point. It is expected that the student will work with the OpenCV to allow a baseline solution to be developed quick. Project JB-3 is more experimental: the egomotion estimation problem has been much studied using video input but there is very little previous work using audio. It is fairly well understood how we can estimate the direction of a sound source using differences between the signals arriving at the left and right ear. In short, subtle inter-aural (‘between ear’) time differences (ITDs) and inter-aural level differences (ILDs) can be used to estimate the angle of arrival of a sound. This technique is normally used to localise potentially moving sound sources. However, the same cues could potentially be used to estimate the motion of the listeners head by detecting apparent changes of position in stationary background sound sources. For example, if the ticking sound of my wall clock appears to move 10 degrees to the left this is probably because I have moved my head 10 degrees to the right. If all the sounds in my office move to the left at the same time and by the same amount, then the obvious explanation is that these sources are actually stationary and that it is my head that has moved. JB-3 will use this idea to develop a concept of acoustic-flow' equivalent to the idea of optical-flow’ used in JB-4.

For project JB-3 there will also be an opportunity to record fresh data using a recently acquired head-tracking system that can be easily set-up and run from a laptop computer.

Additional information

For JB-3 MATLAB programming experience will be an advantage as it will enable existing localisation software to be used as a starting point.

JB-4 will make extensive use of the OpenCV computer vision library which provides functionality for optical flow estimation. Some previous experience of programming in C will be beneficial but a well-motivated student should be able to learn sufficient C in order to complete the project.

JB-3 is suitable for HLT or ACS students, and JB-4 is suitable for DataComms and ACS students.

Initial reading

JB-3

Y.Denda, T.Nishiura and Y. Yamashita, (2006) “Robust talker direction estimaition based on weighted CSP analysis and maximum likelihood estimation”, IEICE Trans. Inf. & Syst. Vol.E89-D, No. 30, pp.1050–1057 (here)
T.M.Shackleton and R.Meddis and J.Hewitt, (1992) “Across frequency integration in a model of lateralization”, JASA 91(4) 2276–2279 (here)

JB-4

OpenCV - Intel’s Open source computer vision library (http://www.intel.com/technology/computing/opencv/).
A.-T. Tsao, C.-S. Fuh, Y.-P. Hung and Y.-S. Chen, (1997) “Ego-motion estimation using optical flow fields observed from multiple cameras”, Proceedings of CVPR’97 p.457 (here)
L. Zelnik-Manor (2004) “The optical flow field” Invited talk, Caltech 2004 (here)
Mori, T. (1985) “An active method of extracting egomotion parameters from optical flow” Biological Cybernetics 52(6):405-407 here

TOP

[TOP]

JPB-MSc-5: Face detection and tracking in stereoscopic video data

Description

As part of a large EC-funded research project the Department has recently made a set of video recordings using a pair of head-mounted stereoscopic cameras. The recordings have been designed in part to test person-tracking algorithms using a series of increasingly complex tests involving one or more actors. Of particular interest is the fact that the stereoscopic recordings allow object locations to be estimated not only in terms of left/right and up/down but also in terms of distance (i.e. near/far). This project will apply ‘off-the-shelf’ face detection algorithms (e.g. Viola and Jones, 2001) to this data in order to detect and track faces appearing in the left and right video image. Then, by pairing faces across left and right images, and measuring the size of the apparent difference in their position (the ‘disparity’) the distance can also be estimated.

One particular focus of the study will be to demonstrate how the distance estimate can be used to disambiguate the confusion that can occur when a person being tracked walks behind another. Trackers that apply a single camera can mistakenly lock onto the face of the nearer person. With two cameras, it should be possible to use the distance estimate to prevent this from happening

Additional information

This is a challenging project that is suitable for a confident ACS or DataComms MSc students. DataComms students who have taken the computer vision module are particularly well qualified.

The project will make use of the OpenCV computer vision library which provides functionality for face detection. Some previous experience of programming in C will be beneficial but a well-motivated student should be able to learn sufficient C in order to complete the project.

###Initial reading

OpenCV - Intel’s Open source computer vision library (http://www.intel.com/technology/computing/opencv/).
P. Viola and M. Jones (2001) “Rapid object detection using a boosted cascade of simple features” Proc CVPR 2001 Volume 1, pages 511-518 (here)
S. Gutierrez and J. Luis Marroquin (2004) “Robust approach for disparity estimation in stereo vision”, Image and Vision Computing 22(3):183-195
Example Java code for computing image disparity (http://sourceforge.net/projects/daoi/).

[TOP]