Analyzing and evaluating the use of visemes in an interpolative synthesizer for visual speech

O. Martinez-Lazalde, "Analyzing and evaluating the use of visemes in an interpolative synthesizer for visual speech", PhD thesis, Department of Computer Science, The University of Sheffield, 2010. (Supervisor: Dr Steve Maddock).

PhD Abstract

Visemes are the visual counter part of phonemes, with a single viseme typically representing a group of phonemes that are visually similar. These visemes are usually based on the static poses used in producing a phoneme (which we all call static viseme) and are used in conjugation with an interpolative technique to create an interpolative visual speech synthesizer. This thesis uses a Constraint-based approach for the interpolation and investigates the use of three types of visemes: static visemes, coarticulated visemes and enhanced visemes. The data for the visemes is obtained from motion captured speech from two speakers using an inexpensive 3D motion capture system. The captured data is mapped to a 3D synthetic face model using a novel hybrid approach based on Radial Basis Functions and Mixtures of Probabilistic Principal Component Analyzers. This process results on the 3D face model replicating the movement from the speakers. From this, facial animation parameters based on Principal Component Analysis are extracted and used to create data for the coarticulated and enhanced visemes that feed into the Constraint-based approach. The parameters are tuned using different corpuses from the two speakers.

The results show that coarticulated visemes, which incorporate information about the context of a viseme, produce the best performance of the different kinds of viseme and enhancement data that were tested. In addition, separate experiments revealed that using separate visemes for the phonemes /b/, /m/ and /p/ produced better results than using a single viseme from the three (which is what is typically done in interpolative synthesizers). A brief investigation on speaking rate also reveals that this needs to be incorporated into the parameters used in a visual speech synthesizer.

Publications