Stuart N WrigleyBSc(Hons) PhD MIET SMIEEE MAHEP
Operations and Business Development Manager, UKRI CDT in Speech and Language Technologies and their Applications

Speech Recognition by Dynamic Time Warping

Dynamic Time Warping

In order to understand DTW, two concepts need to be dealt with,

The question of optimal feature extraction is not trivial. If we are interested in recognition, then an ideal feature extractor might be one that produces a string of words (making the rest of the recognizer redundant)! On the other hand, separating the feature extraction process from the pattern recognition process is a sensible thing to do, since it enables us to encapsulate the pattern recognition process.

We shall concern ourselves with frame based feature extraction. That is, the feature analysis consists of computing a feature vector at regular intervals. For example if we perform a filterbank analysis, then our feature vector may consist of the energies in each band averaged over (say) 20ms. For a linear predictive analysis the feature vector consists of the prediction coefficients (or transformations of them). A common feature vector type used in speech recognition are Mel Frequency Cepstral Coefficients (MFCCs).

Since the feature vectors could possibly have multiple elements, a means of calculating the local distance is required. The distance measure between two feature vectors is calculated using the Euclidean distance metric. Therefore the local distance between feature vector x of signal 1 and feature vector y of signal 2 is given by,


Although the Euclidean metric is computationally more expensive than some metrics, it does give more weight to large differences in a single feature. It can also be shown that this metric has several desirable theoretical properties when comparing cepstra, as in this project.

If it is required to trace back along the best-matching path (rather than just knowing the score at the end of it) then a backtrace array (or backpointer array) must be kept with entries in the array pointing to the preceding point in that path.