Speech Recognition by Dynamic Time Warping

Symmetrical DTW Algorithm

Speech is a time-dependent process. Several utterances of the same word are likely to have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed.

This problem is illustrated in figure 0, in which a ``time-time'' matrix is used to visualize the alignment. As with all the time alignment examples the reference pattern (template) goes up the side and the input pattern goes along the bottom. In this illustration the input ``SsPEEhH'' is a `noisy' version of the template ``SPEECH''. The idea is that `h' is a closer match to `H' compared with anything else in the template. The input ``SsPEEhH'' will be matched against all templates in the system's repository. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. A simple global distance score for a path is simply the sum of local distances that go to make up the path.

**Figure 0:** Illustration of a time alignment path between a template pattern ``SPEECH'' and a noisy input ``SsPEEhH''.

How do we find the best-matching (= lowest global distance) path between an input and a template? We could evaluate all possible paths - but this is extremely inefficient as the number of possible paths is exponential in the length of the input. Instead, we will consider what constraints there are on the matching process (or that we can impose on that process) and use those constraints to come up with an efficient algorithm. The constraints we shall impose are straightforward and not very restrictive:

Matching paths cannot go backwards in time;
Every frame in the input must be used in a matching path;
Local distance scores are combined by adding to give a global distance.

For now we will say that every frame in the template and the input must be used in a matching path. This means that if we take a point (i,j) in the time-time matrix (where i indexes the input pattern frame, j the template frame), then previous point must have been (i-1,j-1), (i-1,j) or (i,j-1) (see figure 1.4). The key idea in dynamic programming is that at point (i,j) we just continue with the lowest distance path from (i-1,j-1), (i-1,j) or (i,j-1).

This algorithm is known as Dynamic Programming (DP). When applied to template-based speech recognition, it is often referred to as Dynamic Time Warping (DTW). DP is guaranteed to find the lowest distance path through the matrix, while minimizing the amount of computation. The DP algorithm operates in a time-synchronous manner: each column of the time-time matrix is considered in succession (equivalent to processing the input frame-by-frame) so that, for a template of length N, the maximum number of paths being considered at any time is N.

If D(i,j) is the global distance up to (i,j) and the local distance at (i,j) is given by d(i,j)

Given that D(1,1) = d(1,1) (this is the initial condition), we have the basis for an efficient recursive algorithm for computing D(i,j). The final global distance D(n,N) gives us the overall matching score of the template with the input. The input word is then recognized as the word corresponding to the template with the lowest matching score. (Note that N will be different for each template.)

For basic speech recognition DP has a small memory requirement, the only storage required by the search (as distinct from the templates) is an array that holds a single column of the time-time matrix.

Note: For ease of explanation, it is assumed that the columns and rows are numbered from 0 onwards.

This means that the only directions in which the match path can move when at (i,j) in the time-time matrix are:

Computationally, (1) is already in a form that could be recursively programmed. However, unless the language is optimised for recursion, this method can be slow even for relatively small pattern sizes. Another method which is both quicker and requires less memory storage uses two nested for loops. This method only needs two arrays that hold adjacent columns of the time-time matrix.

The algorithm to find the least global cost is:

Calculate column 0 starting at the bottom most cell. The global cost to this cell is just its local cost. Then, the global cost for each successive cell is the local cost for that cell plus the global cost to the cell below it. This is called the predCol (predecessor column).
Calculate the global cost to the first cell of the next column (the curCol). This the local cost for the cell plus the global cost to the bottom most cell of the previous column.
Calculate the the global cost of the rest of the cells of curCol. For example, at (i,j) this is the local distance at (i,j) plus the minimum global cost at either (i-1,j), (i-1,j-1) or (i,j-1).
curCol is assigned to predCol and repeat from step 2 until all columns have been calculated.
Global cost is the value stored in the top most cell of the last column.

The pseudocode for this process is:

In order to effect a recogniser, the input pattern is taken and the above process is repeated for each template file. The template file which gives the lowest global cost is the word estimate.

Stuart N WrigleyBSc(Hons) PhD MIET SMIEEE MAHEP Operations and Business Development Manager, UKRI CDT in Speech and Language Technologies and their Applications

home \| biography \| publications

dtw home \| algorithm \| symmetrical dtw \| asymmetrical dtw \| results \| code