Isolated Word Recognition

Let each spoken word be represented by a sequence of speech vectors or observations $ {\mbox{\boldmath $O$}}$, defined as

$\displaystyle {\mbox{\boldmath$O$}} = {\mbox{\boldmath$o$}}_1, {\mbox{\boldmath$o$}}_2, \ldots, {\mbox{\boldmath$o$}}_T$ (1.1)

where $ {\mbox{\boldmath $o$}}_t$ is the speech vector observed at time $ t$. The isolated word recognition problem can then be regarded as that of computing

$\displaystyle \arg\max_i \left\{ P(w_i \vert {\mbox{\boldmath$O$}}) \right\}$ (1.2)

where $ w_i$ is the $ i$'th vocabulary word. This probability is not computable directly but using Bayes' Rule gives

$\displaystyle P(w_i \vert {\mbox{\boldmath$O$}}) = \frac{P({\mbox{\boldmath$O$}}\vert w_i) P(w_i)}{P({\mbox{\boldmath$O$}})}$ (1.3)

Thus, for a given set of prior probabilities $ P(w_i)$, the most probable spoken word depends only on the likelihood $ P({\mbox{\boldmath $O$}}\vert w_i) $. Given the dimensionality of the observation sequence $ {\mbox{\boldmath $O$}}$, the direct estimation of the joint conditional probability $ P({\mbox{\boldmath $o$}}_1,{\mbox{\boldmath $o$}}_2,\ldots \vert w_i)$ from examples of spoken words is not practicable. However, if a parametric model of word production such as a Markov model is assumed, then estimation from data is possible since the problem of estimating the class conditional observation densities $ P({\mbox{\boldmath $O$}}\vert w_i) $ is replaced by the much simpler problem of estimating the Markov model parameters.

% latex2html id marker 50342
$\textstyle \parbox{50mm}{ \begin{center}\setlength...
...echapter.\arabic{figctr}\ \ Isolated Word Problem}
\end{center}\end{center} }$

I n HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model as shown in Fig. [*]. A Markov model is a finite state machine which changes state once every time unit and each time $ t$ that a state $ j$ is entered, a speech vector $ {\mbox{\boldmath $o$}}_t$ is generated from the probability density $ b_j({\mbox{\boldmath $o$}}_t)$. Furthermore, the transition from state $ i$ to state $ j$ is also probabilistic and is governed by the discrete probability $ a_{ij}$. Fig. [*] shows an example of this process where the six state model moves through the state sequence $ X=1,2,2,3,4,4,5,6$ in order to generate the sequence $ {\mbox{\boldmath $o$}}_1$ to $ {\mbox{\boldmath $o$}}_6$. Notice that in HTK, the entry and exit states of a HMM are non-emitting. This is to facilitate the construction of composite models as explained in more detail later.

The joint probability that $ {\mbox{\boldmath $O$}}$ is generated by the model $ M$ moving through the state sequence $ X$ is calculated simply as the product of the transition probabilities and the output probabilities. So for the state sequence $ X$ in Fig. [*]

$\displaystyle P({\mbox{\boldmath$O$}},X\vert M) = a_{12} b_2({\mbox{\boldmath$o...
... a_{22} b_2({\mbox{\boldmath$o$}}_2) a_{23} b_3({\mbox{\boldmath$o$}}_3) \ldots$ (1.4)

However, in practice, only the observation sequence $ {\mbox{\boldmath $O$}}$ is known and the underlying state sequence $ X$ is hidden. This is why it is called a Hidden Markov Model.

% latex2html id marker 50377
$\textstyle \parbox{85mm}{ \begin{center}\setlength...
...er.\arabic{figctr}\ \ The Markov Generation Model}
\end{center}\end{center} }$

Given that $ X$ is unknown, the required likelihood is computed by summing over all possible state sequences $ X = x(1), x(2), x(3), \ldots, x(T)$, that is

$\displaystyle P({\mbox{\boldmath$O$}}\vert M) = \sum_X a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}({\mbox{\boldmath$o$}}_t) a_{x(t)x(t+1)}$ (1.5)

where $ x(0)$ is constrained to be the model entry state and $ x(T+1)$ is constrained to be the model exit state.

As an alternative to equation 1.5, the likelihood can be approximated by only considering the most likely state sequence, that is

$\displaystyle \hat{P}({\mbox{\boldmath$O$}}\vert M) = \max_X \left\{ a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}({\mbox{\boldmath$o$}}_t) a_{x(t)x(t+1)} \right\}$ (1.6)

Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive procedures exist which allow both quantities to be calculated very efficiently. Before going any further, however, notice that if equation 1.2 is computable then the recognition problem is solved. Given a set of models $ M_i$ corresponding to words $ w_i$, equation 1.2 is solved by using 1.3 and assuming that

$\displaystyle P({\mbox{\boldmath$O$}}\vert w_i) = P({\mbox{\boldmath$O$}}\vert M_i).$ (1.7)

All this, of course, assumes that the parameters $ \{a_{ij}\}$ and $ \{b_{j}({\mbox{\boldmath $o$}}_t)\}$ are known for each model $ M_i$. Herein lies the elegance and power of the HMM framework. Given a set of training examples corresponding to a particular model, the parameters of that model can be determined automatically by a robust and efficient re-estimation procedure. Thus, provided that a sufficient number of representative examples of each word can be collected then a HMM can be constructed which implicitly models all of the many sources of variability inherent in real speech. Fig. [*] summarises the use of HMMs for isolated word recognition. Firstly, a HMM is trained for each vocabulary word using a number of examples of that word. In this case, the vocabulary consists of just three words: ``one'', ``two'' and ``three''. Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likely model identifies the word.

% latex2html id marker 50402
$\textstyle \parbox{84mm}{ \begin{center}\setlength...
...gctr}\ \ Using HMMs for Isolated Word Recognition}
\end{center}\end{center} }$


Back to HTK site
See front page for HTK Authors