Discriminative Training

The previous sections have described how maximum likelihood (ML)-based estimates of the HMM model parameters can be initialised and estimated. This section briefly describes how discriminative training is implemented in HTK. It is not meant as a definitive guide to discriminative training, it aims to give sufficient information so that the configuration and command-line options associated with the discriminative training tool HMMIREST can be understood.

HTK supports discriminative training using the HMMIREST tool. The use of both the Maximum Mutual Information (MMI) and Minimum Phone Error (MPE) training criteria are supported. In both cases the aim is to estimate the HMM parameters in such a way as to (approximately) reduce the error rate on the training data. Hence the criteria take into account not only the actual word-level transcription of the training data but also ``confusable'' hypotheses which give rise to similar language model / acoustic model log likelihoods. The form of MMI criterion to be maximised may be expressed as 8.1

$\displaystyle {\cal F}_{\tt mmi}(\lambda)$ $\displaystyle =$ $\displaystyle \frac{1}{R}\sum_{r=1}^R\log\left(
P({\cal H}^r_{\tt ref}\vert{\mbox{\boldmath$O$}}^r,\lambda)
\right)$  
  $\displaystyle =$ $\displaystyle \frac{1}{R}\sum_{r=1}^R\log\left(
\frac{P({\mbox{\boldmath$O$}}^r...
...sum_{\cal H}P({\mbox{\boldmath$O$}}^r\vert{\cal H},\lambda)P({\cal H})}
\right)$ (8.8)

Thus the average log-posterior of the reference, $ {\cal H}^r_{\tt ref}$, is maximised. Here the summation for $ {\cal H}$ is over all possible word sequences. In practice this is restricted to the set of confusable hypotheses, which will be defined by a lattice.

The MPE training criterion is an example of minimum Bayes' risk training8.2. The general expression to be minimised can be expressed as

$\displaystyle {\cal F}_{\tt mpe}(\lambda) = \sum_{r=1}^R\sum_{\cal H}P({\cal H}\vert{\mbox{\boldmath$O$}}^r,\lambda)
{\cal L}({\cal H},{\cal H}^r_{\tt ref})$     (8.9)

where $ {\cal L}({\cal H},{\cal H}^r_{\tt ref})$ is the ``loss'' between the hypothesis $ {\cal H}$ and the reference, $ {\cal H}^r_{\tt ref}$. In general, there are various forms of loss function that may be used. However, in MPE training, the loss function is measured in terms of the the Levenshtein edit distance between the phone sequences of the reference and the hypothesis. In HTK, rather than minimising this expression, the normalised average phone accuracy is maximised. This may be expressed as
$\displaystyle {\hat{\lambda}} = \arg\max_{\lambda}\left\{
1 - \frac{1}{\sum_{r=1}^RQ^r}{\cal F}_{\tt mpe}(\lambda)
\right\}$     (8.10)

where $ Q^r$ is the number of phones in the transcription for training sequence $ r$.

In the HMMIREST implementation the language model scores, including the grammar scale factor are combined into the acoustic models to yield a numerator acoustic model, $ {\cal M}^{\tt num}_r$, and a denominator acoustic model, $ {\cal M}^{\tt den}_r$ for utterance $ r$. In this case the MMI criterion can be expressed as

$\displaystyle {\cal F}_{\tt mmi}(\lambda) = \sum_{r=1}^R\log\left(
\frac{P({\mb...
... M}^{\tt num}_r)}
{P({\mbox{\boldmath$O$}}^r\vert{\cal M}^{\tt den}_r)}
\right)$     (8.11)

and the MPE criterion is expressed as
$\displaystyle {\cal F}_{\tt mpe}(\lambda) = \sum_{r=1}^R\sum_{\cal H}
\left(\fr...
...$}}^r\vert{\cal M}^{\tt den}_r)}\right)
{\cal L}({\cal H},{\cal H}^r_{\tt ref})$     (8.12)

where $ {\cal M}_{\cal H}$ is the acoustic model for hypothesis $ {\cal H}$.

In practice approximate forms of the MMI and normalised average phone accuracy criteria are optimised.



Subsections
Back to HTK site
See front page for HTK Authors