Discriminative Parameter Re-Estimation Formulae

For both MMI and MPE training the estimation of the model parameters are based on variants of the Extended Baum-Welch (EBW) algorithm. In HTK the following form is used to estimate the means and covariance matrices8.3
$\displaystyle \hat{{\mbox{\boldmath$\mu$}}}_{jm} = \frac{
\sum_{r=1}^R \sum_{t=...
...T_r} (L^{{\tt num}r}_{jm}(t) - L^{{\tt den}r}_{jm}(t))
+ D_{jm} + \tau^{\tt I}}$     (8.13)

and
$\displaystyle \hat{{\mbox{\boldmath$\Sigma$}}}_{jm} = \frac{
\sum_{r=1}^R \sum_...
...ox{\boldmath$\mu$}}}_{jm}\hat{{\mbox{\boldmath$\mu$}}}^{\scriptstyle\sf T}_{jm}$     (8.14)

where
$\displaystyle {\bf G}_{jm}^{\tt s} = {{\mbox{\boldmath$\Sigma$}}}_{jm} + {{\mbox{\boldmath$\mu$}}}_{jm}{{\mbox{\boldmath$\mu$}}}_{jm}^{\scriptstyle\sf T}$     (8.15)
$\displaystyle {\bf G}_{jm}^{\tt p} = {{\mbox{\boldmath$\Sigma$}}}^{\tt p}_{jm} ...
...\mu$}}}^{\tt p}_{jm}{{\mbox{\boldmath$\mu$}}}^{{\tt p}{\scriptstyle\sf T}}_{jm}$     (8.16)

The difference between the MMI and MPE criteria lie in how the numerator, $ L^{{\tt num}r}_{jm}(t)$, and denominator, $ L^{{\tt den}r}_{jm}(t)$, ``occupancy probabilities'' are computed. For MMI, these are the posterior probabilities of Gaussian component occupation for either the numerator or denominator lattice. However for MPE, in order to keep the same form of re-estimation formulae as MMI, an MPE-based analogue of the ``occupation probability'' is computed which is related to an approximate error measure for each phone marked for the denominator: the positive values are treated as numerator statistics and negative values as denominator statistics.

In these update formulae there are a number of parameters to be set.

The best configuration option and parameter settings will be task and criterion specific and so will need to be determined empirically. The values shown in the tutorial section of this book can be treated as a reasonable starting point. Note the grammar scale factors used in the tutorial are low compared to those often used in a typical large vocabulary speech recognition systems where values in the range 12-15 are used.

The estimation of the weights and the transition matrices have a similar form. Only the component prior updates will be described here. $ c^{(0)}_{jm}$ is initialised to the current model parameter $ c_{jm}$. The values are then updated 100 times using the following iterative update rule:

$\displaystyle {c}^{(i+1)}_{jm} = \frac{\sum_{r=1}^R
\sum_{t=1}^{T_r}L_{jm}^{{\t...
..._{jn}^{{\tt num}r}(t) + k_{jn}c^{(i)}_{jn} + \tau^{\tt W}c^{\tt p}_{jn}\right)}$     (8.18)

where
$\displaystyle k_{jm} = \max_n\left\{\frac{\sum_{r=1}^R\sum_{t=1}^{T_r}L_{jn}^{{...
...}
\right\}
- \frac{\sum_{r=1}^R\sum_{t=1}^{T_r}L_{jm}^{{\tt den}r}(t)}
{c_{jm}}$     (8.19)

In a similar fashion to the estimation of the means and covariance matrices there are a range of forms that can be used to specify the prior for the component or the transition matrix entry. The same configuration options used for the mean and covariance matrix will determine the exact form of the prior.

For the component prior the I-smoothing weight, $ \tau^{\tt W}$, is specified using the configuration variable ISMOOTHTAUW. This is normally set to 1. The equivalent smoothing term for the transition matrices is set using ISMOOTHTAUT and again a value of 1 is often used.


Back to HTK site
See front page for HTK Authors