Baum-Welch Re-Estimation

To determine the parameters of a HMM it is first necessary to make a rough guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation formulae.

$% latex2html id marker 50441 $\textstyle \parbox{60mm}{ \begin{center}\setlength... ...chapter.\arabic{figctr}\ \ Representing a Mixture} \end{center}\end{center} }$$

Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple data streams does not alter matters significantly since each stream is considered to be statistically independent. Furthermore, mixture components can be considered to be a special form of sub-state in which the transition probabilities are the mixture weights (see Fig. ).

Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a single component Gaussian, that is

$\displaystyle b_j({\mbox{\boldmath$o$}}_t) = \frac{1}{\sqrt{(2 \pi)^n \vert {\m... ...boldmath$\Sigma$}}_j^{-1}({\mbox{\boldmath$o$}}_t - {\mbox{\boldmath$\mu$}}_j)}$

(1.10)

If there was just one state

in the HMM, this parameter estimation would be easy. The maximum likelihood estimates of ${\mbox{\boldmath $\mu$}}_j$ and ${\mbox{\boldmath $\Sigma$}}_j$ would be just the simple averages, that is

$\displaystyle \hat{{\mbox{\boldmath$\mu$}}}_j = \frac{1}{T} \sum_{t=1}^{T} {\mbox{\boldmath$o$}}_t$

(1.11)

and

$\displaystyle \hat{{\mbox{\boldmath$\Sigma$}}}_j = \frac{1}{T} \sum_{t=1}^{T} (... ...}}_j) ({\mbox{\boldmath$o$}}_t - {\mbox{\boldmath$\mu$}}_j)^{\scriptstyle\sf T}$

(1.12)

In practice, of course, there are multiple states and there is no direct assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however, that if some approximate assignment of vectors to states could be made then equations 1.11 and 1.12 could be used to give the required initial values for the parameters. Indeed, this is exactly what is done in the HTK tool called HINIT. HINIT first divides the training observation vectors equally amongst the model states and then uses equations 1.11 and 1.12 to give initial values for the mean and variance of each state. It then finds the maximum likelihood state sequence using the Viterbi algorithm described below, reassigns the observation vectors to states and then uses equations 1.11 and 1.12 again to get better initial values. This process is repeated until the estimates do not change.

Since the full likelihood of each observation sequence is based on the summation of all possible state sequences, each observation vector ${\mbox{\boldmath $o$}}_t$ contributes to the computation of the maximum likelihood parameter values for each state . In other words, instead of assigning each observation vector to a specific state as in the above approximation, each observation is assigned to every state in proportion to the probability of the model being in that state when the vector was observed. Thus, if denotes the probability of being in state at time then the equations 1.11 and 1.12 given above become the following weighted averages

$\displaystyle \hat{{\mbox{\boldmath$\mu$}}}_j = \frac{ \sum_{t=1}^{T} L_j(t) {\mbox{\boldmath$o$}}_t} {\sum_{t=1}^{T} L_j(t)}$

(1.13)

and

$\displaystyle \hat{{\mbox{\boldmath$\Sigma$}}}_j = \frac{ \sum_{t=1}^{T} L_j(t)... ...$}}_t - {\mbox{\boldmath$\mu$}}_j)^{\scriptstyle\sf T}} {\sum_{t=1}^{T} L_j(t)}$

(1.14)

where the summations in the denominators are included to give the required normalisation.

Equations 1.13 and 1.14 are the Baum-Welch re-estimation formulae for the means and covariances of a HMM. A similar but slightly more complex formula can be derived for the transition probabilities (see chapter 8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation must be calculated. This is done efficiently using the so-called Forward-Backward algorithm. Let the forward probability^1.2 $\alpha_j(t)$ for some model with states be defined as

$\displaystyle \alpha_j(t) = P({\mbox{\boldmath$o$}}_1,\ldots,{\mbox{\boldmath$o$}}_t, x(t)=j \vert M).$

(1.15)

That is, $\alpha_j(t)$ is the joint probability of observing the first

speech vectors and being in state

at time

. This forward probability can be efficiently calculated by the following recursion

$\displaystyle \alpha_j(t) = \left[ \sum_{i=2}^{N-1} \alpha_i(t-1) a_{ij} \right] b_j({\mbox{\boldmath$o$}}_t).$

(1.16)

This recursion depends on the fact that the probability of being in state

at time

and seeing observation ${\mbox{\boldmath $o$}}_t$ can be deduced by summing the forward probabilities for all possible predecessor states

weighted by the transition probability $a_{ij}$ . The slightly odd limits are caused by the fact that states

and

are non-emitting^1.3. The initial conditions for the above recursion are

$\displaystyle \alpha_1(1) = 1$

(1.17)

$\displaystyle \alpha_j(1) = a_{1j} b_j({\mbox{\boldmath$o$}}_1)$

(1.18)

for

and the final condition is given by

$\displaystyle \alpha_N(T) = \sum_{i=2}^{N-1} \alpha_i(T) a_{iN}.$

(1.19)

Notice here that from the definition of $\alpha_j(t)$ ,

$\displaystyle P({\mbox{\boldmath$O$}}\vert M) = \alpha_N(T).$

(1.20)

Hence, the calculation of the forward probability also yields the total likelihood $P({\mbox{\boldmath $O$}}\vert M)$ .

The backward probability $\beta_j(t)$ is defined as

$\displaystyle \beta_j(t) = P({\mbox{\boldmath$o$}}_{t+1},\ldots,{\mbox{\boldmath$o$}}_T \vert x(t)=j , M).$

(1.21)

As in the forward case, this backward probability can be computed efficiently using the following recursion

$\displaystyle \beta_i(t) = \sum_{j=2}^{N-1} a_{ij} b_j({\mbox{\boldmath$o$}}_{t+1}) \beta_j(t+1)$

(1.22)

with initial condition given by

$\displaystyle \beta_i(T) = a_{iN}$

(1.23)

for

and final condition given by

$\displaystyle \beta_1(1) = \sum_{j=2}^{N-1} a_{1j} b_j({\mbox{\boldmath$o$}}_1) \beta_j(1).$

(1.24)

Notice that in the definitions above, the forward probability is a joint probability whereas the backward probability is a conditional probability. This somewhat asymmetric definition is deliberate since it allows the probability of state occupation to be determined by taking the product of the two probabilities. From the definitions,

$\displaystyle \alpha_j(t) \beta_j(t) = P({\mbox{\boldmath$O$}},x(t)=j \vert M).$

(1.25)

Hence,

$\displaystyle L_j(t)$	$\displaystyle =$	$\displaystyle P(x(t)=j\vert{\mbox{\boldmath$O$}},M)$	(1.26)
	$\displaystyle =$	$\displaystyle \frac{P({\mbox{\boldmath$O$}},x(t)=j \vert M)}{P({\mbox{\boldmath$O$}}\vert M)}$
	$\displaystyle =$	$\displaystyle \frac{1}{P} \alpha_j(t) \beta_j(t)$

where $P=P({\mbox{\boldmath $O$}}\vert M)$ .

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch algorithm is now in place. The steps in this algorithm may be summarised as follows

For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator and denominator summations of the form illustrated by equations 1.13 and 1.14. These storage locations are referred to as accumulators^1.4.
Calculate the forward and backward probabilities for all states and times .
For each state and time , use the probability and the current observation vector ${\mbox{\boldmath $o$}}_t$ to update the accumulators for that state.
Use the final accumulator values to calculate new parameter values.
If the value of $P=P({\mbox{\boldmath $O$}}\vert M)$ for this iteration is not higher than the value at the previous iteration then stop, otherwise repeat the above steps using the new re-estimated parameter values.

All of the above assumes that the parameters for a HMM are re-estimated from a single observation sequence, that is a single example of the spoken word. In practice, many examples are needed to get good parameter estimates. However, the use of multiple observation sequences adds no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct training sequence.

One final point that should be mentioned is that the computation of the forward and backward probabilities involves taking the product of a large number of probabilities. In practice, this means that the actual numbers involved become very small. Hence, to avoid numerical problems, the forward-backward computation is computed in HTK using log arithmetic.

The HTK program which implements the above algorithm is called HREST. In combination with the tool HINIT for estimating initial values mentioned earlier, HREST allows isolated word HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

Back to HTK site
See front page for HTK Authors