Word n-gram models

Equation 14.1 presents an opportunity for approximating $ \hat{P}(\cal{W})$ by limiting the context:

$\displaystyle \hat P(w_1, w_2, \ldots, w_m) \simeq \prod_{i=1}^{m} \hat P(w_i \;\vert\; w_{i-n+1}, \ldots, w_{i-1})$ (14.2)

for some $ n \geqslant 1$. If language is assumed to be ergodic - that is, it has the property that the probability of any state can be estimated from a large enough history independent of the starting conditions14.2 - then for sufficiently high $ n$ equation 14.2 is exact. Due to reasons of data sparsity, however, values of $ n$ in the range of 1 to 4 inclusive are typically used, and there are also practicalities of storage space for these estimates to consider. Models using contiguous but limited context in this way are usually referred to as $ n$-gram language models, and the conditional context component of the probability (`` $ w_{i-n+1}, \ldots, w_{i-1}$'' in equation 14.2) is referred to as the history.

Estimates of probabilities in $ n$-gram models are commonly based on maximum likelihood estimates - that is, by counting events in context on some given training text:

$\displaystyle \hat P(w_i \vert w_{i-n+1}, \ldots, w_{i-1}) = \frac{C(w_{i-n+1}, \ldots, w_i)}{C(w_{i-n+1}, \ldots, w_{i-1})}$ (14.3)

where $ C(.)$ is the count of a given word sequence in the training text. Refinements to this maximum likelihood estimate are described later in this chapter.

The choice of $ n$ has a significant effect on the number of potential parameters that the model can have, which is maximally bounded by $ \vert\mathbb{W}\vert^n$, where $ \mathbb{W}$ is the set of words in the language model, also known as the vocabulary. A 4-gram model with a typically-sized 65,000 word vocabulary can therefore potentially have $ 65,000\,^4
\simeq 1.8\times10^{19}$ parameters. In practice, however, only a small subset of the possible parameter combinations represent likely word sequences, so the storage requirement is far less than this theoretical maximum - of the order of $ 10^{11}$ times less in fact.14.3 Even given this significant reduction in coverage and a very large training text14.4 there are still many plausible word sequences which will not be encountered in the training text, or will not be found a statistically significant number of times. It would not be sensible to assign all unseen sequences zero probability, so methods of coping with low and zero occurrence word tuples have been developed. This is discussed later in section 14.3.

It is not only the storage space that must be considered, however - it is also necessary to be able to attach a reasonable degree of confidence to the derived estimates. Suitably large quantities of example training text are also therefore required to ensure statistical significance. Increasing the amount of training text not only gives greater confidence in model estimates, however, but also demands more storage space and longer analysis periods when estimating model parameters, which may place feasibility limits on how much data can be used in constructing the final model or how thoroughly it can be analysed. At the other end of the scale for restricted domain models there may be only a limited quantity of suitable in-domain text available, so local estimates may need smoothing with global priors. In addition, if language models are to be used for speech recognition then it is good to train them on precise acoustic transcriptions where possible - that is, text which features the hesitations, repetitions, word fragments, mistakes and all the other sources of deviation from purely grammatical language that characterise everyday speech. However, such acoustically accurate transcriptions are in limited supply since they must be specifically prepared; real-world transcripts as available for various other purposes almost ubiquitously correct any disfluencies or mistakes made by speakers.


Back to HTK site
See front page for HTK Authors