Class n-gram models

One method of reducing the number of word history equivalence classes to be modelled in the $ n$-gram case is to consider some words as equivalent. This can be implemented by mapping a set of words to a word class $ g \in \mathbb{G}$ by using a classification function $ G(w)
= g$. If any class contains more than one word then this mapping will result in less distinct word classes than there are words, $ \vert\mathbb{G}\vert < \vert\mathbb{W}\vert$, thus reducing the number of separate contexts that must be considered. The equivalence classes can then be described as a sequence of these classes:

$\displaystyle \mathcal E_{\textrm{class-{\it n}-gram}}(w_1, \ldots, w_{i}) = \mathcal E(G(w_{i-n+1}), \ldots, G(w_{i}))$ (14.6)

A deterministic word-to-class mapping like this has some advantages over a word $ n$-gram model since the reduction in the number of distinct histories reduces the storage space and training data requirements whilst improving the robustness of the probability estimates for a given quantity of training data. Because multiple words can be mapped to the same class, the model has the ability to make more confident assumptions about infrequent words in a class based on other more frequent words in the same class14.5 than is possible in the word $ n$-gram case - and furthermore for the same reason it is able to make generalising assumptions about words used in contexts which are not explicitly encountered in the training text. These gains, however, clearly correspond with a loss in the ability to distinguish between different histories, although this might be offset by the ability to choose a higher value of $ n$.

The most commonly used form of class $ n$-gram model uses a single classification function, $ G(.)$, as in equation 14.6, which is applied to each word in the $ n$-gram, including the word which is being predicted. Considering for clarity the bigram14.6case, then given $ G(.)$ the language model has the terms $ w_i$, $ w_{i-1}$, $ G(w_i)$ and $ G(w_{i-1})$ available to it. The probability estimate can be decomposed as follows:

$\displaystyle P_{\textrm{class'}}(w_i \;\vert\; w_{i-1})$ $\displaystyle =$ $\displaystyle P(w_i \;\vert\; G(w_i), G(w_{i-1}), w_{i-1} )$  
  $\displaystyle \qquad\qquad\times$ $\displaystyle P(G(w_i) \;\vert\; G(w_{i-1}), w_{i-1})$ (14.7)

It is assumed that $ P(w_i
\;\vert\; G(w_i), G(w_{i-1}), w_{i-1})$ is independent of $ G(w_{i-1})$ and $ w_{i-1}$ and that $ P(G(w_i) \;\vert\; G(w_{i-1}), w_{i-1})$ is independent of $ w_{i-1}$, resulting in the model:

$\displaystyle P_{\textrm{class}}(w_i \;\vert\; w_{i-1}) = P(w_i \;\vert\; G(w_{i})) \;\times\; P(G(w_i) \;\vert\; G(w_{i-1}))$ (14.8)

Almost all reported class $ n$-gram work using statistically-found classes is based on clustering algorithms which optimise $ G(.)$ on the basis of bigram training set likelihood, even if the class map is to be used with longer-context models. It is interesting to note that this approximation appears to works well, however, suggesting that the class maps found are in some respects ``general'' and capture some features of natural language which apply irrespective of the context length used when finding these features.


Back to HTK site
See front page for HTK Authors