Contents of Class n-gram models

$\displaystyle \mathcal E_{\textrm{class-{\it n}-gram}}(w_1, \ldots, w_{i}) = \mathcal E(G(w_{i-n+1}), \ldots, G(w_{i}))$

(14.6)

A deterministic word-to-class mapping like this has some advantages over a word

-gram model since the reduction in the number of distinct histories reduces the storage space and training data requirements whilst improving the robustness of the probability estimates for a given quantity of training data. Because multiple words can be mapped to the same class, the model has the ability to make more confident assumptions about infrequent words in a class based on other more frequent words in the same class^14.5 than is possible in the word

-gram case - and furthermore for the same reason it is able to make generalising assumptions about words used in contexts which are not explicitly encountered in the training text. These gains, however, clearly correspond with a loss in the ability to distinguish between different histories, although this might be offset by the ability to choose a higher value of

The most commonly used form of class

-gram model uses a single classification function,

, as in equation 14.6, which is applied to each word in the

-gram, including the word which is being predicted. Considering for clarity the bigram^14.6case, then given

the language model has the terms

, $w_{i-1}$ ,

and $G(w_{i-1})$ available to it. The probability estimate can be decomposed as follows:

$\displaystyle P_{\textrm{class'}}(w_i \;\vert\; w_{i-1})$	$\displaystyle =$	$\displaystyle P(w_i \;\vert\; G(w_i), G(w_{i-1}), w_{i-1} )$
	$\displaystyle \qquad\qquad\times$	$\displaystyle P(G(w_i) \;\vert\; G(w_{i-1}), w_{i-1})$	(14.7)

It is assumed that $P(w_i \;\vert\; G(w_i), G(w_{i-1}), w_{i-1})$ is independent of $G(w_{i-1})$ and $w_{i-1}$ and that $P(G(w_i) \;\vert\; G(w_{i-1}), w_{i-1})$ is independent of $w_{i-1}$ , resulting in the model:

$\displaystyle P_{\textrm{class}}(w_i \;\vert\; w_{i-1}) = P(w_i \;\vert\; G(w_{i})) \;\times\; P(G(w_i) \;\vert\; G(w_{i-1}))$

(14.8)

Almost all reported class

-gram work using statistically-found classes is based on clustering algorithms which optimise

on the basis of bigram training set likelihood, even if the class map is to be used with longer-context models. It is interesting to note that this approximation appears to works well, however, suggesting that the class maps found are in some respects ``general'' and capture some features of natural language which apply irrespective of the context length used when finding these features.