Estimating probabilities

Language models seek to estimate the probability of each possible word sequence event occurring. In order to calculate maximum likelihood estimates this set of events must be finite so that the language model can ensure that the sum of the probabilities of all events is 1 given some context. In an $ n$-gram model the combination of the finite vocabulary and fixed length history limits the number of unique events to $ \vert\mathbb{W}\vert^n$. For any $ n>1$ it is highly unlikely that all word sequence events will be encountered in the training corpora, and many that do occur may only appear one or two times. A language model should not give any unseen event zero probability,14.11 but without an infinite quantity of training text it is almost certain that there will be events it does not encounter during training, so various mechanisms have been developed to redistribute probability within the model such that these unseen events are given some non-zero probability.

As in equation 14.3, the maximum likelihood estimate of the probability of an event $ \mathcal{A}$ occurring is defined by the number of times that event is observed, $ a$, and the total number of samples in the training set of all observations, $ A$, where $ P(\mathcal{A}) = \frac{a}{A}$. With this definition, events that do not occur in the training data are assigned zero probability since it will be the case that $ a=0$. [Katz 1987]14.12 suggests multiplying each observed count by a discount coefficient factor, $ d_a$, which is dependent upon the number of times the event is observed, $ a$, such that $ a' = d_a \,.\, a$. Using this discounted occurrence count, the probability of an event that occurs $ a$ times now becomes $ P_\mathrm{discount}(\mathcal{A}) = \frac{a'}{A\,}$. Different discounting schemes have been proposed that define the discount coefficient, $ d_a$, in specific ways. The same discount coefficient is used for all events that occur the same number of times on the basis of the symmetry requirement that two events that occur with equal frequency, $ a$, must have the same probability, $ p_a$.

Defining $ c_a$ as the number of events that occur exactly $ a$ times such that $ A = \sum_{a\ge 1} a\,.\,c_a$ it follows that the total amount of reserved mass, left over for distribution amongst the unseen events, is $ \frac{1}{c_0} \; ( 1\;-$ $ \frac{1}{A}\sum_{a\ge 1}$ $ d_a\,.\,c_a\,.\,a)$.



Subsections
Back to HTK site
See front page for HTK Authors