Good-Turing discounting

In [Good 1953]14.13a method of discounting maximum likelihood estimates was proposed whereby the count of an event occurring $ a$ times is discounted with

$\displaystyle d_a = (a+1) \frac{c_{a+1}}{a\,.\,c_a}$ (14.14)

A problem with this scheme, referred to as Good-Turing discounting, is that due to the count in the denominator it will fail if there is a case where $ c_a = 0$ if there is any count $ c_b > 0$ for $ b>a$. Inevitably as $ a$ increases the count $ c_a$ will tend towards zero and for high $ a$ there are likely to be many such zero counts. A solution to this problem was proposed in [Katz 1987], which defines a cut-off value $ k$ at which counts $ a$ for $ a > k$ are not discounted14.14 - this is justified by considering these more frequently observed counts as reliable and therefore not needing to be discounted. Katz then describes a revised discount equation which preserves the same amount of mass for the unseen events:

$\displaystyle d_a = \left\{ \begin{array}{c@{\quad:\quad}l} \frac{(a+1) \frac{c...
...1}} {1 - (k+1)\frac{c_{k+1}}{c_1}} & 1 \le a \le k\\ 1 & a>k \end{array}\right.$ (14.15)

This method is itself unstable, however - for example if $ k.c_k > c_1$ then $ d_a$ will be negative for $ 1 \le a \le k$.


Back to HTK site
See front page for HTK Authors