Contents of Good-Turing discounting

Good-Turing discounting

In [Good 1953]^14.13a method of discounting maximum likelihood estimates was proposed whereby the count of an event occurring

times is discounted with

$\displaystyle d_a = (a+1) \frac{c_{a+1}}{a\,.\,c_a}$

(14.14)

A problem with this scheme, referred to as Good-Turing discounting, is that due to the count in the denominator it will fail if there is a case where

if there is any count

for

. Inevitably as

increases the count

will tend towards zero and for high

there are likely to be many such zero counts. A solution to this problem was proposed in [Katz 1987], which defines a cut-off value

at which counts

for

are not discounted^14.14 - this is justified by considering these more frequently observed counts as reliable and therefore not needing to be discounted. Katz then describes a revised discount equation which preserves the same amount of mass for the unseen events:

$\displaystyle d_a = \left\{ \begin{array}{c@{\quad:\quad}l} \frac{(a+1) \frac{c... ...1}} {1 - (k+1)\frac{c_{k+1}}{c_1}} & 1 \le a \le k\\ 1 & a>k \end{array}\right.$

(14.15)

This method is itself unstable, however - for example if

then

will be negative for $1 \le a \le k$ .

Back to HTK site
See front page for HTK Authors