Gram Files

Statistical language models are estimated by counting the number of events in a sample source text. These event counts are stored in gram files. Provided that they share a common word map, gram files can be grouped together in arbitrary ways to form the raw data pool from which a language model can be constructed. For example, a text source containing 100m words could be processed and stored as two gram files. A few months later, a 3rd gram file could be generated from a newly acquired text source. This new gram file could then be added to the original two files to build a new language model. The original source text is not needed and the gram files need not be changed.

A gram file consists of a header followed by a sorted list of -gram counts. The header contains the following items, each written on a separate line

-gram size ie 2 for bigrams, 3 for trigrams, etc. (Ngram=N)
Word map. Name of word map to be used with this gram file. (WMap=wmapname)
First gram. The first -gram in the file (gram1 = w1 w2 w3 ...)
Sequence number. If given then the actual word map must have a sequence number which is greater than or equal to this. (SeqNo=nnn)
Last gram. The last -gram in the file (gramN = w1 w2 w3 ...)
Number of distinct -grams in file. (Entries = N)
Word map check. This is an optional field containing a word and its id. It can be included as a double check that the correct word map is being used to interpret this gram file. The given word is looked up in the word map and if the corresponding id does not match, an error is reported. (WMCheck = word id)
Text source. This is an optional text string describing the text source which was used to generate the gram file (Source=...).

For example, a typical gram file header might be

    Ngram = 3
    WMap = US_Business_News
    Entries = 50345980
    WMCheck = XEROX 340987
    Gram1 = AN ABLE ART
    GramN = ZEALOUS ZOO OWNERS
    Source = WSJ Aug 94 to Dec 94

The -grams themselves begin immediately following the line containing the keyword \Grams\^16.1. They are listed in lexicographic sort order such that for the -gram $\{w_1 w_2 \ldots w_N\}$ , varies the least rapidly and varies the most rapidly. Each -gram consists of a sequence of 3-byte word ids followed by a single 1-byte count. If the -gram occurred more than 255 times, then it is repeated with the counts being interpreted to the base 256. For example, if a gram file contains the sequence

    w1 w2 ... wN c1
    w1 w2 ... wN c2
    w1 w2 ... wN c3

corresponding to the

-gram $\{w_1 w_2 \ldots w_N\}$ , the corresponding count is

$\displaystyle c_1 + c_2*256 + c_3*256^2$

When a group of gram files are used as input to a tool, they must be organised so that the tool receives -grams as a single stream in sort order i.e. as far as the tool is concerned, the net effect must be as if there is just a single gram file. Of course, a sufficient approach would be to open all input gram files in parallel and then scan them as needed to extract the required sorted -gram sequence. However, if two -gram files were organised such that the last -gram in one file was ordered before the first -gram of the second file, it would be much more efficient to open and read the files in sequence. Files such as these are said to be sequenced and in general, HTK tools are supplied with a mix of sequenced and non-sequenced files. To optimise input in this general case, all HTK tools which input gram files start by scanning the header fields gram1 and gramN. This information allows a sequence table to be constructed which determines the order in which the constituent gram file must be opened and closed. This sequence table is designed to minimise the number of individual gram files which must be kept open in parallel.

This gram file sequencing is invisible to the HTK user, but it is important to be aware of it. When a large number of gram files are accumulated to form a frequently used database, it may be worth copying the gram files using LGCOPY. This will have the effect of transforming the gram files into a fully sequenced set thus ensuring that subsequent reading of the data is maximally efficient.

Back to HTK site
See front page for HTK Authors