Footnotes

often referred to as a codebook exponent.

Since the output distributions are densities, these are not really probabilities but it is a convenient fiction.

To understand equations involving a non-emitting state at time

, the time should be thought of as being $t-\delta t$ if it is an entry state, and $t+\delta t$ if it is an exit state. This becomes important when HMMs are connected together in sequence so that transitions across non-emitting states take place between frames.

...accumulators ^1.4

Note that normally the summations in the denominators of the re-estimation formulae are identical across the parameter sets of a given state and therefore only a single common storage location for the denominators is required and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can result in the denominator summations being different. Hence, in HTK the denominator summations are always stored and calculated individually for each distinct parameter vector or matrix.

... operation ^1.5

They can even be avoided altogether by using a flat start as described in section 8.3.

... minor ^1.6

In practice, a good deal of extra work is needed to achieve efficient operation on large training databases. For example, the HEREST tool includes facilities for pruning on both the forward and backward passes and parallel operation on a network of machines.

... Model ^1.7

See ``Token Passing: a Conceptual Model for Connected Speech Recognition Systems'', SJ Young, NH Russell and JHS Thornton, CUED Technical Report F_INFENG/TR38, Cambridge University, 1989. Available by anonymous ftp from svr-ftp.eng.cam.ac.uk.

... dependent ^3.1

A stage of the tutorial deals with adapting the speaker dependent models for new speakers

... used ^3.2

Available by anonymous ftp from svr-ftp.eng.cam.ac.uk/pub/comp.speech/dictionaries/beep.tar.gz. Note that items beginning with unmatched quotes, found at the start of the dictionary, should be removed.

... files ^3.3

Not to be confused with files containing edit scripts.

... arguments ^3.4

Most UNIX shells, especially the C shell, only allow a limited and quite small number of arguments.

... together ^3.5

Note that if the transition matrices had not been tied, the CO command would be ineffective since all models would be different by virtue of their unique transition matrices.

... file ^3.6

The HLED tool may have to be used to insert silences at the start and end of each transcription or alternatively HRESULTS can be used to ignore silences (or any other symbols) using the -e option.

... systems ^3.7

HVITE becomes progressively less efficient as the vocabulary size is increased and cross-word triphones are used.

... extension ^3.8

An additional, more restrictive, licence must be agreed to in order to download HDECODE.

... model ^3.9

gzip and gunzip are assumed to be in the current path.

... is ^4.1

All of the examples in this book assume the UNIX Operating System and the C Shell but the principles apply to any OS which supports hierarchical files and command line arguments

... element)^4.2

Block sizes typically grow as more blocks are allocated

... retries ^4.3

This does not work if input filters are used.

... units ^5.1

The somewhat bizarre choice of 100nsec units originated in Version 1 of HTK when times were represented by integers and this unit was the best compromise between precision and range. Times are now represented by doubles and hence the constraints no longer apply. However, the need for backwards compatibility means that 100nsec units have been retained. The names SOURCERATE and TARGETRATE are also non-ideal, SOURCEPERIOD and TARGETPERIOD would be better.

... input ^5.2

This method of applying a zero mean is different to HTK Version 1.5 where the mean was calculated and subtracted from the whole speech file in one operation. The configuration variable V1COMPAT can be set to revert to this older behaviour.

... databases ^5.3

Many of the more recent speech databases use compression. In these cases, the data may be regarded as being logically encoded as a sequence of 2-byte integers even if the actual storage uses a variable length encoding scheme.

... function ^5.4

Note that some textbooks define the denominator of equation 5.4 as $1 - \sum_{i=1}^p a_i z^{-i}$ so that the filter coefficients are the negatives of those computed by HTK.

... root.^5.5

the degree of compression can be controlled by setting the configuration parameter COMPRESSFACT which is the power to which the amplitudes are raised and defaults to 0.33)

... parameterisation ^5.6

In any event, setting the compatibility variable V1COMPAT to true in HPARM will ensure that the calculation of energy is compatible with that computed by the Version 1 tool HCODE.

...#tex2html_wrap_inline51107#^5.7

Unless V1COMPAT is set to true.

... discarded ^5.8

Some applications may require the 0'th order cepstral coefficient in order to recover the filterbank coefficients from the cepstral coefficients.

... letters ^6.1

Some command names have single letter alternatives for compatibility with earlier versions of HTK.

...#tex2html_wrap_inline51665#^7.1

No current HTK tool can estimate or use these.

... element ^7.2

Covariance matrices are actually stored internally in lower triangular form

... CLASS="MATH">^7.3

The Choleski storage format is not used by default in HTK Version 2

... matrix ^7.4

Transform matrices are not used by any of the supported HTK tools.

... parameters ^7.5

If C0 or normalised log-energy are added these will be stripped prior to applying the linear transform

... definition ^7.6

The fact that this is possible does not mean that it is recommended practice!

... brackets ^7.7

This definition covers the textual version only. The syntax for the binary format is identical apart from the way that the lexical items are encoded.

... states ^7.8

Integer numbers are specified as either char or short. This has no effect on text-based definitions but for binary format it indicates the underlying C type used to represent the number.

... CLASS="MATH">^7.9

specifically, in equation 7.2 the GCONST value seen in HMM sets is calculated by multiplying the determinant of the covariance matrix by ${\mbox{\boldmath $(2 \pi)^n$}}$

... numbers ^7.10

Though the notation support n-ary trees, the regression class tree code can only generate binary regression class trees.

... as ^8.1

This form of criterion assumes that the language model parameters are fixed. As such it should really be called maximum conditional likelihood estimation.

... training ^8.2

The minimum word error rate (MWE) criterion is also implemented. However MPE has been found to perform slightly better on current large vocabulary speech recognition systems.

... matrices ^8.3

Discriminative training with multiple streams can also be run. However to simplify the notation it is assumed that only a single stream is being used.

... priors ^8.4

The third accumulate is not now used, but is stored for backward compatibility.

... data ^9.1

MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or additive noise effects.

... data ^9.2

MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or additive noise effects.

... transform ^9.3

There is no advantage in storing twice the log determininat, however this is maintained for backward compatibility with internal HTK releases.

... consider ^9.4

The current code in HHED for generating decision trees does not support generating trees for multiple streams. However, the code does support adaptation for hand generated trees.

... using ^9.5

In the current implementation of the code this form of transform can only be estimated in addition to the MLLRMEAN transform

... where ^9.6

For efficiency this transformation is implemented as

$\displaystyle \hat{{\mbox{\boldmath $o$}}}_r(t) = {\mbox{\boldmath $A$}}_r{\mbo... ...\mbox{\boldmath $b$}}_r = {\mbox{\boldmath $W$}}_r{\mbox{\boldmath $\zeta$}}(t)$

(9.18)

... mapping ^10.1

The physical HMM which corresponding to several logical HMMs will be arbitrarily named after one of them.

... likely ^11.1

Remember that discrete probabilities are scaled such that 32767 is equivalent to a probability of 0.000001 and 0 is equivalent to a probability of 1.0

... HMMs ^11.2

Also called semi-continuous HMMs in the the literature.

... words ^12.1

More precisely, nodes represent the ends of words and arcs represent the transitions between word ends. This distinction becomes important when describing recognition output since acoustic scores are attached to arcs not nodes.

... definitions ^13.1

Large HMM sets will often be distributed across a number of MMF files, in this case, the -H option will be repeated for each file.

... 10 ^13.2

The default behaviour of HRESULTS is slightly different to the widely used US NIST scoring software which uses weights of 3,3 and 4 and a slightly different alignment algorithm. Identical behaviour to NIST can be obtained by setting the -n option.

... words ^13.3

All the examples here will assume that each label corresponds to a word but in general the labels could stand for any recognition unit such as phones, syllables, etc. HRESULTS does not care what the labels mean but for human consumption, the labels SENT and WORD can be changed using the -a and -b options.

... commandword ^13.4

The HLED EX command can be used to compute phone level transcriptions when there is only one possible phone transcription per word

... control ^13.5

The underlying signal number must be given, HTK cannot interpret the standard Unix signal names such as SIGINT

...c:fundaments.^14.1

The theory components of this chapter - these first four sections - are condensed from portions of ``Adaptive Statistical Class-based Language Modelling'', G.L. Moore; Ph.D thesis, Cambridge University 2001

... conditions ^14.2

See section 5 of [Shannon 1948] for a more formal definition of ergodicity.

... fact.^14.3

Based on the analysis of 170 million words of newspaper and broadcast news text.

... text ^14.4

A couple of hundred million words, for example.

... class ^14.5

Since it is assumed that words are placed in the same class because they share certain properties.

... bigram ^14.6

By convention unigram refers to a 1-gram, bigram indicates a 2-gram and trigram is a 3-gram. There is no standard term for a 4-gram.

... 1993]^14.7

R. Kneser and H. Ney, ``Improved Clustering Techniques for Class-Based Statistical Language Modelling''; Proceedings of the European Conference on Speech Communication and Technology 1993, pp. 973-976

...;^14.8

That is, $C(G(w))=\sum_{x: G(x)=G(w)}C(x)$ .

... class.^14.9

Given this initialisation, the first $(\vert\mathbb{G}\vert-1)$ moves will be to place each word into an empty class, however, since the class map which maximises $F_{\mathrm{M}_\mathrm{C}}$ is the one which places each word into a singleton class.

... model.^14.10

Which will be higher, given maximum likelihood estimates.

... probability,^14.11

If it did then from equation 14.1 it follows that the probability of any piece of text containing that event would also be zero, and would have infinite perplexity.

... 1987]^14.12

S.M. Katz, ``Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recogniser''; IEEE Transactions on Acoustic, Speech and Signal Processing 1987, vol. 35 no. 3 pp. 400-401

... 1953]^14.13

I.J. Good, ``The Population Frequencies of Species and the Estimation of Population Parameters''; Biometrika 1953, vol. 40 (3,4) pp. 237-264

... discounted ^14.14

It is suggested that ``

or so is a good choice''

... discounting,^14.15

H. Ney, U. Essen and R. Kneser, ``On Structuring Probabilistic Dependences in Stochastic Language Modelling''; Computer Speech and Language 1994, vol.8 no.1 pp.1-38

... 1948]^14.16

C.E. Shannon, ``A Mathematical Theory of Communication''; The Bell System Technical Journal 1948, vol. 27 pp. 379-423, 623-656. Available online at http://galaxy.ucsd.edu/new/external/shannon.pdf

... HTK.^14.17

In fact a very simple text conditioning Perl script is included in LMTutorial/extras/LCond.pl for demonstration purposes only - it converts text to uppercase (so that words are considered equivalent irrespective of case) and reads the input punctuation in order to tag sentences, stripping most other punctuation. See the script for more details.

...-grams.^14.18

LGPREP can also perform text modification using supplied rules.

... tools.^15.1

STARTWORD and ENDWORD to be precise.

... like.^15.2

The exception to this is that differing text may follow a % character.

... \Grams\^16.1

That is, the first byte of the binary data immediately follows the newline character

... further.^17.1

On a 65,000 word vocabulary test set with 170 million words of training text this was found to occur after around 45 iterations

... memory,^17.2

other than a few small local variables taken from the stack as functions are called

... practice.^17.3

Note that these schemes are approximately similar, since the most frequent words are most likely to be encountered sooner in the training text and thus occur higher up in the word map

... option.^17.4

The author always uses this option but has not empirically tested its efficaciousness

... 1)^17.5

This is due to the different observation caching mechanisms used in HDECODE and the HADAPT module

... algorithm ^17.6

This algorithm is significantly different from earlier versions of HTK where K-means clustering was used at every iteration and the Viterbi alignment was limited to states

... ignored ^17.7

Prototypes should either have GConst set (the value does not matter) to avoid HTK trying to compute it or variances should be set to a positive value such as 1.0 to ensure that GConst is computable

... labels ^17.8

In earlier versions of HTK, HLED command names consisted of a single letter. These are still supported for backwards compatibility and they are included in the command summary produced using the -Q option. However, commands introduced since version 2.0 have two letter names.

...TARGETKIND ^17.9

The TARGETKIND is equivalent to the HCOERCE environment variable used in earlier versions of HTK

... grammar.^17.10

The expression between double angle brackets must be a simple list of alternative node names or a variable which has such a list as its value

... recognition ^17.11

In HTK V2 it is preferable for these context-loop expansions to be done automatically via HNET, to avoid requiring a dictionary entry for every context-dependent model

... used ^17.12

If the base-names or left/right context of the context-dependent names in a context-dependent loop are variables, no $ symbols are used when writing the context-dependent nodename.

... revised ^17.13

With the added benefit of rectifying some residual bugs in the HTK V1.5 implementation

... mechanism ^17.14

Using this option only makes sense if the HMM has skip transitions

... transcriptions ^17.15

The choice of ``Sentence'' and ``Word'' here is the usual case but is otherwise arbitrary. HRESULTS just compares label sequences. The sequences could be paragraphs, sentences, phrases or words, and the labels could be phrases, words, syllables or phones, etc. Options exist to change the output designations `SENT' and `WORD' to whatever is appropriate.

... mode ^17.16

It is not,of course, necessary to have multiple processors to use this program since each `parallel' activation can be executed sequentially on a single processor