In this section, the basic mechanisms involved in transforming a
speech waveform into a sequence of parameter vectors will be
described. Throughout this section, it is assumed that the
SOURCEKIND is WAVEFORM and that data is being read from
a HTK format file via HWAVE. Reading from different format
files is described below in section 5.11.
Much of the
material in this section also applies to data read direct from an audio
device, the
additional features needed to deal with this latter case are
described later in section 5.12.
The overall process is illustrated in Fig.
which shows the sampled waveform being converted into a
sequence of parameter blocks. In general, HTK regards
both waveform files and parameter files as being just
sample sequences, the only difference being that in the former
case the samples are 2-byte integers and in the latter they
are multi-component vectors. The sample rate of the input
waveform will normally be determined from the input file
itself. However, it can be set explicitly using the
configuration parameter SOURCERATE. The period
between each parameter vector determines the output sample
rate and it is set using the configuration parameter
TARGETRATE. The segment of waveform used to determine
each parameter vector is usually referred to as a window
and its size is set by the
configuration parameter WINDOWSIZE. Notice that the
window size and frame rate are independent. Normally,
the window size will be larger than the frame rate so that
successive windows overlap as illustrated in
Fig.
.
For example, a waveform sampled at 16kHz would be converted into 100 parameter vectors per second using a 25 msec window by setting the following configuration parameters.
SOURCERATE = 625 TARGETRATE = 100000 WINDOWSIZE = 250000Remember that all durations are specified in 100 nsec units5.1.
Independent of what parameter kind is required, there are some simple pre-processing operations that can be applied prior to performing the actual signal analysis. Firstly, the DC mean can be removed from the source waveform by setting the Boolean configuration parameter ZMEANSOURCE to true (i.e. T). This is useful when the original analogue-digital conversion has added a DC offset to the signal. It is applied to each window individually so that it can be used both when reading from a file and when using direct audio input5.2.
Secondly, it is common practice to pre-emphasise the signal by applying the first order difference equation
In practice, all three of the above are usually applied. Hence, a configuration file will typically contain the following
ZMEANSOURCE = T USEHAMMING = T PREEMCOEF = 0.97Certain types of artificially generated waveform data can cause numerical overflows with some coding schemes. In such cases adding a small amount of random noise to the waveform data solves the problem. The noise is added to the samples using
One problem that can arise when processing speech waveform files obtained from external sources, such as databases on CD-ROM, is that the byte-order may be different to that used by the machine on which HTK is running. To deal with this problem, HWAVE can perform automatic byte-swapping in order to preserve proper byte order. HTK assumes by default that speech waveform data is encoded as a sequence of 2-byte integers as is the case for most current speech databases5.3. If the source format is known, then HWAVE will also make an assumption about the byte order used to create speech files in that format. It then checks the byte order of the machine that it is running on and automatically performs byte-swapping if the order is different. For unknown formats, proper byte order can be ensured by setting the configuration parameter BYTEORDER to VAX if the speech data was created on a little-endian machine such as a VAX or an IBM PC, and to anything else (e.g. NONVAX) if the speech data was created on a big-endian machine such as a SUN, HP or Macintosh machine.
The reading/writing of HTK format waveform files can be further controlled
via the configuration parameters NATURALREADORDER and
NATURALWRITEORDER. The effect and default settings of these parameters
are described in section .
Note that BYTEORDER should not be used when NATURALREADORDER
is set to true. Finally, note that HTK can also byte-swap parameterised
files in a similar way provided that only the byte-order of each 4 byte float
requires inversion.