Speech Signal Processing

In this section, the basic mechanisms involved in transforming a speech waveform into a sequence of parameter vectors will be described. Throughout this section, it is assumed that the SOURCEKIND is WAVEFORM and that data is being read from a HTK format file via HWAVE. Reading from different format files is described below in section 5.11. Much of the material in this section also applies to data read direct from an audio device, the additional features needed to deal with this latter case are described later in section 5.12.

The overall process is illustrated in Fig. which shows the sampled waveform being converted into a sequence of parameter blocks. In general, HTK regards both waveform files and parameter files as being just sample sequences, the only difference being that in the former case the samples are 2-byte integers and in the latter they are multi-component vectors. The sample rate of the input waveform will normally be determined from the input file itself. However, it can be set explicitly using the configuration parameter SOURCERATE. The period between each parameter vector determines the output sample rate and it is set using the configuration parameter TARGETRATE. The segment of waveform used to determine each parameter vector is usually referred to as a window and its size is set by the configuration parameter WINDOWSIZE. Notice that the window size and frame rate are independent. Normally, the window size will be larger than the frame rate so that successive windows overlap as illustrated in Fig. .

For example, a waveform sampled at 16kHz would be converted into 100 parameter vectors per second using a 25 msec window by setting the following configuration parameters.

    SOURCERATE = 625
    TARGETRATE = 100000
    WINDOWSIZE = 250000

Remember that all durations are specified in 100 nsec units^5.1.

$% latex2html id marker 50965 $\textstyle \parbox{50mm}{ \begin{center}\setlength... ...hapter.\arabic{figctr}\ \ Speech Encoding Process} \end{center}\end{center} }$$

Independent of what parameter kind is required, there are some simple pre-processing operations that can be applied prior to performing the actual signal analysis. Firstly, the DC mean can be removed from the source waveform by setting the Boolean configuration parameter ZMEANSOURCE to true (i.e. T). This is useful when the original analogue-digital conversion has added a DC offset to the signal. It is applied to each window individually so that it can be used both when reading from a file and when using direct audio input^5.2.

Secondly, it is common practice to pre-emphasise the signal by applying the first order difference equation

$\displaystyle {s^{\prime}}_n = s_n - k\,s_{n-1}$

(5.1)

to the samples $\{s_n, n=1,N \}$ in each window. Here

is the pre-emphasis coefficient which should be in the range $0 \leq k < 1$ . It is specified using the configuration parameter PREEMCOEF. Finally, it is usually beneficial to taper the samples in each window so that discontinuities at the window edges are attenuated. This is done by setting the Boolean configuration parameter USEHAMMING to true. This applies the following transformation to the samples $\{s_n, n=1,N \}$ in the window

$\displaystyle {s^{\prime}}_n = \left\{ 0.54 - 0.46 \cos \left( \frac{2 \pi (n-1)}{N-1} \right) \right\} s_n$

(5.2)

When both pre-emphasis and Hamming windowing are enabled, pre-emphasis is performed first.

In practice, all three of the above are usually applied. Hence, a configuration file will typically contain the following

    ZMEANSOURCE = T
    USEHAMMING = T
    PREEMCOEF = 0.97

Certain types of artificially generated waveform data can cause numerical overflows with some coding schemes. In such cases adding a small amount of random noise to the waveform data solves the problem. The noise is added to the samples using

$\displaystyle {s^{\prime}}_n = s_n + q RND()$

(5.3)

where

is a uniformly distributed random value over the interval

and

is the scaling factor. The amount of noise added to the data (

) is set with the configuration parameter ADDDITHER (default value

). A positive value causes the noise signal added to be the same every time (ensuring that the same file always gives exactly the same results). With a negative value the noise is random and the same file may produce slightly different results in different trials.

One problem that can arise when processing speech waveform files obtained from external sources, such as databases on CD-ROM, is that the byte-order may be different to that used by the machine on which HTK is running. To deal with this problem, HWAVE can perform automatic byte-swapping in order to preserve proper byte order. HTK assumes by default that speech waveform data is encoded as a sequence of 2-byte integers as is the case for most current speech databases^5.3. If the source format is known, then HWAVE will also make an assumption about the byte order used to create speech files in that format. It then checks the byte order of the machine that it is running on and automatically performs byte-swapping if the order is different. For unknown formats, proper byte order can be ensured by setting the configuration parameter BYTEORDER to VAX if the speech data was created on a little-endian machine such as a VAX or an IBM PC, and to anything else (e.g. NONVAX) if the speech data was created on a big-endian machine such as a SUN, HP or Macintosh machine.

The reading/writing of HTK format waveform files can be further controlled via the configuration parameters NATURALREADORDER and NATURALWRITEORDER. The effect and default settings of these parameters are described in section . Note that BYTEORDER should not be used when NATURALREADORDER is set to true. Finally, note that HTK can also byte-swap parameterised files in a similar way provided that only the byte-order of each 4 byte float requires inversion.

Back to HTK site
See front page for HTK Authors