In all of the preceding discussion, it has been assumed that input was from speech files stored on disk. These files would normally have been stored in parameterised form so that little or no conversion of the source speech data was required. When HVITE is invoked with no files listed on the command line, it assumes that input is to be taken directly from the audio input. In this case, configuration variables must be used to specify firstly how the speech waveform is to be captured and secondly, how the captured waveform is to be converted to parameterised form.
Dealing with waveform capture first, as described in section 5.12, HTK provides two main forms of control over speech capture: signals/keypress and an automatic speech/silence detector. To use the speech/silence detector alone, the configuration file would contain the following
# Waveform capture
SOURCERATE=625.0
SOURCEKIND=HAUDIO
SOURCEFORMAT=HTK
USESILDET=T
MEASURESIL=F
OUTSILWARN=T
ENORMALISE=F
where the source sampling rate is being set to 16kHz. Notice that the SOURCEKIND must be set to HAUDIO and the SOURCEFORMAT must be set to HTK. Setting the Boolean variable USESILDET causes the speech/silence detector to be used, and the MEASURESIL OUTSILWARN variables result in a measurement being taken of the background silence level prior to capturing the first utterance. To make sure that each input utterance is being captured properly, the HVITE option -g can be set to cause the captured wave to be output after each recognition attempt. Note that for a live audio input system, the configuration variable ENORMALISE should be explicitly set to FALSE both when training models and when performing recognition. Energy normalisation cannot be used with live audio input, and the default setting for this variable is TRUE.
As an alternative to using the speech/silence detector, a signal can be used to start and stop recording. For example,
# Waveform capture
SOURCERATE=625.0
SOURCEKIND=HAUDIO
SOURCEFORMAT=HTK
AUDIOSIG=2
would result in the Unix interrupt signal (usually the Control-C key) being
used as a start and stop control13.5. Key-press control of the audio input can be obtained by
setting AUDIOSIG to a negative number.
Both of the above can be used together, in this case, audio capture is disabled until the specified signal is received. From then on control is in the hands of the speech/silence detector.
The captured waveform must be converted to the required target parameter kind. Thus, the configuration file must define all of the parameters needed to control the conversion of the waveform to the required target kind. This process is described in detail in Chapter 5. As an example, the following parameters would allow conversion to Mel-frequency cepstral coefficients with delta and acceleration parameters.
# Waveform to MFCC parameters
TARGETKIND=MFCC_0_D_A
TARGETRATE=100000.0
WINDOWSIZE=250000.0
ZMEANSOURCE=T
USEHAMMING = T
PREEMCOEF = 0.97
USEPOWER = T
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
Many of these variable settings are the default settings
and could be omitted, they are included explicitly here as a reminder
of the main configuration options available.
When HVITE is executed in direct audio input mode, it issues a prompt prior to each input and it is normal to enable basic tracing so that the recognition results can be seen. A typical terminal output might be
READY[1]>
Please speak sentence - measuring levels
Level measurement completed
DIAL ONE FOUR SEVEN
== [258 frames] -97.8668 [Ac=-25031.3 LM=-218.4] (Act=22.3)
READY[2]>
CALL NINE TWO EIGHT
== [233 frames] -97.0850 [Ac=-22402.5 LM=-218.4] (Act=21.8)
etc
If required, a transcription of each spoken input can be output
to a label file or an MLF in the usual way by setting the -e option.
However, to do this
a file name must be synthesised. This is done by using a counter
prefixed by the value of the
HVITE configuration variable
RECOUTPREFIX and
suffixed by the value of RECOUTSUFFIX
.
For example, with the settings
RECOUTPREFIX = sjy
RECOUTSUFFIX = .rec
then the output transcriptions would be stored as
sjy0001.rec, sjy0002.rec etc.