Variable
Frame Rate (VFR) Feature Extraction Toolkit (download)
*Note: If you
download and use this code, please reference (You et al., 2004).*
Variable Frame Rate (VFR)
analysis is a method of feature extraction for noise robust automatic
speech recognition (ASR) which builds on speech perception research that
shows that dynamic spectro-temporal information is important, and,
hence, not all equi-duration speech segments are equally important
perceptually. For example, formant transitions at the onset of a vowel
can carry more discriminative information than the steady-state part of
the vowel. The general framework of VFR can be summarized as follows:
An important aspect of VFR systems is the distance metric used. In (Pointing and Peeling, 1991), the authors present a metric based on Euclidean distance in the feature domain. In (Zhu and Alwan, 2000), a weighted Euclidean distance is used, which accounts for the energy of the current frame. In (You et al., 2004), the authors present a metric which approximates the entropy of the current frame.
This toolkit implements a MATLAB version of the system described in (You et al., 2004).
This toolkit is comprised
of one function: feature=VFR(input,fsamp,
The function VFR( ) is passed raw speech data, and returns MFCC features, along with deltas and double deltas, corresponding to selected frames. Frames are concatenated into a matrix, with time as the vertical axis. The function includes an option to display graphic results of the current VFR analysis. Figure 1 shows an example output display.
Table I provides word-accuracy results for VFR analysis, along with results for the MFCC baseline, when applied to Sets A and B of the Aurora-2 database. The baseline system extracts 13 cepstral coefficients, along with the frame energy. The overall feature vector is then comprised of the static coefficients, along with deltas and double-deltas. The recognizer used 16-state, 6-mixutre word models. For the results in Table I, voice activity detection (VAD) was performed based on (Sohn and Kim, 1999).
Note that VFR may be used
in conjunction with other front-end speech enhancement techniques (Zhu
et al., 2001).
Figure 1. Example output display from VFR analysis. The input signal (^one ̄) was corrupted with Subway Noise at 5 dB.
Table I. Word-accuracy results for VFR obtained on the Aurora-2 database.
20 dB | 15 dB | 10 dB | 5 dB | 0 dB | -5 dB | |||
Set A | N1 | MFCC | 95.30 | 90.05 | 70.74 | 40.47 | 8.60 | 0.28 |
+ VFR | 97.14 | 93.77 | 84.74 | 62.70 | 29.75 | 11.15 | ||
N2 | MFCC | 96.43 | 91.17 | 74.55 | 38.81 | 6.17 | 0.15 | |
+ VFR | 98.19 | 95.53 | 86.61 | 61.67 | 26.15 | 10.91 | ||
N3 | MFCC | 96.12 | 89.02 | 67.31 | 25.23 | 0.95 | 0.00 | |
+ VFR | 98.00 | 95.20 | 83.48 | 53.33 | 17.09 | 9.60 | ||
N4 | MFCC | 95.34 | 88.31 | 68.90 | 33.91 | 4.97 | 0.46 | |
+ VFR | 97.35 | 93.64 | 82.32 | 59.24 | 27.74 | 12.10 | ||
Ave. | MFCC | 95.80 | 89.64 | 70.38 | 34.61 | 5.17 | 0.22 | |
+ VFR | 97.67 | 94.54 | 84.29 | 59.24 | 25.18 | 10.94 | ||
Set B | N1 | MFCC | 97.05 | 92.45 | 78.11 | 46.58 | 10.68 | 0.52 |
+ VFR | 98.59 | 96.10 | 89.16 | 68.84 | 34.76 | 13.79 | ||
N2 | MFCC | 95.89 | 90.36 | 72.67 | 38.06 | 6.44 | 0.27 | |
+ VFR | 97.61 | 94.20 | 84.64 | 57.71 | 23.76 | 11.00 | ||
N3 | MFCC | 96.91 | 93.05 | 79.54 | 44.08 | 11.09 | 1.49 | |
+ VFR | 98.57 | 96.57 | 90.07 | 67.25 | 31.82 | 13.15 | ||
N4 | MFCC | 96.88 | 91.33 | 74.36 | 36.25 | 5.03 | 0.31 | |
+ VFR | 98.40 | 95.87 | 87.57 | 61.22 | 24.38 | 11.23 | ||
Ave. | MFCC | 96.68 | 91.80 | 76.17 | 41.24 | 8.31 | 0.65 | |
+ VFR | 98.29 | 95.69 | 87.86 | 63.76 | 28.68 | 12.29 |
References:
K. M. Pointing and S. M.
Peeling, 1991, The use of variable frame rate analysis in speech
recognition, Computer Speech and Language, Vol. 5, No. 2, pp.
169-179.
Q. Zhu and A. Alwan, 2000, On the use of variable frame rate in speech recognition, ICASSP, pp. 3264-3267.
H. You, Q. Zhu and A. Alwan, 2004, Entropy-based variable frame rate analysis of speech signals and its application to ASR, ICASSP, pp.549-552.
Q. Zhu, X. Cui, M. Iseli and A. Alwan, 2001, Noise Robust Feature Extraction for ASR using the Aurora 2 Database, Proc. EUROSPEECH, Aalborg, Denmark, Vol. 1, pp. 185-188.