Variable Frame Rate

Variable Frame Rate (VFR) Feature Extraction Toolkit (download)
　

*Note: If you download and use this code, please reference (You et al., 2004).*
　

Variable Frame Rate (VFR) analysis is a method of feature extraction for noise robust automatic speech recognition (ASR) which builds on speech perception research that shows that dynamic spectro-temporal information is important, and, hence, not all equi-duration speech segments are equally important perceptually. For example, formant transitions at the onset of a vowel can carry more discriminative information than the steady-state part of the vowel. The general framework of VFR can be summarized as follows:
　

Initially oversample the observed speech in the feature domain, e.g. window the speech signal every ~2.5 ms.
Determine a local measure of feature distance for each frame.
Calculate a cumulative sum of the feature distance with respect to consecutive frames.
- If the current sum is less than a predetermined threshold, discard the current frame.
- If the current frame is greater than a predetermined threshold, keep the current frame and set the cumulative sum to zero.

An important aspect of VFR systems is the distance metric used. In (Pointing and Peeling, 1991), the authors present a metric based on Euclidean distance in the feature domain. In (Zhu and Alwan, 2000), a weighted Euclidean distance is used, which accounts for the energy of the current frame. In (You et al., 2004), the authors present a metric which approximates the entropy of the current frame.

This toolkit implements a MATLAB version of the system described in (You et al., 2004).

This toolkit is comprised of one function: feature=VFR(input,fsamp,winshft,VAD)
　

The function VFR( ) is passed raw speech data, and returns MFCC features, along with deltas and double deltas, corresponding to selected frames. Frames are concatenated into a matrix, with time as the vertical axis. The function includes an option to display graphic results of the current VFR analysis. Figure 1 shows an example output display.

Table I provides word-accuracy results for VFR analysis, along with results for the MFCC baseline, when applied to Sets A and B of the Aurora-2 database. The baseline system extracts 13 cepstral coefficients, along with the frame energy. The overall feature vector is then comprised of the static coefficients, along with deltas and double-deltas. The recognizer used 16-state, 6-mixutre word models. For the results in Table I, voice activity detection (VAD) was performed based on (Sohn and Kim, 1999).

Note that VFR may be used in conjunction with other front-end speech enhancement techniques (Zhu et al., 2001).

　

Figure 1. Example output display from VFR analysis. The input signal (“one”) was corrupted with Subway Noise at 5 dB.

Table I. Word-accuracy results for VFR obtained on the Aurora-2 database.

			20 dB	15 dB	10 dB	5 dB	0 dB	-5 dB
Set A	N1	MFCC	95.30	90.05	70.74	40.47	8.60	0.28
	N1	+ VFR	97.14	93.77	84.74	62.70	29.75	11.15
	N2	MFCC	96.43	91.17	74.55	38.81	6.17	0.15
	N2	+ VFR	98.19	95.53	86.61	61.67	26.15	10.91
	N3	MFCC	96.12	89.02	67.31	25.23	0.95	0.00
	N3	+ VFR	98.00	95.20	83.48	53.33	17.09	9.60
	N4	MFCC	95.34	88.31	68.90	33.91	4.97	0.46
	N4	+ VFR	97.35	93.64	82.32	59.24	27.74	12.10
	Ave.	MFCC	95.80	89.64	70.38	34.61	5.17	0.22
	Ave.	+ VFR	97.67	94.54	84.29	59.24	25.18	10.94
Set B	N1	MFCC	97.05	92.45	78.11	46.58	10.68	0.52
	N1	+ VFR	98.59	96.10	89.16	68.84	34.76	13.79
	N2	MFCC	95.89	90.36	72.67	38.06	6.44	0.27
	N2	+ VFR	97.61	94.20	84.64	57.71	23.76	11.00
	N3	MFCC	96.91	93.05	79.54	44.08	11.09	1.49
	N3	+ VFR	98.57	96.57	90.07	67.25	31.82	13.15
	N4	MFCC	96.88	91.33	74.36	36.25	5.03	0.31
	N4	+ VFR	98.40	95.87	87.57	61.22	24.38	11.23
	Ave.	MFCC	96.68	91.80	76.17	41.24	8.31	0.65
	Ave.	+ VFR	98.29	95.69	87.86	63.76	28.68	12.29

References:
　

K. M. Pointing and S. M. Peeling, 1991, The use of variable frame rate analysis in speech recognition, Computer Speech and Language, Vol. 5, No. 2, pp. 169-179.
　

Q. Zhu and A. Alwan, 2000, On the use of variable frame rate in speech recognition, ICASSP, pp. 3264-3267.

H. You, Q. Zhu and A. Alwan, 2004, Entropy-based variable frame rate analysis of speech signals and its application to ASR, ICASSP, pp.549-552.

Q. Zhu, X. Cui, M. Iseli and A. Alwan, 2001, Noise Robust Feature Extraction for ASR using the Aurora 2 Database, Proc. EUROSPEECH, Aalborg, Denmark, Vol. 1, pp. 185-188.