Threading Spectral Peaks

Brian Strope

Threading Spectral Peaks

Interference, in the form of background noise and the frequency shaping of the acoustic environment, alters the spectral representation of speech. The auditory system is remarkably robust to these distortions.

Most automatic speech recognition systems statistically characterize sequences of spectral representations that are representative of a particular part of speech (usually a word, syllable, or phoneme). Noise corrupts the statistics of the spectral representation of the signal, and significantly reduces recognition accuracy.

Instead of directly characterizing the spectrum, this work focusses on aspects of the spectrotemporal variation which are robust to background noise, and which may have a stronger connection to underlying production and perception mechanisms. Specifically, the temporal correlations of dominant spectral peaks are robust to background noise and may provide an anchor for robust speech representations. Although local spectral peaks are randomly distributed in time and frequency during background noise, during speech, they follow a relatively consistent and highly-correlated pattern, which is largely independent of the channel response.

Processing occurs in 5 stages:

audtory filtering

channel specific adaptation

cross-channel peak isolation (using bandpass cepstral liftering)

threading spectrotemporally neighboring peaks, and approximating their temporal derivatives

choosing a low, mid and high peak (and temporal derivatives) to use as features for speech recognition.

The first three stages are described previously. The threading stage uses a dynamic programing algorithm to connect-the-dots. New peaks (dots) in time connect to the closest thread (in frequency). If no thread within a threshold exists, a new thread is created.

The final stage converts these threads into parameters suitable for speech recognition. Three peaks with center frequencies equally spaced on a Mel-warped scale track the nearest threads. Updating to the frequency position of new threads does not occur instantaneously. Frequency positions are low-pass filtered by a first order recursive filter with an updating fraction that is dependent on the magnitude of the peak. Therefore, the three frequency positions (and their temporal derivatives) update quickly to large magnitude peaks, and slowly to weak peaks.

Publications

ICASSP 98: B. Strope and A. Alwan, "Robust Word Recognition Using Threaded Spectral Peaks," to appear in Proc. of IEEE ICASSP, May 1998.

ASA/ASJ 96: B. Strope and A. Alwan, "Dynamic auditory representations and statistical speech recognition," Proc. of the Acous. Soc. of Amer. Vol. 100, No. 4, 2788, Oct. 1996.

[UCLA] [EE] [SPAPL] [bps] [research]

bps@ucla.edu