Project Summary

In voiced speech, the vocal folds open and close quasi-periodically and thus convert the glottal air flow (air volume velocity) into a train of flow pulses which is referred to as the voice source excitation signal.

Early models of the source signal used a simple impulse train for modeling voiced excitation. None of these models has been calibrated with direct observations of glottal area changes which are the proximal cause of the air pressure changes that we hear as sound.The effective study of the voice source thus requires both more accurate source models and a comprehensive set of underlying observations on which to base the models. The primary goal of the proposed research is to develop and evaluate a new, more powerful source model based on direct observations of vocal fold vibrations.

Besides the critical need to calibrate source models with underlying physiological data, we also need to better understand the linkage between model parameters and perceived quality. None of the previous source models has been systematically validated perceptually. That is, we cannot presently predict well how a given change(s) in a model parameter will affect what listeners hear.

The voice source contains important lexical and non-lexical information. The non-lexical information can convey, for example, prosodic events, emotional status, as well as cues pertaining to the uniqueness of the speaker’s voice.  In engineering applications, there is a need for a more accurate source model that could model different voice qualities. Such a model could improve the naturalness of TTS systems. In addition, understanding what aspects of the source signal, if any, are speaker-specific, should aid in developing better speaker identification algorithms.

We propose to build on our preliminary work in developing a new source model by recording high-speed images of vocal fold vibrations with simultaneous audio recordings, analyzing the corpus to better parameterize the new voice source model and study speaker variability, performing perception experiments to uncover which aspects of the glottal model are perceptually salient, and using the model in TTS and speaker identification algorithms.

The project fosters interdisciplinary activities at:

This work is supported in part by NSF Grant No. IIS-1018863 and by NIH/NIDCD Grant Nos. DC01797 and DC011300.


Voice source, high-speed recording, vocal folds, speech synthesis, speech production model, perceptual validation.


Glottaltopograph (GTG) analyze tool: a toolkit to analyze high-speed laryngeal videos.

Glottaltopography is a method to analyze high-speed laryngeal videos. The method is described in this paper: Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. Briefly, the "glottaltopogram" is based on principal component analysis of pixels' light-intensity time sequences from consecutive video images. This method reveals the overall synchronization of the vibrational patterns of the vocal folds over the entire laryngeal area. This method is effective in visualizing pathological and normal vocal fold vibratory patterns. The GTG toolkit is available for download here.

VoiceSauce: A Program for Voice Analysis

VoiceSauce is an application, implemented in Matlab, which provides automated voice measurements over time from audio recordings. Inputs are standard wave (*.wav) files and the measures currently computed are: F0, Formants F1-F4, H1(*), H2(*), H4(*), A1(*), A2(*), A3(*), H1(*)-H2(*), H2(*)-H4(*), H1(*)-A1(*), H1(*)-A2(*), H1(*)-A3(*), Energy, and Cepstral Peak Prominence ... (details)

