UCLA Speech Processing and Auditory Perception Laboratory

 
COPYRIGHT NOTICE
This page includes the toolkits and data that we used in our papers. Please cite our corresponding papers when using any of the following materials.

 

Sharewares: Codes, Databases, and Useful Links

Codes

Glottaltopograph (GTG) analyze tool: a toolkit to analyze high-speed laryngeal videos.

Glottaltopography is a method to analyze high-speed laryngeal videos. The method is described in this paper: Gang Chen, Jody Kreiman, Abeer Alwan, "The glottaltopogram: a method of analyzing high-speed images of the vocal folds", Computer Speech and Language, 2014, in press. Briefly, the "glottaltopogram" is based on principal component analysis of pixels' light-intensity time sequences from consecutive video images. This method reveals the overall synchronization of the vibrational patterns of the vocal folds over the entire laryngeal area. This method is effective in visualizing pathological and normal vocal fold vibratory patterns. The GTG toolkit is available for download here.


Harmfreq_MOLRT: a statistical model, likelihood ratio test (LRT)-based speech/non-speech detection algorithm

Harmfreq_MOLRT is a statistical model, likelihood ratio test (LRT)-based speech/non-speech detection algorithm. The likelihood ratios (LRs) for voiced and unvoiced frames are computed differently: LR for voiced frames is calculated using only the harmonic DFTs; for unvoiced frames, LR is calculated using all DFTs.  It is an improved version of the multiple observation (MO) LRT VAD proposed by Ramirez et. al. [Matlab code of Harmfreq_MOLRT VAD]

MBSC: a Multi-Band Summary Correlogram (MBSC)-based pitch detection algorithm for noisy speech

MBSC is a Multi-Band Summary Correlogram (MBSC)-based pitch detection algorithm for noisy speech.  The package contains the matlab code that is used to generate the pitch detection results reported in L. N. Tan, and A. Alwan, "Multi-Band Summary Correlogram-based Pitch Detection for Noisy Speech", Speech Communication, in press.  A fast version of the code is also provided in the package. [Matlab code of MBSC pitch detector (to be updated soon)]

SAFE: a Statistical Algorithm for F0 Estimation

SAFE is a toolkit using a Statistical Algorithm for F0 Estimation for both clean and noisy speech. It is developed at the Speech Processing and Auditory Perception Laboratory at UCLA by Wei Chu and Prof. Abeer Alwan. Available for download from here.

VFR: Variable Frame Rate

↑Top

    Variable Frame Rate (VFR) analysis is a method of feature extraction for noise robust automatic speech recognition (ASR) which builds on speech perception research that shows that dynamic spectro-temporal information is important, and, hence, not all equi-duration speech segments are equally important perceptually. For example, formant transitions at the onset of a vowel can carry more discriminative information than the steady-state part of the vowel ... (details) ( download)

VoiceSauce: A Program for Voice Analysis

↑Top

    VoiceSauce is an application, implemented in Matlab, which provides automated voice measurements over time from audio recordings. Inputs are standard wave (*.wav) files and the measures currently computed are: F0, Formants F1-F4, H1(*), H2(*), H4(*), A1(*), A2(*), A3(*), H1(*)-H2(*), H2(*)-H4(*), H1(*)-A1(*), H1(*)-A2(*), H1(*)-A3(*), Energy, and Cepstral Peak Prominence ... (details)

XVocal: Vocal Tract Articulatory Synthesizer

↑Top

    XVocal
    is the UNIX version of Dr. Shinji Maeda's Vocal Tract Articulatory Synthesizer, VTCALCS (originally developed for the PC platform). In 1995, Edmond Chi Hin Chui of our laboratory ported the PC version to UNIX. With the permission by Dr. Maeda, XVocal is now freely available if used for research purposes only. Please check out the user manual for a detailed instruction on how to use the program... (details)

CTMRedit: a Matlab based MRI Image Segmentation Tool with GUI

↑Top

    A Matlab GUI for viewing, segmenting, and interpolating CT and MRI Images. Written by Mark Hasegawa-Johnson and Jul Cha... (details)

Speechdemo: a Matlab based Speech Processing Platform with GUI

↑Top

    Speechdemo is a Matlab-based graphical tool for speech analysis by Qifeng Zhu. It supports simultaneous analysis of signals in two channels. The user can view the signal in time and frequency using a variety of analysis tools such as the Discrete Fourier Transform (DFT); Linear Predictive Coding (LPC); Mel-Frequency Cepstral Coefficients (MFCC); and others... ( details)

ITU G.722 Wide-band Codec implementation in ANSI C

↑Top

    ( ANSI C code)                                                                                                                                               

Databases

 

Databases Distributed through the Linguistic Data Consortium at the University of Pennsylvania (LDC)

↑Top

    • The Child Subglottal Resonances Database. Released t 2022, ISBN: 1-58563-985-0
    • UCLA Speaker Variability Database. Released through the LDC, 2021, ISBN: 1-58563-977-X
    • UCLA High-Speed Laryngeal Audio and Video Database. Released through the LDC, 2017, ISBN: 1-58563-803-X
    • The Subglottal Resonances Database. Released through the LDC, 2015, ISBN: 1-58563-711-4

UCLA Speaker Variability Database

↑Top

    A database designed to sample speaking variability within individual speakers and across a large number of speakers is available through this website. It will also be available from the Linguistic Data Consortium (LDC) as of October, 2021. (download, paper, Readme )

Consonant Vowel Tokens (CV) Database

↑Top

    An extensive database of 1,728 isolated Consonants and Vowels (CV) is available through this website. (details)

VTR Formants Database

↑Top

    The speech group at Microsoft Research (Redmond, Washington, US) and IPAM and Electrical Engineering at UCLA (Los Angeles, CA, US) have recently jointly developed a database for manually labeled vocal-tract-resonance (or formant) trajectories, for research in speech processing including analysis, synthesis, and recognition. (details)

Narrated Videotape Showing 3D Tongue and Vocal Tract Reconstructions from MRI Data for Consonants and Vowels

↑Top

    A narrated videotape showing 3D tongue and vocal tract reconstructions from MRI data for consonants and vowels as produced by 2 talkers. Sample 3D models can be seen at: http://www.ee.ucla.edu/~spapl/projects/mri.html. This videotape is an effective teaching aid, and is produced by Shrikanth Narayanan and Abeer Alwan. ... (details)

    For a free copy of the videotape, please email Prof. Alwan at: alwan@icsl.ucla.edu

Non-Speech Time-Stamps for Aurora 2 Test Sets

↑Top

    The label files in this package contain the time-stamps of silence (sil) and short pause (sp) found in Aurora-2 test sets. These time-stamps are obtained through a manual visual inspection of the spectrograms of clean test files.

Consonant-Vowel-Consonant(CVC) syllables spoken at different rates in the presence of different levels of babble noise

↑Top
    This database includes raw audio, 0dB babble noise corrupted audio and 5dB babble noise corrupted audio files.

Useful Links

 


UCSC Speech Links

                                                                                                                                     

Alexander Graham Bell's Path to the Telephone

 

F0 Estimation Resorces

(from the PhD dissertation of Arturo Camacho, SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music, 2007) ;        

AC-P

: This algorithm (Boersma, 1993) computes the autocorrelation of the signal and divides it by the autocorrelation of the window used to analyze the signal. It uses postprocessing to reduce discontinuities in the pitch trace. It is available with the Praat System at <http://www.fon.hum.uva.nl/praat> The name of the function is ac.

AC-S

: This algorithm uses the autocorrelation of the cubed signal. It is available with the Speech Filing System at <http://www.phon.ucl.ac.uk/resource/sfs> . The name of the function is fxac. 

ANAL

: This algorithm (Secrest and Doddington,  1983) uses autocorrelation to estimate the pitch, and dynamic programming to remove discontinuities in the  pitch trace. It is available with the Speech Filing System at <http://www.phon.ucl.ac.uk/resource/sfs>. The name of the function is fxanal.

CATE: This algorithm uses a quasi

autocorrelation function of the speech excitation signal to estimate the pitch.  We implemented it based on its original description (Di Martino, 1999). The dynamic programming component used to remove discontinuities in the pitch trace was not implemented.

CC

: This algorithm uses cross-correlation to estimate the  pitch and post-processing to remove discontinuities in the pitch trace. It is available with  the Praat System at
<http://www.fon.hum.uva.nl/praat>. The name of the function is cc.

CEP

: This algorithm (Noll, 1967) uses the cepstrum of the signal and is available with the Speech Filing System at  <http://www.phon.ucl.ac.uk/resource/sfs>. The name of the function is fxcep.

ESRPD

: This algorithm (Bagshaw, 1993; Medan, 1991) uses a normalized cross-correlation to estimate the pitch, and post-processing to remove discontinuities  in the pitch trace. It is  available with the Festival Speech Filing System at <http://www.cstr.ed.ac.uk/projects/festival>. The name of the function is pda.

RAPT

: This algorithm (Secrest and Doddington, 1983) uses a normalized cross- correlation to estimate the pitch, and dynamic programming to remove discontinuities in the pitch trace. It is available with the Speech Filing System at <http://www.phon.ucl.ac.uk/resource/sfs>. The name of the function is fxrapt.

SHS

: This algorithm (Hermes, 1988) uses subharmonic summation. It is available with the Praat System at <http://www.fon.hum.uva.nl/praat>. The name of the function is shs.

SHR

: This algorithm (Sun, 2000) uses the subharmonic-to-harmonic ratio. It is available at Matlab Central  <http://www.mathworks.com/matlabcentral, under the title “Pitch Determination Algorithm”>. The name of the function is shrp.

TEMPO

: This algorithm (Kawahara et al., 1999) uses the instantaneous frequency of the outputs of a filterbank. It is available with the STRAIGHT System at its author web page <http://www.wakayama-u.ac.jp/~kawahara>. The name of the function is exstraightsource.

YIN

: This algorithm (de Cheveigné and Kawahara, 2002) uses a modified version of the average squared difference function. It is  available from its author web page at  <http://www.ircam.fr/pcm/cheveign/sw/yin.zip>. The name of the function is yin.
                                               
 
spacer spacer