F0 Estimation

Project Summary

Fundamental frequency (F0) estimation of a speech signal is the process of extracting the periodicity from voiced speech segments. F0 estimation of a speech signal is important for many speech processing applications such as automatic speech recognition (ASR), text-to-speech (TTS) synthesis, speech enhancement, speaker verification, and voice separation.

DSP-based F0 estimation algorithms, such as MBSC, are robust to noisy speech. Recent studies show that mapping from raw waveform segments into F0 estimates by DNNs can outperform DSP-based methods in F0 estimation. However, generalization and noise robustness of DNNs have not been fully addressed previously.

To further improve the robustness of the DNN-based algorithm's performance, we propose a hybrid approach that fuses noise-robust auxiliary DSP representations and raw waveform representations to obtain F0 estimates using deep learning. For fusing noise-robust auxiliary DSP features, we use intermediate features, summary-correlograms (SCs), from a modified version of MBSC. The modified version of MBSC processes input waveforms much faster than the original algorithm, enabling the fusion in our proposed DNN architecture. We show that the proposed fusion network, FusedF0, can outperform both the state-of-the-art DSP baselines and the state-of-the-art raw waveform processing DNN, CREPE, for different noise conditions using the PTDB-TUG database.

Keywords

Pitch Estimation, Summary-correlogram Fusion, Noise-robust, Deep learning

Students and Collaborators

Prof. Abeer Alwan, SPAPL, UCLA

Eray Eren, SPAPL, UCLA

Back to SPAPL Home Page.

Abeer Alwan (alwan@ee.ucla.edu)