DSP-based F0 estimation algorithms, such as MBSC, are robust to noisy speech. Recent studies show that mapping from raw waveform segments into F0 estimates by DNNs can outperform DSP-based methods in F0 estimation. However, generalization and noise robustness of DNNs have not been fully addressed previously.
To further improve the robustness of the DNN-based algorithm's performance, we propose a hybrid approach that fuses noise-robust auxiliary DSP representations and raw waveform representations to obtain F0 estimates using deep learning. For fusing noise-robust auxiliary DSP features, we use intermediate features, summary-correlograms (SCs), from a modified version of MBSC. The modified version of MBSC processes input waveforms much faster than the original algorithm, enabling the fusion in our proposed DNN architecture. We show that the proposed fusion network, FusedF0, can outperform both the state-of-the-art DSP baselines and the state-of-the-art raw waveform processing DNN, CREPE, for different noise conditions using the PTDB-TUG database.
Pitch Estimation, Summary-correlogram Fusion, Noise-robust, Deep learning
Prof. Abeer Alwan, SPAPL, UCLA
Eray Eren, SPAPL, UCLA
Back to SPAPL Home Page.
Abeer Alwan (alwan@ee.ucla.edu)