Learning Optimal Prosody Embedding Codebook based on F0 and Energy

Informace o publikaci

Autoři	PORTEŠ David HORÁK Aleš
Rok publikování	2025
Druh	Článek ve sborníku
Konference	Interspeech 2025
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
Doi	https://doi.org/10.21437/Interspeech.2025-1020
Klíčová slova	prosody, VQ-VAE, Fundamental frequency, F0, Energy, embeddings
Popis	Both the Fundamental frequency (F0) and Energy are prominent features of prosody. Together, they have been used across a wide variety of speech-processing tasks. However, there is a lack of freely available pre-trained vector representations of these features. Therefore, in this paper, we provide the research community with high-quality joint embeddings of the frame-level F0 and Energy features, using the VQ-VAE architecture. By converting the F0 and Energy into a single stream of vector embeddings, we make it possible to seamlessly use prosody in modern architectures, such as multimodal LLMs. In order to ensure maximum embedding quality, we conduct a large-scale hyperparameter search, totaling over 150 experiments on the LibriTTS dataset. We outperform previous works on F0 embeddings, reaching FFE error below 1 percent, while simultaneously embedding the additional feature of Energy. We publish our best-performing models on the Huggingface website.

Studijní programy