Publication details

Learning Optimal Prosody Embedding Codebook based on F0 and Energy

Authors

PORTEŠ David HORÁK Aleš

Year of publication 2025
Type Article in Proceedings
Conference Interspeech 2025
MU Faculty or unit

Faculty of Informatics

Citation
Doi https://doi.org/10.21437/Interspeech.2025-1020
Keywords prosody, VQ-VAE, Fundamental frequency, F0, Energy, embeddings
Description Both the Fundamental frequency (F0) and Energy are prominent features of prosody. Together, they have been used across a wide variety of speech-processing tasks. However, there is a lack of freely available pre-trained vector representations of these features. Therefore, in this paper, we provide the research community with high-quality joint embeddings of the frame-level F0 and Energy features, using the VQ-VAE architecture. By converting the F0 and Energy into a single stream of vector embeddings, we make it possible to seamlessly use prosody in modern architectures, such as multimodal LLMs. In order to ensure maximum embedding quality, we conduct a large-scale hyperparameter search, totaling over 150 experiments on the LibriTTS dataset. We outperform previous works on F0 embeddings, reaching FFE error below 1 percent, while simultaneously embedding the additional feature of Energy. We publish our best-performing models on the Huggingface website.

You are running an old browser version. We recommend updating your browser to its latest version.

More info