You are here:
Publication details
Learning Optimal Prosody Embedding Codebook based on F0 and Energy
| Authors | |
|---|---|
| Year of publication | 2025 |
| Type | Article in Proceedings |
| Conference | Interspeech 2025 |
| MU Faculty or unit | |
| Citation | |
| Doi | https://doi.org/10.21437/Interspeech.2025-1020 |
| Keywords | prosody, VQ-VAE, Fundamental frequency, F0, Energy, embeddings |
| Description | Both the Fundamental frequency (F0) and Energy are prominent features of prosody. Together, they have been used across a wide variety of speech-processing tasks. However, there is a lack of freely available pre-trained vector representations of these features. Therefore, in this paper, we provide the research community with high-quality joint embeddings of the frame-level F0 and Energy features, using the VQ-VAE architecture. By converting the F0 and Energy into a single stream of vector embeddings, we make it possible to seamlessly use prosody in modern architectures, such as multimodal LLMs. In order to ensure maximum embedding quality, we conduct a large-scale hyperparameter search, totaling over 150 experiments on the LibriTTS dataset. We outperform previous works on F0 embeddings, reaching FFE error below 1 percent, while simultaneously embedding the additional feature of Energy. We publish our best-performing models on the Huggingface website. |