Informace o publikaci

Learning Optimal Prosody Embedding Codebook based on F0 and Energy

Autoři

PORTEŠ David HORÁK Aleš

Rok publikování 2025
Druh Článek ve sborníku
Konference Interspeech 2025
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
Doi https://doi.org/10.21437/Interspeech.2025-1020
Klíčová slova prosody, VQ-VAE, Fundamental frequency, F0, Energy, embeddings
Popis Both the Fundamental frequency (F0) and Energy are prominent features of prosody. Together, they have been used across a wide variety of speech-processing tasks. However, there is a lack of freely available pre-trained vector representations of these features. Therefore, in this paper, we provide the research community with high-quality joint embeddings of the frame-level F0 and Energy features, using the VQ-VAE architecture. By converting the F0 and Energy into a single stream of vector embeddings, we make it possible to seamlessly use prosody in modern architectures, such as multimodal LLMs. In order to ensure maximum embedding quality, we conduct a large-scale hyperparameter search, totaling over 150 experiments on the LibriTTS dataset. We outperform previous works on F0 embeddings, reaching FFE error below 1 percent, while simultaneously embedding the additional feature of Energy. We publish our best-performing models on the Huggingface website.

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info