Zde se nacházíte:
Informace o publikaci
Learning Optimal Prosody Embedding Codebook based on F0 and Energy
| Autoři | |
|---|---|
| Rok publikování | 2025 |
| Druh | Článek ve sborníku |
| Konference | Interspeech 2025 |
| Fakulta / Pracoviště MU | |
| Citace | |
| Doi | https://doi.org/10.21437/Interspeech.2025-1020 |
| Klíčová slova | prosody, VQ-VAE, Fundamental frequency, F0, Energy, embeddings |
| Popis | Both the Fundamental frequency (F0) and Energy are prominent features of prosody. Together, they have been used across a wide variety of speech-processing tasks. However, there is a lack of freely available pre-trained vector representations of these features. Therefore, in this paper, we provide the research community with high-quality joint embeddings of the frame-level F0 and Energy features, using the VQ-VAE architecture. By converting the F0 and Energy into a single stream of vector embeddings, we make it possible to seamlessly use prosody in modern architectures, such as multimodal LLMs. In order to ensure maximum embedding quality, we conduct a large-scale hyperparameter search, totaling over 150 experiments on the LibriTTS dataset. We outperform previous works on F0 embeddings, reaching FFE error below 1 percent, while simultaneously embedding the additional feature of Energy. We publish our best-performing models on the Huggingface website. |