Learning deep speech representations for phonetics research


The speech signal is a rich source of informa­tion that conveys not only linguistic but also extra/para-linguistic information, such as the speaker's identity, gender, emotional state, age, or the social status. However, those traits are hidden in complex, non-transparent varia­tions of the speech signal, and mostly obscure to speech research. With recent progress in speech synthesis and voice conversion caused by the advent of deep learning, we argue that synthesized speech can become a valuable tool for research in phonetics. The overarching goal of this project is thus to explore the potential of deep generative modeling of speech as a tool to support basic research in phonetics. To cons­train the task, we will not consider the synthe­sis of stimuli from text, but concentrate on the dedicated manipulation of speech to generate new speech signals with desired properties. The goal is to develop generative models which offer a representation of the speech signal by latent variables, which is compact and informa­tive about the observed speech signal, which represents different sources of variation of the speech signal by different dimensions of the representation, which allows a dedicated ma­nipulation of a phonetic cue along phonetically plausible dimensions, and which is amenable to human interpretation.

Key Facts

Project duration:
04/2021 - 12/2024
Funded by:
DFG-Datenbank gepris
Tiefe generative Modelle für die Phonetikforschung

More Information

Principal Investigators

contact-box image

Prof. Dr. Reinhold Häb-Umbach

Communications Engineering / Heinz Nixdorf Institute

About the person
contact-box image

Petra Wagner

Universität Bielefeld

About the person (Orcid.org)