ODT - THESIS TOPIC: Tamás Gábor Csapó: Special problems for deep ...

Special problems for deep learning based text-to-speech synthesis

THESIS TOPIC PROPOSAL

Institute: Budapest University of Technology and Economics
computer sciences
Doctoral School of Informatics

Thesis supervisor: Tamás Gábor Csapó
Location of studies (in Hungarian): Távközlési és Médiainformatikai Tanszék
Abbreviation of location of studies: TMIT

Description of the research topic:

State-of-the-art text-to-speech (TTS) synthesis is based on statistical parametric methods. In the last decade, particular attention has been paid to deep neural network (DNN) based TTS, which has gained much popularity due to its advantages in flexibility and smoothness. Another advantage is the availability of speaker adaptation methods, i.e. to have an individual voice based on any target speaker. In the parametric system, the speech signal is decomposed to parameters representing excitation (e.g. fundamental frequency, F0) and spectrum of speech, and these are fed to a machine learning system. After the statistical model is learnt on the training data, during synthesis, the parameter sequences are converted back to speech signal with reconstructing methods (e.g. speech vocoders, excitation models).
For speech synthesis systems, it is possible to generate audible speech output with a live user if the system is designed for real-time low latency usage. User-in-the-loop experiments, which investigate the interaction between user and systems or which explore potential applications of such systems would be a key contribution to the field.
Although nowadays TTS systems are intelligible, for real-time systems a limitation of current parametric techniques does not allow full naturalness yet and there is room for improvement in being close to human speech. There are vocoding methods which yield in close to natural synthesized speech (e.g. STRAIGHT and WORLD), they are typically computationally expensive, and are thus not suitable for real-time implementation, especially in embedded environments.
The possible research tasks of the Ph.D. student are the following:
- Overview of the related scientific papers, including the novel results in deep neural network based text-to-speech synthesis.
- Propose, design and implement computationally feasible solutions for speech vocoding.
- Conduct research on the speaker dependency of TTS, and propose speaker adaptation methods.
- Propose, design and implement deep learning based solutions (e.g. convolutional and recurrent neural networks, Generative Adversarial Network) for TTS with the new vocoder.
- Demonstrate the effectiveness of the theoretical results in a sample application scenario.
- Evaluate the results with objective and subjective methods.

Required language skills: English or Hungarian
Further requirements:
basic signal processing and machine learning knowledge

Number of students who can be accepted: 1

Deadline for application: 2020-06-15