Login
 Forum
 
 
Thesis topic proposal
 
Tamás Gábor Csapó
Special problems for deep learning based text-to-speech synthesis

THESIS TOPIC PROPOSAL

Institute: Budapest University of Technology and Economics
computer sciences
Doctoral School of Informatics

Thesis supervisor: Tamás Gábor Csapó
Location of studies (in Hungarian): Távközlési és Médiainformatikai Tanszék
Abbreviation of location of studies: TMIT


Description of the research topic:

State-of-the-art text-to-speech (TTS) synthesis is based on statistical parametric methods. In the last decade, particular attention has been paid to deep neural network (DNN) based TTS, which has gained much popularity due to its advantages in flexibility and smoothness. Another advantage is the availability of speaker adaptation methods, i.e. to have an individual voice based on any target speaker. In the parametric system, the speech signal is decomposed to parameters representing excitation (e.g. fundamental frequency, F0) and spectrum of speech, and these are fed to a machine learning system. After the statistical model is learnt on the training data, during synthesis, the parameter sequences are converted back to speech signal with reconstructing methods (e.g. speech vocoders, excitation models).
For speech synthesis systems, it is possible to generate audible speech output with a live user if the system is designed for real-time low latency usage. User-in-the-loop experiments, which investigate the interaction between user and systems or which explore potential applications of such systems would be a key contribution to the field.
Although nowadays TTS systems are intelligible, for real-time systems a limitation of current parametric techniques does not allow full naturalness yet and there is room for improvement in being close to human speech. There are vocoding methods which yield in close to natural synthesized speech (e.g. STRAIGHT and WORLD), they are typically computationally expensive, and are thus not suitable for real-time implementation, especially in embedded environments.
The possible research tasks of the Ph.D. student are the following:
- Overview of the related scientific papers, including the novel results in deep neural network based text-to-speech synthesis.
- Propose, design and implement computationally feasible solutions for speech vocoding.
- Conduct research on the speaker dependency of TTS, and propose speaker adaptation methods.
- Propose, design and implement deep learning based solutions (e.g. convolutional and recurrent neural networks, Generative Adversarial Network) for TTS with the new vocoder.
- Demonstrate the effectiveness of the theoretical results in a sample application scenario.
- Evaluate the results with objective and subjective methods.

Required language skills: English or Hungarian
Further requirements: 
basic signal processing and machine learning knowledge

Number of students who can be accepted: 1

Deadline for application: 2020-06-15


2024. IV. 17.
ODT ülés
Az ODT következő ülésére 2024. június 14-én, pénteken 10.00 órakor kerül sor a Semmelweis Egyetem Szenátusi termében (Bp. Üllői út 26. I. emelet).

 
All rights reserved © 2007, Hungarian Doctoral Council. Doctoral Council registration number at commissioner for data protection: 02003/0001. Program version: 2.2358 ( 2017. X. 31. )