ODT - THESIS TOPIC: Tamás Gábor Csapó: Expressive Speech Signal Processing ...

Expressive Speech Signal Processing with AI advances

THESIS TOPIC PROPOSAL

Institute: Budapest University of Technology and Economics
computer sciences
Doctoral School of Informatics

Thesis supervisor: Tamás Gábor Csapó
Location of studies (in Hungarian): Department of Telecommunications and Media Informatics
Abbreviation of location of studies: TMIT

Description of the research topic:

People use AI to display human-like capabilities such as learning, planning, analyzing, and speaking. Image recognition for face unlock in cellphones, ML-based financial fraud detection, and Voice assistants are examples of AI software currently being used in everyday life. Natural human speech is very expressive and varies based on the emotions (e.g., angry, sad), speaking styles (e.g., whisper, boasting), speaker’s individualities (e.g., age, gender), accents (e.g., native, foreign), and daily conversation type speech in which people tend to express relationships. Due to the successful development of deep learning, text-to-speech models built with artiﬁcial neural networks have achieved profound advancement in both the quality and naturalness of synthesized speech. However, these systems are often perceived as lacking expressiveness, being less interpretable, limiting the ability to fully convey information, and do not give the user enough control over how speech sounds. That is because people speak with complex rhythm, intonation and timbre that’s challenging for AI to emulate. Therefore, the quality of synthesized expressive speech is still a challenge and finding ways to add expressiveness to the synthesized speech models is still an open issue.
To solve the problem, we ﬁrst need to understand the composition of spoken speech, which contains content, speaker, and style factors. Then, the research question is, how variations in speech characteristics such as intonation, stress, rhythm and prosody can be modeled for text-to-speech synthesis. Thus, our goal at BME TMIT SmartLab is to address the problem of synthesizing expressed speech, and then build an expressive and controllable Text-to-Speech (TTS) framework with high-quality and robust synthesis.

Research tasks:

- Present an overview of current TTS techniques, focusing on approaches based on end-to-end expressive neural architecture.
- Study and analyze the effectiveness and weaknesses of speech features, such as fundamental frequency (F0) characteristics, duration characteristics, intensity parameter, and other prosodic parameters.
- Design an efficient neural architecture that can synthesize speech in computationally efficient way in parallel and with limited data.
- Building models and tools for high-quality, controllable speech synthesis that capture the expressiveness of human speech.
- Evaluate the system from different aspects, including intelligibility, naturalness, and preference of the synthetic speech.
- Contribute to high-quality research papers and publish at peer-reviewed conferences and journals.

Required language skills: angol
Number of students who can be accepted: 1

Deadline for application: 2023-01-10