ODT - THESIS TOPIC: Tamás Gábor Csapó: Speech-related biosignal processing using ...

Speech-related biosignal processing using deep learning methods

THESIS TOPIC PROPOSAL

Institute: Budapest University of Technology and Economics
computer sciences
Doctoral School of Informatics

Thesis supervisor: Tamás Gábor Csapó
Location of studies (in Hungarian): Távközlési és Médiainformatikai Tanszék
Abbreviation of location of studies: TMIT

Description of the research topic:

During the last several years, there has been significant interest in processing of the speech-related biosignals (e.g. laryngeal, articulatory gestures, muscular action potentials) with deep learning methods, because these biosignals have the potential to overcome limitations of traditional acoustic-based systems for spoken communication. Within this area, articulatory-to-acoustic conversion research field is a subtopic, which is often referred to as “Silent Speech Interfaces” (SSI). This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically ultrasound tongue imaging (UTI), electromagnetic articulography (EMA), magnetic resonance imaging (MRI), surface electromyography (sEMG) or multimodal approaches are used, and the above methods may also be combined with a simple video recording of the lip movements, resulting in sensor-to-speech systems.
The main challenges in the field of biosignal-based speech processing is to handle session and speaker dependency. Session dependency is a source of variance which comes from the possible misalignment of the recording equipment. For example, for UTI recordings, the probe fixing headset has to be mounted to the speaker before use, and in practice it is impossible to mount it to exactly the same spot as before. This inherently causes the recorded ultrasound video to become misaligned compared to a video recorded in a previous session. Therefore, such recordings are not directly comparable. This occurs similarly with other articulatory equipment. Although there are a lot of research results for generating intelligible speech or recognizing content using EMA, UTI, sEMG, lip video and multimodal data, all the studies were conducted on relatively small databases and typically with one or just a small number of speakers. However, all of the articulatory tracking devices are obviously highly sensitive to the speaker; and the development of novel methods for normalization, alignment, model adaptation, speaker adaptation would be highly important.
The possible research tasks of the Ph.D. student are the following:
- Overview of the related scientific papers, including the novel results in deep neural network based biosignal processing.
- Propose, design and implement computationally feasible solutions for handling session dependency of biosignals (e.g. ultrasound / EMA).
- Conduct research on the speaker dependency of biosignals, and propose speaker adaptation methods (e.g. Capsule networks / Transformer networks).
- Propose, design and implement deep learning based solutions (e.g. convolutional and recurrent neural networks, Generative Adversarial Network) for the Silent Speech Interface scenario.
- Demonstrate the effectiveness of the theoretical results in a sample application scenario.
- Evaluate the results with objective and subjective methods.

Required language skills: English or Hungarian
Further requirements:
basic signal processing and machine learning knowledge

Number of students who can be accepted: 1

Deadline for application: 2021-06-14