ODT - THESIS TOPIC: György Szaszák: Deep learning based fusion ...

Deep learning based fusion of acoustic and linguistic features for spoken document retrieval

THESIS TOPIC PROPOSAL

Institute: Budapest University of Technology and Economics
computer sciences
Doctoral School of Informatics

Thesis supervisor: György Szaszák
Location of studies (in Hungarian): Távközlési és Médiainformatikai Tanszék
Abbreviation of location of studies: TMIT

Description of the research topic:

Research objectives:
Automatic summarization is used to extract the most relevant information from either text or speech. When dealing with speech, it is often transcribed using an automatic speech recognizer (ASR) and summarization is carried out on the so obtained text, especially as text based tools are also more advanced in terms of semantic analysis capabilities. Automatically transcribed text often lacks punctuation marks, because speech recognizers model pure word sequences, hence an important cue for the structural organization of the conveyed information is lost. An objective of the research is to retrieve punctuation and information structure relying on speech prosody. Subsequent steps in the processing pipeline of spoken document summarization involve a ranking of the most relevant words/sentences for the summary, based often on TF-IDF scores, completed by Latent Semantic Analysis. The bottleneck in this process consists in the capability of capturing relations and concepts related to meaning. Deep learning based word embeddings have shown very strong semantic modelling capabilities in recent years, hence a straightforward application of them would be to use them in word or sentence ranking processes during summarization. Finally, the combination of acoustic-prosodic and linguistic features with neural networks can bring these to a common platform.

Open problems:
- Improving tokenization and automatic punctuation of ASR output by using phonological phrase alignment or another approach capable of extracting the information structure conveyed by speech prosody.
- Exploring new approaches in creating word embeddings for highly agglutinating languages; examination of morpheme based approaches, development of efficient learning algorithms capable of modelling sparse data.
- Prosodic-acoustic and linguistic feature fusion with neural networks.
- Abstractive speech summarization (when not whole sentences are extracted, but a completely new summary is created using just words from the original text or speech), sequence-to-sequence modelling approaches for speech summarization
- Slot filling and question answering tasks with deep neural networks for automatic speech understanding.

Required language skills: english
Further requirements:
Requirements:
Some experience in speech or language technology, deep learning and neural networks
Good mathematical backgrounds (probability theory, statistics and processes)
Working in Linux environment, C/C++, Python and other scripting langu

Number of students who can be accepted: 2

Deadline for application: 2018-07-31