End to End multi channel multi talker ASR, EMMA

Our aim is to advance robust speech processing for everyday conversational scenarios, addressing some limitations in current state-of-the-art approaches. In fact, current speech foundation models such as Whisper are incapable of natively handling multi-talker, multi-channel conversational speech. These need to be integrated into a complex pipeline which combines independently trained subsystems for diarization, source separation, and automatic speech recognition (ASR), suffering from error propagation. This project pursues two complementary directions: 1) developing a modular multi-channel ASR system by using a streamlined pipeline of existing pre-trained components, including diarization and target-speaker ASR and fine-tune the whole pipeline together to avoid error propagation; and 2) building a novel, more computationally efficient “Whisper-style” foundation model for joint diarization and ASR with extended context handling. Key research questions include the feasibility of fully end-to-end meeting transcription, how to effectively handle multi-channel data with single-channel pre-trained models, and differentiable integration of components, particularly diarization and ASR.

Read detailed description of the topic

Group members:

Team Leaders:
Lukas Burget
Samuele Cornell

Senior Members:
Jun Du
Matthew Wiesner
Matthew Maciejewski
Yoshiki Masuyama

Grad Students:
Darshan Prabhu
Ruoyu Wang
Xiluo He
Alexander Polok
Marc Deegen
Jiangyu Han
Dominik Klement

Undergrad Students:
Mehmet Emre Tiryaki
Rohan Phadke

Affiliates:
Shinji Watanabe
Bolaji Yusuf
Martin Kocour
Jin Li
Prachi Singh