JSALT2025
Jelinek Summer Workshop on Speech and Language Technology in Brno, Czechia, EU
June 9th - August 1st 2025
Watch JSALT2025 live on YouTube.

End to End multi channel multi talker ASR, EMMA

Our aim is to advance robust speech processing for everyday conversational scenarios, addressing some limitations in current state-of-the-art approaches. In fact, current speech foundation models such as Whisper are incapable of natively handling multi-talker, multi-channel conversational speech. These need to be integrated into a complex pipeline which combines independently trained subsystems for diarization, source separation, and automatic speech recognition (ASR), suffering from error propagation. This project pursues two complementary directions: 1) developing a modular multi-channel ASR system by using a streamlined pipeline of existing pre-trained components, including diarization and target-speaker ASR and fine-tune the whole pipeline together to avoid error propagation; and 2) building a novel, more computationally efficient “Whisper-style” foundation model for joint diarization and ASR with extended context handling. Key research questions include the feasibility of fully end-to-end meeting transcription, how to effectively handle multi-channel data with single-channel pre-trained models, and differentiable integration of components, particularly diarization and ASR.

Read detailed description of the topic

Group members:

Team Leaders:
Lukas Burget
Samuele Cornell
Senior Members:
Jun Du
Matthew Wiesner
Matthew Maciejewski
Yoshiki Masuyama
Grad Students:
Martin Kocour
Darshan Prabhu
Ruoyu Wang
Xiluo He
Alexander Polok
Marc Deegen
Jiangyu Han
Dominik Klement
Junyi Peng
Undergrad Students:
Mehmet Emre Tiryaki
Rohan Phadke
Affiliates:
Shinji Watanabe
Bolaji Yusuf