JSALT2025
Jelinek Summer Workshop on Speech and Language Technology in Brno, Czechia, EU
June 9th - August 1st 2025

Summer Workshop

Summer workshops will take place at the Faculty of Information Technology, June 23th - August 1st 2025.

Three to four research topics will be selected. A research team will be formed around each topic.

Why to come

Topics & Research Groups

Topic: Play your Part: Towards LLM role-playing agents that stick to their role

Large Language Models (LLMs) - neural networks trained as auto-regressive generative models on web-scale text datasets - can be prompted to perform various tasks, including dialogue, enabling natural, human-like interaction. To facilitate interaction with LLMs and prevent harmful behavior, complex prompts are crafted to shape the persona of the simulated character. This topic aims to address the issue of consistency and controllability in LLM agents within the challenging context of long-form interactions. We propose a dual-pronged approach. Firstly, we will explore metrics to identify and quantify deviations from desired behavior, along with the necessary evaluation sets to measure these metrics effectively. Secondly, we will delve into mitigating such deviations through the development of improved control techniques. Our methods will be based on gaining a deeper understanding of the mechanisms underlying role-playing and jailbreaking through modern mechanistic interpretability techniques, and the analysis of interaction dynamics using a model-based approach. Two applications involving long-form interaction and of significant practical relevance - multi-turn task-oriented dialogues and the simulation of doctor-patient interactions with diverse personas - will inform the design of our methods and serve as testbeds for their evaluation.

Read detailed description of the topic

Topic: Advancing Expert-Level Reasoning and Understanding in Large Audio Language Models

To exhibit intelligence in the physical world, both AI agents and humans must comprehend and then reason about sound (including speech, non-speech sounds, and music). However, research in complex reasoning with audio has lagged behind modalities such as language and vision. This discrepancy is due to several challenges, the capabilities of algorithms for audio understanding, scarcity of large-scale training datasets, architectures, and, the lack of comprehensive benchmarks for assessing advanced audio processing capabilities. The recent open-source MMAU benchmark has revealed that even state-of-the-art LALMs, including proprietary ones, achieve only 53% accuracy on complex audio reasoning tasks. This deficiency represents a crucial bottleneck in the development of multimodal AI systems and the progression toward AGI. We are embarking on an intensive project to address critical limitations in Foundational Large Audio Language Models (LALMs). Our workshop is focused on advancing expert-level understanding and complex reasoning in audio-language models. The team, drawn from several universities and industry in the US, Europe and Asia, and with students and senior professionals from various disciplines, will allow us to achieve these goals.

Read detailed description of the topic

Topic: End to End multi channel multi talker ASR, EMMA

Our aim is to advance robust speech processing for everyday conversational scenarios, addressing some limitations in current state-of-the-art approaches. In fact, current speech foundation models such as Whisper are incapable of natively handling multi-talker, multi-channel conversational speech. These need to be integrated into a complex pipeline which combines independently trained subsystems for diarization, source separation, and automatic speech recognition (ASR), suffering from error propagation. This project pursues two complementary directions: 1) developing a modular multi-channel ASR system by using a streamlined pipeline of existing pre-trained components, including diarization and target-speaker ASR and fine-tune the whole pipeline together to avoid error propagation; and 2) building a novel, more computationally efficient “Whisper-style” foundation model for joint diarization and ASR with extended context handling. Key research questions include the feasibility of fully end-to-end meeting transcription, how to effectively handle multi-channel data with single-channel pre-trained models, and differentiable integration of components, particularly diarization and ASR.

Read detailed description of the topic

Topic: TTS4ALL: TTS in low resource scenarios: data management, methodology, models, evaluation

The project aims to effectively train and evaluate TTS systems in a situation of scarce training data and complex linguistic contexts. We aim to set up an effective data collection, preparation and evaluation protocols that are adapted to the situation above-mentioned. We will also explore effective strategies for training TTS models for spoken languages without written form or dialects without standardized writing systems. Besides that, we will also address the use of Self-Supervised Learning (SSL) for building TTS and investigate SSL layers in order to find where linguistic content and emotions are encoded. Furthermore, we will benefit from our multidisciplinary and highly-skilled team to build TTS for additional applications that include speech pseudonymization and streaming TTS. Speech pseudonymization is an area lacking existing resources and previous studies. It involves altering the linguistic content of recorded natural speech to protect the speaker’s identity while maintaining the intelligibility of the utterance. This could be particularly useful in scenarios where privacy is a concern, such as in legal or child protection contexts. Streaming TTS is also an emerging topic, which allows for speech generation as symbolic inputs (text or discrete tokens) are provided. This could be particularly useful for integrating TTS with the output of a textual Large Language Model (LLM) or for simultaneous speech translation. Streaming TTS could enable real-time applications where immediate feedback is required, such as in conversational agents or live broadcasting.

Read detailed description of the topic

Interested in getting involved? Joining one of the research teams requires personal invitation from the group leader and negotiating practical terms with the organizers. Please contact us.

Program

Workshop (June 23th - August 1st 2025)

Research teams will work on selected topics.

Opening Day (June 23th 2025)

Program/presentations will be published here.

Plenary Lectures

Expected 1x week. Program/presentations will be published here.

Social events during workshop

Program will be published here.

Closing Presentations (July 31st - August 1st 2025)

Program/presentations will be published here.

You might also be interested in previous workshops.