JSALT2025
Jelinek Summer Workshop on Speech and Language Technology in Brno, Czechia, EU
June 9th - August 1st 2025
Watch JSALT2025 live on YouTube.
5th plenary: Friday, July 18th, 11:00, Room E112.
Hervé Bredin: Speaker diarization, a love loss story

Plenary Lectures

All lectures will be live-streamed via YouTube.

Plenary lecture 1 Fri June 27, 11:00, Room E112, Neural Target Speech and Sound Extraction: An Overview

Marc Delcroix [NTT Communication Science Laboratories]

Humans can listen to a desired sound within a complex acoustic scene consisting of a mixture of various sounds. This phenomenon, called the cocktail party effect or selective hearing, enables us to listen to an interlocutor in a noisy cafe, focus on a particular instrument in a song, or notice a siren on the road.

In this talk, I will discuss target speech/sound extraction (TSE), which isolates the speech signal of a target speaker or a target sound from a mixture of several speakers or sounds using clues that identify the target in the mixture. Such clues might be a spatial clue indicating the direction of the target, a video of the target, or a prerecorded enrollment audio from which the speaker’s voice or the target sound characteristics can be derived. I will introduce the foundation and present recent research on neural-based TSE for speech and arbitrary sounds.

Marc Delcroix is a Distinguished Researcher with NTT Communication Science Laboratories, NTT Corporation, Japan. He received the M.Eng. degree from the Free University of Brussels, Brussels, Belgium, and Ecole Centrale Paris, Paris, France, in 2003, and the Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in 2007. His research interests include various aspects of speech and audio processing, such as target speech and sound extraction, speech enhancement, robust speech recognition, model adaptation, and speaker diarization.

He is a member of the CHiME challenge steering committee, AASP-TC, a past member of the SL-TC from 2018 to 2023 and the organizing committees of the REVERB Challenge 2014, the ASRU 2017, and SLT 2022.

Plenary lecture 2 Tue July 1, 11:00, Room E112, Cross-layer models for conversational speech recognition in low-resourced scenarios

Barbara Schuppler [TU Graz, Austria]

In recent years, conversational speech has become a major focus in speech science and technology. As dialogue systems evolve from transactional tools into socially interactive agents, they demand increasingly accurate automatic speech recognition (ASR). At the same time, conversational data offers unique insights into human speech processing. Drawing on the cross-layer optimization principle from communications engineering, I adopt a similar view of how meaning is accessed across multiple levels of speech information. In this talk, I present findings from my group’s work on integrating pronunciation and prosodic variation into ASR for conversational speech. Our hybrid approach—combining data-driven and knowledge-based methods—proves especially effective in low-resource settings. While transformer-based models often outperform classical systems, the latter still excel with short, fragmented utterances when paired with linguistic knowledge. Beyond ASR, our methods inform fields like pathological speech analysis, dementia prediction, and assistive speech technologies.

Barbara Schuppler studied Physics and Spanish Philology at the University of Graz and the Universidad Autónoma de Madrid, completing a diploma thesis in experimental physics in 2007. She conducted her dissertation within the Marie-Curie RTN "Sound-to-Sense" at Radboud University Nijmegen, with research visits at NTNU Trondheim. After working as teacher at the Graz International Bilingual School, she was awarded an FWF Hertha-Firnberg Grant in 2012 and joined the Signal Processing and Speech Communication Laboratory at TU Graz. Now Associate Professor at TU Graz, her research interests include the investigation of methods for quantitative analyses of prosody and pronunciation variation in conversational speech, the integration of gained phonetic and linguistic knowledge into speech technology, with a specific focus on applications in the educational and healthcare sector.

Plenary lecture 3 Thu July 10, 11:00, Room E112, Differentiable Modeling for Machine Learning

Ramani Duraiswami [University of Maryland, College Park]

Learning via deep neural networks has achieved great success in learning functions relating large datasets to their labels in areas like natural language processing, computer vision, and speech processing. It continues to revolutionize many areas of science and society. I will briefly describe various themes of machine learning research underway in my group at UMD: Differentiable Modeling, speeding up the Attention mechanism in Training and Inference in Transformer architectures, and Large Audio Language Models.

The main part of my talk will be focused on the use of differentiable forward modeling in various ways.

An under-appreciated aspect of the deep learning revolution is the use of automatic differentiation and backpropagation on differentiable computational graphs to obtain the parameters specifying the networks. Before learning from data became the method of choice, scientists spent entire careers developing forward models that captured much scientific knowledge about the domains they worked on. These models were based on mathematics, physics, biology, and other scientific areas they worked on. Making these forward models differentiable allows for this knowledge to be incorporated in deep learning architectures. This allows achieving computational pipelines that can incorporate deep learning for tasks like parameter optimization, cost function minimization, inverse problem solution, implicit neural representations, and learning explainable models, that work well in domains where data is sparse.

We apply these ideas in domains like computer graphics, human hearing, room acoustics, signal processing, and in the solution of inverse problems arising in mathematical physics. We will present example solutions and results.

Ramani Duraiswami is Professor in the department of Computer Science at the University of Maryland, College Park. He also has appointments at Artificial Intelligence Institute at Maryland, UMIACS, Electrical Engineering, Robotics program, Neural and Cognitive Sciences program, and the Applied Math and Scientific Computing Program at the same university. Prof. Duraiswami got his B. Tech. from IIT Bombay and his Ph.D. from The Johns Hopkins University. His research interests are in machine learning, scientific computing and computational perception. Two companies have been spun out based on his research. The audio engine used in content that plays on the millions of shipping VR headsets, PCs, and headphones is based on work from his lab.

Plenary lecture 4 Tue July 15, 11:00, Room E112, Large Concept Model: beyond token-based Large Language Models

Loïc Barrault [Meta AI, France]

Current methods to reach Advanced Machine Intelligence are almost all based on the token-level Large Language Model paradigm. It has attracted a big part of the research community and huge progress have been made. However, we argue that token-based LLMs lack crucial characteristics of human intelligence that limit their potential, such as explicit reasoning and planning, hierarchical processing as well as multilingual processing. In this talk, I will present the Large Concept Model, a model trained to reason over a multimodal and multilingual sentence representation space. This diffusion-based model shows strong performance on several generative tasks and exhibits strong 0-shot multilingual capabilities. Several variants have been explored, including an initial attempt to hierarchical processing of text.

Loïc Barrault (M) is a Research Scientist at Meta AI. Previously, he was a Senior Lecturer in the NLP group of the University of Sheffield and an Associate Professor at LIUM, University of Le Mans. He obtained his PhD at the University of Avignon in 2008 in the field of automatic speech recognition. His research work focuses on statistical and neural machine translation, by considering multiple modalities (multimodal neural machine translation) and by designing lifelong learning methods for MT. Recent work include text-to-text machine translation for 200 languages (NLLB200), speech-to-speech translation for 100 languages (Seamless-M4T) along its expressive version (Seamless Expressive) and reasoning in the embedding space (Large Concept Model).

Plenary lecture 5 Fri July 18, 11:00, Room E112, Speaker diarization, a love loss story

Hervé Bredin [pyannoteAI]

This talk traces the evolution of speaker diarization systems, from traditional multi-stage approaches to modern end-to-end architectures. I will begin with a historical overview, highlighting the shift toward end-to-end modeling and the subsequent partial return to hybrid approaches. The core of the talk focuses on loss functions designed to address key challenges in diarization: permutation invariance through the powerset loss, streaming diarization via a look-ahead loss, and the extension to joint speaker segmentation and separation using the PixIT loss.

Hervé Bredin (a.k.a. the pyannote guy) is currently on leave from CNRS (he was a tenured research scientist there between 2008 and 2025) and is now Chief Science Officer at pyannoteAI, a startup he co-founded around the pyannote.audio open-source speaker diarization toolkit.

Plenary lecture 6 Tue July 22, 11:00, Room E112, Methodologies for Music Understanding and Generation in the Context of Trustworthy AI

Xavier Serra [Universitat Pompeu Fabra, Barcelona]

Music is a deeply human form of expression, shaped by rich cultural, social, and perceptual dimensions. At the Music Technology Group (MTG) of Universitat Pompeu Fabra, we approach music understanding and generation as computational challenges that require not only powerful machine learning techniques but also domain-aware, transparent, and ethically grounded methodologies. In this talk, I will provide an overview of the MTG's research efforts at the intersection of music and artificial intelligence, highlighting our work on audio analysis, symbolic music processing, generative modeling, and multimodal representation learning. I will place special emphasis on how we incorporate principles of trustworthy AI—including fairness, transparency, cultural awareness, and reproducibility—across our research pipeline. Through examples from recent projects I will illustrate the importance of interdisciplinary collaboration and open science practices in advancing the field in a socially responsible way. The talk aims to foster dialogue on how music AI research can balance innovation with accountability, and how technical choices can reflect broader values—particularly when dealing with creative, subjective, and culturally diverse domains like music.

Xavier Serra is Professor at the Department of Engineering of Universitat Pompeu Fabra (UPF) in Barcelona, where he founded and directs the Music Technology Group (MTG). With a PhD in Computer Music from Stanford University (1989), he is internationally recognized for his contributions to sound and music computing, particularly in the analysis, description, and synthesis of musical signals, and in the development of AI-based methodologies for music understanding. He was awarded an Advanced Grant from the European Research Council (ERC) for the CompMusic project, a pioneering initiative that promoted multicultural and interdisciplinary approaches in music information research. He currently directs the UPF-BMAT Chair on Artificial Intelligence and Music, which is dedicated to fostering ethical, transparent, and context-aware AI technologies that can empower the music sector and support creators, educators, and users. His research combines data-driven and knowledge-based approaches, with applications in music creation, education, and cultural heritage. A strong advocate of open science, he actively promotes open datasets, software, and reproducible research practices. He has led numerous international and national research projects, collaborated closely with industry and cultural institutions, and supervised a large number of doctoral students, contributing to the advancement of a trustworthy and human-centered music technology.

Plenary lecture 7 Thu July 24, 11:00, Room E112, Helpful AI Models: You can't always get what you want, but you might get what you need

Jordan Boyd-Graber Ying [University of Maryland]

AIs are trained in many ways, depending on the application. For example: on specific tasks, the goal is to maximize accuracy; with "general purpose" LLMs, the goal is to give users answers they want. This talk argues that the focus should be slightly different: we should specifically measure human-computer workflows and optimize the ability of the combined team at that task. I'll discuss three different examples of human-computer teams that our group has explored: learning new vocabulary, strategic negotiation, and identifying false claims. For learning new vocabulary, we adapt alignment tuning to combine what looks helpful and is truly helpful into a flashcard scheduler that can improve the overall quality of study aids from our QA system. For strategic negotiation, we have computer agents help humans play a board game called Diplomacy to assist human players think strategically and detect lies, which we capture using an analysis of grounded statements with abstract meaning representation and value functions. Finally, we show that computers can help humans identify false statements---but only when the computer is not confidently incorrect. I'll then close with a discussion of how these questions lead into a broader discussion of human skill vs. computer skill, how to measure that, and on what datasets.

Jordan Boyd-Graber is a full professor at the University of Maryland. He has worked on model evaluations for human-centered topic models, psychologically inspired leaderboards, human–computer machine translation, and question answering. He also contributed new models for improving generative models with RL, interactive approaches for question answering, topic models, and negotiations. Of his twenty former PhD students, five have gone on to tenure track positions. He and his students have been recognized with paper awards at EMNLP (2023), IUI (2018), NAACL (2016), and NeurIPS (2009, 2015), and he won the 2015 Karen Spärk Jones Award and a 2017 NSF CAREER Award. He served as PC for ACL 2023, SAC for EMNLP and NAACL, AC for ACL, NAACL, EMNLP, and NeurIPS, Poster Chair for EMNLP 2022, Tutorial Chair for ACL 2017, and Advisor for the ACL 2014 SRW.

He previously was an assistant professor at the University of Colorado, Visiting Research Scientist at Google Zürich, and Praktikant at the Berlin-Brandenburg Akademie der Wissenschaften. His undergraduate degrees are in Computer Science and History at the California Institute of Technology, and he received his PhD from Princeton University.