JSALT2025
Jelinek Summer Workshop on Speech and Language Technology in Brno, Czechia, EU
June 9th - August 1st 2025
Watch JSALT2025 live on YouTube.
Second plenary: Tuesday, July 1st, 11:00, Room E112.
Barbara Schuppler: Cross-layer models for conversational speech recognition in low-resourced scenarios

Summer school

Summer school will take place at the Brno University of Technology - Faculty of Information Technology, June 9th - June 20th 2025.

All lectures and hands-on will be held in lecture room E112 (exception: G202 on June 12-13).

The morning lectures are open to the public. We request that you inform us at least one day in advance if you plan to attend (e-mail: jsalt2025@jhu.edu).

Program

Day 1 Mon June 9

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 09:30 Welcome
Sanjeev Khundanpur and Honza Cernocky
room E112
09:30 - 12:00 Automatic Speech Recognition - from GMMs to neural networks
Petr Schwarz [Brno University of Technology]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Managing an ASR Project: From Data to Delivery
Igor Szöke and Karel Veselý [Brno University of Technology]
room E112
17:30 Summer school welcome picnic
FIT BUT E103 respirium

Day 2 Tue June 10

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Human Hearing and Speech Engineering
Hynek Hermansky [Brno University of Technology and Johns Hopkins University]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Current trends in speaker verification / Extracting speaker-related representations from speech
Oldřich Plchot [Brno University of Technology]
room E112

Day 3 Wed June 11

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Speaker Diarization: From Modular to End-to-End Systems
Federico Landini [Deepgram]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Computer Vision and Handwriting Recognition
Michal Hradiš [Brno University of Technology]
room E112

Day 4 Thu June 12

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Evolutionary machine learning in engineering design
Lukáš Sekanina [Brno University of Technology]
room G202
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Learning for physical interaction: from pixels to machines that see, reason and act
Josef Šivic [Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University and ELLIS Unit Prague]
room G202

Day 5 Fri June 13

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Natural Language Processing with Transformer-based Models I.
Ondřej Bojar and Ondřej Dušek [Charles University, Prague]
room G202
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Natural Language Processing with Transformer-based Models II.
Ondřej Bojar and Ondřej Dušek [Charles University, Prague]
room G202

Day 6 Mon June 16

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Beyond transformers - new sequence processing architectures
Jan Chorowski [Pathway], Marek Adamczyk [Warsaw University], Adrian Łańcucki [NVIDIA]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Speech Synthesis: From Models to Signals – and Back to… Models
Jindřich Matoušek [Department of Cybernetics & New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia in Pilsen]
room E112

Day 7 Tue June 17

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Robust speech recognition I.
Lukas Burget [Brno University of Technology], Samuele Cornell [Carnegie Mellon University], Yoshiki Masuyama [Mitsubishi Electric Research Laboratories]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Robust speech recognition II.
Lukas Burget [Brno University of Technology], Alexander Polok [Brno University of Technology]
room E112

Day 8 Wed June 18

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Speech Processing for Low Resource scenarios I.
Salima Mdhaffar [Avignon University]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Speech Processing for Low Resource scenarios II.
Salima Mdhaffar [Avignon University]
room E112

Day 9 Thu June 19

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Synthetic dialogue generation with role-playing LLMs I.
Ricard Marxer [University of Toulon], Sergio Burdisso [Idiap Research Institute], Alessa Carbo [Johns Hopkins University], Antonio Almudevar [University of Zaragoza], Severin Baroudi [University of Toulon], Isabella Gidi [Harvard University]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Synthetic dialogue generation with role-playing LLMs II.
Ricard Marxer [University of Toulon], Sergio Burdisso [Idiap Research Institute], Alessa Carbo [Johns Hopkins University], Antonio Almudevar [University of Zaragoza], Severin Baroudi [University of Toulon], Isabella Gidi [Harvard University]
room E112

Day 10 Fri June 20

08:00 bus departure from BUT dorms - address: Kolejní 2, Brno to venue Bozetechova 2/1, Brno
08:15 breakfast at FIT BUT, Stary Pivovar
09:00 - 12:00 Introduction to Multimodal Large Language Models I.
Ramani Duraiswami [University of Maryland], Santosh Kesiraju [Brno University of Technology] and Alicia Lozano-Diez [Universidad Autónoma de Madrid]
room E112
12:00 lunch at FIT BUT, Stary Pivovar
13:00 - 17:00 Introduction to Multimodal Large Language Models II.
Ramani Duraiswami [University of Maryland], Santosh Kesiraju [Brno University of Technology] and Alicia Lozano-Diez [Universidad Autónoma de Madrid]
room E112
17:30 Summer school closing ice-cream / beer
FIT BUT E103 respirium

Day 1 Mon June 9 - morning: Automatic Speech Recognition - from GMMs to neural networks

Petr Schwarz [Brno University of Technology]

This tutorial will present the development of speech recognizers from componental HMM/GMM architectures to modern sequence-to-sequence end-to-end trained neural network architectures. Petr spent more than 20 years in speech recognition research, developed a commercial speech recognition core for Phonexia, and helped to start the Kaldi toolkit. He is happy to share his experience.

Petr Schwarz [PhD, Brno University of Technology, 2009] is senior researcher in BUT Speech@FIT at the Faculty of Information Technology (FIT) of BUT. He has broad experience in speech technologies ranging from voice biometry, speech transcription, keyword spotting, to language identification. At BUT, Petr worked on many national, EU, and US research projects and many international technology evaluation campaigns like those organized by the U.S. National Institute of Standards and Technology (NIST). In 2006, Petr co-founded Phonexia, and served for several years as its CEO and CTO. Phonexia sells speech technologies to more than 60 countries. Currently, he is working on conversational AI technologies for emergency services. Petr also participated in JSALT workshops a few times.

Day 1 Mon June 9 - afternoon: Managing an ASR Project: From Data to Delivery

Igor Szöke and Karel Veselý [Brno University of Technology]

This talk explores the key challenges of deploying Automated Speech Recognition (ASR) — from data collection (with privacy considerations and transcription process) to model pretraining and the pros and cons of pre-trained vs. from-scratch models for low-resource languages. We’ll also touch on runtime application, streaming ASR, and deployment, highlighting practical decisions that impact ASR performance, latency and scalability.

Igor Szöke is senior researcher in BUT Speech@FIT at the Faculty of Information Technology (FIT) of BUT. He obtained Ing. from BUT in 2004 and PhD from BUT in 2010. His research area is machine learning and speech data mining (automatic speech recognition, spoken term detection and data augmentation). He was BUT's PI in H2020 CleanSky2 project ATCO2 developing a unique platform allowing to collect, organize and pre-process air-traffic control (voice communication) data from air space. He is currently coordinating CNECT/LUX/2022/OP/0030 - aiming at the development of ASR system usable for the EC services and European industry.

Karel Veselý, Ph.D. (Ing. [MS]. Brno University of Technology, 2010, Ph.D. Brno University of Technology, 2018, SCOPUS h-index 13) is senior researcher in the BUT Speech@FIT group. He was on research internships at LIA Avignon (France), IBM Speech R&D group in Prague and at Johns Hopkins University. He is one of co-founders of the KALDI toolkit, is the 3rd contributor in terms of lines of code, and author of the nnet1 module. He is known for his work in multilingual and semi-supervised ASR training, while he also worked on Goodness of pronunciation systems for child speech. He was instrumental in several EC projects (ATCO2, HAAWAII), DARPA RATS, IARPA Babel and IARPA MATERIAL and is the key ASR developer in the European CNECT project.

Day 2 Tue June 10 - morning: Human Hearing and Speech Engineering

Hynek Hermansky [Brno University of Technology and Johns Hopkins University]

The tutorial will review a few relevant properties of human hearing. We will argue that these hearing properties got in the process of human evolution imprinted on human speech. Some possible implications of this hypothesis on improvements of speech technology are discussed.

Hynek Hermansky received his Doctor of Engineering degree, Electrical Engineering, University of Tokyo (1979-1983); has done Doctoral Graduate Studies, Technical University Brno, Czech Republic (1973-1978); and received Ing. (MS equivalent), Technical University Brno, Czech Republic (1967-1972). His current appointments are Research Professor, Brno University of Technology, Czech Republic and Julian S. Smith Professor Emeritus, The Johns Hopkins University. He is a Life Fellow of Institute of Electrical and Electronic Engineers (IEEE) and a Fellow of International Speech Communication Association (ISCA. He has been awarded the Medal for Scientific Achievements from ISCA (2013) and the Flanagan Speech and Audio Processing Award from IEEE(2020). He has over 250 peer reviewed publications and has been granted thirteen patents.

Day 2 Tue June 10 - afternoon: Current trends in speaker verification / Extracting speaker-related representations from speech

Oldřich Plchot [Brno University of Technology]

This talk will cover state-of-the-art and emerging methods for extracting speaker representations (embeddings) from speech. We will compare unsupervised, self-supervised, weakly supervised, and fully supervised approaches and discuss various applications and use cases that fit the methods. More focus will be given to self-supervised Transformers and their use for extracting speaker representations, as these models have quickly risen in popularity and become an integral part of state-of-the-art speech modeling for automatic speech recognition.

Dr. Oldrich Plchot (Ing. [MS]. Brno University of Technology, 2007, Ph.D. Brno University of Technology, 2014) is a senior researcher in BUT Speech@FIT research group. He worked on EU-sponsored project MOBIO (7th FP) as well as in several projects sponsored at the local Czech level. He was the technical lead of US-Air Force EOARD-sponsored project “Improving the capacity of language recognition systems to handle rare languages using radio broadcast data”, and a key member of personnel in the BEST project and RATS Patrol project sponsored by U.S. IARPA and DARPA, respectively. He participated in several high-profile international research workshops: BOSARIS held in Brno in 2010 and 2012 and at the Johns Hopkins University (MD, USA) summer research workshop in 2013. He significantly contributed to the success of BUT team in numerous international evaluations organized by NIST (Speaker recognition since 2010, Language recognition since 2007) as well as in evaluations organized within IARPA and DARPA projects. He is currently serving as a PI on H2020 project ELOQUENCE coordinating BUT efforts in NLU, conversational LLMs and dialogue systems. He authored or co-authored more than 70 papers, including IEEE Transactions on Audio, Speech, and Language Processing, and high-profile conferences such as ICASSP and Interspeech. He is the recipient of the 2016 “Josef Hlávka Prize” awarded to the most talented PhD students and young researchers of Czech technical Universities.

Day 3 Wed June 11 - morning: Speaker Diarization: From Modular to End-to-End Systems

Federico Landini [Deepgram]

In this lecture we will cover the basics of speaker diarization, the task that deals with finding "who spoke when" in an audio recording. We will discuss traditional modular systems based on clustering and then end-to-end systems where a single neural network processes the audio and returns outputs.

Federico Landini obtained his PhD from Brno University of Technology (BUT) where he was advised by Lukáš Burget and Mireia Diez while working on speaker diarization. He has contributed to both modular diarization systems (e.g., VBx) and end-to-end neural models (e.g., EEND and DiaPer) always open-sourcing recipes and models. During his PhD he has also partnered in R&D collaborations with companies and participated in diarization challenges, including leading the BUT team to successful results in DIHARD and VoxSRC challenges. Besides, he has also done internships at Meta, Facebook, Apple and Microsoft and obtained his Computer Science degree from the University of Buenos Aires. He has served as a reviewer for IEEE/ACM TASLP, EURASIP/ISCA Speech Communication, Interspeech, ICASSP, IEEE SLT, IEEE ASRU, ISCA Odyssey and has won the Best Reviewer Award in ASRU 2023. He is now working at Deepgram as a Research Scientist.

Day 3 Wed June 11 - afternoon: Computer Vision and Handwriting Recognition

Michal Hradiš [Brno University of Technology]

This lecture introduces key tasks in computer vision and provides a high-level overview of neural networks commonly employed to address them, emphasizing model architectures and training methodologies. Special attention will be given to large Transformer-based models and contemporary approaches to their pre-training. Additionally, we will discuss handwritten text recognition systems, exploring parallels and connections with speech recognition techniques.

Michal Hradiš is a senior researcher at the Faculty of Information Technology, Brno University of Technology, specialising in computer vision and machine learning. He is the principal developer of PERO OCR, a printed and handwritten text recognition system used by libraries and archives. Currently, his team focuses on semantic document understanding, information retrieval and efficient exploration of large databases of historical documents in projects semANT, Smart Digilinka, and Orbis Pictus. He has experience from several computer vision startups and now, he is developing a medical device assisting gastroenterologists during colonoscopies and other procedures.

Day 4 Thu June 12 - morning: Evolutionary machine learning in engineering design

Lukáš Sekanina [Brno University of Technology]

The use of evolutionary algorithms for the automated design of programs, electronic circuits, neural networks, antennas, and other objects has become a fruitful approach in computer science and engineering in the last decade. The reason is that evolutionary approaches can handle the design process in a holistic, multi-objective way and create solutions with unique properties. With the massive development of AI based on machine learning, evolutionary machine learning has been introduced to enrich machine learning methods with evolutionary computing techniques and vice versa. This tutorial surveys the key ingredients of evolutionary design methods, focusing on genetic programming. Examples of evolved solutions (such as approximate arithmetic circuits, neural network architectures, and image filters) that show unique properties compared to conventional designs will be presented and discussed.

Lukáš Sekanina (Senior Member of IEEE) received all his degrees from the Brno University of Technology, Czech Republic (Ing. in 1999, Ph.D. in 2002), where he is currently a full professor and Head of the Department of Computer Systems. He was awarded the Fulbright scholarship and worked on the evolutionary circuit design with NASA Jet Propulsion Laboratory at Caltech in 2004. He was a visiting lecturer with Pennsylvania State University (2001), Universidad Politécnica de Madrid (2012), and a visiting researcher with the University of Oslo in 2001. He has worked as an associate editor of IEEE Transactions on Evolutionary Computation and an editorial board member of the Genetic Programming and Evolvable Machines Journal. As a PC co-chair, he contributed to organizing conferences such as ICES, EuroGP, DDECS, DTIS, and DATE. He was a principal investigator of eight projects supported by the Czech Science Foundation. Prof. Sekanina co-authored one patent and over 250 research papers, mostly in genetic programming, approximate computing, evolvable hardware, and automated design methods.

Day 4 Thu June 12 - afternoon: Learning for physical interaction: from pixels to machines that see, reason and act

Josef Šivic [Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University and ELLIS Unit Prague]

Large-scale neural networks have enabled major progress in several areas of artificial intelligence, including natural language processing and computer vision, demonstrating remarkable performance on complex tasks such as writing computer programs or creating images. These impressive results are powered by Internet-scale datasets, transformer-based neural architectures, self-supervised learning techniques, and supercomputer infrastructures. However, the progress has been so far limited in areas that require interactions with physical environments where collecting large-scale datasets is challenging and slow. Examples include visually guided robotic manipulation of objects, where data collection is limited by the speed of the robot. In this talk, I will show our recent progress on this hard problem as well as point to interesting results by others. I will also explain some of the key necessary concepts to make the talk accessible to students who have background in machine learning and speech/language but do not have expertise in computer vision or robotics.

Josef Sivic holds a distinguished researcher position at the Czech Institute of Robotics, Informatics and Cybernetics (CIIRC) at the Czech Technical University in Prague, where he heads the Intelligent Machine Perception team and the ELLIS Unit Prague. He received the habilitation degree from Ecole Normale Superieure in Paris in 2014 and PhD from the University of Oxford in 2006. After PhD, he was a post-doctoral associate at the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology and then spent more than 10 years at Inria Paris where he received an ERC Starting Grant. He was awarded the British Machine Vision Association Sullivan Thesis Prize, three test-of-time awards at major computer vision conferences, and, in 2023, an ERC Advanced Grant. From 2019 to 2025 he served on the board of the European Laboratory of Learning and Intelligent Systems (ELLIS).

Day 5 Fri June 13 - morning: Natural Language Processing with Transformer-based Models I.

Ondřej Bojar and Ondřej Dušek [Charles University, Prague]

This tutorial will provide a brief introduction into selected prominent tasks in natural language processing (LLMs), with the main focus on the application of generative language models based on the Transformer neural architecture, up to and including large language models (LLMs). We will discuss the basics of Transformer models as well as their issues and common misconceptions, and we will give a brief introduction into LLM training. We will then introduce several NLP tasks and look into ways of applying Transformers and LLMs to them, including machine translation (where Transformers were first introduced), speech translation, data-to-text generation, as well as chatbots/dialogue systems.

Ondřej Bojar is an Associate Professor at ÚFAL (Institute of Formal and Applied Linguistics), Charles University, and a lead scientist in Machine Translation in the Czech Republic. He has been co-organizing a well-known series of shared tasks in machine translation and machine translation evaluation (WMT) since 2013. He produced many successful works on pre-neural statistical machine translation, as well as newer neural approaches. Apart from text-based translation, he has also been focusing on speech translation and meeting minuting.

Ondřej Dušek is an Assistant Professor at ÚFAL, Charles University, focusing on natural language generation and human-computer dialogue. His research focuses on generative language models, mostly applied to the data-to-text and dialogue response generation tasks. He is specifically interested in semantic accuracy and grounding in language generation, as well as ways of evaluating generation accuracy.

Day 5 Fri June 13 - afternoon: Natural Language Processing with Transformer-based Models II.

Ondřej Bojar and Ondřej Dušek [Charles University, Prague]

Continuation of morning’s tutorial

Day 6 Mon June 16 - morning: Beyond transformers - new sequence processing architectures

Jan Chorowski [Pathway], Marek Adamczyk [Warsaw University], Adrian Łańcucki [NVIDIA]

The use of evolutionary algorithms for the automated design of programs, electronic circuits, neural networks, antennas, and other objects has become a fruitful approach in computer science and engineering in the last decade. The reason is that evolutionary approaches can handle the design process in a holistic, multi-objective way and create solutions with unique properties. With the massive development of AI based on machine learning, evolutionary machine learning has been introduced to enrich machine learning methods with evolutionary computing techniques and vice versa. This tutorial surveys the key ingredients of evolutionary design methods, focusing on genetic programming. Examples of evolved solutions (such as approximate arithmetic circuits, neural network architectures, and image filters) that show unique properties compared to conventional designs will be presented and discussed.

Jan Chorowski is the CTO at Pathway building Live AI systems, underpinned by a proprietary real-time data processing engine and an AI framework. He received his M.Sc. degree in electrical engineering from Wrocław University of Technology and Ph.D. from University of Louisville. He has worked at the University of Wroclaw and has collaborated with several research teams, including Google Brain, Microsoft Research and Yoshua Bengio’s lab.

Marek Adamczyk is a theoretical computer scientist with a passion for empirical validation of mathematics. His fields of expertise are probability theory, algorithmics, and high-dimensional geometry. He works on algorithms that predict the future. The problems on which he focuses in his research are combinatorial in nature but come from the context of stochastic optimization, online optimization, mechanism design, algorithmic game theory, and core machine learning. Simply speaking, optimization under uncertainty.

Adrian Łańcucki is a senior engineer at NVIDIA. His research focuses on representation learning and generative modeling for text and speech, as well as improving quality and efficiency at scale. In 2019, Adrian obtained a Ph.D. in machine learning from the University of Wroclaw, Poland. Since then, he has actively collaborated with academia.

Day 6 Mon June 16 - afternoon: Speech Synthesis: From Models to Signals – and Back to… Models

Jindřich Matoušek [Department of Cybernetics & New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia in Pilsen]

A brief history, state-of-the-art, and future directions of speech synthesis

Jindřich Matoušek is an Associate Professor at the Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia (UWB) in Pilsen, Czech Republic. He received his M.Sc. and Ph.D. degrees from UWB and has been active as a researcher at the university since 1999, joining the New Technologies for the Information Society (NTIS) research center in 2012. His primary research and teaching interests are computer speech processing, especially speech synthesis. J. Matoušek is the author or co-author of over 180 scientific publications and has served as Area Chair for “Speech Synthesis and Spoken Language Generation” at Interspeech 2021 and 2022 and Publication Chair of Interspeech 2021. He has extensive experience leading and participating in research projects. In 2020, he was on the team awarded the Czech Technology Agency (TAČR) Award for the project “Elimination of the Language Barriers Faced by the Handicapped Viewers of the Czech Television”. For his leadership of the TAČR project “Automatic voice banking and reconstruction for patients after total laryngectomy”, he received the 2020 City of Pilsen Award. J. Matoušek is also active in the scientific community as a member of the evaluation committee for the Antonín Svoboda Award for the best Czech Ph.D. thesis in cybernetics and informatics.

Day 7 Tue June 17 - morning: Robust speech recognition I.

Lukas Burget [Brno University of Technology], Samuele Cornell [Carnegie Mellon University], Yoshiki Masuyama [Mitsubishi Electric Research Laboratories]

While ASR achieves superhuman performance on clean benchmarks, it struggles in real-world scenarios like meeting transcription, where word error rates exceed 35% versus under 3% on clean data. This lecture examines the challenges of robust ASR for conversational speech, including noise, reverberation, multiple speakers, and overlapped speech (>15% of meeting duration). The lecture covers evaluation methodologies for long-form multi-speaker audio, including concatenated minimum permutation WER (cpWER), and surveys key datasets from AMI to current benchmarks like CHiME-7/8 and NOTSOFAR1. Technical approaches are categorized into front-end methods (speech separation, beamforming, target speaker extraction) and back-end methods (self-supervised features, serialized output training, target-speaker ASR). Robust ASR remains an active research area with significant opportunities, particularly as large language models enable new applications like automated meeting summarization. Key challenges include speaker tracking, training-inference mismatches, and integrating speech separation, diarization, and recognition components.

The morning part of the tutorial will include

  1. Introduction to the problems related to "in-the-wild" conversational speech recognition (Samuele Cornell) (30 min)
    1. Long-form data
    2. Multiple-talkers, overlapped speech
    3. Reverberation and noise
  2. Metrics and datasets (Samuele Cornell) (15 min)
  3. Break (10 mins)
  4. Front-end techniques for robust ASR (Yoshiki Masuyama) (45 min)
    1. Single-channel techniques
      1. Time-frequency mask estimation with neural networks
      2. Permutation-invariant training (PIT) with signal-level metrics
    2. Multi-channel techniques
      1. Classical beamforming
      2. Neural beamforming
  5. Break (15 min)
  6. Back-end techniques for robust ASR (Lukas Burget and Samuele Cornell) (45 min)
    1. PIT-based techniques
    2. Serialized output training
    3. Target-speaker ASR

Lukáš Burget is an associate professor at the Faculty of Information Technology, Brno University of Technology (FIT BUT), and the scientific director of the BUT Speech@FIT group. He is recognized internationally as a leading expert in speech processing. He received his Ph.D. from Brno University of Technology in 2004. From 2000 to 2002, he was a visiting researcher at OGI, Portland, USA, and in 2011–2012, he spent a sabbatical at SRI International in Menlo Park, USA. His research interests include various areas of speech processing, such as acoustic modeling for speech, speaker, and language recognition. He has played a leading role in several JSALT workshops, notably leading the 2008 team that introduced i-vectors, the first widely adopted speech embeddings, and contributing in 2009 to the creation of the widely used Kaldi ASR toolkit. In 2006, he co-founded Phonexia, a company that now employs over 50 full-time staff and delivers speech technologies to clients in more than 60 countries.

Samuele Cornell is currently a postdoctoral research associate at Carnegie Mellon University at the Language Technologies Institute within Prof. Shinji Watanabe research group (WAVLab). He got a doctoral degree in Information Engineering at Università Politecnica delle Marche, Italy in 2023. His research interests are mainly in the area of robust speech processing (speech enhancement, speech separation, diarization, automatic speech recognition) for distant multi-talker conversational scenarios, and also in the broader field of machine listening (sound event detection and classification). He is author and has significant contributions in several popular open-source speech-processing toolkits (e.g. SpeechBrain, ESPNet, Asteroid source separation) and in the organization and co-organization of popular audio processing challenges such as DCASE (Task 4 2022, 2021, 2024), CHiME (CHiME-7/8 DASR lead organizer) and URGENT (2024 and 2025).

Yoshiki Masuyama is a visiting research scientist at Mitsubishi Electric Research Laboratories (MERL) and a collaborator with the Ono Laboratory in the Department of Computer Science, Faculty of Systems Design, Tokyo Metropolitan University. He received his Ph.D. from Tokyo Metropolitan University in 2024. His research interests focus on the integration of signal processing and machine learning technologies for efficient and robust audio processing. He has worked on a wide range of audio signal processing tasks, especially multichannel speech separation, robust automatic speech recognition, and multimodal learning. He is the recipient of the Best Student Paper Award at the IEEE Spoken Language Technology Workshop 2022.

Day 7 Tue June 17 - afternoon: Robust speech recognition II.

Lukas Burget [Brno University of Technology], Alexander Polok [Brno University of Technology]

The afternoon part of the tutorial will include

  1. Introduction to diarization conditioned Whisper (DiCoW) target speaker ASR method (Lukas Burget) (20 min)
  2. Lab on target speaker ASR (Alexander Polok) (2 h)

Alexander Polok is a Junior Researcher and PhD student at the Faculty of Information Technology, Brno University of Technology (BUT). His research focuses on speech recognition, with an emphasis on practical and efficient methods for applying ASR models in conversational settings. His work has received several honors, including the Brno PhD Talent Scholarship, the Jury Award for CHiME Task 2: NOTSOFAR, recognition at IT SPY 2023 for one of the best diploma theses, and a nomination for the Government Award for Talented Students. He also participated in the JSALT 2023 workshop.

Day 8 Wed June 18 - morning: Speech Processing for Low Resource scenarios I.

Salima Mdhaffar [Avignon University]

The morning part of the tutorial will include

Lecture 1 Speech semantic based tasks for low resource scenarios

Lecture 2 TTS for low resource languages

Dr. Salima Mdhaffar is a full-time senior researcher at Avignon University (LIA). She received a PhD in Computer Science from Le Mans University in 2020. In 2020, as a postdoctoral researcher in LIA, she worked on neural end-to-end systems for the named entities recognition, spoken language understanding, and self-supervised learning, as part of the European SELMA H2020 project. She has also worked on federated learning and privacy preserving for ASR systems as part of the ANR (French National Research Agency) DeepPrivacy project. During her postdoc, she spent a few months as a visiting researcher at MILA within the SpeechBrain team. In 2024, she became a full-time senior researcher at LIA. She is involved in many ANR collaborative projects: E-SSL, TRADEF, SpeechPrivacy and Pantagruel. She also works in some industrial projects with Airbus, Elyadata, and Sonos.Her main research interests are automatic speech recognition, semantic information extraction from speech, speech translation, and SSL.She is (co-)author of more than 35 international conferences. She co-supervises 2 Ph.D. students in the field of low-resource speech processing.

Day 8 Wed June 18 - afternoon: Speech Processing for Low Resource scenarios II.

Salima Mdhaffar [Avignon University]

The afternoon part of the tutorial will include:

Lab: Speech Processing for Low Resource scenario with SpeechBrain.
This part of the tutorial will be on Google-Colab

Day 9 Thu June 19 - morning: Synthetic dialogue generation with role-playing LLMs I.

Ricard Marxer [University of Toulon], Sergio Burdisso [Idiap Research Institute], Alessa Carbo [Johns Hopkins University], Antonio Almudevar [University of Zaragoza], Severin Baroudi [University of Toulon], Isabella Gidi [Harvard University]

The tutorial focuses on using role playing LLMs to generate synthetic dialogues with a focus on applying mechanistic interpretability techniques for analysis. We will introduce start by introducing basic concepts about how LLMs are used for role playing. The tutorial will follow by presenting some techniques to develop actionable understanding of LLMs such as difference of means, SAEs and the binding framework. Finally we will explain how role-playing LLMs can be used to perform dialogue generation, from basic setups using a single LLM to a persona-based multi-agent configuration.

Ricard Marxer is a Full Professor at the University of Toulon and a researcher at the Laboratoire d’Informatique et Systèmes (LIS), affiliated with the CNRS. He also serves as the founding director of the Erasmus Mundus Joint Master's Degree in Marine and Maritime Intelligent Robotics (MIR) and leads the DYNI (DYNamics of Information) research team. His research interests encompass machine learning, artificial intelligence, unsupervised learning, machine listening, speech and language processing, music technology, bioacoustics, AI and robotics, marine robotics, AI safety, and responsible AI. Marxer has contributed to numerous publications in these fields and collaborates extensively with researchers in speech, music, and marine bioacoustics. His recent work includes developing systems for underwater audio processing and marine mammal monitoring, as well as applications of deep learning to hearing aid technology and speech enhancement.

Sergio Burdisso is a postdoctoral researcher at the Idiap Research Institute in Martigny, Switzerland, specializing in artificial intelligence and machine learning. His research interests encompass reinforcement learning, natural language processing, and multimodal systems, with applications in areas such as depression detection, media bias analysis, and fake news detection. Burdisso has contributed to numerous publications in top-tier conferences and journals, collaborating with international experts in the field. His work reflects a commitment to advancing AI technologies for societal impact, particularly in healthcare and media literacy.

Antonio Almudevar is currently pursuing a PhD at the University of Zaragoza. Previously, earned a Bachelor's degree in Telecommunication Engineering and a Master's degree in Modeling, Mathematical Research, Statistics, and Computing, from the University of Zaragoza. His primary research focus is representation learning, specifically leveraging Information Theory and Variational Inference methods to design and analyze the properties of representation spaces in Deep Learning models.

Alessa Carbo is a rising junior at Johns Hopkins University majoring in computer science, with research interests in AI alignment and safety, mechanistic interpretability, NLP, and the theoretical foundations of learning and generalization in deep neural networks.

Severin Baroudi is a PhD student at University of Toulon, he works on self-supervised models to improve speaker diarization and source separation. I possess a Master degree in Machine Learning applied to Audio and Image signals obtained at Paul Sabatier University, in Toulouse. Work aside, I like pop culture (movies in particular), workout at the gym, running and traveling.

Isabella Gidi is a rising Junior undergrad at Harvard University studying Applied Math with a minor in Mind, Brain, and Behavior. She is interested in interpretability, natural language processing, and semantics.

Day 9 Thu June 19 - afternoon: Synthetic dialogue generation with role-playing LLMs II.

Ricard Marxer [University of Toulon], Sergio Burdisso [Idiap Research Institute], Alessa Carbo [Johns Hopkins University], Antonio Almudevar [University of Zaragoza], Severin Baroudi [University of Toulon], Isabella Gidi [Harvard University]

Continuation of the tutorial

Day 10 Fri June 20 - morning: Introduction to Multimodal Large Language Models I.

Ramani Duraiswami [University of Maryland], Santosh Kesiraju [Brno University of Technology] and Alicia Lozano-Diez [Universidad Autónoma de Madrid]

Introduction to Multimodal Large Language Models – Ramani Duraiswami

  1. LLM Introduction and Basics
    1. Next Token Prediction
    2. Transformers, Conformer
    3. Training and inference
    4. SFT vs LoRA finetuning
  2. CLIP, CLAP
    1. Images, Audio
  3. Fine Tuning vs Full
  4. Loss functions - Contrastive, Cross-Attention

Details of Training Audio Language Models – Santosh Kesiraju and Alicia Lozano-Diez

  1. Audio Encoder - CLAP, AST, Whisper
  2. Large Audio LLMs- Audio Flamingo-2, Flamingo-3
  3. Datasets and synthetic augmentation for audio
  4. Creating AQA benchmarks
  5. Future Works : Audio Generation; Audio to Audio

Ramani Duraiswami is Professor and Associate Chair (for Graduate Studies) at the Department of Computer Science, and in UMIACS, at the University of Maryland. Prof. Duraiswami got his B. Tech. at IIT Bombay, and his Ph.D. at The Johns Hopkins University. After spending a few years working in industry, he joined the University of Maryland, where he established the Perceptual Interfaces and Reality Lab. He has broad research interests, including both algorithm development (for machine learning, statistics, wave propagation and scattering, the fast multipole method), and systems development/applications (spatial audio capture rendering and personalization; computer vision, acoustics). He has published over 280 peer-reviewed archival papers, co-authored a book, has several issued patents, and according to Google Scholar has an h-index of 64 (in 2023). Some of his research has been spun out into a startup, VisiSonics, whose technology is in millions of devices. A particular theme of Prof. Duraiswami’s recent research has been combining machine learning with scientific simulation, and the understanding the interaction of waves with objects - electromagnetic, acoustic, and visual. Prof. Duraiswami has affiliate appointments in UMIACS, ECE, AMSC, MRC and NACS.

Santosh Kesiraju received his PhD from International Institute of Information Technology, Hyderabad, India. He is currently a senior researcher in the Speech@FIT group from Brno University of Technology. His research interests are in the intersection of speech and language technologies, mainly speech recognition, speech-to-text translation and spoken dialogue systems. He collaborates with universities in India, Europe and USA. He has supervised 4 Bachelor and 3 Master thesis and currently co-supervising 3 PhD students.

Alicia Lozano-Diez received the double degree in Computer Science Engineering and Mathematics from Universidad Autónoma de Madrid (UAM), Spain, in 2012, and the postgraduate Master in Research and Innovation in Information and Communications Technologies (I2-TIC) from the same University in 2013. Since 2012, she has been with the Audias research group at UAM. During her Ph.D., in 2015 and 2017, she joined for 4 and 2 months research internships the Speech group (Speech@FIT) at Brno University of Technology (BUT, Brno, Czech Republic). In the 2016 summer, she interned at SRI International (STAR Lab, California, USA). Her research is mainly focused on deep neural networks (DNN) based systems for automatic language and speaker recognition. She finished her Ph.D. in 2018 and got an assistant professor position at the UAM, continuing her research at the Audias group. In 2019, she got the H2020 Marie Curie funding for the project “Robust End-To-End SPEAKER recognition based on deep learning and attention models” and joined the Speech@FIT (BUT) for almost two years as a post-doc researcher.

Day 10 Fri June 20 - afternoon: Introduction to Multimodal Large Language Models II.

Ramani Duraiswami [University of Maryland], Santosh Kesiraju [Brno University of Technology] and Alicia Lozano-Diez [Universidad Autónoma de Madrid]

Laboratory (Python notebook Exercises)

  1. Simple examples
  2. Trying out AF2 and AF3
  3. Trying out MMAU-Pro
  4. Preparing AQA data
  5. Simple training

Social events during summer school

Program will be published here.

You might also be interested in previous summer school lectures.