Summer workshops will take place at the Faculty of Information Technology, June 23th - August 1st 2025.
Three to four research topics will be selected. A research team will be formed around each topic.
Large Language Models (LLMs) - neural networks trained as auto-regressive generative models on web-scale text datasets - can be prompted to perform various tasks, including dialogue, enabling natural, human-like interaction. To facilitate interaction with LLMs and prevent harmful behavior, complex prompts are crafted to shape the persona of the simulated character. This topic aims to address the issue of consistency and controllability in LLM agents within the challenging context of long-form interactions. We propose a dual-pronged approach. Firstly, we will explore metrics to identify and quantify deviations from desired behavior, along with the necessary evaluation sets to measure these metrics effectively. Secondly, we will delve into mitigating such deviations through the development of improved control techniques. Our methods will be based on gaining a deeper understanding of the mechanisms underlying role-playing and jailbreaking through modern mechanistic interpretability techniques, and the analysis of interaction dynamics using a model-based approach. Two applications involving long-form interaction and of significant practical relevance - multi-turn task-oriented dialogues and the simulation of doctor-patient interactions with diverse personas - will inform the design of our methods and serve as testbeds for their evaluation.
To exhibit intelligence in the physical world, both AI agents and humans must comprehend and then reason about sound (including speech, non-speech sounds, and music). However, research in complex reasoning with audio has lagged behind modalities such as language and vision. This discrepancy is due to several challenges, the capabilities of algorithms for audio understanding, scarcity of large-scale training datasets, architectures, and, the lack of comprehensive benchmarks for assessing advanced audio processing capabilities. The recent open-source MMAU benchmark has revealed that even state-of-the-art LALMs, including proprietary ones, achieve only 53% accuracy on complex audio reasoning tasks. This deficiency represents a crucial bottleneck in the development of multimodal AI systems and the progression toward AGI. We are embarking on an intensive project to address critical limitations in Foundational Large Audio Language Models (LALMs). Our workshop is focused on advancing expert-level understanding and complex reasoning in audio-language models. The team, drawn from several universities and industry in the US, Europe and Asia, and with students and senior professionals from various disciplines, will allow us to achieve these goals.
Our aim is to advance robust speech processing for everyday conversational scenarios, addressing some limitations in current state-of-the-art approaches. In fact, current speech foundation models such as Whisper are incapable of natively handling multi-talker, multi-channel conversational speech. These need to be integrated into a complex pipeline which combines independently trained subsystems for diarization, source separation, and automatic speech recognition (ASR), suffering from error propagation. This project pursues two complementary directions: 1) developing a modular multi-channel ASR system by using a streamlined pipeline of existing pre-trained components, including diarization and target-speaker ASR and fine-tune the whole pipeline together to avoid error propagation; and 2) building a novel, more computationally efficient “Whisper-style” foundation model for joint diarization and ASR with extended context handling. Key research questions include the feasibility of fully end-to-end meeting transcription, how to effectively handle multi-channel data with single-channel pre-trained models, and differentiable integration of components, particularly diarization and ASR.
The project aims to effectively train and evaluate TTS systems in a situation of scarce training data and complex linguistic contexts. We aim to set up an effective data collection, preparation and evaluation protocols that are adapted to the situation above-mentioned. We will also explore effective strategies for training TTS models for spoken languages without written form or dialects without standardized writing systems. Besides that, we will also address the use of Self-Supervised Learning (SSL) for building TTS and investigate SSL layers in order to find where linguistic content and emotions are encoded. Furthermore, we will benefit from our multidisciplinary and highly-skilled team to build TTS for additional applications that include speech pseudonymization and streaming TTS. Speech pseudonymization is an area lacking existing resources and previous studies. It involves altering the linguistic content of recorded natural speech to protect the speaker’s identity while maintaining the intelligibility of the utterance. This could be particularly useful in scenarios where privacy is a concern, such as in legal or child protection contexts. Streaming TTS is also an emerging topic, which allows for speech generation as symbolic inputs (text or discrete tokens) are provided. This could be particularly useful for integrating TTS with the output of a textual Large Language Model (LLM) or for simultaneous speech translation. Streaming TTS could enable real-time applications where immediate feedback is required, such as in conversational agents or live broadcasting.
Interested in getting involved? Joining one of the research teams requires personal invitation from the group leader and negotiating practical terms with the organizers. Please contact us.
Research teams will work on selected topics.
Program/presentations will be published here.
Expected 1x week. Program/presentations will be published here.
Program will be published here.
Program/presentations will be published here.
You might also be interested in previous workshops.