Advancing Expert-Level Reasoning and Understanding in Large Audio Language Models

To exhibit intelligence in the physical world, both AI agents and humans must comprehend and then reason about sound (including speech, non-speech sounds, and music). However, research in complex reasoning with audio has lagged behind modalities such as language and vision. This discrepancy is due to several challenges, the capabilities of algorithms for audio understanding, scarcity of large-scale training datasets, architectures, and, the lack of comprehensive benchmarks for assessing advanced audio processing capabilities. The recent open-source MMAU benchmark has revealed that even state-of-the-art LALMs, including proprietary ones, achieve only 53% accuracy on complex audio reasoning tasks. This deficiency represents a crucial bottleneck in the development of multimodal AI systems and the progression toward AGI. We are embarking on an intensive project to address critical limitations in Foundational Large Audio Language Models (LALMs). Our workshop is focused on advancing expert-level understanding and complex reasoning in audio-language models. The team, drawn from several universities and industry in the US, Europe and Asia, and with students and senior professionals from various disciplines, will allow us to achieve these goals.

Read detailed description of the topic

Audio Flamingo 3 Demo

https://huggingface.co/spaces/nvidia/audio-flamingo-3

Group members:

Team Leaders:
Ramani Duraiswami
Santosh Kesiraju

Senior Members:
Paola Garcia
Alicia Lozano-Diez

Grad Students:
Sreyan Ghosh
Lasha Koroshinadze
Wenyi Yu
Sara Barahona Quirós
Hyeonggon Rye
Simon Sedlacek
Sathvik Udupa
Christos Vlachos
Fernando López Gavilánez
Cecilia Micaela Bolaños Wagner
Yao Liu
Laura Herrera Alarcón
Sonal Kumar

Undergrad Students:
Allison Ferner
Finn Ellingwood

Affiliates:
Jordan Boyd-Graber
David Harwath
Chao Zhang
Miroslav Hlavacek
Maxim Plicka
Themos Stafylakis