This repository contains the code and model checkpoints of DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models and is a fork of the SpatialAST Repository.
To setup the environment, please use the following commands:
conda env create -f environment.yml
bash timm_patch/patch.sh
DSpAST is trained and evaluated using the same binaural dataset based on Audioset and SoundSpaces 2.0 that was used to train the binaural encoder SpatialAST of BAT. More details including instructions on how to download the dataset can be found in the SpatialAST Repository.
DSpAST is trained in three stages. For each stage, we provide a training script and a trained checkpoint:
| Stage | Audioset split | Epochs | Loss Weights (SED,DP,DOAE) | Training Script | Checkpoint |
|---|---|---|---|---|---|
| 1 | unbalanced (10% per epoch) | 100 | (1,0,0) | scripts/finetune-stage1.sh |
|
| 2 | unbalanced (1% per epoch) | 50 | (100,2,1) | scripts/finetune-stage2.sh |
|
| 3 | balanced (100% per epoch) | 50 | (100,2,1) | scripts/finetune-stage3.sh |
For inference, the script scripts/inf.sh can be used. On our system, the performances obtained with our provided checkpoints are:
| Binaural Encoder | mAP (↑) | ER20° (↓) | MAE (↓) | DER (↓) |
|---|---|---|---|---|
| SpatialAST | 49.90 | 24.43 | 17.87 | 32.50 |
| DSpAST (stage 1) | 53.05 | 98.56 | 95.57 | 97.58 |
| DSpAST (stage 2) | 52.64 | 20.31 | 14.44 | 28.35 |
| DSpAST (stage 3) | 54.53 | 20.28 | 14.44 | 28.03 |
Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our paper for further information.
If you use any part of this code for your work, we kindly ask you to cite the following papers:
@article{wilkinghoff2025dspast,
author = {Wilkinghoff, Kevin and
Tan, Zheng-Hua},
title = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
journal = {arXiv:2509.13927},
year = {2025}
}
and the original BAT paper, which is the foundation of this work:
@inproceedings{zheng2024bat,
author = {Zheng, Zhisheng and
Peng, Puyuan and
Ma, Ziyang and
Chen, Xie and
Choi, Eunsol and
Harwath, David},
title = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
booktitle = {Proc. ICML},
year = {2024}
}
