Skip to content

Code for the paper "DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models"

License

Notifications You must be signed in to change notification settings

zhenghuatan/DSpAST

Repository files navigation

DSpAST: Disentangled Spatial Audio Spectrogram Transformer

This repository contains the code and model checkpoints of DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models and is a fork of the SpatialAST Repository.


Installation

To setup the environment, please use the following commands:

conda env create -f environment.yml
bash timm_patch/patch.sh

Dataset

DSpAST is trained and evaluated using the same binaural dataset based on Audioset and SoundSpaces 2.0 that was used to train the binaural encoder SpatialAST of BAT. More details including instructions on how to download the dataset can be found in the SpatialAST Repository.


Training

DSpAST is trained in three stages. For each stage, we provide a training script and a trained checkpoint:

Stage Audioset split Epochs Loss Weights (SED,DP,DOAE) Training Script Checkpoint
1 unbalanced (10% per epoch) 100 (1,0,0) scripts/finetune-stage1.sh
2 unbalanced (1% per epoch) 50 (100,2,1) scripts/finetune-stage2.sh
3 balanced (100% per epoch) 50 (100,2,1) scripts/finetune-stage3.sh

Inference

For inference, the script scripts/inf.sh can be used. On our system, the performances obtained with our provided checkpoints are:

Binaural Encoder mAP (↑) ER20° (↓) MAE (↓) DER (↓)
SpatialAST 49.90 24.43 17.87 32.50
DSpAST (stage 1) 53.05 98.56 95.57 97.58
DSpAST (stage 2) 52.64 20.31 14.44 28.35
DSpAST (stage 3) 54.53 20.28 14.44 28.03

Similar performance improvements can also be observed when using DSpAST as a binaural encoder for spatial audio reasoning with LLMs. Please have a look at our paper for further information.


References

If you use any part of this code for your work, we kindly ask you to cite the following papers:

@article{wilkinghoff2025dspast,
    author     = {Wilkinghoff, Kevin and
                  Tan, Zheng-Hua},
    title      = {{DSpAST:} Disentangled Representations for Spatial Audio Reasoning with Large Language Models},
    journal    = {arXiv:2509.13927},
    year       = {2025}
}

and the original BAT paper, which is the foundation of this work:

@inproceedings{zheng2024bat,
  author       = {Zheng, Zhisheng and
                  Peng, Puyuan and
                  Ma, Ziyang and
                  Chen, Xie and
                  Choi, Eunsol and
                  Harwath, David},
  title        = {{BAT:} Learning to Reason about Spatial Sounds with Large Language Models},
  booktitle    = {Proc. ICML},
  year         = {2024}
}

About

Code for the paper "DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published