Skip to content

AudioLLMs/AudioBench

Repository files navigation

Prometheus-Logo

πŸ”₯ AudioBench πŸ”₯

arXiv Hugging Face Organization License

⚑ A repository for evaluating AudioLLMs in various tasks πŸš€ ⚑
⚑ AudioBench: A Universal Benchmark for Audio Large Language Models πŸš€ ⚑
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟

🏠 AudioBench Leaderboard | πŸ€— Huggingface Datasets | πŸ€— AudioLLM Paper Collection GitHub Repo stars

AudioBench is a universal benchmark for evaluating audio large language models (AudioLLMs) on speech, audio-scene, and voice understanding tasks across 50+ datasets. New to the codebase? See ARCHITECTURE.md for how the pieces fit together.

Contents

πŸ”§ Installation

Installation with pip:

pip install -r requirements.txt

⏩ Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh

# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

To evaluate on a different dataset, just replace the DATASET and METRIC names (see the full list below):

DATASET=librispeech_test_clean
METRIC=wer

πŸ“Š Supported Datasets

AudioBench supports 50+ datasets. Full names, metrics, and usage are in examples/supported_datasets.md.

Full dataset list (click to expand)

πŸ€– Supported Models

See examples/adding_new_model.md for setup details.

βž• Add Your Own Dataset / Model

Your own dataset β€” two steps:

  1. Make a copy of one of the customized dataset loaders. Example: cn_college_listen_mcq_test. Customize it for your own dataset.
  2. Add a new entry in dataset.py.

Your own model β€” as long as the model can do inference, you can load it and generate responses. See adding_new_model.

For an overview of how datasets, models, and metrics fit together, see ARCHITECTURE.md.

πŸ† Leaderboard & Users

🌟 View the live AudioBench Leaderboard on Hugging Face Spaces

To submit your model to the leaderboard, email: bwang28c@gmail.com

Researchers, companies or groups that are using AudioBench:

πŸ“ Change Log

Expand change log
  • Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietnamese and Indonesian).
  • Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
  • Mar 2025: AudioBench now supports over 50 datasets!!
  • Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
  • JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
  • JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
  • DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
  • SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
  • AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
  • AUG 2024: Leaderboard is live. Check it out here.
  • JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

Star History Chart

πŸ“– Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}

βœ… To-Do List

  • Features
    • Evaluation with audio/speech generation
    • Evaluation with multiround chatbot
    • Also support other model-as-judge and report the results
    • Update AI-SHELL from WER to CER
  • Bugs
    • Threads of model-as-judge
    • Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

πŸ™Œ Contributors

  • Xue Cong Tey (MMAU-mini Dataset)

About

AudioBench: A Universal Benchmark for Audio Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors