🔥 AudioBench 🔥

⚡ A repository for evaluating AudioLLMs in various tasks 🚀 ⚡
⚡ AudioBench: A Universal Benchmark for Audio Large Language Models 🚀 ⚡
🌟 Come to View Our Live Leaderboard on Huggingface Space 🌟

🏠 AudioBench Leaderboard | 🤗 Huggingface Datasets | 🤗 AudioLLM Paper Collection

AudioBench is a universal benchmark for evaluating audio large language models (AudioLLMs) on speech, audio-scene, and voice understanding tasks across 50+ datasets. New to the codebase? See ARCHITECTURE.md for how the pieces fit together.

🔧 Installation

Installation with pip:

pip install -r requirements.txt

⏩ Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh

# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

To evaluate on a different dataset, just replace the DATASET and METRIC names (see the full list below):

DATASET=librispeech_test_clean
METRIC=wer

📊 Supported Datasets

AudioBench supports 50+ datasets. Full names, metrics, and usage are in examples/supported_datasets.md.

Full dataset list (click to expand)

🤖 Supported Models

See examples/adding_new_model.md for setup details.

➕ Add Your Own Dataset / Model

Your own dataset — two steps:

Make a copy of one of the customized dataset loaders. Example: cn_college_listen_mcq_test. Customize it for your own dataset.
Add a new entry in dataset.py.

Your own model — as long as the model can do inference, you can load it and generate responses. See adding_new_model.

For an overview of how datasets, models, and metrics fit together, see ARCHITECTURE.md.

🏆 Leaderboard & Users

🌟 View the live AudioBench Leaderboard on Hugging Face Spaces

To submit your model to the leaderboard, email: bwang28c@gmail.com

Researchers, companies or groups that are using AudioBench:

📝 Change Log

Expand change log

Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietnamese and Indonesian).
Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
Mar 2025: AudioBench now supports over 50 datasets!!
Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
AUG 2024: Leaderboard is live. Check it out here.
JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

📖 Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={NAACL},
  year={2025}
}

✅ To-Do List

Features
- Evaluation with audio/speech generation
- Evaluation with multiround chatbot
- Also support other model-as-judge and report the results
- Update AI-SHELL from WER to CER
Bugs
- Threads of model-as-judge
- Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

🙌 Contributors

Xue Cong Tey (MMAU-mini Dataset)

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
IFEval-Audio		IFEval-Audio
assets		assets
examples		examples
leaderboard		leaderboard
log_for_all_models		log_for_all_models
scripts/aspire2ap_cluster		scripts/aspire2ap_cluster
src		src
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
check_log.py		check_log.py
eval.sh		eval.sh
requirements.txt		requirements.txt
vllm_model_judge_llama_3_70b.sh		vllm_model_judge_llama_3_70b.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔥 AudioBench 🔥

Contents

🔧 Installation

⏩ Quick Start

📊 Supported Datasets

🤖 Supported Models

➕ Add Your Own Dataset / Model

🏆 Leaderboard & Users

📝 Change Log

📖 Citation

✅ To-Do List

🙌 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔥 AudioBench 🔥

Contents

🔧 Installation

⏩ Quick Start

📊 Supported Datasets

🤖 Supported Models

➕ Add Your Own Dataset / Model

🏆 Leaderboard & Users

📝 Change Log

📖 Citation

✅ To-Do List

🙌 Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages