β‘ A repository for evaluating AudioLLMs in various tasks π β‘
β‘ AudioBench: A Universal Benchmark for Audio Large Language Models π β‘
π Come to View Our Live Leaderboard on Huggingface Space π
π AudioBench Leaderboard | π€ Huggingface Datasets | π€ AudioLLM Paper Collection
AudioBench is a universal benchmark for evaluating audio large language models (AudioLLMs) on speech, audio-scene, and voice understanding tasks across 50+ datasets. New to the codebase? See ARCHITECTURE.md for how the pieces fit together.
- π§ Installation
- β© Quick Start
- π Supported Datasets
- π€ Supported Models
- β Add Your Own Dataset / Model
- π Leaderboard & Users
- π Change Log
- π Citation
- β To-Do List
- π Contributors
Installation with pip:
pip install -r requirements.txtFor model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.
The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.
# Step 1:
# Server the judgement model using VLLM framework (my example is using int4 quantized version)
# This requires with 1 * 80GB GPU
bash vllm_model_judge_llama_3_70b.sh
# Step 2:
# We perform model inference and obtain the evaluation results with the second GPU
GPU=2
BATCH_SIZE=1
OVERWRITE=True
NUMBER_OF_SAMPLES=-1 # indicate all test samples if number_of_samples=-1
MODEL_NAME=Qwen2-Audio-7B-Instruct
DATASET=cn_college_listen_mcq_test
METRICS=llama3_70b_judge
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLESTo evaluate on a different dataset, just replace the DATASET and METRIC names (see the full list below):
DATASET=librispeech_test_clean
METRIC=wer
AudioBench supports 50+ datasets. Full names, metrics, and usage are in examples/supported_datasets.md.
Full dataset list (click to expand)
- librispeech_test_clean, ASR, English, Metric:
wer - librispeech_test_other, ASR, English, Metric:
wer - common_voice_15_en_test, ASR, English, Metric:
wer - peoples_speech_test, ASR, English, Metric:
wer - gigaspeech_test, ASR, English, Metric:
wer - tedlium3_test, ASR, English, Metric:
wer - tedlium3_long_form_test, ASR, English, Long recording, Metric:
wer - earnings21_test, ASR, English, Long recording, Metric:
wer - earnings22_test, ASR, English, Long recording, Metric:
wer - aishell_asr_zh_test, ASR, Chinese, Metric:
wer - covost2_en_id_test, Speech Translation, English-Indonesian, Metric:
bleu - covost2_en_zh_test, Speech Translation, English-Chinese, Metric:
bleu - covost2_en_ta_test, Speech Translation, English-Tamil, Metric:
bleu - covost2_id_en_test, Speech Translation, Indonesian-English, Metric:
bleu - covost2_zh_en_test, Speech Translation, Chinese-English, Metric:
bleu - covost2_ta_en_test, Speech Translation, Tamil-English, Metric:
bleu - cn_college_listen_mcq_test, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge,gpt4o_judge - slue_p2_sqa5_test, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - dream_tts_mcq_test, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge,gpt4o_judge - public_sg_speech_qa_test, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - spoken_squad_test, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - openhermes_audio_test, Speech Instruction, Metric:
llama3_70b_judge,gpt4o_judge - alpaca_audio_test, Speech Instruction, Metric:
llama3_70b_judge,gpt4o_judge - spoken-mqa_short_digit, Speech Instruction, Metric:
acc - spoken-mqa_long_digit, Speech Instruction, Metric:
acc - spoken-mqa_single_step_reasoning, Speech Instruction, Metric:
acc - spoken-mqa_multi_step_reasoning, Speech Instruction, Metric:
acc - audiollm_instructionfollowing, Instruction Following (IFEval-Audio), Metric:
llama3_70b_judge_combined - clotho_aqa_test, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - wavcaps_qa_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - audiocaps_qa_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - wavcaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,meteor,gpt4o_judge - audiocaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,meteor,gpt4o_judge - iemocap_emotion_test, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - meld_sentiment_test, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - meld_emotion_test, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - voxceleb_accent_test, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - voxceleb_gender_test, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - iemocap_gender_test, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - muchomusic_test, Music Understanding, Metric:
llama3_70b_judge,gpt4o_judge - imda_part1_asr_test, Singlish ASR, Metric:
wer - imda_part2_asr_test, Singlish ASR, Metric:
wer - imda_part3_30s_asr_test, Singlish ASR, Metric:
wer - imda_part4_30s_asr_test, Singlish ASR, Metric:
wer - imda_part5_30s_asr_test, Singlish ASR, Metric:
wer - imda_part6_30s_asr_test, Singlish ASR, Metric:
wer - imda_part3_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - imda_part4_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - imda_part5_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - imda_part6_30s_sqa_human_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - imda_part3_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - imda_part4_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - imda_part5_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - imda_part6_30s_ds_human_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - imda_ar_sentence, Singlish, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - imda_ar_dialogue, Singlish, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - imda_gr_sentence, Singlish, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - imda_gr_dialogue, Singlish, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - seame_dev_man, English-Chinese Code-Switching, Metric:
wer - seame_dev_sge, English-Chinese Code-Switching, Metric:
wer - mmau_mini, Audio Understanding and Reasoning, Multiple Choice Questions, Metric:
llama3_70b_judge,string_match,gpt4o_judge - gigaspeech2_thai, ASR for Thai language, Metric:
wer - gigaspeech2_indo, ASR for Indonesian language, Metric:
wer - gigaspeech2_viet, ASR for Vietnamese language, Metric:
wer - ASCEND, English-Chinese Code-Switching, Metric:
wer - [fleurs] speech translation
- [AIR-Bench] airbench tasks
See examples/adding_new_model.md for setup details.
- cascade_whisper_large_v3_llama_3_8b_instruct
- cascade_whisper_large_v2_gemma2_9b_cpt_sea_lionv3_instruct
- MERaLiON-AudioLLM-Whisper-SEA-LION
- Qwen-Audio-Chat
- Qwen2-Audio-7B-Instruct
- SALMONN_7B: need extra git clone.
- WavLLM_fairseq: deprecated β inference setup is too involved; the loader is kept for reference.
- whisper_large_v3
- whisper_large_v2
- gemini-1.5-flash: key needed
- gemini-2-flash: key needed
- gpt-4o-audio: key needed
- phi_4_multimodal_instruct
- seallms_audio_7b
- ultravox https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_1-8b / https://www.ultravox.ai/
- llama3_s
- audio-flamingo-2
- [GLM4-Voice]
- [Mini-Omni]
- [SLAM-Omni]
- [https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct]
- [https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b]
Your own dataset β two steps:
- Make a copy of one of the customized dataset loaders. Example: cn_college_listen_mcq_test. Customize it for your own dataset.
- Add a new entry in dataset.py.
Your own model β as long as the model can do inference, you can load it and generate responses. See adding_new_model.
For an overview of how datasets, models, and metrics fit together, see ARCHITECTURE.md.
π View the live AudioBench Leaderboard on Hugging Face Spaces
To submit your model to the leaderboard, email: bwang28c@gmail.com
Researchers, companies or groups that are using AudioBench:
- Llama3-S: When Llama Learns to Listen
- llms-eval
- More to come...
Expand change log
- Mar 2025: Supported phi_4_multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietnamese and Indonesian).
- Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
- Mar 2025: AudioBench now supports over 50 datasets!!
- Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
- JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
- JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
- DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
- SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
- AUG 2024: Leaderboard is live. Check it out here.
- JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}- Features
- Evaluation with audio/speech generation
- Evaluation with multiround chatbot
- Also support other model-as-judge and report the results
- Update AI-SHELL from WER to CER
- Bugs
- Threads of model-as-judge
- Post-processing script for IMDA PART4 which contains code-switching in 4 languages.
- Xue Cong Tey (MMAU-mini Dataset)
