Home

Awesome

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

Arxiv link https://arxiv.org/pdf/2402.07729.pdf
[ACL 2024 Main conference] https://aclanthology.org/2024.acl-long.109.pdf

<figure> <img src="Images/main_figure.png" width="60%"> </figure>

AIR-Bench (Audio InstRuction Benchmark) is the First benchmark designed to evaluate the ability of LALMs to understand various types of audio signals (including human speech, natural sounds and music), and furthermore, to interact with humans in textual format.

AIR-Bench encompasses two dimensions: foundation and chat benchmarks. The former consists of 19 tasks with approximately 19k single-choice questions. The latter one contains 2k instances of open-ended question-and-answer data.

Overview of LALM Response

<img src="Images/Foundation_Example.png" width="60%"> <img src="Images/Chat_Example.png" width="60%">

LeaderBoard

Chat LeaderBoard

RankCategoriesSpeechSoundMusicMixed AudioAverage
🏅Qwen2-Audio7.186.996.796.776.93
🥈Qwen-Audio-Turbo7.046.595.985.776.34
🥉SALMONN6.166.285.956.086.11
4Qwen-Audio6.476.955.525.386.08
5Gemini-1.5-pro6.975.495.065.275.70
6BLSP6.175.555.084.525.33
7Pandagpt3.585.465.062.934.25
8Next-gpt3.864.764.182.924.13
9SpeechGPT1.570.950.951.141.15
10Macaw-LLM0.971.010.911.001.01
Whisper+GPT 47.54////

Foundation Leaderboard

CategoriesQwen-Audio-TurboQwen-AudioPandagptSALMONNNext-gptBLSPSpeechGPTWhisper+GPT 4
Rank🏅🥈🥉4567/
Speech grounding45.4%56.1%23.0%25.3%25.4%25.0%28.8%35.0%
Spoken language identification95.9%92.8%34.6%28.1%23.7%30.8%39.6%96.8%
Speaker gender recognition82.5%67.2%66.5%35.5%57.0%33.2%29.2%21.9%
Emotion recognition60.0%43.2%26.0%29.9%25.7%27.4%37.6%59.5%
Speaker age prediction58.8%36.0%42.5%48.7%62.4%51.2%20.4%41.1%
Speech entity recognition48.1%71.2%34.0%51.7%26.1%37.2%35.9%69.8%
Intent classification56.4%77.8%28.5%36.7%25.6%46.6%45.8%87.7%
Speaker number verification54.3%35.3%43.2%34.3%25.4%28.1%32.6%30.0%
Synthesized voice detection69.3%48.3%53.1%50.0%30.8%50.0%39.2%40.5%
Audio grounding41.6%23.9%38.3%24.0%62.2%34.6%26.1%/
Vocal sound classification78.1%84.9%31.6%45.3%23.5%29.8%26.2%/
Acoustic scene classification61.3%67.5%55.7%34.1%24.1%25.2%23.7%/
Sound question answering62.8%64.6%48.7%28.4%18.8%36.1%33.9%/
Music instruments classification59.6%59.1%47.7%41.3%24.3%22.8%29.1%/
Music genre classfication77.1%71.2%39.8%45.3%28.1%26.1%29.3%/
Music note analysis-pitch30.1%28.6%26.4%26.4%25.1%23.5%24.1%/
Music note analysis-velocity25.1%25.4%27.2%22.8%23.1%24.9%25.2%/
Music question answering62.5%48.2%50.7%54.6%47.1%31.0%31.3%/
Music emotion detection39.0%36.1%36.7%32.2%25.4%28.3%29.7%/
Average57.8%54.5%40.2%36.0%31.5%31.4%30.0%/

Download AIR-Bench

Please refer to the issue.

Easy Evaluation

Step1: Evaluate on Foundation Benchmark

Inference your model on Foundation Benchmark

python Inference_Foundation.py

[Optional] Alignment on Foundation Benmark

This is an optional step. This situation applies when your model cannot accurately answer ABCD and needs to be aligned with GPT. We provide a script that can batch call GPT, you only need to do one thing: replace your own GPT call keys (MIT_SPIDER_TOKEN and MIT_SPIDER_URL).

python align_in_foundation.py

Calculate score on Foundation Benchmark

python score_foundation.py

Step2: Evaluate on Chat Benchmark

Inference your model on Chat Benchmark

python Inference_Chat.py

Calculate gpt score on Chat Benchmark

python score_chat.py

Note

The final score is the average of the model prediction scores (remember to swap the positions of answer_gt and model prediction and then calculate the final score).

Merge score

Summarize the scores on the chat dataset as the final score. See cal_score.py for the simple code provided.

License

AIR-Bench is released under Apache License Version 2.0.

Citing

If you find this repository helpful, please consider citing it:

@article{yang2024air, title={AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension}, author={Yang, Qian and Xu, Jin and Liu, Wenrui and Chu, Yunfei and Jiang, Ziyue and Zhou, Xiaohuan and Leng, Yichong and Lv, Yuanjun and Zhao, Zhou and Zhou, Chang and others}, journal={arXiv preprint arXiv:2402.07729}, year={2024} }