Awesome

@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology (WACV 2025)

by Xin Jiang*, Junwei Zheng*, Ruiping Liu, Jiahang Li, Jiaming Zhang†, Sven Matthiesen, Rainer Stiefelhagen

* denotes equal contribution and † denotes corresponding author

News

[2024.09.17] ATBench (Assistive Technology Benchmark) is accepted to WACV2025.
[2024.10.13] We are excited to release ATModel (Assistive Technology Model) training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)

pipeline

Introduction

multi_task_result

ATBench is designed by a pre-design user study with PVIs, including five five most crucial vision-language tasks: Panoptic Segmentation, Image Captioning, Visual Question Answering (VQA), Depth Estimation, Optical Character Recognition (OCR). And we also proposed a novel ATModel that can address all tasks simultaneously.

More detailed can be found in our arxiv paper.

Getting Started

Checkpoints and Numbers:

	PS<br/>(ADE-150)	DE<br/>(NYU-V2)	OCR<br/>(6 datasets avg)	IC<br/>(VizWiz_Cap)	VQA<br/>(VizWiz_VQA)	#Params
Model	PQ	RMSE	Acc(%)	CIDEr	Acc(%)
Unified-IO (S)	-	0.649	-	-	42.4	71M
Unified-IO (B)	-	0.469	-	-	45.8	241M
Unified-IO (L)	-	0.402	-	-	47.7	776M
X-Decoder (T)	41.6	-	-	-	-	164M
GIT (T)	-	-	-	113.1	68.0	0.7B
PaLI (T)	-	-	-	117.2	67.5	3.0B
ATModel	38.5	0.425	80.1	52.5	53.7	62M

Installation, Dataset, Training and Evaluation Guide:

Acknowledgement

We build our work on top of X-Decoder and use their code. We appreciate the previous open-source repository X-Decoder.

Citation

If you find our work useful in your research, please cite:

@inproceedings{jiang2025atbench,
title={@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology},
author={Jiang, Xin and Zheng, Junwei and Liu, Ruiping and Li, Jiahang and Zhang, Jiaming and Matthiesen, Sven and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025}
}