Awesome
@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology (WACV 2025)
by Xin Jiang*, Junwei Zheng*, Ruiping Liu, Jiahang Li, Jiaming Zhang†, Sven Matthiesen, Rainer Stiefelhagen
* denotes equal contribution and † denotes corresponding author
<p align="center"> <a href="https://arxiv.org/pdf/2409.14215"> <img src="https://img.shields.io/badge/arXiv-2409.14215-red" /></a> <a href="https://junweizheng93.github.io/publications/ATBench/ATBench.html"> <img src="https://img.shields.io/badge/Project-page-green" /></a> <a href="https://pytorch.org/"> <img src="https://img.shields.io/badge/Framework-PyTorch-orange.svg" /></a> <a href="https://github.com/jystin/ATBench/blob/main/LICENSE"> <img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" /></a> </p>News
- [2024.09.17] ATBench (Assistive Technology Benchmark) is accepted to WACV2025.
- [2024.10.13] We are excited to release ATModel (Assistive Technology Model) training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)
Introduction
ATBench is designed by a pre-design user study with PVIs, including five five most crucial vision-language tasks: Panoptic Segmentation, Image Captioning, Visual Question Answering (VQA), Depth Estimation, Optical Character Recognition (OCR). And we also proposed a novel ATModel that can address all tasks simultaneously.
More detailed can be found in our arxiv paper.
Getting Started
Checkpoints and Numbers:
PS<br/>(ADE-150) | DE<br/>(NYU-V2) | OCR<br/>(6 datasets avg) | IC<br/>(VizWiz_Cap) | VQA<br/>(VizWiz_VQA) | #Params | |
---|---|---|---|---|---|---|
Model | PQ | RMSE | Acc(%) | CIDEr | Acc(%) | |
Unified-IO (S) | - | 0.649 | - | - | 42.4 | 71M |
Unified-IO (B) | - | 0.469 | - | - | 45.8 | 241M |
Unified-IO (L) | - | 0.402 | - | - | 47.7 | 776M |
X-Decoder (T) | 41.6 | - | - | - | - | 164M |
GIT (T) | - | - | - | 113.1 | 68.0 | 0.7B |
PaLI (T) | - | - | - | 117.2 | 67.5 | 3.0B |
ATModel | 38.5 | 0.425 | 80.1 | 52.5 | 53.7 | 62M |
Installation, Dataset, Training and Evaluation Guide:
Acknowledgement
- We build our work on top of X-Decoder and use their code. We appreciate the previous open-source repository X-Decoder.
Citation
If you find our work useful in your research, please cite:
@inproceedings{jiang2025atbench,
title={@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology},
author={Jiang, Xin and Zheng, Junwei and Liu, Ruiping and Li, Jiahang and Zhang, Jiaming and Matthiesen, Sven and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025}
}