Awesome

IDA-VLM

IDA-VLM weights <a href="https://huggingface.co/jiyatai/IDA-VLM">🤗</a>, MM-ID <a href="https://huggingface.co/jiyatai/MM-ID">🤗</a>, Visual instrution tuning data with ID reference <a href="https://huggingface.co/datasets/jiyatai/visual_instruction_tuning_ID_reference">🤗</a>.

This is the code base for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model.

We propose visual instruction tuning with ID reference, which unleashes the potential of LVLM in identity memory and recognition across diverse scenes, and develop an ID-aware LVLM, IDA-VLM. This paper paves the way for future artificial intelligence systems to possess multi-identity visual inputs, thereby facilitating the comprehension of complex visual narratives like movies.

Samples:

<img src="./fig/samples.png"> <details> <summary>Animation image urls</summary> https://img1.doubanio.com/view/photo/l/public/p2625512480.webp, https://img1.doubanio.com/view/photo/m/public/p2901199610.webp, https://img2.doubanio.com/view/photo/m/public/p2896107391.webp, https://img2.doubanio.com/view/photo/l/public/p2895851711.webp, https://olimg.3dmgame.com/uploads/images/xiaz/2021/0924/1632447816995.jpg, https://i0.hdslb.com/bfs/archive/0384c2f5139013b1ceae84395bbd58fae25898ef.jpg, https://act-webstatic.mihoyo.com/event-static/2023/08/15/9797cacf6d60a54f91fb6f68546b43e1_6723404097102093983.jpg?x-oss-process=image/quality,Q_80/resize,m_lfit,s_700 </details>

Todo list:

Release code.
Release benchmark images, tuning data.
Release model weights and easy start.

We have three main contributions: MM-ID, tuning data construction and model training.

In MM-ID, we introduce the task format and evaluation methods. ID_reference_data contains the processing code for producing instruction tuning data. Model includes training and inference code, which is based on Qwen-VL-Chat.

For a quickstart, you need download images of MM-ID (or prepare ID images and test images of your own) and model weights, to complete instruction task with ID inference, detailed in Model.

License

The majority of this project is licensed under Qwen-VL License.

Acknowledge

Qwen-VL: The codebase we build upon.
MovieNet: The main dataset we use for tuning data construction.