Home

Awesome

<div align="center">

IntCLIP: Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a> Conference Paper

</div>

Updates

Abstract

Multi-label Intention Understanding (MIU) for images is a critical yet challenging domain, primarily due to the ambiguity of intentions leading to a resource-intensive annotation process. Current leading approaches are held back by the limited amount of labeled data. To mitigate the scarcity of annotated data, we leverage the Contrastive Language-Image Pre-training (CLIP) model, renowned for its proficiency in aligning textual and visual modalities. We introduce a novel framework, Intention Understanding with CLIP (IntCLIP), which utilizes a dual-branch approach. This framework exploits the 'Sight'-oriented knowledge inherent in CLIP to augment 'Semantic'-centric MIU tasks. Additionally, we propose Hierarchical Class Integration to effectively manage the complex layered label structure, aligning it with CLIP's nuanced sentence feature extraction capabilities. Our Sight-assisted Aggregation further refines this model by infusing the semantic feature map with essential visual cues, thereby enhancing the intention understanding ability. Through extensive experiments conducted on the standard MIU benchmark and other subjective tasks such as Image Emotion Recognition, IntCLIP clearly demonstrates superiority over current state-of-the-art techniques.

Architecture

Overview of proposed Intention Understanding with CLIP (IntCLIP). (a) Demonstrates the MIU with CLIP process, incorporating Sight-assisted Aggregation to fuse sight and semantic features with text encoding, followed by Asymmetric Loss (AS Loss) optimization to align final image and text features. The Hierarchical Class Integration extracts abundant class information from hierarchical labels. (b) Depicts the Sight-semantic Image Encoding strategy, where the CLIP-initialized encoder is dedicated to capturing sight-related features, and the semantic branch, through partially learnable deep layers, evolves to accommodate semantic information, thereby preserving both objective visual details and subjective semantic insights.

Getting started

Dependencies installation

Install IntCLIP from scratch

conda create -n intclip python=3.7
conda activate intclip

git clone https://github.com/yan9qu/IntCLIP.git
cd IntCLIP
pip install -r requirements.txt
cd ..

In the conda environment, install pycocotools and randaugment with pip:

pip install pycocotools
pip install randaugment

Install extra dependencies

Follow the link to install dassl.

Data preparation

Prepare datasets

warning: Please note that we do not own the copyrights of any datasets we used. Please contact the original authors to get access to the images, follow the link.

You should specify the following directory in opts.py:

--datadir
--dataset_config_file

Training

Training IntCLIP on Intentonomy:

python train.py 

Inference

The results of the test set and validation set, obtained from the output of the train.py file during training

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{intclip,
    title={Synergy of Sight and Semantics: Visual Intention Understanding with CLIP},
    author={Qu Yang, Mang Ye and Dacheng Tao},
    booktitle={ECCV},
    year={2024}
}

:heart: We thank the following users who open repositories on Github for us to build on for our experiments