Home

Awesome

SDA-CLIP: Surgical Visual Domain Adaptation Using Video and Text Labels

Created by Yuchong Li

This repository contains PyTorch implementation for SDA-CLIP.

We introduce a Surgical Domain Adaptation method based on the Contrastive Language-Image Pretraining model (SDA-CLIP) to recognize cross-domain surgical action. Specifically, we utilize the Vision Transformer(ViT) and Transformer based on CLIP pre-trained parameters to extract video and text embeddings, respectively. Text embedding is developed as a bridge between VR and clinical domains. Inter- and intra- modality loss functions are employed to enhance the consistency of embeddings of the same class.

Our code is based on CLIP and ActionCLIP.

Prerequisites

Requirements

The environment is also recorded in requirements.txt.

Pretrained models

We use the base model (ViT-B/16 for image encoder & text encoder) pre-trained by CLIP-openai. The model can be downloaded in link. The pre-trained model should be saved in ./models/.

Model weights

Our model weights for the hard and soft domain adaptation tasks can be downloaded in link.