Awesome

<br /> <p align="center"> <h1 align="center">CLIP Visual Spatial Reasoning</h1> <h3 align="center">Benchmark CLIP models using Visual Spatial Reasoning.</h3> <p align="center"> <a href="https://github.com/cambridgeltl/visual-spatial-reasoning">Original Visual Spatial Reasoning repo</a> </p> </p>

Note: Currently this is true zero shot (so no fine tuning). I benchmark the following CLIP models:

OpenClip laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
OpenClip laion/CLIP-ViT-H-14-laion2B-s32B-b79K
OpenAI Clip openai/clip-vit-large-patch14-336

Findings:

Using the (True) / (False) modifiers proposed in the paper results gives no better than random results.
After experimenting with many stratagies for modifying the prompts I was able to get results at 55% (so slightly better than average)

Open questions:

Will fine tuning the modle show same/better results as the model types in the VSR paper
How do the different relationship score (does CLIP nativly understand any relationships resonable well)

- fine tuning results

python src\train.py --base_model ViT-L/14@336px --mini_batch_size 20 --batch_size 500 --learning_rate 2e-5

test_accuracy: 65.07% trained model: model_run-113-65-07.pt

v-002 results

uses the modified prompts ie:

The horse is left of
The horse is left of the person.

python src\eval002.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval002.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 55.44%

python src\eval002.py --model_url openai/clip-vit-large-patch14-336

Score: 54.39%

v-001 results

python src\eval001.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 55.23%

python src\eval001.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 53.83%

python src\eval001.py --model_url openai/clip-vit-large-patch14-336

Score: 53.86%

v-000 results

uses the prompts from the VSR paper (but without retraining); ie:

The horse is left of the person. (False)
The horse is left of the person. (True)

python src\eval000.py --model_url laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Score: 49.24%

python src\eval000.py --model_url laion/CLIP-ViT-H-14-laion2B-s32B-b79K

Score: 49.51%

python src\eval000.py --model_url openai/clip-vit-large-patch14-336

Score: 48.85%

install

conda env create
conda activate clip-vsr

run

python src\eval.py

Download images

See data/ folder's readme. Images should be saved under data/images/.

Citation

If you use the VSR dataset please site the orginal authors:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.00363}
}

License

This project is licensed under the Apache-2.0 License.