Home

Awesome

Learning to Generate Grounded Visual Captions without Localization Supervision

<img src="teaser/pytorch-logo-dark.png" width="10%"> License: MIT

This is the PyTorch implementation of our paper:

Learning to Generate Grounded Visual Captions without Localization Supervision<br> Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira<br> European Conference on Computer Vision (ECCV), 2020 <br>

[arXiv] [GitHub] [Project]

<p align="center"> <img src="teaser/concept.png" width="100%"> </p>

10-min YouTube Video

<p align="center"> <a href="https://youtu.be/X84Tg0ULu1Y"> <img src="https://img.youtube.com/vi/X84Tg0ULu1Y/maxresdefault.jpg" width="75%"> </a> </p>

How to start

Clone the repo recursively:

git clone --recursive git@github.com:chihyaoma/cyclical-visual-captioning.git

If you didn't clone with the --recursive flag, then you'll need to manually clone the pybind submodule from the top-level directory:

git submodule update --init --recursive

Installation

The proposed cyclical method can be applied directly to image and video captioning tasks.

Currently, installation guide and our code for video captioning on the ActivityNet-Entities dataset are provided in anet-video-captioning.

Acknowledgments

Chih-Yao Ma and Zsolt Kira were partly supported by DARPA’s Lifelong Learning Machines (L2M) program, under Cooperative Agreement HR0011-18-2-0019, as part of their affiliation with Georgia Tech. We thank Chia-Jung Hsu for her valuable and artistic helps on the figures.

Citation

If you find this repository useful, please cite our paper:

@inproceedings{ma2020learning,
    title={Learning to Generate Grounded Image Captions without Localization Supervision},
    author={Ma, Chih-Yao and Kalantidis, Yannis and AlRegib, Ghassan and Vajda, Peter and Rohrbach, Marcus and Kira, Zsolt},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2020},
    url={https://arxiv.org/abs/1906.00283},
}