Awesome
Awesome Video-Text Retrieval by Deep Learning
A curated list of deep learning resources for video-text retrieval.
Contributing
Please feel free to pull requests to add papers.
Markdown format:
- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)
Table of Contents
- Implementations
- Papers
- 2023 - 2022 - 2021 - 2020 - 2019 - 2018 - Before
- Ad-hoc Video Search
- Other Related
- Datasets
Implementations
PyTorch
- hybrid_space
- dual_encoding
- w2vvpp
- Mixture-of-Embedding-Experts
- howto100m
- collaborative
- hgr
- coot
- mmt
- ClipBERT
TensorFlow
Others
- w2vv(Keras)
Useful Toolkit
Papers
2023
[Pei et al. CVPR23]
CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. CVPR, 2023. [paper][Li et al. CVPR23]
SViTT: Temporal Learning of Sparse Video-Text Transformers. CVPR, 2023. [paper] [code][Wu et al. CVPR23]
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval. CVPR, 2023. [paper] [code][Ko et al. CVPR23]
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models. CVPR, 2023. [paper] [code][Wang et al. CVPR23]
All in One: Exploring Unified Video-Language Pre-Training. CVPR, 2023. [paper] [code][Girdhar et al. CVPR23]
IMAGEBIND: One Embedding Space To Bind Them All. CVPR, 2023. [paper] [code][Huang et al. CVPR23]
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval. CVPR, 2023. [paper] [code][Li et al. CVPR23]
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling. CVPR, 2023. [paper] [code][Huang et al. CVPR23]
Clover: Towards a Unified Video-Language Alignment and Fusion Model. CVPR, 2023. [paper] [code][Ji et al. CVPR23]
Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning. CVPR, 2023. [paper][Gan et al. CVPR23]
CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset. CVPR, 2023. [paper] [code][Zhao et al. CVPRW23]
Cali-NCE: Boosting Cross-Modal Video Representation Learning With Calibrated Alignment. CVPRWorkshop, 2023. [paper][Ma et al. TCSVT23]
Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. TCSVT, 2023. [paper]
2022
[Dong et al. ACMMM22]
Partially Relevant Video Retrieval. ACM Multimedia, 2022. [homepage] [paper] [code]A new text-to-video retrieval subtask
[Wang et al. ACMMM22]
Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning. ACM Multimedia, 2022. [paper] [code][Wang et al. ACMMM22]
Learn to Understand Negation in Video Retrieval. ACM Multimedia, 2022. [paper] [code][Falcon et al. ACMMM22]
A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval. ACM Multimedia, 2022. [paper] [code][Ma et al. ACMMM22]
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. ACM Multimedia, 2022. [paper][Hu et al. ECCV22]
Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. ECCV, 2022. [paper] [code][Liu et al. ECCV22]
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. ECCV, 2022. [paper] [code][Dong et al. TCSVT22]
Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. TCSVT, 2022. [paper] [code][Li et al. CVPR22]
Align and Prompt: Video-and-Language Pre-training with Entity Prompts, CVPR, 2022. [paper] [code][Shvetsova et al. CVPR22]
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval. CVPR, 2022. [paper] [code][Ge et al. CVPR22]
Bridging Video-text Retrieval with Multiple Choice Questions. CVPR, 2022. [paper] [code][Han et al. CVPR22]
Temporal Alignment Networks for Long-term Video. CVPR.2022. [paper] [code][Gorti et al. CVPR22]
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. CVPR, 2022. [paper] [code][Lu et al. NIPS22]
LGDN: Language-Guided Denoising Network for Video-Language Modeling. NIPS, 2022. [paper][Liu et al. SIGIR22]
Animating Images to Transfer CLIP for Video-Text Retrieval. SIGIR, 2022. [paper][Zhao et al. SIGIR22]
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. SIGIR, 2022. [paper][Liu et al. ACL22]
Cross-Modal Discrete Representation Learning. ACL, 2022. [paper][Gabeur et al. WACV22]
Masking Modalities for Cross-modal Video Retrieval. WACV, 2022. [paper][Cao et al. AAAI22]
Visual Consensus Modeling for Video-Text Retrieval. AAAI, 2022. [paper] [code][Cheng et al. AAAI22]
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. AAAI, 2022. [paper][code][Wang et al. TMM22]
Many Hands Make Light Work: Transferring Knowledge from Auxiliary Tasks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2022. [paper][Park et al. NAACL22]
Exposing the Limits of Video-Text Models through Contrast Sets. NAACL, 2022. [paper] [code][Song et al. TOMM22]
Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval. TOMM, 2022. [paper][Bai et al. ARXIV22]
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval. arXiv:2207.04858, 2022. [paper][Bain et al. ARXIV22]
A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv:2205.08508, 2022. [paper][Gao et al. ARXIV22]
CLIP2TV: Align, Match and Distill for Video-Text Retrieval. arXiv:2111.05610, 2022. [paper][Jiang et al. ARXIV22]
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations. arXiv:2204.03382, 2022. [paper]
2021
[Dong et al. TPAMI21]
Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code][Wei et al. TPAMI21]
Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper][Lei et al. CVPR21]
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code][Wray et al. CVPR21]
On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code][Chen et al. CVPR21]
Learning the Best Pooling Strategy for Visual Semantic Embedding. CVPR, 2021. [paper][code][Wang et al. CVPR21]
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. CVPR, 2021. [paper][Miech et al. CVPR21]
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. CVPR, 2021. [paper][Liu et al. CVPR21]
Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval. CVPR, 2021. [paper][Chen et al. ICCV21]
Multimodal Clustering Networks for Self-Supervised Learning from Unlabeled Videos. ICCV, 2021. [paper][Ioana et al. ICCV21]
TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. ICCV, 2021. [paper][code][Yang et al. ICCV21]
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. ICCV, 2021. [paper][Bian et al. ICCV21]
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV, 2021. [paper][code][Wen et al. ICCV21]
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-Training for Vision-Language Representation. ICCV, 2021. [paper][code][Luo et al. ACMMM21]
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising. ACM Multimedia, 2021. [paper][Wu et al. ACMMM21]
HANet: Hierarchical Alignment Networks for Video-Text Retrieval. ACM Multimedia, 2021. [paper][code][Liu et al. ACMMM21]
Progressive Semantic Matching for Video-Text Retrieval. ACM Multimedia, 2021. [paper][Han et al. ACMMM21]
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. ACM Multimedia, 2021. [paper][Wei et al. ACMMM21]
Meta Self-Paced Learning for Cross-Modal Matching. ACM Multimedia, 2021. [paper][Patrick et al. ICLR21]
Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper][Qi et al. TIP21]
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper][Song et al. TMM21]
Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia, 2021. [paper][Dong et al. NEUCOM21]
Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2021. [paper] [code][Jin et al. SIGIR21]
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. SIGIR, 2020. [paper][He et al. SIGIR21]
Improving Video Retrieval by Adaptive Margin. SIGIR, 2021. [paper][Wang et al. IJCAI21]
Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. IJCAI, 2021. [paper][Chen et al. AAAI21]
Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. AAAI, 2021. [paper][Hao et al. ICME21]
What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval. ICME, 2021. [paper][Wu et al. ICME21]
Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval. ICME, 2021. [paper][Song et al. ICIP21]
Semantic-Preserving Metric Learning for Video-Text Retrieval. IEEE International Conference on Image Processing, 2021. [paper][Hao et al. ICMR21]
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval. ICMR, 2021. [paper][Liu et al. ARXIV21]
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. arXiv:2103.15049, 2021. [paper][Akbari et al. ARXIV21]
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv:2104.11178 , 2021. [paper] [code][Fang et al. ARXIV21]
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv:2106.11097, 2021. [paper] [code][Luo et al. ARXIV21]
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv:2104.08860, 2021. [paper][code][Li et al. ARXIV21]
Align and Prompt: Video-and-Language Pre-training with Entity Prompts. arXiv:2112.09583, 2021. [paper][code]
2020
[Yang et al. SIGIR20]
Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper][Ging et al. NeurIPS20]
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code][Gabeur et al. ECCV20]
Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code] [homepage][Li et al. TMM20]
SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper][Wang et al. TMM20]
Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper][Chen et al. TMM20]
Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper][Wu et al. ACMMM20]
Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper][Feng et al. IJCAI20]
Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper][Wei et al. CVPR20]
Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper][Doughty et al. CVPR20]
Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper][Chen et al. CVPR20]
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper][Zhu et al. CVPR20]
ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper][Miech et al. CVPR20]
End-to-End Learning of Visual Representations From Uncurated Instructional Videos. CVPR, 2020. [paper] [code] [homepage][Zhao et al. ICME20]
Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper][Luo et al. ARXIV20]
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]
2019
[Dong et al. CVPR19]
Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code][Song et al. CVPR19]
Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper][Wray et al. ICCV19]
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper][Xiong et al. ICCV19]
A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper][Li et al. ACMMM19]
W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code][Liu et al. BMVC19]
Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code][Choi et al. BigMM19]
From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]
2018
[Dong et al. TMM18]
Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code][Zhang et al. ECCV18]
Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code][Yu et al. ECCV18]
A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper][Shao et al. ECCV18]
Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper][Mithun et al. ICMR18]
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code][Miech et al. arXiv18]
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]
Before
[Yu et al. CVPR17]
End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code][OtaniEmail et al. ECCVW2016]
Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper][Xu et al. AAAI15]
Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]
Ad-hoc Video Search
Other Related
[Rouditchenko et al. INTERSPEECH21]
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech, 2021. [paper] [code][Li et al. arXiv20]
Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]
Datasets
[MSVD]
David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset][MSRVTT]
Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset][TGIF]
Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage][AVS]
Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset][LSMDC]
Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset][ActivityNet Captions]
Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset][DiDeMo]
Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code][HowTo100M]
Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage][VATEX]
Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]
Licenses
To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.