Awesome
Awesome-Temporal-Sentence-Grounding-in-Videos
<p align="center"> <img width="250" src="https://camo.githubusercontent.com/1131548cf666e1150ebd2a52f44776d539f06324/68747470733a2f2f63646e2e7261776769742e636f6d2f73696e647265736f726875732f617765736f6d652f6d61737465722f6d656469612f6c6f676f2e737667" "Awesome!"> </p>A curated list of grounding natural language in video and related area. :-)
Introduce
本方向主要分为两类任务:
-
Temporal Activity Localization by Language:给定一个query(包含对activity的描述),找到对应动作(事件)的起止时间;
<div align="center"><img height="200px" src="https://res.cloudinary.com/dzu6x6nqi/image/upload/v1554267644/Awesome%20Language%20Moment%20Retrieval/TALL_-_2.png"></div> -
Spatio-temporal object referring by language: 给定一个query(包含对object/person的描述),在时空中找到连续的bounding box (也就是一个tube)。
<div align="center"><img width="500px" src="https://res.cloudinary.com/dzu6x6nqi/image/upload/v1554267650/Awesome%20Language%20Moment%20Retrieval/SPRL_-_4.png"></div>
Format
Markdown format:
- [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link)
Change Log
- 2019/12/16: Add CBP (AAAI 2020)
Table of Contents
Papers
Survey
- None.
Before
- Grounded Language Learning from Video Described with Sentences - H. Yu et al,
ACL 2013
. - Visual Semantic Search: Retrieving Videos via Complex Textual Queries - Dahua Lin et al,
CVPR 2014
. - Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework - R. Xu et al,
AAAI 2015
. - Unsupervised Alignment of Actions in Video with Text Descriptions - Y. C. Song et al,
IJCAI 2016
.
2017
-
Localizing Moments in Video with Natural Language - Lisa Anne Hendricks et al,
ICCV 2017
. [code] -
TALL: Temporal Activity Localization via Language Query - Jiyang Gao et al,
ICCV 2017
. [code]. -
Spatio-temporal Person Retrieval via Natural Language Queries - M. Yamaguchi et al,
ICCV 2017
. [code]
- Attention-based Natural Language Person Retrieval - Tao Zhou et al,
CVPR 2017
. - Where to Play: Retrieval of Video Segments using Natural-Language Queries - S. Lee et al,
arxiv 2017
.
2018
- Find and Focus: Retrieve and Localize Video Events with Natural Language Queries - Dian Shao et al,
ECCV 2018
. - Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos - B. Liu et al,
ECCV 2018
. - Temporally Grounding Natural Sentence in Video - J. Chen et al,
EMNLP 2018
. - Localizing Moments in Video with Temporal Language - Lisa Anne Hendricks et al,
EMNLP 2018
. - Object Referring in Videos with Language and Human Gaze - A. B. Vasudevan et al,
CVPR 2018
. [code]. - Weakly Supervised Dense Event Captioning in Videos - X. Duan et al,
NIPS 2018
. - Actor and Action Video Segmentation from a Sentence - Kirill Gavrilyuk et al,
CVPR 2018
. - Attentive Moment Retrieval in Videos - M. Liu et al,
SIGIR 2018
.
2019
- Multilevel Language and Vision Integration for Text-to-Clip Retrieval - H. Xu et al,
AAAI 2019
. [code] - Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos - He, Dongliang et al,
AAAI 2019
. - To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression - Y. Yuan et al,
AAAI 2019
. [code] - Semantic Proposal for Activity Localization in Videos via Sentence Query - S. Chen et al,
AAAI 2019
. - MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment - Da Zhang et al,
CVPR 2019
.
- Weakly Supervised Video Moment Retrieval From Text Queries - N. C. Mithun et al,
CVPR 2019
. - Language-Driven Temporal Activity Localization_ A Semantic Matching Reinforcement Learning Model - W. Wang et al,
CVPR 2019
. - Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos - Yitian Yuan et al,
NIPS 2019
. [code] - WSLLN: Weakly Supervised Natural Language Localization Networks - M. Gao et al,
EMNLP 2019
. - ExCL: Extractive Clip Localization Using Natural Language Descriptions - S. Ghosh et al,
NAACL 2019
. - Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos - Zhu Zhang et al,
SIGIR 2019
. [code] - Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention - B. Jiang et al,
ICMR 2019
. [code] - MAC: Mining Activity Concepts for Language-based Temporal Localization - Runzhou Ge Ge et al,
WACV 2019
. [code] - Temporal Localization of Moments in Video Collections with Natural Language - V. Escorcia et al,
arxiv 2019
. - Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention - C. R. Opazo et al,
arxiv 2019
. - Tripping through time: Efficient Localization of Activities in Videos - Meera Hahn et al,
arxiv 2019
. - [Related] Localizing Unseen Activities in Video via Image Query - Zhu Zhang et al,
IJCAI 2019
.
2020
- Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction - Jingwen Wang et al,
AAAI 2020
. [code]
Dataset
Benchmark Results
ActivityNet Captions
R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | |
---|---|---|---|---|---|---|---|---|---|
MCN | 42.80 | 21.37 | 9.58 | - | - | - | - | - | PB |
CTRL | 49.09 | 28.70 | 14.0 | - | - | - | - | - | PB |
ACRN | 50.37 | 31.29 | 16.17 | - | - | - | - | - | PB |
QSPN | - | 45.3 | 27.7 | 13.6 | - | 75.7 | 59.2 | 38.3 | PB |
TGN | 70.06 | 45.51 | 28.47 | - | 79.10 | 57.32 | 44.20 | - | PB |
SCDM | - | 54.80 | 36.75 | 19.86 | - | 77.29 | 64.99 | 41.53 | PB |
CBP | - | 54.30 | 35.76 | 17.80 | - | 77.63 | 65.89 | 46.20 | PB |
TripNet | - | 48.42 | 32.19 | 13.93 | - | - | - | - | RL |
ABLR | 73.30 | 55.67 | 36.79 | - | - | - | - | - | RL |
ExCL | - | 63.30 | 43.6 | 24.1 | - | - | - | - | PF |
PFGA | 75.25 | 51.28 | 33.04 | 19.26 | - | - | - | - | PF |
WSDEC-X(Weakly) | 62.7 | 42.0 | 23.3 | - | - | - | - | - | |
WSLLN (Weakly) | 75.4 | 42.8 | 22.7 | - | - | - | - | - |
Charades-STA
R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | |
---|---|---|---|---|---|---|---|---|---|
CTRL | - | - | 23.63 | 8.89 | - | - | 58.92 | 29.52 | PB |
ABLR | - | - | 24.36 | 9.01 | - | - | - | - | PB |
SMRL | - | - | 24.36 | 11.17 | - | - | 61.25 | 32.08 | PB |
ACL-K | - | - | 30.48 | 12.20 | - | - | 64.84 | 35.13 | PB |
SAP | - | - | 27.42 | 13.36 | - | - | 66.37 | 38.15 | PB |
QSPN | - | 54.7 | 35.6 | 15.8 | - | 95.8 | 79.4 | 45.4 | PB |
MAN | - | - | 46.53 | 22.72 | - | - | 86.23 | 53.72 | PB |
SCDM | - | - | 54.44 | 33.43 | - | - | 74.43 | 58.08 | PB |
CBP | - | - | 36.80 | 18.87 | - | - | 70.94 | 50.19 | PB |
TripNet | - | 51.33 | 36.61 | 14.50 | - | - | - | - | RL |
ExCL | - | 65.1 | 44.1 | 23.3 | - | - | - | - | RL |
PFGA | - | 67.53 | 52.02 | 33.74 | - | - | - | - | PF |
DiDeMo
R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | |
---|---|---|---|---|---|---|---|---|
TMN | 22.92 | - | - | - | 76.08 | - | - | - |
MCN | 28.10 | - | - | - | 78.21 | - | - | - |
TGN | 28.23 | - | - | - | 79.26 | - | - | - |
MAN | 27.02 | - | - | - | 81.70 | - | - | - |
WSLLN (Weakly) | 19.4 | - | - | - | 54.4 | - | - | - |
TACoS
R@1 IoU@0.1 | R@1 IoU@0.3 | R@1 IoU@0.5 | R@1 IoU@0.7 | R@5 IoU@0.1 | R@5 IoU@0.3 | R@5 IoU@0.5 | R@5 IoU@0.7 | Method | |
---|---|---|---|---|---|---|---|---|---|
MCN | 2.62 | 1.64 | 1.25 | - | 2.88 | 1.82 | 1.01 | - | PB |
CTRL | 24.32 | 18.32 | 13.30 | - | 48.73 | 36.69 | 25.42 | - | PB |
TGN | 41.87 | 21.77 | 18.90 | - | 53.40 | 39.06 | 31.02 | - | PB |
ACRN | 24.22 | 19.52 | 14.62 | - | 47.42 | 34.97 | 24.88 | - | PB |
ACL-K | 31.64 | 24.17 | 20.01 | - | 57.85 | 42.15 | 30.66 | - | PB |
SCDM | - | 26.11 | 21.17 | - | - | 40.16 | 32.18 | - | PB |
CBP | - | 27.31 | 24.79 | 19.10 | - | 43.64 | 37.40 | 25.59 | PB |
TripNet | - | 23.95 | 19.17 | 9.52 | - | - | - | - | RL |
SMRL | 26.51 | 20.25 | 15.95 | - | 50.01 | 38.47 | 27.84 | - | RL |
ABLR | 34.7 | 19.5 | 9.4 | - | - | - | - | - | RL |
ExCL | - | 45.5 | 28.0 | 14.6 | - | - | - | - | PF |
Popular Implementations
PyTorch
TensorFlow
- jiyanggao/TALL
- runzhouge/MAC
- BonnieHuangxin/SLTA
- yytzsy/ABLR_code
- yytzsy/SCDM
- JaywongWang/TGN
- JaywongWang/CBP
Others
- None.
Licenses
To the extent possible under law, muketong all copyright and related or neighboring rights to this work.