Home

Awesome

Awesome Visual Grounding

A curated list of research papers in grounding. Link to the code if available is also present.

Have a look at SCOPE.md to get familiar with what grounding means and the tasks considered in this repository.

To maintaing the quality of the repo, I have gone through all the listed papers at least once before adding them to ensure their relevance to grounding. However, I might have missed some paper(s) or added some irrelevant paper(s). Feel free to open an issue in that case. I will go through the paper and then add / remove it.

Table of Contents

Contributing

Feel free to contact me via email (ark.sadhu2904@gmail.com) or open an issue or submit a pull request. To add a new paper via pull request:

  1. Fork the repo, change readme. Put the new paper under the correct heading, and place it at the correct chronological position.
  2. Copy its reference in MLA format
  3. Put ** around the title
  4. Provide link to the paper (arxiv/semantic scholar/conference proceedings).
  5. If code or website exists, link that too.
  6. Send a pull request. Ideally, I will review the request within a week.

Demos

  1. MATTNet demo: http://vision2.cs.unc.edu/refer/comprehension

Other Compilations:

Shoutout to some other awesome stuff on vision and language grounding:

  1. Multi-modal Reading List by Paul Liang (@pliang279) : https://github.com/pliang279/awesome-multimodal-ml/
  2. Temporal Grounding by Mu Ketong (@iworldtong): https://github.com/iworldtong/Awesome-Grounding-Natural-Language-in-Video
  3. Temporal Grounding by WuJie (@WuJie1010): https://github.com/WuJie1010/Awesome-Temporally-Language-Grounding. Also, checkout their implementation of some of the popular papers: https://github.com/WuJie1010/Temporally-language-grounding

Datasets

Image Grounding Datasets

  1. Flickr30k: Plummer, Bryan A., et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Proceedings of the IEEE international conference on computer vision. 2015. [Paper] [Code] [Website]

  2. RefClef: Kazemzadeh, Sahar, et al. Referitgame: Referring to objects in photographs of natural scenes. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. [Paper] [Website]

  3. RefCOCOg: Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  4. Visual Genome: Krishna, Ranjay, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123.1 (2017): 32-73. [Paper] [Website]

  5. RefCOCO and RefCOCO+: 1. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  6. GuessWhat: De Vries, Harm, et al. Guesswhat?! visual object discovery through multi-modal dialogue. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. [Paper] [Code] [Website]

  7. Clevr-ref+: Liu, Runtao, et al. Clevr-ref+: Diagnosing visual reasoning with referring expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code] [Website]

  8. Talk2Car: Deruyttere, Thierry, et al. Talk2Car: Taking Control of Your Self-Driving Car Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. [Paper] [Website] [Code]

  9. KB-Ref: Wang, Peng, et al. Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge. Proceedings of the 28th ACM International Conference on Multimedia. 2020. [Paper] [Code]

  10. Ref-Reasoning: Yang, Sibei, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code] [Website]

  11. Cops-Ref: Chen, Zhenfang, et al. Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  12. SUNRefer: Liu, Haolin, et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code] [Website]

Video Datasets

  1. TaCoS: Regneri, Michaela, et al. Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics 1 (2013): 25-36. [Paper] [Website]

  2. Charades: Sigurdsson, Gunnar A., et al. Hollywood in homes: Crowdsourcing data collection for activity understanding. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Website]

  3. Charades-STA: Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017).[Paper] [Code]

  4. Distinct Describable Moments (DiDeMo): Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code] [Website]

  5. ActivityNet Captions: Krishna, Ranjay, et al. Dense-captioning events in videos. Proceedings of the IEEE International Conference on Computer Vision. 2017. [Paper] [Website]

  6. Charades-Ego: [Website]

    • Sigurdsson, Gunnar, et al. Actor and Observer: Joint Modeling of First and Third-Person Videos. CVPR-IEEE Conference on Computer Vision & Pattern Recognition. 2018. [Paper] [Code]
    • Sigurdsson, Gunnar A., et al. "Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos." arXiv preprint arXiv:1804.09626 (2018). [Paper] [Code]
  7. TEMPO: Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

  8. ActivityNet-Entities: Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Embodied Agents Platforms:

  1. Matterport3D: Chang, Angel, et al. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158 (2017). [Paper] [Code] [Website]

    • Photorealistic rooms
  2. AI2-THOR: Kolve, Eric, et al. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017). [Paper] [Website]

    • Actionable objects!
  3. Habitat AI: Savva, Manolis, et al. Habitat: A platform for embodied ai research. Proceedings of the IEEE International Conference on Computer Vision. 2019. (ICCV 2019) [Paper] [Website]

Paper Roadmap (Chronological Order):

Visual Grounding / Referring Expressions (Images):

  1. Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]

  2. Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]

  3. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  4. Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  5. Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]

  6. Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]

  7. Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]

  8. Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]

  9. Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]

  10. Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]

  11. Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]

  12. Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]

  13. Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]

  14. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  15. Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]

  16. Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]

  17. Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]

  18. Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]

  19. Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]

  20. Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]

  21. Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]

  22. Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]

  23. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  24. Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]

  25. Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]

  26. Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]

  27. Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]

  28. Li, Ruiyu, et al. Referring image segmentation via recurrent refinement networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper] [Code]

  29. Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]

  30. Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]

  31. Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]

  32. Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]

  33. Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  34. Shi, Hengcan, et al. Key-word-aware network for referring expression image segmentation. Proceedings of the European Conference on Computer Vision (ECCV). 2018.[Paper] [Code]

  35. Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]

  36. Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]

  37. Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]

  38. Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]

  39. Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]

  40. Liu, Daqing, et al. Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [Paper] [Code]

  41. RETRACTED (see #2): Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]

  42. Hong, Richang, et al. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). 2019. [Paper]

  43. Liu, Xihui, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]

  44. Dogan, Pelin, Leonid Sigal, and Markus Gross. Neural Sequential Phrase Grounding (SeqGROUND). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  45. Datta, Samyak, et al. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. arXiv preprint arXiv:1903.11649 (2019). (ICCV 2019) [Paper]

  46. Fang, Zhiyuan, et al. Modularized textual grounding for counterfactual resilience. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  47. Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  48. Yang, Sibei, Guanbin Li, and Yizhou Yu. Cross-Modal Relationship Inference for Grounding Referring Expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]

  49. Yang, Sibei, Guanbin Li, and Yizhou Yu. Dynamic Graph Attention for Referring Expression Comprehension. arXiv preprint arXiv:1909.08164 (2019). (ICCV 2019) [Paper] [Code]

  50. Wang, Josiah, and Lucia Specia. Phrase Localization Without Paired Training Examples. arXiv preprint arXiv:1908.07553 (2019). (ICCV 2019) [Paper] [Code]

  51. Yang, Zhengyuan, et al. A Fast and Accurate One-Stage Approach to Visual Grounding. arXiv preprint arXiv:1908.06354 (2019). (ICCV 2019) [Paper] [Code]

  52. Sadhu, Arka, Kan Chen, and Ram Nevatia. Zero-Shot Grounding of Objects from Natural Language Queries. arXiv preprint arXiv:1908.07129 (2019).(ICCV 2019) [Paper] [Code] Disclaimer: I am an author of the paper

  53. Liu, Xuejing, et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. arXiv preprint arXiv:1908.10568 (2019). (ICCV 2019) [Paper] [Code]

  54. Chen, Yi-Wen, et al. Referring Expression Object Segmentation with Caption-Aware Consistency. arXiv preprint arXiv:1910.04748 (2019). (BMVC 2019) [Paper] [Code]

  55. Liu, Jiacheng, and Julia Hockenmaier. Phrase Grounding by Soft-Label Chain Conditional Random Field. arXiv preprint arXiv:1909.00301 (2019) (EMNLP 2019). [Paper] [Code]

  56. Liu, Yongfei, Wan Bo, Zhu Xiaodan and He Xuming. Learning Cross-modal Context Graph for Visual Grounding. arXiv preprint arXiv: (2019) (AAAI-2020). [Paper] [Code]

  57. Yu, Tianyu, et al. Cross-Modal Omni Interaction Modeling for Phrase Grounding. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link] [Code]

  58. Qiu, Heqian, et al. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link]

  59. Wang, Qinxin, et al. MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding. arXiv preprint arXiv:2010.05379 (2020). [Paper] [Code]

  60. Liao, Yue, et al. A real-time cross-modality correlation filtering method for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper]

  61. Hu, Zhiwei, et al. Bi-directional relationship inferring network for referring image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  62. Yang, Sibei, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  63. Luo, Gen, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]

  64. Gupta, Tanmay, et al. Contrastive learning for weakly supervised phrase grounding. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]

  65. Yang, Zhengyuan, et al. Improving one-stage visual grounding by recursive sub-query construction. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]

  66. Wang, Liwei, et al. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]

  67. Sun, Mingjie, Jimin Xiao, and Eng Gee Lim. Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  68. Liu, Haolin, et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  69. Liu, Yongfei, et al. Relation-aware Instance Refinement for Weakly Supervised Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]

  70. Lin, Xiangru, Guanbin Li, and Yizhou Yu. Scene-Intuitive Agent for Remote Embodied Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]

  71. Sun, Mingjie, et al. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE transactions on pattern analysis and machine intelligence (TPAMI 2021). [Paper] [Code]

  72. Mu, Zongshen, et al. Disentangled Motif-aware Graph Learning for Phrase Grounding. arXiv preprint arXiv:2104.06008 (AAAI 2021). [Paper]

  73. Chen, Long, et al. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. arXiv preprint arXiv:2009.01449 (AAAI-2021). [Paper] [Code]

  74. Deng, Jiajun, et al. TransVG: End-to-End Visual Grounding with Transformers. arXiv preprint arXiv:2104.08541 (2021). [Paper] [Unofficial Code]

  75. Du, Ye, et al. Visual Grounding with Transformers. arXiv preprint arXiv:2105.04281 (2021). [Paper]

  76. Kamath, Aishwarya, et al. MDETR--Modulated Detection for End-to-End Multi-Modal Understanding. arXiv preprint arXiv:2104.12763 (2021). [Paper]

  77. Cho, Junhyeong, et al. Collaborative Transformers for Grounded Situation Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2022. [Paper] [Code] [Website]

Natural Language Object Retrieval (Images)

  1. Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]

  2. Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]

  3. Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]

  4. Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]

  5. Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]

  6. Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Grounding Relations / Referring Relations

  1. Krishna, Ranjay, et al. Referring relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code] [Website]

  2. Raboh, Moshiko et al. Differentiable Scene Graphs. (2019). [Paper]

  3. Conser, Erik, et al. Revisiting Visual Grounding. arXiv preprint arXiv:1904.02225 (2019). [Paper]

    • Critique of Referring Relationship paper

Video Grounding (Activity Localization) using Natural Language:

  1. Yu, Haonan, et al. Grounded Language Learning from Video Described with Sentences Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2013. [Paper]

  2. Xu, Ran, et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence. 2015. [Paper]

  3. Song, Young Chol, et al. Unsupervised Alignment of Actions in Video with Text Descriptions Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2016. [Paper]

  4. Gao, Jiyang, et al. Tall: Temporal activity localization via language query. arXiv preprint arXiv:1705.02101 (2017). Method name: TALL [Paper] [Code]

  5. Hendricks, Lisa Anne, et al. Localizing moments in video with natural language. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: MCN [Paper] [Code]

  6. Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. Video Object Segmentation with Language Referring Expressions. arXiv preprint arXiv:1803.08006 (2018). [Paper] [Website]

  7. Xu, Huijuan, et al. Joint Event Detection and Description in Continuous Video Streams. arXiv preprint arXiv:1802.10250 (2018). [Paper] [Code]

  8. Xu, Huijuan, et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning. arXiv preprint arXiv:1804.05113 (2018). [Paper] [Code]

  9. Liu, Bingbin, et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos. European Conference on Computer Vision. Springer, Cham, 2018. [Paper] [Website]

  10. Liu, Meng, et al. Attentive Moment Retrieval in Videos. Proceedings of the International ACM SIGIR Conference . 2018. [Paper] [Website]

  11. Chen, Jingyuan, et al. Temporally grounding natural sentence in video. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. [Paper]

  12. Hendricks, Lisa Anne, et al. Localizing Moments in Video with Temporal Language. arXiv preprint arXiv:1809.01337 (2018). [Paper] [Code] [Website]

  13. Wu, Aming, and Yahong Han. Multi-modal Circulant Fusion for Video-to-Language and Backward. IJCAI. Vol. 3. No. 4. 2018. [Paper] [Code]

  14. Ge, Runzhou, et al. MAC: Mining Actiivity Concepts for Language-based Temporal Localization. arXiv preprint arXiv:1811.08925 (2018). [Paper] [Code]

  15. Zhang, Da, et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. arXiv preprint arXiv:1812.00087 (2018). [Paper]

  16. He, Dongliang, et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. Proceedings of the AAAI Conference on Artificial Intelligence. 2019. [Paper]

  17. Wang, Weining, Yan Huang, and Liang Wang. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]

  18. Ghosh, Soham, et al. ExCL: Extractive Clip Localization Using Natural Language Descriptions. arXiv preprint arXiv:1904.02755 (2019). [Paper]

  19. Chen, Shaoxiang, and Yu-Gang Jiang. Semantic Proposal For Activity Localization In Videos Via Sentence Query. Proceedings of the AAAI Conference on Artificial Intelligence. 2019.[Paper]

  20. Yuan Y, Mei T, Zhu W. To Find Where You Talk: Temporal Sentence Localization In Video With Attention Based Location Regression. Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 9159-9166. [Paper]

  21. Mithun N C, Paul S, Roy-Chowdhury A K. Weakly Supervised Video Moment Retrieval From Text Queries. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 11592-11601.[Paper]

  22. Escorcia, Victor, et al. Temporal Localization of Moments in Video Collections with Natural Language. arXiv preprint arXiv:1907.12763 (2019). (ICCV 2019) [Paper] [Code]

  23. Wang, Jingwen, Lin Ma, and Wenhao Jiang. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction. arXiv preprint arXiv:1909.05010 (2019). (AAAI 2020) [Paper] [Code]

Grounded Description (Image) (WIP)

  1. Hendricks, Lisa Anne, et al. Generating visual explanations. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Pytorch Code]

  2. Jiang, Ming, et al. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. arXiv preprint arXiv:1909.02050 (2019). (EMNLP 2019) [Paper] [Code]

  3. Lee, Jason, Kyunghyun Cho, and Douwe Kiela. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019). (EMNLP 2019) [Paper]

Grounded Description (Video) (WIP)

  1. Ma, Chih-Yao, et al. Grounded Objects and Interactions for Video Captioning. arXiv preprint arXiv:1711.06354 (2017). [Paper]

  2. Zhou, Luowei, et al. Grounded video description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper] [Code]

Visual Grounding Pretraining

  1. Sun, Chen, et al. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766 (2019). [Paper]

  2. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv preprint arXiv:1908.02265 (Neurips 2019) [Paper] [Code]

  3. Li, Liunian Harold, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019). [Paper] [Code]

  4. Li, Gen, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019). [Paper]

  5. Tan, Hao, and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019). [Paper] [Code]

  6. Su, Weijie, et al. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019). [Paper]

  7. Chen, Yen-Chun, et al. UNITER: Learning UNiversal Image-TExt Representations. arXiv preprint arXiv:1909.11740 (2019). [Paper]

  8. Li Liunian Harold, Pengchuan Zhang, Haotian Zhang, et al. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021). [Paper] [Code]

Visual Grounding in 3D

  1. Chen, Dave Zhenyu, Angel X. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer International Publishing, 2020. [Paper] [Code]

  2. Achlioptas, Panos, et al. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. European Conference on Computer Vision. Springer, Cham, 2020. [Website]

  3. Yuan, Zhihao, et al. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. arXiv preprint arXiv:2103.01128 (2021). [Paper] [Code]

  4. Rozenberszki, David, et al. Language-Grounded Indoor 3D Semantic Segmentation in the Wild. European Conference on Computer Vision. Springer, Tel Aviv, 2022. [Website] [Paper] [Code]

Grounding for Embodied Agents (WIP):

  1. Shridhar, Mohit, et al. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. arXiv preprint arXiv:1912.01734 (2019). [Paper] [Code] [Website]

Misc:

  1. Han, Xudong, Philip Schulz, and Trevor Cohn. Grounding learning of modifier dynamics: An application to color naming. arXiv preprint arXiv:1909.07586 (2019). (EMNLP 2019) [Paper] [Code]

  2. Yu, Xintong, et al. What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues. arXiv preprint arXiv:1909.00421 (2019). (EMNLP 2019) [Paper] [Code]