Paper Routing for Visual Grounding

Visual Grounding
Referring Expressions
Phrase Grounding

Reprensentation Approach

几个在VG任务中的主流视觉backbone

rpn
maskrcnn
retinanet(fpn)
Vit
DETR

文本表示的编码方式/编码器模型

word2vec [File]
bert

VG paper routing

Karpathy, Andrej, Armand Joulin, and Li F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems. 2014. [Paper]

RNN类方法

Karpathy, Andrej, and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. Method name: Neural Talk. [Paper] [Code] [Torch Code] [Website]

1
2
3

RPN作为视觉backbone+BiRNN编码文本，前19个region和karpathy分割的snippets（phrase）映射到同一长度vector后进行相似度计算S，max(0,S)以衡量整个图片与句子的相似程度。

* 整体是用的retrieval的baseline，类似于SCAN等retrival任务的特征处理方式

Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

Method name: Spatial Context Recurrent
ConvNet (SCRC)* [Paper] [Code] [Website]

1
2
3

文本提首先进入一个embedding层
CNN同样提取global contextual feature和 local feature，
LSTM 获取local 和 global信息（两个单元），local处理[x box ,x spatial]

Mao, Junhua, et al. Generation and comprehension of unambiguous object descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Wang, Liwei, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Paper] [Code]
Yu, Licheng, et al. Modeling context in referring expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper][Code]
Nagaraja, Varun K., Vlad I. Morariu, and Larry S. Davis. Modeling context between objects for referring expression understanding. European Conference on Computer Vision. Springer, Cham, 2016.[Paper] [Code]
Rohrbach, Anna, et al. Grounding of textual phrases in images by reconstruction. European Conference on Computer Vision. Springer, Cham, 2016. Method Name: GroundR [Paper] [Tensorflow Code] [Torch Code]
Wang, Mingzhe, et al. Structured matching for phrase localization. European Conference on Computer Vision. Springer, Cham, 2016. Method name: Structured Matching [Paper] [Code]
Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Website]
Fukui, Akira et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. EMNLP (2016). Method name: MCB [Paper][Code]
Endo, Ko, et al. An attention-based regression model for grounding textual phrases in images. Proc. IJCAI. 2017. [Paper]
Chen, Kan, et al. MSRC: Multimodal spatial regression with semantic context for phrase grounding. International Journal of Multimedia Information Retrieval 7.1 (2018): 17-28. [Paper -Springer Link]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Yu, Licheng, et al. A joint speakerlistener-reinforcer model for referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code][Website]
Hu, Ronghang, et al. Modeling relationships in referential expressions with compositional modular networks. Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [Paper] [Code]
Luo, Ruotian, and Gregory Shakhnarovich. Comprehension-guided referring expressions. Computer Vision and Pattern Recognition (CVPR). Vol. 2. 2017. [Paper] [Code]
Liu, Jingyu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. Proceedings of CVPR. 2017. [Paper]
Xiao, Fanyi, Leonid Sigal, and Yong Jae Lee. Weakly-supervised visual grounding of phrases with linguistic structures. arXiv preprint arXiv:1705.01371 (2017). [Paper]
Plummer, Bryan A., et al. Phrase localization and visual relationship detection with comprehensive image-language cues. Proc. ICCV. 2017. [Paper] [Code]
Chen, Kan, Rama Kovvuri, and Ram Nevatia. Query-guided regression network with context policy for phrase grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017. Method name: QRC [Paper] [Code]
Liu, Chenxi, et al. Recurrent Multimodal Interaction for Referring Image Segmentation. ICCV. 2017. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Li, Xiangyang, and Shuqiang Jiang. Bundled Object Context for Referring Expressions. IEEE Transactions on Multimedia (2018). [Paper ieee link]
Yu, Zhou, et al. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. arXiv preprint arXiv:1805.03508 (2018). [Paper] [Code]
Yu, Licheng, et al. Mattnet: Modular attention network for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [Paper] [Code] [Website]
Deng, Chaorui, et al. Visual Grounding via Accumulated Attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper]
Li, Ruiyu, et al. Referring image segmentation via recurrent refinement networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.[Paper] [Code]
Zhang, Yundong, Juan Carlos Niebles, and Alvaro Soto. Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining. arXiv preprint arXiv:1808.00265 (2018). [Paper]
Chen, Kan, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. arXiv preprint arXiv:1803.03879 (2018). [Paper] [Code]
Zhang, Hanwang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code]
Cirik, Volkan, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. arXiv preprint arXiv:1805.10547 (2018).[Paper] [Code]
Margffoy-Tuay, Edgar, et al. Dynamic multimodal instance segmentation guided by natural language queries. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Shi, Hengcan, et al. Key-word-aware network for referring expression image segmentation. Proceedings of the European Conference on Computer Vision (ECCV). 2018.[Paper] [Code]
Plummer, Bryan A., et al. Conditional image-text embedding networks. Proceedings of the European Conference on Computer Vision (ECCV). 2018. [Paper] [Code]
Akbari, Hassan, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding. arXiv preprint arXiv:1811.11683 (2018). [Paper]
Kovvuri, Rama, and Ram Nevatia. PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding. arXiv preprint arXiv:1812.03213 (2018). [Paper]
Chen, Xinpeng, et al. Real-Time Referring Expression Comprehension by Single-Stage Grounding Network. arXiv preprint arXiv:1812.03426 (2018). [Paper]
Wang, Peng, et al. Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks. arXiv preprint arXiv:1812.04794 (2018). [Paper]
Liu, Daqing, et al. Learning to Assemble Neural Module Tree Networks for Visual Grounding. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2019. [Paper] [Code]
RETRACTED (see #2): Deng, Chaorui, et al. You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding. arXiv preprint arXiv:1902.04213 (2019). [Paper]
Hong, Richang, et al. Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). 2019. [Paper]
Liu, Xihui, et al. Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. [Paper]
Dogan, Pelin, Leonid Sigal, and Markus Gross. Neural Sequential Phrase Grounding (SeqGROUND). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Datta, Samyak, et al. Align2ground: Weakly supervised phrase grounding guided by image-caption alignment. arXiv preprint arXiv:1903.11649 (2019). (ICCV 2019) [Paper]
Fang, Zhiyuan, et al. Modularized textual grounding for counterfactual resilience. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Ye, Linwei, et al. Cross-Modal Self-Attention Network for Referring Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Cross-Modal Relationship Inference for Grounding Referring Expressions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR) 2019. [Paper]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Dynamic Graph Attention for Referring Expression Comprehension. arXiv preprint arXiv:1909.08164 (2019). (ICCV 2019) [Paper] [Code]
Wang, Josiah, and Lucia Specia. Phrase Localization Without Paired Training Examples. arXiv preprint arXiv:1908.07553 (2019). (ICCV 2019) [Paper] [Code]
Yang, Zhengyuan, et al. A Fast and Accurate One-Stage Approach to Visual Grounding. arXiv preprint arXiv:1908.06354 (2019). (ICCV 2019) [Paper] [Code]
Sadhu, Arka, Kan Chen, and Ram Nevatia. Zero-Shot Grounding of Objects from Natural Language Queries. arXiv preprint arXiv:1908.07129 (2019).(ICCV 2019) [Paper] [Code]

Disclaimer: I am an author of the paper

Liu, Xuejing, et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. arXiv preprint arXiv:1908.10568 (2019). (ICCV 2019) [Paper] [Code]
Chen, Yi-Wen, et al. Referring Expression Object Segmentation with Caption-Aware Consistency. arXiv preprint arXiv:1910.04748 (2019). (BMVC 2019) [Paper] [Code]
Liu, Jiacheng, and Julia Hockenmaier. Phrase Grounding by Soft-Label Chain Conditional Random Field. arXiv preprint arXiv:1909.00301 (2019) (EMNLP 2019). [Paper] [Code]
Liu, Yongfei, Wan Bo, Zhu Xiaodan and He Xuming. Learning Cross-modal Context Graph for Visual Grounding. arXiv preprint arXiv: (2019) (AAAI-2020). [Paper] [Code]
Yu, Tianyu, et al. Cross-Modal Omni Interaction Modeling for Phrase Grounding. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link] [Code]
Qiu, Heqian, et al. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. Proceedings of the 28th ACM International Conference on Multimedia. ACM 2020. [Paper: ACM Link]
Wang, Qinxin, et al. MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding. arXiv preprint arXiv:2010.05379 (2020). [Paper] [Code]
Liao, Yue, et al. A real-time cross-modality correlation filtering method for referring expression comprehension. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper]
Hu, Zhiwei, et al. Bi-directional relationship inferring network for referring image segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]
Yang, Sibei, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]
Luo, Gen, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [Paper] [Code]
Gupta, Tanmay, et al. Contrastive learning for weakly supervised phrase grounding. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]
Yang, Zhengyuan, et al. Improving one-stage visual grounding by recursive sub-query construction. Proceedings of the European Conference on Computer Vision (ECCV). 2020. [Paper] [Code]
Wang, Liwei, et al. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]
Sun, Mingjie, Jimin Xiao, and Eng Gee Lim. Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]
Liu, Haolin, et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]
Liu, Yongfei, et al. Relation-aware Instance Refinement for Weakly Supervised Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper] [Code]
Lin, Xiangru, Guanbin Li, and Yizhou Yu. Scene-Intuitive Agent for Remote Embodied Visual Grounding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2021. [Paper]
Sun, Mingjie, et al. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE transactions on pattern analysis and machine intelligence (TPAMI 2021). [Paper] [Code]
Mu, Zongshen, et al. Disentangled Motif-aware Graph Learning for Phrase Grounding. arXiv preprint arXiv:2104.06008 (AAAI 2021). [Paper]
Chen, Long, et al. Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding. arXiv preprint arXiv:2009.01449 (AAAI-2021). [Paper] [Code]
Deng, Jiajun, et al. TransVG: End-to-End Visual Grounding with Transformers. arXiv preprint arXiv:2104.08541 (2021). [Paper] [Unofficial Code]
Du, Ye, et al. Visual Grounding with Transformers. arXiv preprint arXiv:2105.04281 (2021). [Paper]
Kamath, Aishwarya, et al. MDETR–Modulated Detection for End-to-End Multi-Modal Understanding. arXiv preprint arXiv:2104.12763 (2021). [Paper]

Natural Language Object Retrieval (Images)

Guadarrama, Sergio, et al. Open-vocabulary Object Retrieval. Robotics: science and systems. Vol. 2. No. 5. 2014. [Paper] [Code]
Hu, Ronghang, et al. Natural language object retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. Method name: Spatial Context Recurrent ConvNet (SCRC) [Paper] [Code] [Website]
Wu, Fan et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning. CoRR abs/1703.07579 (2017): n. pag. [Paper] [Code]
Li, Jianan, et al. Deep attribute-preserving metric learning for natural language object retrieval. Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017. [Paper: ACM Link]
Nguyen, Anh, et al. Object Captioning and Retrieval with Natural Language. arXiv preprint arXiv:1803.06152 (2018). [Paper] [Website]
Plummer, Bryan A., et al. Open-vocabulary Phrase Detection. arXiv preprint arXiv:1811.07212 (2018). [Paper] [Code]

Grounding Relations / Referring Relations

Krishna, Ranjay, et al. Referring relationships. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [Paper] [Code] [Website]
Raboh, Moshiko et al. Differentiable Scene Graphs. (2019). [Paper]
Conser, Erik, et al. Revisiting Visual Grounding. arXiv preprint arXiv:1904.02225 (2019).
[Paper]
- Critique of Referring Relationship paper

Grounded Description (Image) (WIP)

Hendricks, Lisa Anne, et al. Generating visual explanations. European Conference on Computer Vision. Springer, Cham, 2016. [Paper] [Code] [Pytorch Code]
Jiang, Ming, et al. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. arXiv preprint arXiv:1909.02050 (2019). (EMNLP 2019) [Paper] [Code]
Lee, Jason, Kyunghyun Cho, and Douwe Kiela. Countering language drift via visual grounding. arXiv preprint arXiv:1909.04499 (2019). (EMNLP 2019) [Paper]