A deep dense captioning framework with joint localization and contextual reasoning
Rui Kong , Wei Xie
Journal of Central South University ›› 2021, Vol. 28 ›› Issue (9) : 2801 -2813.
A deep dense captioning framework with joint localization and contextual reasoning
Dense captioning aims to simultaneously localize and describe regions-of-interest (RoIs) in images in natural language. Specifically, we identify three key problems: 1) dense and highly overlapping RoIs, making accurate localization of each target region challenging; 2) some visually ambiguous target regions which are hard to recognize each of them just by appearance; 3) an extremely deep image representation which is of central importance for visual recognition. To tackle these three challenges, we propose a novel end-to-end dense captioning framework consisting of a joint localization module, a contextual reasoning module and a deep convolutional neural network (CNN). We also evaluate five deep CNN structures to explore the benefits of each. Extensive experiments on visual genome (VG) dataset demonstrate the effectiveness of our approach, which compares favorably with the state-of-the-art methods.
dense captioning / joint localization / contextual reasoning / deep convolutional neural network
| [1] |
|
| [2] |
|
| [3] |
KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3128–3137. DOI: https://doi.org/10.1109/TPAMI.2016.2598339. |
| [4] |
CHEN X L, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431. |
| [5] |
DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, ROHRBACH M, VENUGOPALAN S, SAENKO K, DARRELL T. Long-term recurrent convolutional networks for visual recognition and description [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625–2634. DOI: https://doi.org/10.1109/tpami.2016.2599174. |
| [6] |
JOHNSON J, KARPATHY A, LI F F. Densecap: Fully convolutional localization networks for dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4565–4574. DOI: https://doi.org/10.1109/CVPR.2016.494. |
| [7] |
|
| [8] |
SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos [C]// Advances in Neural Information Processing Systems. 2014: 568–576. DOI: https://doi.org/10.1002/14651858.CD001941.pub3. |
| [9] |
SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [C]// 3rd International Conference on Learning Representations. San Diego, CA, United States, 2014. https://arxiv.org/abs/1409.1556. |
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
REN S, HE K, GIRSHICK R, SUN J. Faster R-CNN: Towards real-time object detection with region proposal networks [C]// Advances in Neural Information Processing Systems. 2015: 91–99. DOI: https://doi.org/10.1109/TPAMI.2016.2577031. |
| [14] |
|
| [15] |
LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3431–3440. DOI: https://doi.org/10.1109/CVPR.2015.7298965. |
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
XU K, BA J, KIROS R, CHO K, COURVILLE A, SALAKHUTDINOV R, ZEMEL R S, BENGIO Y. Show, attend and tell: Neural image caption generation with visual attention [C]// International Conference on Machine Learning. 2015: 2048–2057. |
| [23] |
|
| [24] |
|
| [25] |
YIN G, SHENG L, LIU B, YU N, WANG X, SHAO J. Context and attribute grounded dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6241–6250. |
| [26] |
YANG L, TANG K, YANG J, LI L J. Dense captioning with joint inference and visual context [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2193–2202. |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
LIU Y, WANG R, SHAN S, CHEN X. Structure inference net: Object detection using scene-level context and instance-level relationships [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6985–6994. |
| [33] |
|
| [34] |
YIN G, SHENG L, LIU B, YU N, WANG X, SHAO J, LOY C C. Zoom-net: Mining deep feature interactions for visual relationship recognition [C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 322–338. |
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
/
| 〈 |
|
〉 |