A deep dense captioning framework with joint localization and contextual reasoning

Rui Kong; Wei Xie

doi:10.1007/s11771-021-4808-5

Journal of Central South University ›› 2021, Vol. 28 ›› Issue (9) : 2801 -2813. DOI: 10.1007/s11771-021-4808-5

Article

A deep dense captioning framework with joint localization and contextual reasoning

Rui Kong ¹^,^a
, Wei Xie ¹

Author information +

History +

PDF

Abstract

Dense captioning aims to simultaneously localize and describe regions-of-interest (RoIs) in images in natural language. Specifically, we identify three key problems: 1) dense and highly overlapping RoIs, making accurate localization of each target region challenging; 2) some visually ambiguous target regions which are hard to recognize each of them just by appearance; 3) an extremely deep image representation which is of central importance for visual recognition. To tackle these three challenges, we propose a novel end-to-end dense captioning framework consisting of a joint localization module, a contextual reasoning module and a deep convolutional neural network (CNN). We also evaluate five deep CNN structures to explore the benefits of each. Extensive experiments on visual genome (VG) dataset demonstrate the effectiveness of our approach, which compares favorably with the state-of-the-art methods.

Keywords

dense captioning / joint localization / contextual reasoning / deep convolutional neural network

Cite this article

Download citation ▾

Rui Kong, Wei Xie. A deep dense captioning framework with joint localization and contextual reasoning. Journal of Central South University, 2021, 28(9): 2801-2813 DOI:10.1007/s11771-021-4808-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	KrishnaR, ZhuY, GrothO, JohnsonJ, HataK, KravitzJ, ChenS, KalantidisY, LiL J, ShammaD A, BernsteinM S, LiF F. Visual genome: Connecting language and vision using crowdsourced dense image annotations [J]. International Journal of Computer Vision, 2017, 123(1): 32-73

[2]	LinT Y, MaireM, BelongieS, BourdevL, GirshickR, HaysJ, PeronaP, RamananD, ZitnickC L, DollárP. Microsoft coco: Common objects in context [C]. European Conference on Computer Vision, 2014, Cham, Springer, 740755

[3]	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3128–3137. DOI: https://doi.org/10.1109/TPAMI.2016.2598339.

[4]	CHEN X L, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431.

[5]

DONAHUE J, ANNE HENDRICKS L, GUADARRAMA S, ROHRBACH M, VENUGOPALAN S, SAENKO K, DARRELL T. Long-term recurrent convolutional networks for visual recognition and description [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625–2634. DOI: https://doi.org/10.1109/tpami.2016.2599174.

[6]	JOHNSON J, KARPATHY A, LI F F. Densecap: Fully convolutional localization networks for dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4565–4574. DOI: https://doi.org/10.1109/CVPR.2016.494.

[7]	ZeilerM D, FergusR. Visualizing and understanding convolutional networks [C]. European Conference on Computer Vision, 2014, Cham, Springer, 818833

[8]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos [C]// Advances in Neural Information Processing Systems. 2014: 568–576. DOI: https://doi.org/10.1002/14651858.CD001941.pub3.

[9]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [C]// 3rd International Conference on Learning Representations. San Diego, CA, United States, 2014. https://arxiv.org/abs/1409.1556.

[10]	HeK, ZhangX, RenS, SunJ. Deep residual learning for image recognition [C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, Las Vegas, NV, USA, IEEE, 770778

[11]	IoffeS, SzegedyC. Batch normalization: Accelerating deep network training by reducing internal covariate shift [C]. 2015 International Conference on Machine Learning (ICML), 2015, Lille, France, ACM, 448456

[12]	RussakovskyO, DengJ, SuH, KrauseJ, SatheeshS, MaS, HuangZ, KarpathyA, KhoslaA, BernsteinM, BergA C, LiF F. Imagenet large scale visual recognition challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-252

[13]	REN S, HE K, GIRSHICK R, SUN J. Faster R-CNN: Towards real-time object detection with region proposal networks [C]// Advances in Neural Information Processing Systems. 2015: 91–99. DOI: https://doi.org/10.1109/TPAMI.2016.2577031.

[14]	GirshickR. Fast R-CNN [C]. 2015 IEEE International Conference on Computer Vision (ICCV), 2015, Santiago, Chile, IEEE, 14401448

[15]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3431–3440. DOI: https://doi.org/10.1109/CVPR.2015.7298965.

[16]	RenZ, WangX, ZhangN, LvX, LiL J. Deep reinforcement learning-based image captioning with embedding reward [C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, HI, USA, IEEE, 290298

[17]	BengioS, VinyalsO, JaitlyN, ShazeerN. Scheduled sampling for sequence prediction with recurrent neural networks [C]. 2015 Advances in Neural Information Processing Systems (NIPS), 2015, Montreal, Quebec, Canada, MIT Press, 11711179

[18]	CorniaM, StefaniniM, BaraldiL, CucchiaraR. Meshed-memory transformer for image captioning [C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, Seattle, WA, USA, IEEE, 1057810587

[19]	PanY, YaoT, LiY, MeiT. X-linear attention networks for image captioning [C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, Seattle, WA, USA, IEEE, 1097110980

[20]	ChenL, ZhangH, XiaoJ, NieL, ShaoJ, LiuW, ChuaT S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning [C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, Honolulu, HI, USA, IEEE, 56595667

[21]	AndersonP, HeX, BuehlerC, TeneyD, JohnsonM, GouldS, ZhangL. Bottom-up and top-down attention for image captioning and visual question answering [C]. 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, Salt Lake City, UT, USA, IEEE, 60776086

[22]	XU K, BA J, KIROS R, CHO K, COURVILLE A, SALAKHUTDINOV R, ZEMEL R S, BENGIO Y. Show, attend and tell: Neural image caption generation with visual attention [C]// International Conference on Machine Learning. 2015: 2048–2057.

[23]	LiX, JiangS, HanJ. Learning object context for dense captioning [C]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 8650-8657

[24]	HochreiterS, SchmidhuberJ. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780

[25]	YIN G, SHENG L, LIU B, YU N, WANG X, SHAO J. Context and attribute grounded dense captioning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 6241–6250.

[26]	YANG L, TANG K, YANG J, LI L J. Dense captioning with joint inference and visual context [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2193–2202.

[27]	TangL, XueY, ChenD, GomesC P. Multi-entity dependence learning with rich context via conditional variational auto-encoder [C]. 2018 Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018, New Orleans, Louisiana, USA, AAAI Press, 824832

[28]	WangJ, WangW, WangL, WangZ, FengD D, TanT. Learning visual relationship and context-aware attention for image captioning [J]. Pattern Recognition, 2020, 98: 107075

[29]	HuX, FuC W, ZhuL, WangT, HengP A. Sac-net: spatial attenuation context for salient object detection [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(3): 1079-1090

[30]	LiuN, HanJ, YangM H. PiCANet: Pixel-wise contextual attention learning for accurate saliency detection [J]. IEEE Transactions on Image Processing, 2020, 296438-6451

[31]	LiY, HuangC, LoyC C, TangX. Human attribute recognition by deep hierarchical contexts [C]. European Conference on Computer Vision, 2016, Cham, Springer, 684700

[32]	LIU Y, WANG R, SHAN S, CHEN X. Structure inference net: Object detection using scene-level context and instance-level relationships [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6985–6994.

[33]	WooS, KimD, ChoD, KweonI S. Linknet: Relational embedding for scene graph [C]. 2018 Advances in Neural Information Processing Systems (NIPS), 2018, Montreal, Canada, MIT Press, 560570

[34]	YIN G, SHENG L, LIU B, YU N, WANG X, SHAO J, LOY C C. Zoom-net: Mining deep feature interactions for visual relationship recognition [C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 322–338.

[35]	MottaghiR, ChenX, LiuX, ChoN G, LeeS W, FidlerS, UrtasunR, YuilleA. The role of context for object detection and semantic segmentation in the wild [C]. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, Columbus, OH, USA, IEEE, 891898

[36]	EveringhamM, van GoolL, WilliamsC K I, WinnJ, ZissermanA. The pascal visual object classes (voc) challenge [J]. International Journal of Computer Vision, 2010, 88(2): 303-338

[37]

BanerjeeS, LavieA. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments [C]. Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, Ann Arbor, Michigan, USA, Association for Computational Linguistics, 6572

[38]

AbadiM, BarhamP, ChenJ, ChenZ, DavisA, DeanJ, DevinM, GhemawatS, LrvingG, LsardM, KudlurM, LevenbergJ, MongaR, MooreS, MurrayD G, SteinerB, TuckerP, VasudevanV, WardenP, WickeM, YuY, ZhengX. Tensorflow: A system for large-scale machine learning [C]. Proceeding of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’ 16), 2016, Savannah, GA, USA, USENIXAssociation, 265283