A survey of deep learning-based visual question answering

Tong-yuan Huang; Yu-ling Yang; Xue-jiao Yang

doi:10.1007/s11771-021-4641-x

Journal of Central South University ›› 2021, Vol. 28 ›› Issue (3) :728 -746. DOI: 10.1007/s11771-021-4641-x

Article

A survey of deep learning-based visual question answering

Author information +

History +

PDF

Abstract

With the warming up and continuous development of machine learning, especially deep learning, the research on visual question answering field has made significant progress, with important theoretical research significance and practical application value. Therefore, it is necessary to summarize the current research and provide some reference for researchers in this field. This article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering field. First, relevant background knowledge about VQA(Visual Question Answering) was introduced. Secondly, the issues and challenges of visual question answering were discussed, and at the same time, some promising discussion on the particular methodologies was given. Thirdly, the key sub-problems affecting visual question answering were summarized and analyzed. Then, the current commonly used data sets and evaluation indicators were summarized. Next, in view of the popular algorithms and models in VQA research, comparison of the algorithms and models was summarized and listed. Finally, the future development trend and conclusion of visual question answering were prospected.

Keywords

computer vision / natural language processing / visual question answering / deep learning / attention mechanism

Cite this article

Download citation ▾

Tong-yuan Huang, Yu-ling Yang, Xue-jiao Yang. A survey of deep learning-based visual question answering. Journal of Central South University, 2021, 28(3): 728-746 DOI:10.1007/s11771-021-4641-x

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	ZHOU Bo-lei, TIAN Yuan-dong, SUKHBAATAR S. Simple baseline for visual question answering [EB/OL]. [2015-12-07]. https://arxiv.org/abs/1512.02167.

[2]	AgrawalA, LuJ, AntolS, MitchellM. VQA: Visual question answering [J]. International Journal of Computer Vision, 2017, 123(1): 4-31

[3]	MalinowskiM, FritzM. A multi-world approach to question answering about real-world scenes based on uncertain input [C]. International Conference on Neural Information Processing Systems, 2014, New York, MIT Press, 16821690

[4]	ShihK J, SinghS, HoiemD. Where to look: Focus regions for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 46134621

[5]	MaC, ShenC-h, DickA, WuQ, WangP, HengelV A, ReidL. Visual question answering with memory-augmented networks [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press6975–6984

[6]	FUKUI A, PARK H. D, YANG D, ROHRBACH A, DARRELL T, ROHRBACH M. Multimodal compact bilinear pooling for visual question answering and visual grounding [EB/OL]_[2016-07-06]. https://arxiv.org/abs/1606.01847v3. DOI: https://doi.org/10.18653/v1/d16-1044.

[7]	GordonD, KembhaviA, RastegariM, RedmonJ, FoxD, FarhadiA. IQA: Visual question answering in interactive environments [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 40894098

[8]	KafleK, KananC. Visual question answering: Datasets, algorithms, and future challenges [J]. Computer Vision and Image Understanding, 2017, 163(1): 3-20

[9]	WuQ, TeneyD, WangP, ShenC-h, DickA, HengelV A. Visual question answering: A survey of methods and datasets [J]. Computer Vision and Image Understanding, 2017, 163(3): 21-40

[10]	GUPTA A K. Survey of visual question answering: Datasets and techniques [EB/OL]_[2017-05-10]. https://arxiv.org/abs/1705.03865.

[11]	YuJ, WangL, YuZhou. Research on visual question answering techniques [J]. Journal of Computer Research and Development, 2018, 55(9): 122-134

[12]	WuQ, WangP, ShenC-h, DickA, HengelV. Image captioning and visual question answering based on attributes and external knowledge [J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2018, 40(6): 1367-1381

[13]	GravesA. Long short-term memory [M]. Supervised Sequence Labelling with Recurrent Neural Networks, 2012, Germany, Springer Berlin Heidelberg, 17351780

[14]	JainU, SchwingA G, LazebnikS. Two can play this game: Visual dialog with discriminative question generation and answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 57545763

[15]	ZhouL-w, KalantidisY, ChenX-l, CorsoJ J, RohrbachM. Grounded video description [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 65786587

[16]	HediB Y, CadeneR, CordM, ThomeN. MUTAN: Multimodal tucker fusion for visual question answering [C]. Proceedings of the IEEE International Conference on Computer Vision, 2017, USA, IEEE Press, 26122620

[17]	LI Yi-kang, DUAN Nan, ZHOU Bo-lei, CHU Xiao, OUYANG W, WANG Xiao-gang. Visual question generation as dual task of visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6116–6124. DOI: https://doi.org/10.1109/CVPR.2018.00640.

[18]	LiuF, XiangT, HospedalesT M, YangW-k, SunC-yin. iVQA: Inverse visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 86118619

[19]	VriesD H, StrubF, ChandarS, PietquinO. Guesswhat? visual object discovery through multi-modal dialogue [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, USA, IEEE Press, 55035512

[20]	FukushimaK. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position [J]. Biological Cybernetics, 1980, 36(4): 193-202

[21]	LecunY, BottouL, BengioY, HaffnerP. Gradient-based learning applied to document recognition [J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324

[22]	KrizhevskyA, SutskeverI, HintonG E. ImageNet classification with deep convolutional neural networks [C]. International Conference on Neural Information Processing Systems, 2012, United States, Curran Associates Inc., 10971105

[23]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]_[2014-09-04]. https://arxiv.org/abs/1409.1556.

[24]	LIN Min, CHEN Qiang, YAN Shui-cheng. Network in network [EB/OL]_[2013-12-16]. https://arxiv.org/abs/1312.4400.

[25]	SzegedyC, LiuW, JiaY-q, SermanetP, ReedS, AnguelovD, ErhanD, VanhouckeV, RabinovichA. Going deeper with convolutions [C]. IEEE Conference on Computer Vision and Pattern Recognition, 2015, USA, IEEE Computer Society, 19

[26]	IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift [EB/OL]_[2015-02-11]. https://arxiv.org/abs/1502.03167.

[27]	SzegedyC, VanhouckeV, IoffeS, ShlensJ, WojnaZ. Rethinking the inception architecture for computer vision [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 28182826

[28]	HeK-m, ZhangX-y, RenS-q, SunJian. Deep residual learning for image recognition [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 770778

[29]	HuangG, LiuZ, MaatenL V D, WeinbergerQ K. Densely connected convolutional networks [C]. IEEE Conference on Computer Vision and Pattern Recognition, 2017, USA, IEEE Computer Society, 22612269

[30]	HuJ, ShenL, SunG. Squeeze-and-excitation networks [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Computer Society, 71327141

[31]	MassicetiD, SiddharthN, DokaniaP K, TorrH P. FLIPDIAL: A generative model for two-way visual dialogue [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 60976105

[32]	CaoQ-x, LiangX-d, LiB-l, LiG-b, LinLiang. Visual question reasoning on general dependency tree [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 72497257

[33]	HuH-x, ChaoW-l, ShaFei. Learning answer embeddings for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 54285436

[34]	AntolS, AgrawalA, LuJ, MitchellM. VQA: Visual question answering [J]. International Journal of Computer Vision, 2017, 123(1): 4-31

[35]	ShinA, UshikuY, HaradaT. Customized image narrative generation via interactive visual question generation and answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 89258933

[36]	LiR-y, JiaJ-ya. Visual question answering with question representation update [C]. Advances in Neural Information Processing Systems, 2016, Spain, Curran Associates Inc, 46554663

[37]	TeneyD, AndersonP, HeX-d, HengelV A. Tips and tricks for visual question answering: Learnings from the 2017 challenge [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 42234232

[38]	AndersonP, HeX-d, BuehlerC. Bottom-up and top-down attention for image captioning and visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 60776086

[39]	ElmanJ L. Finding structure in time [J]. Cognitive Science, 1990, 14(2): 179-211

[40]	KirosR, ZhuY-k, SalakhutdinovR, et al.. Skip-thought vectors [J]. Computer Science, 2015, 27(28): 23-36

[41]	MalinowskiM, RohrbachM, FritzM. Ask your neurons: A neural-based approach to answering questions about images [C]. IEEE International Conference on Computer Vision, 2015, USA, IEEE Press, 19

[42]	LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNNS for fine-grained visual recognition [EB/OL]_[2015-04-29]. 2015: 1449–1457. DOI: https://doi.org/10.1109/ICCV.2015.170.

[43]	GaoY, BeijbomO, ZhangN, DarrellT. Compact bilinear pooling [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 317326

[44]	KIM J H, ON K W, LIM W, KIM J, HA J W, ZHANG B T. Hadamard product for low-rank bilinear pooling [EB/OL]_[2016-11-14]. https://arxiv.org/abs/1610.04325.

[45]	YuZ, YuJ, FanJ-p, TaoD-cheng. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering [C]. IEEE International Conference on Computer Vision, 2017, USA, IEEE Computer Society, 18391848

[46]	WuQ, WangP, ShenC-h, DickA, HengelV. Ask me anything: Free-form visual question answering based on knowledge from external sources [C]. Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 46224630

[47]	KimJ H, LeeS W, KwakD, HeoM O, KimJ, HaJ W, ZhangB T. Multimodal residual learning for VQA [C]. Advances in Neural Information Processing Systems, 2016361369

[48]	ShresthaR, KafleK, KananC. Answer them all! Toward universal visual question answering models [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 1047210481

[49]	PengG, JiangZ-k, YouH-y, LuP, HoiS, WangX-g, LiH-sheng. Dynamic fusion with intra- and inter- modality attention flow for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 66396648

[50]	KAZEMI V, ELQURSH A. Show, ask, attend, and answer: A strong baseline for visual question answering [EB/OL]_[2017-04-11]. https://arxiv.org/abs/1704.03162.

[51]	YangZ-c, HeX-d, GaoJ-f, DengL, SmolaA. Stacked attention networks for image question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 2129

[52]	ZhuC, ZhaoY-p, HuangS-y, TuK-w, MaYi. Structured attentions for visual question answering [C]. Proceedings of the IEEE International Conference on Computer Vision, 2017, USA, IEEE Press, 12911300

[53]	YinW-p, SchützeH, XiangB, XiangBing. ABCNN: Attention-based convolutional neural network for modeling sentence pairs [J]. Transactions of the Association for Computational Linguistics, 2016, 4(4): 259-272

[54]	ILIEVSKI I, YAN Shui-cheng, FENG Jia-shi. A focused dynamic attention model for visual question answering [EB/OL]_[2016-04-06]. https://arxiv.org/abs/1604.01485.

[55]	XuH-j, SaenkoK. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering [C]. European Conference on Computer Vision, 2016, Netherlands, Springer International Publishing, 451466

[56]	YuD-f, FuJ-l, MeiTao. Multi-level attention networks for visual question answering [C]. Computer Vision and Pattern Recognition, 2017, USA, IEEE Press, 41874195

[57]	JangY, SongY, YuY, KimY, KimG. TGIF-QA: Toward spatio-temporal reasoning in visual question answering [C]. IEEE Conference on Computer Vision and Pattern Recognition, 2017, USA, IEEE Computer Society, 13591367

[58]	LuJ, YangJ-w, BatraD, ParikhD. Hierarchical question-image co-attention for visual question answering [C]. Advances in Neural Information Processing Systems, 2016, Spain, Curran Associates Inc, 289297

[59]	LiangJ-w, JiangL, CaoL-l, LiL-J, HauptmannA. Focal visual-text attention for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 61356143

[60]	NguyenD K, OkataniT. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 60876096

[61]	PatroB, NamboodiriV P. Differential attention for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 76807688

[62]	CadeneR, BenY H, CordM, ThomeN. MUREL: Multimodal relational reasoning for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 19891998

[63]	XU K, BA J, KIROS R, CHOK, COURVILLEA, SALAKHUTDINOV R, ZEMEL R, BENGIOY. Show, Attend and tell: Neural image caption generation with visual attention [EB/OL]_[2015-02-10]. https://arxiv.org/abs/1502.03044v3.

[64]	LuongM T, PhamH, ManningC D. Effective approaches to attention-based neural machine translation [J]. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, Lisbon, Portugal. Stroudsburg, PA, USA, Association for Computational Linguistics

[65]	AuerS, BizerC, KobilarovG, LehmannJ, CyganiakR, IvesZ. DBpedia: A nucleus for a web of open data [J]. Semantic Web, 2007, 4825(2): 11-15

[66]

BOLLACKER K, EVANS C, PARITOSH P, STURGE T, TAYLOR J. Freebase: A collaboratively created graph database for structuring human knowledge [C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Stanford University: ACM, 2008: 1247–1250. DOI: https://doi.org/10.1145/1376616.1376746.

[67]	LiuH, SinghP. Concept Net—A Practical Commonsense Reasoning Tool-Kit [J]. BT Technology Journal, 2004, 22(4): 21-33

[68]	ZHU Yu-ke, ZHANG Ce, RÉ C, LI Fei-fei. Building a large-scale multimodal knowledge base system for answering visual queries [EB/OL]_[2015-05-20]. https://arxiv.org/abs/1507.05670v1.

[69]	TandonN, MeloG D, SuchanekF. WebChild: Harvesting and organizing commonsense knowledge from the web [C]. ACM International Conference on Web Search and Data Mining, 2014, New York, ACM, 523532

[70]	DongX, GabrilovichE, HeitzG, HornW. Knowledge vault: A web-scale approach to probabilistic knowledge fusion [C]. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, New York, ACM, 601610

[71]	ChenX-l, ShrivastavaA, GuptaA. NEIL: Extracting visual knowledge from web data [C]. IEEE International Conference on Computer Vision, 2014, USA, IEEE Press, 14091416

[72]	KumarA, IrsoyO, OndruskaP, IyyerM, BradburyJ, GulrajaniI, ZhongV, PaulusR, SocherR. Ask me anything: Dynamic memory networks for natural language processing [C]. International Conference on Machine Learning, 2016, New York, ICML, 13781387

[73]	SHEN Chun-hua, DICK A, WU Qi, WANG Peng, HENGEL V A. Explicit knowledge-based reasoning for visual question answering [EB/OL]_[2015-11-29]. https://arxiv.org/abs/1511.02570.

[74]	WangP, WuQ, ShenC-h, DickA, HengelV A. FVQA: Fact-based visual question answering [J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2018, 40(10): 2413-2427

[75]	SuZ, ZhuC, DongY-p, CaiD-q, ChenY-r, LiJ-guo. Learning visual knowledge memory networks for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, USA, IEEE Press, 77367745

[76]	TeneyD, LiuL-q, VanD H A. Graph-structured representations for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, USA, IEEE Press, 19

[77]	SHIN A, USHIKU Y, HARADA T. The color of the cat is gray: 1 million full-sentences visual question answering [EB/OL]_[2016-09-07]. https://arxiv.org/abs/1609.06657.

[78]	TangK-h, ZhangH-w, WuB-y, LuoW-h, LiuWei. Learning to compose dynamic tree structures for visual contexts [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 66196628

[79]	WortsmanM, EhsaniK, RastegariM, FarhadiA, MottaghiR. Learning to learn how to learn: Self-adaptive visual navigation using meta-learning [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 67506759

[80]	NohH, KimT, MunJ, HanB. Transfer learning via unsupervised task discovery for visual question answering [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 83858394

[81]	WangZ-d, ZhengL, LiY-l, WangS-jin. Linkage based face clustering via graph convolution network [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 11171125

[82]	SinghA, NatarajanV, ShahM, JiangY, ChenX-l, BatraD, ParikhD, RohrbachM. Towards VQA models that can read [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 83178326

[83]	ZhouB-y, CuiQ, WeiX-s, ChenZ-min. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, USA, IEEE Press, 97169725

[84]	ShepardR N. Toward a universal law of generalization for psychological science [J]. Science, 1988, 242(4880): 944-944

[85]	SukhbaatarS, WestonJ, FergusR. End-to-end memory networks [C]. Advances in Neural Information Processing Systems, 2015, Canada, Curran Associates Inc, 24402448

[86]	KumarA, IrsoyO, OndruskaP, IyyerM, BradburyJ, GulrajaniI, ZhongV, PaulusR, SocherR. Ask me anything: Dynamic memory networks for natural language processing [C]. International Conference on Machine Learning, 2016, New York, ICML, 13781387

[87]	XiongC-m, MerityS, SocherR. Dynamic memory networks for visual and textual question answering [C]. International Conference on Machine Learning, 2016, New York, ICML, 23972406

[88]	SantoroA, BartunovS, BotvinickM, WierstraD, LillicrapT. Meta-learning with memory-augmented neural networks [C]. International Conference on Machine Learning, 2016, New York, ICML, 18421850

[89]	GuoD-l, XuC, TaoD-cheng. Image-question-answer synergistic network for visual dialog [J]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 1043410443

[90]	WangX, HuangQ-y, CelikyilmazA, GaoJ-f, ShenD-h, WangY-F, WangW Y, ZhangLei. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 66306638

[91]	KeL-y-m, LiX-j, BiskY, HoltzmanA, GanZ, LiuJ-j, GaoJ-f, ChoiY-j, SrinivasaS. Tactical rewind: Self-correction via backtracking in vision-and-language navigation [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, USA, IEEE Press, 67416749

[92]	REN Meng-ye, KIROS R, ZEMEL R. Image question answering: A visual semantic embedding model and a new dataset [J]. Litoral Revista De La Poesía Y El Pensamiento, 2015(6): 8–31.

[93]	AgrawalA, LuJ-s, AntolS, MitchellM. VQA: Visual question answering: www.visualqa.org [J]. International Journal of Computer Vision, 2016, 123(1): 12-24

[94]	ZhuY-k, GrothO, BernsteinM, LiF-fei. Visual7w: Grounded question answering in images [C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, USA, IEEE Press, 49955004

[95]	KrishnaR, ZhuY-k, GrothO, JohnsonJ, HataK, KravitzJ, ChenS, KalantidisY, LiL-J, ShammaA D, BernsteinS M, LiF-fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations [J]. International Journal of Computer Vision, 2017, 123(1): 32-73

[96]	YuL-c, ParkE, BergA C, BergT L. Visual Madlibs: Fill in the blank description generation and question answering [C]. IEEE International Conference on Computer Vision, 2016, USA, IEEE Computer Society, 24612469

[97]	ZhangP, GoyalY, SummersstayD, ParikhD. Yin and Yang: Balancing and answering binary visual questions [C]. Computer Vision and Pattern Recognition, 2016, USA, IEEE Computer Society, 50145022

[98]	ANDREAS J, ROHRBACH M, DARRELL T, KLEIN D. Learning to compose neural networks for question answering [EB/OL]_[2016-01-07]. https://arxiv.org/abs/1601.01705.

[99]	GaoH-y, MaoJ-h, ZhouJ, HuangZ-h, WangL, XuWei. Are you talking to a machine? Dataset and methods for multilingual image question answering [J]. Computer Science, 2015, 27(28): 2296-2304