A survey of deep learning-based visual question answering
Tong-yuan Huang , Yu-ling Yang , Xue-jiao Yang
Journal of Central South University ›› 2021, Vol. 28 ›› Issue (3) : 728 -746.
With the warming up and continuous development of machine learning, especially deep learning, the research on visual question answering field has made significant progress, with important theoretical research significance and practical application value. Therefore, it is necessary to summarize the current research and provide some reference for researchers in this field. This article conducted a detailed and in-depth analysis and summarized of relevant research and typical methods of visual question answering field. First, relevant background knowledge about VQA(Visual Question Answering) was introduced. Secondly, the issues and challenges of visual question answering were discussed, and at the same time, some promising discussion on the particular methodologies was given. Thirdly, the key sub-problems affecting visual question answering were summarized and analyzed. Then, the current commonly used data sets and evaluation indicators were summarized. Next, in view of the popular algorithms and models in VQA research, comparison of the algorithms and models was summarized and listed. Finally, the future development trend and conclusion of visual question answering were prospected.
computer vision / natural language processing / visual question answering / deep learning / attention mechanism
| [1] |
ZHOU Bo-lei, TIAN Yuan-dong, SUKHBAATAR S. Simple baseline for visual question answering [EB/OL]. [2015-12-07]. https://arxiv.org/abs/1512.02167. |
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
FUKUI A, PARK H. D, YANG D, ROHRBACH A, DARRELL T, ROHRBACH M. Multimodal compact bilinear pooling for visual question answering and visual grounding [EB/OL]_[2016-07-06]. https://arxiv.org/abs/1606.01847v3. DOI: https://doi.org/10.18653/v1/d16-1044. |
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
GUPTA A K. Survey of visual question answering: Datasets and techniques [EB/OL]_[2017-05-10]. https://arxiv.org/abs/1705.03865. |
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
LI Yi-kang, DUAN Nan, ZHOU Bo-lei, CHU Xiao, OUYANG W, WANG Xiao-gang. Visual question generation as dual task of visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6116–6124. DOI: https://doi.org/10.1109/CVPR.2018.00640. |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition [EB/OL]_[2014-09-04]. https://arxiv.org/abs/1409.1556. |
| [24] |
LIN Min, CHEN Qiang, YAN Shui-cheng. Network in network [EB/OL]_[2013-12-16]. https://arxiv.org/abs/1312.4400. |
| [25] |
|
| [26] |
IOFFE S, SZEGEDY C. Batch normalization: Accelerating deep network training by reducing internal covariate shift [EB/OL]_[2015-02-11]. https://arxiv.org/abs/1502.03167. |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNNS for fine-grained visual recognition [EB/OL]_[2015-04-29]. 2015: 1449–1457. DOI: https://doi.org/10.1109/ICCV.2015.170. |
| [43] |
|
| [44] |
KIM J H, ON K W, LIM W, KIM J, HA J W, ZHANG B T. Hadamard product for low-rank bilinear pooling [EB/OL]_[2016-11-14]. https://arxiv.org/abs/1610.04325. |
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
KAZEMI V, ELQURSH A. Show, ask, attend, and answer: A strong baseline for visual question answering [EB/OL]_[2017-04-11]. https://arxiv.org/abs/1704.03162. |
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
ILIEVSKI I, YAN Shui-cheng, FENG Jia-shi. A focused dynamic attention model for visual question answering [EB/OL]_[2016-04-06]. https://arxiv.org/abs/1604.01485. |
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
XU K, BA J, KIROS R, CHOK, COURVILLEA, SALAKHUTDINOV R, ZEMEL R, BENGIOY. Show, Attend and tell: Neural image caption generation with visual attention [EB/OL]_[2015-02-10]. https://arxiv.org/abs/1502.03044v3. |
| [64] |
|
| [65] |
|
| [66] |
BOLLACKER K, EVANS C, PARITOSH P, STURGE T, TAYLOR J. Freebase: A collaboratively created graph database for structuring human knowledge [C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Stanford University: ACM, 2008: 1247–1250. DOI: https://doi.org/10.1145/1376616.1376746. |
| [67] |
|
| [68] |
ZHU Yu-ke, ZHANG Ce, RÉ C, LI Fei-fei. Building a large-scale multimodal knowledge base system for answering visual queries [EB/OL]_[2015-05-20]. https://arxiv.org/abs/1507.05670v1. |
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
SHEN Chun-hua, DICK A, WU Qi, WANG Peng, HENGEL V A. Explicit knowledge-based reasoning for visual question answering [EB/OL]_[2015-11-29]. https://arxiv.org/abs/1511.02570. |
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
SHIN A, USHIKU Y, HARADA T. The color of the cat is gray: 1 million full-sentences visual question answering [EB/OL]_[2016-09-07]. https://arxiv.org/abs/1609.06657. |
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
REN Meng-ye, KIROS R, ZEMEL R. Image question answering: A visual semantic embedding model and a new dataset [J]. Litoral Revista De La Poesía Y El Pensamiento, 2015(6): 8–31. |
| [93] |
|
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
ANDREAS J, ROHRBACH M, DARRELL T, KLEIN D. Learning to compose neural networks for question answering [EB/OL]_[2016-01-07]. https://arxiv.org/abs/1601.01705. |
| [99] |
|
| [100] |
|
/
| 〈 |
|
〉 |