Cross-media analysis and reasoning: advances and directions
Yu-xin PENG, Wen-wu ZHU, Yao ZHAO, Chang-sheng XU, Qing-ming HUANG, Han-qing LU, Qing-hua ZHENG, Tie-jun HUANG, Wen GAO
Cross-media analysis and reasoning: advances and directions
Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.
Cross-media analysis / Cross-media reasoning / Cross-media applications
[1] |
Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59. http://dx.doi.org/10.3233/AIC-1994-7104
|
[2] |
Adib , F., Hsu , C.Y., Mao , H.,
|
[3] |
Andrew , G., Arora , R., Bilmes , J.,
|
[4] |
Antenucci , D., Li , E., Liu , S.,
|
[5] |
Antol , S., Agrawal , A., Lu , J.,
|
[6] |
Babenko , A., Slesarev , A., Chigorin , A.,
|
[7] |
Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97. http://dx.doi.org/10.1097/00124784-199909000-00012
|
[8] |
Carlson , C., Betteridge , J., Kisiel , B.,
|
[9] |
Chen , D.P., Weber , S.C., Constantinou , P.S.,
|
[10] |
Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416. http://dx.doi.org/10.1109/ICCV.2013.178
|
[11] |
Chen , Y., Carroll , R.J., Hinz , E.R.M.,
|
[12] |
Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383. http://dx.doi.org/10.1109/TKDE.2007.48
|
[13] |
Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344. http://dx.doi.org/10.1145/2556288.2557139
|
[14] |
Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746. http://dx.doi.org/10.1109/TMM.2011.2181343
|
[15] |
Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.
|
[16] |
Deng , J., Dong , W., Socher , R.,
|
[17] |
Dong , X., Gabrilovich , E., Heitz , G.,
|
[18] |
Fang , Q., Xu , C., Sang , J.,
|
[19] |
Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.
|
[20] |
Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16. http://dx.doi.org/10.1145/2647868.2654902
|
[21] |
Ferrucci , D., Levas , A., Bagchi , S.,
|
[22] |
Fuentes-Pacheco , J., Ruiz-Ascencio , J., Rendón-Mancha , J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81. http://dx.doi.org/10.1007/s10462-012-9365-8
|
[23] |
Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145. http://dx.doi.org/10.1177/0165551504042802
|
[24] |
Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.
|
[25] |
Ginsberg , J., Mohebbi , M., Patel , R.S.,
|
[26] |
Gong , Y., Ke , Q., Isard , M.,
|
[27] |
Hochreiter , S., Schmidhuber , J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735
|
[28] |
Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.
|
[29] |
Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377. https://doi.org/10.1093/biomet/28.3-4.321
|
[30] |
Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.
|
[31] |
Hua , Y., Wang , S., Liu , S.,
|
[32] |
Jia , X., Gavves , E., Fernando , B.,
|
[33] |
Johnson , J., Krishna , R., Stark , M.,
|
[34] |
Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137. http://dx.doi.org/10.1109/CVPR.2015.7298932
|
[35] |
Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.
|
[36] |
Kulkarni , G., Premraj , V., Dhar , S.,
|
[37] |
Kumar , S., Sanderford , M., Gray , V.E.,
|
[38] |
Kuznetsova , P., Ordonezz , V., Berg , T.L.,
|
[39] |
Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173. http://dx.doi.org/10.1007/978-3-642-27645-3_5
|
[40] |
Lazer , D., Kennedy , R., King , G.,
|
[41] |
Lew , M.S., Sebe , N., Djeraba , C.,
|
[42] |
Lin , T., Pantel , P., Gamon , M.,
|
[43] |
Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10. http://dx.doi.org/10.1145/1390334.1390338
|
[44] |
Mao , X., Lin , B., Cai , D.,
|
[45] |
McGurk , H., MacDonald , J., 1976. Hearing lips and seeing voices.Nature, 264(5588):746–748. http://dx.doi.org/10.1038/264746a0
|
[46] |
MIT Technology Review, 2014. Data driven healthcare.https://www.technologyreview.com/business-report/data-driven-health-care/free [Dec. 06, 2016].
|
[47] |
Mnih , V., Kavukcuoglu , K., Silver , D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–333. http://dx.doi.org/10.1038/nature14236
|
[48] |
Ngiam , J., Khosla , A., Kim , M.,
|
[49] |
Ordonez , V., Kulkarni , G., Berg , T.L., 2011. Im2text: describ-ing images using 1 million captioned photographs. Ad-vances in Neural Information Processing Systems, p.1143–1151.
|
[50] |
Pan , Y.H., 2016. Heading toward artificial intelligence 2.0.Engineering, 2(4):409–413. http://dx.doi.org/10.1016/J.ENG.2016.04.018
|
[51] |
Pearl , J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.
|
[52] |
Peng , Y., Huang , X., Qi , J., 2016a. Cross-media shared repre-sentation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846–3853.
|
[53] |
Peng , Y., Zhai , X., Zhao , Y.,
|
[54] |
Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079. http://dx.doi.org/10.1109/ICCV.2015.128
|
[55] |
Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918. http://dx.doi.org/10.1145/2187836.2187958
|
[56] |
Rasiwasia , N., Costa Pereira , J., Coviello , E.,
|
[57] |
Rasiwasia , N., Mahajan , D., Mahadevan , V.,
|
[58] |
Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54. http://dx.doi.org/10.1007/s10462-012-9356-9
|
[59] |
Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.
|
[60] |
Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464. http://dx.doi.org/10.1109/CVPR.2015.7298752
|
[61] |
Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.
|
[62] |
Socher , R., Lin , C., Ng , A.Y.,
|
[63] |
Socher , R., Karpathy , A., Le , Q.,
|
[64] |
Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.
|
[65] |
Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714. http://dx.doi.org/10.14778/2733004.2733069
|
[66] |
Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213. http://dx.doi.org/10.1108/OIR-10-2014-0257
|
[67] |
Vinyals , O., Toshev , A., Bengio , S.,
|
[68] |
Wang , D., Cui , P., Ou , M.,
|
[69] |
Wang , W., Ooi , B.C., Yang , X.,
|
[70] |
Wang , Y., Wu , F., Song , J.,
|
[71] |
Wei , Y., Zhao , Y., Lu , C.,
|
[72] |
Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.
|
[73] |
Xu , K., Ba , J., Kiros , R.,
|
[74] |
Yang , Y., Zhuang , Y., Wu , F.,
|
[75] |
Yang , Y., Teo , C.L., Daume , H.,
|
[76] |
Yang , Y., Nie , F., Xu , D.,
|
[77] |
Yuan , L., Pan , C., Ji , S.,
|
[78] |
Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978. http://dx.doi.org/10.1109/TCSVT.2013.2276704
|
[79] |
Zhang , H., Yang , Y., Luan , H.,
|
[80] |
Zhang , H., Yuan , J., Gao , X.,
|
[81] |
Zhang , H., Shang , X., Luan , H.,
|
[82] |
Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.
|
[83] |
Zhu , Y., Zhang , C., Ré , C.,
|
/
〈 | 〉 |