Cross-media analysis and reasoning: advances and directions

Yu-xin PENG, Wen-wu ZHU, Yao ZHAO, Chang-sheng XU, Qing-ming HUANG, Han-qing LU, Qing-hua ZHENG, Tie-jun HUANG, Wen GAO

PDF(996 KB)
PDF(996 KB)
Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (1) : 44-57. DOI: 10.1631/FITEE.1601787
Review
Review

Cross-media analysis and reasoning: advances and directions

Author information +
History +

Abstract

Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.

Keywords

Cross-media analysis / Cross-media reasoning / Cross-media applications

Cite this article

Download citation ▾
Yu-xin PENG, Wen-wu ZHU, Yao ZHAO, Chang-sheng XU, Qing-ming HUANG, Han-qing LU, Qing-hua ZHENG, Tie-jun HUANG, Wen GAO. Cross-media analysis and reasoning: advances and directions. Front. Inform. Technol. Electron. Eng, 2017, 18(1): 44‒57 https://doi.org/10.1631/FITEE.1601787

References

[1]
Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59. http://dx.doi.org/10.3233/AIC-1994-7104
[2]
Adib , F., Hsu , C.Y., Mao , H., , 2015. Capturing the human figure through a wall.ACM Trans. Graph., 34(6):219. http://dx.doi.org/10.1145/2816795.2818072
[3]
Andrew , G., Arora , R., Bilmes , J., , 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247–1255.
[4]
Antenucci , D., Li , E., Liu , S., , 2013. Ringtail: a gener-alized nowcasting system. Proc. VLDB Endow., 6(12): 1358–1361. http://dx.doi.org/10.14778/2536274.2536315
[5]
Antol , S., Agrawal , A., Lu , J., , 2015. VQA: visual ques-tion answering. IEEE Int. Conf. on Computer Vision, p.2425–2433. http://dx.doi.org/10.1109/ICCV.2015.279
[6]
Babenko , A., Slesarev , A., Chigorin , A., , 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584–599. http://dx.doi.org/10.1007/978-3-319-10590-1_38
[7]
Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97. http://dx.doi.org/10.1097/00124784-199909000-00012
[8]
Carlson , C., Betteridge , J., Kisiel , B., , 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306–1313.
[9]
Chen , D.P., Weber , S.C., Constantinou , P.S., , 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115–119.
[10]
Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416. http://dx.doi.org/10.1109/ICCV.2013.178
[11]
Chen , Y., Carroll , R.J., Hinz , E.R.M., , 2013. Applying active learning to high-throughput phenotyping algo-rithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253–259. http://dx.doi.org/10.1136/amiajnl-2013-001945
[12]
Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383. http://dx.doi.org/10.1109/TKDE.2007.48
[13]
Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344. http://dx.doi.org/10.1145/2556288.2557139
[14]
Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746. http://dx.doi.org/10.1109/TMM.2011.2181343
[15]
Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.
[16]
Deng , J., Dong , W., Socher , R., , 2009. ImageNet: a large- scale hierarchical image database. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.248–255. http://dx.doi.org/10.1109/CVPR.2009.5206848
[17]
Dong , X., Gabrilovich , E., Heitz , G., , 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Dis-covery and Data Mining, p.601–610. http://dx.doi.org/10.1145/2623330.2623623
[18]
Fang , Q., Xu , C., Sang , J., , 2016. Folksonomy-based visual ontology construction and its applications.IEEE Trans. Multim., 18(4):702–713. http://dx.doi.org/10.1109/TMM.2016.2527602
[19]
Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.
[20]
Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16. http://dx.doi.org/10.1145/2647868.2654902
[21]
Ferrucci , D., Levas , A., Bagchi , S., , 2013. Watson: be-yond jeopardy!Artif. Intell., 199-200:93–105. http://dx.doi.org/10.1016/j.artint.2012.06.009
[22]
Fuentes-Pacheco , J., Ruiz-Ascencio , J., Rendón-Mancha , J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81. http://dx.doi.org/10.1007/s10462-012-9365-8
[23]
Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145. http://dx.doi.org/10.1177/0165551504042802
[24]
Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.
[25]
Ginsberg , J., Mohebbi , M., Patel , R.S., , 2009. Detecting influenza epidemics using search engine query data.Na-ture, 457(7232):1012–1014.
[26]
Gong , Y., Ke , Q., Isard , M., , 2014. A multi-view em-bedding space for modeling internet images, tags, and their semantics.Int. J. Comput. Vis., 106(2):210–233. http://dx.doi.org/10.1007/s11263-013-0658-4
[27]
Hochreiter , S., Schmidhuber , J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735
[28]
Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.
[29]
Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377. https://doi.org/10.1093/biomet/28.3-4.321
[30]
Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.
[31]
Hua , Y., Wang , S., Liu , S., , 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190–199. http://dx.doi.org/10.1109/ICDM.2014.65
[32]
Jia , X., Gavves , E., Fernando , B., , 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.
[33]
Johnson , J., Krishna , R., Stark , M., , 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668–3678. http://dx.doi.org/10.1109/CVPR.2015.7298990
[34]
Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137. http://dx.doi.org/10.1109/CVPR.2015.7298932
[35]
Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.
[36]
Kulkarni , G., Premraj , V., Dhar , S., , 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601–1608. http://dx.doi.org/10.1109/CVPR.2011.5995466
[37]
Kumar , S., Sanderford , M., Gray , V.E., , 2012. Evolu-tionary diagnosis method for variants in personal exomes.Nat. Meth., 9(9):855–856. http://dx.doi.org/10.1038/nmeth.2147
[38]
Kuznetsova , P., Ordonezz , V., Berg , T.L., , 2014. TREETALK: composition and compression of trees for image descriptions.Trans. Assoc. Comput. Ling., 2:351–362.
[39]
Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173. http://dx.doi.org/10.1007/978-3-642-27645-3_5
[40]
Lazer , D., Kennedy , R., King , G., , 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203–1205. http://dx.doi.org/10.1126/science.1248506
[41]
Lew , M.S., Sebe , N., Djeraba , C., , 2006. Content-based multimedia information retrieval: state of the art and challenges.ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19. http://dx.doi.org/10.1145/1126004.1126005
[42]
Lin , T., Pantel , P., Gamon , M., , 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589–598. http://dx.doi.org/10.1145/2187836.2187916
[43]
Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10. http://dx.doi.org/10.1145/1390334.1390338
[44]
Mao , X., Lin , B., Cai , D., , 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906. http://dx.doi.org/10.1145/2502081.2502087
[45]
McGurk , H., MacDonald , J., 1976. Hearing lips and seeing voices.Nature, 264(5588):746–748. http://dx.doi.org/10.1038/264746a0
[46]
MIT Technology Review, 2014. Data driven healthcare.https://www.technologyreview.com/business-report/data-driven-health-care/free [Dec. 06, 2016].
[47]
Mnih , V., Kavukcuoglu , K., Silver , D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–333. http://dx.doi.org/10.1038/nature14236
[48]
Ngiam , J., Khosla , A., Kim , M., , 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689–696.
[49]
Ordonez , V., Kulkarni , G., Berg , T.L., 2011. Im2text: describ-ing images using 1 million captioned photographs. Ad-vances in Neural Information Processing Systems, p.1143–1151.
[50]
Pan , Y.H., 2016. Heading toward artificial intelligence 2.0.Engineering, 2(4):409–413. http://dx.doi.org/10.1016/J.ENG.2016.04.018
[51]
Pearl , J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.
[52]
Peng , Y., Huang , X., Qi , J., 2016a. Cross-media shared repre-sentation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846–3853.
[53]
Peng , Y., Zhai , X., Zhao , Y., , 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization.IEEE Trans. Circ. Syst. Video Technol., 26(3):583–596. http://dx.doi.org/10.1109/TCSVT.2015.2400779
[54]
Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079. http://dx.doi.org/10.1109/ICCV.2015.128
[55]
Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918. http://dx.doi.org/10.1145/2187836.2187958
[56]
Rasiwasia , N., Costa Pereira , J., Coviello , E., , 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251–260. http://dx.doi.org/10.1145/1873951.1873987
[57]
Rasiwasia , N., Mahajan , D., Mahadevan , V., , 2014. Cluster canonical correlation analysis. Int. Conf. on Arti-ficial Intelligence and Statistics, p.823–831.
[58]
Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54. http://dx.doi.org/10.1007/s10462-012-9356-9
[59]
Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.
[60]
Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464. http://dx.doi.org/10.1109/CVPR.2015.7298752
[61]
Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.
[62]
Socher , R., Lin , C., Ng , A.Y., , 2011. Parsing natural scenes and natural language with recursive neural net-works. Int. Conf. on Machine Learning, p.129–136.
[63]
Socher , R., Karpathy , A., Le , Q., , 2014. Grounded compositional semantics for finding and describing im-ages with sentences.Trans. Assoc. Comput. Ling., 2:207–218.
[64]
Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.
[65]
Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714. http://dx.doi.org/10.14778/2733004.2733069
[66]
Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213. http://dx.doi.org/10.1108/OIR-10-2014-0257
[67]
Vinyals , O., Toshev , A., Bengio , S., , 2015. Show and tell: a neural image caption generator. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3156–3164. http://dx.doi.org/10.1109/CVPR.2015.7298935
[68]
Wang , D., Cui , P., Ou , M., , 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure.IEEE Trans. Multim., 17(9): 1404–1416. http://dx.doi.org/10.1109/TMM.2015.2455415
[69]
Wang , W., Ooi , B.C., Yang , X., , 2014. Effective multi- modal retrieval based on stacked auto-encoders.Proc. VLDB Endow., 7(8):649–660. http://dx.doi.org/10.14778/2732296.2732301
[70]
Wang , Y., Wu , F., Song , J., , 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307–316. http://dx.doi.org/10.1145/2647868.2654901
[71]
Wei , Y., Zhao , Y., Lu , C., , 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449–460. http://dx.doi.org/10.1109/TCYB.2016.2519449
[72]
Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.
[73]
Xu , K., Ba , J., Kiros , R., , 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048–2057.
[74]
Yang , Y., Zhuang , Y., Wu , F., , 2008. Harmonizing hier-archical manifolds for multimedia document semantics understanding and cross-media retrieval.IEEE Trans. Multim., 10(3):437–446. http://dx.doi.org/10.1109/TMM.2008.917359
[75]
Yang , Y., Teo , C.L., Daume , H., , 2011. Corpus-guided sentence generation of natural images. Conf. on Empiri-cal Methods in Natural Language Processing, p.444–454.
[76]
Yang , Y., Nie , F., Xu , D., , 2012. A multimedia retrieval framework based on semi-supervised ranking and rele-vance feedback.IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723–742. http://dx.doi.org/10.1109/TPAMI.2011.170
[77]
Yuan , L., Pan , C., Ji , S., , 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression.Bioinformatics, 30(2):266–273. http://dx.doi.org/10.1093/bioinformatics/btt648
[78]
Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978. http://dx.doi.org/10.1109/TCSVT.2013.2276704
[79]
Zhang , H., Yang , Y., Luan , H., , 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187–196. http://dx.doi.org/10.1145/2647868.2654915
[80]
Zhang , H., Yuan , J., Gao , X., , 2014b. Boosting cross- media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953–956. http://dx.doi.org/10.1145/2647868.2654975
[81]
Zhang , H., Shang , X., Luan , H., , 2016. Learning from collective intelligence: feature learning using social im-ages and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.http://dx.doi.org/10.1145/2978656
[82]
Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.
[83]
Zhu , Y., Zhang , C., Ré , C., , 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.

RIGHTS & PERMISSIONS

2017 Zhejiang University and Springer-Verlag Berlin Heidelberg
PDF(996 KB)

Accesses

Citations

Detail

Sections
Recommended

/