Cross-media analysis and reasoning: advances and directions

Yu-xin PENG; Wen-wu ZHU; Yao ZHAO; Chang-sheng XU; Qing-ming HUANG; Han-qing LU; Qing-hua ZHENG; Tie-jun HUANG; Wen GAO

doi:10.1631/FITEE.1601787

PDF(996 KB)

Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (1) : 44-57. DOI: 10.1631/FITEE.1601787

Review

Cross-media analysis and reasoning: advances and directions

Author information +

History +

Abstract

Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.

Keywords

Cross-media analysis / Cross-media reasoning / Cross-media applications

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yu-xin PENG, Wen-wu ZHU, Yao ZHAO, Chang-sheng XU, Qing-ming HUANG, Han-qing LU, Qing-hua ZHENG, Tie-jun HUANG, Wen GAO. Cross-media analysis and reasoning: advances and directions. Front. Inform. Technol. Electron. Eng, 2017, 18(1): 44‒57 https://doi.org/10.1631/FITEE.1601787

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59. http://dx.doi.org/10.3233/AIC-1994-7104

[2]	Adib , F., Hsu , C.Y., Mao , H., , 2015. Capturing the human figure through a wall.ACM Trans. Graph., 34(6):219. http://dx.doi.org/10.1145/2816795.2818072

[3]	Andrew , G., Arora , R., Bilmes , J., , 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247–1255.

[4]	Antenucci , D., Li , E., Liu , S., , 2013. Ringtail: a gener-alized nowcasting system. Proc. VLDB Endow., 6(12): 1358–1361. http://dx.doi.org/10.14778/2536274.2536315

[5]	Antol , S., Agrawal , A., Lu , J., , 2015. VQA: visual ques-tion answering. IEEE Int. Conf. on Computer Vision, p.2425–2433. http://dx.doi.org/10.1109/ICCV.2015.279

[6]	Babenko , A., Slesarev , A., Chigorin , A., , 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584–599. http://dx.doi.org/10.1007/978-3-319-10590-1_38

[7]	Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97. http://dx.doi.org/10.1097/00124784-199909000-00012

[8]	Carlson , C., Betteridge , J., Kisiel , B., , 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306–1313.

[9]	Chen , D.P., Weber , S.C., Constantinou , P.S., , 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115–119.

[10]	Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416. http://dx.doi.org/10.1109/ICCV.2013.178

[11]	Chen , Y., Carroll , R.J., Hinz , E.R.M., , 2013. Applying active learning to high-throughput phenotyping algo-rithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253–259. http://dx.doi.org/10.1136/amiajnl-2013-001945

[12]	Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383. http://dx.doi.org/10.1109/TKDE.2007.48

[13]	Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344. http://dx.doi.org/10.1145/2556288.2557139

[14]	Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746. http://dx.doi.org/10.1109/TMM.2011.2181343

[15]	Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.

[16]	Deng , J., Dong , W., Socher , R., , 2009. ImageNet: a large- scale hierarchical image database. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.248–255. http://dx.doi.org/10.1109/CVPR.2009.5206848

[17]	Dong , X., Gabrilovich , E., Heitz , G., , 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Dis-covery and Data Mining, p.601–610. http://dx.doi.org/10.1145/2623330.2623623

[18]	Fang , Q., Xu , C., Sang , J., , 2016. Folksonomy-based visual ontology construction and its applications.IEEE Trans. Multim., 18(4):702–713. http://dx.doi.org/10.1109/TMM.2016.2527602

[19]	Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.

[20]	Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16. http://dx.doi.org/10.1145/2647868.2654902

[21]	Ferrucci , D., Levas , A., Bagchi , S., , 2013. Watson: be-yond jeopardy!Artif. Intell., 199-200:93–105. http://dx.doi.org/10.1016/j.artint.2012.06.009

[22]	Fuentes-Pacheco , J., Ruiz-Ascencio , J., Rendón-Mancha , J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81. http://dx.doi.org/10.1007/s10462-012-9365-8

[23]	Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145. http://dx.doi.org/10.1177/0165551504042802

[24]	Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.

[25]	Ginsberg , J., Mohebbi , M., Patel , R.S., , 2009. Detecting influenza epidemics using search engine query data.Na-ture, 457(7232):1012–1014.

[26]	Gong , Y., Ke , Q., Isard , M., , 2014. A multi-view em-bedding space for modeling internet images, tags, and their semantics.Int. J. Comput. Vis., 106(2):210–233. http://dx.doi.org/10.1007/s11263-013-0658-4

[27]	Hochreiter , S., Schmidhuber , J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735

[28]	Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.

[29]	Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377. https://doi.org/10.1093/biomet/28.3-4.321

[30]	Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.

[31]	Hua , Y., Wang , S., Liu , S., , 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190–199. http://dx.doi.org/10.1109/ICDM.2014.65

[32]	Jia , X., Gavves , E., Fernando , B., , 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.

[33]	Johnson , J., Krishna , R., Stark , M., , 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668–3678. http://dx.doi.org/10.1109/CVPR.2015.7298990

[34]	Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137. http://dx.doi.org/10.1109/CVPR.2015.7298932

[35]	Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.

[36]	Kulkarni , G., Premraj , V., Dhar , S., , 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601–1608. http://dx.doi.org/10.1109/CVPR.2011.5995466

[37]	Kumar , S., Sanderford , M., Gray , V.E., , 2012. Evolu-tionary diagnosis method for variants in personal exomes.Nat. Meth., 9(9):855–856. http://dx.doi.org/10.1038/nmeth.2147

[38]	Kuznetsova , P., Ordonezz , V., Berg , T.L., , 2014. TREETALK: composition and compression of trees for image descriptions.Trans. Assoc. Comput. Ling., 2:351–362.

[39]	Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173. http://dx.doi.org/10.1007/978-3-642-27645-3_5

[40]	Lazer , D., Kennedy , R., King , G., , 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203–1205. http://dx.doi.org/10.1126/science.1248506

[41]	Lew , M.S., Sebe , N., Djeraba , C., , 2006. Content-based multimedia information retrieval: state of the art and challenges.ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19. http://dx.doi.org/10.1145/1126004.1126005

[42]	Lin , T., Pantel , P., Gamon , M., , 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589–598. http://dx.doi.org/10.1145/2187836.2187916

[43]	Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10. http://dx.doi.org/10.1145/1390334.1390338

[44]	Mao , X., Lin , B., Cai , D., , 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906. http://dx.doi.org/10.1145/2502081.2502087

[45]	McGurk , H., MacDonald , J., 1976. Hearing lips and seeing voices.Nature, 264(5588):746–748. http://dx.doi.org/10.1038/264746a0

[46]	MIT Technology Review, 2014. Data driven healthcare.https://www.technologyreview.com/business-report/data-driven-health-care/free [Dec. 06, 2016].

[47]	Mnih , V., Kavukcuoglu , K., Silver , D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–333. http://dx.doi.org/10.1038/nature14236

[48]	Ngiam , J., Khosla , A., Kim , M., , 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689–696.

[49]	Ordonez , V., Kulkarni , G., Berg , T.L., 2011. Im2text: describ-ing images using 1 million captioned photographs. Ad-vances in Neural Information Processing Systems, p.1143–1151.

[50]	Pan , Y.H., 2016. Heading toward artificial intelligence 2.0.Engineering, 2(4):409–413. http://dx.doi.org/10.1016/J.ENG.2016.04.018

[51]	Pearl , J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.

[52]	Peng , Y., Huang , X., Qi , J., 2016a. Cross-media shared repre-sentation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846–3853.

[53]	Peng , Y., Zhai , X., Zhao , Y., , 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization.IEEE Trans. Circ. Syst. Video Technol., 26(3):583–596. http://dx.doi.org/10.1109/TCSVT.2015.2400779

[54]	Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079. http://dx.doi.org/10.1109/ICCV.2015.128

[55]	Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918. http://dx.doi.org/10.1145/2187836.2187958

[56]	Rasiwasia , N., Costa Pereira , J., Coviello , E., , 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251–260. http://dx.doi.org/10.1145/1873951.1873987

[57]	Rasiwasia , N., Mahajan , D., Mahadevan , V., , 2014. Cluster canonical correlation analysis. Int. Conf. on Arti-ficial Intelligence and Statistics, p.823–831.

[58]	Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54. http://dx.doi.org/10.1007/s10462-012-9356-9

[59]	Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.

[60]	Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464. http://dx.doi.org/10.1109/CVPR.2015.7298752

[61]	Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.

[62]	Socher , R., Lin , C., Ng , A.Y., , 2011. Parsing natural scenes and natural language with recursive neural net-works. Int. Conf. on Machine Learning, p.129–136.

[63]	Socher , R., Karpathy , A., Le , Q., , 2014. Grounded compositional semantics for finding and describing im-ages with sentences.Trans. Assoc. Comput. Ling., 2:207–218.

[64]	Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.

[65]	Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714. http://dx.doi.org/10.14778/2733004.2733069

[66]	Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213. http://dx.doi.org/10.1108/OIR-10-2014-0257

[67]	Vinyals , O., Toshev , A., Bengio , S., , 2015. Show and tell: a neural image caption generator. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3156–3164. http://dx.doi.org/10.1109/CVPR.2015.7298935

[68]	Wang , D., Cui , P., Ou , M., , 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure.IEEE Trans. Multim., 17(9): 1404–1416. http://dx.doi.org/10.1109/TMM.2015.2455415

[69]	Wang , W., Ooi , B.C., Yang , X., , 2014. Effective multi- modal retrieval based on stacked auto-encoders.Proc. VLDB Endow., 7(8):649–660. http://dx.doi.org/10.14778/2732296.2732301

[70]	Wang , Y., Wu , F., Song , J., , 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307–316. http://dx.doi.org/10.1145/2647868.2654901

[71]	Wei , Y., Zhao , Y., Lu , C., , 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449–460. http://dx.doi.org/10.1109/TCYB.2016.2519449

[72]	Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.

[73]	Xu , K., Ba , J., Kiros , R., , 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048–2057.

[74]	Yang , Y., Zhuang , Y., Wu , F., , 2008. Harmonizing hier-archical manifolds for multimedia document semantics understanding and cross-media retrieval.IEEE Trans. Multim., 10(3):437–446. http://dx.doi.org/10.1109/TMM.2008.917359

[75]	Yang , Y., Teo , C.L., Daume , H., , 2011. Corpus-guided sentence generation of natural images. Conf. on Empiri-cal Methods in Natural Language Processing, p.444–454.

[76]	Yang , Y., Nie , F., Xu , D., , 2012. A multimedia retrieval framework based on semi-supervised ranking and rele-vance feedback.IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723–742. http://dx.doi.org/10.1109/TPAMI.2011.170

[77]	Yuan , L., Pan , C., Ji , S., , 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression.Bioinformatics, 30(2):266–273. http://dx.doi.org/10.1093/bioinformatics/btt648

[78]	Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978. http://dx.doi.org/10.1109/TCSVT.2013.2276704

[79]	Zhang , H., Yang , Y., Luan , H., , 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187–196. http://dx.doi.org/10.1145/2647868.2654915

[80]	Zhang , H., Yuan , J., Gao , X., , 2014b. Boosting cross- media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953–956. http://dx.doi.org/10.1145/2647868.2654975

[81]	Zhang , H., Shang , X., Luan , H., , 2016. Learning from collective intelligence: feature learning using social im-ages and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.http://dx.doi.org/10.1145/2978656

[82]	Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.