Cross-media analysis and reasoning: advances and directions

Yu-xin PENG , Wen-wu ZHU , Yao ZHAO , Chang-sheng XU , Qing-ming HUANG , Han-qing LU , Qing-hua ZHENG , Tie-jun HUANG , Wen GAO

Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (1) : 44 -57.

PDF (996KB)
Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (1) : 44 -57. DOI: 10.1631/FITEE.1601787
Review
Review

Cross-media analysis and reasoning: advances and directions

Author information +
History +
PDF (996KB)

Abstract

Cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and rea-soning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.

Keywords

Cross-media analysis / Cross-media reasoning / Cross-media applications

Cite this article

Download citation ▾
Yu-xin PENG, Wen-wu ZHU, Yao ZHAO, Chang-sheng XU, Qing-ming HUANG, Han-qing LU, Qing-hua ZHENG, Tie-jun HUANG, Wen GAO. Cross-media analysis and reasoning: advances and directions. Front. Inform. Technol. Electron. Eng, 2017, 18(1): 44-57 DOI:10.1631/FITEE.1601787

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Aamodt , A., Plaza , E., 1994. Case-based reasoning: founda-tional issues, methodological variations, and system ap-proaches.AI Commun., 7(1):39–59.

[2]

Adib , F., Hsu , C.Y., Mao , H., , 2015. Capturing the human figure through a wall.ACM Trans. Graph., 34(6):219.

[3]

Andrew , G., Arora , R., Bilmes , J., , 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247–1255.

[4]

Antenucci , D., Li , E., Liu , S., , 2013. Ringtail: a gener-alized nowcasting system. Proc. VLDB Endow., 6(12): 1358–1361.

[5]

Antol , S., Agrawal , A., Lu , J., , 2015. VQA: visual ques-tion answering. IEEE Int. Conf. on Computer Vision, p.2425–2433.

[6]

Babenko , A., Slesarev , A., Chigorin , A., , 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584–599.

[7]

Brownson , R.C., Gurney , J.G., Land , G.H., 1999. Evidence- based decision making in public health.J. Publ. Health Manag. Pract., 5(5):86–97.

[8]

Carlson , C., Betteridge , J., Kisiel , B., , 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306–1313.

[9]

Chen , D.P., Weber , S.C., Constantinou , P.S., , 2007. Clinical arrays of laboratory measures, or “clinarrays” built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115–119.

[10]

Chen , X., Shrivastava , A., Gupta , A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409–1416.

[11]

Chen , Y., Carroll , R.J., Hinz , E.R.M., , 2013. Applying active learning to high-throughput phenotyping algo-rithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253–259.

[12]

Cilibrasi , R.L., Vitanyi , P.M.B., 2007. The Google similarity distance.IEEE Trans. Knowl. Data Eng., 19(3):370–383.

[13]

Culotta , A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335–1344.

[14]

Daras , P., Manolopoulou , S., Axenopoulos , A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries.IEEE Trans. Multim., 14(3):734–746.

[15]

Davenport , T.H., Prusak , L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Busi-ness School Press, Boston, p.5.

[16]

Deng , J., Dong , W., Socher , R., , 2009. ImageNet: a large- scale hierarchical image database. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.248–255.

[17]

Dong , X., Gabrilovich , E., Heitz , G., , 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Dis-covery and Data Mining, p.601–610.

[18]

Fang , Q., Xu , C., Sang , J., , 2016. Folksonomy-based visual ontology construction and its applications.IEEE Trans. Multim., 18(4):702–713.

[19]

Fellbaum , C., Miller , G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.

[20]

Feng , F., Wang , X., Li , R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multi-media, p.7–16.

[21]

Ferrucci , D., Levas , A., Bagchi , S., , 2013. Watson: be-yond jeopardy!Artif. Intell., 199-200:93–105.

[22]

Fuentes-Pacheco , J., Ruiz-Ascencio , J., Rendón-Mancha , J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81.

[23]

Garfield , E., 2004. Historiographic mapping of knowledge domains literature.J. Inform. Sci., 30(2):119–145.

[24]

Gibney , E., 2015. DeepMind algorithm beats people at classic video games.Nature, 518(7540):465–466.

[25]

Ginsberg , J., Mohebbi , M., Patel , R.S., , 2009. Detecting influenza epidemics using search engine query data.Na-ture, 457(7232):1012–1014.

[26]

Gong , Y., Ke , Q., Isard , M., , 2014. A multi-view em-bedding space for modeling internet images, tags, and their semantics.Int. J. Comput. Vis., 106(2):210–233.

[27]

Hochreiter , S., Schmidhuber , J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780.

[28]

Hodosh , M., Young , P., Hockenmaier , J., 2013. Framing image description as a ranking task: data, models and evaluation metrics.J. Artif. Intell. Res., 47(1):853–899.

[29]

Hotelling , H., 1936. Relations between two sets of variates.Biometrika, 28(3-4):321–377.

[30]

Hsu , F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton Univer-sity Press, Princeton, USA.

[31]

Hua , Y., Wang , S., Liu , S., , 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190–199.

[32]

Jia , X., Gavves , E., Fernando , B., , 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.

[33]

Johnson , J., Krishna , R., Stark , M., , 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668–3678.

[34]

Karpathy , A., Li , F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3128-3137.

[35]

Krizhevsky , A., Sutskever , I., Hinton , G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097–1105.

[36]

Kulkarni , G., Premraj , V., Dhar , S., , 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601–1608.

[37]

Kumar , S., Sanderford , M., Gray , V.E., , 2012. Evolu-tionary diagnosis method for variants in personal exomes.Nat. Meth., 9(9):855–856.

[38]

Kuznetsova , P., Ordonezz , V., Berg , T.L., , 2014. TREETALK: composition and compression of trees for image descriptions.Trans. Assoc. Comput. Ling., 2:351–362.

[39]

Lazaric , A., 2012. Transfer in reinforcement learning: a frame- work and a survey.In : Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Ber-lin Heidelberg, Berlin, p.143–173.

[40]

Lazer , D., Kennedy , R., King , G., , 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203–1205.

[41]

Lew , M.S., Sebe , N., Djeraba , C., , 2006. Content-based multimedia information retrieval: state of the art and challenges.ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19.

[42]

Lin , T., Pantel , P., Gamon , M., , 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589–598.

[43]

Luo , G., Tang , C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.3–10.

[44]

Mao , X., Lin , B., Cai , D., , 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906.

[45]

McGurk , H., MacDonald , J., 1976. Hearing lips and seeing voices.Nature, 264(5588):746–748.

[46]

MIT Technology Review, 2014. Data driven healthcare.

[47]

Mnih , V., Kavukcuoglu , K., Silver , D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–333.

[48]

Ngiam , J., Khosla , A., Kim , M., , 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689–696.

[49]

Ordonez , V., Kulkarni , G., Berg , T.L., 2011. Im2text: describ-ing images using 1 million captioned photographs. Ad-vances in Neural Information Processing Systems, p.1143–1151.

[50]

Pan , Y.H., 2016. Heading toward artificial intelligence 2.0.Engineering, 2(4):409–413.

[51]

Pearl , J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.

[52]

Peng , Y., Huang , X., Qi , J., 2016a. Cross-media shared repre-sentation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846–3853.

[53]

Peng , Y., Zhai , X., Zhao , Y., , 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization.IEEE Trans. Circ. Syst. Video Technol., 26(3):583–596.

[54]

Prabhu , N., Babu , R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071–1079.

[55]

Radinsky , K., Davidovich , S., Markovitch , S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909–918.

[56]

Rasiwasia , N., Costa Pereira , J., Coviello , E., , 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251–260.

[57]

Rasiwasia , N., Mahajan , D., Mahadevan , V., , 2014. Cluster canonical correlation analysis. Int. Conf. on Arti-ficial Intelligence and Statistics, p.823–831.

[58]

Rautaray , S.S., Agrawal , A., 2015. Vision based hand gesture recognition for human computer interaction: a survey.Artif. Intell. Rev., 43(1):1–54.

[59]

Roller , S., Schulte im Walde , S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Pro-cessing, p.1146–1157.

[60]

Sadeghi , F., Divvala , S.K., Farhadi , A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456–1464.

[61]

Singhal , A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.

[62]

Socher , R., Lin , C., Ng , A.Y., , 2011. Parsing natural scenes and natural language with recursive neural net-works. Int. Conf. on Machine Learning, p.129–136.

[63]

Socher , R., Karpathy , A., Le , Q., , 2014. Grounded compositional semantics for finding and describing im-ages with sentences.Trans. Assoc. Comput. Ling., 2:207–218.

[64]

Srivastava , N., Salakhutdinov , R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural In-formation Processing Systems, p.2222–2230.

[65]

Suchanek , F., Weikum , G., 2014. Knowledge bases in the age of big data analytics.Proc. VLDB Endow., 7(13):1713–1714.

[66]

Uyar , A., Aliyu , F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197–213.

[67]

Vinyals , O., Toshev , A., Bengio , S., , 2015. Show and tell: a neural image caption generator. IEEE Conf. on Com-puter Vision and Pattern Recognition, p.3156–3164.

[68]

Wang , D., Cui , P., Ou , M., , 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure.IEEE Trans. Multim., 17(9): 1404–1416.

[69]

Wang , W., Ooi , B.C., Yang , X., , 2014. Effective multi- modal retrieval based on stacked auto-encoders.Proc. VLDB Endow., 7(8):649–660.

[70]

Wang , Y., Wu , F., Song , J., , 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307–316.

[71]

Wei , Y., Zhao , Y., Lu , C., , 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449–460.

[72]

Wu , W., Xu , J., Li , H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.

[73]

Xu , K., Ba , J., Kiros , R., , 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048–2057.

[74]

Yang , Y., Zhuang , Y., Wu , F., , 2008. Harmonizing hier-archical manifolds for multimedia document semantics understanding and cross-media retrieval.IEEE Trans. Multim., 10(3):437–446.

[75]

Yang , Y., Teo , C.L., Daume , H., , 2011. Corpus-guided sentence generation of natural images. Conf. on Empiri-cal Methods in Natural Language Processing, p.444–454.

[76]

Yang , Y., Nie , F., Xu , D., , 2012. A multimedia retrieval framework based on semi-supervised ranking and rele-vance feedback.IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723–742.

[77]

Yuan , L., Pan , C., Ji , S., , 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression.Bioinformatics, 30(2):266–273.

[78]

Zhai , X., Peng , Y., Xiao , J., 2014. Learning cross-media joint representation with sparse and semi-supervised regulari-zation.IEEE Trans. Circ. Syst. Video Technol., 24(6):965–978.

[79]

Zhang , H., Yang , Y., Luan , H., , 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187–196.

[80]

Zhang , H., Yuan , J., Gao , X., , 2014b. Boosting cross- media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953–956.

[81]

Zhang , H., Shang , X., Luan , H., , 2016. Learning from collective intelligence: feature learning using social im-ages and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.

[82]

Zhang , J., Wang , S., Huang , Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355–362.

[83]

Zhu , Y., Zhang , C., , C., , 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.

RIGHTS & PERMISSIONS

Zhejiang University and Springer-Verlag Berlin Heidelberg

AI Summary AI Mindmap
PDF (996KB)

Supplementary files

FITEE-0044-17004-YXP_suppl_1

FITEE-0044-17004-YXP_suppl_2

7998

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/