Towards human-like and transhuman perception in AI 2.0: a review
Yong-hong TIAN, Xi-lin CHEN, Hong-kai XIONG, Hong-liang LI, Li-rong DAI, Jing CHEN, Jun-liang XING, Jing CHEN, Xi-hong WU, Wei-min HU, Yu HU, Tie-jun HUANG, Wen GAO
Towards human-like and transhuman perception in AI 2.0: a review
Perception is the interaction interface between an intelligent system and the real world. Without sophisticated and flexible perceptual capabilities, it is impossible to create advanced artificial intelligence (AI) systems. For the next-generation AI, called ‘AI 2.0’, one of the most significant features will be that AI is empowered with intelligent perceptual capabilities, which can simulate human brain’s mechanisms and are likely to surpass human brain in terms of performance. In this paper, we briefly review the state-of-the-art advances across different areas of perception, including visual perception, auditory perception, speech per-ception, and perceptual information processing and learning engines. On this basis, we envision several R&D trends in intelligent perception for the forthcoming era of AI 2.0, including: (1) human-like and transhuman active vision; (2) auditory perception and computation in an actual auditory setting; (3) speech perception and computation in a natural interaction setting; (4) autonomous learning of perceptual information; (5) large-scale perceptual information processing and learning platforms; and (6) urban om-nidirectional intelligent perception and reasoning engines. We believe these research directions should be highlighted in the future plans for AI 2.0.
Intelligent perception / Active vision / Auditory perception / Speech perception / Autonomous learning
[1] |
Amodei, D., Anubhai, R., Battenberg, E.,
|
[2] |
Bear, M.F., Connors, B.W., Paradiso, M.A., 2001. Neurosci-ence. Lippincott Williams and Wilkins, Maryland, p.208.
|
[3] |
Bruna, J., Mallat, S., 2013. Invariant scattering convolution networks. IEEE Trans. Patt. Anal. Mach. Intell., 35(8): 1872–1886. http://dx.doi.org/10.1109/TPAMI.2012.230
|
[4] |
Candès, E., Romberg, J., Tao, T., 2006. Robust uncertainty principles: exact signal reconstruction from highly in-complete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509. http://dx.doi.org/10.1109/TIT.2005.862083
|
[5] |
Deng, J., Dong, W., Socher, R.,
|
[6] |
Duarte, M., Davenport, M., Takhar, D.,
|
[7] |
Han, J., Shao, L., Xu, D.,
|
[8] |
Hinton, G., Deng, L., Yu, D.,
|
[9] |
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735
|
[10] |
Hou, Y.Z.,Jiao, L.F., 2014. Survey of smart city construction study from home and abroad. Ind. Sci. Trib., 13(24):94–97 (in Chinese).
|
[11] |
Jiang, H., Huang, G., Wilford, P., 2014. Multi-view in lensless compressive imaging. Apsipa Trans. Signal Inform. Proc., 3(15):1–10. http://dx.doi.org/10.1109/PCS.2013.6737678
|
[12] |
Kadambi, A., Whyte, R., Bhandari, A.,
|
[13] |
Kale, P.V., Sharma, S.D., 2014. A review of securing home using video surveillance. Int. J. Sci. Res., 3(5):1150–1154.
|
[14] |
Kendrick, K.M., 1998. Intelligent perception. Appl. Animal Behav. Sci., 57(3-4):213–231. http://dx.doi.org/10.1016/S0168-1591(98)00098-7
|
[15] |
King, S., 2014. Measuring a decade of progress in text-to- speech. Loquens, 1(1):e006. http://dx.doi.org/10.3989/loquens.2014.006
|
[16] |
Krizhevsk, A., Sutskever, I., Hinton, G., 2012. ImageNet clas-sification with deep convolutional neural networks. Ad-vances in Neural Information Processing Systems, p.1097–1105.
|
[17] |
Lacey, G., Taylor, G.W., Areibi, S., 2016. Deep learning on FPGAs: past, present, and future. arXiv:1602.04283.
|
[18] |
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature, 521(7553):436–444. http://dx.doi.org/10.1038/nature14539
|
[19] |
Li, T., Chang, H., Wang, M.,
|
[20] |
Ling, Z.H., Kang, S.Y., Zen, H.,
|
[21] |
Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech Commun., 22(1):1–15. http://dx.doi.org/10.1016/S0167-6393(97)00021-6
|
[22] |
Litovsky, R.Y., Colburn, H.S., Yost, W.A.,
|
[23] |
Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them. IEEE Int. Conf. on Computer Vision Pattern Recognition, p.5188–5196. http://dx.doi.org/10.1109/CVPR.2015.7299155
|
[24] |
Makhoul, J., 2016. A 50-year retrospective on speech and language processing. Int. Conf. on Interspeech, p.1.
|
[25] |
Mattys, S.L., Davis, M.H., Bradlow, A.R.,
|
[26] |
McMackin, L., Herman, M.A., Chatterjee, B.,
|
[27] |
Mountcastle, V., 1978. An organizing principle for cerebral function: the unit model and the distributed system. In: Edelman, G.M., Mountcastle, V.B. (Eds.), The Mindful Brain. MIT Press, Cambridge.
|
[28] |
Musialski, P., Wonka, P., Aliaga, D.G.,
|
[29] |
Ngiam, J., Khosla, A., Kim, M.,
|
[30] |
Niwa, K., Koizumi, Y., Kawase, T.,
|
[31] |
Oord, A., Dieleman, S., Zen, H.,
|
[32] |
Pan, Y.H., 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4):409–413. http://dx.doi.org/10.1016/J. ENG.2016.04.018
|
[33] |
Pratt, G., Manzo, J., 2013. The DARPA robotics challenge. IEEE Robot. Autom. Mag., 20(2):10–12. http://dx.doi.org/10.1109/MRA.2013.2255424
|
[34] |
Priano, F.H., Armas, R.L., Guerra, C.F., 2016. A model for the smart development of island territories. Int. Conf. on Digital Government Research, p.465–474. http://dx.doi.org/10.1145/2912160.2912187
|
[35] |
Raina, R., Battle, A., Lee, H.,
|
[36] |
Robinson, E.A., Treitel, S., 1967. Principles of digital Wiener filtering. Geophys. Prospect., 15(3):311–332. http://dx.doi.org/10.1111/j.1365-2478.1967.tb01793.x
|
[37] |
Roy, R., Kailath, T., 1989. ESPRIT-estimation of signal pa-rameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process., 37(7):984–995. http://dx.doi.org/10.1109/29.32276
|
[38] |
Salakhutdinov, R., Hinton, G., 2009. Deep Boltzmann ma-chines. J. Mach. Learn. Res., 5:448–455.
|
[39] |
Saon, G., Kuo, H.K.J., Rennie, S.,
|
[40] |
Seide, F., Li, G., Yu, D., 2011. Conversational speech tran-scription using context-dependent deep neural networks. Int. Conf. on Interspeech, p.437–440.
|
[41] |
Soltau, H., Saon, G., Sainath, T.N., 2014. Joint training of convolutional and nonconvolutional neural networks. IEEE Int. Conf. on Acoustics, Speech and Signal Pro-cessing, p.5572–5576. http://dx.doi.org/10.1109/ICASSP.2014.6854669
|
[42] |
Song, T., Chen, J., Zhang, D.B.,
|
[43] |
Suzuki, L.R., 2015. Data as Infrastructure for Smart Cities. PhD Thesis, University College London, London, UK.
|
[44] |
Tadano, R., Pediredla, A., Veeraraghavan, A., 2015. Depth selective camera: a direct, on-chip, programmable tech-nique for depth selectivity in photography. Int. Conf. on Computer Vision, p.3595–3603. http://dx.doi.org/10.1109/ICCV.2015.410
|
[45] |
Tokuda, K., Nankaku, Y., Toda, T.,
|
[46] |
Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cogn. Neurosci., 3(1):71–86. http://dx.doi.org/10.1162/jocn.1991.3.1.71
|
[47] |
Veselý, K., Ghoshal, A., Burget, L.,
|
[48] |
Wang, W., Xu, S., Xu, B., 2016. First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. Int. Conf. on Interspeech, p.2243–2247. http://dx.doi.org/10.21437/Interspeech.2016-134
|
[49] |
Xiong, W., Droppo, J., Huang, X.,
|
[50] |
Zhang, J.P., Wang, F.Y., Wang, K.F.,
|
[51] |
Zheng, L., Yang, Y., Hauptmann, A.G., 2016. Person re-identification: past, present and future. arXiv:1610. 02984.
|
/
〈 | 〉 |