Towards human-like and transhuman perception in AI 2.0: a review

Yong-hong TIAN; Xi-lin CHEN; Hong-kai XIONG; Hong-liang LI; Li-rong DAI; Jing CHEN; Jun-liang XING; Jing CHEN; Xi-hong WU; Wei-min HU; Yu HU; Tie-jun HUANG; Wen GAO

doi:10.1631/FITEE.1601804

PDF(461 KB)

Front. Inform. Technol. Electron. Eng ›› 2017, Vol. 18 ›› Issue (1) : 58-67. DOI: 10.1631/FITEE.1601804

Review

Towards human-like and transhuman perception in AI 2.0: a review

Author information +

History +

Abstract

Perception is the interaction interface between an intelligent system and the real world. Without sophisticated and flexible perceptual capabilities, it is impossible to create advanced artificial intelligence (AI) systems. For the next-generation AI, called ‘AI 2.0’, one of the most significant features will be that AI is empowered with intelligent perceptual capabilities, which can simulate human brain’s mechanisms and are likely to surpass human brain in terms of performance. In this paper, we briefly review the state-of-the-art advances across different areas of perception, including visual perception, auditory perception, speech per-ception, and perceptual information processing and learning engines. On this basis, we envision several R&D trends in intelligent perception for the forthcoming era of AI 2.0, including: (1) human-like and transhuman active vision; (2) auditory perception and computation in an actual auditory setting; (3) speech perception and computation in a natural interaction setting; (4) autonomous learning of perceptual information; (5) large-scale perceptual information processing and learning platforms; and (6) urban om-nidirectional intelligent perception and reasoning engines. We believe these research directions should be highlighted in the future plans for AI 2.0.

Keywords

Intelligent perception / Active vision / Auditory perception / Speech perception / Autonomous learning

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yong-hong TIAN, Xi-lin CHEN, Hong-kai XIONG, Hong-liang LI, Li-rong DAI, Jing CHEN, Jun-liang XING, Jing CHEN, Xi-hong WU, Wei-min HU, Yu HU, Tie-jun HUANG, Wen GAO. Towards human-like and transhuman perception in AI 2.0: a review. Front. Inform. Technol. Electron. Eng, 2017, 18(1): 58‒67 https://doi.org/10.1631/FITEE.1601804

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Amodei, D., Anubhai, R., Battenberg, E., , 2015. Deep Speech 2: end-to-end speech recognition in English and Mandarin. arXiv:1512.02595.

[2]	Bear, M.F., Connors, B.W., Paradiso, M.A., 2001. Neurosci-ence. Lippincott Williams and Wilkins, Maryland, p.208.

[3]	Bruna, J., Mallat, S., 2013. Invariant scattering convolution networks. IEEE Trans. Patt. Anal. Mach. Intell., 35(8): 1872–1886. http://dx.doi.org/10.1109/TPAMI.2012.230

[4]	Candès, E., Romberg, J., Tao, T., 2006. Robust uncertainty principles: exact signal reconstruction from highly in-complete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509. http://dx.doi.org/10.1109/TIT.2005.862083

[5]	Deng, J., Dong, W., Socher, R., , 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf. on Computer Vision and Pattern Recognition, p.248–255. http://dx.doi.org/10.1109/CVPR.2009.5206848

[6]	Duarte, M., Davenport, M., Takhar, D., , 2008. Single- pixel imaging via compressive sampling. IEEE Signal Proc. Mag., 25(2):83–91. http://dx.doi.org/10.1109/MSP.2007.914730

[7]	Han, J., Shao, L., Xu, D., , 2013. Enhanced computer vision with Microsoft Kinect sensor: a review. IEEE Trans. Cybern., 43(5):1318–1334. http://dx.doi.org/10.1109/TCYB.2013.2265378

[8]	Hinton, G., Deng, L., Yu, D., , 2012. Deep neural net-works for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Proc. Mag., 29(6):82–97. http://dx.doi.org/10.1109/MSP.2012.2205597

[9]	Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neur. Comput., 9(8):1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735

[10]	Hou, Y.Z.,Jiao, L.F., 2014. Survey of smart city construction study from home and abroad. Ind. Sci. Trib., 13(24):94–97 (in Chinese).

[11]	Jiang, H., Huang, G., Wilford, P., 2014. Multi-view in lensless compressive imaging. Apsipa Trans. Signal Inform. Proc., 3(15):1–10. http://dx.doi.org/10.1109/PCS.2013.6737678

[12]	Kadambi, A., Whyte, R., Bhandari, A., , 2013. Coded time of flight cameras: sparse deconvolution to address multipath interference and recover time profiles. ACM Trans. Graph., 32(6):1–10. http://dx.doi.org/10.1145/2508363.2508428

[13]	Kale, P.V., Sharma, S.D., 2014. A review of securing home using video surveillance. Int. J. Sci. Res., 3(5):1150–1154.

[14]	Kendrick, K.M., 1998. Intelligent perception. Appl. Animal Behav. Sci., 57(3-4):213–231. http://dx.doi.org/10.1016/S0168-1591(98)00098-7

[15]	King, S., 2014. Measuring a decade of progress in text-to- speech. Loquens, 1(1):e006. http://dx.doi.org/10.3989/loquens.2014.006

[16]	Krizhevsk, A., Sutskever, I., Hinton, G., 2012. ImageNet clas-sification with deep convolutional neural networks. Ad-vances in Neural Information Processing Systems, p.1097–1105.

[17]	Lacey, G., Taylor, G.W., Areibi, S., 2016. Deep learning on FPGAs: past, present, and future. arXiv:1602.04283.

[18]	LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature, 521(7553):436–444. http://dx.doi.org/10.1038/nature14539

[19]	Li, T., Chang, H., Wang, M., , 2015. Crowded scene analysis: a survey. IEEE Trans. Circ. Syst. Video Technol., 25(3):367–386. http://dx.doi.org/10.1109/TCSVT.2014.2358029

[20]	Ling, Z.H., Kang, S.Y., Zen, H., , 2015. Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Proc. Mag., 32(3):35–52. http://dx.doi.org/10.1109/MSP.2014.2359987

[21]	Lippmann, R.P., 1997. Speech recognition by machines and humans. Speech Commun., 22(1):1–15. http://dx.doi.org/10.1016/S0167-6393(97)00021-6

[22]	Litovsky, R.Y., Colburn, H.S., Yost, W.A., , 1999. The precedence effect. J. Acoust. Soc. Am., 106:1633–1654. http://dx.doi.org/10.1121/1.427914

[23]	Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them. IEEE Int. Conf. on Computer Vision Pattern Recognition, p.5188–5196. http://dx.doi.org/10.1109/CVPR.2015.7299155

[24]	Makhoul, J., 2016. A 50-year retrospective on speech and language processing. Int. Conf. on Interspeech, p.1.

[25]	Mattys, S.L., Davis, M.H., Bradlow, A.R., , 2012. Speech recognition in adverse conditions: a review. Lang. Cogn. Proc., 27:953–978. http://dx.doi.org/10.1080/01690965.2012.705006

[26]	McMackin, L., Herman, M.A., Chatterjee, B., , 2012. A high-resolution SWIR camera via compressed sensing. SPIE, 8353:835303. http://dx.doi.org/10.1117/12.920050

[27]	Mountcastle, V., 1978. An organizing principle for cerebral function: the unit model and the distributed system. In: Edelman, G.M., Mountcastle, V.B. (Eds.), The Mindful Brain. MIT Press, Cambridge.

[28]	Musialski, P., Wonka, P., Aliaga, D.G., , 2013. A survey of urban reconstruction. Comput. Graph. Forum, 32(6): 146–177. http://dx.doi.org/10.1111/cgf.12077

[29]	Ngiam, J., Khosla, A., Kim, M., , 2011. Multimodal deep learning. 28th In. Conf. on Machine Learning, p.689–696.

[30]	Niwa, K., Koizumi, Y., Kawase, T., , 2016. Pinpoint extraction of distant sound source based on DNN map-ping from multiple beamforming outputs to prior SNR. IEEE Int. Conf. on Acoustics, Speech and Signal Pro-cessing, p.435–439. http://dx.doi.org/0.1109/ICASSP.2016.7471712

[31]	Oord, A., Dieleman, S., Zen, H., , 2016. WaveNet: a generative model for raw audio. arXiv:1609.03499.

[32]	Pan, Y.H., 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4):409–413. http://dx.doi.org/10.1016/J. ENG.2016.04.018

[33]	Pratt, G., Manzo, J., 2013. The DARPA robotics challenge. IEEE Robot. Autom. Mag., 20(2):10–12. http://dx.doi.org/10.1109/MRA.2013.2255424

[34]	Priano, F.H., Armas, R.L., Guerra, C.F., 2016. A model for the smart development of island territories. Int. Conf. on Digital Government Research, p.465–474. http://dx.doi.org/10.1145/2912160.2912187

[35]	Raina, R., Battle, A., Lee, H., , 2007. Self-taught learning: transfer learning from unlabeled data. 24th Int. Conf. on Machine Learning, p.759–766. http://dx.doi.org/10.1145/1273496.1273592

[36]	Robinson, E.A., Treitel, S., 1967. Principles of digital Wiener filtering. Geophys. Prospect., 15(3):311–332. http://dx.doi.org/10.1111/j.1365-2478.1967.tb01793.x

[37]	Roy, R., Kailath, T., 1989. ESPRIT-estimation of signal pa-rameters via rotational invariance techniques. IEEE Trans. Acoust. Speech Signal Process., 37(7):984–995. http://dx.doi.org/10.1109/29.32276

[38]	Salakhutdinov, R., Hinton, G., 2009. Deep Boltzmann ma-chines. J. Mach. Learn. Res., 5:448–455.

[39]	Saon, G., Kuo, H.K.J., Rennie, S., , 2015. The IBM 2015 English conversational telephone speech recognition system. arXiv:1505.05899.

[40]	Seide, F., Li, G., Yu, D., 2011. Conversational speech tran-scription using context-dependent deep neural networks. Int. Conf. on Interspeech, p.437–440.

[41]	Soltau, H., Saon, G., Sainath, T.N., 2014. Joint training of convolutional and nonconvolutional neural networks. IEEE Int. Conf. on Acoustics, Speech and Signal Pro-cessing, p.5572–5576. http://dx.doi.org/10.1109/ICASSP.2014.6854669

[42]	Song, T., Chen, J., Zhang, D.B., , 2016. A sound source localization algorithm using microphone array with rigid body. Int. Congress on Acoustics, p.1–8.

[43]	Suzuki, L.R., 2015. Data as Infrastructure for Smart Cities. PhD Thesis, University College London, London, UK.

[44]	Tadano, R., Pediredla, A., Veeraraghavan, A., 2015. Depth selective camera: a direct, on-chip, programmable tech-nique for depth selectivity in photography. Int. Conf. on Computer Vision, p.3595–3603. http://dx.doi.org/10.1109/ICCV.2015.410

[45]	Tokuda, K., Nankaku, Y., Toda, T., , 2013. Speech syn-thesis based on hidden Markov models. Proc. IEEE, 101(5):1234–1252. http://dx.doi.org/10.1109/JPROC.2013.2251852

[46]	Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cogn. Neurosci., 3(1):71–86. http://dx.doi.org/10.1162/jocn.1991.3.1.71

[47]	Veselý, K., Ghoshal, A., Burget, L., , 2013. Sequence- discriminative training of deep neural networks. Int. Conf. on Interspeech, p.2345–2349.

[48]	Wang, W., Xu, S., Xu, B., 2016. First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention. Int. Conf. on Interspeech, p.2243–2247. http://dx.doi.org/10.21437/Interspeech.2016-134

[49]	Xiong, W., Droppo, J., Huang, X., , 2016. Achieving human parity in conversational speech recognition. arXiv:1610.05256.

[50]	Zhang, J.P., Wang, F.Y., Wang, K.F., , 2011. Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst., 12(4):1624–1639. http://dx.doi.org/10.1109/TITS.2011.2158001