Scene text detection and recognition: recent advances and future trends
Yingying ZHU, Cong YAO, Xiang BAI
Scene text detection and recognition: recent advances and future trends
Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important and active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are three-fold: 1) introduce up-to-date works, 2) identify state-of-the-art algorithms, and 3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.
text detection / text recogntion / natural image / algorithms / applications
[1] |
Tsai S S, Chen H, Chen D, Schroth G, Grzeszczuk R, Girod B. Mobile visual search on printed documents using text and low bit-rate features. In: Proceedings of the 18th IEEE International Conference on Image Processing. 2011, 2601–2604
CrossRef
Google scholar
|
[2] |
Barber D B, Redding J D, McLain T W, Beard R W, Taylor CN. Vision-based target geo-location using a fixed-wing miniature air vehicle. Journal of Intelligent and Robotic Systems, 2006, 47(4): 361–382
CrossRef
Google scholar
|
[3] |
Kisacanin B, Pavlovic V, Huang T S. Real-time vision for humancomputer interaction. Springer Science and Business Media, 2005
CrossRef
Google scholar
|
[4] |
DeSouza G N, Kak A C. Vision for mobile robot navigation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(2): 237–267
CrossRef
Google scholar
|
[5] |
Ham Y K, Kang M S, Chung H K, Park R H,Park G T. Recognition of raised characters for automatic classification of rubber tires. Optical Engineering, 1995, 34(1): 102–109
CrossRef
Google scholar
|
[6] |
Yao C, Zhang X, Bai X, Liu W, Tu Z. Rotation-invariant features for multi-oriented text detection in natural images. PloS one, 2013, 8(8): e70173
CrossRef
Google scholar
|
[7] |
Yao C, Bai X, Shi B, Liu W. Strokelets: A learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4042–4049
CrossRef
Google scholar
|
[8] |
Chen X, Yuille A L. Detecting and reading text in natural scenes. In: Proceedings of 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004, 2
|
[9] |
Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform. In: Proceedings of 2010 IEEE Conference on Computer Vision and Pattern Recognition. 2010, 2963–2970
CrossRef
Google scholar
|
[10] |
Neumann L, Matas J. A method for text localization and recognition in real-world images. Lecture Notes in Computer Science, 2011, 6494, 770–783
CrossRef
Google scholar
|
[11] |
Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of 2011 IEEE International Conference on Computer Vision. 2011, 1457–1464
CrossRef
Google scholar
|
[12] |
Yao C, Bai X, Liu W, Ma Y, Tu Z. Detecting texts of arbitrary orientations in natural images. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1083–1090
|
[13] |
Neumann L,Matas J. Real-time scene text localization and recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 3538–3545
CrossRef
Google scholar
|
[14] |
Novikova T, Barinova O, Kohli P, Lempitsky V. Large-lexicon attribute-consistent text recognition in natural images. In: Proceedings of 12th European Conference on Computer Vision. 2012, 752–765
CrossRef
Google scholar
|
[15] |
Mishra A, Alahari K, Jawahar C V. Scene text recognition using higher order language priors. In: Proceedings of the 23rd British Machine Vision Conference. 2012
CrossRef
Google scholar
|
[16] |
Weinman J J, Butler Z, Knoll D, Field J. Toward integrated scene text reading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(2): 375–387
CrossRef
Google scholar
|
[17] |
Bissacco A, Cummins M, Netzer Y, Neven, H. Photoocr: reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 785–792
CrossRef
Google scholar
|
[18] |
Phan T Q, Shivakumara P, Tian S, Tan C L. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 569–576
CrossRef
Google scholar
|
[19] |
Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 512–528
CrossRef
Google scholar
|
[20] |
Almazan J, Gordo A, Fornes A, Valveny, E.Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(12): 2552–2566
CrossRef
Google scholar
|
[21] |
Chen D, Luettin J, Shearer K. A survey of text detection and recognition in images and videos. Institut Dalle Molle d’Intelligence Artificielle Perceptive Research Report IDIAP-RR 00-38. 2000
|
[22] |
Jung K, Kim K I, Jain A K. Text information extraction in images and video: a survey. Pattern recognition, 2004, 37(5): 977–997
CrossRef
Google scholar
|
[23] |
Liang J, Doermann D, Li H. Camera-based analysis of text and documents: a survey. International Journal of Document Analysis and Recognition, 2005, 7(2–3): 84–104
CrossRef
Google scholar
|
[24] |
Zhang H, Zhao K, Song Y Z, Guo J. Text extraction from natural scene image: a survey. Neurocomputing, 2013, 122: 310–323
CrossRef
Google scholar
|
[25] |
Uchida S. Text localization and recognition in images and video. Handbook of Document and Recognition. London: Springer, 2014, 843–883
|
[26] |
Kang L, Li Y, Doermann D. Orientation robust text line detection in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4034–4041
CrossRef
Google scholar
|
[27] |
Pan Y F, Hou X, Liu C L. A hybrid approach to detect and localize texts in natural scene images. IEEE Transactions on Image Processing, 2011, 20(3): 800–813
CrossRef
Google scholar
|
[28] |
Yi C, Tian Y L. Text string detection from natural scenes by structurebased partition and grouping. IEEE Transactions on Image Processing, 2011, 20(9): 2594–2605
CrossRef
Google scholar
|
[29] |
Huang W, Lin Z, Yang J C,Wang J. Text localization in natural images using stroke feature transform and text covariance descriptors. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 1241–1248
CrossRef
Google scholar
|
[30] |
Huang W, Qiao Y, Tang X. Robust scene text detection with convolution neural network induced Mser trees. In: Proceedings of European Conference on Computer Vision. 2014, 497–511
CrossRef
Google scholar
|
[31] |
Mishra A, Alahari K, Jawahar C V. Top-down and bottom-up cues for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 2687–2694
CrossRef
Google scholar
|
[32] |
Shi C Z, Wang C H, Xiao B H, Zhang Y. Scene text recognition using part-based tree-structured character detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013, 2961–2968
CrossRef
Google scholar
|
[33] |
Lee C Y, Bhardwaj A, Di W, Jagadeesh, V. Region-based discriminative feature pooling for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 4050–4057
CrossRef
Google scholar
|
[34] |
Yao C, Bai X, Liu W. A unified framework for multi-oriented text detection and recognition. IEEE Transactions on Image Processing, 2014, 23(11): 4737–4749
CrossRef
Google scholar
|
[35] |
Zhong Y, Karu K, Jain A K. Locating text in complex color images. In: Proceedings of the 3rd IEEE Conference on Document Analysis and Recognition. 1995, 146–149
CrossRef
Google scholar
|
[36] |
Kim K I, Jung K, Kim J H. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(12): 1631–1639
CrossRef
Google scholar
|
[37] |
Gllavata J, Ewerth R, Freisleben B. Text detection in images based on unsupervised classification of high-frequency wavelet coefficients. In: Proceedings of the 17th IEEE International Conference on Pattern Recognition. 2004, 425–428
CrossRef
Google scholar
|
[38] |
Li H, Doermann D, Kia O. Automatic text detection and tracking in digital video. IEEE Transactions on Image Processing, 2000, 9(1): 147–156
CrossRef
Google scholar
|
[39] |
Leibe B, Schiele B. Scale-invariant object categorization using a scaleadaptive mean-shift search. Lecture Notes in Computer Science, 2004, 3175: 145–153
CrossRef
Google scholar
|
[40] |
Lyu M R, Song J, Cai M. A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Transactions on Circuits and Systems for Video Technology, 2005, 15(2): 243–255
CrossRef
Google scholar
|
[41] |
Zhong Y, Zhang H, Jain A K. Automatic caption localization in compressed video. IEEE Transactions on Pattern Analysis and Machine Intelligenc, 2000, 22(4): 385–392
CrossRef
Google scholar
|
[42] |
Viola P, Jones M. Fast and robust classification using asymmetric adaboost and a detector cascade. In: Proceedings of Advances in Neural Information Processing System, 2001, 14
|
[43] |
Lucas S M. Icdar 2005 text locating competition results. In: Proceedings of the 8th International Conference on Document Analysis and Recognition. 2005, 80–84
|
[44] |
Wu V, Manmatha R, Riseman E M. Finding text in images. In: Proceedings of the 2nd ACM international conference on Digital libraries. 1997, 3–12
CrossRef
Google scholar
|
[45] |
Wolf C, Jolion J M. Extraction and recognition of artificial text in multimedia documents. Formal Pattern Analysis and Applications, 2004, 6(4): 309–326
CrossRef
Google scholar
|
[46] |
Wang K, Belongie S. Word spotting in the wild. In: Proceedings of European Conference on Computer Vision. 2010, 591–604
CrossRef
Google scholar
|
[47] |
Jain A K, Yu B. Automatic text location in images and video frames. Pattern Recognition, 1998, 31(12): 2055–2076
CrossRef
Google scholar
|
[48] |
Chen H, Tsai S S, Schroth G, Chen D m. Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Proceedings of the 18th IEEE International Conference on Image Processing. 2011, 2609–2612
|
[49] |
Yin X C, Yin X, Huang K, Hao H W. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5): 970–983
CrossRef
Google scholar
|
[50] |
Wright J, Yang A Y, Ganesh A, Sastry S S. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(2): 210–227
CrossRef
Google scholar
|
[51] |
Elad M, Aharon M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 2006, 15(12): 3736–3745
CrossRef
Google scholar
|
[52] |
Zhao M, Li S, Kwok J. Text detection in images using sparse representation with discriminative dictionaries. Image and Vision Computing, 2010, 28(12): 1590–1599
CrossRef
Google scholar
|
[53] |
Shivakumara P, Phan T Q, Tan C L. A laplacian approach to multioriented text detection in video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(2): 412–419
CrossRef
Google scholar
|
[54] |
Liu Y X, Ikenaga T. A contour-based robust algorithm for text detection in color images. IEICE Transactions on Information and Systems, 2006, 89(3): 1221–1230
CrossRef
Google scholar
|
[55] |
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 1, 886–893
CrossRef
Google scholar
|
[56] |
Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289
|
[57] |
Sawaki M, Murase H, Hagita N. Automatic acquisition of contextbased images templates for degraded character recognition in scene images. In: Proceedings of the 15th International Conference on Pattern Recognition. 2000, 4, 15–18
|
[58] |
Zhou J, Lopresti D. Extracting text from www images. In: Proceedings of the 4th International Conference on Document Analysis and Recognition. 1997, 1,248–252
CrossRef
Google scholar
|
[59] |
Zhou J, Lopresti D P, Lei Z. Ocr for world wide web images. In: Proceedings of Society of Photographic Instrumentation Engineers. 1997, 58
|
[60] |
de Campos T, Babu B R, Varma M. Character recognition in natural images. In: Proceedings of the International Conference on Computer Vision Theory and Applications, 2009
|
[61] |
Smith R. Limits on the application of frequency-based language models to Ocr. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 538–542
CrossRef
Google scholar
|
[62] |
Matas J, Chum O, Urban M, Pajdla T. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 2004, 22(10): 761–767
CrossRef
Google scholar
|
[63] |
Mohri M, Pereira F, Riley M. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 2002, 16(1): 69–88
CrossRef
Google scholar
|
[64] |
Rodriguez-Serrano J A, Perronnin F C. Label embedding for text recognition. In: Proceedings of the British Machine Vision Conference, 2013
|
[65] |
Neumann L, Matas J. Text localization in real-world images using efficiently pruned exhaustive search. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 687–691
CrossRef
Google scholar
|
[66] |
Neumann L, Matas J. Scene text localization and recognition with oriented stroke detection. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 97–104
CrossRef
Google scholar
|
[67] |
Le Cun B B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. Handwritten digit recognition with a back-propagation network. In: Proceedings of Advances in Neural Information Processing Systems. 1990
|
[68] |
Farabet C, Couprie C, Najman L, LeCun, Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929
CrossRef
Google scholar
|
[69] |
Taigman Y, Yang M, Ranzato M A,Wolf, L. Deepface: closing the gap to human-level performance in face verification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1701–1708
CrossRef
Google scholar
|
[70] |
Girshick R, Donahue J, Darrell T, Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587
CrossRef
Google scholar
|
[71] |
Lee C Y, Xie S, Gallagher P, Zhang Z Y, Tu Z W. Deeply-supervised Yingying ZHU et al. Scene text detection and recognition: recent advances and future trends 35 nets. arXiv preprint arXiv:1409.5185. 2014
|
[72] |
Coates A, Carpenter B, Case C, Satheesh S, Suresh B, Wang T, Wu D J, Ng AY. Text detection and character recognition in scene images with unsupervised feature learning. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 440–445
CrossRef
Google scholar
|
[73] |
Wang T, Wu D J, Coates A, Ng A Y. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition. 2012, 3304–3308
|
[74] |
Karaoglu S, Van Gemert J C, Gevers T. Object reading: text recognition for object recognition. Lecture Notes in Computer Science, 2012, 7585: 456–465
CrossRef
Google scholar
|
[75] |
Google Goggles. https://play.google.com/store/apps
|
[76] |
Lucas S M, Panaretos A, Sosa L, et al. ICDAR 2003 robust reading competitions. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2003, 2, 682–682
CrossRef
Google scholar
|
[77] |
Shahab A, Shafait F, Dengel A. ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: Proceedings of International Conference on Document Analysis and Recognition. 2011, 1491–1496
CrossRef
Google scholar
|
[78] |
Karatzas D, Shafait F, Uchida S, Iwamura, M. ICDAR 2013 robust reading competition. In: Proceedings of Document Analysis and Recognition. 2013, 1484–1493
CrossRef
Google scholar
|
[79] |
Nagy R, Dicker A, Meyer-Wegener K. NEOCR: a configurable dataset for natural image text recognition. Camera-Based Document Analysis and Recognition. Berlin: Springer, 2012: 150–163
CrossRef
Google scholar
|
[80] |
Lee S H, Cho M S, Jung K, Kim J H. Scene text extraction with edge constraint and text collinearity link. In: Proceedings of International Conference on Pattern Recognition. 2010, 3983–3986
|
[81] |
de Campos T, Babu B R, Varma M. Character recognition in natural images. In: Proceedings of International Conference on Computer Vision Theory and Applications, 2009
|
[82] |
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y. Reading digits in natural images with unsupervised feature learning. In: Proceedings of NIPS workshop on deep learning and unsupervised feature learning. 2011, (2), 5
|
[83] |
Yi C, Tian Y. Text extraction from scene images by character appearance and structure modeling. Computer Vision and Image Understanding, 2013, 117(2): 182–194
CrossRef
Google scholar
|
[84] |
Wolf C, Jolion J M. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition, 2006, 8(4): 280–296
CrossRef
Google scholar
|
[85] |
Yin X C, Yin X, Huang K, Hao H W. Accurate and robust text detection: a step-in for text retrieval in natural scene images. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 1091–1092
CrossRef
Google scholar
|
[86] |
Neumann L, Matas J. On combining multiple segmentations in scene text recognition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2013, 523–527
CrossRef
Google scholar
|
[87] |
Koo H I, Kim D H. Scene text detection via connected component clustering and nontext filtering. IEEE Transactions on Image Processing, 2013, 22(6): 2296–2305
CrossRef
Google scholar
|
[88] |
Shi C, Wang C, Xiao B, Zhang Y, Gao S. Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recognition Letters, 2013, 34(2): 107–116
CrossRef
Google scholar
|
[89] |
Yi C, Tian Y. Text detection in natural scene images by stroke gabor words. In: Proceedings of International Conference on Document Analysis and Recognition, 2011, 177–181
CrossRef
Google scholar
|
[90] |
Freeman H, Shapira R. Determining the minimum-area encasing rectangle for an arbitrary closed curve. Communications of the ACM, 1975, 18(7): 409–413
CrossRef
Google scholar
|
[91] |
Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338
CrossRef
Google scholar
|
[92] |
Goel V, Mishra A, Alahari K, Jawahar C V. Whole is greater than sum of parts: recognizing scene text words. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2013, 398–402
CrossRef
Google scholar
|
[93] |
Yildirim G, Achanta R, SÃijsstrunk S. Text recognition in natural images using multiclass hough forests. In: Proceedings of International Conference on Computer Vision Theory and Applications. 2013, 737–741
|
[94] |
ABBYY FineReader 9.0. http://www.abbyy.com/
|
[95] |
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Synthetic data and artificial neural networks for natural scene text recognition. 2014, arXiv preprint arXiv:1406.2227
|
[96] |
Su B, Lu S. Accurate scene text recognition based on recurrent neural network. In: Proceedings of Computer Vision-ACCV, 2014
|
[97] |
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Reading text in the wild with convolutional neural networks. 2014, arXiv preprint arXiv:1412.1842
|
[98] |
Jaderberg M, Simonyan K, Vedaldi A, Zisserman A. Deep structured output learning for unconstrained text recognition. 2014, arXiv reprint arXiv: 1412.5903
|
/
〈 | 〉 |