In response to the above problems, deep learning-based text detection algorithms have emerged in recent years. They improve and optimize the general object detection algorithms from feature extraction, region proposal network (RPN), multi-task joint training, loss function improvement, non-maximum value suppression (NMS), semi-supervised learning and other aspects. As a result, the text detection accuracy in natural scene images are improved significantly. For example, connectionist text proposal network (CTPN) [
10] improves the text detection accuracy by exploring and exploiting the contextual characteristics of text characters through the bidirectional long short-term memory (BiLSTM). RRPN [
11] uses the bounding box along with the rotation angle as the text block annotation and training data, thereby obtaining the ability to detect the rotational text block. DMPNet [
12] uses quads (non-rectangular) as text block annotation because it is more closely surround the text content. SegLink [
13] cuts words into smaller blocks that are easier to detect, and then concatenates adjacent blocks into text line. TextBoxes [
14,
15] has the ability to detect thinner text lines by introducing rectangular convolution kernels. FTSN [
16] uses mask-NMS instead of the traditional NMS algorithm to filter candidate bounding boxes. WordSup [
17] uses a semi-supervised learning strategy to train character-level text detection models with word-level annotated data.