Introduction
In clinical medicine, pathological examination has been regarded as a gold standard for cancer diagnosis for more than 100 years [
1]. Pathologists use a microscope to observe a histological section. Many advanced technologies, including hematoxylin and eosin (H&E) staining and spectral methods, have been applied in the preparation of tissue slides to improve imaging quality. However, intra- and interobserver disagreement cannot be avoided through visual observation and subjective interpretation, especially for experienced pathologists [
2,
3]. The limited agreement has resulted in the necessity of computational methods for pathological diagnosis [
4–
16] because automatic approaches can attain robust performance. The first step for computer-aided analysis is digital imaging.
Digital imaging is the process of acquiring, compressing, storing, and displaying scenes digitally. Whole slide imaging is a more advanced and frequently used technology in pathology compared with traditional digital imaging technologies that process static images through cameras [
17–
19]. This technology involves two processes. A specialized scanner is utilized to convert an entire glass histopathology or cytopathology slide into a digital slide. A virtual slide viewer is used to visualize the collected digital files. This technique can efficiently generate high-resolution whole slides. A whole slide image (WSI) typically contains trillions of pixels ranging from 200 MB to 1 GB [
20]. The size of one patient’s data generated from a biopsy can reach TB level because a WSI records the information of the entire tissue section. The WSI immensely promotes systematic image analysis toward microscopic morphology.
Digital pathology (DP) analysis aims to determine the degree of tissue cancerization and predict its prognosis. Traditional computational methods objectively evaluate disease-related tissue changes by extracting mathematical features, such as textural [
6,
7,
9,
12], morphological [
4,
5], structural [
6,
13–
16], and fractal features [
8,
11]. However, staining, fixation, and sectioning difference can cause huge variances, thereby causing difficulties to these hand-crafted features with limited generalization. Deep learning (DL) methods provide domain agnostic solutions that are ideally suitable for DP image analysis tasks.
DL is a popular machine learning method based on a deep neural network. A deep neural network [
21] is composed of multiple nonlinear modules and can learn representations from raw data without domain expertise. As a representation learning framework, multilevel features are extracted during the multilayer propagation. Network internal parameters are randomly initialized at the beginning and updated through backpropagation (backward propagation of errors). DL can discover intricate structures in large data sets. Many training data usually result in enhanced performance. A convolutional neural network (CNN) is a popular deep neural network [
22]. It was first proposed by LeCun
et al. [
23]. CNN is suitable for processing images in the form of 2D or 3D arrays. Convolutional and pooling layers are two typical components. The convolutional layer extracts feature maps from previous layers through the convolutional kernel. Every unit of the output feature map is the multiply–accumulate of the convolutional kernel and the local input area. The same kernel is shared among all the units in a feature map, making the network invariant to target locations. The concept of weight sharing and receptive field can reduce model complexity and the number of weights, thereby deepening the depth of CNN. The pooling layer is used to merge neighboring features from the previous convolutional layer, thereby immensely reducing the feature dimension and increasing the robustness to small shifts. Nonlinearity, such as a rectified linear unit module, is another important component that enables the network to emulate complex functions [
24]. CNN takes raw images (or large patches) as input to avoid the complex feature extraction in conventional recognition algorithms. CNN is highly invariant to translation, scaling, inclination, and other forms of deformation. It has brought great breakthroughs in image processing and is widely applied in a vast variety of computer vision tasks. Histopathology images are characterized by data complexity, making deep architectures extremely suitable for complex feature learning in DP analysis [
25].
This paper systematically reviews the research directions of DL in DP research: (1) Image preprocessing–stain normalization. Stain variation can seriously deteriorate training results. Color normalization is a prerequisite step for automatic algorithms. (2) Obtaining clinical/biological structure information. Qualitative and quantitative histology characteristics are the key indicators used in cancer evaluation [
26,
27]. The extraction and analysis of low-(cell), middle-(gland), and high-level (region) objects (Fig. 1) through classification, semantic segmentation, detection, and instance segmentation are extensively discussed. (3) Grading and prognosis. DL methods that either extract valid pathological information to facilitate subsequent survival prediction and treatment suggestions or conduct cancer grading on the WSI are comprehensively presented. We aim to provide readers with either medical or engineering backgrounds with thorough and detailed cases in the intersection of DL and medical image processing.
Stain normalization
Original tissue sections are visually transparent under the microscope. Efficient examination requires dyeing tissue sections with colored histochemical stains. Stains enhance the contrast among various structures by selectively binding to particular cellular components. However, color variation exists in histopathology images because of the uncertainty of multiple factors, such as (1) scanner type, (2) stain concentration, (3) manufacturer, (4) time elapse, (5) environmental temperatures upon staining, and (6) digitization. Undesired color variations will affect the visual examinations of tissues by pathologists and automatic analysis of DP images by software. Stain normalization (also known as color normalization) that basically transfers the mean color from one source image to others is developed to alleviate interimage biases. Although DL algorithms can partially mitigate color variation through proper data augmentation, the result performance deteriorates without stain normalization because of a limited amount of data. Therefore, stain normalization acts as a prerequisite for pathological image analysis.
Various approaches [
28–
31] have been proposed for stain normalization. Khan
et al. [
28] presented a supervised method based on nonlinear mapping using a representation derived from color deconvolution. Principal color histograms were derived from a set of quantized histograms. A global stain color descriptor was then computed. Thus, a supervised classification framework was learned to calculate image-specific stain matrices. Vahadane
et al. [
29] proposed a structure-preserving method for stain separation and color normalization to preserve biological structure information by modeling stain density maps on the basis of nonnegativity, sparsity, and soft classification. Janowczyk
et al. [
30] presented an algorithm based on an unsupervised neural network. They explored sparse autoencoders (SAEs) through an iterative process to partition similar tissue types in source and target images. The learned filters can optimally reconstruct the original image. The color of the target image was then altered to match the source image tissue through histogram specification. Bentaieb and Hamarneh [
31] proposed a different approach based on generative adversarial networks. They cast stain normalization as a style transfer problem to transfer the staining appearance of tissue images across data sets. A recent review summarized the pros and cons of various stain normalization methods for histopathology images [
32].
Cell level
Cellular objects are the frequently used biomarkers for cancer histology diagnosis [
33,
34]. In accordance with the widely-used Nottingham–Bloom–Richardson grade for breast cancer screening [
33], tubule formation, nuclear pleomorphism, and the number of mitotic figures are essential for the assessment of breast cancer. Nuclei features, such as spatial distribution and morphological appearance, play an important role in identifying mitotic index and nuclear pleomorphism level [
2,
33–
35]. The number of mitoses is a critical predictor of tumor aggressiveness and is of great importance for cancer screening and assessment. Manual cell segmentation and mitosis counting are time consuming and labor intensive for pathologists. The results are usually subjective, exhibiting intra- and interindividual variability. Thus, the development of automatic segmentation and detection methods is essential for efficient and reliable pathological diagnosis. However, accurate detection and segmentation of nuclei experience the following difficulties: (1) many cells touch each other with weak nuclear boundaries, making them difficult to discriminate; (2) variability exhibits in sizes, shapes, textures, and appearances of individual nuclei; (3) background clutter, stain imbalance, and image artifacts unavoidably exist. In addition to the general difficulties of nuclei analysis in the histopathology image, the mitosis analysis shows some unique challenges. The mitosis exhibits diverse shape configurations in different growth stages. Apoptotic nuclei look similar to mitoses, thereby causing false positive during detection [
36]. A number of DL-based approaches have been proposed to achieve enhanced performance. Here, we first review the application of DL in cell and nucleus semantic segmentation. We then summarize the development of cell and mitosis detection. We discuss nuclei instance segmentation, which is a combined task of nuclei semantic segmentation and detection. Table 1 presents an overview of each task in the cell-level analysis.
Cell and nucleus semantic segmentation
The segmentation task, which aims to assign a class label to each pixel of an image, is a common task in pathology image analysis. Neural membrane segmentation in electron microscopy (EM) images is the primary task for automated pipeline reconstruction and mapping neural connections [
65]. Cellular object segmentation is a prerequisite step for cellular morphology computation, characteristic quantification, and cell recognition. In glioblastoma multiforme, round oligodendroglioma can be distinguished from elongated and irregular morphology of astrocytoma with the assistance of nuclear segmentation [
39]. In cervical cytology diagnosis, nuclear segmentation is necessary to discover all types of cytological abnormalities that are usually identified by certain nuclear abnormality [
66]. Therefore, the development of accurate automatic nuclear segmentation methods is essential for computer-assisted diagnosis to ease pathologists’ burden of manual inspection and alleviate the missed diagnosis and misdiagnosis. Previous studies [
67–
69] for nucleus or cell segmentation focused on region growth, threshold, clustering, level set, supervised color-texture-based, and watershed-based methods. These conventional approaches depend on elaborately designed features or representations that require intense domain knowledge and are unadaptable to different circumstances. DL approaches has been increasingly applied for automated nuclear segmentation. CNN is initially used in classification tasks, where the input is an image, and the corresponding output is a single class label. Hence, some studies have regarded segmentation tasks as pixelwise classification tasks [
37,
39,
40]. These methods are based on fixed pipeline steps. First, a group of windows is densely sampled from the WSI via a sliding window. The central pixel is then predicted to a certain class at a probability by utilizing the rich context information of the window. This pixel is assigned a label, such as “foreground” or “background” in accordance with the fixed threshold. Finally, the whole image is segmented in accordance with the labels of all the pixels. DL methods function as the feature extractor and classifier during the pixel-level classification. Cireşan
et al. [
37] first applied a CNN-based method to neural membrane segmentation in EM stacks using the above strategy. They achieved the best performance at
ISBI 2012 EM Segmentation Challenge [
65]. Similarly, Zhou
et al. [
39] proposed an approach for nuclear segmentation. They randomly cropped image patches at nuclear centers and jointly learned a set of convolutional filters and a sparse linear regressor. The segmentation results were obtained by applying a threshold to the regressor prediction in performing pixelwise binary classification. For most single-scale CNN-based methods, a small patch size cannot provide sufficient context information for learning effective features. Windows with large sizes are also unfeasible to achieve the goal because of the high computational cost. Song
et al. [
40] implemented a multi-scale CNN framework to segment cervical cytoplasm for solving the scale variations in the nuclei. They first extracted feature representations via multi-scale CNN and obtained coarse segmentation by inputting the features into a two-layer neural network. A graph partitioning model was then implemented on the basis of coarse segmentation and superpixels for fine segmentation. Finally, they computed nuclear markers to split touching nuclei. A similar approach that combines CNN features with morphological operations for nuclei segmentation was presented in Reference [
42]. Xing
et al. [
41] used a CNN model to learn the feature representation of raw data and to generate a probability map. Each pixel was assigned a probability indicating its probability that it belongs to a nucleus. An iterative region merging approach was then performed to initialize the contours. Accurate nucleus segmentation was achieved via a selection-based sparse shape model. Although the above pixel-based methods [
37,
39–
42] have shown promising performance, obvious limitations are found. First, the patch size in pixel-based methods is small. Although this property can increase the locating accuracy, rich context information cannot be fully utilized. Second, densely selected patches increase the calculation burden, making the algorithm cumbersome and time consuming. Finally, these methods usually rely on prior assumptions on the target structure, thereby reducing the generalization to targets with different morphologies. Semantic segmentation tasks have become asymptotically and absolutely efficient with the success of fully convolutional network (FCN) [
70]. This end-to-end network consists of a downsampling path and an upsampling path. The downsampling path extends the classification network structure, which is composed of convolutional and pooling operations. The upsampling path includes convolutional and deconvolutional layers that can produce dense classification scores for a whole image. The full image is fed as input rather than multiple dense patches when conducting training and testing. A pixelwise probability map will be outputted with one single forward propagation. Zhang
et al. [
43] combined FCN and graph-based approach for cervical nuclei segmentation. The FCN acted as a nucleus high-level feature learner and generated a nucleus label mask and a nucleus probabilistic map. The former was used to guide the graph construction, and the latter was formulated into an in-region cost function. They achieved state-of-the-art Zijdenbos similarity index of 0.92±0.09. Ronneberger
et al. [
38] innovatively modified and extended the FCN by adding a symmetric expanding path along with a usual contracting path. High-resolution features from the contracting path and the upsampled output from the expanding path were combined to enable precise localization. The resulted architecture was named U-net, and many varieties were subsequently invented for multiple medical image tasks [
44,
71,
72]. Without pre- and postprocessing, FCN [
70] and its variants, such as U-net [
38], are proved to be well-suited in medical image segmentation tasks.
Cell and mitosis detection
A detection task aims to find and locate the regions of interests (ROIs), such as nuclei or mitoses in a tissue slide and is of great importance to cancer screening. Cell spatial distribution analysis and mitosis counting can help in distinguishing differentiation degrees [
73]. Automatic cell/nuclei detection serves as an essential prerequisite for a series of subsequent tasks, such as cell/nuclei instance segmentation, tracking, and morphological measurements [
74]. Object detection is necessary to automatically acquire spatial distribution and quantity information. A detection task requires classifying and positioning the targets. Many traditional methods based on hand-crafted features have been proposed [
75–
78] to solve this problem. A number of DL-based studies have been conducted on this field in recent years.
In recent studies toward nuclei detection, feature representations learned by DL-based methods have been demonstrated more effective than hand-crafted features [
79]. DL-based nuclei detection approaches can be divided into two categories: (1) conduct per-pixel classification via a sliding window to find local maxima on a probability map. Xu
et al. [
47] applied a stack SAE to learn high-level features using such strategy and fed into a softmax classifier to classify each patch as nuclear or nonnuclear. Their method achieved promising performance on breast cancer histopathology images. Sirinukunwattana
et al. [
46] proposed a spatially constrained CNN to generate a probability mask for spatial regression. Their model predicted the centroid locations of nuclei and their confidence whether they corresponded to true centroids; (2) regress the proximity value for each pixel. Xie
et al. [
45] presented a CNN-based structured regression approach. Their method generated proximity patches that show high values for pixels near cellular centers rather than a single class label assigned to an input image patch. In this way, each cell was detected. However, the model with a large-size proximity patch was difficult to train because the subsampling operation resulted in information loss. In their another work [
48], they extended the previous method with a fully residual CNN to directly output a dense proximity prediction that had the same size as the input image. The experimental results showed the superiority of their method in terms of detection accuracy and speed.
Given the importance of mitosis counting for cancer prognosis and treatment, some challenges or contests have been encounterd in developing robust algorithms on mitosis detection [
80–
82]. Most methods regarded this task as follows: (1) as a classification problem that relies on sliding window to classify whether an image patch is centered with the mitosis; (2) as a semantic segmentation task that first segments the images to find candidates of mitoses and then classifies these candidates to obtain the final detections; (3) as a proposal-based detection issue that produces a number of potential proposals on the image patch and eliminates the most possible ones.
(1) As a classification problem. Cireşan
et al. [
49] utilized CNN as a pixelwise classifier via a sliding window manner to detect the mitoses in breast histology images. Their method achieved the best performance in the
2012 ICPR Mitosis Detection Challenge [
81]. Wang
et al. [
50] performed mitosis detection in breast cancer pathology images with the combination of CNN and hand-crafted features. They presented a cascaded approach that possibly maximized the exploiting of the two distinct feature sets. Such a strategy exhibited higher accuracy and less demanding computational cost compared with each individual feature extraction technique. To acquire sufficient training images, crowdsourcing was introduced in the learningprocess of CNN to exploit additional data sources annotated by nonexpert users for mitosis detection [
52]. They trained a multi-scale CNN model with the same basic architecture on different image scales to perform mitosis detection and provided the crowds with mitosis candidates for annotation. The collected annotations were then passed to the existing CNN with the specially designed aggregation layer attached for model refinement and ground truth generation. These methods are usually computationally demanding.
(2) As a semantic segmentation task. Chen
et al. [
36] proposed a deep regression network based on FCNs. Their method can produce a dense score map with the same size as the input image and can be trained in an end-to-end fashion. The regression layer was added at the end of the FCN architecture, and the proximity score map was defined to efficiently locate the mitotic centroid on the map. Their method was concise and general to be applied to other similar tasks. They also utilized a transfer learning strategy from cross-domain to alleviate the insufficiency of medical data. In their another study [
51], they used two networks to build a cascade detection system. They first exploited FCN to locate the candidates of mitoses. A second fine discrimination model was then applied to classify the candidate patches. Li
et al. [
55] expanded the mitosis labels that were usually single pixel to labels with concentric circles, where the inside circle was a mitotic region, and the outside ring was a “middle ground.” Thus, a concentric loss was defined to detect mitosis from a segmentation model.
(3) As a proposal-based issue. Li
et al. [
53] applied a proposal-based deep detection network to the mitosis detection task and utilized a patch-based deep verification network to improve the accuracy.
Strategy 1 (classification-based) is largely applied in previous DL-based studies. The advancement of computational power and DL technologies have promoted Strategy 2 (semantic segmentation-based) to be the mainstream. Strategy 3 (proposal-based) is becoming increasingly popular with the success of detection algorithms developed in natural scenes.
Nuclei instance segmentation
Instance segmentation is a unified work that groups pixels into semantic classes and object instances [
83]. Compared with detection, instance segmentation provided finer segmentation rather than a box for each independent object. Unlike semantic segmentation, foreground pixels are separated into different individuals [
84]. Nuclei instance segmentation, which aims to differentiate nuclear regions from the background and separate them into different individuals, is fundamental to acquire many grading indexes. For example, the average nuclear size estimated by nuclei instance segmentation results is an important indicator used for nuclear pleomorphism scoring [
85] in manual [
86,
87] and automatic [
88] measurements. Increasing explorations have been conducted for nuclei instance segmentation.
Some public data sets [
61,
89–
91], such as the Fluo-N2DL-HeLa data set from ISBI cell tracking challenge [
89], the Computational Precision Medicine Nuclear Segmentation Challenge at MICCAI 2015 and MICCAI 2017 [
90], the Triple Negative Breast Cancer [
91], and the Multiorgan Nuclei Segmentation (MoNuSeg) Data set [
61], are available to facilitate the development of nuclei instance segmentation algorithms. The cell/nucleus instance segmentation approaches are classified into two strategies, namely, (1) semantic segmentation-based: apply semantic segmentation with traditional image processing methods (image labeling, watershed, etc.) to separate nuclei into different individuals; (2) detection-based: present potential seeds (centers of each nucleus) or proposals (bounding boxes surrounding each nucleus) initially and predict segmented masks on the basis of detection outputs.
(1) Semantic segmentation-based. To separate touching nuclei into individual ones, Chen
et al. [
59] proposed to integrate contour information into a multilevel FCN for developing a deep contour-aware network. Their method won the
2015 MICCAI Nuclei Segmentation Challenge [
90]. Kumar
et al. [
61] explicitly introduced a third class of nuclear boundary (besides foreground that is inside any nucleus and background that is outside every nucleus) to discriminate different nuclear instances. Song
et al. [
60] first relied on a semantic FCN to classify each pixel into the class of nuclei, cytoplasm, and background. They then defined overlapped cell splitting as a discrete labeling task and designed a suitable cost function to split crowded cells. They applied a dynamic multitemplate deformation model for cell boundary refinement. Similarly, Yang
et al. [
56] achieved instance-level segmentation of glial cells in 3D images by first obtaining voxel-level segmentation via FCN. Subsequently, they adopted a
k-terminal cut algorithm to separate touching cells. Naylor
et al. [
63] predicted the distance map of nuclei with U-net and a regression loss. Postprocessing was conducted on the basis of the output of the regression network. Zhou
et al. [
64] achieved the highest ranking in the
2018 MICCAI MoNuSeg Challenge [
92]. Their method aggregated multilevel information between two task-specific decoders. With bidirectionally combined task-specific features, they leveraged the spatial and texture dependencies between the nuclei and contours. Zeng
et al. [
72] applied residual-inception-channel attention-U-net. Their model output contour and nuclei masks to achieve instance segmentation. The above methods tend to obtain semantic segmentation results as a premise and implicitly or explicitly introduce contour priors (e.g., as the additional label, encoded in the distance loss, etc.) to disentangle nucleus-to-nucleus connections. Such methods show simplicity and fastness. However, some disadvantages are found. Oversegmentation will lead to additional identified nuclei instances. Undersegmentation will fail to split touching nuclei accurately. These methods usually experience heavily-engineered post-processing, making them difficult to be generalized. Nuclei with irregular boundaries cannot be well-handled.
(2) Detection-based. Akram
et al. [
57,
58] designed a two-stage network for nuclei instance segmentation. In the first stage, a FCN model proposed possible bounding boxes for cell proposals and their scores. Nonmaxima suppression [
93] was then implemented to remove low-scored duplicate proposals. In the second stage, segmentation masks were obtained through thresholding and morphological operations [
57] or predicted using a second CNN for each proposed bounding box [
58]. Ho
et al. [
62] proposed a 3D CNN framework for nuclei detection and segmentation in fluorescence microscopy images. They first used the 3D distance transform [
94] to individually detect seeds and a 3D CNN segmented nuclei within a subvolume centered at each seed. Region-based object detection algorithms [
95,
96] have been increasingly used for nuclei instance segmentation [
97,
98]. Detection-based approaches usually exhibit high performance but cost considerable time and may include certain false positives or false negatives.
Gland level
A gland is a group of cells that can synthesize and release proteins and carbohydrates [
99,
100]. It is included in many types of organs, such as colon, breast, and lung. Cancerization of these organs usually causes organizational and structural changes toward their glands [
101]. Adenocarcinoma is a common form of carcinoma [
102]. Among all the colon cancers, colorectal adenocarcinoma is the most common form that originates in intestinal glands [
100]. Most of the pancreatic cancers are adenocarcinomas [
100,
103]. The quantitatively measured glandular formation is a crucial indicator for the degree of differentiation [
104], and architectural features (such as size and shape) are equally important for the prediction of prognosis [
105]. Instance segmentation [
106] is a prerequisite step to obtain the morphological and statistical features of glands in digital histology images. In this section, we review the application of DL in gland instance segmentation.
Instance segmentation
Gland instance segmentation is a challenging task because of the huge variance of gland morphology [
101]. First, the size and shape of glands vary with different sectioning orientations. Glands within the same tissue image can possibly have different sizes when the orientation is different (Fig. 2A and 2B). Second, the density difference between the connective tissue and glands can cause artifacts to the boundaries of glandular tissue (Fig. 2C). Third, section thickness and dye freshness or uniformity will lead to intensity variations (Fig. 2B and 2D). Finally, glands in neoplastic tissue usually show heterogeneous appearances (Fig. 2E and 2F). The glandular structures will degenerate with the increase in the grading of cancer. Separating touching glands, which is a crucial step in instance segmentation, is challenging.
The automatic segmentation of glands in histology images has been explored by many studies [
104]. Traditional methods rely on the extraction of hand-crafted features, conventional classifiers, and a large amount of prior knowledge [
67,
101,
112–
125]. Although these studies perform well for tissues with many regular glands (usually healthy and benign cases), they yield unsatisfactory results for cancer cases, where the glands show substantial variation and deformation. DL algorithms have enabled accurate and general gland segmentation.
In the
2015 MICCAI Colon Gland Segmentation Challenge (GlaS) [
100], DL methods were introduced in the segmentation and classification tasks of the gland and showed superior performance over traditional methods. Chen
et al. [
59] proposed an FCN-based deep contour-aware network (DCAN). This method combined gland segmentation and contour prediction into a unified multitask structure. Multiple regularizations can provide complementary information for the two tasks. The final segmentation mask, with individual glands separated from each other, can be generated through this single forward network by fusing the predicted probability map and predicted contour. In the semantic segmentation branch, DCAN combined multilevel contextual features to handle variant gland shapes. This method achieved the best performance in the contest. Xu
et al. [
106,
109] conducted gland instance segmentation on this data set. They designed a deep multichannel side supervision system (DMCS), where region and boundary cues were fused with side supervision to solve the instance segmentation issue. The DMCS adopted one channel for semantic segmentation and another channel for contour prediction. The region feature channel followed the FCN structure. The holistically-nested edge detector (HED)-side convolution channel was inspired by HED [
126], where five multi-scale outputs were generated from the FCN channel to combine into a final edge map. The boundary information loss during the FCN downsampling can be compensated by adding an additional contour segmentation channel. Different from DCAN, DMCS alleviated the burden of postprocessing by generating the final instance segmentation result through fully convolutional layers. Their method exceeded the previously reported performance on this data set, including the competition champion [
59].
DCAN and DMCS aim to preserve object boundaries by adding an additional branch. BenTaieb
et al. [
110] incorporated topology and smoothness information into the training of FCN by designing a new loss function. In this work, two losses, namely, traditional pixel-level and topology-aware losses, were combined. Hierarchical relations and boundary smoothness constraints were encoded into a unified topology-aware loss function. The label hierarchy should be followed in accordance with the proposed algorithm. Epithelium and lumen regions were labeled as foregrounds, and surrounding stroma was labeled as background. The boundary was smoothed by limiting the neighborhood pixels to output similar probabilities. This end-to-end network did not require postprocessing nor cause additional computational burden during testing. BenTaieb
et al. [
111] creatively proposed another method to solve the gland instance segmentation problem. On the basis of the thought that classification and segmentation tasks can provide compensation information for each other, they designed a symmetrically joint combined classification and segmentation network. Two loss functions were utilized, where the first function penalized the classification of gland grading, and the other was used for pixel-level segmentation. The pretrained classification output can provide spatial information of multiple glands that assists in the discrimination of different objects. In this way, improved performance was achieved for the two tasks.
In addition to the abovementioned FCN-based approaches, some studies have viewed gland segmentation as a pixel-based classification problem [
107,
108]. One of the participant teams in 2015 GlaS [
100], Kainz
et al. [
107], applied two distinct CNNs as pixelwise classifiers. The first network called Object-Net performed four-class classification to distinguish benign background, benign gland, malignant background, and malignant gland. The second network called Separator-Net was trained to predict the separating structures among glands for differentiating touching glands. The postprocessing method based on weighted total variation transformed the combination of two outputs into a final segmentation result. This method successfully segmented glands from the background and classified whether the tissue was benign or malignant. Li
et al. [
108] used Alexnet [
126] and Googlenet [
127] (with weights pretrained on ImageNet [
128]) for window classification rather than using specifically designed CNN structures. Their result outperformed the previous state-of-the-art method that utilized hand-crafted features with support vector machine (SVM) (HC-SVM) [
129]. They also validated that the fusion between CNN and HC-SVM will result in additional improvement, indicating that the two contrasting methods are complementary (Table 3).
Region level
The DP analysis is regarded as a region-level task when the target ROI is a larger tissue compared with the cell and gland. Region-level information is an important diagnostic factor. The biological behavior and morphological characteristics of tissue cells and contextual information should be considered. Pathologists need to identify tumor areas in WSIs, determine carcinoma types (e.g., small cell and non-small types for lung cancer), and assess their aggressiveness in the following treatment. Several challenges exist in the automatic analysis of region-level tasks [
130]: (1) the complexity of clinical feature representation: histopathology characteristics, such as morphology, scale, texture, and color distribution, can be remarkably heterogeneous for different cancer types, making it difficult to find a general pattern for different cancers; (2) the insufficiency of training images: different from a natural scene image data set that can include millions of images, a pathological image data set usually contains only hundreds of images; (3) the extremely large size of a single histopathology image: a gigapixel WSI typically possesses a size larger than 100 000 pixels and contains more than 1 million descriptive objects, thereby making effective feature extraction difficult. In this section, we will illustrate the application of DL in region-level structure analysis from two aspects. The first aspect is to classify whether the region is cancerous or distinguish different cancer subtypes. The second aspect is to segment certain structures associated with specific clinical features. Table 4 presents an overview of each task in the region-level analysis.
Classification
Previous studies [
141,
142] on histopathology image classification mainly focused on manual feature design. Automatic feature learning by CNNs has drawn much attention with the development of DL methods. Directly applying CNNs to region-level analysis is impractical because of the large scale of WSIs and high computational cost. Downsampling a WSI to fit into a neural network is unsuitable because essential details may be lost. Alternatively, most studies [
130,
132,
134] choose to perform analysis locally on small patches that are densely cropped. The final results are acquired on the basis of the overall patch-level predictions. Knowledge transferred from cross-domain is largely utilized to alleviate data insufficiency. Xu
et al. [
130,
132] exploited CNN activation features to achieve region-level classification. They first divided each region into a set of overlapping patches. A transfer learning strategy was then explored by pretraining CNN with ImageNet [
126]. Each patch was transformed into a 4096-dimensional feature vector. Feature pooling and feature selection were implemented to yield a region-level feature vector for reducing redundancy and selecting a subset of many relevant features. Region-level features were passed to a linear SVM [
143] for classification. Their method achieved a state-of-the-art accuracy of 97.5% in the
MICCAI 2014 Brain Tumor Digital Pathology Challenge [
144]. Similarly, Källén
et al. [
134] developed a region-level classification algorithm to report the Gleason score. They used the pretrained OverFeat [
145] to extract multilayer features for each patch. Random forest (RF) [
146] or SVM was used for patch classification. They applied a voting strategy on all patches for different classes to classify the whole region image. The class with the highest votes was assigned. However, this approach may miss small lesion areas in a cancerous WSI. An average pooling strategy was exploited for aggregating patch-level features to region-level ones [
140]. In particular, they first fine-tuned a pretrained VGG-16 network [
147] using fixed-size patches sampled from the lesion spots. The CNN was then used to extract a convolutional feature vector for each patch. The softmax probabilities outputted by the CNN were used as weights of patch-level feature representations. Region-level features were obtained through the average pooling of patch-level ones. Their method may be limited because they only considered ROIs with the most severe diagnosis within a WSI.
The abovementioned supervised patch-level prediction strategies have several inherent drawbacks. First, these methods require WSIs with well-annotated cancer regions. However, pixel wise annotations are seldom available because of the large sizes of WSIs. Comparatively, many WSIs with only image-level ground truth labels can be utilized. Second, patch-level labels do not consistently correspond with the region label [
133]. Simple decision fusion approaches (e.g., voting and max pooling) are insufficiently robust to make the right region-level prediction. Weakly supervised methods, especially MIL-based ones [
131,
133,
139,
148–
151], are extensively investigated to solve the above issues. Xu
et al. [
131] extracted feature representation via deep neural networks and applied the MIL framework for classification. Their MIL performance outweighed that of supervised DL features. Hou
et al. [
133] proposed an expectation–maximization (EM)-based method that combines patch-based CNN with supervised decision fusion. First, they identified discriminative patches in WSIs by combining an EM-based method with CNN. The histograms of patch-level predictions were then passed to a multiclass logistic regression or SVM to predict the WSI-level labels. They verified their method on two WSI data sets. However, their algorithm was computationally intensive, and its performance improvement was slight. Wang
et al. [
139] first collected discriminative patches using FCN. The spatially contextual information was then utilized for feature selection. A globally holistic region descriptor was generated after aggregating the features from multiple representative instances and fed into a RF for WSI-level classification.
Segmentation
Considering the large sizes and small quantity of tissue sections, the region-level segmentation problem is typically formulated as a patch-level classification task. In the
MICCAI 2014 Brain Tumor Digital Pathology Challenge [
144], Xu
et al. [
130,
132] extracted patch features using CNN (pretrained on ImageNet [
126]). A linear SVM was then applied to classify the patches as necrosis or nonnecrosis. The method achieved state-of-the-art performance in the challenge. Qaiser
et al. [
136] computed the persistent homology and CNN features of each patch. Two types of features were separately fed into the RF regression model. The output prediction was obtained via a multistage ensemble strategy. Although their approach was elaborately designed, the improvement was small. A superpixel-based scheme was used to oversegment the region into atomic areas rather than cropping patches using a tiling strategy, indicating that this scheme performed better than fixed-size square window-based approaches [
135].
Some studies have applied weakly supervised methods for segmentation, where only region-level labels are available. Courtiol
et al. [
138] adopted a MIL technique for localizing disease areas in WSIs. They first cropped all patches from nonoverlapping grids. The pretrained ResNet-50 [
152] architecture was then utilized to extract patch-level feature vectors, followed by feature embedding. Top instances and negative evidence were selected. A multilayer perceptron classifier was used for classification. Several studies have combined FCN and MIL for weakly supervised segmentation. Jia
et al. [
137] developed a new MIL algorithm under a deep weak supervision (DWS) formulation. They utilized the first three stages of the VGGNet and connected side-output layers with DWS under MIL. Area constraints were presented to regularize the sizes of predicted positive instances. Subsequently, the side-outputs were merged to exploit the multi-scale predictions. The method was general and had potential applications for various medical image analysis tasks.
Grading and prognosis
Prognosis provides the prediction of possible disease development, such as the likelihood of disease deterioration or amelioration, chance of survival, and expectations of life quality [
153]. However, recent advancements of DL methods in histopathology make DL a potential tool to enhance the accuracy of prognosis [
154].
DL methods are primarily applied to extract cancer-related features or structures during prognosis. Non-small cell lung cancer (NSCLC) is typically treated with surgical resection. However, disease recurrence usually occurs for patients with early-stage NSCLC. Validated biomarkers are scarce for the prediction of the recurrence rate [
155]. To assist subsequent therapy decisions (whether or not to adopt adjuvant chemotherapy), Wang
et al. [
156] first used CNN and watershed-based method for nuclear segmentation. They then extracted and selected massive features from previously segmented nuclei pixels. Three different classifiers were used to distinguish binary classes that corresponded to recurrence or nonrecurrence. In their experiment, DL showed slightly better performance compared with conventional watershed algorithms. Vaidya
et al. [
157] presented a radiology–pathology fusion approach to predict recurrence-free survival. These methods also used CNN to segment nuclei. The pathological features extracted from the nuclei and intratumoral/peritumoral features from computed tomography scans were combined, which was an innovative concept. After feature selection using a minimum redundancy maximum relevance algorithm, three classifiers were used for recurrence risk stratification.
Gene status is an important indicator of prognosis. The mutation of specific genes reveals the level of tumor severity and the probability of healing [
158–
160]. However, modeling the relationship between pathology images and high-dimensional genetic data is challenging [
161]. Many studies have focused on the prediction of specific gene mutation. Coudray
et al. [
162] used a modified Inception v3 to predict the mutation of six types of genes (STK11, EGFR, FAT1, SETBP1, KRAS, and TP53). In particular, the network uses an adenocarcinoma (LUAD) tile as input and outputs one of the six gene types.
Some research groups [
163] studied the feasibility of substitute DL methods for complicated or expensive prognosis-related clinical examination. Gastrointestinal cancer patients with microsatellite instable (MSI) tumors considerably benefit from immunotherapy. However, the MSI test requires additional immunohistochemical or genetic analysis. Many patients do not receive MSI screening, thereby making potential responders to immune checkpoint inhibitors for missing timely treatment [
164]. To address this problem, Kather
et al. [
163] successfully identified MSI from ubiquitously available H&E-stained histology images. This method first trained ResNet-18 [
152] to detect tumors. Square tiles were then cut from tumor regions. Another ResNet-18 [
152] was used for MSI classification after color normalization. With advanced DL, a broad target population can be excavated without additional laboratory testing.
Cancer grade describes the appearance of tumors and closely correlates with survival. It is an important determinant for prognosis, and has attracted the interest of many researchers. CNN can mine the hidden pattern of pathology characteristics. Therefore, cancer grading is directly conducted on raw WSIs, without intermediate feature extraction (e.g., tissue structures and cell morphology). Nagpal
et al. [
165] designed a DL system that outperforms human pathologists. The WSI was divided into dense patches, where each patch was classified as either nontumor or Gleason pattern (GP) 3/4/5 by InceptionV3 [
166]. After calculating the GP quantitation of a slide, Gleason scoring was predicted. Ing
et al. [
167] demonstrated the suitability of semantic segmentation in prostate adenocarcinoma grading. A semantic segmentation network can delineate and classify tumor regions. In their method, foreground tissue areas were roughly located through gray-level thresholding at the beginning and partitioning multiple subtiles. Subsequently, four CNN architectures (FCN-8s [
70], SegNet-Full [
168], SegNet-Basic [
168], and U-net [
38]) were trained to divide the areas of stroma, benign epithelium, and Gleason Grade 3/4/5 tumors. Segmentation was conducted in differently scaled images, which were combined and reassembled to whole slide grading maps. Li
et al. [
169] regarded automatic Gleason grading as an instance segmentation task. In this way, glands with the same grading will be separated into different individuals. They proposed a novel path R-CNN model. In path R-CNN, the ResNet model was used as a backbone to extract and feed the feature maps into two branches. The first branch was a cascade of region proposal network (RPN) and grading network head (GNH). In this classical two-stage branch, RPN worked as a proposal generator, and GNH predicted a binary mask, class, and box offset for each ROI. The second branch was called epithelial network head (ENH), which was a simple binary classification network. ENH can filter the images without tumors, where the network outputted the whole image as background. The network will output the results produced by GNH when ENH predicted the existence of tumors. A fully connected conditional random field model will reduce the artifacts on the edges of stitched patches. Although Path R-CNN adopted an end-to-end structure, the training process was two-stage. GNH and the high layers of the ResNet backbone were trained and fixed in the first place, and ENH was then trained.
Conclusions
The advent of DL has immensely improved the consistency, efficiency, and accuracy in pathology analysis. As a powerful tool, DL provides reliable support for diagnostic assessment and treatment decisions. In comparison with traditional machine learning methods, DL algorithms can uncover unrecognized features to assist prognosis. CNN benefits from transfer learning, indicating that the performance can be improved by pretraining the network on large nonpathology data sets (e.g., ImageNet [
128]).
However, DL-based approaches are still in the exploratory stage and have several limitations. First, large high-quality data cohorts labeled by multiple authoritative pathologists are needed. The performance of CNN strongly depends on the quality and quantity of training data. The evaluation of different algorithms relies on the objectivity of the gold standard. However, the annotation is relatively expensive, thereby hindering the acquisition of adequate data. Second, intrinsic difficulties exist in histology annotations, especially for expert pathologists. Despite the decision discrepancy of experts, only one criterion is used for network training and testing, making it unreliable. Third, strong robustness for the clinical application cannot be achieved. The result will probably become inferior when a proposed algorithm in one research is used in another type of target (WSIs with different image magnification, dyeing situation, cancer type, etc.). Many reliable paradigms are required. Finally, DL methods are still “black boxes” and inside patterns cannot be comprehended. Low interpretability creates a semantic gap, making these method undependable in clinical practice. Apparently, many progresses need to be achieved in this field.
Despite all these challenges, DL has certainly driven a revolution on DP diagnosis. In the future, effective cancer prognosis, image registration, and biological target prediction on pathology images through DL methods will be valuable and are challenging research directions. Solving medical problems through unsupervised or weakly supervised learning, such as one-shot learning, remains to be explored. We envision that DL will serve as a decision support tool for human pathologists and immensely alleviate their workloads via high-throughput analysis. The combination of human and artificial intelligence shows a bright prospect.