INTRODUCTION
Deep learning is revolutionizing many data-rich research fields, such as biology and medicine [
1]. Almost parallel to the booming of deep learning in the past decade, single-cell technology has become a major driving force behind rapid advances in biomedicine. With unprecedented quantity and resolution, single-cell data offer exciting opportunities to uncover many secrets about life, such as cancer drug resistance, gene regulation in embryonic development, mechanisms of stem cell differentiation and reprogramming [
2–
4]. With an increasing speed of data generation (jumping from hundreds of cells to millions of cells in the past few years), single-cell biology calls for powerful bioinformatics methods. Against this backdrop of big data, it is not surprising that single-cell bioinformatics would have intersection with deep learning. In this review, we are going to focus on the single-cell RNA-seq (scRNA-seq) data, because it is currently one of the most prevalent types of data in molecular biology, and by directly profiling the gene expression levels of cells it links up dynamics at the molecular level and the cellular level.
Despite intense development of bioinformatics techniques for scRNA-seq data analysis in the past 5 years, some challenges are still not fully addressed, e.g., dropout events, batch effect, noise, high dimensionality, and scalability. On the other hand, deep learning researchers need understand these computational challenges in order to use deep learning to help with the data analysis and knowledge discovery. By summarizing the challenges and state-of-the-art methods of applying deep learning to scRNA-seq data analysis, this review may contribute to building a bridge between single-cell bioinformatics and deep learning.
To use existing deep learning methods or develop new ones, we need obtain datasets of sufficient quantity and quality. Specialized scRNA-seq databases containing pre-processed data have been developed. For instance, PanglaoDB is an scRNA-seq database collected from mouse and human cells [
5]. It contains pre-processed and pre-computed analyses from more than 1,054 single-cell experiments covering most major single-cell platforms and protocols, based on more than 4 million cells from a wide range of tissues and organs. Besides, there have been many scRNA-seq datasets released along with journal publications (
e.g., [
6–
8]), some of which have been used in the papers reviewed below.
CHALLENGES OF scRNA-SEQ DATA ANALYSIS
To develop useful algorithms and software tools for the analysis of scRNA-seq data, it is crucial to recognize and understand the computational challenges posed by the analysis tasks. Although many approaches have been proposed to address the challenges, more and better techniques are still needed. Moreover, with fast advances in single-cell technologies, new challenges of data analysis would appear. The following is a list of challenges that have been recently addressed by deep learning methods. Due to space limitation, state-of-the-art methods that are not deep learning will not be reviewed here. For details, readers can refer to a few reviews focused on the challenges in scRNA-seq data analysis [
9–
11].
Batch effect
High-throughput single-cell RNA-seq data are often collected in multiple batches, with different conditions, platforms or by different laboratories. It is inevitable that the difference among batches leads to different gene expression values, which might be confused with the biological variations due to inherent inter-cell heterogeneity. Such a technical bias is called
batch effect. If not corrected, batch effect would result in spurious structures in the data and misleading conclusions in downstream analysis. It is worth noting that batch effect is not unique to single-cell data analysis. Recently published methods for batch effect removal in scRNA-seq data include canonical correlation analysis (CCA) [
12] and mutual nearest neighbors (MNN) [
13].
Dropout events
For some lowly expressed genes, their small numbers of RNA molecules and the stochastic nature of transcriptional processes could lead to spurious zero entries in the expression matrices of scRNA-seq data. This is called
dropout [
14]. To correct for the zero-inflation bias caused by dropout, numerous statistical methods (especially imputation) have been recently proposed [
15,
16]. Accidentally, in deep learning, the term “dropout” refers to a regularization technique that can address overfitting [
17,
18]. When the dropout technique is applied in deep learning, every neuron has a chance of being temporarily ignored in the model training, thereby preventing co-adaption among neighboring neurons and increasing the robustness and generalizability of the whole neural network.
Technical noise
In addition to batch effect and dropout events, some other technical factors could also cause biases in the scRNA-seq data, especially for lowly expressed genes, such as cDNA amplification bias, cell cycle effects, insufficient sequencing depth, etc., and such biases are called
technical noise. On the other hand, single-cell data contain intrinsic biological variability which can reveal valuable insights about the mechanisms of gene regulation at single-cell level. It is therefore crucial and challenging to separate the technical noise from the biological noise [
19]. To tackle the challenge, powerful biological techniques, such as ERCC (external RNA control consortium) spike-in [
20] and unique molecular identifier (UMI) [
21], have been developed and widely used. Nonetheless, computational techniques are still needed to further correct for the technical noise. The past few years have seen an increasing number of denoising methods,
e.g., normalization [
22,
23], as well as statistical models for both biological noise and technical noise [
24].
Curse of dimensionality
A key step in scRNA-seq data analysis is dimensionality reduction. Typically, an scRNA-seq dataset contains the expression profiles of a large number of genes, where each gene corresponds to a dimension, and the expression profile of each cell corresponds to a data point in the high-dimensional cell state space. In some data analysis steps (
e.g., clustering), the distance between data points plays crucial roles. In a high-dimensional space, however, as the data points become sparse, the distance measures (
e.g., Euclidean distance, Mahalanobis distance and Manhattan distance) lose their effectiveness, rendering the concept of nearest neighbor unclear and data analysis problems difficult. This is called the
curse of dimensionality [
25]. Moreover, it could lead to the issue of overfitting, especially when the number of data points is relatively small. One way to alleviate the issues caused by high dimensionality is to increase the data, but in most cases that would be infeasible because the amount of data required would increase exponentially with the dimensionality. Thus, an alternative and practical solution is dimensionality reduction [
26].
Many approaches of dimensionality reduction have been employed or developed for scRNA-seq data,
e.g., PCA (principal component analysis) [
27], t-SNE [
28,
29], diffusion map [
13], GPLVM [
30,
31], SIMLR [
32], and UMAP [
33], etc. Recently, some dimensionality reduction methods are devised to take into account characteristics that are unique to scRNA-seq data, such as ZIFA which explicitly tackles dropout events [
34]. Tested on simulated and real data, these methods have been demonstrated to be effective in extracting salient factors from the high-dimensional scRNA-seq datasets, and help improve performance in various downstream analysis,
e.g., clustering, visualization, cell type discovery, developmental trajectory reconstruction, pseudotime inference, and gene regulatory network inference. However, the existing methods for dimensionality reduction still have some limitations, such as the lack of robustness to random sampling, unable to capture global structures while focusing on local structures of the data, sensitivity to parameters, and high computational cost, etc.
Scalability
While dimensionality reduction mainly deals with the large number of genes in scRNA-seq data, the other key parameter of data size is the number of cells. Both parameters of data size pose the challenge of scalability. Since the birth of droplet-based scRNA-seq technique (
i.e., Drop-seq) [
35], the amount of cells profiled in each experiment has reached tens of thousands and often millions. High-throughput single-cell projects, such as the Human Cell Atlas (HCA) project [
4], are generating data of many cells, which calls for more efficient and scalable algorithms for modeling and data analysis. For instance, some methods for dimensionality reduction and clustering require multiplication of two
N ×
N matrices, where
N is the number of cells. Besides algorithmic innovations, some parallel and high-performance computing techniques,
e.g., graphical processing units (GPUs), are also frequently used in different areas of bioinformatics [
36].
AUTOENCODERS
In the past 5 years, numerous statistical and machine learning methods have been proposed to address the above issues of scRNA-seq data analysis. However, deep learning techniques have been involved in tackling these issues only since around 2017. Among these techniques, autoencoder has been the most popular so far.
Autoencoders are artificial neural networks for unsupervised learning. Sometimes it is also counted as self-supervised learning as the target output is directly from the input data. The main idea of autoencoder is to learn efficient representations of data by forcing the neural network to reconstruct the input dataset itself as accurately as possible under some constraints [
37]. While directly learning the identity function would be easy but useless, imposing constraints on the internal hidden layers of the neural network (
e.g., lower dimensions) can force the model to ignore irrelevant information and capture most essential patterns in the data. A typical autoencoder comprises an encoder (which converts the input data to an internal representation) and a decoder (which generates output from the internal representation). The loss function is usually the reconstruction error based on some measure of distance between the input and output data (
e.g., Euclidean distance or Kullback–Leiler divergence). Autoencoders are often used for dimensionality reduction. Actually, principal component analysis (PCA) can be considered a special case of autoencoder when the cost function is the Mean Squared Error and only linear activation functions are used. Other applications of autoencoders include pretraining for supervised learning, feature extraction, information retrieval, etc.
Dependent on specific applications, there are variants of autoencoders, such as stacked (or deep) autoencoders (where multiple hidden layers pile up) [
38], denoising autoencoders (where noise is added to the input in order to push the autoencoder to recognize the informative patterns in the data) [
39], and variational autoencoders (where a sampling layer lies between the encoder and decoder such that the output data consist of instances sampled from the same distribution as the input data) [
40].
DEEP LEARNING METHODS FOR scRNA-SEQ DATA ANALYSIS
In the following, we briefly review several deep learning methods that have been applied to scRNA-seq data analysis, published between 2017 and 2019. Although the number of papers is still small in this direction, an explosion of publications is likely to come soon.
Shaham
et al. proposed one of the first methods that use neural networks to remove batch effect [
41]. They used the residual neural networks (ResNets), which can be very deep by avoiding exploding or vanishing gradients and increasing performance with depth. ResNets can easily learn mapping functions that are close to the identity function, and thus are suitable for calibrating a source sample to match the target sample in the batch effect removal. Moreover, the maximum mean discrepancy (MMD), a measure for distance between two probability distributions, was used to encode the loss function in the ResNets. The authors have applied the MMD-ResNet method to remove batch effects in both mass cytometry and scRNA-seq data.
Batch effect removal is usually done before cell clustering in scRNA-seq data analysis. However, a recently proposed algorithm called “DESC” (deep embedding algorithm for single-cell clustering) combines clustering and batch effect removal in an iterative framework [
42], considering the fact that different cell types have different degrees of vulnerability to batch effect. Here, a stacked autoencoder is used for dimensionality reduction, as a step of pretraining and parameter initialization for the iterative clustering. The trained encoder layers represent the mapping from the original data to the lower-dimensional representation.
To address the issue of dropout, several imputation methods have been developed,
e.g., MAGIC [
43], scImpute [
44] and drImpute [
45]. Inspired by the recent success of autoencoders for sparse matrix imputation in collaborative filtering for recommendation systems, Talwar
et al. proposed an autoencoder-based method called “AutoImpute” to handle the dropout in scRNA-seq data [
46]. The authors used overcomplete autoencoders, which aim to regenerate the imputed expression matrix by focusing on the non-zero entries in the input sparse matrix.
Eraslan
et al. proposed a method called “DCA” (deep count autoencoder) to solve the problems of denoising and imputation together [
47]. The main idea is to define the loss function in terms of noise models,
e.g., negative binomial (NB) and zero-inflation negative binomial (ZINB). When the ZINB noise model is used, the loss function is the likelihood of the ZINB distribution. The output layer of DCA includes three neuron nodes for each gene, representing the mean of the NB distribution (as the denoised data), and two parameters of the ZINB distribution (
i.e., dispersion and dropout probability). Compared with other imputation and denoising methods for scRNA-seq data, DCA has two advantages. One is the ability to capture the nonlinear dependencies among genes, and the other is the scalability to millions of cells thanks to the efficiency of autoencoder and the support for GPU usage.
Several deep learning methods for the dimensionality reduction of scRNA-seq data have been recently proposed. Lin
et al. compared 4 neural network architectures to learn the representation of scRNA-seq data, and used denoising autoencoder (DAE) to do unsupervised pre-training [
48]. Their main goal, however, was to do supervised learning for disentangling cell types and database queries to infer cell types or states. They also found that the neural networks incorporating prior knowledge of protein-protein and protein-DNA interactions can perform better. Moreover, analysis of the learned models can yield biological insights, demonstrating some degree of interpretability of the neural networks. Ding
et al. used the variational autoencoder (VAE) to infer the approximate posterior distributions of low-dimensional latent variables, thereby learning a parametric mapping from a high-dimensional space to a low-dimensional embedding [
49]. Compared with popular existing methods such as t-SNE, their method named “scvis” can capture the global structure in the data, is more robust to noise, and has better interpretability thanks to the probabilistic nature of the latent variable model. However, according to Becht
et al. [
33], the running time of scvis is expensive especially for dimensionality reduction. Wang and Gu proposed a method called “VASC”, which uses the deep variational autoencoder for dimensionality reduction and visualization of scRNA-seq data [
50]. The architecture of VASC includes the encoder network, the decoder network, and the zero-inflated layer which simulates the dropout events. Compared with existing methods such as PCA, t-SNE and ZIFA, VASC can capture nonlinear patterns in the data and has broader data compatibility. Moreover, VASC could recover cell developmental processes based on dimensionality reduction. However, it is still insufficient for the recovery of cell differentiation trajectories. Integrating the gene ontology (GO) with deep neural networks, Peng
et al. proposed both an unsupervised method called “GOAE” (Gene Ontology AutoEncoder) and a supervised method called “GONN” (Gene Ontology Neural Network) for dimensionality reduction and clustering [
51]. Their experimental results show that, by incorporating the prior knowledge from GO, both clustering performance and interpretability of the neural networks can be improved. However, the model is sensitive to the threshold in dealing with the GO terms.
Most of the above methods are each focused on one or two tasks of data processing and analysis. It would be desirable, however, to integrate different tools into one joint framework. Lopez
et al. developed an integrative software tool called “scVI” (single-cell variational inference) which can carry out tasks including batch correction, library-size bias correction, dropout correction, imputation, dimensionality reduction, clustering, visualization [
52]. The rationale is that different analysis tasks can reuse a common low-dimensional representation of scRNA-seq data to increase consistency and flexibility. scVI is a probabilistic approach based on the hierarchical Bayesian model. It uses variational autoencoders for dimensionality reduction and neural networks for multiple tasks such as estimating the dropout probability. Note that the architecture of the scVI algorithm is built on a highly modular deep learning framework, and thus better results could be obtained by testing other combinations of modules (
e.g., nonlinearities, regularization). Another hierarchical Bayesian model with a deep autoencoder for data denoising, named SAVER-X, was published recently [
53]. In addition, SAVER-X employs transfer learning to automate the process of cross-study information sharing and data integration. As such, the authors were able to do cross-species scRNA-seq data analysis from the mouse cells to human cells so that the issue of human cell shortage could be addressed. More recently, a deep recurrent learning method called “scScope” was proposed to conduct batch effect removal, dropout imputation, and cell subpopulation identification and so on [
54]. At the core of scScope is an autoencoder, with a layer for batch effect correction and a layer for imputation. The imputation layer returns to the beginning of the encoder, forming a recurrent network structure. If there is only one iteration, the framework is a standard autoencoder. Moreover, like other deep learning methods, scScope offers support for parallel training using GPUs, thereby promising to have the scalability for millions of cells.
Most of the deep learning methods reviewed above are summarized in Table 1.
DISCUSSIONS
The methods reviewed here have mostly used autoencoders, a popular class of unsupervised machine learning methods, probably because labeled data are often unavailable for scRNA-seq datasets, which renders supervised learning inapplicable.
Despite the exciting development of deep learning methods for scRNA-seq data analysis, there are still some limitations which could point to interesting directions of future work. First, deep learning is mostly used for data preprocessing (
e.g., dimensionality reduction, denoising, and imputation), serving to enhance rather than directly carry out downstream analysis tasks (
e.g., lineage identification, gene regulatory network inference). Secondly, in most of the cases, it is still unclear whether deep learning methods are significantly better than traditional statistical or machine learning methods. An interesting future work would be to compare deep learning methods with other methods for specific tasks, to gain insights about when and why deep learning would perform better. Thirdly, the reported performance of deep learning methods may be sensitive to the values of hyper-parameters [
55], and the robustness and generalizability of deep learning methods should be assessed by testing on some third-party data. Of course, this issue is not specific to scRNA-seq data analysis. Fourthly, the integration of scRNA-seq data with other types of single-cell data (
e.g., single-cell Hi-C data, scATAC-seq data, single-cell proteomics and cell imaging data) could be facilitated by some deep learning methods in the future [
56–
58]. Last but not least, more interpretability is needed for scientific knowledge discovery. Although some of the reviewed papers have included biological interpretation of the hidden layers of autoencoders, a desirable aspect of interpretability is the automatic construction of a model that is able to simulate the single-cell dynamics of transcription. For that, generative adversarial network (GAN) models might be promising [
59]. Although not as popular as some other deep learning methods in genomics, GAN has recently been applied to single-cell genomics [
60]. For example, Ghahramani
et al. used a GAN model to generate synthetic scRNA-seq data by simulation and to achieve dimensionality reduction [
61]. Their model has a higher interpretability of parameters compared with other models. A recent method for data integration, called the Manifold-Aligning GAN (MAGAN), was proposed by Amodio
et al. [
62]. MAGAN aligns two manifolds to maintain pointwise correspondence in order to integrate scRNA-seq data and proteomic data (
e.g., mass cytometry).
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature