Imputation of single-cell gene expression with an autoencoder neural network

Md. Bahadur Badsha, Rui Li, Boxiang Liu, Yang I. Li, Min Xian, Nicholas E. Banovich, Audrey Qiuyan Fu

PDF(3905 KB)
PDF(3905 KB)
Quant. Biol. ›› 2020, Vol. 8 ›› Issue (1) : 78-94. DOI: 10.1007/s40484-019-0192-7
RESEARCH ARTICLE
RESEARCH ARTICLE

Imputation of single-cell gene expression with an autoencoder neural network

Author information +
History +

Abstract

Background: Single-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.

Methods: To overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.

Results: On both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.

Conclusions: We demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.

Graphical abstract

Keywords

single-cell / gene expression / deep learning / autoencoder

Cite this article

Download citation ▾
Md. Bahadur Badsha, Rui Li, Boxiang Liu, Yang I. Li, Min Xian, Nicholas E. Banovich, Audrey Qiuyan Fu. Imputation of single-cell gene expression with an autoencoder neural network. Quant. Biol., 2020, 8(1): 78‒94 https://doi.org/10.1007/s40484-019-0192-7

References

[1]
Kolodziejczyk, A. A., Kim, J. K., Svensson, V., Marioni, J. C. and Teichmann, S. A. (2015) The technology and biology of single-cell RNA sequencing. Mol. Cell, 58, 610–620
CrossRef Pubmed Google scholar
[2]
Ziegenhain, C., Vieth, B., Parekh, S., Reinius, B., Guillaumet-Adkins, A., Smets, M., Leonhardt, H., Heyn, H., Hellmann, I. and Enard, W. (2017) Comparative analysis of single-cell RNA sequencing methods. Mol. Cell, 65, 631–643.e4
CrossRef Pubmed Google scholar
[3]
Li, W. V. and Li, J. J. (2018) An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun., 9, 997
CrossRef Pubmed Google scholar
[4]
Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J. I., Raj, A., Li, M. and Zhang, N. R. (2018) SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods, 15, 539–542
CrossRef Pubmed Google scholar
[5]
van Dijk, D., Sharma, R., Nainys, J., Yim, K., Kathail, P., Carr, A. J., Burdziak, C., Moon, K. R., Chaffer, C. L., Pattabiraman, D., (2018) Recovering gene interactions from single-cell data using data diffusion. Cell, 174, 716–729.e27
CrossRef Pubmed Google scholar
[6]
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. and Theis, F. J. (2019) Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun., 10, 390
CrossRef Pubmed Google scholar
[7]
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. and Yosef, N. (2018) Deep generative modeling for single-cell transcriptomics. Nat. Methods, 15, 1053–1058
CrossRef Pubmed Google scholar
[8]
Bacher, R. and Kendziorski, C. (2016) Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol., 17, 63
CrossRef Pubmed Google scholar
[9]
Stegle, O., Teichmann, S. A. and Marioni, J. C. (2015) Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet., 16, 133–145
CrossRef Pubmed Google scholar
[10]
Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, 313, 504–507
CrossRef Pubmed Google scholar
[11]
Bengio, Y. (2012) Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning. pp. 17–36. Bellevue
[12]
Zhu, Z., Wang, X., Bai, S., Yao C. and Bai, X. (2016) Deep learning representation using autoencoder for 3D shape retrieval. Neurocomputing, 204, 41–50
CrossRef Google scholar
[13]
Rumelhart, D. E., Hinton,G. E. and Williams, R. J. (1986) Learning representations by back-propagating errors. Nature, 323, 533–536
CrossRef Google scholar
[14]
Kingma, D. P. and Ba, J. (2015) Adam: A method for stochastic optimization. In: Proceeding of the 3rd International Conference for Learning Representations. San Diego
[15]
Dahl, G. E., Sainath, T. N. and Hinton, G. E. (2013) Improving deep neural networks for LVCSR using rectified linear units and dropout. In Proceedings of IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. IEEE Service Center
[16]
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learing. Cambridge: MIT Press
[17]
Linderman, G. C., Zhao, J. and Kluger, Y. (2018) Zero-preserving imputation of scRNA-seq data using low-rank approximation. bioRxiv: 397588
[18]
Zappia, L., Phipson, B. and Oshlack, A. (2017) Splatter: simulation of single-cell RNA sequencing data. Genome Biol., 18, 174
CrossRef Pubmed Google scholar
[19]
Shekhar, K., Lapan, S. W., Whitney, I. E., Tran, N. M., Macosko, E. Z., Kowalczyk, M., Adiconis, X., Levin, J. Z., Nemesh, J., Goldman, M., (2016) Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell, 166, 1308–1323.e30
CrossRef Pubmed Google scholar
[20]
Johnson, W. E., Li, C. and Rabinovic, A. (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118–127
CrossRef Pubmed Google scholar
[21]
Zhu, Z., Wang, T. and Samworth, R. J. (2019) High-dimensional principal component analysis with heterogeneous missingness. arXiv:1906.12125
[22]
Paul, F., Arkin, Y., Giladi, A., Jaitin, D. A., Kenigsberg, E., Keren-Shaul, H., Winter, D., Lara-Astiaso, D., Gury, M., Weiner, A., (2015) Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell, 163, 1663–1677
CrossRef Pubmed Google scholar
[23]
Zheng, G. X. Y., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu,J., (2017) Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8, 14049
CrossRef Pubmed Google scholar

DATA AVAILABILITY

Data sets used in this paper are published and publicly available at the following website:ĥ Mouse bone marrow data: GEO GSE72857 (www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72857).ĥ GTEx gene expression data (Gene TPMs) (gtexportal.org/home/datasets).ĥ Human PBMC and mouse brain data from 10XGenomics (support.10Xgenomics.com/single-cell-gene-expression/datasets).ĥ Mouse retina data: provided in the scVI software package (github.com/YosefLab/scVI/tree/master/tests/data); original data from GEO GSE81905(www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE81905).

CODE AVAILABILITY

Our software is implemented in Python and builds on Google TensorFlow. It can be run on CPUs and GPUs. The source code is available at the website (github.com/audreyqyfu/LATE).

SUPPLEMENTARY MATERIALS

The supplementary materials can be found online with this article at https://doi.org/10.1007/s40484-019-0192-7.

AUTHOR CONTRIBUTIONS

A.Q.F. conceived the project. R.L., B.L. and A.Q.F. developed the method. R.L., M.X. and A.Q.F. implemented the software. M.B.B., R.L. and A.Q.F. analyzed the data. M.B.B., R.L., B.L., Y.I.L., N.E.B. and A.Q.F. interpreted the results. A.Q.F. wrote the manuscript with input from other authors.

ACKNOWLEDGEMENTS

We acknowledge the extensive support on hardware and software provided by the Computational Resources Core (CRC) through the Institute for Bioinformatics & Evolutionary Studies at the University of Idaho, and in particular the assistance from Dr. Benjamin Oswald, the Director of the CRC. This research is supported by NIH R00HG007368 (to A.Q.F.) and partially by the NIH/NIGMS grant P20GM104420 to the Institute for Modeling Collaboration & Innovation at the University of Idaho.ƒThe Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The gene expression data used for the analyses described in this manuscript were obtained from the GTEx Portal on 04/11/2018.

COMPLIANCE WITH ETHICS GUIDELINES

The authors Md. Bahadur Badsha, Rui Li, Boxiang Liu, Yang I. Li, Min Xian, Nicholas E. Banovich and Audrey Qiuyan Fu declare that they have no conflicts of interest.ƒThis article does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2020 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(3905 KB)

Accesses

Citations

Detail

Sections
Recommended

/