Tuning hyperparameters of doublet-detection methods for single-cell RNA sequencing data
Nan Miles Xi, Angelos Vasilopoulos
Tuning hyperparameters of doublet-detection methods for single-cell RNA sequencing data
Background: The existence of doublets in single-cell RNA sequencing (scRNA-seq) data poses a great challenge in downstream data analysis. Computational doublet-detection methods have been developed to remove doublets from scRNA-seq data. Yet, the default hyperparameter settings of those methods may not provide optimal performance.
Methods: We propose a strategy to tune hyperparameters for a cutting-edge doublet-detection method. We utilize a full factorial design to explore the relationship between hyperparameters and detection accuracy on 16 real scRNA-seq datasets. The optimal hyperparameters are obtained by a response surface model and convex optimization.
Results: We show that the optimal hyperparameters provide top performance across scRNA-seq datasets under various biological conditions. Our tuning strategy can be applied to other computational doublet-detection methods. It also offers insights into hyperparameter tuning for broader computational methods in scRNA-seq data analysis.
Conclusions: The hyperparameter configuration significantly impacts the performance of computational doublet-detection methods. Our study is the first attempt to systematically explore the optimal hyperparameters under various biological conditions and optimization objectives. Our study provides much-needed guidance for hyperparameter tuning in computational doublet-detection methods.
Doublet is a major confounder in single-cell RNA sequencing data analysis. Computational doublet-detection methods aim to remove doublets from scRNA-seq data. The performance of those methods relies on the appropriate setting of their hyperparameters. In this study, we explore the optimal hyperparameters for scDblFinder, a cutting-edge doublet-detection method. Our optimization utilizes a full factorial design, a response surface model, and 16 real scRNA-seq datasets. The optimal hyperparameters achieve top doublet-detection performance under a wide range of biological conditions. Our methodology is applicable to broader computational methods in scRNA-seq data analysis.
scRNA-seq / doublet detection / hyperparameter tuning / experimental design / response surface model
[1] |
Kolodziejczyk,A. A., Kim,J. K., Svensson,V., Marioni,J. C. Teichmann,S. (2015). The technology and biology of single-cell RNA sequencing. Mol. Cell, 58: 610–620
CrossRef
Google scholar
|
[2] |
Saliba,A. Westermann,A. J., Gorski,S. A. (2014). Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res., 42: 8845–8860
CrossRef
Google scholar
|
[3] |
Wiedmeier,J. E., Noel,P., Lin,W., Von Hoff,D. D. (2019). Single-cell sequencing in precision medicine. Cancer Treat. Res., 178: 237–252
CrossRef
Google scholar
|
[4] |
Aissa,A. F., Islam,A. B. M. M. K., Ariss,M. M., Go,C. C., Rader,A. E., Conrardy,R. D., Gajda,A. M., Rubio-Perez,C., Valyi-Nagy,K., Pasquinelli,M.
CrossRef
Google scholar
|
[5] |
Sun,G., Li,Z., Rong,D., Zhang,H., Shi,X., Yang,W., Zheng,W., Sun,G., Wu,F., Cao,H.
CrossRef
Google scholar
|
[6] |
Cargill,T. N., Nielsen,C. M., Russell,A. J. C. (2020). The application of single-cell RNA sequencing in vaccinology. J. Immunol. Res., 2020: 8624963
CrossRef
Google scholar
|
[7] |
Wolock,S. L., Lopez,R. Klein,A. (2019). Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Syst., 8: 281–291.e9
CrossRef
Google scholar
|
[8] |
McGinnis,C. S., Murrow,L. M. Gartner,Z. (2019). DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst., 8: 329–337.e4
CrossRef
Google scholar
|
[9] |
Germain,P. Lun,A., Garcia Meixide,C., Macnair,W. Robinson,M. (2021). Doublet identification in single-cell sequencing data using scDblFinder. F1000 Res., 10: 979
CrossRef
Google scholar
|
[10] |
Bais,A. S. (2019). scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics, 15: 1150–1158
CrossRef
Google scholar
|
[11] |
Bernstein,N. J., Fong,N. L., Lam,I., Roy,M. A., Hendrickson,D. G. Kelley,D. (2020). Solo: doublet identification in single-cell RNA-seq via semi-supervised deep learning. Cell Syst., 11: 95–101.e5
CrossRef
Google scholar
|
[12] |
Xi,N. M. Li,J. (2021). Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst., 12: 176–194.e6
CrossRef
Google scholar
|
[13] |
Luecken,M. D. Theis,F. (2019). Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol., 15: e8746
CrossRef
Google scholar
|
[14] |
Stoeckius,M., Zheng,S., Houck-Loomis,B., Hao,S., Yeung,B. Z., Mauck,W. M. Smibert,P. (2018). Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol., 19: 224
CrossRef
Google scholar
|
[15] |
Alles,J., Karaiskos,N., Praktiknjo,S. D., Grosswendt,S., Wahle,P., Ruffault,P. Ayoub,S., Schreyer,L., Boltengagen,A., Birchmeier,C.
CrossRef
Google scholar
|
[16] |
Kang,H. M., Subramaniam,M., Targ,S., Nguyen,M., Maliskova,L., McCarthy,E., Wan,E., Wong,S., Byrnes,L., Lanata,C. M.
CrossRef
Google scholar
|
[17] |
McGinnis,C. S., Patterson,D. M., Winkler,J., Conrad,D. N., Hein,M. Y., Srivastava,V., Hu,J. L., Murrow,L. M., Weissman,J. S., Werb,Z.
CrossRef
Google scholar
|
[18] |
ProbstP.,Boulesteix A. L.. (2019) Tunability: importance of hyperparameters of machine learning algorithms. arXiv,1802.09596
|
[19] |
Hu,Q. Greene,C. (2019). Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pac. Symp. Biocomput., 24: 362–373
|
[20] |
Raimundo,F., Vallot,C. Vert,J. (2020). Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol., 21: 212
CrossRef
Google scholar
|
[21] |
WangL.,Xiao Q.. (2018) Optimal maximin L1-distance Latin hypercube designs based on good lattice point designs. Ann Stat., Ann. Statist. 46, 3741–3766
|
[22] |
Wang,L., Sun,F., Lin,D. K. J. Liu,M. (2018). Construction of orthogonal symmetric Latin hypercube designs. Stat. Sin., 28: 1503–1520
|
[23] |
Wang,L., Xu,H. Liu,M. (2022). Fractional factorial designs for Fourier-cosine models. Metrika, 86: 373–390
CrossRef
Google scholar
|
[24] |
Wang,L. (2022). A class of multilevel nonregular designs for studying quantitative factors. Stat. Sin., 32: 825–845
CrossRef
Google scholar
|
[25] |
RedkoI.,Morvant E.,HabrardA.,SebbanM.. (2019) Advances in Domain Adaptation Theory. Amsterdam: Elsevier
|
[26] |
Xi,N. M. Li,J. (2021). Protocol for executing and benchmarking eight computational doublet-detection methods in single-cell RNA sequencing data analysis. STAR Protoc., 2: 100699
CrossRef
Google scholar
|
[27] |
Steinberg,D. M. Hunter,W. (1984). Experimental design: review and comment. Technometrics, 26: 71–97
CrossRef
Google scholar
|
[28] |
Hao,Y., Hao,S., Andersen-Nissen,E., Mauck,W. M. Zheng,S., Butler,A., Lee,M. J., Wilk,A. J., Darby,C., Zager,M.
CrossRef
Google scholar
|
[29] |
ChenT.. (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794
|
[30] |
BoxG. E. P.Draper N.. (1987) Empirical Model-building and Response Surfaces. New York: John Wiley & Sons
|
[31] |
Box,G. E. P. Wilson,K. (1951). On the experimental attainment of optimum conditions. J. R. Stat. Soc. Series B Stat. Methodol., 13: 1–45
|
[32] |
Breusch,T. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47: 1287–1294
CrossRef
Google scholar
|
/
〈 | 〉 |