A structural variation genotyping algorithm enhanced by CNV quantitative transfer

Tian ZHENG, Xinyang QIAN, Jiayin WANG

PDF(1627 KB)
PDF(1627 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (6) : 166905. DOI: 10.1007/s11704-021-1177-z
RESEARCH ARTLCLE

A structural variation genotyping algorithm enhanced by CNV quantitative transfer

Author information +
History +

Abstract

Genotyping of structural variations considering copy number variations (CNVs) is an infancy and challenging problem. CNVs, a prevalent form of critical genetic variations that cause abnormal copy numbers of large genomic regions in cells, often affect transcription and contribute to a variety of diseases. The characteristics of CNVs often lead to the ambiguity and confusion of existing genotyping features and algorithms, which may cause heterozygous variations to be erroneously genotyped as homozygous variations and seriously affect the accuracy of downstream analysis. As the allelic copy number increases, the error rate of genotyping increases sharply. Some instances with different copy numbers play an auxiliary role in the genotyping classification problem, but some will seriously interfere with the accuracy of the model. Motivated by these, we propose a transfer learning-based method to genotype structural variations accurately considering CNVs. The method first divides the instances with different allelic copy numbers and trains the basic machine learning framework with different genotype datasets. It maximizes the weights of the instances that contribute to classification and minimizes the weights of the instances that hinder correct genotyping. By adjusting the weights of the instances with different allelic copy numbers, the contribution of all the instances to genotyping can be maximized, and the genotyping errors of heterozygote variations caused by CNVs can be minimized. We applied the proposed method to both the simulated and real datasets, and compared it to some popular algorithms including GATK, Facets and Gindel. The experimental results demonstrate that the proposed method outperforms the others in terms of accuracy, stability and efficiency. The source codes have been uploaded at github/TrinaZ/CNVtransfer for academic use only.

Graphical abstract

Keywords

genotyping / copy number variations / transfer learning

Cite this article

Download citation ▾
Tian ZHENG, Xinyang QIAN, Jiayin WANG. A structural variation genotyping algorithm enhanced by CNV quantitative transfer. Front. Comput. Sci., 2022, 16(6): 166905 https://doi.org/10.1007/s11704-021-1177-z

References

[1]
Lu X , Chen X , Forney C , Donmez O , Miller D , Parameswaran S , Hong T , Huang Y , Pujato M , Cazares T , Miraldi E R , Ray J P , De Boer C G , Harley J B , Weirauch M T , Kottyan L C . Global discovery of lupus genetic risk variant allelic enhancer activity. Nature Communications, 2021, 12( 1): 1611–
[2]
Alkan C , Coe B P , Eichler E E . Genome structural variation discovery and genotyping. Nature Reviews Genetics, 2011, 12( 5): 363– 376
[3]
Zhang Z , Cheng H , Hong X , Di Narzo A F , Franzen O , Peng S , Ruusalepp A , Kovacic J C , Bjorkegren J L M , Wang X , Hao K . EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data. Nucleic Acids Research, 2019, 47( 7): e39–
[4]
Zhao M , Wang Q , Wang Q , Jia P , Zhao Z . Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 2013, 14( S11): S1–
[5]
Zhang C , Cai H , Huang J , Song Y . nbCNV: a multi-constrained optimization model for discovering copy number variants in single-cell sequencing data. BMC Bioinformatics, 2016, 17 : 384–
[6]
Iranmanesh S M , Guo N L . Integrated DNA copy number and gene expression regulatory network analysis of non-small cell lung cancer metastasis. Cancer Informatics, 2014, 13( S5): 13– 23
[7]
Conrad D F , Pinto D , Redon R , Feuk L , Gokcumen O . Origins and functional impact of copy number variation in the human genome. Nature, 2010, 464( 7289): 704– 712
[8]
Chiang C , Scott A J , Davis J R , Tsang E K , Li X , Kim Y , Hadzic T , Damani F N , Ganel L , Consortium G , Montgomery S B , Battle A , Conrad D F , Hall I M . The impact of structural variation on human gene expression. Nature Genetics, 2017, 49( 5): 692– 699
[9]
Chen P, Huang W, Shao W, Cai H. Discrimination of recurrent CNVs from individual ones from multisample aCGH by jointly constrained minimization. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 2015, 186– 193
[10]
Xu B , Cai H , Zhang C , Yang X , Han G . Copy number variants calling for single cell sequencing data by multi-constrained optimization. Computational Biology and Chemistry, 2016, 63 : 15– 20
[11]
Lu C , Xie M , Wendl M C , Wang J , McLellan M D . Patterns and functional implications of rare germline variants across 12 cancer types. Nature Communications, 2015, 6 : 10086–
[12]
Freed D, Aldana R, Weber J A, Edwards J S. The Sentieon genomics tools-a fast and accurate solution to variant calling from next-generation sequence data. bioRxiv, 2017, DOI: 10.1101/115717
[13]
Chu C , Zhang J , Wu Y . GINDEL: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PLoS One, 2014, 9( 11): e113324–
[14]
Sudmant P , Rausch T , Gardner E J , Handsaker R E , Abyzov A . An integrated map of structural variation in 2,504 human genomes. Nature, 2015, 526( 7571): 75– 81
[15]
Liaw A , Wiener M . Classification and regression by randomForest. R News, 2002, 2–3 : 18– 22
[16]
Nørgaard M, Ravn O, Poulsen N K, Hansen L K. Neural Networks for Modeling and Control of Dynamic Systems: A Practitioner's Handbook. London: Springer, 2000, 246
[17]
Chang C C , Lin C J . LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2( 3): 1– 27
[18]
Breiman L , Friedman J H , Olshen R A , Stone C J . Classification and regression trees (CART). Biometrics, 1984, 40( 3): 358–
[19]
Kohavi R , John G H . Wrappers for feature subset selection. Artificial Intelligence, 1997, 97( 1–2): 273– 324
[20]
Dai W, Yang Q, Xue G R, Yu Y. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200
[21]
Shen R , Seshan V E . FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic Acids Research, 2016, 44( 16): e131–
[22]
Auton A , Abecasis G R , Altshuler D M , Durbin R M , Abecasis G R . A global reference for human genetic variation. Nature, 2015, 526( 7571): 68– 74
[23]
Cao D S , Liang Y Z , Xu Q S , Zhang L X , Hu Q N , Li H D . Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds. Journal of Chemometrics, 2011, 25( 4): 201– 207

Acknowledgements

The work was supported by the National Natural Science Foundation of China (Grant No. 31701150) and the Fundamental Research Funds for the Central Universities (CXTD2017003).

Supporting Information

The supporting information is available online at journal. hep. com. cn and link. springer. com.

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(1627 KB)

Accesses

Citations

Detail

Sections
Recommended

/