mtFFPECleaner: A high-fidelity machine learning approach for artifact removal in high-throughput mitochondrial DNA sequencing data of archival formalin-fixed paraffin-embedded specimens

Shengjing Li , Wenjie Guo , Tianlei Sun , Fanfan Xie , Fan Peng , Kaixiang Zhou , Zhangwen Lei , Xu Guo , Yang Liu , Jinliang Xing

Interdisciplinary Medicine ›› 2025, Vol. 3 ›› Issue (6) : e70062

PDF
Interdisciplinary Medicine ›› 2025, Vol. 3 ›› Issue (6) :e70062 DOI: 10.1002/inmd.70062
RESEARCH ARTICLE
mtFFPECleaner: A high-fidelity machine learning approach for artifact removal in high-throughput mitochondrial DNA sequencing data of archival formalin-fixed paraffin-embedded specimens
Author information +
History +
PDF

Abstract

Archival formalin-fixed paraffin-embedded (FFPE) specimens are greatly useful for mitochondrial DNA (mtDNA) biomarker studies. However, formalin-induced artifacts mimic somatic mutations, confounding clinical interpretations. Current artifact-removal tools, which are optimized for nuclear DNA, lack mtDNA-specific adaptation to address its high copy number, GC-content disparity between strands, and characteristics of heteroplasmy, necessitating tailored computational solutions. We developed mtFFPECleaner, a machine learning framework integrating multidimensional features of mtDNA mutations, including variant allele frequency (VAF) distribution, strand orientation bias score, sequence context, and local base composition. The framework employed a random forest classifier trained on 837 ground-truth genuine mutations and 1169 artifacts from 23 paired FFPE-fresh frozen (FF) samples. Model training was performed using tenfold cross-validation, followed by independent validation on an additional 15 paired FFPE-FF samples. Our analyses revealed that formalin-induced artifacts in mtDNA next-generation sequencing (NGS) data predominantly occur as C > T/G > A transitions, particularly in low VAF ranges, with significant strand bias and sequence context dependence. The mtFFPECleaner classifier, optimized through balanced sampling (1:2 ratio of artifacts to genuine mutations), achieved outstanding discrimination accuracy in an independent validation set (98.7% specificity and 98.2% sensitivity), outperforming conventional nuclear DNA artifact-removal tools, including SOBDetector and DEEPOMOICS FFPE. Furthermore, following artifact removal by mtFFPECleaner, the mutational spectra of 314 FFPE samples showed remarkable concordance with those observed in FF samples. Importantly, we observed that the artifact burden correlated with the duration of FFPE sample storage, underscoring mtFFPECleaner's capability to effectively mitigate formalin-induced damage accumulated over decades in archival biospecimens. mtFFPECleaner represents the first dedicated solution for enhancing mutational fidelity in mtDNA NGS data from FFPE specimens. The open-source R package (https://github.com/AlienLemon117/mtFFPECleaner) ensures scalability for large-scale archival studies, unlocking the translational potential of FFPE biobanks.

Keywords

artifact removal / FFPE specimens / machine learning / mitochondrial DNA mutation / next-generation sequencing

Cite this article

Download citation ▾
Shengjing Li, Wenjie Guo, Tianlei Sun, Fanfan Xie, Fan Peng, Kaixiang Zhou, Zhangwen Lei, Xu Guo, Yang Liu, Jinliang Xing. mtFFPECleaner: A high-fidelity machine learning approach for artifact removal in high-throughput mitochondrial DNA sequencing data of archival formalin-fixed paraffin-embedded specimens. Interdisciplinary Medicine, 2025, 3(6): e70062 DOI:10.1002/inmd.70062

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

J. B. Stewart, P. F. Chinnery, Nat. Rev. Genet. 2021, 22, 106.

[2]

M. Kim, M. Mahmood, E. Reznik, P. A. Gammage, Trends Cancer 2022, 8, 1046.

[3]

P. Vodicka, S. Vodenkova, N. Danesova, L. Vodickova, R. Zobalova, K. Tomasova, S. Boukalova, M. V. Berridge, J. Neuzil, Trends Cancer 2025, 11, 62.

[4]

S. D. Shelton, S. House, L. M. N. Melo, V. Ramesh, Z. Chen, T. Wei, X. Wang, C. B. Llamas, S. S. K. Venigalla, C. J. Menezes, G. Allies, J. Krystkiewicz, J. Rosler, S. W. Meckelmann, P. Zhao, F. Rambow, D. Schadendorf, Z. Zhao, J. G. Gill, R. J. DeBerardinis, S. J. Morrison, A. Tasdogan, P. Mishra, Sci. Adv. 2024, 10, eadk8801.

[5]

C. Yin, D. Y. Li, X. Guo, H. Y. Cao, Y. B. Chen, F. Zhou, N. J. Ge, Y. Liu, S. S. Guo, Z. Zhao, H. S. Yang, J. L. Xing, Ann. Oncol. 2019, 30, 953.

[6]

K. Wang, T. Kumar, J. Wang, D. C. Minussi, E. Sei, J. Li, T. M. Tran, A. Thennavan, M. Hu, A. K. Casasent, Z. Xiao, S. Bai, L. Yang, L. M. King, V. Shah, P. Kristel, C. L. van der Borden, J. R. Marks, Y. Zhao, A. J. Zurita, A. Aparicio, B. Chapin, J. Ye, J. Zhang, D. L. Gibbons, P. C. Grand Challenge, E. Sawyer, A. M. Thompson, A. Futreal, E. S. Hwang, J. Wesseling, E. H. Lips, N. E. Navin, Cell 2023, 186, 3968.

[7]

T. A. Steiert, G. Parra, M. Gut, N. Arnold, J. R. Trotta, R. Tonda, A. Moussy, Z. Gerber, P. M. Abuja, K. Zatloukal, C. Rocken, T. Folseraas, M. M. Grimsrud, A. Vogel, B. Goeppert, S. Roessler, S. Hinz, C. Schafmayer, P. Rosenstiel, J. F. Deleuze, I. G. Gut, A. Franke, M. Forster, Nucleic Acids Res. 2023, 51, 7143.

[8]

F. Xie, W. Guo, X. Wang, K. Zhou, S. Guo, Y. Liu, T. Sun, S. Li, Z. Xu, Q. Yuan, H. Zhang, X. Gu, J. Xing, S. Liu, Clin. Transl. Med. 2024, 14, e1523.

[9]

Q. Zhao, F. Wang, Y. X. Chen, S. Chen, Y. C. Yao, Z. L. Zeng, T. J. Jiang, Y. N. Wang, C. Y. Wu, Y. Jing, Y. S. Huang, J. Zhang, Z. X. Wang, M. M. He, H. Y. Pu, Z. J. Mai, Q. N. Wu, R. Long, X. Zhang, T. Huang, M. Xu, M. Z. Qiu, H. Y. Luo, Y. H. Li, D. S. Zhang, W. H. Jia, G. Chen, P. R. Ding, L. R. Li, Z. H. Lu, Z. Z. Pan, R. H. Xu, Nat. Commun. 2022, 13, 2342.

[10]

H. Do, A. Dobrovic, Clin. Chem. 2015, 61, 64.

[11]

G. Chen, S. Mosier, C. D. Gocke, M. T. Lin, J. R. Eshleman, Mol. Diagn. Ther. 2014, 18, 587.

[12]

Z. Yi, X. Zhang, X. Wei, J. Li, J. Ren, X. Zhang, Y. Zhang, H. Tang, X. Chang, Y. Yu, W. Wei, Nat. Commun. 2024, 15, 6397.

[13]

Q. Guo, E. Lakatos, I. A. Bakir, K. Curtius, T. A. Graham, V. Mustonen, Nat. Commun. 2022, 13, 4487.

[14]

Y. Yuan, Y. S. Ju, Y. Kim, J. Li, Y. Wang, C. J. Yoon, Y. Yang, I. Martincorena, C. J. Creighton, J. N. Weinstein, Y. Xu, L. Han, H. L. Kim, H. Nakagawa, K. Park, P. J. Campbell, H. Liang, P. Consortium, Nat. Genet. 2020, 52, 342.

[15]

M. Diossy, Z. Sztupinszki, M. Krzystanek, J. Borcsok, A. C. Eklund, I. Csabai, A. G. Pedersen, Z. Szallasi, Brief. Bioinform. 2021, 22, bbab186.

[16]

M. Ikegami, S. Kohsaka, T. Hirose, T. Ueno, S. Inoue, N. Kanomata, H. Yamauchi, T. Mori, S. Sekine, Y. Inamoto, Y. Yatabe, H. Kobayashi, S. Tanaka, H. Mano, Commun. Biol. 2021, 4, 1396.

[17]

D. H. Heo, I. Kim, H. Seo, S. G. Kim, M. Kim, J. Park, H. Park, S. Kang, J. Kim, S. Paik, S. E. Hong, Sci. Rep. 2024, 14, 2559.

[18]

W. Guo, Y. Liu, X. Ji, S. Guo, F. Xie, Y. Chen, K. Zhou, H. Zhang, F. Peng, D. Wu, Z. Wang, X. Guo, Q. Zhao, X. Gu, J. Xing, Theranostics 2023, 13, 324.

[19]

M. Li, R. Schroder, S. Ni, B. Madea, M. Stoneking, Proc. Natl. Acad. Sci. USA 2015, 112, 2491.

[20]

S. Guo, K. Zhou, Q. Yuan, L. Su, Y. Liu, X. Ji, X. Gu, X. Guo, J. Xing, Mol. Ther. Nucleic Acids 2021, 23, 232.

[21]

H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, S. Genome, Bioinformatics 2009, 25, 2078.

[22]

A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, M. A. DePristo, Genome Res. 2010, 20, 1297.

[23]

M. Costello, T. J. Pugh, T. J. Fennell, C. Stewart, L. Lichtenstein, J. C. Meldrim, J. L. Fostel, D. C. Friedrich, D. Perrin, D. Dionne, S. Kim, S. B. Gabriel, E. S. Lander, S. Fisher, G. Getz, Nucleic Acids Res. 2013, 41, e67.

[24]

J. G. Greener, S. M. Kandathil, L. Moffat, D. T. Jones, Nat. Rev. Mol. Cell Biol. 2022, 23, 40.

[25]

D. E. Wood, J. R. White, A. Georgiadis, B. Van Emburgh, S. Parpart-Li, J. Mitchell, V. Anagnostou, N. Niknafs, R. Karchin, E. Papp, C. McCord, P. LoVerso, D. Riley, L. A. Diaz, ., S. Jones, M. Sausen, V. E. Velculescu, S. V. Angiuoli, Sci. Transl. Med. 2018, 10, eaar7939.

[26]

G. Cazzato, C. Caporusso, F. Arezzo, A. Cimmino, A. Colagrande, V. Loizzi, G. Cormio, T. Lettini, E. Maiorano, V. S. Scarcella, P. Tarantino, M. Marrone, A. Stellacci, P. Parente, P. Romita, A. De Marco, V. Venerito, C. Foti, G. Ingravallo, R. Rossi, L. Resta, Genes 2021, 12, 1472.

[27]

S. Basyuni, L. Heskin, A. Degasperi, D. Black, G. C. C. Koh, L. Chmelova, G. Rinaldi, S. Bell, L. Grybowicz, G. Elgar, Y. Memari, P. Robbe, Z. Kingsbury, C. Caldas, J. Abraham, A. Schuh, L. Jones, P T Group, G Personalised Breast Cancer Program, M. Tischkowitz, M. A. Brown, H. R. Davies, S. Nik-Zainal, Nat. Commun. 2024, 15, 7731.

[28]

Y. S. Ng, L. A. Bindoff, G. S. Gorman, T. Klopstock, C. Kornblum, M. Mancuso, R. McFarland, C. M. Sue, A. Suomalainen, R. W. Taylor, D. R. Thorburn, D. M. Turnbull, Lancet Neurol. 2021, 20, 573.

RIGHTS & PERMISSIONS

2025 The Author(s). Interdisciplinary Medicine published by Wiley-VCH GmbH on behalf of Nanfang Hospital, Southern Medical University.

PDF

3

Accesses

0

Citation

Detail

Sections
Recommended

/