mtFFPECleaner: A high-fidelity machine learning approach for artifact removal in high-throughput mitochondrial DNA sequencing data of archival formalin-fixed paraffin-embedded specimens
Shengjing Li , Wenjie Guo , Tianlei Sun , Fanfan Xie , Fan Peng , Kaixiang Zhou , Zhangwen Lei , Xu Guo , Yang Liu , Jinliang Xing
Interdisciplinary Medicine ›› 2025, Vol. 3 ›› Issue (6) : e70062
Archival formalin-fixed paraffin-embedded (FFPE) specimens are greatly useful for mitochondrial DNA (mtDNA) biomarker studies. However, formalin-induced artifacts mimic somatic mutations, confounding clinical interpretations. Current artifact-removal tools, which are optimized for nuclear DNA, lack mtDNA-specific adaptation to address its high copy number, GC-content disparity between strands, and characteristics of heteroplasmy, necessitating tailored computational solutions. We developed mtFFPECleaner, a machine learning framework integrating multidimensional features of mtDNA mutations, including variant allele frequency (VAF) distribution, strand orientation bias score, sequence context, and local base composition. The framework employed a random forest classifier trained on 837 ground-truth genuine mutations and 1169 artifacts from 23 paired FFPE-fresh frozen (FF) samples. Model training was performed using tenfold cross-validation, followed by independent validation on an additional 15 paired FFPE-FF samples. Our analyses revealed that formalin-induced artifacts in mtDNA next-generation sequencing (NGS) data predominantly occur as C > T/G > A transitions, particularly in low VAF ranges, with significant strand bias and sequence context dependence. The mtFFPECleaner classifier, optimized through balanced sampling (1:2 ratio of artifacts to genuine mutations), achieved outstanding discrimination accuracy in an independent validation set (98.7% specificity and 98.2% sensitivity), outperforming conventional nuclear DNA artifact-removal tools, including SOBDetector and DEEPOMOICS FFPE. Furthermore, following artifact removal by mtFFPECleaner, the mutational spectra of 314 FFPE samples showed remarkable concordance with those observed in FF samples. Importantly, we observed that the artifact burden correlated with the duration of FFPE sample storage, underscoring mtFFPECleaner's capability to effectively mitigate formalin-induced damage accumulated over decades in archival biospecimens. mtFFPECleaner represents the first dedicated solution for enhancing mutational fidelity in mtDNA NGS data from FFPE specimens. The open-source R package (https://github.com/AlienLemon117/mtFFPECleaner) ensures scalability for large-scale archival studies, unlocking the translational potential of FFPE biobanks.
artifact removal / FFPE specimens / machine learning / mitochondrial DNA mutation / next-generation sequencing
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
2025 The Author(s). Interdisciplinary Medicine published by Wiley-VCH GmbH on behalf of Nanfang Hospital, Southern Medical University.
/
| 〈 |
|
〉 |