How error correction affects polymerase chain reaction deduplication: A survey based on unique molecular identifier datasets of short reads

Pengyao Ping; Tian Lan; Shuquan Su; Wei Liu; Jinyan Li

doi:10.1002/qub2.99

Quant. Biol. ›› 2025, Vol. 13 ›› Issue (3) :e99 DOI: 10.1002/qub2.99

REVIEW ARTICLE

How error correction affects polymerase chain reaction deduplication: A survey based on unique molecular identifier datasets of short reads

Pengyao Ping ¹^,²
, Tian Lan ¹^,²
, Shuquan Su ¹^,²
, Wei Liu ¹^,²
, Jinyan Li ²^,³^,^†

Author information +

History +

PDF (2576KB)

Abstract

Next-generation sequencing data are widely utilised for various downstream applications in bioinformatics and numerous techniques have been developed for PCR-deduplication and error-correction to eliminate bias and errors introduced during the sequencing. This study first-time provides a joint overview of recent advances in PCR-deduplication and error-correction on short reads. In particular, we utilise UMI-based PCR-deduplication strategies and sequencing data to assess the performance of the solely-computational PCR-deduplication approaches and investigate how error correction affects the performance of PCR-deduplication. Our survey and comparative analysis reveal that the deduplicated reads generated by the solely-computational PCR-deduplication and error-correction methods exhibit substantial differences and divergence from the sets of reads obtained by the UMI-based deduplication methods. The existing solely-computational PCR-deduplication and error-correction tools can eliminate some errors but still leave hundreds of thousands of erroneous reads uncorrected. All the error-correction approaches raise thousands or more new sequences after correction which do not have any benefit to the PCR-deduplication process. Based on our findings, we discuss future research directions and make suggestions for improving existing computational approaches to enhance the quality of short-read sequencing data.

Keywords

error correction / next generation sequencing (NGS) / PCR-deduplication / polymerase chain reaction (PCR) duplicates / unique molecular identifier (UMI)

Cite this article

Download citation ▾

Pengyao Ping, Tian Lan, Shuquan Su, Wei Liu, Jinyan Li. How error correction affects polymerase chain reaction deduplication: A survey based on unique molecular identifier datasets of short reads. Quant. Biol., 2025, 13(3): e99 DOI:10.1002/qub2.99

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Shendure J , Balasubramanian S , Church GM , Gilbert W , Rogers J , Schloss JA , et al. DNA sequencing at 40: past, present and future. Nature. 2017; 550 (7676): 345- 53.

[2]	Kukurba KR , Montgomery SB . RNA sequencing and analysis. Cold Spring Harb Protoc. 2015: 951- 69.

[3]	Silverman JD , Bloom RJ , Jiang S , Durand HK , Dallow E , Mukherjee S , et al. Measuring and mitigating PCR bias in microbiota datasets. PLoS Comput Biol. 2021; 17 (7): e1009113.

[4]	Ebbert MTW , Wadsworth ME , Staley LA , Hoyt KL , Pickett B , Miller J , et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinf. 2016; 17 (S7): 239.

[5]	Marx V . How to deduplicate PCR. Nat Methods. 2017; 14 (5): 473- 6.

[6]	Gao KM , Derr AG , Guo Z , Nündel K , Marshak-Rothstein A , Finberg RW , et al. Human nasal wash RNA-Seq reveals distinct cell-specific innate immune responses in influenza versus SARS-CoV-2. JCI Insight. 2021; 6 (22): e152288.

[7]	Parekh S , Ziegenhain C , Vieth B , Enard W , Hellmann I . The impact of amplification on differential expression analyses by RNA-seq. Sci Rep. 2016; 6 (1): 25533.

[8]	Xiao W , Ren L , Chen Z , Fang LT , Zhao Y , Lack J , et al. Toward best practice in cancer mutation detection with whole-genome and whole-exome sequencing. Nat Biotechnol. 2021; 39 (9): 1141- 50.

[9]	Heydari M , Miclotte G , Demeester P , Van de Peer Y , Fostier J . Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinf. 2017; 18 (1): 374.

[10]	Freedman AH , Clamp M , Sackton TB . Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour. 2021; 21 (1): 18- 29.

[11]	Marchetti F , Cardoso R , Chen CL , Douglas GR , Elloway J , Escobar PA , et al. Error-corrected next-generation sequencing to advance nonclinical genotoxicity and carcinogenicity testing. Nat Rev Drug Discov. 2023; 22 (3): 165- 6.

[12]	Piazzi M , Bavelloni A , Salucci S , Faenza I , Blalock WL . Alternative splicing, RNA editing, and the current limits of next generation sequencing. Genes. 2023; 14 (7): 1386.

[13]	Telonis AG , Magee R , Loher P , Chervoneva I , Londin E , Rigoutsos I . Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types. Nucleic Acids Res. 2017; 45 (6): 2973- 85.

[14]	Huang Y , Shang M , Liu T , Wang K . High-throughput methods for genome editing: the more the better. Plant Physiol. 2022; 188 (4): 1731- 45.

[15]	Casbon JA , Osborne RJ , Brenner S , Lichtenstein CP . A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 2011; 39 (12): e81.

[16]	Rochette NC , Rivera-Colón AG , Walsh J , Sanger TJ , Campbell-Staton SC , Catchen JM . On the causes, consequences, and avoidance of PCR duplicates: towards a theory of library complexity. Mol Ecol Resour. 2023; 23 (6): 1299- 318.

[17]	Kebschull JM , Zador AM . Sources of PCR-induced distortions in high-throughput sequencing data sets. Nucleic Acids Res. 2015; 43: e143.

[18]	Liu Y , Zhang X , Zou Q , Zeng X . Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics. 2021; 37 (11): 1604- 6.

[19]	Burriesci MS , Lehnert EM , Pringle JR . Fulcrum: condensing redundant reads from high-throughput sequencing studies. Bioinforma Oxf Engl. 2012; 28 (10): 1324- 7.

[20]	DePristo MA , Banks E , Poplin R , Garimella KV , Maguire JR , Hartl C , et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43 (5): 491- 8.

[21]	Schmieder R , Edwards R . Quality control and preprocessing of metagenomic datasets. Bioinforma Oxf Engl. 2011; 27 (6): 863- 4.

[22]	Ma X , Shao Y , Tian L , Flasch DA , Mulder HL , Edmonson MN , et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019; 20 (1): 50.

[23]	Stoler N , Nekrutenko A . Sequencing error profiles of Illumina sequencing instruments. NAR Genomics Bioinforma. 2021; 3 (1).

[24]	Kivioja T , Vähärautio A , Karlsson K , Bonke M , Enge M , Linnarsson S , et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012; 9 (1): 72- 4.

[25]	Smith T , Heger A , Sudbery I . UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res. 2017; 27 (3): 491- 9.

[26]	Peng X , Dorman KS . Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers. Bioinforma Oxf Engl. 2023; 39 (1): btad002.

[27]	Orabi B , Erhan E , McConeghy B , Volik SV , Le Bihan S , Bell R , et al. Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics. 2019; 35 (11): 1829- 36.

[28]	Mitchell K , Brito JJ , Mandric I , Wu Q , Knyazev S , Chang S , et al. Benchmarking of computational error-correction methods for next-generation sequencing data. Genome Biol. 2020; 21 (1): 71.

[29]	Picard toolkit. 2019.

[30]	Girardot C , Scholtalbers J , Sauer S , Su SY , Furlong EE . Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers. BMC Bioinf. 2016; 17 (1): 419.

[31]	Mangul S , Driesche SV , Martin LS , Martin KC , Elear E UMI-reducer: collapsing duplicate sequencing reads via unique molecular identifiers. 2017; 103267.

[32]	Chen S , Zhou Y , Chen Y , Huang T , Liao W , Xu Y , et al. Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data. BMC Bioinf. 2019; 20 (S23): 606.

[33]	Parekh S , Ziegenhain C , Vieth B , Enard W , Hellmann I . zUMIs - a fast and flexible pipeline to process RNA sequencing data with UMIs. GigaScience. 2018; 7 (6): giy059.

[34]	Liu D . Algorithms for efficiently collapsing reads with unique molecular identifiers. PeerJ. 2019; 7: e8275.

[35]	Tsagiopoulou M , Maniou MC , Pechlivanis N , Togkousidis A , Kotrová M , Hutzenlaub T , et al. UMIc: a preprocessing method for UMI deduplication and reads correction. Front Genet. 2021; 12: 660366.

[36]	Clement K , Farouni R , Bauer DE , Pinello L . AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing. Bioinforma Oxf Engl. 2018; 34 (13): i202- 10.

[37]	Fu Y , Wu P.-H , Beane T , Zamore PD , Weng Z . Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genom. 2018; 19 (1): 531.

[38]	Li H , Handsaker B , Wysoker A , Fennell T , Ruan J , Homer N , et al. The sequence alignment/map format and SAMtools. Bioinforma Oxf Engl. 2009; 25 (16): 2078- 9.

[39]	Danecek P , Bonfield JK , Liddle J , Marshall J , Ohan V , Pollard MO , et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021; 10 (2).

[40]	Bansal V . A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments. BMC Bioinf. 2017; 18 (S3): 43.

[41]	Gaia ASC , de Sá PHCG , de Oliveira MS , Veras AAO . NGSReadsTreatment-a cuckoo filter-based tool for removing duplicate reads in NGS data. Sci Rep. 2019; 9 (1): 11681.

[42]	Dai H , Guan Y . Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping. Bioinformatics. 2020; 36 (10): 3254- 6.

[43]	González-Domínguez J , Schmidt B . ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinforma Oxf Engl. 2016; 32 (10): 1562- 4.

[44]	Expósito RR , Veiga J , González-Domínguez J , Touriño J . MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud. Bioinformatics. 2017; 33 (17): 2762- 4.

[45]	Huang Y , Niu B , Gao Y , Fu L , Li W . CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinforma Oxf Engl. 2010; 26 (5): 680- 2.

[46]	Chong Z , Ruan J , Wu C.-I . Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics. 2012; 28 (21): 2732- 7.

[47]	Vander Heiden JA , Yaari G , Uduman M , Stern JN , O’Connor KC , Hafler DA , et al. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics. 2014; 30 (13): 1930- 2.

[48]	Bao E , Jiang T , Kaloshian I , Girke T . SEED: efficient clustering of next-generation sequences. Bioinformatics. 2011; 27 (18): 2502- 9.

[49]	Song L , Florea L , Langmead B . Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 2014; 15 (11): 509.

[50]	Greenfield P , Duesing K , Papanicolaou A , Bauer DC . Blue: correcting sequencing errors using consensus and context. Bioinforma Oxf Engl. 2014; 30 (19): 2723- 32.

[51]	Heo Y , Wu X.-L , Chen D , Ma J , Hwu WM . BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinforma Oxf Engl. 2014; 30 (10): 1354- 62.

[52]	Heo Y , Ramachandran A , Hwu W.-M , Ma J , Chen D . BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics. 2016; 32 (15): 2369- 71.

[53]	Liu Y , Schröder J , Schmidt B . Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics. 2013; 29 (3): 308- 15.

[54]	Ilie L , Molnar M . RACER: rapid and accurate correction of errors in reads. Bioinforma Oxf Engl. 2013; 29 (19): 2490- 3.

[55]	Simpson JT , Durbin R . Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012; 22 (3): 549- 56.

[56]	Li H . BFC: correcting Illumina sequencing errors. Bioinforma Oxf Engl. 2015; 31 (17): 2885- 7.

[57]	Marinier E , Brown DG , McConkey BJ . Pollux: platform independent error correction of single and mixed genomes. BMC Bioinf. 2015; 16 (1): 10.

[58]	Dlugosz M , Deorowicz S . RECKONER: read error corrector based on KMC. Bioinforma Oxf Engl. 2017; 33 (7): 1086- 9.

[59]	Abdallah M , Mahgoub A , Ahmed H , Chaterji S . Athena: automated tuning of k-mer based genomic error correction algorithms using language models. Sci Rep. 2019; 9 (1): 16157.

[60]	Sharma A , Jain P , Mahgoub A , Zhou Z , Mahadik K , Chaterji S . Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing. BMC Bioinf. 2022; 23 (1): 25.

[61]	Salmela L , Schröder J . Correcting errors in short reads by multiple alignments. Bioinformatics. 2011; 27 (11): 1455- 61.

[62]	Kao WC , Chan AH , Song YS . ECHO: a reference-free short-read error correction algorithm. Genome Res. 2011; 21 (7): 1181- 92.

[63]	Schulz MH , Weese D , Holtgrewe M , Dimitrova V , Niu S , Reinert K , et al. Fiona: a parallel and automatic strategy for read error correction. Bioinformatics. 2014; 30 (17): i356- 63.

[64]	Allam A , Kalnis P , Solovyev V . Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics. 2015; 31 (21): 3421- 8.

[65]	Kallenborn F , Hildebrandt A , Schmidt B . CARE: context-aware sequencing read error correction. Bioinformatics. 2021; 37 (7): 889- 95.

[66]	Kallenborn F , Cascitti J , Schmidt B . CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinf. 2022; 23 (1): 227.

[67]	Heydari M , Miclotte G , Van de Peer Y , Fostier J . Illumina error correction near highly repetitive DNA regions improves de novo genome assembly. BMC Bioinf. 2019; 20 (1): 298.

[68]	Limasset A , Flot JF , Peterlongo P . Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics. 2020; 36 (2): 651.

[69]	Zhang X , Liu Y , Yu Z , Blumenstein M , Hutvagner G , Li J . Instance-based error correction for short reads of disease-associated genes. BMC Bioinf. 2021; 22 (S6): 142.

[70]	Zhang X , Ping P , Hutvagner G , Blumenstein M , Li J . Aberration-corrected ultrafine analysis of miRNA reads at single-base resolution: a k-mer lattice approach. Nucleic Acids Res. 2021; 49 (18): e106.

[71]	Chen S , Zhou Y , Chen Y , Gu J . Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinforma Oxf Engl. 2018; 34 (17): i884- 90.

[72]	Urgese G , Parisi E , Scicolone O , Di Cataldo S , Ficarra E . BioSeqZip: a collapser of NGS redundant reads for the optimization of sequence analysis. Bioinformatics. 2020; 36 (9): 2705- 11.

[73]	Xu H , Luo X , Qian J , Pang X , Song J , Qian G , et al. FastUniq: a fast de novo duplicates removal tool for paired short reads. PLoS One. 2012; 7 (12): e52249.

[74]	Davis EM , Sun Y , Liu Y , Kolekar P , Shao Y , Szlachta K , et al. SequencErr: measuring and suppressing sequencer errors in next-generation sequencing data. Genome Biol. 2021; 22 (1): 37.

[75]	Peng X , Dorman KS . AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data. Bioinformatics. 2020; 36 (21): 5151- 8.