A survey on <i>de novo</i> assembly methods for single-molecular sequencing

Ying Chen; Chuan-Le Xiao

doi:10.1007/s40484-020-0214-5

PDF(217 KB)

Quant. Biol. ›› 2020, Vol. 8 ›› Issue (3) : 203-215. DOI: 10.1007/s40484-020-0214-5

REVIEW

A survey on de novo assembly methods for single-molecular sequencing

Ying Chen ,
Chuan-Le Xiao

Author information +

History +

Abstract

Background: The single-molecular sequencing (SMS) is under rapid development and generating increasingly long and accurate sequences. De novo assembly of genomes from SMS sequences is a critical step for many genomic studies. To scale well with the developing trends of SMS, many de novo assemblers for SMS have been released. These assembly workflows can be categorized into two different kinds: the correction-and-assembly strategy and the assembly-and-correction strategy, both of which are gaining more and more attentions.

Results: In this article we make a discussion on the characteristics of errors in SMS sequences. We then review the currently widely applied de novo assemblers for SMS sequences. We also describe computational methods relevant to de novo assembly, including the alignment methods and the error correction methods. Benchmarks are provided to analyze their performance on different datasets and to provide use guides on applying the computation methods.

Conclusion: We make a detailed review on the latest development of de novo assembly and some relevant algorithms for SMS, including their rationales, solutions and results. Besides, we provide use guides on the algorithms based on their benchmark results. Finally we conclude the review by giving some developing trends of third generation sequencing (TGS).

Graphical abstract

Keywords

third generation sequencing / single-molecular real-time sequencing / sequence alignment / sequence error correction / de novo assembly

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Ying Chen, Chuan-Le Xiao. A survey on de novo assembly methods for single-molecular sequencing. Quant. Biol., 2020, 8(3): 203‒215 https://doi.org/10.1007/s40484-020-0214-5

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

Brown, S. D., Nagaraju, S., Utturkar, S., De Tissera, S., Segovia, S., Mitchell, W., Land, M. L., Dassanayake, A. and Köpke, M. (2014) Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels, 7, 40

CrossRef Pubmed Google scholar

[2]	Rhoads, A. and Au, K. F. (2015) PacBio sequencing and its applications. Genom. Proteom. Bioinf., 13, 278–289 CrossRef Pubmed Google scholar

[3]	Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., (2009) Real-time DNA sequencing from single polymerase molecules. Science, 323, 133–138 CrossRef Pubmed Google scholar

[4]

Detter, J. C., Johnson, S. L., Bishop-Lilly, K. A., Chain, P. S., Gibbons, H. S., Minogue, T. D., Sozhamannan, S., Van Gieson, E.J. and Resnick, I. G. (2014) Nucleic acid sequencing for characterizing infectious and/or novel agents in complex samples. In: Biological Identification, pp. 3–53. Sawston: Woodhead Publishing

[5]

Wenger, A. M., Peluso, P., Rowell, W. J., Chang, P. C., Hall, R. J., Concepcion, G. T., Ebler, J., Fungtammasan, A., Kolesnikov, A., Olson, N. D., (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol., 37, 1155–1162

CrossRef Pubmed Google scholar

[6]	Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569 CrossRef Pubmed Google scholar

[7]	Magi, A., Semeraro, R., Mingrino, A., Giusti, B. and D’Aurizio, R. (2018) Nanopore sequencing data analysis: state of the art, applications and challenges. Brief. Bioinformatics, 19, 1256–1272 Pubmed

[8]	Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T., (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol., 36, 338–345 CrossRef Pubmed Google scholar

[9]	Magi, A., Giusti, B. and Tattini, L. (2017) Characterization of MinION nanopore data for resequencing analyses. Brief. Bioinformatics, 18, 940–953 Pubmed

[10]	Rang, F. J., Kloosterman, W. P. and de Ridder, J. (2018) From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol., 19, 90 CrossRef Pubmed Google scholar

[11]	Loman, N. J., Quick, J. and Simpson, J. T. (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods, 12, 733–735 CrossRef Pubmed Google scholar

[12]	Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., Cuomo, C. A., Zeng, Q., Wortman, J., Young, S. K., (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9, e112963 CrossRef Pubmed Google scholar

[13]	Kingan, S. B., Heaton, H., Cudini, J., Lambert, C. C., Baybayan, P., Galvin, B. D., Durbin, R., Korlach, J. and Lawniczak, M. K. N. (2019) A high-quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes (Basel), 10, 62 CrossRef Pubmed Google scholar

[14]	Vaser, R., Sović I., Nagarajan, N. and Šikić, M. (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res., 27, 737–746 CrossRef Pubmed Google scholar

[15]	Bashir, A., Klammer, A., Robins, W. P., Chin, C. S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P., (2012) A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol., 30, 701–707 CrossRef Pubmed Google scholar

[16]	Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol., 30, 693–700 CrossRef Pubmed Google scholar

[17]	Ribeiro, F. J., Przybylski, D., Yin, S., Sharpe, T., Gnerre, S., Abouelleil, A., Berlin, A. M., Montmayeur,A., Shea, T. P., Walker, B. J., (2012) Finished bacterial genomes from shotgun sequence data. Genome Res., 22, 2270–2277 CrossRef Pubmed Google scholar

[18]	Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569 CrossRef Pubmed Google scholar

[19]	Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197 CrossRef Pubmed Google scholar

[20]	Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402 CrossRef Pubmed Google scholar

[21]	Chaisson, M. J. and Tesler, G. (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 13, 238 CrossRef Pubmed Google scholar

[22]	Ferragina, P. and Manzini, G. (2000) Opportunistic data structures with applications In: Proceedings the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE

[23]	Li, H. and Durbin, R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595 CrossRef Pubmed Google scholar

[24]	Ben, L. and Salzberg, S. L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359

[25]	Chen, Y., Ye, W., Zhang, Y. and Xu, Y. (2015) High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res., 43, 7762–7768 CrossRef Pubmed Google scholar

[26]	Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K. and Yiu, S. M. (2008) Compressed indexing and local alignment of DNA. Bioinformatics, 24, 791–797 CrossRef Pubmed Google scholar

[27]	Abouelhoda, M. I. and Ohlebusch, E. A. (2003) Local chaining algorithm and its applications in comparative genomics. In: International Workshop on Algorithms in Bioinformatics, pp. 1–16. Berlin: Springer

[28]	Eppstein, D., Galil, Z., Giancarlo, R. and Italiano, G. F. (1992) Sparse dynamic programming I: linear cost functions. J. Assoc. Comput. Mach., 39, 519–545 CrossRef Google scholar

[29]	Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads In: International Workshop on Algorithms in Bioinformatics, pp. 52–67. Berlin: Springer

[30]	Cormen, T. H., Leiserson,C. E., Rivest, R. L. and Stein, C. (2009) Introduction to Algorithms (3rd. edition), pp.197–204. Cambridge: MIT Press

[31]	Myers, E. W. (1986) AnO (ND) difference algorithm and its variations. Algorithmica, 1, 251–266 CrossRef Google scholar

[32]	Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369 CrossRef Pubmed Google scholar

[33]	Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110 CrossRef Pubmed Google scholar

[34]	Li, H. (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100 CrossRef Pubmed Google scholar

[35]	Gotoh, O. (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol., 52, 359–373 CrossRef Pubmed Google scholar

[36]	Suzuki, H. and Kasahara, M. (2018) Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19, 45 CrossRef Pubmed Google scholar

[37]	Šošic, M. and Šikic, M. (2017) Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33, 1394–1395 CrossRef Pubmed Google scholar

[38]	Myers, G. (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach., 46, 395–415 CrossRef Google scholar

[39]	Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18, 341–343 CrossRef Google scholar

[40]	Lee, C., Grasso, C. and Sharlow, M. F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464 CrossRef Pubmed Google scholar

[41]	Chin, C. S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion,G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050–1054 CrossRef Pubmed Google scholar

[42]	Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21(suppl_2), ii79–ii85

[43]	Berlin, K., Koren, S., Chin, C. S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630 CrossRef Pubmed Google scholar

[44]	Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736 CrossRef Pubmed Google scholar

[45]	Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824 CrossRef Pubmed Google scholar

[46]	Xiao, C. L., Chen, Y., Xie, S. Q., Chen,K.-N., Wang, Y., Han, Y., Luo F., Xie , Z. (2017) MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods, 14, 1072–1074

[47]	Kolmogorov, M., Yuan, J., Lin, Y. and Pevzner, P. A. (2019) Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol., 37, 540–546 CrossRef Pubmed Google scholar

[48]	Ruan, J. and Li, H. (2020) Fast and accurate long-read assembly with wtdbg2. Nat. Methods, 17, 155–158 CrossRef Pubmed Google scholar

[49]	Chen, Y., Nie, F., Xie, S.-Q. and Zheng, Y.-F. (2020) Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection. bioRxiv, 930107

COMPLIANCE WITH ETHICS GUIDELINES

The authors Ying Chen and Chuan-Le Xiao declare that they have no conflict of interests.

This article is a review article and does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2020 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap

PDF(217 KB)

1279

Accesses

Citations

Altmetric

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
References
COMPLIANCE WITH ETHICS GUIDELINES
RIGHTS & PERMISSIONS

Received	Revised	Accepted	Published
10 Mar 2020	17 May 2020	13 Jun 2020	15 Sep 2020
Online First Date	Issue Date
14 Sep 2020	25 Sep 2020

About the journal

Aims & scopes

Description

Editorial board

Abstracting / Indexing

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Call for papers

Editorial policy

Guidelines for authors

Download templates

Classifications via endnote

Guidelines for reviewers

Author FAQs