A survey on de novo assembly methods for single-molecular sequencing
Ying Chen, Chuan-Le Xiao
A survey on de novo assembly methods for single-molecular sequencing
Background: The single-molecular sequencing (SMS) is under rapid development and generating increasingly long and accurate sequences. De novo assembly of genomes from SMS sequences is a critical step for many genomic studies. To scale well with the developing trends of SMS, many de novo assemblers for SMS have been released. These assembly workflows can be categorized into two different kinds: the correction-and-assembly strategy and the assembly-and-correction strategy, both of which are gaining more and more attentions.
Results: In this article we make a discussion on the characteristics of errors in SMS sequences. We then review the currently widely applied de novo assemblers for SMS sequences. We also describe computational methods relevant to de novo assembly, including the alignment methods and the error correction methods. Benchmarks are provided to analyze their performance on different datasets and to provide use guides on applying the computation methods.
Conclusion: We make a detailed review on the latest development of de novo assembly and some relevant algorithms for SMS, including their rationales, solutions and results. Besides, we provide use guides on the algorithms based on their benchmark results. Finally we conclude the review by giving some developing trends of third generation sequencing (TGS).
third generation sequencing / single-molecular real-time sequencing / sequence alignment / sequence error correction / de novo assembly
[1] |
Brown, S. D., Nagaraju, S., Utturkar, S., De Tissera, S., Segovia, S., Mitchell, W., Land, M. L., Dassanayake, A. and Köpke, M. (2014) Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels, 7, 40
CrossRef
Pubmed
Google scholar
|
[2] |
Rhoads, A. and Au, K. F. (2015) PacBio sequencing and its applications. Genom. Proteom. Bioinf., 13, 278–289
CrossRef
Pubmed
Google scholar
|
[3] |
Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B.,
CrossRef
Pubmed
Google scholar
|
[4] |
Detter, J. C., Johnson, S. L., Bishop-Lilly, K. A., Chain, P. S., Gibbons, H. S., Minogue, T. D., Sozhamannan, S., Van Gieson, E.J. and Resnick, I. G. (2014) Nucleic acid sequencing for characterizing infectious and/or novel agents in complex samples. In: Biological Identification, pp. 3–53. Sawston: Woodhead Publishing
|
[5] |
Wenger, A. M., Peluso, P., Rowell, W. J., Chang, P. C., Hall, R. J., Concepcion, G. T., Ebler, J., Fungtammasan, A., Kolesnikov, A., Olson, N. D.,
CrossRef
Pubmed
Google scholar
|
[6] |
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E.,
CrossRef
Pubmed
Google scholar
|
[7] |
Magi, A., Semeraro, R., Mingrino, A., Giusti, B. and D’Aurizio, R. (2018) Nanopore sequencing data analysis: state of the art, applications and challenges. Brief. Bioinformatics, 19, 1256–1272
Pubmed
|
[8] |
Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T.,
CrossRef
Pubmed
Google scholar
|
[9] |
Magi, A., Giusti, B. and Tattini, L. (2017) Characterization of MinION nanopore data for resequencing analyses. Brief. Bioinformatics, 18, 940–953
Pubmed
|
[10] |
Rang, F. J., Kloosterman, W. P. and de Ridder, J. (2018) From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol., 19, 90
CrossRef
Pubmed
Google scholar
|
[11] |
Loman, N. J., Quick, J. and Simpson, J. T. (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods, 12, 733–735
CrossRef
Pubmed
Google scholar
|
[12] |
Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., Cuomo, C. A., Zeng, Q., Wortman, J., Young, S. K.,
CrossRef
Pubmed
Google scholar
|
[13] |
Kingan, S. B., Heaton, H., Cudini, J., Lambert, C. C., Baybayan, P., Galvin, B. D., Durbin, R., Korlach, J. and Lawniczak, M. K. N. (2019) A high-quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes (Basel), 10, 62
CrossRef
Pubmed
Google scholar
|
[14] |
Vaser, R., Sović I., Nagarajan, N. and Šikić, M. (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res., 27, 737–746
CrossRef
Pubmed
Google scholar
|
[15] |
Bashir, A., Klammer, A., Robins, W. P., Chin, C. S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P.,
CrossRef
Pubmed
Google scholar
|
[16] |
Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D.,
CrossRef
Pubmed
Google scholar
|
[17] |
Ribeiro, F. J., Przybylski, D., Yin, S., Sharpe, T., Gnerre, S., Abouelleil, A., Berlin, A. M., Montmayeur,A., Shea, T. P., Walker, B. J.,
CrossRef
Pubmed
Google scholar
|
[18] |
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E.,
CrossRef
Pubmed
Google scholar
|
[19] |
Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197
CrossRef
Pubmed
Google scholar
|
[20] |
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402
CrossRef
Pubmed
Google scholar
|
[21] |
Chaisson, M. J. and Tesler, G. (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 13, 238
CrossRef
Pubmed
Google scholar
|
[22] |
Ferragina, P. and Manzini, G. (2000) Opportunistic data structures with applications In: Proceedings the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE
|
[23] |
Li, H. and Durbin, R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595
CrossRef
Pubmed
Google scholar
|
[24] |
Ben, L. and Salzberg, S. L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359
|
[25] |
Chen, Y., Ye, W., Zhang, Y. and Xu, Y. (2015) High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res., 43, 7762–7768
CrossRef
Pubmed
Google scholar
|
[26] |
Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K. and Yiu, S. M. (2008) Compressed indexing and local alignment of DNA. Bioinformatics, 24, 791–797
CrossRef
Pubmed
Google scholar
|
[27] |
Abouelhoda, M. I. and Ohlebusch, E. A. (2003) Local chaining algorithm and its applications in comparative genomics. In: International Workshop on Algorithms in Bioinformatics, pp. 1–16. Berlin: Springer
|
[28] |
Eppstein, D., Galil, Z., Giancarlo, R. and Italiano, G. F. (1992) Sparse dynamic programming I: linear cost functions. J. Assoc. Comput. Mach., 39, 519–545
CrossRef
Google scholar
|
[29] |
Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads In: International Workshop on Algorithms in Bioinformatics, pp. 52–67. Berlin: Springer
|
[30] |
Cormen, T. H., Leiserson,C. E., Rivest, R. L. and Stein, C. (2009) Introduction to Algorithms (3rd. edition), pp.197–204. Cambridge: MIT Press
|
[31] |
Myers, E. W. (1986) AnO (ND) difference algorithm and its variations. Algorithmica, 1, 251–266
CrossRef
Google scholar
|
[32] |
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
CrossRef
Pubmed
Google scholar
|
[33] |
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
CrossRef
Pubmed
Google scholar
|
[34] |
Li, H. (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100
CrossRef
Pubmed
Google scholar
|
[35] |
Gotoh, O. (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol., 52, 359–373
CrossRef
Pubmed
Google scholar
|
[36] |
Suzuki, H. and Kasahara, M. (2018) Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19, 45
CrossRef
Pubmed
Google scholar
|
[37] |
Šošic, M. and Šikic, M. (2017) Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33, 1394–1395
CrossRef
Pubmed
Google scholar
|
[38] |
Myers, G. (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach., 46, 395–415
CrossRef
Google scholar
|
[39] |
Hirschberg, D. S. (1975) A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18, 341–343
CrossRef
Google scholar
|
[40] |
Lee, C., Grasso, C. and Sharlow, M. F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464
CrossRef
Pubmed
Google scholar
|
[41] |
Chin, C. S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion,G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A.,
CrossRef
Pubmed
Google scholar
|
[42] |
Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21(suppl_2), ii79–ii85
|
[43] |
Berlin, K., Koren, S., Chin, C. S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
CrossRef
Pubmed
Google scholar
|
[44] |
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
CrossRef
Pubmed
Google scholar
|
[45] |
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
CrossRef
Pubmed
Google scholar
|
[46] |
Xiao, C. L., Chen, Y., Xie, S. Q., Chen,K.-N., Wang, Y., Han, Y., Luo F., Xie , Z. (2017) MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods, 14, 1072–1074
|
[47] |
Kolmogorov, M., Yuan, J., Lin, Y. and Pevzner, P. A. (2019) Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol., 37, 540–546
CrossRef
Pubmed
Google scholar
|
[48] |
Ruan, J. and Li, H. (2020) Fast and accurate long-read assembly with wtdbg2. Nat. Methods, 17, 155–158
CrossRef
Pubmed
Google scholar
|
[49] |
Chen, Y., Nie, F., Xie, S.-Q. and Zheng, Y.-F. (2020) Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection. bioRxiv, 930107
|
/
〈 | 〉 |