Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era
Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova, Paola Bonizzoni
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era
Background: De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs.
Results: In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact.
Conclusions: We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field.
overlap graphs / de Bruijn graphs / genome assembly / long reads / string graphs
[1] |
Tomescu, A.I., Medvedev, P. (2016) Safe and Complete Contig Assembly Via Omnitigs. In: Research in Computational Molecular Biology, Singh, M. (eds.), RECOMB 2016. Lecture Notes in Computer Science. vol 9649. Springer, Cham
|
[2] |
Burrows, M. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. CiteSeer
|
[3] |
Ferragina, P. and Manzini, G. (2005) Indexing compressed text. J. Assoc. Comput. Mach., 52, 552–581
CrossRef
Google scholar
|
[4] |
Baker, M. (2012) De novo genome assembly: what every biologist should know. Nat. Methods, 9, 333–337
CrossRef
Google scholar
|
[5] |
Miller, J. R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327
CrossRef
Pubmed
Google scholar
|
[6] |
Paszkiewicz, K. and Studholme, D. J. (2010) De novo assembly of short sequence reads. Brief. Bioinform., 11, 457–472
CrossRef
Pubmed
Google scholar
|
[7] |
Narzisi, G. and Mishra, B. (2011) Comparing de novo genome assembly: the long and short of it. PLoS One, 6, e19175
CrossRef
Pubmed
Google scholar
|
[8] |
Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B.,
CrossRef
Pubmed
Google scholar
|
[9] |
Jayakumar, V. and Sakakibara, Y. (2019) Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief. Bioinform., 20, 866–876
CrossRef
Pubmed
Google scholar
|
[10] |
Ghurye, J. and Pop, M. (2019) Modern technologies and algorithms for scaffolding assembled genomes. PLOS Comput. Biol., 15, e1006994
CrossRef
Pubmed
Google scholar
|
[11] |
Shi, F. (1996) Suffix arrays for multiple strings: A method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, Jaffar, J., Yap, R.H.C. (eds.), ASIAN 1996. Lecture Notes in Computer Science, vol 1179. Springer, Berlin, Heidelberg
|
[12] |
Manber, U. and Myers, G. (1993) Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22, 935–948
CrossRef
Google scholar
|
[13] |
Lam, T. W., Li, R., Tam, A., Wong, S., Wu, E. and Yiu, S. M. (2009) High throughput short read alignment via bi-directional BWT. In Bioinformatics and Biomedicine (BIBM ’09), pp. 31–36, Washington, DC
|
[14] |
Simpson, J. T. and Durbin, R. (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26, i367–i373
CrossRef
Pubmed
Google scholar
|
[15] |
Simpson, J. T. and Durbin, R. (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res., 22, 549–556
CrossRef
Pubmed
Google scholar
|
[16] |
Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760
CrossRef
Pubmed
Google scholar
|
[17] |
Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25
CrossRef
Pubmed
Google scholar
|
[18] |
Cox, A. J., Garofalo, F., Rosone, G. and Sciortino, M. (2016) Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37, 17–33
CrossRef
Google scholar
|
[19] |
Li, H. (2014) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275
CrossRef
Pubmed
Google scholar
|
[20] |
Bonizzoni, P., Della Vedova , G., Pirola, Y., Previtali, M. and Rizzi, R. (2019) Multithread multistring Burrows-Wheeler transform and longest common prefix array. J. Comput. Biol., 26
CrossRef
Google scholar
|
[21] |
Bloom, B. H. (1970) Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13, 422–426
CrossRef
Google scholar
|
[22] |
Fan, L., Cao, P., Almeida, J. and Broder, A. Z. (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8, 281–293
CrossRef
Google scholar
|
[23] |
Broder, A. Z. (1997) On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES 1997, pp.21–29. IEEE
|
[24] |
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
CrossRef
Pubmed
Google scholar
|
[25] |
Salmela, L.and Rivals, E. (2014) LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30, 3506–3514
|
[26] |
Nordberg, H., Cantor, M., Dusheyko, S., Hua, S., Poliakov, A., Shabalov, I., Smirnova, T., Grigoriev, I. V. and Dubchak. I. (2013) The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Research, 42, D26–D31
|
[27] |
Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21, i79–i85
CrossRef
Pubmed
Google scholar
|
[28] |
Gonnella, G. and Kurtz, S. (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13, 82
CrossRef
Pubmed
Google scholar
|
[29] |
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
CrossRef
Pubmed
Google scholar
|
[30] |
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2016) LSG: An external-memory tool to compute string graphs for next-generation sequencing data assembly. J. Comput. Biol., 23, 137–149
CrossRef
Pubmed
Google scholar
|
[31] |
Peng, Y. and Leung, H. C. M., Yiu, S. M. and Chin, F. Y. L. (2010) IDBA—a practical iterative de Bruijn graph de novo assembler. In Research in Computational Molecular Biology, Bonnie B., (ed.), pp. 426–440, Springer Berlin Heidelberg
|
[32] |
Medvedev, P., Pham, S., Chaisson, M., Tesler, G. and Pevzner, P. (2011) Paired de Bruijn graphs: A novel approach for incorporating mate pair information into genome assemblers. In Proceedings of the 15th Annual International Conference on Research in Computational Molecular Biology, RECOMB’11, pp. 238–251, Springer-Verlag
|
[33] |
Pham, S. K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P. A. and Alekseyev, M. A. (2012) Pathset graphs: A novel approach for comprehensive utilization of paired reads in genome assembly. In Research in Computational Molecular Biology, Chor, B., (ed.), pp. 200–212, Springer Berlin Heidelberg
|
[34] |
Ronen, R., Boucher, C., Chtsaz, H. and Pevzner, P. (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics, 28, i188–i196
|
[35] |
Bao, E., Jiang, T., and Girke, T. (2014) AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics, 30, i319–i328
|
[36] |
Li, H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844
CrossRef
Pubmed
Google scholar
|
[37] |
Nong, G., Zhang, S. and Chan, W. H. (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput., 60, 1471–1484
CrossRef
Google scholar
|
[38] |
Bauer, M., Cox, A. and Rosone, G. (2013) Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483, 134–148
CrossRef
Google scholar
|
[39] |
Bonizzoni, P., Della Vedova, G. , Pirola, Y., Previtali, M. and Rizzi, R. (2017) FSG: Fast String Graph construction for de novo assembly. J. Comput. Biol., 24, 953–968
CrossRef
Pubmed
Google scholar
|
[40] |
Garey, M. R. and Johnson, D. S. (1979) Computer and Intractability: A Guide to the Theory of NP-completeness. Freeman, W. H., (ed.), San Francisco
|
[41] |
Conway, T. C. and Bromage, A. J. (2011) Succinct data structures for assembling large genomes. Bioinformatics, 27, 479–486
CrossRef
Pubmed
Google scholar
|
[42] |
Chikhi, R. and Rizk, G. (2013) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol., 8, 22
CrossRef
Pubmed
Google scholar
|
[43] |
Salikhov, K., Sacomoto, G. and Kucherov, G. (2014) Using cascading Bloom filters to improve the memory usage for de Bruijn graphs. Algorithms Mol. Biol., 9, 2
CrossRef
Pubmed
Google scholar
|
[44] |
Jackman, S. D., Vandervalk, B. P., Mohamadi, H., Chu, J., Yeo, S., Hammond, S. A., Jahesh, G., Khan, H., Coombe, L., Warren, R. L.,
CrossRef
Pubmed
Google scholar
|
[45] |
Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. and Medvedev, P. (2015) On the representation of de Bruijn graphs. J. Comput. Biol., 22, 336–352
CrossRef
Pubmed
Google scholar
|
[46] |
Bowe, A., Onodera, T., Sadakane, K. and Shibuya, T. (2012) Succinct de Bruijn graphs. In: Algorithms in Bioinformatics, volume 7534 of Lecture Notes in Computer Science, Raphael, B. and Tang, J. (eds.), pp. 225–235. Springer Berlin Heidelberg
|
[47] |
Boucher, C., Bowe, A., Gagie, T., Puglisi, S. J. and Sadakane, K. (2015) Variable-order de Bruijn graphs. In Data Compression Conference (DCC), pp. 383–392
|
[48] |
Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M. and Puglisi, S. J. (2018) Bidirectional variable-order de Bruijn graphs. Int. J. Found. Comput. Sci., 29, 1279–1295
CrossRef
Google scholar
|
[49] |
Muggli, M. D., Bowe, A., Noyes, N. R.Morley, P.S., Belk, K. E., Raymond, R., Gagie, T., Puglisi, S. J., Boucher, C. (2017) Succinct colored de Bruijn graphs. Bioinformatics, 33, 3181–3187
|
[50] |
Almodaresi, F., Pandey, P. and Patro, R. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017).
|
[51] |
Almodaresi, F., Sarkar, H., Srivastava, A. and Patro, R. (2018) A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34, i169–i177
CrossRef
Pubmed
Google scholar
|
[52] |
Muggli, M. D., Alipanahi, B., Boucher, C. (2019) Building large updatable colored de Bruijn graphs via merging. bioRxiv, 229641
|
[53] |
Neil, C.Jones,Pavel A Pevzner, and Pavel Pevzner. (2004) An introduction to bioinformatics algorithms. MIT Press
|
[54] |
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
CrossRef
Pubmed
Google scholar
|
[55] |
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
CrossRef
Pubmed
Google scholar
|
[56] |
Chu, J., Mohamadi, H., Warren, R. L., Yang, C. and Birol, I. (2017) Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics, 33, 1261–1270
Pubmed
|
[57] |
Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv, 530972
|
[58] |
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. and Tse, D. N. (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res., 27, 747–756
CrossRef
Pubmed
Google scholar
|
[59] |
Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics, pp. 52–67. Springer
|
[60] |
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
CrossRef
Pubmed
Google scholar
|
[61] |
Lin, Y., Yuan, J., Kolmogorov, M., Shen, M. W., Chaisson, M. and Pevzner, P. A. (2016) Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA, 113, E8396–E8405
CrossRef
Pubmed
Google scholar
|
[62] |
Prjibelski, A. D.,Vasilinetc, I., Bankevich, A., Gurevich, A., Krivosheeva, T., Nurk, S., Pham, S., Korobeynikov, A., Lapidus, A., and Pevzner. P.A. (2014) Exspander: a universal repeat resolver for DNA fragment assembly. Bioinformatics, 30, i293–i301
|
[63] |
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A.,
CrossRef
Pubmed
Google scholar
|
[64] |
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E.,
CrossRef
Pubmed
Google scholar
|
[65] |
Fasulo, D., Halpern, A., Dew, I., and Mobarry. C. (2002) Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18, S294S302
|
[66] |
Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A.,
CrossRef
Pubmed
Google scholar
|
[67] |
Berger, A. and Lafferty, J. (2017) Information retrieval as statistical translation. ACM SIGIR Forum, 51, 219–226
|
[68] |
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
CrossRef
Pubmed
Google scholar
|
[69] |
Välimäki, N., Ladra, S. and Mäkinen, V. (2012) Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213, 49–58
CrossRef
Google scholar
|
[70] |
Egidi, L., Louza, F. A., Manzini, G. and Telles, G. (2018) External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, volume 113, pp. 1–14
|
[71] |
Liu, B., Liu, C.-M., Li, D., Li, Y., Ting, H.-F., Yiu, S. M., Luo, R. and Lam, T. W. (2016) BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics, 17, 499
CrossRef
Pubmed
Google scholar
|
[72] |
Sohn, J. and Nam, J.-W. (2016) The present and future of de novo whole-genome assembly. Brief. Bioinform., bbw096
|
[73] |
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
|
[74] |
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F.,
CrossRef
Pubmed
Google scholar
|
[75] |
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. and McVean, G. (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet., 44, 226–232
CrossRef
Pubmed
Google scholar
|
[76] |
Pandey, P., Almodaresi, F., Bender, M. A., Ferdman, M., Johnson, R. and Patro, R. (2018) Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst., 7, 201–207.e4
CrossRef
Pubmed
Google scholar
|
[77] |
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res., 19, 1117–1123
CrossRef
Pubmed
Google scholar
|
[78] |
Holt, J. and McMillan, L. (2014) Merging of multi-string BWTs with applications. Bioinformatics, 30, 3524–3531
CrossRef
Pubmed
Google scholar
|
[79] |
Minkin, I., Pham, S. and Medvedev, P. (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33, 4024–4032
Pubmed
|
[80] |
Marcus, S., Lee, H. and Schatz, M. C. (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30, 3476–3483
CrossRef
Pubmed
Google scholar
|
[81] |
Cazaux, B., Lecroq, T. and Rivals, E. (2014) From indexing data structures to de Bruijn graphs. In Combinatorial Pattern Matching, pp. 89–99. Springer
|
[82] |
Baier, U., Beller, T. and Ohlebusch, E. (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32, 497–504
CrossRef
Pubmed
Google scholar
|
[83] |
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B. E., Ghaffaari, A., Kersey, P., Kloosterman, W., Makinen, V., Novak, A.,
|
/
〈 | 〉 |