Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova, Paola Bonizzoni

PDF(342 KB)
PDF(342 KB)
Quant. Biol. ›› 2019, Vol. 7 ›› Issue (4) : 278-292. DOI: 10.1007/s40484-019-0181-x
REVIEW

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

Author information +
History +

Abstract

Background: De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs.

Results: In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact.

Conclusions: We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field.

Keywords

overlap graphs / de Bruijn graphs / genome assembly / long reads / string graphs

Cite this article

Download citation ▾
Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova, Paola Bonizzoni. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant. Biol., 2019, 7(4): 278‒292 https://doi.org/10.1007/s40484-019-0181-x

References

[1]
Tomescu, A.I., Medvedev, P. (2016) Safe and Complete Contig Assembly Via Omnitigs. In: Research in Computational Molecular Biology, Singh, M. (eds.), RECOMB 2016. Lecture Notes in Computer Science. vol 9649. Springer, Cham
[2]
Burrows, M. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. CiteSeer
[3]
Ferragina, P. and Manzini, G. (2005) Indexing compressed text. J. Assoc. Comput. Mach., 52, 552–581
CrossRef Google scholar
[4]
Baker, M. (2012) De novo genome assembly: what every biologist should know. Nat. Methods, 9, 333–337
CrossRef Google scholar
[5]
Miller, J. R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327
CrossRef Pubmed Google scholar
[6]
Paszkiewicz, K. and Studholme, D. J. (2010) De novo assembly of short sequence reads. Brief. Bioinform., 11, 457–472
CrossRef Pubmed Google scholar
[7]
Narzisi, G. and Mishra, B. (2011) Comparing de novo genome assembly: the long and short of it. PLoS One, 6, e19175
CrossRef Pubmed Google scholar
[8]
Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B., (2012) Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief. Funct. Genomics, 11, 25–37
CrossRef Pubmed Google scholar
[9]
Jayakumar, V. and Sakakibara, Y. (2019) Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief. Bioinform., 20, 866–876
CrossRef Pubmed Google scholar
[10]
Ghurye, J. and Pop, M. (2019) Modern technologies and algorithms for scaffolding assembled genomes. PLOS Comput. Biol., 15, e1006994
CrossRef Pubmed Google scholar
[11]
Shi, F. (1996) Suffix arrays for multiple strings: A method for on-line multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, Jaffar, J., Yap, R.H.C. (eds.), ASIAN 1996. Lecture Notes in Computer Science, vol 1179. Springer, Berlin, Heidelberg
[12]
Manber, U. and Myers, G. (1993) Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22, 935–948
CrossRef Google scholar
[13]
Lam, T. W., Li, R., Tam, A., Wong, S., Wu, E. and Yiu, S. M. (2009) High throughput short read alignment via bi-directional BWT. In Bioinformatics and Biomedicine (BIBM ’09), pp. 31–36, Washington, DC
[14]
Simpson, J. T. and Durbin, R. (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26, i367–i373
CrossRef Pubmed Google scholar
[15]
Simpson, J. T. and Durbin, R. (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res., 22, 549–556
CrossRef Pubmed Google scholar
[16]
Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760
CrossRef Pubmed Google scholar
[17]
Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25
CrossRef Pubmed Google scholar
[18]
Cox, A. J., Garofalo, F., Rosone, G. and Sciortino, M. (2016) Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37, 17–33
CrossRef Google scholar
[19]
Li, H. (2014) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275
CrossRef Pubmed Google scholar
[20]
Bonizzoni, P., Della Vedova , G., Pirola, Y., Previtali, M. and Rizzi, R. (2019) Multithread multistring Burrows-Wheeler transform and longest common prefix array. J. Comput. Biol., 26
CrossRef Google scholar
[21]
Bloom, B. H. (1970) Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13, 422–426
CrossRef Google scholar
[22]
Fan, L., Cao, P., Almeida, J. and Broder, A. Z. (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8, 281–293
CrossRef Google scholar
[23]
Broder, A. Z. (1997) On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES 1997, pp.21–29. IEEE
[24]
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
CrossRef Pubmed Google scholar
[25]
Salmela, L.and Rivals, E. (2014) LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30, 3506–3514
[26]
Nordberg, H., Cantor, M., Dusheyko, S., Hua, S., Poliakov, A., Shabalov, I., Smirnova, T., Grigoriev, I. V. and Dubchak. I. (2013) The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Research, 42, D26–D31
[27]
Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21, i79–i85
CrossRef Pubmed Google scholar
[28]
Gonnella, G. and Kurtz, S. (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13, 82
CrossRef Pubmed Google scholar
[29]
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
CrossRef Pubmed Google scholar
[30]
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2016) LSG: An external-memory tool to compute string graphs for next-generation sequencing data assembly. J. Comput. Biol., 23, 137–149
CrossRef Pubmed Google scholar
[31]
Peng, Y. and Leung, H. C. M., Yiu, S. M. and Chin, F. Y. L. (2010) IDBA—a practical iterative de Bruijn graph de novo assembler. In Research in Computational Molecular Biology, Bonnie B., (ed.), pp. 426–440, Springer Berlin Heidelberg
[32]
Medvedev, P., Pham, S., Chaisson, M., Tesler, G. and Pevzner, P. (2011) Paired de Bruijn graphs: A novel approach for incorporating mate pair information into genome assemblers. In Proceedings of the 15th Annual International Conference on Research in Computational Molecular Biology, RECOMB’11, pp. 238–251, Springer-Verlag
[33]
Pham, S. K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P. A. and Alekseyev, M. A. (2012) Pathset graphs: A novel approach for comprehensive utilization of paired reads in genome assembly. In Research in Computational Molecular Biology, Chor, B., (ed.), pp. 200–212, Springer Berlin Heidelberg
[34]
Ronen, R., Boucher, C., Chtsaz, H. and Pevzner, P. (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics, 28, i188–i196
[35]
Bao, E., Jiang, T., and Girke, T. (2014) AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics, 30, i319–i328
[36]
Li, H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844
CrossRef Pubmed Google scholar
[37]
Nong, G., Zhang, S. and Chan, W. H. (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput., 60, 1471–1484
CrossRef Google scholar
[38]
Bauer, M., Cox, A. and Rosone, G. (2013) Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483, 134–148
CrossRef Google scholar
[39]
Bonizzoni, P., Della Vedova, G. , Pirola, Y., Previtali, M. and Rizzi, R. (2017) FSG: Fast String Graph construction for de novo assembly. J. Comput. Biol., 24, 953–968
CrossRef Pubmed Google scholar
[40]
Garey, M. R. and Johnson, D. S. (1979) Computer and Intractability: A Guide to the Theory of NP-completeness. Freeman, W. H., (ed.), San Francisco
[41]
Conway, T. C. and Bromage, A. J. (2011) Succinct data structures for assembling large genomes. Bioinformatics, 27, 479–486
CrossRef Pubmed Google scholar
[42]
Chikhi, R. and Rizk, G. (2013) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol., 8, 22
CrossRef Pubmed Google scholar
[43]
Salikhov, K., Sacomoto, G. and Kucherov, G. (2014) Using cascading Bloom filters to improve the memory usage for de Bruijn graphs. Algorithms Mol. Biol., 9, 2
CrossRef Pubmed Google scholar
[44]
Jackman, S. D., Vandervalk, B. P., Mohamadi, H., Chu, J., Yeo, S., Hammond, S. A., Jahesh, G., Khan, H., Coombe, L., Warren, R. L., (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res., 27, 768–777
CrossRef Pubmed Google scholar
[45]
Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. and Medvedev, P. (2015) On the representation of de Bruijn graphs. J. Comput. Biol., 22, 336–352
CrossRef Pubmed Google scholar
[46]
Bowe, A., Onodera, T., Sadakane, K. and Shibuya, T. (2012) Succinct de Bruijn graphs. In: Algorithms in Bioinformatics, volume 7534 of Lecture Notes in Computer Science, Raphael, B. and Tang, J. (eds.), pp. 225–235. Springer Berlin Heidelberg
[47]
Boucher, C., Bowe, A., Gagie, T., Puglisi, S. J. and Sadakane, K. (2015) Variable-order de Bruijn graphs. In Data Compression Conference (DCC), pp. 383–392
[48]
Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M. and Puglisi, S. J. (2018) Bidirectional variable-order de Bruijn graphs. Int. J. Found. Comput. Sci., 29, 1279–1295
CrossRef Google scholar
[49]
Muggli, M. D., Bowe, A., Noyes, N. R.Morley, P.S., Belk, K. E., Raymond, R., Gagie, T., Puglisi, S. J., Boucher, C. (2017) Succinct colored de Bruijn graphs. Bioinformatics, 33, 3181–3187
[50]
Almodaresi, F., Pandey, P. and Patro, R. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017).
[51]
Almodaresi, F., Sarkar, H., Srivastava, A. and Patro, R. (2018) A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34, i169–i177
CrossRef Pubmed Google scholar
[52]
Muggli, M. D., Alipanahi, B., Boucher, C. (2019) Building large updatable colored de Bruijn graphs via merging. bioRxiv, 229641
[53]
Neil, C.Jones,Pavel A Pevzner, and Pavel Pevzner. (2004) An introduction to bioinformatics algorithms. MIT Press
[54]
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
CrossRef Pubmed Google scholar
[55]
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
CrossRef Pubmed Google scholar
[56]
Chu, J., Mohamadi, H., Warren, R. L., Yang, C. and Birol, I. (2017) Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics, 33, 1261–1270
Pubmed
[57]
Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv, 530972
[58]
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. and Tse, D. N. (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res., 27, 747–756
CrossRef Pubmed Google scholar
[59]
Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics, pp. 52–67. Springer
[60]
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
CrossRef Pubmed Google scholar
[61]
Lin, Y., Yuan, J., Kolmogorov, M., Shen, M. W., Chaisson, M. and Pevzner, P. A. (2016) Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA, 113, E8396–E8405
CrossRef Pubmed Google scholar
[62]
Prjibelski, A. D.,Vasilinetc, I., Bankevich, A., Gurevich, A., Krivosheeva, T., Nurk, S., Pham, S., Korobeynikov, A., Lapidus, A., and Pevzner. P.A. (2014) Exspander: a universal repeat resolver for DNA fragment assembly. Bioinformatics, 30, i293–i301
[63]
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050–1054
CrossRef Pubmed Google scholar
[64]
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569
CrossRef Pubmed Google scholar
[65]
Fasulo, D., Halpern, A., Dew, I., and Mobarry. C. (2002) Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18, S294S302
[66]
Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A., (2000) A whole-genome assembly of Drosophila. Science, 287, 2196–2204
CrossRef Pubmed Google scholar
[67]
Berger, A. and Lafferty, J. (2017) Information retrieval as statistical translation. ACM SIGIR Forum, 51, 219–226
[68]
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
CrossRef Pubmed Google scholar
[69]
Välimäki, N., Ladra, S. and Mäkinen, V. (2012) Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213, 49–58
CrossRef Google scholar
[70]
Egidi, L., Louza, F. A., Manzini, G. and Telles, G. (2018) External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, volume 113, pp. 1–14
[71]
Liu, B., Liu, C.-M., Li, D., Li, Y., Ting, H.-F., Yiu, S. M., Luo, R. and Lam, T. W. (2016) BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics, 17, 499
CrossRef Pubmed Google scholar
[72]
Sohn, J. and Nam, J.-W. (2016) The present and future of de novo whole-genome assembly. Brief. Bioinform., bbw096
[73]
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
[74]
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., (2018) Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol., 36, 875–879
CrossRef Pubmed Google scholar
[75]
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. and McVean, G. (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet., 44, 226–232
CrossRef Pubmed Google scholar
[76]
Pandey, P., Almodaresi, F., Bender, M. A., Ferdman, M., Johnson, R. and Patro, R. (2018) Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst., 7, 201–207.e4
CrossRef Pubmed Google scholar
[77]
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res., 19, 1117–1123
CrossRef Pubmed Google scholar
[78]
Holt, J. and McMillan, L. (2014) Merging of multi-string BWTs with applications. Bioinformatics, 30, 3524–3531
CrossRef Pubmed Google scholar
[79]
Minkin, I., Pham, S. and Medvedev, P. (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33, 4024–4032
Pubmed
[80]
Marcus, S., Lee, H. and Schatz, M. C. (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30, 3476–3483
CrossRef Pubmed Google scholar
[81]
Cazaux, B., Lecroq, T. and Rivals, E. (2014) From indexing data structures to de Bruijn graphs. In Combinatorial Pattern Matching, pp. 89–99. Springer
[82]
Baier, U., Beller, T. and Ohlebusch, E. (2016) Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32, 497–504
CrossRef Pubmed Google scholar
[83]
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B. E., Ghaffaari, A., Kersey, P., Kloosterman, W., Makinen, V., Novak, A., (2016) Computational pan-genomics: status, promises and challenges. Brief. Bioinform., 19, 118–135

COMPLIANCE WITH ETHICS GUIDELINES

The authors Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova and Paola Bonizzoni declare that they have no conflict of interests.
This article is a review article and does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2019 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(342 KB)

Accesses

Citations

Detail

Sections
Recommended

/