Comparison of k-mer-based de novo comparative metagenomic tools and approaches

Alise Jany Ponsero , Matthew Miller , Bonnie Louise Hurwitz

Microbiome Research Reports ›› 2023, Vol. 2 ›› Issue (4) : 27

PDF
Microbiome Research Reports ›› 2023, Vol. 2 ›› Issue (4) :27 DOI: 10.20517/mrr.2023.26
Original Article

Comparison of k-mer-based de novo comparative metagenomic tools and approaches

Author information +
History +
PDF

Abstract

Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited.

Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis.

Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches.

Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.

Keywords

De-novo comparative metagenomics / metagenomes / k-mers

Cite this article

Download citation ▾
Alise Jany Ponsero, Matthew Miller, Bonnie Louise Hurwitz. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. Microbiome Research Reports, 2023, 2(4): 27 DOI:10.20517/mrr.2023.26

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Comin M,Pizzi C.Comparison of microbiome samples: methods and computational challenges.Brief Bioinform2021;22:88-95

[2]

Altschul SF,Miller W,Lipman DJ.Basic local alignment search tool.J Mol Biol1990;215:403-10

[3]

Maillet N,Chikhi R,Peterlongo P.Compareads: comparing huge metagenomic experiments.BMC Bioinformatics2012;13:S10 PMCID:PMC3526429

[4]

Maillet N,Vannier T,Peterlongo P.Commet: comparing and combining multiple metagenomic datasets. In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2014 Nov 2-5; Belfast, UK. IEEE; 2015. p. 94-8.

[5]

Dubinkina VB,Ulyantsev VI,Alexeev DG.Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis.BMC Bioinformatics2016;17:38 PMCID:PMC4715287

[6]

Wu YW.A novel abundance-based algorithm for binning metagenomic sequences using l-tuples.J Comput Biol2011;18:523-34 PMCID:PMC3123841

[7]

Fofanov Y,Katili C.How independent are the appearances of n-mers in different genomes?.Bioinformatics2004;20:2421-8

[8]

Ondov BD,Melsted P.Mash: fast genome and metagenome distance estimation using MinHash.Genome Biol2016;17:132 PMCID:PMC4915045

[9]

Choi I,Bomhoff M,Hartman JH.Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Gigascience2019;8:giy165 PMCID:PMC6354030

[10]

Benoit G,Mariadassou M.Multiple comparative metagenomics using multiset k-mer counting.PeerJ Comput Sci2016;2:e94

[11]

Gourlé H,Hayer J.Simulating Illumina metagenomic data with InSilicoSeq.Bioinformatics2019;35:521-2 PMCID:PMC6361232

[12]

Yu Z,Ban R.SimuSCoP: reliably simulate Illumina sequencing data based on position and context dependent profiles.BMC Bioinformatics2020;21:331 PMCID:PMC7379788

[13]

Li W,Haft DH.RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.Nucleic Acids Res2021;49:D1020-8 PMCID:PMC7779008

[14]

Dixon P.VEGAN, a package of R functions for community ecology.J Veg Sci2003;14:927-30

[15]

Wood DE,Langmead B.Improved metagenomic analysis with Kraken 2.Genome Biol2019;20:257 PMCID:PMC6883579

[16]

Lu J,Thielen P.Bracken: estimating species abundance in metagenomics data.PeerJ Comput Sci2017;3:e104

[17]

Benoit G,Robin S,Peterlongo P.SimkaMin: fast and resource frugal de novo comparative metagenomics.Bioinformatics2020;36:1275-6

[18]

Matharu D,Dikareva E.Bacteroides abundance drives birth mode dependent infant gut microbiota developmental trajectories.Front Microbiol2022;13:953475 PMCID:PMC9583133

[19]

Hiseni P,Wilson RC,Snipen L.HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data.Microbiome2021;9:165 PMCID:PMC8325300

[20]

Rowe WP,Alcon-Giner C.Streaming histogram sketching for rapid microbiome analytics.Microbiome2019;7:40 PMCID:PMC6420756

[21]

Pierce NT,Reiter T,Brown CT.Large-scale sequence comparisons with sourmash.F1000Res2019;8:1006 PMCID:PMC6720031

[22]

Murray KD,Ong CS,Warthmann N.kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity.PLoS Comput Biol2017;13:e1005727 PMCID:PMC5600398

[23]

Fimereli D,Konopka T.TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data.Nucleic Acids Res2013;41:e86 PMCID:PMC3627586

[24]

Ulyantsev VI,Dubinkina VB,Alexeev DG.MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data.Bioinformatics2016;32:2760-7

[25]

Zhang Q,Canino-Koning R,Brown CT.These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.PLoS One2014;9:e101271 PMCID:PMC4111482

[26]

Lu YY,Ren J,Waterman MS.CAFE: aCcelerated Alignment-FrEe sequence analysis.Nucleic Acids Res2017;45:W554-9 PMCID:PMC5793812

[27]

Thomas AM.Multiple levels of the unknown in microbiome research.BMC Biol2019;17:48 PMCID:PMC6560723

[28]

Chu J,Erhan E.Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.Proc Natl Acad Sci U S A2020;117:16961-8 PMCID:PMC7382288

[29]

Kazemi P,Nikolić V,Warren RL.ntHash2: recursive spaced seed hashing for nucleotide sequences.Bioinformatics2022;38:4812-3 PMCID:PMC9563681

[30]

Wang Y,Deng C,Sun F.KmerGO: A tool to identify group-specific sequences with k-mers.Front Microbiol2020;11:2067 PMCID:PMC7477287

AI Summary AI Mindmap
PDF

143

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/