Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Lin Wan, Xin Kang, Jie Ren, Fengzhu Sun

PDF(1009 KB)
PDF(1009 KB)
Quant. Biol. ›› 2020, Vol. 8 ›› Issue (2) : 143-154. DOI: 10.1007/s40484-020-0200-y
RESEARCH ARTICLE
RESEARCH ARTICLE

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Author information +
History +

Abstract

Background: Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.

Methods: In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.

Results: We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.

Conclusions: We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.

Graphical abstract

Keywords

Markov chains / next generation sequencing / transition probabilities / confidence intervals

Cite this article

Download citation ▾
Lin Wan, Xin Kang, Jie Ren, Fengzhu Sun. Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data. Quant. Biol., 2020, 8(2): 143‒154 https://doi.org/10.1007/s40484-020-0200-y

References

[1]
Almagor, H. (1983) A Markov analysis of DNA sequences. J. Theor. Biol., 104, 633–645
CrossRef Pubmed Google scholar
[2]
Reinert, G., Schbath, S. and Waterman, M. S. (2005) Statistics on words with applications to biological sequences. In: Applied Combinatorics on Words, Lothaire, M. ed., ch. 6, pp. 268–352 New York: Cambridge University Press
[3]
Blaisdell, B. E. (1985) Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol., 21, 278–288
CrossRef Pubmed Google scholar
[4]
Pevzner, P. A., Borodovsky, M.Y., and MironovA. A. (1989) Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn., 6, 1013–1026
CrossRef Pubmed Google scholar
[5]
Hong, J. (1990) Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res., 18, 1625–1628
CrossRef Pubmed Google scholar
[6]
Arnold, J., Cuticchia, A. J., Newsome, D. A., Jennings, IIIW. W. 3rd and Ivarie, R. (1988) Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 7145–7158
CrossRef Pubmed Google scholar
[7]
Avery, P. J. (1987) The analysis of intron data and their use in the detection of short signals. J. Mol. Evol., 26, 335–340
CrossRef Pubmed Google scholar
[8]
Narlikar, L., Mehta, N., Galande, S. and Arjunwadkar, M. (2013) One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41, 1416–1424
CrossRef Pubmed Google scholar
[9]
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2016) Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, 32, 993–1000
CrossRef Pubmed Google scholar
[10]
Billingsley, P. (1961)Statistical Inference for Markov Processes, vol. 2. Chicago: University of Chicago Press Chicago
[11]
Billingsley, P. (1961) Statistical methods in Markov chains. Ann. Math. Stat., 32, 12–40.
CrossRef Google scholar
[12]
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
CrossRef Pubmed Google scholar
[13]
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829
CrossRef Pubmed Google scholar
[14]
Zhai, Z., Reinert, G., Song, K., Waterman, M. S., Luan, Y. and Sun, F. (2012) Normal and compound Poisson approximations for pattern occurrences in NGS reads. J. Comput. Biol., 19, 839–854
CrossRef Pubmed Google scholar
[15]
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M. and Sun, F. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol., 20, 64–79
CrossRef Pubmed Google scholar
[16]
Sun, F., Arnheim, N. and Waterman, M. S. (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res., 23, 3034–3040
CrossRef Pubmed Google scholar
[17]
Daley, T.and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 22, 3159–3165
[18]
Lander, E. S. and Waterman, M. S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231–239
CrossRef Pubmed Google scholar
[19]
Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. and Gerstein, M. (2008) Modeling ChIP sequencing in silico with applications. PLOS Comput. Biol., 4, e1000158
CrossRef Pubmed Google scholar
[20]
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327
CrossRef Pubmed Google scholar
[21]
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
CrossRef Pubmed Google scholar
[22]
Schwartz, S., Oren, R. and Ast, G. (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One, 6, e16685
CrossRef Pubmed Google scholar

SUPPLEMENTARY MATERIALS

The supplementary materials can be found online with this article at https:// 10.1007/s40484-020-0200-y.

ACKNOWLEDGEMENTS

We thank Dr. Gesine Reinert of Oxford University for discussions and help related to the problems studied in this paper. LW is supported by NSFC grants (Nos.11571349 and 91630314), the National Key R&D Program of China under Grant 2018YFB0704304, NCMIS of CAS, LSC of CAS, and the Youth Innovation Promotion Association of CAS. JR and FS were supported by US National Science Foundation (NSF) (DMS-1518001) and National Institutes of Health (NIH) (R01GM120624, 1R01GM131407).

COMPLIANCE WITH ETHICS GUIDELINES

The authors Lin Wan, Xin Kang, Jie Ren and Fengzhu Sun declare that they have no conflict of interests.
This article does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2020 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(1009 KB)

Accesses

Citations

Detail

Sections
Recommended

/