Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Lin Wan , Xin Kang , Jie Ren , Fengzhu Sun

Quant. Biol. ›› 2020, Vol. 8 ›› Issue (2) : 143 -154.

PDF (1009KB)
Quant. Biol. ›› 2020, Vol. 8 ›› Issue (2) : 143 -154. DOI: 10.1007/s40484-020-0200-y
RESEARCH ARTICLE
RESEARCH ARTICLE

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Author information +
History +
PDF (1009KB)

Abstract

Background: Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.

Methods: In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.

Results: We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.

Conclusions: We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.

Graphical abstract

Keywords

Markov chains / next generation sequencing / transition probabilities / confidence intervals

Cite this article

Download citation ▾
Lin Wan, Xin Kang, Jie Ren, Fengzhu Sun. Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data. Quant. Biol., 2020, 8(2): 143-154 DOI:10.1007/s40484-020-0200-y

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Almagor, H. (1983) A Markov analysis of DNA sequences. J. Theor. Biol., 104, 633–645

[2]

Reinert, G., Schbath, S. and Waterman, M. S. (2005) Statistics on words with applications to biological sequences. In: Applied Combinatorics on Words, Lothaire, M. ed., ch. 6, pp. 268–352 New York: Cambridge University Press

[3]

Blaisdell, B. E. (1985) Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol., 21, 278–288

[4]

Pevzner, P. A., Borodovsky, M.Y., and MironovA. A. (1989) Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn., 6, 1013–1026

[5]

Hong, J. (1990) Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res., 18, 1625–1628

[6]

Arnold, J., Cuticchia, A. J., Newsome, D. A., Jennings, IIIW. W. 3rd and Ivarie, R. (1988) Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 7145–7158

[7]

Avery, P. J. (1987) The analysis of intron data and their use in the detection of short signals. J. Mol. Evol., 26, 335–340

[8]

Narlikar, L., Mehta, N., Galande, S. and Arjunwadkar, M. (2013) One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41, 1416–1424

[9]

Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2016) Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, 32, 993–1000

[10]

Billingsley, P. (1961)Statistical Inference for Markov Processes, vol. 2. Chicago: University of Chicago Press Chicago

[11]

Billingsley, P. (1961) Statistical methods in Markov chains. Ann. Math. Stat., 32, 12–40.

[12]

Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753

[13]

Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829

[14]

Zhai, Z., Reinert, G., Song, K., Waterman, M. S., Luan, Y. and Sun, F. (2012) Normal and compound Poisson approximations for pattern occurrences in NGS reads. J. Comput. Biol., 19, 839–854

[15]

Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M. and Sun, F. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol., 20, 64–79

[16]

Sun, F., Arnheim, N. and Waterman, M. S. (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res., 23, 3034–3040

[17]

Daley, T.and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 22, 3159–3165

[18]

Lander, E. S. and Waterman, M. S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231–239

[19]

Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. and Gerstein, M. (2008) Modeling ChIP sequencing in silico with applications. PLOS Comput. Biol., 4, e1000158

[20]

Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327

[21]

Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235

[22]

Schwartz, S., Oren, R. and Ast, G. (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One, 6, e16685

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF (1009KB)

Supplementary files

QB-20200-OF-WL_suppl_1

2018

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/