Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data
Lin Wan, Xin Kang, Jie Ren, Fengzhu Sun
Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data
Background: Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.
Methods: In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.
Results: We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.
Conclusions: We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.
Markov chains / next generation sequencing / transition probabilities / confidence intervals
[1] |
Almagor, H. (1983) A Markov analysis of DNA sequences. J. Theor. Biol., 104, 633–645
CrossRef
Pubmed
Google scholar
|
[2] |
Reinert, G., Schbath, S. and Waterman, M. S. (2005) Statistics on words with applications to biological sequences. In: Applied Combinatorics on Words, Lothaire, M. ed., ch. 6, pp. 268–352 New York: Cambridge University Press
|
[3] |
Blaisdell, B. E. (1985) Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol., 21, 278–288
CrossRef
Pubmed
Google scholar
|
[4] |
Pevzner, P. A., Borodovsky, M.Y., and MironovA. A. (1989) Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn., 6, 1013–1026
CrossRef
Pubmed
Google scholar
|
[5] |
Hong, J. (1990) Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res., 18, 1625–1628
CrossRef
Pubmed
Google scholar
|
[6] |
Arnold, J., Cuticchia, A. J., Newsome, D. A., Jennings, IIIW. W. 3rd and Ivarie, R. (1988) Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 7145–7158
CrossRef
Pubmed
Google scholar
|
[7] |
Avery, P. J. (1987) The analysis of intron data and their use in the detection of short signals. J. Mol. Evol., 26, 335–340
CrossRef
Pubmed
Google scholar
|
[8] |
Narlikar, L., Mehta, N., Galande, S. and Arjunwadkar, M. (2013) One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41, 1416–1424
CrossRef
Pubmed
Google scholar
|
[9] |
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2016) Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, 32, 993–1000
CrossRef
Pubmed
Google scholar
|
[10] |
Billingsley, P. (1961)Statistical Inference for Markov Processes, vol. 2. Chicago: University of Chicago Press Chicago
|
[11] |
Billingsley, P. (1961) Statistical methods in Markov chains. Ann. Math. Stat., 32, 12–40.
CrossRef
Google scholar
|
[12] |
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
CrossRef
Pubmed
Google scholar
|
[13] |
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829
CrossRef
Pubmed
Google scholar
|
[14] |
Zhai, Z., Reinert, G., Song, K., Waterman, M. S., Luan, Y. and Sun, F. (2012) Normal and compound Poisson approximations for pattern occurrences in NGS reads. J. Comput. Biol., 19, 839–854
CrossRef
Pubmed
Google scholar
|
[15] |
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M. and Sun, F. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol., 20, 64–79
CrossRef
Pubmed
Google scholar
|
[16] |
Sun, F., Arnheim, N. and Waterman, M. S. (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res., 23, 3034–3040
CrossRef
Pubmed
Google scholar
|
[17] |
Daley, T.and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 22, 3159–3165
|
[18] |
Lander, E. S. and Waterman, M. S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231–239
CrossRef
Pubmed
Google scholar
|
[19] |
Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. and Gerstein, M. (2008) Modeling ChIP sequencing in silico with applications. PLOS Comput. Biol., 4, e1000158
CrossRef
Pubmed
Google scholar
|
[20] |
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327
CrossRef
Pubmed
Google scholar
|
[21] |
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
CrossRef
Pubmed
Google scholar
|
[22] |
Schwartz, S., Oren, R. and Ast, G. (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One, 6, e16685
CrossRef
Pubmed
Google scholar
|
/
〈 | 〉 |