Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Lin Wan; Xin Kang; Jie Ren; Fengzhu Sun

doi:10.1007/s40484-020-0200-y

Quant. Biol. ›› 2020, Vol. 8 ›› Issue (2) :143 -154. DOI: 10.1007/s40484-020-0200-y

RESEARCH ARTICLE

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Author information +

History +

PDF (1009KB)

Abstract

Background: Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid.

Methods: In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences.

Results: We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method.

Conclusions: We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.

Graphical abstract

Keywords

Markov chains / next generation sequencing / transition probabilities / confidence intervals

Cite this article

Download citation ▾

Lin Wan, Xin Kang, Jie Ren, Fengzhu Sun. Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data. Quant. Biol., 2020, 8(2): 143-154 DOI:10.1007/s40484-020-0200-y

登录浏览全文

4963

注册一个新账户忘记密码

INTRODUCTION

Markov chains (MC) have been widely used to model molecular sequences [¹,²]. They have been used to study the dependencies between the bases [³], the enrichment and depletion of certain word patterns [⁴], prediction of occurrences of long word patterns from short patterns [⁵,⁶], and the detection of signals in introns [⁷]. Narlikar et al. [⁸] studied the effect of the order of MCs on several biological problems including phylogenetic analysis, assignment of sequence fragments to different genomes in meta-genomic studies, motif discovery, and functional classification of promoters. Usually the transition probabilities of the Markov chain are estimated from data, and hence the uncertainty of the estimated transition probabilities has to be taken into account. This is done for example in [²] for word counts in long sequences.

Often DNA samples do not consist of a few long sequences, but of many short sequences, for example based on next generation sequencing (NGS) technology. For short sequences the standard asymptotic results do not apply, see for example [⁹]. In [⁹] it was shown that the word count standardization includes an additional scaling factor, the effective coverage d. While [⁹] considered word count statistics under a Markov chain with estimated transition probabilities, in this study we are interested in finding confidence regions for the transition probabilities. We derive a normal and a chi-square approximation for several statistics related to MC based on many short sequences (also called reads) from a long underlying sequence that is assumed to be a realization of a Markov chain; and again the effective coverage d appears.

Consider an r-th order Markov chain A = A₁A₂… A_n of length n. The maximum likelihood estimate of transition probability from a word w of length r to a letter b is given by

P^w, b = f w b / f w,

where fw is the number of occurrences of word w within the sequence. Many investigators have studied the limit distribution of

P^w, b

and other related statistics when the sequence length n tends to infinity [¹⁰,¹¹]. In the classical papers [¹⁰,¹¹], Billingsley studied many important problems related to MC. In particular, he showed the following important results.

Suppose the long sequence follows a simple first order Markov chain, with finite state space

A

of size s. We call the elements of

A

letters, as we will be thinking of a Markov chain on the set of nucleotides or amino acids. Denote the transition probability from letter i to letter j by p_ij , and let

(1)

ξ i j = (f i j − f i p i j) / f i 12

Theorem 3.1 in [¹¹] shows that if the Markov chain is stationary and ergodic, then as the sequence length n → ∞ the distribution of the s²-dimensional random vector x = (x_i,j ) converges to the normal distribution with mean 0 and covariance matrix (l_i,j_;_k,l), where

(2)

λ i, j; k, l = δ i, k (δ j, l p i j − p i j p k l),

where du,v is the Kronecker delta that is 1 when u = v, and 0 otherwise. Let z = (z1,…, zs) be the random vector with components

(3)

ζ i = (f i − n p i) / n 12,

Theorem 3.3 in [¹¹] also shows that the distribution of the random vector z converges to the normal distribution with covariance matrix

α i j + O (1 n)

, where

α i j = δ i j p i − p i p j + p i Σ m = 1 ∞ (p i j (m) − p j) + p j Σ m = 1 ∞ (p j i (m) − p i) .

In [¹¹] it is also shown that the statistic

(4)

S = Σ i j (f i j − f i p i j) 2 f i p i j .

is asymptotically chi-square in distribution, and the number of degrees of freedom is s(s - 1) (assuming the transition probabilities p_ij_{are all positive), where s is the size of the alphabet set $A$ .}

These results were derived under the assumption of one sample consisting of a long realisation of the Markov chain. With the development of next generation sequencing technologies, instead of one long sequence, typically short fragments from the long genome sequence are sampled, while assembly of these genome fragments as in [¹²,¹³] can be challenging.

In this study, we investigate the asymptotic distributions of similar statistics described above as in [¹⁰,¹¹], but under the new model proposed in [⁹,¹⁴] for the NGS short reads data (see Fig. 1 and Section of “Methods” for the details of the model). We demonstrate that, after scaling by the effective coverage d proposed in [⁹], these statistics converge to the same corresponding distributions as those statistics for long sequences. We further apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate these asymptotic distributions and our theoretical confidence intervals using simulated as well as real data sets. In addition, we propose a parametric bootstrap method, and compare the bootstrap confidence intervals with our theoretical confidence intervals. We find that, the asymptotic properties of these statistics and the theoretical confidence intervals given in this study are highly accurate for the NGS short reads data, providing a powerful tool for NGS data analysis.

RESULTS

Our theoretical results are mainly presented in Section of “Methods”: we first introduce the probabilistic model for NGS data based on a MC sequence and random sampling of short reads; we then derive the statistics of interest based on NGS reads and the effective coverage d; we then develop two propositions on the asymptotic distributions of these statistics, and finally we propose the theoretical confidence intervals as well as the parametric bootstrap confidence intervals for MC transition probabilities based on NGS short reads data. Although we cannot rigorously prove the major propositions developed in this study, we validate statistical properties of these NGS related statistics and theoretical confidence intervals of transition probabilities using simulated NGS data sets as well as a real viral genome sequence data in this section.

Validating the theoretical results using simulated NGS data sets

We validate Proposition 1, Proposition 2 and the theoretical confidence intervals for MC transition probabilities of Eq. 16 in Section of “Methods” using simulated NGS short reads data sets.

For each simulated NGS data set, we first simulate the underlying genome sequence with MC using the same set of parameters of transition probabilities P = [p_ij_{] as in [⁹], which are listed in Supplementary Table S1. The initial position of the simulated genome is sampled based on the stationary distribution of the MC. We set the genome length G = 100, 000 for all simulated data sets. We then randomly sample M reads of length b bps from the simulated underlying genome. The reads are sampled based on the Poisson process as described in Section of “Methods”. The Poisson process can be homogeneous or inhomogeneous.}

We simulate 3 NGS data sets with reads sampled by homogeneous Poisson processes. The parameters of the 3 homogeneous data sets are (1) M = 500, b = 100 (denoted as H1 thereafter); (2) M = 1000, b = 200 (denoted as H2 thereafter); and (3) M = 1000, b = 300 (denoted as H3 thereafter).

For each of these homogeneous data sets, the rate c of Poisson processes is calculated by

c = M G − β ≈ M G .

Noting Remark 1, the sequencing coverage C = bc = bM/G. We follow Eq. 13 and calculate the corresponding effective coverage d as

d = 1 + β c = 1 + β M G .

For real NGS data, the reads are generally generated inhomogeneously from the genome. To test our results for inhomogeneous cases, we also simulate 3 NGS data sets with reads sampled by inhomogeneous Poisson processes. To achieve this, we divide the long underlying genome sequence into 100 consecutive non-overlapped bins b1, … , bt, … , b100 with the same size; the sampling rates along the positions within the same bin bt are equal; and the rates of the different bins are proportional to 100 random variables drawn from the gamma distribution G(1, 20) [¹⁵]. The simulated 3 inhomogeneous data sets of different read numbers and read lengths are (1) M = 500, b = 100 (denoted as IH1 thereafter); (2) M = 1000, b = 200 (denoted as IH2 thereafter); and (3) M = 1000, b = 300 (denoted as IH3 thereafter).

Validating the approximate normal distribution of $ξ i j (R)$ in Proposition 1

To validate the approximate normal distribution of

ξ i j (R)

by Proposition 1, we first standardize the

ξ i j (R)

ξ i j (R) ˜ = ξ i j (R) d p i j (1 − p i j) → N (0, 1),

where the estimated values of d and p_i,j are used in the standardization of

ξ i j (R)

. For each of the 6 data sets (H1, H2, H3, IH1, IH2, and IH3), we repeat the simulating processes by the same parameters 2,000 times to generate 2,000 samples, and then calculate the

ξ i j (R) ˜

value for each sample, and check whether 2,000 values of

ξ i j (R) ˜

have the standard normal distribution by the Kolmogorov-Smirnov (KS) test.

For the homogeneous data sets, for all ij, the

ξ i j (R) ˜

's have approximately mean of 0 and variance of 1, with p-values being uniformly distributed in the region of (0, 1) (Fig. 2), indicating that all

ξ i j (R) ˜

's follow the standard normal distribution. We obtain similar results for the inhomogeneous data sets (Supplementary Fig. S1). These results show strong validations of Proposition 1.

Validating the approximate distributions of $ζ i (R)$ and $S (R)$ in Proposition 2

To validate the approximate distributions of

ζ i (R)

and

S (R)

in Proposition 2, we first standardize them as follows:

ζ i (R) ˜ = ζ i (R) d α i i → N (0, 1),

and

S (R) ˜ = S (R) d → χ 2 (s (s − 1)) .

We then check on the simulated data sets, whether the new statistic

ζ i (R) ˜

has the standard normal distribution for all i ∈ A, and the new statistic

ζ i (R) ˜

has the chi-square distribution with df = s(s - 1) = 4 × 3= 12, respectively.

Our simulation results show that: 1) for all i ∈ A, the

ζ i (R) ˜

approximates to the standard normal distribution quite well for both homogeneous data sets (Supplementary Fig. S2) and inhomogeneous data sets (Supplementary Fig. S3); 2) the

S (R) ˜

approximates a chi-square distribution with df = 12 (Mean=12 and Variance=24) on both homogeneous data sets and inhomogeneous data sets as well: the calculated values of (Mean, Variance and p-value) are (12.02, 25.15, 0.402) for H1, (12.00, 24.75, 0.690) for H2, (12.14, 24.82, 0.556) for H3, (11.81, 22.92, 0.058) for IH1, (12.15, 23.65, 0.064) for IH2, and y (11.88, 24.18, 0.453) for IH3.

Validating the theoretical confidence intervals of MC transition probabilities using NGS data

One of the major applications of Proposition 1 is to construct the theoretical confidence intervals of MC transition probabilities using NGS data. We validate the derived theoretical level 1 -a confidence interval Eq.16 by simulation study.

For each of the simulated homogeneous (H1, H2 and H3) and inhomogeneous (IH1, IH2 and IH3) data sets, we

1. Simulate one sample, estimate $p^i j s$ ,

2. Calculate theoretical level 1 -a (a ∈ {0.01, 0.05, 0.1}) confidence interval for $p i j$ by Eq.16;

3. Repeat Steps 1 and 2 for 1,000 times, and calculate the fraction of times that the theoretical level 1 -a confidence intervals cover the true $p i j$ .

The simulation results in Fig. 3 and Supplementary Fig. S4 show that, the fractions of times that the theoretical confidence intervals cover the true $p i j$ are very close to 1 -a (a = 0.01, 0.05, 0.1) for both homogeneous and inhomogeneous data sets.

Calculating theoretical confidence intervals of $p i j$ based the estimated $d^$

In real study, the effective coverage d is generally not available for the NGS data sets. To deal with data sets with unknown d, we adopt the approach proposed in [⁹], and estimate the effective coverage d as follows:

(5) $d^= m e d i a n {(Z w (R)) 2, w ∈ A k} / 0. 456$

where $Z w (R)$ is given in Eq. 14.

We find that, by plug-in the estimated effective coverage $d^$ to the theoretical confidence intervals for $p i j$ in Eq.16, we can still achieve similar accuracies in calculating the theoretical confidence intervals (Supplementary Figs. S5, S6).

Validating the theoretical confidence interval for $p i j$ by parametric bootstrap

For each of the homogeneous NGS data sets (H1, H2, H3), we calculate the confidence intervals by parametric bootstrap method. We repeat 100 bootstrap calculation for each data set. We find that the fraction of times that the bootstrap confidence intervals covering the true transition probabilities approximate the level 1 -a well (Supplementary Figs. S7S9). Meanwhile, the theoretical confidence intervals by Eq.16 achieve similar results as the parametric bootstrap confidence intervals both in 1) the fraction of time cover the true transition probability (Supplementary Figs. S7S9), and 2) lengths of confidence intervals (Supplementary Fig. S10).

Application to real viral genome sequence data

We apply the theoretical results developed in this study on a real viral genome sequence Bacillus phage SP-beta. The length G of the genomic sequence, which was downloaded from NCBI, is 134, 416 in bp. The frequencies of A, G, C and T are 45,580 (33.91%), 21,078 (15.68%), 25,477 (18.95%) and 42,281 (31.46%), respectively. We model the Bacillus genome sequence with a 2ndorder MC after estimating MC order by method developed in [⁹]. We estimate the transition probability based on the genome sequence as
$p^i j, k = f i j k f i j,$
which are shown in Supplementary Table S2. The stationary distribution p of the 2nd-order MC is calculated based on the estimated transition probabilities (Supplementary Table S2). Based on Theorem 3.1 in Billingsley [¹¹], we derive the theoretical confidence for long genome sequence as follows
(6) $CI 1 − α Long = (p^i j, k − z 1 − α 2 p^i j, k (1 − p^i j) f i j, p^i j, k + z 1 − α 2 p^i j, k (1 − p^i j, k) f i j)$

See Table 1 for an example of the confidence intervals for p_AA,A by (6).

We then simulate NGS data sets by generating short reads from the Bacillus phage SP-beta genome sequence based on both homogeneous and inhomogeneous Poisson processes. The read length is b = 200 in bp and we generate NGS data sets using coverage C = bc ∈ {0.5, 1, 2, 5, 10}. We then estimate the MC transition matrix using NGS reads as
$p ˜ i j, k = f i j k (R) f i j (R),$
and extend the theoretical confidence intervals of Eq.16 for NGS to the 2nd-order MC transition probabilities p_ij,k as follows
(7) $CI 1 − α NGS = (p^i j, k − z 1 − α 2 d p^i j, k (1 − p^i j, k) f i j (R), p^i j, k + z 1 − α 2 d p^i j, k (1 − p^i j, k) f i j (R)) .$

We calculate the theoretical confidence intervals by Eq. (7) as well as the bootstrap confidence intervals by Eq. (19) for transition probability p_AA,A on homogeneous NGS data sets (Table 1). Although we do not have true transition probabilities to validate these confidence intervals, we find that the theoretical confidence intervals by Eq. (7) as well as the parametric bootstrap by Eq. (19) are very close to the confidence intervals estimated from the long sequence by Eq. 6 (Table 1).

The theoretical confidence intervals by Eq. (7) for inhomogeneous NGS data sets are list in Supplementary Table S3.

DISCUSSION

In this study, we proposed statistics for estimations of MC transition probabilities and their corresponding confidence intervals based on next generation sequence reads. Although we did not rigorously prove the two propositions (Propositions 1 and 2) on the asymptotic distributions of $ξ i j (R)$ , $ζ i (R)$ and S^(R), our simulation studies showed that the propositions and the derived formulas for the confidence intervals of the transition probabilities perform decently well on simulated NGS datasets with wide regions of sequencing coverage.

In addition, we proposed a parametric bootstrap method for the confidence intervals of MC transition probabilities based on NGS reads data, and showed that both theoretical confidence intervals derived from the two propositions and the bootstrap methods achieve similar results with high accuracies, further supporting the validity of Propositions 1 and 2. However, it is worth noting that the parametric bootstrap based approach is computationally expensive in sampling data sets many times; while the theoretical confidence intervals bypass the need of the time consuming sampling step, thereby making it highly applicable to large-scale NGS datasets.

The developed theory and methods implicitly assume error-free for the NGS short reads data. Since the sequencing error rates of bulk sequencing are lower than 1% for Illumina sequencing machines, the error-free assumption on NGS data will not be a big issue. The rapid advances in single-cell sequencing (SCS) [¹⁶,¹⁷] provide unprecedented insights of cell heterogeneity. The statistical theory and methods developed in this study will be promising for the quantitative characterizations of SCS data. However, due to limited molecular materials within a single cell as well as high dropout events, the SCS data are error-prone. For SCS data analysis, the sequencing errors are non-negligible and will blur the estimators, making their asymptotic distributions biased. We will study the asymptotic distribution theory of statistics based on the error-prone NGS data in our future work.

METHODS

Probabilistic modeling of a MC sequence and random sampling of the reads using NGS

In NGS, a large number of reads are randomly sampled from the genome. Hence two random processes are involved in the generation of the short reads data: i) the generation of the underlying genome sequence with genome length G base pairs (bps), and ii) random sampling of the M reads with length b from the simulated genome in i). See Fig. 1 for an illustration.

In this study, we assume the underlying genome sequence is generated by a Markov Chain. We treat mainly the case of a first-order Markov chain, as the results can be easily adapted to higher order Markov chains. We thus use a first-order stationary and ergodic MC to model the underlying genome sequence with each letter taking values in a finite alphabet set $A$ of size s. Since our study is based on genomic sequences, $A$ = {A, C, G, T} and s = 4. We denote the transition matrix of MC as P = [p_ij_]4×4.

We model the random sampling of the reads using a Poisson process as in the Lander-Waterman model [¹⁸]. The Poisson process $ρ$ is a process on [0, ∞) and is assumed to be independent of the Markov chain. As in [¹⁴,¹⁸-²¹], we assume that the genome is a sequence of contiguous bases and that the distribution of reads along the genome follows a potentially inhomogeneous Poisson process with rate c(x) = c(x, G) at position x. We suppress the argument G unless we take limits. If c(x) = c for all x, we refer to the sampling of the reads as homogeneous. Allowing for inhomogeneity in the sampling Poisson process reflects observations about NGS data in [²²] that the sampling rates of reads at different positions are not homogeneous.

We assume that all sampled reads have the same length of b bps. This length b does not depend on the length of the underlying sequence, nor on the parameters of the Markov chain or the Poisson process. For NGS data, b is given by the technology which is applied to obtain the reads.

A total of M reads are independently sampled from the genome of length G bps. The M reads are our data; they are a realisation of the above Poisson process and the Markov chain sample A, as follows. Suppose that an event in $ρ$ occurs at t. Let s be the closest integer ≤t. Then the Poisson event results in a read R = (As, As+1, ... , As+b ) from the underlying Markov chain A = (A1, ... , AG).

Statistics related to Markov chains based on NGS data

We extend the statistics $ξ i j$ , $ζ i$ and S for long sequence in [¹¹] to $ξ i j (R)$ , $ζ i (R)$ and S(R) for the NGS short sequence/reads data accordingly as follows.

Let $f i j (R)$ be the number of occurrences of dinucleotide (ij) within the M reads and $f i (R)$ = $Σ j f i j (R)$ . Define
(8) $ξ i j (R) = f i j (R) − f i (R) p i j f i (R) .$

Note that the moment estimation of p_ij from NGS data is $p^i j (R) = f i j (R) f i (R)$ , thus $ξ i j (R)$ can also be written as
$ξ i j (R) = f i (R) (p^i j (R) − p i j) .$

Similarly, we define
(9) $ζ i (R) = f i (R) − n (R) p i n (R),$
and
(10) $S (R) = Σ i j (f i j (R) − f i (R) p i j) 2 f i (R) p i j,$
where
$n (R) = M β,$
with M being the number of reads.

To see the centering in the above equations, we first calculate the expectation of $f i (R)$ and of $f i j (R)$ . For this purpose we introduce some notations. For a read Rk = (r_k,₁, . . . , r_k,_b ), we denote its starting position on genome G by S(Rk ), rounded to the lower integer. The left-hand end of the read is a Poisson process which creates the read starts at t such that s≤t<s + 1 for a nonnegative integer s, then we set S(R_k ) = s. Let f_i be the number of occurrences of letter i in the genome, we have that
$f i = Σ s = 1 G 1 (A s = i), E f i = G p i,$
where 1 denotes the indicator function which is 1 when the argument is true, and 0 otherwise.

To calculate the counts of letter i in the M reads, the $f i (R)$ can be written in the form of S(R_k ) as follows
$f i (R) = Σ s = 1 G − β Σ k = 1 M 1 (S (R k) = s) Σ ℓ =1 β 1 (r k, ℓ = i) = Σ s = 1 G − β Σ k = 1 M 1 (S (R k) = s) Σ ℓ =1 β 1 (A s + ℓ − 1 = i) .$

By the independence of the Poisson process and the Markov chain, we have
$E f i (R) = p i β Σ s = 1 G − β E (Σ 1 (S (R k) = s) k = 1 M) = p i β Σ s = 1 G − β E (ρ (s, s + 1)) = p i β ∫ 0 G − β c (x) d x = (G − β) c ¯ β p i = M β p i = n (R) p i,$
where $c ‾$ is the averaged coverage, which is calculated as follows
$c ‾ = 1 G − β ∫ 0 G − β c (x) d x = M / (G − β) .$

This explains the centering in Eq. 9.

For the centering in Eq. 8 we first calculate
$f i j (R) = Σ s = 1 G − β Σ k = 1 M 1 (S (R k) = s) Σ ℓ =1 β − 1 1 (r k, ℓ = i, r k, ℓ + 1 = j) = Σ s = 1 G − β Σ k = 1 M 1 (S (R k) = s) Σ ℓ =1 β − 1 1 (A s + ℓ − 1 = i, A s + ℓ = j) .$

Note that
$E (1 (A s + ℓ − 1 = i, A s + ℓ = j) | A 1, ..., A s + ℓ − 1) = p i j 1 (A s + ℓ − 1 = i),$
we then have
$E f i j (R) = p i j Σ s = 1 G − β Σ k = 1 M 1 (S (R k) = s) Σ ℓ =1 β 1 (A s + ℓ − 1 = i) = p i j f i (R) .$

This conditional expectation calculation makes the centering in Eq. 8 reasonable. Note that we do not center by $E f i j (R) = M β p i p i j$ . The centering in Eq. 10 follows from the centering in Eq. 8.

The effective coverage d of NGS data

In order to state our results we need to introduce some more notations. For a position s ∈{1, 2, . . . , G}, let L(s) denote the (random) number of reads which cover position s. For n = 1, 2, . . . ,
$C (n) = {s ∈ {1, 2, ..., G} : L (s) = n}$
denote the set of positions that is covered by exactly n reads. As in [⁹], we also call C(n) the region of coverage n. Note that a region does not need to be contiguous; we group all the parts of the genome with coverage n into the n-th region.

We define
(11) $d n = 1 G E | C (n) |,$
which is the expected percentage of the genome that has coverage n. We then define the effective coveraged as follows
(12) $d = d (G) = Σ n = 1 ∞ n 2 d n Σ n = 1 ∞ n d n .$

To calculate d, we note that
$E | C (n) | = Σ s = 1 G P (L (s) = n) = Σ s = 1 G P (ρ (s − β, s] = n) .$
can be with an explicit expression, since $ρ (s − β, s]$ has a Poisson distribution with parameter $∫ s − β s c (x) d x$ . In the homogeneous case, $ρ (s − β, s]$ has Poisson distribution with parameter bc. Then d_n = P(Poi(bc) = n), where Poi(bc) denotes a Poisson random variable with parameter bc. Thus in the homogeneous case, we have
$Σ n = 1 ∞ n d n = E P o i (β c) = β c,$
and
$Σ n = 1 ∞ n 2 d n = E P o i 2 (β c) = β c + (β c) 2 .$

Thus, in the homogeneous case, d has a simple expression as
(13) $d = 1 + β c .$

Remark 1. The results in [⁹] also include d, but the notation is slightly different. The rate c(x) in this paper translates into the rate c(x)/b in [⁹] so that bc in this paper is c in [⁹]. We argue that in the setup it is more intuitive to let the rate of the Poisson process not be scaled by the length of the reads, and hence we change in notation.

Asymptotic distributions of statistics related to Markov chains based on NGS data

The previous study [⁹] showed that under the same NGS short reads generative model as in Methods section which models the underlying long genome sequence as a MC and uses the LanderWaterman model for random sampling of the reads, the traditional standardization of word count as
(14) $Z w (R) = f w (R) − E w (R) σ^w (R),$
where $(σ^w (R)) 2 = E w (R) (1 − f − w (R) f − w − (R)) (1 − f − w − (R) f − w (R))$ , for NGS short reads does not converge to the standard normal distribution as that for a long sequence. Thus [⁹], proposed the concept of effective coverage d for NGS short reads data given in Eq. 12, and proved that $Z w R d$ approximates to the standard normal distribution.

In this study, we show through simulations that the effective coverage d is also critical in the asymptotic distribution theory of MC transition probability estimators based on NGS short reads. We have the following two propositions for the asymptotic distributions of $ξ i j (R)$ , $ζ i (R)$ and S^(R), all of which are related to MC for NGS short reads data.

Proposition 1. Assuming the underlying genomic sequence follows a stationary and ergodic MC with transition probability matrix $P = [p i j] 4 × 4$ . We also assume that a set of reads by NGS are randomly sampled according to the Poisson process, possibly inhomogeneous, with rate function c(x). Then, as the genome length G tends to infinity, $(1 d ξ i j (R)) i, j ∈ A$ converge in distribution to a mean zero normal vector with covariance S with entries $Σ i, j; k, l = λ i, j; k, l$ where d is the scaling factor given in Eq. 12 and $λ i, j; k, l$ is given in Eq. 2.

Proposition 2. Under the same assumptions of Proposition 1, as G → ∞,

1. The vector $(1 d ζ i (R)) i ∈ A$ converges in distribution to a mean zero normal vector with covariance matrix
$(α i j) i, j ∈ A$
.

2. The statistic S^(R)/d has an approximate $χ 2$ -distribution with s(s - 1) degrees of freedom.

Remark 2. In this study, we only show the validity of the two propositions by simulation studies. Currently, we cannot provide rigorous mathematical proofs for the two propositions, due to the complex intercorrelations of words within reads. We leave them as our conjectures for future mathematical justifications.

Theoretical confidence intervals for MC transition probabilities based on NGS reads data

We show in Proposition 1 that, $ξ i j R$ has an asymptotic normal distribution
$ξ i j R = f i j R − f i R p i j f i (R) ∼ N (0, d p i j (1 − p i j)) .$

We applied this property to obtain the theoretical confidence interval for p_ij using NGS data. By dividing the numerator and denominator by $f i (R)$ , we have
$ξ i j R = p^i j − p i j 1 / f i (R) ∼ N (0, d p i j (1 − p i j)),$
where, $p^i j = f i j R / f i R$ , and hence,
$p^i j − p i j d p i j (1 − p i j) / f i (R) ~ N (0, 1) .$

For 0 <a< 1, let z₁_-a be the number such that the area under the standard normal density function to the left of z₁_-a is 1 -a. Thus, we have
(15) $P (− z 1 − α 2 ≤ p^i j − p i j d p i j (1 − p i j) / f i (R) ≤ z 1 − α 2) = 1 − α .$

Manipulation of the inequalities gives
$P (1 1 + γ (p^i j + 12 γ − γ p^i j (1 − p^i j) + γ 2) ≤ p i j ≤ 1 1 + γ (p^i j + 12 γ + γ p^i j (1 − p^i j) + γ 2)) = 1 − α,$
wher $γ = z 1 − α 2 2 d f i (R)$ . Therefore, we obtain the level 1 -a confidence interval for p_ij as
$CI 1 − α = 1 1 + γ (p^i j + 12 γ − γ p^i j (1 − p^i j) + γ 2, p^i j + 12 γ + γ p^i j (1 − p^i j) + γ 2) .$

In practice, $f i (R)$ can be very large, making $f i (R) ≫ z 1 − α 2 2 d$ and $γ ≈ 0$ . We thus discard terms of resulting in a simplified confidence interval for p_ij as
(16) $CI 1 − α NGS = (p^i j − z 1 − α 2 d p^i j (1 − p^i j) f i (R), p^i j + z 1 − α 2 d p^i j (1 − p^i j) f i (R)) .$

Parametric bootstrap for obtaining the confidence intervals for transition probabilities and the stationary distribution of MC

Let X be a stationary and ergodic Markov chain with transition matrix P= $[p i j]$ _s×s. Assume that a set of reads are randomly sampled from X according to the Poisson process, denoted by {R₁, R₂, … , R_M}. We can estimate the transition matrix P and stationary distribution $π$ by $P^= [p^i j]$ and $π^= [π^i]$ , where,
(17) $p^i j = f i j (R) f i (R),$
(18) $π^i = Σ j ∈ S f i j (R) Σ m = 1 M (| R m | − 1),$
where |R_m| is the length of read R_m, m = 1, 2, …, M . Moreover,
$π^P^= π^,$
$Σ i ∈ S π^i = 1.$

When G → ∞,
$lim ⁡ G → ∞ P^= P, lim ⁡ G → ∞ π^i = π i .$

We provide a parametric bootstrap method for finding the confidence intervals for p_ij and p_i_{as follows:}

B1 Simulate the NGS data set. Simulate a genome sequence of length G by the stationary and ergodic Markov chain P , and the sequence starts from the stationary distribution p; generate NGS data set by sampling M reads ${R 1, R 2, ⋯, R M}$ from the simulated genome by the Poisson process. Note that, for a real data set, we skip Step B1 and go directly to B2.

B2 Given a the set of reads of the NGS data set (either a simulated one from Step B1 or a real one), estimate parameters of transition matrix $P^$ and stationary distribution $π^$ by Eqs. 17 and 18.

B3 Generate bootstrap data set. Simulate a new Markov genome sequence of length G based on the estimated parameters $P^$ and $π^$ in B2; and generate a bootstrap data set with M reads ${R 1 *, R 2 *, ⋯, R M *}$ by the Poisson process.

B4 Estimate parameters of transition matrix $P^*$ and stationary distribution $π^*$ from the simulated bootstrap data set in B3 by Eqs. 17 and 18.

B5 Repeat Steps B3–B4B times to get B estimators $p^i j *$ s and $π^i *$ s; For each dinucleotide, order $p^i j *$ from the smallest to largest values as $p^i j *$ (1), $p^i j *$ (2), • • • , $p^i j *$ (B), and the level 1 -a confidence interval for p_ij is given by
(19) $CI 1 - α B = (p^i j * (⌊ B × α 2 ⌋), p^i j * (⌈ B × (1 − α 2) ⌉));$

For each i, order $π^i *$ from the smallest to largest values as $π^i *$ (1), $π^i *$ (2), … , $π^i *$ (B), and the level 1 -a confidence interval for $π i$ is given by
(20) $CI 1 - α b = (π^i * (⌊ B × α 2 ⌋), π^i * (⌈ B × (1 − α 2) ⌉)) .$

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]
Almagor, H. (1983) A Markov analysis of DNA sequences. J. Theor. Biol., 104, 633–645

[2]
Reinert, G., Schbath, S. and Waterman, M. S. (2005) Statistics on words with applications to biological sequences. In: Applied Combinatorics on Words, Lothaire, M. ed., ch. 6, pp. 268–352 New York: Cambridge University Press

[3]
Blaisdell, B. E. (1985) Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding. J. Mol. Evol., 21, 278–288

[4]
Pevzner, P. A., Borodovsky, M.Y., and MironovA. A. (1989) Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dyn., 6, 1013–1026

[5]
Hong, J. (1990) Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis. Nucleic Acids Res., 18, 1625–1628

[6]
Arnold, J., Cuticchia, A. J., Newsome, D. A., Jennings, IIIW. W. 3rd and Ivarie, R. (1988) Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 7145–7158

[7]
Avery, P. J. (1987) The analysis of intron data and their use in the detection of short signals. J. Mol. Evol., 26, 335–340

[8]
Narlikar, L., Mehta, N., Galande, S. and Arjunwadkar, M. (2013) One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41, 1416–1424

[9]
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2016) Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, 32, 993–1000

[10]
Billingsley, P. (1961)Statistical Inference for Markov Processes, vol. 2. Chicago: University of Chicago Press Chicago

[11]
Billingsley, P. (1961) Statistical methods in Markov chains. Ann. Math. Stat., 32, 12–40.

[12]
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753

[13]
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829

[14]
Zhai, Z., Reinert, G., Song, K., Waterman, M. S., Luan, Y. and Sun, F. (2012) Normal and compound Poisson approximations for pattern occurrences in NGS reads. J. Comput. Biol., 19, 839–854

[15]
Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M. and Sun, F. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J. Comput. Biol., 20, 64–79

[16]
Sun, F., Arnheim, N. and Waterman, M. S. (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res., 23, 3034–3040

[17]
Daley, T.and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 22, 3159–3165

[18]
Lander, E. S. and Waterman, M. S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231–239

[19]
Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. and Gerstein, M. (2008) Modeling ChIP sequencing in silico with applications. PLOS Comput. Biol., 4, e1000158

[20]
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327

[21]
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235

[22]
Schwartz, S., Oren, R. and Ast, G. (2011) Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One, 6, e16685

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

Share on WeChat

PDF (1009KB)

Part of a collection:

Supplementary files

QB-20200-OF-WL_suppl_1

2312

Accesses

0

Citation

Detail

Sections

Recommended

About the journal

Browse

Authors & reviewers