WaveNano: a signal-level nanopore base-caller via simultaneous prediction of nucleotide labels and move labels through bi-directional WaveNets

Sheng Wang , Zhen Li , Yizhou Yu , Xin Gao

Quant. Biol. ›› 2018, Vol. 6 ›› Issue (4) : 359 -368.

PDF (902KB)
Quant. Biol. ›› 2018, Vol. 6 ›› Issue (4) : 359 -368. DOI: 10.1007/s40484-018-0155-4
METHODOLOGY ARTICLE
METHODOLOGY ARTICLE

WaveNano: a signal-level nanopore base-caller via simultaneous prediction of nucleotide labels and move labels through bi-directional WaveNets

Author information +
History +
PDF (902KB)

Abstract

Background: The Oxford MinION nanopore sequencer is the recently appealing third-generation genome sequencing device that is portable and no larger than a cellphone. Despite the benefits of MinION to sequence ultra-long reads in real-time, the high error rate of the existing base-calling methods, especially indels (insertions and deletions), prevents its use in a variety of applications.

Methods: In this paper, we show that such indel errors are largely due to the segmentation process on the input electrical current signal from MinION. All existing methods conduct segmentation and nucleotide label prediction in a sequential manner, in which the errors accumulated in the first step will irreversibly influence the final base-calling. We further show that the indel issue can be significantly reduced via accurate labeling of nucleotide and move labels directly from the raw signal, which can then be efficiently learned by a bi-directional WaveNet model simultaneously through feature sharing. Our bi-directional WaveNet model with residual blocks and skip connections is able to capture the extremely long dependency in the raw signal. Taking the predicted move as the segmentation guidance, we employ the Viterbi decoding to obtain the final base-calling results from the smoothed nucleotide probability matrix.

Results: Our proposed base-caller, WaveNano, achieves good performance on real MinION sequencing data from Lambda phage.

Conclusions: The signal-level nanopore base-caller WaveNano can obtain higher base-calling accuracy, and generate fewer insertions/deletions in the base-called sequences.

Graphical abstract

Keywords

nanopore sequencing / bi-directional WaveNets / base-calling / third generation sequencing / deep learning

Cite this article

Download citation ▾
Sheng Wang, Zhen Li, Yizhou Yu, Xin Gao. WaveNano: a signal-level nanopore base-caller via simultaneous prediction of nucleotide labels and move labels through bi-directional WaveNets. Quant. Biol., 2018, 6(4): 359-368 DOI:10.1007/s40484-018-0155-4

登录浏览全文

4963

注册一个新账户 忘记密码

INTRODUCTION

Over the last decade, high-throughput second-generation sequencing technology has revolutionized genomic research with the ability to sequence the whole genome of a variety of organisms on earth [1]. In 2014, Oxford Nanopore Technologies (ONT) released a third-generation sequencing platform, MinION, which is a portable, single-molecule genomic sequencing device no larger than an iPhone [2] (see Figure 1A). There are two key features of this device, long-reads and point-of-care [3].

MinION directly senses native, individual DNA single-strand without the need for polymerase chain reaction (PCR) amplification, which enables the device to sequence extremely long reads (typically from 12k to 120k bp, or even longer) of DNA without a reduction in the sequence quality [4]. This allows users to generate reads spanning most repetitive sequences, which most second-generation sequencing technologies based on short reads (typically from 75 to 150 bp) cannot unambiguously resolve [1].

MinION can be used for sequencing immediately at anywhere in real-time, as it is portable and does not require any special setup or calibration procedures [5]. It was reported that MinION has been used for diagnostic investigation during the Ebola outbreak in Guinea, west Africa [6]. Another report indicated that MinION has been tested on the International Space Station (ISS) for real-time sequencing of a Lambda phage and the results showed no difference in performance on the ISS and on earth [7].

The key innovation of the MinION sequencer is the direct measurement of the changes in the electrical current signal (denoted as raw signal) when a single-strand DNA passes through the nanopore [8]. In MinION, a few hundred of nanopores are implanted in a voltage-biased membrane. Single-strand DNA sequences pass through these pores. At each time point, there are five consecutive nucleotides in a pore (denoted as a 5-mer). The electrical current signal is measured for each time point of the pore. The underlying assumption is that with different 5-mers in the pore, the electrical current will be different. The goal is to decode the time-course electrical current signals into the sequence of nucleotides. This procedure is referred to as base-calling (Figure 1B). However, the frequency of the electrical current measurements and the speed of the DNA sequence passing through the pore are not coordinated, which causes the main technical difficulty for base-calling. In general, the frequency of the electrical current measurements is 7‒9 times higher than the passing speed of the DNA sequence. That is, each 5-mer is on average measured by 8‒10 times, yet the variance of measured times per 5-mer is very high as the passing speed is inconsistent. Here, a pore model is defined as the correspondence between the expected current signal and the 5-mer inside the pore at the same time point. Thus, given a pore model, we may annotate 5-mer label and move label upon the raw signal. Specifically, a 5-mer label indicates that a certain time point of the raw signal belongs to which 5-mer, whereas a move label indicates whether the 5-mer stays or moves for the next time point of the signal (Figure 1B).

The most problematic issue in nanopore sequencing is the high sequencing error rate, especially in indels (insertions and deletions) [9]. To solve this issue, a variety of approaches have been proposed and they can be roughly categorized into two groups: (i) machine learning based and (ii) consensus based. The former group relies on a machine learning model that learns the mapping between the input nanopore current signals (with length L1) and the corresponding DNA sequence (with length L3), i.e., base-calling. The latter group not only relies on the base-called reads (here, a read is a decoded DNA sequence), but also requires additional resources, such as read mapping on a given reference genome [4], error correction using short reads from second-generation sequencing [10], or the overlapping base-called reads from the same nanopore experiments (such as Nanopolish [2,11] and PoreSeq [12]). In either case, a more accurate base-calling method is needed to facilitate the power and advantages of the nanopore sequencing. Here we focus on solving two key issues of the existing machine learning-based base-calling approaches: (ⅰ) serial base-calling, and (ⅱ) model architecture.

Serial base-calling. In almost all existing methods [13,14], the base-calling process is divided into two serial steps: segmentation and decoding (see Figure 2A). The segmentation step takes the discrete time-course electrical current signals as input and outputs a segmented stepwise curve. This step can be considered as a clustering step, which tries to group current signals for the same 5-mers together. The decoding step then takes the segmented curve as input and decodes it to the nucleotide sequence. However, since the segmentation step is based on local signal information only and does not take global context into account, insertion (i.e., one 5-mer is divided into multiple segments) and deletion (multiple 5-mers are merged to the same segment) issues commonly exist due to the noisy nature of the current signal measurement [13]. Such indel issues not only affect the event-level features, but also irreversibly harm the final base-calling performance.

Model architecture. Another key issue is that the machine learning model architectures in existing methods are not suitable for the ultra-long sequence of the nanopore current signals [15]. In brief, the first generation of base-calling methods is based on hidden Markov models (HMMs) [13]. The limitation of HMMs lies in the short-range dependency. To overcome this issue, the second generation of base-callers is established on recurrent neural network (RNN) architectures, such as long short-term memory (LSTM) [16] or gated recurrent unit (GRU) [17]. Thanks to a large hidden state space, the RNN base-callers can potentially capture long-distance dependencies in the input signal [14]. However, since the length of the nanopore signal is extremely long (about 100k bp on average), traditional RNN-based architectures still cannot handle these long sequences properly [15].

To further improve supervised learning methods for base-calling, it is possible for us to borrow ideas from very recent technology breakthrough in speech recognition via deep learning [15]. Deep learning [18] is a powerful machine learning technique that has revolutionized image classification [19], natural language processing [20], and bioinformatics [21]. Recently, a deep learning architecture WaveNets [15] demonstrated superior performance in speech generation. If we consider the nanopore signal as a speech signal, then base-calling is kind of similar to speech recognition. Thus, the recent developed WaveNet architecture might also work for base-calling.

In this paper, we propose a novel method, WaveNano, which jumps over the segmentation step and directly conducts 5-mer label and move label prediction simultaneously from the electrical current signals. The main contributions of this paper are as follows:

• We develop a method to accurately label the ground-truth 5-mer label and move label for each time point of the raw electrical current signal using dynamic time warping [8], which provides supervised training data for our base-calling method.

• We propose a novel model to simultaneously predict the 5-mer label and the move label for each time point via bi-directional WaveNets [15] with stacked residual blocks of convolutional neural network (shown in Figure 3).

• We show that using predicted move labels as the segmentation guidance can help solve the indel issue substantially. Experimental results on the Lambda phage demonstrate that our proposed method outperforms the existing methods and achieves good accuracy.

RESULTS

Lambda phage data preparation

We sequenced the genome of a Lambda phage, which is provided by ONT for calibration of the nanopore sequencers, based on the one-dimensional (1D) protocol. The reference DNA sequence of the Lambda phage is made available by ONT. Before sequencing, the Lambda phage DNA library preparation was performed using the genomic DNA sequencing kit (Oxford Nanopore). According to the manufacturers’instructions, we used the SQK-MAP-003 sequencing kit for R9.4 MinION flow cells. A new MinION flow cell was used for each sequencing run. The library was loaded onto the MinION flow cell and the genomic DNA 48-hour sequencing protocol was initiated using the MinKNOW software. We obtained around 24,000 reads, with an average length of 63,000 bp for electrical current signals.

Evaluation metrics

To evaluate the performance of our base-caller WaveNano, we adopted a measurement similar to the one used in BLAST [22] to compare the similarity between the reference DNA sequence S ret with length L ref and the base-called sequence S call with length Lcall. The scoring function is 1 for the same nucleotide, 2 for mismatched nucleotides, 2 for gap open, and 1 for gap extension. Based on such a scoring function, the optimal alignment path could be obtained by dynamic programming [23]. Along the path, we can count the number of exact matches (denoted as M), the number of gaps (denoted as G), and the length of alignment (denoted as Lali).

We used four evaluation metrics below to assess the quality of the base-called sequence: (i) sequence identity with respect to Sret (defined as SeqID1= M/ Lret), (ii) sequence identity with respect to Scall (defined as SeqID2=M/ Lc al l), (iii) sequence identity (defined as SeqID=M/ Lali), and (iv) the gap ratio (defined as Grate=G/ La li). Generally speaking, a higher value of the sequence identity and a lower value of the gap ratio indicate the high similarity between the base-called sequence and the reference sequence.

In addition, as shown in Equation (4), besides the optimization for the loss function of move label, we also try to optimize the AUC score of move label directly in the training process. Although we acknowledge that area under the precision-recall curve (AUPRC) is a better score to evaluate classification performance for imbalanced problem, currently it is very challenging to develop a simple approach to directly optimize AUPRC for a linear-structured data, such as sequence labeling problem here [24]. In this work, we will compare the AUC and AUPRC for move label prediction.

Compared methods

We compared WaveNano with the state-of-the-art and official base-calling tool Metrichor (https://metrichor.com/). Metrichor was initially trained on a hidden Markov model and the average base-calling accuracy in terms of SeqID is around 70% [13]. With the latest release of ONT R9, it was reported that Metrichor has evolved the base-calling algorithm to a bi-directional LSTM model [25] and the base-calling accuracy is boosted to close to 90% [9]. However, it should be noted that Metrichor is a cloud-based platform and the source code is not open to public. Recently, ONT released Albacore, which is a binary-only tool for offline base-calling. It was reported that these official tools have similar base-calling accuracy [14].

Comparison results

We trained WaveNano and conducted the evaluation on Lambda phage. The performance was obtained through five-fold cross validation on about 24,000 reads of Lambda phage. Experimental results show that WaveNano achieves better performance than Metrichor and Albacore, under all four measurements, SeqID1, SeqID2, SeqID, Grate (Table 1). Specifically, WaveNano achieves 0.632 and 0.947 accuracy for the 5-mer label and move label prediction, respectively. Through Viterbi decoding with the segmentation guidance, the final base-calling obtains 0.956, 0.932, and 0.923 sequence identity, and 0.056 gap ratio, which implies base-calling from WaveNano not only predicts more accurate DNA sequences, but also solves the indel issue significantly better than the official base-callers, Metrichor and Albacore.

Table 1 also shows that the accuracy of Metrichor is similar to that of Albacore, especially in terms of SeqID and Grate. In addition, Albabore has a higher SeqID1 and a lower SeqID2 than Metrichor, indicating that the DNA sequence predicted by Albacore has more matches to the ground-truth sequence than that predicted by Metrichor, but the sequence length predicted by Metrichor is shorter than that predicted by Albacore.

We further studied the importance of the move label prediction as the segmentation guidance and the bi-directional WaveNets, by removing each of them from WaveNano and evaluated the performance. Experimental results show that both the move guidance and the bi-directional WaveNets are important components in the success of WaveNano as removing each of them results in a significantly dropped 5-mer label accuracy to 0.564 and 0.597, respectively. Furthermore, for the final base-calling results, the gap ratio increases to 12.8% and 7.1%, respectively. Among the two components, the move label as the segmentation guidance is more important than the bi-directional WaveNets.

The AUC (AUPRC) of move label prediction for WaveNano and WaveNano without bi-WaveNet is 0.872 (0.595) and 0.863 (0.581), respectively. If we remove the AUC loss term in Eq. 4, these prediction results would become 0.867 (0.586) and 0.856 (0.569), respectively.

Runtime

It is difficult to directly compare the running time of Metrichor to that of WaveNano because Metrichor is a cloud-based tool and we did not know the exact parameter setting of Metrichor. However, we could compare the running time of WaveNano with the official offline base-caller, Albacore. It took WaveNano 0.0000416 second to base-call one time point of the signal or 0.5 second for a signal sequence with 12,000 time points, whereas it costs Albacore 2 seconds for base-calling the same signal sequence.

DISCUSSION AND CONCLUSIONS

In this paper we proposed a novel base-caller, WaveNano, for the third-generation nanopore sequencing. We showed that our method is better than the state-of-the-arts on processing the nanopore current signal data, obtaining higher base-calling accuracy, and generating fewer insertions/deletions on the Lamba page dataset. Consequently, base-called DNA sequences by WaveNano are more accurate and contain fewer gaps than those by Metrichor and Albacore, the current cutting-edge official base-callers.

The superiority of our method is rooted largely in the machine learning model bi-direction WaveNets (Bi-WaveNets), which is a deep learning model based on the dilated residual CNN. Such architecture could be regarded as an alternative model of RNN with gates (e.g., LSTM, GRU). According to our knowledge, the official base-calling algorithm from ONT (i.e., Metrichor and Albacore) is based on bi-directional LSTM (Bi-LSTM). It is worth mentioning that WaveNet is more suitable for capturing ultra-long dependency due to the following two reasons: (i) WaveNets are auto-regressive and consist of stacked causal filters with dilated convolutions to allow their receptive fields to grow exponentially with respect to the depth, which is essential to capture the ultra-long range temporal dependencies in the input data; (ii) the layers of the dilated convolutions make WaveNets a much faster model than using RNN with gate units. Consequently, comparing to RNN, WaveNets can exploit ultra long-range temporal dependencies in the signal sequence in an efficient and effective way. Moreover, our proposed model Bi-WaveNets can capture both upstream and downstream information, whereas the traditional WaveNets can only capture the upstream information.

Finally, it should be noted that a bunch of new base-callers recently appeared could directly work on the raw electrical current signals. Base-calling from raw signal (without segmenting the signal into events) first appears in Albacore v2.0 at September 2017. Albacore is the official off-line basecaller whose source code is not opened to public. Users can download Albacore from the Nanopore community website, but they need an account to log in. Later on, Scrappie, also from ONT research group and the source code is open to public, supports the base-calling from raw signal. Chiron [26] is the first-appeared third-party base-caller to perform raw signal base-calling, which is also based on deep learning. From this trend, we may foresee that the nanopore signal-level processing aided by deep learning should be the future.

METHODS

Here we propose WaveNano, a novel offline base-caller for third-generation nanopore sequencing. WaveNano simultaneously infers the 5-mer label and the move label of each time point of the input electrical current signal (see Figure 2B), by using bi-directional WaveNets with residual blocks and gated activation units. Exploiting the predicted move labels as the segmentation guidance, we employ Viterbi decoding with the predicted 5-mer label probability matrix to obtain the final DNA sequence.

Problem formulation

The base-calling process can be formulated as follows. Given an input electrical current signal sequence X =x1, x2,... ,xL1 with L1 time points, we need to decode the final DNA sequence B=b1, b2 ,...,bL3 with L2 nucleotides, where xi is the electrical current measurement of a 5-mer (e.g., “ACGTT”) at time point i, and bj is a nucleotide that can take one of the four values from {A,T,C,G}. Note that the frequency of the electrical current measurements is about 7‒9 times faster than the speed in which the single-strand DNA passes through the nanopore. For consecutive time measurements xi and xi+1, the 5-mer either stays in the pore or moves by one nucleotide. We denote this annotation as the move label sequence Xm with length L1. Such stay/move labels can later serve as the segmentation guidance to convert the electrical current signals to a 5-mer event sequence.

Previous methods, such as [13,14], divide the base-calling process into two serial steps: segmentation and decoding (see Figure 2A). In particular, the current signal sequence is firstly fragmented to an event sequence X'= x'1, x'1, ..., x'L2 with length L2( L3 L2L1) through segmentation on X. Due to the noisy nature of the segmentation process, the indel issues in these methods irreversibly harm the final base-calling performance. Thus, we propose a novel approach, WaveNano, to simultaneously predict the 5-mer label y1 (i.e., which 5-mer among all possible 45=1024 5-mers is in the pore) and the move label y2 (i.e., whether the 5-mer moves by a nucleotide at the next time point) for each time point of the input current signal sequence X.

Signal-level ground-truth labeling

In order to train our deep learning model, we need to prepare the supervised training data first. For our method, training data refer to the ground-truth 5-mer label and move label annotation for each time point of the training electrical current signals, which are not directly available given the raw signals. We thus need to do signal-level labeling to assign the 5-mer label y1 and move label y2 to each time point of the signal, where the size of the label space for y1 and y2 is 1024 and 2, respectively.

Our original training data set contains the raw time-course electrical current signals X of length L1 and their corresponding DNA sequence B of length L3. However, it does not contain the corresponding 5-mer label and move label sequences, each of which should be of length L1 as well. We first try to find an alignment between X and B.

Although it seems intractable to perform this alignment directly, we could use the ONT official pore model describing the electrical current signal that are expected to be observed for each 5-mer [13]. Given the DNA sequence B with length L3, we can use 5-mer sliding window to generate all the L3 5-mers (the last 4 5-mers contain less than 5 nucleotides, but are still in the pore). Each 5-mer is then converted to its expected electrical current signal value according to the ONT official pore parameters. After transforming B into the expected signal sequence S of length L3, an optimal alignment path between the two signal sequences, X and S, could be obtained using dynamic time warping (DTW) [8].

To determine the optimal path via DTW, we recursively compute an L1× L3 matrix D, where the matrix entry D(n,m ) is the total cost of an optimal path between X(x1, ..., xn) and S( s1 ,...,sm). Here D (n,m) =min{D(n 1,m1),D (n, m1),D(n 1, m) }+c (n,m) where c is a L1× L3 matrix containing distances between elements xn in the sequence X and sm in the sequence S. Here we use the Z-score difference to calculate c(n,m). Note that recently there is a speed up approach for calculating DTW in O (N) time complexity [27].

After aligning the reference DNA sequence B to the electrical current signal sequence X, it is straightforward to assign the 5-mer label y1(t) for each time point t. To assign the move label y2(t ) at time point t, we just need to check whether the consecutive 5-mers at time points t and t +1 are the same or not. If same, then we assign stay at time point t, otherwise move.

Overall pipeline of WaveNano

The overall base-calling pipeline of WaveNano is presented in Figure 3, which takes the current signal with length L1 as the input. Considering the large variance of the electrical current values at each time point, it is necessary to calculate the Z-score normalization of the current signal [8]. In contrast to the existing machine learning models [13,14] which depend on the segmentation step by MinKNOW [13], WaveNano conducts training on the Z-score normalization of the original current signal with the ground-truth 5-mer label and move label obtained as described in the Section of Signal-level Ground-truth Labeling. Thus, WaveNano can essentially overcome the indel (insertions/deletions) issues that commonly exist in existing methods which are due to the segmentation errors. Besides, as WaveNano can conduct base-calling on the entire signal sequence, it leverages bi-directional WaveNets consisting of residual blocks with skip connections. Then the feature maps of the bi-directional WaveNets are concatenated together to capture the ultra long-dependency within the signal sequence. On top of the concatenated feature maps, we predict move labels (binary classification) and 5-mer labels (1024-class classification) simultaneously through one-dimensional convolutional neural network (1D CNN) with a 1×1 kernel. Using the predicted move labels as the segmentation guidance, the predicted 5-mer probability matrix (of size L1×1024) is segmented accordingly and then fed into the Viterbi decoding block to obtain the final base-calling with length L3. It is worth mentioning that WaveNano does not rely on any other segmentation/decoding tools, and thus is a self-contained offline framework.

Bi-directional WaveNets for joint learning of 5-mer labels and move labels

Our proposed WaveNano method directly operates on the waveform of the normalized Z-score of the input electrical current signal. Given a waveform x=x1,..., xT, different from the generative model WaveNet [15] (shown in Figure 4), the joint probability of the 5-mer label p1(x) and the move label p2(x ) are factorized as follows:

p1(x )= Πt=1 Tp1 (xt | x1 ,...,xt 1,xt+1,...,xT )
p2(x)=Πt=2T p1(xt| x1,... ,xt1, xt+1,...,xT) .

That is, the 5-mer label and the move label at each time point are conditioned on all other time points.

Since the stacked dilated causal convolutions can have a very large receptive field [15], we exploit two parallel WaveNets with stacked dilated 1D CNNs by taking the forward current signal and the reversed current signal as inputs respectively. Similar to the configuration in which dilation is doubled for each layer up to a limit [15], we also use two repeat blocks in WaveNano, i.e., 1, 2,4,..., 4096. Furthermore, the gated activation units z=tanh( Wf,k* X)σ( W g,k *X) [28], residual and parameterized skip connections [19] are also used in WaveNano to speed up convergence and enable a deeper model [15]. Specifically, residual connections are presented in the dotted box and all residual blocks are summed up through skip connections (Figure 3). In order to capture the long-range temporal dependency of all other time points, feature maps of bi-directional WaveNets are combined through the concatenation operation.

On top of the concatenated feature maps, WaveNano conducts the 5-mer label and move label prediction simultaneously though 1D CNN that is activated by softamx with a 1×1 kernel by minimizing the combined loss. It is worth mentioning that the move prediction is a class imbalanced problem (about 8 times more stay than move). Thus, we add the approximated AUC loss [29] for solving the class imbalanced problem of move prediction.

Specifically, we denote f as the move prediction layer with softmax activation illustrated in Figure 3. According to Ref. [30], the AUC of a predictor f is defined as AUC(f)=P(f( t0) f( t1 )| t0D0,t 1D1), where D0 ,D1 are the samples with ground-truth labels stay and move, respectively. Its unbiased estimator, i.e., Wilcoxon-Man-Whitney statistics, is
AUC(f)= 1 n0n1 Σt 0D0, t1 D 1Σk=0dΣl=0k αklf( t1)lf( t0)kl
, where n0= |D0|,n 1=|D1|, and I ( )is the indicator function. In order to add the noncontinuous AUC loss to the continuous cross-entropy loss and optimize the combined loss through gradient decent, we consider an approximation of the AUC loss by a polynomial approximation of indicator function I( ) with degree d [30], i.e.,
AUCmove=1 n0 n1Σt0D 0 ,t1 D1 Σk=0dΣl=0k αklf( t1)lf( t0)kl
where αk l=ck Ckl (1)k l is a constant.

Thus, the combined cross-entropy loss is:
loss= losssmer+ λ1 lo ssmove+ λ2AUCmove
where losssmer= 1TΣi mi *log (m i), lossmove = 1 T Σi si *log (s i), and T is the length of the input signal. Besides, mi,si are the predicted probabilities of the 5-mer label and the move label, mi*,si* are ground-truth 5-mer label and move label respectively, and λ1, λ2 are the trade-off parameters.

Viterbi decoding with 5-mer labels and move labels as segmentation guidance

Given the predicted probabilities of the 5-mer labels p1 and move labels p2 for the electrical current signal sequence X, WaveNano first segments the 5-mer label sequence with the guidance of the predicted move labels. Specifically, for a certain time range t1 to t2 from the original signal sequence, if all their predicted stay labels are above a given threshold θ, then the predicted 5-mer label (e.g., for a certain label l) for each time point t' in this segment is calculated by p1'(t',l)=1t2t1 Σi= t1t2 p1 (i,l). This produces a segmented event sequence X' with the probabilities of the 5-mer labels p1', which can be interpreted as a smoothed predicted 5-mer label probability matrix. Finally, WaveNano runs the Viterbi decoding algorithm [2] on this smoothed probability matrix to compute the most likely 5-mer sequence S', which can then be transformed to the DNA sequence B directly.

It should be noted that the segmentation process in WaveNano is completely different from that in previous methods, such as Metrichor and Albacore, in the following two aspects: (ⅰ) our procedure employs a supervised learning model, WaveNet, which can capture ultra long-range temporal dependencies, whereas existing methods only exploit local information for segmentation; (ⅱ) our procedure is more flexible than previous methods with the help of a tunable parameter θ. A larger θ will produce a longer base-called sequence, whereas a smaller θ will produce a shorter sequence. We found that setting θ to 0.9 as the default value leads to an appropriate base-called sequence with a comparable length to the reference DNA sequence.

Neural network architecture

In WaveNano, we exploit bi-directional Wavenets each with two repeat residual blocks, which consists of causal 1D CNN layers with kernel size 3 and dilated 1D CNN with dilation 1,2,..., 212. The obtained 12 feature maps, each with 128 channels, are summed through skip connections. And the feature maps of bi-directional Wavenets are concatenated together as the contextual feature vector. The output from the contextual feature vector is regularized with dropout (=0.5) to avoid overfitting and fed to two 1D CNN layers with a 1× 1 kernel. The output units for the 5-mer label and move label prediction are 1024 and 2 respectively with a softmax activation function. We set λ1=0.1 and λ2= 0.15 for balancing the two jointly learned tasks.

Our code is implemented in Tensorflow, a publicly available deep learning software. Weights in our neural networks are initialized using the default setting in Tensorflow. We train all the layers in our deep network simultaneously using the Adam optimizer. The batch size is set to 10 and length is fixed to 12K through padding. The entire deep network is trained on a single NVIDIA GeForce GTX TITAN X GPU with 12GB memory. It takes about one to two weeks to train our deep network. In the testing stage, the 5-mer label and move label prediction of a read with length 12,000 takes 0.5 s on average.

References

[1]

Cao, M. D., Nguyen, S. H., Ganesamoorthy, D., Elliott, A. G., Cooper, M. A. and Coin, L. J. (2017) Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun., 8, 14515

[2]

Loman, N. J., Quick, J. and Simpson, J. T. (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods, 12, 733–735

[3]

Li, Y., Han, R., Bi, C., Li, M., Wang, S. and Gao, X. (2018) DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics, 34, 2899–2908

[4]

Jain, M., Fiddes, I. T., Miga, K. H., Olsen, H. E., Paten, B. and Akeson, M. (2015) Improved data analysis for the MinION nanopore sequencer. Nat. Methods, 12, 351–356

[5]

Lu, H., Giordano, F. and Ning, Z. (2016) Oxford Nanopore MinION sequencing and genome assembly. Genom. Proteom. Bioinf., 14, 265–279

[6]

Quick, J., Loman, N. J., Duraffour, S., Simpson, J. T., Severi, E., Cowley, L., Bore, J. A., Koundouno, R., Dudas, G., Mikhail, A., (2016) Real-time, portable genome sequencing for Ebola surveillance. Nature, 530, 228–232

[7]

Castro-Wallace, S. L., Chiu, C. Y., John, K. K., Stahl, S. E., Rubins, K. H., McIntyre, A. B. R., Dworkin, J. P., Lupisella, M. L., Smith, D. J., Botkin, D. J., (2017) Nanopore DNA sequencing and genome assembly on the International Space Station. Sci. Rep., 7, 18022

[8]

Loose, M., Malla, S. and Stout, M. (2016) Real-time selective sequencing using nanopore technology. Nat. Methods, 13, 751–754

[9]

Jain, M., Olsen, H. E., Paten, B. and Akeson, M. (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol., 17, 239

[10]

Goodwin, S.,Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C. and McCombie, W. R. (2015) Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res., 25, 1750–1756

[11]

Sovic, I., Šikić M., Wilm, A., Fenlon, S. N., Chen, S. and Nagarajan, N. (2016) Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap. Nat Commun., 7, 11307

[12]

Szalay, T. and Golovchenko, J. A. (2015) De novo sequencing and variant calling with nanopores using PoreSeq. Nat. Biotechnol., 33, 1087–1091

[13]

David, M., Dursi, L. J., Yao, D., Boutros, P. C. and Simpson, J. T. (2017) Nanocall: an open source basecaller for Oxford Nanopore sequencing data. Bioinformatics, 33, 49–55

[14]

Boža, V., Brejová B. and Vinař T. (2017) DeepNano: deep recurrent neural networks for base calling in MinION nanopore reads. PLoS One, 12, e0178751

[15]

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu K. (2016) Wavenet: A generative model for raw audio. ArXiv, 1609.03499

[16]

Hochreiter, S. and Schmidhuber, J. (1997) Long short-term memory. Neural Comput., 9, 1735–1780

[17]

Chung, J., Gulcehre, C., Cho, K. H. and Bengio, Y. (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, 1412.3555

[18]

LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning. Nature, 521, 436–444

[19]

He, K., Zhang, X., Ren, S., and Sun, J. (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas

[20]

Hirschberg, J. and Manning, C. D. (2015) Advances in natural language processing. Science, 349, 261–266

[21]

Wang, S., Sun, S., Li, Z., Zhang, R. and Xu, J. (2017) Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol., 13, e1005324

[22]

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410

[23]

Pearson, W. R. and Miller, W. (1992) Dynamic programming algorithms for biological sequence comparison. In Methods in Enzymology. pp. 575–601, Elsevier

[24]

Wang, S., Ma, J. and Xu, J. (2016) AUCpreD: proteome-level protein disorder prediction by AUC-maximized deep convolutional neural fields. Bioinformatics, 32, i672–i679

[25]

McIntyre, A. B., Rizzardi, L., Yu, A. M., Alexander, N., Rosen, G. L., Botkin, D. J., Stahl, S. E., John, K. K., Castro-Wallace, S. L., McGrath, K., (2016) Nanopore sequencing in microgravity. npj Microgravity, 2, 16035

[26]

Teng, H., Cao, M. D., Hall, M. B., Duarte, T., Wang, S. and Coin, L. J. M. (2018) Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. Gigascience, 7, giy037

[27]

Han, R., Li, Y., Wang, S. and Gao, X. (2017) An accurate and rapid continuous wavelet dynamic time warping algorithm for unbalanced global mapping in nanopore sequencing. bioRxiv, 238857

[28]

van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems

[29]

Wang S., Sun S., and Xu J. (2016) AUC-maximized deep convolutional neural fields for protein sequence labeling. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science, Frasconi P., Landwehr N., Manco G., Vreeken J. (eds) vol 9852. Springer, Cham

[30]

Calders T., and Jaroszewicz S. (2007) Efficient AUC optimization for classification. In Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science, Kok J. N., Koronacki J., Lopez de Mantaras R., Matwin S., Mladenič D., Skowron A. (eds), vol 4702. Springer, Berlin, Heidelberg

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF (902KB)

2730

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/