INTRODUCTION
Third generation sequencing (TGS), also called single-molecule, real-time (SMRT) sequencing, including PacBio RS II platform developed by Pacific Biosciences (PacBio) and MinION platform developed by Oxford Nanopore, captures sequence information directory in the process of DNA molecule replication. TGS generates average read length in the order of 10 kb, which is much longer than next generation sequencing (NGS) read size (100–500 bp) and significantly improves the integrity and continuity of de novo assembly of genomes. Furthermore, the weak effect of classical sequencing bias, such as GC content, allows for resolving complex genome regions and the exploration of structural variants with an unprecedented accuracy and resolution. Many research results based on TGS have been published in different fields, such as de novo assembly, haplotype phasing, N6-methyladenine DNA modification, transcriptome and structural variation detection. TGS has promoted the studies of genomics in an impressive way. However, TGS technology has a serious drawback: its sequencing error rate (11%–15% for PacBio raw reads, 5%–35% for Nanopore raw reads) is much higher than NGS (1%). High error rate and long read length bring great challenges to de novo assembly of TGS sequences. Due to the dramatic difference of data characteristics between TGS sequences and NGS sequences, computation methods designed for NGS sequences always do not work for TGS sequences. Furthermore, sequencing errors between PacBio sequences and Nanopore sequences are also remarkably different, which requires us to address them separately. Currently the widely used de novo assembly workflows developed for TGS sequences can be categorized into two strategies: “correction-and-assembly” and “assembly-and-correction”. The “correction-and-assembly” strategy corrects the sequencing errors in TGS raw reads first, and then assembles genome using corrected reads. On the other hand, the “assembly-and-correction” strategy assembles genome using error-prone TGS raw reads, and then corrects the genome assembly with raw reads. Due to the computational intensive error correction process, the “correction-and-assembly” approach is usually slower. However, directly assembling genome using raw reads with high sequencing error rates may increase assembly errors in genome sequence, especially in genome containing repetitive regions, which then affects the quality of reference genome and causes results bias in downstream analysis. Practices have demonstrated that the “correction-and-assembly” strategy can reconstruct highly continuous and accurate genome assemblies.
In this review we first take a look at the sequencing errors of PacBio sequences and Oxford Nanopore sequences respectively, and then introduce the currently widely used “correction-and-assembly” and “assembly-and-correction” de novo assembly workflows and several relevant methods, such as alignment methods and error correction methods designed for TGS sequences. We summarize all these computation methods in Table 1. To analyze the performance of the workflows, we give two benchmarks and talk about their assembly results on each dataset. Based on their performance on different datasets, we then list some use guides on applying these computation methods. Finally, we talk about the development prospects of sequencing technologies and conclude this paper.
CHARACTERISTICS OF SMS SEQUENCES
Initially the sequence (also called continuous long read, CLR) generated by the PacBio RS system with the first generation of chemistry (C1 chemistry) produces average lengths only around 1500 bp [
1]. Up to now, the PacBio RS II system with the sixth generation of polymerase and the fourth generation of chemistry (P6-C4 chemistry) has improved average read lengths over 10 kb. The maximum read lengths can exceed 60 kb and the N50 can reach up to 20 kb [
2]. Error rates of PacBio raw reads are 11%–15%. The large majority of sequencing errors are insertions and deletions (indels).
Due to the random distribution of errors in CLRs, a sufficient sequencing pass yields a circular consensus sequence (CCS) having significantly reduced error rate [
3]. For example, an accuracy>99% can be achieved with a coverage of 15 passes. However, considering that the total length of a CLR is limited by the life time of the polymerase [
4], there exists a tradeoff between the number of sequencing passes and the CCS read length. Recently, the CCS has been optimized to improve the accuracy and length of single-molecule real-time sequencing [
5]. The average sequence length has been increased to 13.5 kb and the accuracy is as high as 99.8%. This improvement has been successfully used for variant detection and
de novo assembly of human genomes.
The lack of bias and the random nature in sequencing errors [
6] make it possible and relatively easy to correct PacBio reads with high accuracy. Heuristic gapped alignment methods are able to effectively find overlaps between the template (the raw read to be corrected) and its supporting reads (reads sampled from the same regions of the genome with the template) having difference less than 30%. Based on these overlaps, the template can be corrected using some consensus algorithm. The accuracy of corrected reads can be as high as 97%–99%.
The MinION sequencer developed by Oxford Nanopore Technologies (ONT) generates long reads base on the transit of a DNA molecule through a nanoscopic pore which has been thought of as one of the most promising approaches to detect polymeric molecules since the 90s [
7]. Over the past years, ONT released several different MinION chemistry versions (R6.0, R7.0, R7.3, R9 and R9.4) and several updates of the base-calling tools that strongly improved the performance of the MinON device. While the earliest R6.0 chemistry version allowed to generate a total yield in the order of tens of Mb and 2D reads with 4 kb of median length and 40% of the average error rate, the more recent R7.0 and R7.3 improved the throughput to hundreds of Mb and 2D read size and accuracy to 6 kb and 80%–85%, respectively [
7]. In October 2016, new flow cells containing R9.4 1D chemistry were released by ONT, which was used to generate the first nanopore high-coverage (30×) human genomes with about 40 flow cells [
8]. At the same time, a new protocol was also developed to generate ultra-long reads (5×) whose N50 was longer than 100 kb and the maximum read length can reach up to 882 kb.
De novo assembly of this 35× raw reads data yielded a contiguous assembly with NG50 as long as 6.4 Mb [
8].
Compared to PacBio sequences, errors in Nanopore reads are much more complex [
9,
10]. First, error rates of Nanopore reads are much higher and broadly distributed among reads and among subsequences of a read, ranging from 5%–35%. Second, there commonly exist subsequences with extreme high error rates (50%) in Nanopore reads [
9]. This means that not only the error rates of two different reads, but also two different subsequences sampled from the same read, can differ greatly. These characteristics of errors make it much more difficult to align, correct and
de novo assemble Nanopore reads. Furthermore, due to the higher error rates, average identity of reads corrected from Nanopore reads are generally lower than those corrected from PacBio raw reads. For example, the corrected reads of the 35× human raw sequences [
8] mentioned above average only 92% identity to the reference; while reads corrected from PacBio raw reads are typically>99% identical to the reference genome. The decreased accuracy of Nanopore corrected sequences introduces troubles for accurately assembling genomes, especially for complex genomes.
CORRECTION-AND-ASSEMBLY METHODS
The “correction-and-assembly” methods achieve high consensus accurate reads by correcting the noisy raw reads and then assemble genomes with the corrected reads. The assembly process generally consists of four stages. (i) Perform a pairwise mapping on the raw reads to find supporting reads for each template. (ii) Align the supporting reads to the template and correct the template based on these overlaps. (iii) Perform a pairwise mapping on the corrected reads and assemble the genome based on the mapping results. (iv) Polish the genome with various data sources, such as the raw reads and NGS sequences using polishing algorithms such as Nanopolish [
11], Pilon [
12], smrtlink [
13] and Racon [
14] to further improve the accuracy of the genome. It is noted that these polishing methods are also used for correcting errors in genomes assembled by the “assembly-and-correction” approaches.
On the early stage, intuitive and automated hybrid workflows are developed for constructing finished and high-quality small bacterial genomes (1–10 Mb) [
15–
17]. In these hybrid methods, the accurate short reads (from SMRT circular consensus sequencing reads [
5] or NGS reads) are used to correct errors in the long SMRT sequencing reads. These methods have been applied successfully to a variety of microbes and eukaryotic organisms. They generally involve at least two different sequencing libraries and at least two types of sequencing runs (and sometimes different sequencing platforms). For more efficient genome assembly, especially for large and complex genomes, a homogeneous workflow that requires only one library and sequencing platform is required to simplify the assembly process and to reduce the sequencing costs. Another shortcoming of the hybrid approach is that the constructed genomes always break in regions with insufficient NGS data coverage (owing to the GC or sequence context biases) [
18].
Currently the monohybrid methods that rely solely on TGS data have become the mainstream assembly approaches. These methods avoid the drawbacks in hybrid approaches and have been successfully applied for constructing large genomes (such as the human genome).
Both the hybrid and monohybrid assembly methods involve sequence alignment and error correction as intermediate steps. The sequence alignment for TGS reads is computationally intensive and still remains the most time-consuming step in genome assemblies. In this section, we will first talk about the existing alignment algorithms developed for TGS reads (see the first category of Table 1). We then proceed to introduce the consensus methods that are widely used in the mainstream assembly workflows (see the second category of Table 1). Finally we describe the assembly workflows that have been extensively use for both PacBio sequences and Nanopore sequences.
Alignment methods for TGS sequences
Strictly speaking, the existing alignment methods designed for TGS sequences are all variants of the standard Smith-Waterman algorithm [
19]. They employ various heuristics based on one or more characteristics of TGS reads to reduce the searching space and decrease memory usage of the original Smith-Waterman algorithm. Many of them follow BLAST’s “seed-and-extension” searching paradigm [
20].
BLASR
BLASR (basic local alignment with successive refinement) [
21] uses a successive refinement approach to map SMRT sequences to the reference genome in which case the divergence between them are dominated by insertion and deletion sequencing errors. Its core idea is to narrow down the search space from the whole genome positions to a relatively small number of candidate genome intervals where the read are likely to map. To further validate which candidate interval is mostly likely to be mapped, a detailed alignment is performed on each candidate interval and the best one is chosen as its searching results.
BLASR takes a read
and a genome
as inputs. The searching process consists of three phases. Step 1: Detecting candidate intervals by clustering short and exact matches. BLASR builds either a suffix array (SA) or BWT-FM index [
22] on the genome according to the time and space requirements, and queries for exact matches of subsequences (of length at least
) between
and
. The BWT-FM index has been used substantially for aligning NGS short reads [
23,
24] and accelerating the traditional BLAST search [
25,
26]. It is a very flexible data structure that allows for searching for fixed length matches, maximal exact matches, and matches in which mismatch bases are allowed efficiently. The set of matches (also called anchors) is denoted as
. The anchors in
are then clustered with global chaining [
27]. To do so, the anchors are sorted by genome and query positions and the clusters of anchors are found in intervals that roughly the length of the read. For each anchor
, a set
is created, where
and
are the genome position and length of the anchor
respectively. In each
, the global chaining algorithm [
27] is used for finding a maximal subset of anchors
that are not overlapping and listed in increasing read and genome positions. For each cluster
BLASR assigns it with a frequency weighted score
, where
is the frequency of the frequency of subsequence
in the genome. After scoring only the top
(10 by default) clusters are retained. Step 2: Mapping
to the candidate intervals using a sparse dynamic programming (SDP) [
28]. For candidate interval
, let
and
be the anchors in
that have the minimum and maximum genome positions respectively. Suppose the maximum insertion rate of the sequencing device is
, then the minimal starting genome position of the interval is
, and the maximal ending genome position is
, here
is the read position of anchor
. The length of this interval is
. Similar to step 1, a set of matches are found between
and the candidate interval. The matches used here are fixed short length
and denoted as set
. The sparse dynamic programming finds the largest subset of anchors
that are of increasing read and genome positions. Step 3: The sparse dynamic programming alignment is used as a guide for performing a detailed banded alignment. The SDP in Step 2 only aligns some subsequences in
that from
to the candidate interval. To align every base of
to the interval, BLASR triggers a banded dynamic programming. Due to the large indel rate of SMRT sequences, the size of the band used to contain the entire alignment can be prohibitively large. To reduce the searching space, BLASR uses
as a guide where the band follows the layout of the anchors in
. After this step, the read base positions are assigned to reference positions. In BLASR, the authors also provide detailed theory analysis to support their clustering and to compute the mapping quality values.
DALIGN
DALIGN [
29] proposes a sensitive and threaded filter for detecting seed points that are likely to have a significant local alignment passing through them and a linear expected time heuristic local alignment algorithm. It is developed for detecting read-to-read overlaps amongst SMRT noisy long reads. Specifically, given two SMRT long noisy read blocks denoted as
and
respectively, DALIGN aims to find local alignments between reads that are sufficiently long and stringent.
The first step of DALIGN is rapid seed detection with a threaded filter. Unlike BLASR, which employs Suffix Array or BWT-FM index to search for variable-length anchors, DALIGN uses a much simpler series of highly optimized sorts to discover -mer matches in a given diagonal band between two reads. The filter proceeds as follows:
1. Build a -kmer list for block , where is the subsequence .
2. Similarly build a -mer list for block .
3. Sort both and in order of hash values of their -mers.
4. Build a -mer match list by a merge sweep of the two sorted -mer lists.
5. Sort lexicographically on , , and .
Steps 1, 2 and 4 are easily seen to take linear time and space. The most time-consuming step is the sorting process in Steps 3 and 5, which in theory takes
time, where
is the list size. DALIGN optimizes the encoding of the lists in Steps 3 and 5 by squeezing their elements into 64-bit integers. The problem of sorting lists in Steps 3 and 5 now becomes realizing a threadable, memory coherent sort of an array
of
64-bit integers. DALIGN implements a radix sort [
30] where each 64-bit integer is considered as a vector of
,
-bit digits
and
is a free parameter whose typical value is 8. The radix sort then sorts the numbers by stably sorting the array on the first
-bit digit
, then on the second
, and so on to
in
sorting passes. The sorting process asymptotically takes
time. As
and
are fixed small constants the algorithm is effectively
in time complexity.
The second step of DALIGN is rapid local alignment from a seed
reported by the filter in Step 1, where
and
are the positions of the
-mer in two reads. The original
algorithm [
31] centers on the idea of computing progressive “waves” of furthest reaching (f.r.) points. Starting from
on diagonal
, the goal is to find the longest possible paths starting at
, first with 0-differences, then with 1-differences, 2-differences, and so on. The problem is that the waves become wider and wider as we compute f.r. waves away from
in each direction. To narrow down the waves, DALIGN uses several strategies to trim the span of a wave by removing the f.r. points that are extremely unlikely to be in the desired local alignment. The key idea is that a desired local alignment should not over any reasonable segment have an exceedingly low correlation. This is also the key idea behind the X-drop criteria in many heuristic alignment tools such as BLAST. In DALIGN, the first principle for trimming a wave is to remove f.r. points for which the last
columns of the alignment have less than
matches. The second trimming principle is keep only f.r. points which are within
antidiagonals of the maximal anti-diagonal reached by its wave. The computation of successive waves eventually ends because either (a) the boundary of either sequence is reached, or (b) all the f.r. points fail the regional alignment quality criterion in which case it is assumed that the two reads no longer correlate with each other.
Experiments show that DALIGN is much more sensitive and can be 22 to 39 times faster than BLASR on the PacBio human dataset. Furthermore, it takes only 15,600 core hours for overlapping the 54× PacBio human dataset. Compared to the 404,000 core hours using BLASR when running on the Google “Exacycle” platform, this represents a substantial 25× reduction in computation time.
Minimap and minimap2
Minimap is an alignment free mapping approach for TGS sequences. In addition to pairwise overlapping, minimap also supports read-to-genome and genome-to-genome mapping. The most distinctive features in minimap is that it takes minimizer matches as seeds and skip the detailed alignment step so that it can be very fast compared to other mapping tools.
Minimizers [
32] are proposed as a replacement of
-mers for reducing storage requirements and not reducing the searching sensitivity at the same time. minimap maps a
-mer into a 64-bit integer hash values in three
steps. First, choose a hash function
such that
,
,
and
. Next, for a
-mer
, define
Finally, the function
is used as the hash function that transfers each
-mer to a 64-bit integer hash value. Here
(see algorithm 2 of [
33]) is an invertible integer hash function such that the complex
-mer would have the small hash values while the repetitive
-mers such as a long run of
’s have large hash values. A
-minimizer of a string is the
-mer having smallest hash values in a surrounding window of
consecutive
-mers. Note that there can be more than one minimizer in a window of
consecutive
-mers.
Minimap works in two steps: indexing and mapping. In the indexing stage, it collects minimizers in all target sequences, sorts the minimizers by their hash values using radix sort introduced in DALIGN [
29], and builds a hash table for them. Minimap only samples roughly
of the total
-mers (note that a minimizer is the
-kmer having smallest hash value among
consecutive
-mers) so that its storage requirement is much less than that in DALIGN. Minimap samples the most informative
-mers so that the sensitivity will not reduce significantly. In the mapping stage, minimap collects all the minimizer hits between the query and all the target sequences. A single-linkage clustering is performed to detect the maximal collinear subset of hits by solving a longest increasing sequencing problem (Algorithm 4 of [
33]). This subset is output as the final mapping result.
Minimap has been optimized and improved into minimap2 [
34] by adopting a more accurate chaining method and supporting recovering base-level alignments by performing DP-based global alignments between adjacent anchors in a chain. Minimap2 supports aligning many more types of sequences, such as accurate short reads, SMRT long noisy reads, full-length noisy Direct RNA or cDNA reads, as well as assembly contigs or closely related full chromosomes of hundreds of megabases in length.
To chain the minimizer hits (anchors), minimap2 sorts them by ending reference position and let
be the maximal chaining score up to the
-th anchor in the list, which is calculated with a dynamic programming:
where
is the number of matching bases between the two anchors and
is the gap cost that defined by
In implementation, a gap of length costs , where is the average anchor length. A direct computation of all the ’s obviously takes time. Minimap2 reduces this time to by introducing a heuristic.
To recover detailed alignments, minimap2 performs DP-based global alignments between adjacent anchors in each chain with a 2-piece affine gap cost [
35]
, which helps to recover longer indels. To further accelerate the global alignment, a SSE implementation based on a difference-based formulation developed in [
36] is used here.
Edlib
Edlib [
37] is developed for efficiently performing an important category of sequence alignments: the calculation of Levenshtein distance (also called edit distance) which is the minimum number of single-character edits (insertion, deletion or substitution). The calculation of Levenshtein distance is equivalent to the calculation of maximum alignment score using standard dynamic programming under the unit cost scoring system. Edlib [
37] in principle is an extended implementation of the original bit-vector dynamic programming algorithm [
38]. The original bit-vector algorithm accelerates the dynamic programming alignment algorithm which computes the minimal edit distance of two sequences by packing multiple cells as a bit-vector into one CPU register and thus enables the computation the values of multiple cells at a time. It only supports semi-global alignment where gaps at the start and end of the query sequence are not penalized. Edlib extends the bit-vector algorithm to support both global alignments and semi-global alignments in which gaps at the end of the query are not penalized by the extended banded algorithm which supports all three kinds of alignments.
Another deficiency of the original bit-vector algorithm is that it returns no traceback information (a sequence of operations including substitution, insertion and deletion that performed on one sequence to transform it into the other sequence). The traceback is critical to many sequence analysis such as sequencing error correction and variant detection. Edlib remedies this defect by finding the optimal alignment path for all three supported alignments in linear space with Hirschberg’s divide-and-conquer strategy [
39].
The consensus algorithms for TGS sequences
The correction of noisy TGS long reads is critical to many genomic studies with high resolution, such as structural variation detection. The currently widely used monohybrid correction methods work by aligning supporting reads to the template sequence. The overlaps are then used to build a multiple sequence alignment, from which a consensus sequence is resolved with some scoring scheme.
DAGCon
DAGCon is developed as part of the hierarchical genome assembly workflow HGAP [
6] to correct seeding sequences. It draws the idea from [
40] in which multiple sequence alignment is represented as a directed acyclic graph.
DAGCon constructs a directed acyclic graph which is denoted as the sequence alignment graph (SAG) with alignments between supporting reads and the template in five steps. Step 1: Normalize mismatches and degenerated alignment positions in each alignment. It is known that mismatches in one alignment are indistinguishable from an adjacent insertion-deletion pair to the correction algorithm. Furthermore, given a specific gap cost function, there are cases where there is more than one way to place gaps in an alignment. To reduce such potential inconsistency introduced by the two cases mentioned above and to improve the final consensus accuracy, each mismatch is replaced by a insertion-deletion pair and every gap is moved to the right most equivalent position (see Supplementary Figure S1 in [
6] for an example). Step 2: Setup an initial directed acyclic graph from the template. Specifically, an initial backbone graph consists of the nodes
with
, where
is the residue of the template at position
, and the directed edges
is built in this step. Step 3: Construct a multigraph from the alignments and merge the edges that are connecting the same nodes. Each alignment is represented as a list
and added to the backbone graph (Algorithm 1 of [
6]). Step 4: Merge nodes. To reduce the complexity of the backbone graph, DAGCon merges those nodes that are similar to each other (for example, the nodes having the same label and are connected to the same out-node or are connected from the same in-node) while preserves the directed acyclic property of the graph at the same time (Algorithm 4 of [
6]). Step 5: Generate consensus sequence from the simplified backbone graph. Each node in the graph is given a score by checking the weight of its out-edges. Each path in the graph is assigned a score by summing up the scores of its nodes. The consensus is constructed by finding the path having the highest path score among all the possible paths with a simple dynamic programming (Algorithms 5 and 6 of [
6]).
FALCON-sense
FALCON-sense is the consensus algorithm that released together with the FALCON assembler [
41]. FALCON-sense starts with performing a pairwise mapping with DALIGN [
29] on the noisy raw reads. Each template is then processed as follows. First, a variation of the
algorithm [
31] is applied to align the supporting reads to the template where all the differences in the alignments are encoded as insertions and deletions only. For each position in every alignment, an “alignment tag” that encodes both the positions and the bases in the supporting read and the template is created. These tags are then used for constructing an alignment graph in the following way. For each tag, if it does not exist in the graph, a corresponding node is added to the graph. For consecutive tags from an alignment, an edge is created with
equaling to 1 is added to the graph. If the edge already exists in the graph, its corresponding
is increased by 1. After processing all the alignment tags we obtain a directed acyclic graph. Every node in the graph has a tag. The
in each edge indicates how many reads support the connection between the two nodes. Finally a standard dynamic programming algorithm is applied on the alignment graph to find the highest confidence path. By concatenating the bases of the tags on the best path the consensus sequence is generated. Supplementary Figure 12 of [
41] illustrates this consensus process.
In implementation, FALCON-sense is much simpler, more robust and more efficient than the DAGCon algorithm. It is also integrated in another widely used de novo assembly workflow Canu as the error correction method.
The correction-and-assembly workflows
There are three widely used “assembly-and-correction” assemblers, which are described below.
FALCON
The FALCON assembler [
41] follows the same workflow of the hierarchical genome assembly pipeline HGAP [
18] with many computationally optimized implementations. The most distinctive feature of FALCON is that it also supports phased diploid genome assembly with the FALCON-Unzip algorithm.
The FALCON workflow starts with error correction using the FALCON-sense algorithm (Section FALCON-sense). After that the DALIGN [
29] algorithm is used for identifying overlaps between all pairs of corrected reads. These overlaps are used to construct a directed string graph [
42], which is further simplified with assembly string graph reduction (Supplementary Figure 14 of [
41]) into the final “haplotype-fused assembly graph”. After the reduction process, the primary and associated contigs are constructed based on the corrected reads and their overlap relationships. For each primary contig, FALCON collects all the raw reads that associated with it and with its associated contigs according to the overlapping information. The raw reads are then aligned to the contigs with BLASR [
21] for phasing heterozygous SNPs and identifying the haplotype of each raw read (Supplementary Figure 15 of [
41]).
To construct haplotype-specific contigs (haplotigs), the FALCON-Unzip algorithm constructs a haplotype-specific assembly graph for each contig from its corresponding raw reads and then combines this graph to the fused assembly subgraph that contains the paths of this contig to construct a complete subgraph, which has complete read representation from both haplotypes. See Figure 1 and Supplementary Figure 13 of [
41] for details. As a result, FALCON-Unzip build a new primary contig and several haplotigs from the original assembly graph of that contig. It assigns each phased read to the primary contig or one of the haploptigs according to the phasing information.
Canu
The PBcR assembly workflow [
43] is developed for effectively
de novo assembling large genomes such as the human genome. It proposes MHAP for overlapping noisy long reads using probabilistic and locality-sensitive hashing. The DAGCon [
6] and FalconSense are used for correcting the errors in the long reads. And the Celera Assembler is used for assembling the corrected long reads.
To reduce the running time of the overlapping step in the assembly pipeline, MHAP (MinHash Alignment Process) uses a dimensionality reduction technique named MinHash sketches for efficient alignment filtering. The MinHash sketch of a read is created by extracting all the
-mers and converting them to fingerprints by
hash functions. As a result, we obtain
hash sets
. In MHAP, after the first hash set
, subsequent fingerprints are generated using XOR shift pseudo-random number generators
. The
-mer that generates the minimum value for each hash is referred to as the min-mer for that hash. The sketch of that sequence is then composed of the ordered set of its
min-mer finerprints that is much smaller than the set of all of the read’s
-mers. To find overlaps between two reads, MHAP builds two sketches for them and estimates their Jaccard similarity by computing the fraction of min-mers common to both sketches. If these two sketches are sufficiently similar, the shared min-mers are located in two reads and the median difference in positions is computed to determine the overlap offsets. See Figure 1 of [
43] for an example. The concept of sketch presented here and the concept of minimizer that used in minimap and minimap2 share the same idea. They are both reduced representations of the original sequence that used for decreasing memory usage and accelerating the overlapping process.
The PBcR is later optimized into a more efficient assembler named Canu [
44]. Canu implements an updated version of MHAP which uses a two-stage overlap filter, where the first stage identifies read pairs that are likely to share an overlap and the second stage estimates the extent and quality of the overlap. For the first step, a
tf-idf (term frequency, inverse document frequency) weighting is implemented in MHAP to increase sensitivity of true overlaps and reduce the number of false and repetitive overlaps. For a raw read set
, let
and
be the maximum and minimum observed frequency of all the
-mers in
respectively, and
be the frequency of a specific
-mer
in
. The inverse document frequency function for
, denoted as
, is defined as
The parameter controls how strongly less common -mers are preferred in relation to the more common ones, and is a constant used for linearly rescaling the value to fall in . A -mer does not exist in will be given as its value. The value of is then computed by the formulation , where is the number of occurrences of in . After computing the weight for each -mer , the original MinHash computation is now replaced by applying hash functions per entry rather than the single . For each sketch entry , the min-mer is chosen as the minimum hash value computed across all the hash functions. Since highly weighted -mers are hashed more times, this increases the chance that they will be chosen as a min-mer.
In the correction stage, Canu uses two filtering steps to avoid the introduction of overlaps from copy-specific repeat variants. The first is a global filter where each read chooses where it will supply correction evidence, and the second is a local filter where each read accepts or rejects the evidence supplied by other reads. The FALCON-sense consensus algorithm [
41] is used for generating the corrected reads. Canu also reduces the runtime by an order of magnitude of the Celera Assembler. In the assembly stage, Canu uses a variant of the greedy “best overlap graph” algorithm [
45] for constructing a sparse overlap graph. The greedy algorithm loads only the “best” (longest) overlaps for each read end into memory and thus consumes much less memory compared the string graph formulation [
42], which requires loading the full overlap graph into memory. Furthermore, to improve repeat and haplotype separation, Canu introduces a new “Bogart” algorithm to statistically filter repeat-induced overlaps and retrospectively inspect the graph for potential errors.
MECAT
MECAT [
46] is developed to effectively distinguish true overlaps and random read pairs that share similar subsequences. Traditional aligners such as BLAST [
20] employ the
-mer match methods to filer out random read pairs and quickly find seeding positions for local gapped alignments. However, the repetitive subsequences in reads always find a large number of
-mer matches and results in excessive candidate seeding positions. On the other hand, simply masking these low-complexity regions, or ignoring the
-mers with high frequencies can cause loss of correct overlaps. To address this issue, MECAT proposes a pseudolinear alignment distance difference factors scoring (DDFs) algorithm to quickly filter excessive alignments.
For two
-mer matches whose positions are
and
respectively, where
is the read position and
is the reference position, their distance difference factor is defined as
If , then this two -mer matches are said to support each other in positions. Now suppose the -mer matches between a read and a target sequence are , , …, . The DDF score of each match is initialized with zero. For every match pair and , if , we increase the DDF scores of both matches by one. The -mer having highest DDF score provides a position to which a significant local alignment is likely to pass through. It is evidently that the DDF scoring process takes time, which is very slow for long SMRT reads. To speed up the scoring process, MECAT breaks it into two steps so that the quadratic mutual scoring only takes place in local regions. In the first step the database is decomposed into consecutive blocks. Each block has the same length of whose default value is 2000. The scoring starts with randomly selecting a block that having at least -mer matches. The mutual DDF scoring is performed on the matches into that block. If the match with the highest score is significant (greater than the threshold), we set it as the seed position. In the second step, the scoring process is extended to its neighbor blocks. For each neighbor block, we calculate the DDF between its -mer matches and the seed -mer match. If , we increase the score of the seed -mer match by 1. The mutual scoring in step 1 is conducted in time, and the extension scoring in step 2 is conducted in time, where is the number of -mer matches. Since the number of -mer matches in mutual scoring is small, the overall scoring process can be finished in pseudolinear time.
Experiments show that the scores of the seeding positions grow linearly with their alignment lengths. After DDF scoring filtering, the candidate seeding positions are reduced by 50%–70% before extending the seeding positions to local alignments. In the correction stage, MECAT employs DDF scoring to effectively choose supporting reads for each template and then uses an adaptive DAGCon consensus algorithm [
18] to correct the templates. Originally the DAGCon algorithm achieves higher precision compared to other correction methods and builds a sequence alignment graph (SAG) for the whole template, which consumes a lot of time and memory. MECAT accelerates the consensus process in two steps. First it identifies bases in the template that are covered mainly by the same bases or gaps from supporting reads. In the corrected read, these bases will remain unchanged if they are covered mainly by the same bases, or will be deleted if they are mainly covered by gaps. Second, the DAGCon algorithm is used for correcting the subsequences between two unchanged bases. MECAT originally uses the assembly module from Canu [
44], which is now replaced by fsa, a string graph [
42] based assembler.
ASSEMBLY-AND-CORRECTION METHODS
Flye
Flye [
47] is developed for accurately resolving genomic repeats in
de novo genome assembly. Unlike other mainstream assemblers, Flye starts with creating a disjointigs that represent concatenations of multiple disjoint genomic segments and concatenating all error-prone disjointtigs into a single string. From these concatenate results, Flye creates an accurate assembly graph, untangle it with reads and resolves the bridged repeats. Finally, Flye resolves the unbridged repeats with the repeat graph using small differences between repeat copies and outputs accurate contigs formed by paths in the graph.
Wtdbg2
Wtdbg2 [
48] proposes a fast pairwise mapping implementation and a layout algorithm based on fuzzy-Bruijn graph (FBG) to speed up the large genome assembly process. To reduce memory usage, the current aligners (such as DALIGN and minimap2) split the input raw reads into small batches and perform pairwise mapping between batches. Wtdbg2 loads all the raw reads into memory and builds a hash table for the
-mers in all raw reads. For two reads that share
-mers, wtdbg2 performs a Smith-Waterman-like dynamic programming between these two reads in terms of bins. Wtdbg2 encodes each read into a consecutive sequence of bins. Each bin consists of 256 bp. The FBG proposed by wtdbg2 extends the basic ideas from de Bruijn graph. Each base in FBG is a 256 bp bin and a
-bin in FBG consists of
consecutive bins from reads. After simplifying the FBG, wtdbg2 builds the final consensus with partial order alignment over edge sequences.
SOME BENCHMARKS
To illustrate the efficiency and assembly quality of the assembly workflows described in this review, we present some benchmarks in this section. The benchmark results would also provide clues for use guides to be talked about in the next section. It is noted that there exits an abundance of benchmarks of this sort reported in various references. Therefore we are not going to carry out such tests ourselves but instead quoting two representative reports and listing the results here for easy reference and discussion.
Benchmarks on PacBio data
We list a simplified version of experimental results reported in [
48] Table 2. The five workflows are used for
de novo assembling three moderate genomes (100 M) and one large human genome (3G). More details about the datasets and the experimental setup are found in Table 1 and Supplementary Table 1 of [
48]. On the three moderate genomes, Canu achieves the highest assembly quality while has the worst efficient performance. FALCON, Flye and MECAT share roughly the same behavior in terms of NG50, NG75 and run time. Wtdgb2 is the fastest assembler and is at least one order of magnitude faster than Canu. NG50 and NG75 of assemblies constructed by Wtdbg2 are generally better than those of FALCON, Flye and MECAT. In assembling the 100X human dataset, FALCON has the best assembly quality. The assembly process of Wtdbg2 is dominated by the polishing step, which consumes about 75% of the computation time. The reference genome cover of the Wtgbg2 assembly is significantly smaller than that of Canu and FALCON.
Benchmarks on Nanopore data
The FALCON and MECAT workflows do not work for Nanopore data. Therefore we assess performance of the other three assemblers here. In Table 3 we quote the results from [
49] where the three workflows are used for assembling five moderate genomes (100M). Details about the datasets and experiment setup can be found in Table 1 and Supplementary Table 8 of [
49]. As on PacBio datasets, Canu takes the longest time. Wtdbg2 is still the most efficient workflow and can be two orders of magnitude faster than Canu.
On the other hand, due to the great discrepancy between PacBio and Nanopore sequences, the three assemblers behave quite differently on these two kinds of datasets when considering the assembly contiguity. The Flye workflow has the longest N50 on each dataset. Wtdbg2 obtains better N50 on A. thaliana and D. melanogaster datasets than Canu. While on complex genomes C. reinhardtii, O. sativa and S. pennellii, Canu produces larger N50 than Wtdbg2. This shows the advantage of the Correction-and-Assembly strategies on assembling complex genomes. We also note that Wtgbd2 outputs significantly more contigs than the other two workflows on each dataset.
SOME USE GUIDES
If your task is to carry out a sequence alignment, the preferred alignment tool is generally minimap2 [
34], which can meet your most demands of processing. As a general-purpose sequence mapping tool, minimap2 always runs quite satisfactorily in both efficiency and sensitivity. Minimap2 supports both read-to-read pairwise mapping and reference mapping, and a large spectrum of sequences: DNA and mRNA sequences, PacBio and Nanopore noisy long reads, as well as assembly contigs or closely related full chromosomes of hundreds of megabases in length.
If your study (not necessarily be de novo genome assembly) involves raw reads correction as an intermediate step, you don’t bother re-implementing the consensus algorithms mentioned above and then proceeding to carry out the overlap-and-correction pipeline. To correct PacBio noisy long reads, either Canu or MECAT is recommended. However if the dataset is large, MECAT is the preferred option, since MECAT has astounding speed benefits and maintains the same level of accuracy assurance as Canu. To correct Nanopore raw reads, the currently available mainstream tool is Canu.
There exist multiple choices of workflows for de novo genome assembly with both PacBio and Nanopore sequences. For small and moderate datasets, especially for complex genomes, Canu is able to construct contiguous and accurate assemblies solidly. For large PacBio datasets, if the genome is complex, then MECAT is recommended. Otherwise Flye and Wtdbg2 are also appropriate alternatives. For large Nanopore datasets, consider Flye. However if your computing resources is limited you should use Wtdbg2. Note that there are cases in which these workflows can work in a cooperative way. For example, if the quality of contigs assembled by one assembler is stunningly poor, you may wonder if the raw read dataset is corrupt (that is, the data is incomplete or too noisy). To validate your point, you may construct the contigs again with a second fast assembler such as Wtdbg2 or Flye and investigate the quality of the new contigs.
CONCLUSIONS AND DISCUSSIONS
In this survey, we analyze in detail the features of SMS sequences including both the PacBio platform and the Nanopore platform. We review and summarize several important and widely used computation methods that are designed for TGS. We classify these methods into three categories: alignment methods, consensus methods and assembly workflows. The alignment methods and consensus methods are integrated into assembly workflows to carry out intermediate tasks. For each method, we give a general description to its design principle and an in-depth analysis of its core concepts and core ideas. To illustrate the performance of the assembly workflows, we show two benchmark results here and analysis their behavior on each dataset. Based on their design principles and benchmark results, we proceed to provide some general use guides on applying these methods.
Though has made substantial progress, TGS still has defects and the field is still finite that TGS is used in genomic studies. Based on what we have learned, we predict here that in the future TGS will develop towards the following ways. Firstly, as the ultra-long read protocol [
8] and the circular consensus sequencing continue to evolve, SMS sequences should keep increasing in both length and accuracy. As results, we get more and more complete and contiguous assembled contigs that cover regions remain unresolved today. It will be also feasible to perform polyploid assembly of complex genomes. Besides
de novo assembly, it is expected that TGS will promote studies in other fields such as N6-methyladenine DNA modification, transcriptome and structural variation detection. Secondly, it is also expected that the third generation sequencing technology also develop in different areas such as RNA and single-molecule sequencing and to promote the relevant studies with higher resolutions.
We hope that this review can serve as a roadmap for those whose are interested in TGS and serve as a foundation based on which more advanced computation methods are developed.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature