1 INTRODUCTION
RNA-Seq is an application of next-generation sequencing (NGS) technologies to perform transcriptome-wide profiling. As one of the most cost-effective approaches, RNA-Seq has been widely applied in humans, model organisms and non-model species and has provided unprecedented insights into the transcriptomic landscape [
1–
8]. The versatile applications of RNA-Seq include (i) whole transcriptome reconstruction based on
de novo transcriptome assembly [
9], (ii) identification of novel transcripts [
4], (iii) detection of differentially expressed genes [
10] or transcripts [
11] between experimental groups, (iv) detection of alternatively spliced isoforms [
12], (v) detection of allele-specific expression [
13], (vi) construction of co-expression networks [
14], (vii) identification of RNA editing sites [
15], and (viii) identification of DNA variations in gene regions [
16]. Many of these applications have been the subject of recent reviews [
2,
17–
20], but the field is rapidly evolving, particularly with respect to methods for mapping of short-read RNA-Seq data in model organisms, which is a fundamental step for all forms of RNA-Seq data analysis. In this review we focus on recent progress in read-mapping algorithms for RNA-Seq data and reference-guided transcriptome assembly,which is recommended if the aim is to detect novel transcripts. Additionally, we discuss the latest developments in differential expression analysis from RNA-Seq data, which is the primary interest of biologists in many RNA-Seq studies. We conclude with a perspective on future directions for RNA-Seq.
2 STRATEGIES FOR TRANSCRIPTOMIC ANALYSIS WITH A REFERENCE GENOME
For organisms with a reference genome, direct mapping to the reference and/or reference-guided transcriptome assembly are more computationally efficient than de novo assembly and are the most commonly used strategies.
Direct mapping is a straightforward option for transcriptomic analysis in model organisms with a well-annotated reference genome or transcriptome. Using this strategy, RNA-Seq reads are directly aligned to the reference genome or to transcript sequences using mapping tools such as Tophat [
21], Tophat2 [
22], HISAT [
23], HISAT2 [
23], MapSplice [
24], SOAPSplice [
25]or STAR [
26] for splice-junction mapping, or Bowtie [
27], Bowtie2 [
28], BWA [
29], BWA-SW [
30], BWA-MEM [
31], SOAP [
32] or SOAP2 [
33] for non-splice-junction mapping. Based on the annotation, each feature (i.e., gene, transcript or exon) is assigned a count value or a normalized count value by counting the number of RNA-Seq reads covering the feature, with these count values representing the relative abundance of features in the transcriptome. Comprehensive annotation is advantageous for this approach, but a simulation study has shown that the method is robust to the presence of incomplete annotation and any incorrect transcripts present in a curated set do not absorb much signal [
34]. In summary, direct mapping to the reference is a popular approach for analysis of RNA-Seq data, both because the analysis workflow is straightforward and due to the availability of many well-developed downstream software tools (e.g., edgeR [
35], DESeq [
36], DESeq2 [
37], SAMseq [
38], baySeq [
39], NOIseq [
40], limma [
41], NBPSeq [
42], TSPM [
43] and EBSeq [
44] for differential expression analysis).
Reference-guided transcriptome assembly is a more ambitious approach for transcriptomic analysis. This method involves aligning reads to a reference genome and uses both the alignment outcomes and curated annotations to infer the transcript structures. This strategy is attractive because it can leverage a reference genome and existing annotations for the discovery of novel transcripts. In theory, this strategy is superior to the direct mapping approach because it offers the possibility of obtaining a more complete set of gene/transcript sequences, as has now been shown in many RNA-Seq studies [
11,
45,
46], whereas direct mapping relies on current annotations for model organisms that are often incomplete. However, a potential caveat is that due to typical limitations in RNA-Seq data, such as short read length, sequencing errors and biases, and/or errors introduced during alignment and assembly, reference-guided transcriptome assembly may generate massive partial transcripts and even assembled artefacts that can confound transcriptomic analyses. Recent studies have shown that these assembled artefacts can account for a substantial proportion of the signal when performing expression analyses [
34]. In addition, compared to the direct mapping approach, there are fewer downstream tools supporting this methodology. Cufflinks [
11,
47] and Scripture [
45] were the first software tools to implement reference-guided transcriptome assembly. Trinity was initially designed for
de novo transcriptome assembly [
48], but now it also offers reference-guided transcriptome assembly in recently released versions. More recently a related method called StringTie has been released that claims to have improved performance compared to Cufflinks [
49]. Reference-guided transcriptome assembly is unquestionably the best option if the objective is to identify novel transcripts. Alternatively, direct mapping to the reference is arguably the best choice for analysis of RNA-Seq data in well-annotated model organisms.
3 MAPPING ALGORITHMS FOR SHORT-READ DATA
Ideally, the first step in analysis of RNA-Seq data would involve mapping of short-read sequences to a reference transcriptome. However, because the complexity of the transcriptome is incompletely annotated, even for well-studied species, mapping RNA-Seq reads to a reference genome is preferable for organisms whose reference genomes are available.
A wide variety of mapping algorithms and software tools have been developed over the past few years. For example, more than 60 aligners are listed in Fonseca et al. (2012)’s study [50] and the number continues to increase (e.g., 84 aligners were listed). The growing number of aligners is indicative of the importance of sequence alignmentto the research community and is evidence of the active development of mapping tools. However, it also present challenges to researchers in terms of selecting a suitable aligner for their studies.
Mapping tools for short-read sequencing data can be divided into two major groups: (i) unspliced aligners that are designed to align continuous reads to a reference without consideration for splicing junctions, and (ii) spliced aligners that are capable of splitting reads at intron-exon boundaries. For RNA-Seq studies, unspliced aligners are mainly applied when (i) organisms do not contain introns in their genomes (e.g., most bacteria and some eukaryotic microorganisms), or (ii) sequence reads are mapped to a library of known transcript sequences (i.e., a reference transcriptome) rather than a reference genome sequence. On the other hand, spliced aligners are capable of mapping RNA-Seq data to a reference genome. Below we briefly discuss the mapping algorithms and tools that are commonly applied to RNA-Seq data. Note that most unspliced aligners discussed below are designed for short-read NGS data rather than specifically for RNA-Seq data. However, these unspliced aligners can be applied in RNA-Seq studies under the aforementioned scenarios.
Three broad categories of mapping algorithms are commonly used in analysis of short-read data (reviewed in Ref. [
51]): algorithms based on hash tables, algorithms based on suffix trees and algorithms based on merge sorting (Figure 1).
3.1 Algorithms and tools based on hash tables
A hash table is a key-value data structure used to implement an associative array. The most important feature of this type of data structure is that it can map keys to values very efficiently. The idea of hash table based algorithms, which essentially follow the seed-and-extend paradigm by matching a short section (i.e., a seed) of each read to the reference and extending these seed matches to the full length of the read, can be traced back to when the BLAST algorithm was first developed [
52,
53].
Arguably, Eland was the first successful aligner integrated intothe Illumina data processing package that utilized the seed-and-extend paradigm in short-read alignment (A. J. Cox, unpublished). The concept of Eland is to split a read into segments, creating a memory-resident hash table for all read segments and scanning inexact matches using combinations of segments as exact hash-keys. This seed strategy is also known as space seed. This approach inspired the development of many other short-read alignersbased on space seed, such as SOAP [
32], MAQ [
54], RMAP [
55,
56], and ZOOM [
57], among others. The downside of the space seed approach is that gaps are not permitted within the seed. More recent methods have sought to overcome this limitation by use of dynamic programming to detect gaps during the extension step or by attempting small gaps at each read position [
32,
58]. Ultimately, the problem was overcome by the
q-gram filter and multiple seed hits approaches. The
q-gram filter is based on the observation that the substrings of an approximate match must have a certain number of
q-grams (i.e., strings of length
q) in common [
59]. In general, methods based on space seed and
q-gram are similar insomuch as they both rely on a hash table for fast and exact matching. Space seed initiates seed extension from one long-seed match while
q-gram initiates extension usually with multiple relatively short-seed matches. SHRiMP [
60] and RazerS [
61]are two successful examples implementing the
q-gram filter that provides a solution to building an index that allows gaps. Later, RazerS 3 [
62] was developed as a successor to RazerS with a superior running time and the capability to mapping reads of various lengths with many insertion and deletion errors. In addition to using the
q-gram filter, RazerS 3 makes use of Open Multi-Processing to provide a share-memory parallelization with dynamic load balancing, a pigeonhole-based filter with controllable sensitivity, and an implementation of a banded version of Myers’ bit-vector algorithm for verification to improve the performance on both running time and sensitivity [
62].
Major improvements on seed extension were also achieved by accelerating the standard Smith-Waterman algorithm with vectorization (i.e., multiple query sequences can be processed in one CPU cycle) and by constraining dynamic programming around seeds. These improvements enabled significant acceleration of the alignment process. For example, the striped Smith-Waterman algorithm (i.e., the Smith-Waterman implementation where the Single-Instruction Multiple-Data (SIMD) registers are parallel to the query sequence, but are accessed in a striped pattern) achieved a 2−8 fold performance improvement over other SIMD based Smith-Waterman implementations [
63]. Novoalign (http://www.novocraft.com/products/novoalign/), CLC Genomics workbench (http://www.clcbio.com/products/clc-genomics-workbench/), SHRiMP [
60] and SMALT (http://www.sanger.ac.uk/science/tools/smalt-0) are examples of software that utilize the accelerated Smith-Waterman algorithm in the alignment. BWA-MEM [
31]also recently joined this category, and introduced several innovations including seeding and re-seeding, improved seed extension, and chaining (i.e., linking a group of seeds that are collinear and close to each other) and chain filtering (i.e., filtering overlapping short chains by some criteria), all designed for optimal alignment of 70 bp or longer reads.
3.2 Algorithms and tools based on suffix trees
A suffix tree is a compressed trie containing all the suffixes (i.e., substrings) of the given sequence (e.g., a genome sequence) by pre-processingthe sequence data into a space-efficient data structure. After the construction of suffix trees, fast query searches can be performed easily, for instance by locating a substring with specific mismatches. Algorithms based on suffix trees essentially reduce the inexact matching problem to the exact matching problem. This is achieved by first identifying exact matches and then building inexact matches supported by these exact matches [
51].
Use of a trie greatly enhances alignment efficiency because multiple loci that share an identical substring in a reference need only be aligned once (since identical alignments collapse on a single path in the trie), whereas alignment needs to be performed independently for each locus using the hash table approach.The suffix tree is undoubtedly one of the most important and widely used data structures in string processing. However, algorithms based on suffix trees are memory intensive because even the most space efficient implementation [
64] requires at least 12.5 bytes per bp, which equates to>37 G bytes for the human genome (~3 Gbp). Continuous improvements have been made to overcome this obstacle, culminating in theenhanced suffix array [
65] and FM-index [
66]. An enhanced suffix array uses a basic suffix array enhanced with several auxiliary arrays, leading toa reduction in space consumption to 6.25 bytes per bp. An FM-index (Full-text index in Minute space) is a compressed full-text substring index based on Burrows-Wheeler transform [
67], which allows compression of the input text while still supporting fast substring queries.
A number of publicly available aligners have been developed based on suffix tree algorithms. For example, Segemehl [
68] use anenhanced suffix array, Bowtie [
27], BWA [
29], SOAP2 [
33], and BWA-SW [
30] are based on the FM-index. Bowtie2 combines ultrafast FM-index-based seeding with efficient extension by dynamic programming in order to obtain gapped alignments [
28]. RSEM is a software package for quantifying gene and isoform abundances from short-read RNA-Seq data [
69]. It uses the Bowtie/Bowtie2 alignment program to align reads against transcript sequences rather than a genome reference, with parameters specifically chosen for transcript quantification from RNA-Seq data (e.g., the “--estimate-rspd” option enables RSEM to use the data to learn how RNA-Seq reads are distributed across a transcript). TopHat is one of the few tools that supports splice junction mapping for RNA-Seq reads. It first maps RNA-Seq reads to a genome reference using Bowtie, and then analyses the mapping results to identify splice junctions between exons [
21]. TopHat2 [
22] is the descendant of TopHat. By using Bowtie or Bowtie2 as the underlying mapping engine and adopting a two-step approach −these being (i) detection of potential splice sites for introns and (ii) use of these candidate splice sites in a subsequent step to correctly align multiexon-spanning reads− TopHat2 is able to align reads spanning insertions and deletions on the same chromosome, even if these are very large, and reads spanning translocations involving different chromosomes [
22]. MapSplice [
24] and SOAPSplice [
25] use a similar two-step approach for splice junction mapping. Spliced Transcripts Alignment to a Reference (STAR) is another popular splice junction mapperbased on an algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and a stitching procedure [
26]. Another recent and highly promising method is HISAT [
23] and its upgraded version HISAT2. In addition to using one global FM index that represents a whole genome, HISAT and HISAT2 use massive local FM indexes that collectively cover the whole genome for the effective alignment of RNA-Seq reads [
23].
3.3 Algorithms and tools based on merge sorting
The alignment algorithm based on merge sorting uses not only the most probable base, but also all possible bases with a probability above a certain base probability threshold provided by the Illumina probability file. It then generates all possible reads with probability above a certain read probability threshold. For the core alignment, it sorts all these generated reads in lexicographical order and then crosses it sequentially with a pre-sorted table of windows of reference sequences and their reverse complements. This approach eliminates the need for an indexed structure by replacing random I/O with sequential I/O. Currently, the only software using this approach are Slider [
70] and its descendant SliderII [
71].
3.4 Summary for mapping algorithms and tools
In general, the strengths of hash table based algorithms are that they can tolerate high levels of genomic variation and easily perform partial alignment (such as for exon-exon junction reads), but this comes at the cost of high memory requirements for hashing and poor sensitivityfor alignment of reads in repetitive regions. In comparison, the strengths of algorithms based on suffix trees are that they are able to perform fast alignment, especially for exact matches, and these algorithms offer alignments with high sensitivity in repetitive regions. However, suffix tree based algorithms are generally less tolerant of high genomic variation than hash table based algorithms. Merge sorting based tools, since Slider and SliderII, are becoming less popular, primarily because they use Illumina probability files as input rather than more standard file format (such as fastq), and recent sequencing platforms (such as HiSeq 2000) do not provide Illumina probability files. Table 1 lists some popular aligners that have been widely applied in short-read sequence alignment. Note that this is not a complete list. Readers are referred to Fonseca
et al. (2012)’s study [
50] and the well-maintained high-throughput sequencing mappers website for a more comprehensive list of aligners.
4 DIFFERENTIAL EXPRESSION ANALYSIS
The primary goal of most RNA-Seq studies is to identify differentially expressed genes (DEGs)and/or differentially expressed transcripts (DETs) between experimental groups. Prior to data analysis, quality control is usually performed to assess the quality of sequencing reads, including sequence quality scores, GC content, sequence duplication levels. There are a number of tools that are designed for this purpose, such as FastQC and FASTX-Toolkit. To quantify gene expression, RNA-Seq reads need to be aligned to a reference genome for model organisms (e.g. , using HISAT2 [
23]) or to a library of transcriptome sequences reconstructed using
de novo assembly strategies for organisms without reference sequences (e.g., using Trinity [
48] for
de novo assembly, and RSEM [
69] for mapping and detection of DETs). If detecting novel isoforms is of interest in a study, then reference-guided assembly needs to be performed (e.g., using StringTie [
49]), followed by a merge step to generate a non-redundant set of transcripts (e.g., using Cufflinks-Cuffmerge [
72]) for downstream analyses. Following alignment, the expression level of genes/transcripts is quantified by counting the number of reads aligned to each feature (e.g., using HTSeq or StringTie [
49] for generating gene-level or transcript-level count tables, respectively). Subsequently, a range of statistical methods can be applied to assess the significance of differences in expression level observed between experimental groups (e.g., using edgeR [
35] or Ballgown [
73]for detection DEGs or DETs, respectively). A general workflow for the differential expression analysis is illustrated in Figure 2.
4.1 Tools and methods for RNA-Seq differential expression analysis
Accurate quantification of gene expression and detection of DEGs and DETs is non-trivial [
74,
75]due to (i) biases and errors inherent in NGS technology [
76–
78], (ii) biases of abundance measures due to the effects of nucleotide composition and the varying length of genes or transcripts [
79,
80], (iii) undetermined effects of both sequencing depth and the number of replicates, (iv) the mixture of technical and biological variation, and (v) the existence of alternative gene isoforms and overlapping sense-antisense transcripts [
72]. A lot of efforts have been made to address these difficulties [
72,
81,
82]. In early RNA-Seq studies lacking biological replicates, the distribution of feature counts across technical replicates was reported to fit well to a Poisson distribution where the variance is equal to the mean [
76,
83]. However, when biological replicates are included in RNA-Seq studies, the Poisson distribution underestimates the variation seen in many studies [
84,
85], a problem known as overdispersion. Several methods have been proposed to account for overdispersion in RNA-Seq differential expression analysis, including Auer
et al.’s (2011) [
43]two-stage Poisson model based onquasi-likelihood, the negative binomial (NB) distribution [
35,
36], and non-parametric methods such as NOISeq [
40] and SAMseq [
38]. Among all these methods, NB has achieved a dominant position in the methodologies to model feature counts for RNA-Seq data [
35,
36,
80] due to the capability of accounting for both technical and biological variance. A number of software tools were developed based on NB, includingDESeq [
36], DESeq2 [
37], edgeR [
35], and baySeq [
39], among others.
Although most existing tools were developed for differential expression analysis at the gene level, it is worth noting that Cufflinks-Cuffdiff [
11]and its upgraded version Cuffdiff2 [
72]implemented a more ambitious method for transcript-level differential expression analysis. Cuffdiff2 estimates count variances for each transcript among biological replicates under a beta negative binomial model of fragment count variability [
72]. Another software package, RSEM, computes maximum likelihood abundance estimates at transcript-level resolution using the Expectation-Maximization algorithm for its directed graphical model [
69]. A key feature of RSEM is that it only requires the user to provide a set of reference transcript sequences, such as one produced by a
de novo transcriptome assembler, which allows for RNA-Seq analysis of species for which only transcript sequences are available [
69]. Ballgown is another recently developed software tool that performs linear model-based differential expression analysis at transcript-level resolution [
73]. It also offers functionality for visualization of the transcript assembly on a gene-by-gene basis and extraction of abundance estimates for exons, introns, transcripts or genes [
73].
Differential expression analysis at transcript-level resolution is unquestionably an ideal approach as all RNA-Seq reads originate from transcripts whereas gene-based analyses represent a combination of all isoforms in the same gene locus. One simple scenario that illustrates this point is whentwo isoforms are differentially expressed in different directions (i.e., one isoform is up-regulated and the other isoform is down-regulated), in which case one may not detect any gene-level differtial expression. However, a key challenge in transcript-level quantification from RNA-Seq data is that lists of transcripts are incomplete, even for well-studied model organisms. As a consequence, if a gene has novel isoforms, then RNA-Seq reads originated from these isoforms may be assigned to other known isoforms, leading to incorrect quantification of those known isoforms.
The field of differential expression analysis, although maturing, is still growing quickly and new software tools are continuously being developed. A few comparison studies have been reported to evaluate the performance of different RNA-Seq software tools.Soneson and Delorenzi [
86] evaluated 11 software packages (DESeq [
36], edgeR [
35], NBPSeq [
42], TSPM [
43], baySeq [
39], EBSeq [
44], NOISeq [
40], SAMseq [
38], ShrinkSeq [
87] and two versions of limma [
36,
41]) mainly based on simulated data sets and concludedthat the method of choice in a particular situation depends on the experimental conditions. Rapaport
et al. [
88] evaluated six of the most commonly used differential expression software packages (Cuffdiff [
89], edgeR [
35], DESeq [
36], PoissonSeq [
90], baySeq [
39], and limma [
41]) by considering a number of key features, including normalization, accuracy of differential expression detection and differential expression analysis when one condition has no detectable expression. They found significant differences among the methods, but comparable performance was found between array-based methods (e.g., limma) adapted to RNA-Seq data and methods specially designed for RNA-Seq (e.g., edgeR). Seyednasrollah
et al. [
91] performed a systematic comparison of eight widely used software packages (edgeR [
35], DESeq [
36], baySeq [
39], NOIseq [
40], SAMseq [
38], limma [
41], Cuffdiff2 [
72] and EBSeq [
44]) for detecting differential expression between sample groups, focusing on measures that are of practical interest to researchers when analysing RNA-Seq data sets, including the number of DEGs identified using different numbers of replicates, their consistency within and between pipelines, the estimated proportion of false discoveries and the runtimes. They found marked differences among software packages, and the number of replicates and the heterogeneity of the samples should be taken into account when selecting the analysis pipeline [
91]. Zhang
et al. [
75] have recently demonstrated that edgeR outperforms DESeq and Cuffdiff2 using both real and simulated RNA-Seq data sets by consideration of number of replicates, sequencing depth, and balanced vs. unbalanced sequencing depth within and between groups. These comparison studies have provided useful guidelines for a proper study design and a suitable software tool for RNA-Seq differential expression analysis. However, new software toolssuch as DESeq2 [
37] and Ballgown [
73] have since been developed and most existing tools have been upgraded (typically resulting in improved performance). The fast growing number of new tools and active development of existing tools also makes it difficult to choose the best (or the most suitable) software tool for differential expression analysis in a given RNA-Seq study, though edgeR and limma were previously reported to perform well under many circumstances compared with others [
75,
91].
4.2 Key factors in study design: sequencing depth and sample size
Sequencing depth and sample size are two key factors that affect differential expression analysis. Zhang
et al. [
75] have shown that the performance of Cuffdiff2 is sensitive to sequencing depth, whereas DESeq and edgeR appear relatively stable and thus are a better choice for differential expression analysis when sequencing depth is low (i.e., number of reads<10 M). There is evidence that the number of DEGs discovered in RNA-Seq studies is positively correlated with sequencing depth [
40,
75], suggestinga strong effect of sequencing depth on differential expression analysis. Unbalanced sequencing depth between groups can also have negative effects on the performance of differential expression analysis for some software tools [
75].
Another key factor for RNA-Seq differential expression analysis is the sample size (i.e., the number of biological replicates in each group). In theory, one would expect an increase in statistical power for the identification of DEGs with an increasing number of biological replicates, and indeed a positive correlation between DEGs and the number of biological replicates has been reported by Seyednasrollah
et al. [
91] and Zhang
et al. [
75]. However, different versions of software tools may have opposite effects on correlations between DEGs and the number of biological replicates, as discovered by Seyednasrollah
et al. that with a different version of Cuffdiff2 the number of detected DEGs decreased when the number of samples increased [
91].
Since budgetary constraints are common with RNA sequencing, an optimal experimental design needs to balance the sequencing depth for each sample with the number of replicates for each group. The consensus position of many studies [
75,
88,
92] is that the overall impact of the sequencing depth is not as critical as sample size, and thus including sufficient biological replicates should be the prime consideration for RNA-Seq study designs. The required number of biological replicates depends on a number of factors including the amount of biological variation in the samples to be sequenced. Several studies have suggested that 4−6 biological replicates from inbred mice cell populations,and at least 14 biological replicates from human cell lines (unrelated individuals in the same ethnic group) are required for RNA-Seq differential expression analysis [
75,
91]. Larger sample sizesare likely to be required in animal/human tissue samples compared to cell lines or cells from inbred lab strains. However, to determine the optimal number, more gold standard datasets and comprehensive evaluations based on these datasets are required to guide future RNA-Seq study designs.
5 FUTURE DIRECTIONS
5.1 Long-read RNA-Seq
Long reads have greater potential than short reads at many levels. For transcriptomic analysis with a genome reference, long-read RNA-Seq data has greater power than short-read data to (i) unambiguously map to the reference genome [
51,
93], (ii) detect indels and structural variations, especially for variants in repeat regions [
94], (iii) produce full-length transcripts without assembly [
95,
96], (iv) resolve transcriptional complexity for gene loci with a massive number of isoforms and/or antisense transcripts [
96,
97], and (v) detect allele-specific expression and allele-specific AS patterns [
93].
Roche 454 was the first high-throughput sequencing platform offering long-read sequencing using the pyrosequencing technology and sequencing-by-synthesis approach [
98]. It can generate relatively long reads of up to 1 kp (average read length of 450 bp with the Roche 454 FLX Titanium sequencer [
78]). The use of 454 sequencing has led to a better understanding of the structure of the human genome [
99] since its launch in 2005, enabling the first non-Sanger sequence of an individual human [
100] and opening up new approaches for transcriptomic studies [
101].
The PacBio SMRT (single molecule real-time) sequencing platform, also known as one of the third-generation sequencing platforms, has been pioneered by Pacific Biosciences [
102,
103]. PacBio SMRT sequencing is built upon several key innovations (i.e., zero-mode waveguides and phospholinked nucleotides) that harness the natural process of DNA replication and enable real-time observation of DNA synthesis [
102]. It offers long-read sequencing with an average read length>10 kb, and a proportion of reads longer than 60 kb. Despite the relatively high error rate associated with the PacBio SMRT techonology, since the SMRT sequencing platform was commercially launched in early 2011, it has achieved many successful applications in the RNA-Seq field, including but not limited to obtaining comprehensive gene sets for non-model eukaryotes [
95], characterization of full-length alleles in complex gene loci [
96], and resolving the transcriptomic complexity [
104]. Another important innovation based onthe PacBio SMRT platform is Iso-Seq (the isoform sequencing: http://www.pacb.com/applications/rna-sequencing/), a method for the production of complete and unbiased full-length complementary DNA (cDNA) sequences without transcriptome reconstruction. This approach provides accurate information about alternatively spliced exons, transcriptional start sites and alternative polyadenylation sites directly from sequencing.
Oxford Nanopore technologies MinION offers a new approach for long-read sequencing. MinION uses the nanopore sequencing technology that can discriminate individual nucleotides by measuring the change in electrical conductivity as DNA molecules pass through the nanopore [
105,
106]. As the first commercially available sequencer that uses nanopores, MinION offers read lengths of tens of kilobases, with theoretically no instrument-imposed limitation on the size of sequenced reads [
107]. An important feature of nanopore sequencingis that the sequencing process does not rely on DNA replication. It has the advantage of reading full-length molecules in real-time and has the potential for sequencing RNA without conversion to cDNA, which is extremely attractive because it has the potential of recognizingthe modified RNA bases during real-time sequencing and therefore may shed light on the types and putative functions of RNA modifications. Currently, direct sequencing of RNA using the nanopore technology is yet to be developed but this development is expected in the near future.
The broad application of long-read sequencing is currently constrained by relatively high error rates of sequenced nucleotides and relatively high sequencing and computationalcosts [
108](e.g., it was estimated that 32.86 years CPU time would be required to process the PacBio raw reads for error-correction-overlap at ~ 44X sequencing coverage in Pendleton
et al.’s study [
108]), compared to short-read sequencing. Nonetheless, it is foreseeable that long-read sequencing will play a more important role in future RNA-Seq studies.
5.2 Single-cell RNA-Seq
Cells are the basic units of biological structure and function. Each tissue is a mixture of different cell types, and these subpopulations, or indeed individual cells in a single subpopulation, may have temporal and spatial variation in gene expression. There is a growing demand for single-cell profiling that is driven by the need for (i) direct analysis of rare cell types or cells with insufficient material for conventional RNA-Seq protocols, (ii) identification of cell subpopulations in tissues [
109] and (iii) profiling interesting subpopulations of cells from a heterogeneous population [
110]. To fully understand how complex tissues work in development and physiology, it will be important and essential to study transcriptional programs at single-cell resolution.
With the application of RNA imaging techniques such as RNA-FISH (fluorescent
in situ hybridization targeting ribonucleic acid molecules), single-cell measurements of gene expression arenow possible. Previous studies have provided important insights into the dynamics of transcription and cell-to-cell variation in gene expression [
111–
113]. However, such approaches can only examine the expression of a small number of genes in each experiment, thus restricting our ability toperform transcriptome-wide examinations of gene expression and co-expression patterns.
Recent technological advances have enabled RNA-Seq whole-transcriptome analysis of a single cell [
114]. Several such methods for profiling single cells have emerged, such as CEL-Seq [
115], Smart-seq2 [
116] and MARS-Seq [
7]. Typically, these methods first separate the cells by fluorescence-activated cell sorting (FACS) [
6] or microfluidics [
8], and then amplify each cell’s transcriptome separately for RNA-Seq, typically profiling hundreds to a few thousand cells in one experiment. To overcome the low-throughput issue, two droplet-based RNA-Seq approaches: inDrop RNA-Seq [
117] and Drop-Seq [
118] have recently been developed to enable fast profiling of the transcriptome for thousands of individual cells. Both approaches encapsulate cells into droplets and use novel barcoding strategies to match each mRNA to its cell-of-origin; inDrop RNA-Seq uses a microfluidic platform for droplet barcoding whereas Drop-Seq uses a split-pool synthesis approach to generate large numbers of distinctly barcoded beads into individual droplets [
117,
118]. Klein
et al. also claimed that the inDrop RNA-Seq method has a theoretical capacity to barcode tens of thousands of cells per run [
117], which will be important for its future application for profiling large populations of cells when sequencing throughput is high enough to afford multiplexing tens of thousands of cell samples in a single run. Meanwhile, G&T-Seq offers a powerful method for simultaneously sequencing a single cell’s genome and transcriptome, thereby enabling direct identification of genetic variations and their effect on gene expression at single-cell resolution [
104]. Macaulay
et al. demonstrate the power of G&T-Seq by sequencing the genome and transcriptome of a single cell in parallel by discovery of many cellular properties that could not be inferred from DNA or RNA sequencing alone.
Methods and tools for single cell RNA-Seq analysis are only now beginning to emerge. Pollen
et al. reported an analysis strategy for unbiased analysis and comparison of cell populations from heterogeneous tissue by microfluidic single-cell capture and low-coverage sequencing of many cells using existing tools [
119]. Meanwhile, Trapnell
et al. reported a tool kit called Monocle which is an unsupervised algorithm that increases the temporal resolution of transcriptome dynamics using single-cell RNA-Seq data collected at multiple time points[120]. Another recent method called scLVM (single-cell Latent Variable Model) has been developed to tease apart different sources of gene expression heterogeneity in single-cell transcriptomes, in particular that due to cell cycle-induced variation [
109]. Interest in single-cell RNA-Seq is growing rapidly. It is foreseeable that single-cell RNA-Seq will significantly accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution, revolutionizing our view of the transcriptome. In addition, the integrated analysis of a cell’s transcriptome, genome and eventually epigenome will enable a more complete understanding of the molecular machinery of cells and how this relates to higher order phenotypic variation.
Higher Education Press and Springer-Verlag Berlin Heidelberg