INTRODUCTION
One of the most popular genomics analyses involves identifying genetic and epigenetic variants and genotyping the variants genome-wide in large populations of interest as has been widely implemented to phenotype prediction, gene mapping and isolation, assessment of population diversity in genetics, ecological and evolutionary biology [
1–
3]. This is made practical in fast developing techniques of sequencing technology during the past decades. Supplementary Table 1 summarizes the major technical features of three generations of genome sequencing variant assay [
4].
This article is dedicated to the Special Collection of Recent Advances in Next-Generation Bioinformatics (Ed. Xuegong Zhang).
The Sanger sequencing technique, which was developed in 1977 and based on the principle of selective incorporation of chain-terminating dideoxynucleosides by DNA polymerase [
5], has dominated the DNA sequencing for about 25 years. The major limitation of the Sanger sequencing in throughput and efficiency urges emerging new alternatives. Microarray-based DNA sequencing technique represents the first “poster-Sanger” method for simultaneously profiling the DNA sequencing polymorphisms at genome-wide candidate loci. It has greatly facilitated genome-wide association studies (GWAS) and QTL mapping [
6] through accurately genotyping a large population at a high-density set of discontiguous genomic coordinates. The microarray technique requires design of probes as tiling representation of the reference sequence of the genome of interest [
7]. There are two key commercial producers providing microarray chips that satisfy different needs of SNP detecting and genotyping in plants or crops species (Supplementary Table 1), for example, Affymetrix has developed chips for crop species including barley, cotton, lettuce, maize, pepper, rice, rose, soybean, strawberry, wheat etc., and Illumina developed chips for maize etc. Although the microarray technology enables the efficient genotyping of large populations in parallel at tens of thousands of SNPs, its relying on the prior knowledge of genomic sequence of the organism under question to design probes has seriously restricted its application to the species whose genomic sequence and polymorphic sequence are yet properly annotated.
Techniques of the next-generation sequencing (NGS) matured in 2005 and have become more and more popular experimental tool for genomic, transcriptomic and epigenetic analyses in plant and crop species [
8,
9] without need of prior information of the genome sequence. The techniques confer efficient re-sequencing of the genome, transcriptome and epigenome of any diploid species, with high throughput polymorphism detecting and genotyping as a typical example of utilities of the NGS techniques in addition to their many other uses. The whole genome re-sequencing has been implemented in various plant species including
Arabidopsis, potato (
solanum tuberosum), rice and maize [
10–
13] and the cost of the re-sequencing analysis has continuously declined over the past decade. But it is still infeasible and highly costly to implement whole genome DNA sequencing for genotyping large populations of most plant and crop species, which generally have a large genome size, from many millions to even billions of base pairs [
14,
15]. This report focuses on design and implement of optimized protocols of sequencing based genetic and epigenetic detecting and genotyping in large populations of plant and crop species.
REDUCED-REPRESENTATION SEQUENCING STRATEGIES FOR GENOTYPING LARGE PLANT POPULATIONS
Basic idea behind sequencing based genotyping large populations is reduced representation of genome to be sequenced, addressing the problem of cost limitation of the whole genome sequencing. Supplementary Table 2 summarizes features of 3 different strategies of reduced representation sequencing.
The first “reduced-representation” sequencing approach is to re-sequence transcriptome rather than genome to identify the genetic polymorphisms only in transcribed genomic regions. Compared to the large size of genome in eukaryotic species, the size of transcriptome is significantly reduced. Thus, amount of short-read data required and associated cost can be dramatically reduced. In the past five years, RNA-sequencing has been implemented to identify tens to thousands of genetic variants in human, cottonwood and potato genomes [
16–
18]. Because expression of genes in different samples can vary widely and change dynamically, the RNA-seq based approach bears several key inherent weaknesses. For example, the allelic imbalances in RNA-seq data can significant affect the analysis of the nuclear genotype [
19,
20]. Furthermore, only the genetic polymorphisms in commonly expressed regions can be identified in most of the sequenced samples. For those locating in tissue-specific expressed regions (or development stage specific expressed regions), they can be detected only in those few specific samples. In 2011, Christodoulou
et al. introduced a method that tried to enable efficient sequencing of low-abundance RNAs and to decrease the proportion of reads from highly expressed RNAs. However, normalizing range of dynamic expression in NGS is not an easy task [
21].
To ease the problem caused by differential expression of genes in transcriptome, the hybrid capture sequencing approach was proposed to extract a small part of whole genomic DNA [
22]. If nucleotide sequence information for regions of interest is known, hybrid-capture probes are designed to bind to regions of interest and these regions can be directly isolated for sequencing [
23,
24]. Recently, Uitdewilligen
et al. (2013) designed 57,054 oligonucleotide probes to capture DNA fragments for 807 potato genes and inferred sample genotypes from the sequence data [
25]. An obvious technical hurdle to this hybrid capture sequencing strategy is requirement of genomic sequence information to enable design of a large number of unique capturing probes for surrogating the predefined genome regions. The capture sequencing strategy could receive highly accurate genotyping result, but may be inappropriate when population under question is considerably divergent from the probe designing reference. The probes thus designed may poorly bind to the regions of interest and result in severe biases in capturing the target regions [
2]. Furthermore, the DNA may not be evenly presented by the targeted genomic regions. For instance, the extracted DNA is generally enriched in targeted regions with high GC content, due to the more stable binding [
26].
From 2008, a new “reduced-representation” sequencing strategy referred as “RAD-seq” has been widely proposed for genotyping large population in several animal and plant species [
27–
31]. This RAD-seq strategy mainly combines genotyping-by-sequencing with restriction-enzyme sites associated DNA fragments (RAD) to make this goal considerably more time and cost-effective in comparison to conventional platforms such as microarray based genotyping technologies, RNA-seq and capture sequencing. It can identify and score hundreds to thousands of genetic markers randomly distributed across the interested genome, from a group of sequenced samples. Although there are several modified forms of the original RAD-seq method (Table 1), they still share some common and significant drawbacks. For example, DNA samples for genetic or genomic analyses are usually collected from leaves to ensure gain of both the integrity and quantity of DNA extracted. However, the current RAD-seq methods ignored a fundamental fact that there are a large copy number (1,000~10,000) of chloroplast sequence and rRNA genes in plant leaf cells [
31–
34]. Without properly dealing with the issue, the chloroplast DNA and/or rRNA genes may occupy a substantially large proportion (up to 60%) of sequence reads obtained from the standard RAD-seq protocol and their modified versions. This will, in turn, significantly dilute coverage of the sequence reads across the genome under question. To address this issue, we described here an optimized RAD approach whose comparison to other versions of RAD-seq was summarized in Table 1.
Table 1 summarized the type of enzyme, the number of restriction enzymes, need of segment size selection, removal of DNA of chloroplast and RNA genes, multiplex level of sample pooling for four main representatives of the RAD-seq methods and the optimized RAD-seq protocol described here. The optimized RAD-seq method integrated bioinformatic analysis and a computer guided sequencing library construction design in order to maximize uniformity and coverage of sequence reads across the genome under question and to minimize presentation of chloroplast and rRNA DNA in the sequence reads.
AN OPTIMIZED RAD-SEQ STRATEGY FOR GENOTYPING LARGE PLANT POPULATIONS
Based on re-analysis of the current RAD-seq experiments summarized, we observed that a large proportion (30%~60%) of obtained short sequence reads were mapped to chloroplast sequence and rRNA genes [
32–
34]. And, this represents a serious waste of high-throughput sequencing capacity and resources since these reads are typically directly discarded for downstream analysis. Furthermore, it is crucial to achieve high recovery of DNA fragments within a specified size range during the size-selection step. However, existing RAD-seq approaches have generally implemented traditional manual gel purification to extract the targeted DNA fragment. This method is obviously superficial, has low reproducibility, and can introduce low molecular weight contamination [
35]. In 2012, Peterson et al. introduced the use of Pippin-Prep automated size-selection technology to precisely extract the targeted DNA fragments [
36]. Furthermore, Quail
et al. introduced the double size selection strategy for NGS library construction which could give tighter and more uniform size distribution for selected DNA fragments among the desired size ranges [
37]. We described here how this size selection may be effectively implemented in the optimized RAD-seq approach.
The optimization was made to maximize the number of genomic DNA sequence reads that uniformly cover a plant genome under study and to minimize the number of sequence reads representing chloroplast DNA and rRNA genes. The optimized approach provides both a generic bioinformatics tool and a library construction protocol. The bioinformatics pipeline was developed in a user friendly manner to work out the cutting sites from different combinations of restriction enzymes (REs) from 269 possible REs across a given plant genome of interest, and then to determine the optimal combination of REs. The simulation based bioinformatic analysis requires sequence information of the genome and information of the cutting sites of all 269 possible REs. The 4 popular REs (
EcoR I,
Hind III,
Msp I and
Mse I) could be selected for the first round of
in silico digestion to make the optimal combination of REs that satisfies the following criteria: (i) recovery of the largest possible number of genomic regions within the target size range required by Illumina sequencing (224 – 424 bp). This number should be at least as large as the predefined number based on consideration of the experimental objectives, including the coverage required per sample, the number of samples to be sequenced, and the density of DNA polymorphisms to be targeted; and (ii) minimization of the number of DNA fragments recovered from chloroplast sequence and rRNA genes. For the second round of digestion, we employed the 269 commercially available REs (http://insilico.ehu.es/restriction/main/index2.php) to achieve the optimal combination of REs for completely removing the DNA fragments from chloroplast sequence and rRNA genes while keeping the maximum number of selected genomic DNA fragments almost intact. Thus, these two rounds of RE digestion are designed to not only cut genomic DNA sequence into pre-designed fragments but also to remove the large proportion of DNA fragments from chloroplast sequence and rRNA genes during RAD-seq library construction. Here, bioinformatics tools and the corresponding computer programs were generic in design and compilation, enabling their use in the design of RAD-seq experiments without needing specialized bioinformatics or computational skills. Experimentally, we proposed here a double size-selection strategy using the Pippin Prep automated size selection technology, in order to reproducibly extract DNA fragments with a pre-defined size range from the bioinformatic prediction, and at the same time, to enable removal of dimers from the RAD-seq libraries [
35,
37].
The workflow of library construction for the optimized RAD-seq protocol was shown in Figure 1. In brief, it consists of the following 6 steps: (i) first round of digestion to cut the DNA sequence of each sample; (ii) ligating the barcoded adapters to RE cut sites; (iii) pooling equimolar amounts of DNA fragments for each sample and carrying out precise size-selection using the Pippin Prep automated size selection platform; (iv) second round of digestion to remove the selected DNA fragments from chloroplast sequence and rRNA genes; (v) PCR amplification using Illumina primers for pooled samples; (vi) second round of size-selection using Pippin Prep to ensure high recovery of DNA fragments in the required size range and remove adapter dimers.
To experimentally validate the optimized RAD-seq approach, we have constructed several pooled sequencing libraries from leave tissues from parental lines and their segregating offspring of both diploid and tetraploid Arabidopsis and potato. The sequence data analysis shows that the optimized RAD-seq approach designed for the Arabidopsis and potato genomes can effectively remove DNA fragments derived from chloroplast sequence and rRNA genes, and the short reads collected are mostly concentrated onto the targeted genomic regions. A balanced representation of sequence reads was obtained from across all pooled samples. Generally, after the second round of digestion, proportion of reads mapped to chloroplast and rRNA genes could significantly reduce from 60% to less than 10%. Meanwhile, these short-reads could precisely and consistently map to the selected genomic regions among different samples. These features demonstrate the robustness and efficiency of the optimized RAD-seq approach developed here and further indicate that one can feasibly design and effectively implement the protocol to achieve expected coverage of polymorphic sequence markers for large plant populations given information of the genome sequence and ideally, though not necessarily, information of the sequence polymorphism distribution in the genome.
RAD BASED BISULFITE SEQUENCING FOR PLANT SPECIES
DNA methylation plays a fundamental role in the regulation of gene expression and is widespread in the genome of eukaryotic species. Currently, bisulfite sequencing (BS) approach is the sole and standard method that can accurately measure the DNA methylation at both single-base resolution at a genome scale [
38]. BS method uses bisulfite treatment in combination with high-throughput sequencing to draw the most complete picture of a DNA methylome. It has been successfully applied to elucidate the methylomes of human cells as well as of other species such as mouse,
Arabidopsis, maize [
39–
42]. But, as the same as whole genome sequencing, the genome wide BS still bears the high cost and need for exceptional depth of sequencing, and this hinders BS for profiling large populations. In 2005, Meissner
et al. firstly proposed “reduced representation bisulfite sequencing” (RRBS) strategy for covering only 1% of the mouse genome [
43]. This approach basically combines “
Bgl II” restriction-enzyme sites associated DNA fragments (RAD) with the standard BS platform to enrich for the regions of genome that have a high CpG content. And, this RRBS approach has been widely applied to the DNA methylation studies for several mammal species such as human and mouse genomes [
43–
46].
The idea of the optimized RAD-seq method aforementioned can be extended to develop a RRBS method specific to plant and crop species. Basically, the plant/crop specific RRBS method is proposed to minimize presentation of chloroplast and rRNA DNA in the sequencing library and, in turn, in sequence reads finally generated. Figure 2 illustrates diagrammatic procedure of a plant/crop specific RRBS experiment with Arabidopsis as an experimental model. Compared to the optimized RAD-seq experiments, the optimized RRBS experiment differs in the following aspects.
Firstly, in the first round of digestion of the RAD-seq, REs were selected from EcoR I, Hind III, Msp I and Mse I. In the RRSB, we used 9 methylation insensitive restriction enzymes in the simulation prediction of the cutting sites across the genome under study so as to determine the optimal combination of REs in the RRBS-seq experiment (Table 2). The first 5 are 6 bp or 5 bp REs and can identify “cytosine” sites in any plant genome. This ensures at least one “CpG”, “CHH” or “CHG” site in every selected DNA fragment. The remaining 4 except for “Mse I” are 4 bp REs and can identify “CpG” sites. Thus, use of these REs ensures the selected DNA fragments could detect at least one “CpG” site. Thus, use of the above the 9 REs guarantees 2 candidate methylation sites (“CpG”, “CHH” or “CHG”) in every DNA fragment selected for sequencing library construction. The numbers of produced DNA fragments varied markedly over different REs’ combinations when applied to shear the Arabidopsis genome (Table 3). The current Illumina Hiseq sequencer generates short reads with length varying among 2×100 bp, 2×125 bp and 2×150 bp. Thus, we determined the size range of selected DNA fragments from 224 bp and 424 bp. The numbers of selected DNA fragments for different REs’ combinations were shown in Table 3. For the RRBS-seq experimental design, we will be able to detect the methylation events from about 20,000 small DNA regions evenly distributed across the whole Arabidopsis genome. To reach this objective, we selected 4 candidate combinations of Res for the first round of digestion to shear the Arabidopsis genome: “Cac8 I/Taq I”, “Fnu4H I/Taq I”, “BsaJ I/Cac8 I/Taq I” and “BsaJI/DdeI/MseI”. For these 4 different REs’ combinations, each of REs’ combinations could produce enough genomic DNA fragments (>30,000) with length from 224 bp to 424 bp while the number of selected DNA fragments from chloroplast sequence and rRNA genes was controlled at a relatively low level (Table 3).
From the selected 4 RE combinations, we found that there were 40~53 DNA fragments from the rRNA genes and chloroplast sequence in the selected genomic DNA fragments. Due to an extremely high copy number of rRNA genes and chloroplast sequence, a very large proportion of reads could be generated for these undesirable segments from the RRBS-seq sequencing (Table 4A). In order to remove these undesirables, we searched the cutting sites of 269 different restriction enzymes in the selected DNA fragments for the optimal combination to minimize the DNA fragments representing the chloroplast sequence and rRNA genes at the same time to keep the number of the selected genomic DNA fragments as much intact as possible after the second round of digestion (Table 5).
The analysis presented above recommends use of the “BsaJI/DdeI/MseI” combination for the first round digestion for shearing Arabidopsis genome and the “Xmn I/Pct I/Tfi I” combination in the second round digesting to remove the DNA fragments representing the chloroplast sequence and rRNA genes. After the two rounds of digestion, there were about 15,011 genomic DNA fragments remained as sequencing templates but 2 undesirable fragments from the chloroplast sequence. Here, we carried out numerical analysis to work out proportion of short-reads that might be mapped on the chloroplast genome. It is clear from Table 4B that about 93% and 96% of sequence reads generated from the design would be mapped to the genomic DNA in the Arabidopsis diploid and tetraploid species respectively, leaving only 7% and 4% of the undesirable sequence reads for the two species.
Furthermore, the 15,011 selected genomic DNA fragments would cover 3.7 Mb genomic regions (based on 2×125 bp paired-ends sequencing strategy). For size of the Arabidopsis genome (119 Mb), the sequence reads account for 3.1% (3.7/119) of the whole genome. Furthermore, we explored distribution of detectable “candidate methylation sites” among 15,011 selected genomic DNA fragments. Table 6 indicates the number of detectable “candidate methylation sites” in each of the Arabidopsis chromosome and shows there would be a total of 718,904 candidate methylation sites to be detected through the RRBS-seq experiment designed. Compared to the total 21,468,778 candidate methylation sites in the Arabidopsis genome, 3.4% (718,904/21,468,778) of the whole candidate methylation sites would be targeted in the RRBS experiment as summarized in Figure 3, which shows that the detected “candidate methylation sites” could randomly and high-density disperse among the whole Arabidopsis genomic sequences.
Secondly, the adapter oligonucleotides in the optimized RRSB have all cytosines (C nucleotides) replaced by 5′methyl-cytosines, so as to prevent deamination of these cytosines during the bisulfite conversion reaction.
Thirdly, in bisulfite conversion reaction, the unmethylated cytosine is deaminated into a uracil in the selected DNA fragments. Here, the bisulfite conversion efficiency is extremely important for the RRBS experiments. With a low efficiency of the bisulfite conversion, biases may be introduced into the methylation profiles. In the optimized RRBS experiment, we used the commercial “EZ DNA methylation-gold kit” (Zymo Research, Orange County, CA, USA) to improve efficiency of the bisulfite conversion reaction. Although it has been proposed that the conversion efficiency of “EZ DNA methylation-gold kit” can be reached to over 99%, we tested performance of the kit for its bisulfite conversion efficiency. The fact that chloroplast DNA in any plant is free of methylation provided us with a unique opportunity to achieve direct assessment of the bisulfite conversion efficiency by deliberately remaining a few chloroplast DNA fragments in the sequencing library. The conversion efficiency can thus be assessed by checking conversion states of the chloroplast DNA fragments.
A pilot experiment designed by the optimized RRBS protocol was carried out for 2 autotetraploid
Arabidopsis parental lines (ABRC: CS3151 and CS3900) and their 4 F2 offspring. The raw sequenced short-reads data showed an even distribution among these pooled samples: the proportion of reads from each sequenced sample ranged from 12.71% to 22.15% (one sample student
t-test,
P-value>0.05; the coefficient of variation=22.49%), only about 7.33% of short-reads were not assigned to samples (Table 7). Furthermore, the raw short-reads data were aligned to the
Arabidopsis reference sequence using bismark [
47]. The alignment result clearly showed that over 55% of short-reads could be successfully aligned to the genome sequence, while only a small minority of reads (3%) aligned to the chloroplast sequence and rRNA genes (Figure 4). These results were consistent with our simulated prediction based on the high copy number of chloroplast and rRNA DNA sequences in
Arabidopsis leaf cells (Table 4B). Thus, the optimized RRBS-seq strategy developed in the present study have been achieved in effectively removing the reads derived from the chloroplast and rRNA DNA and significantly increasing the proportion of short-reads aligned to the genomic sequence. Additionally, we analyzed the proportion of sequenced fragments that fell within the selected size range (224 bp ~ 424 bp). The length distribution of the sequenced DNA fragments for the pooled RRBS-seq dataset was shown in Figure 5, which clearly indicates that more than 95% of sequenced fragments are within the target size range.
DISCUSSION
Next generation sequencing (NGS) techniques enhance the power for sequence-based identification and genotyping sequence variants of genetic and epigenetic markers. This opens tremendous research opportunity for understanding genetic and epigenetic basis of complex genetic traits [
48,
49]. Although the NGS is experiencing a fast decline in costs and provides more robust and informative data, it is still unaffordable for many laboratories to implement the technique for large population genetic and epigenetic analyses. The solution to this problem is the methods so called reduced presentation sequencing such as those described in Geraldes
et al. 2011, Uitdewilligen
et al. 2013, Baird
et al. 2008 [
16,
25,
27]. In fact, it is not necessary to have a complete set of genomic sequence variants for many genetic or genomic analyses. For example, mapping resolution in any genetic linkage analysis depends rather on the number of recombinants between genetic markers than on the density of the markers [
50]. Thus, use of very dense marker maps will not lead to improved mapping efficiency.
Among the reduced presentation sequencing approaches, the restriction enzyme associated DNA sequencing, or RAD-seq in short [
27], gains the most popularity for its simple to implement, cost and time effective and flexibility in designing for various purposes. Thus, it represents an idea choice for large population sequence variant scanning and genotyping.
On the other hand, leaf tissues are usually used to extract DNA samples in plant or crop species. A large copy number of the chloroplast genome and rRNA genes in the DNA samples from leaf tissues inevitably lead to significant presentation in the sequence reads to be generated from the DNA samples. Re-analysis of sequence data from several typical plant RAD-seq experiments shows that these undesirable sequence reads may account for up to 60% of all sequence reads generated [
32–
34]. To address this problem, we describe here the optimized RAD-seq method for large plant population sequence variants screening and genotyping. The method is in fact an
in silico guided design of restriction enzymes used to shear genomic DNA into segments with prior required lengths and to remove DNA segments representing chloroplast and rRNA genes in sequencing library construction. This in combination with accurate DNA segment selection by use of Pippin-Pre confers the evenness of genome coverage of selected and sequenced DNA segments, the flexibility to design for required sequence coverage over the genome of interest and for targeting genomic regions to be sequenced, effective control of undesirable sequence reads representing nongenomic or high copy number genomic DNA. Moreover, the ideas behind the optimized RAD-seq experiment can be extended to achieve cost and time effective genome-wide profiling of DNA methylation through the reduced representation bisulfite sequencing (RRBS) method. We optimistically anticipate that the desirable features of the optimized RAD-seq and RRBS methods will open a new era for genomic and epigenetic profiling of large plant and crop populations.
Higher Education Press and Springer-Verlag Berlin Heidelberg