Introduction
Wheat is one of the most important crops in the world, occupying 17% of the arable land acreage worldwide, feeding about 40% of the world population and providing 20% of the total food calories and protein in human nutrition [
1].
Wheat grain development is a complex process that involves coordination among a number of different tissues [
2]. Most highly expressed genes in the wheat endosperm are related to storage proteins, starch metabolism enzymes and defense proteins [
3]. Seed specificity of gene expression is affected by the precise control of time and space, involving interaction between cis-elements and transcription factors. Wheat storage proteins are predominantly synthesized and stably accumulated in mature endosperm tissue. The main gluten components, prolamins, are encoded by multigene families whose expression is regulated temporally and spatially. This regulation is determined by promoter elements such as the prolamin box, which comprises endosperm and nitrogen elements [
4]. Transcription factors that bind to these sites include endosperm box factor-I, endosperm box factor-II and storage protein activator (SPA) in wheat grain [
5]. Starch is <FootNote>
FT-0
</FootNote>synthesized through the coordinated interactions of several biosynthetic starch enzymes, including ADP glucose pyrophosphorylase, starch synthase, starch branching enzyme and starch debranching enzyme [
6]. Therefore, understanding the molecular mechanisms by which these metabolic pathways are regulated is expected to provide valuable information for improving wheat quality.
The availability of more than one million expressed sequence tags has made it possible to use a variety of high-throughput methods to identify genes that are differentially expressed during grain development [
7,
8]. Moreover, transcriptome sequencing offers a manageable approach to study the complex structure and function of the wheat transcripts and can help in unraveling the genetics of wheat endosperm.
A total of 17949 publicly available expressed sequence tags generated from endosperm specific cDNA libraries have been analyzed and 2237 differentially expressed genes at six different time points (3, 7, 14, 21, 28 and 35 days post anthesis; dpa) were identified. This has elucidated much information about storage proteins, putative defense proteins and proteins involved in starch and sucrose metabolism [
3]. The relative expression abundances of different gene categories in the developing wheat caryopsis from 8 to 40 dpa have been analyzed using the Serial Analysis of Gene Expression method [
9]. Wan studied the transcriptome expression profile of wheat cultivar, Hereward, from 6 to 42 dpa using the Affymetrix wheat GeneChip oligonucleotide arrays and found that during this period, 14550 genes were differentially expressed [
10]. Yang analyzed 201 seed-specific unigenes using digital differential display technology combined with wheat chip expression and carried out the classification and gene ontology (GO) enrichment, revealing that the seed-specific unigenes participated mainly in defense reactions, stress, nutrition storage activity, enzyme inhibition and redox (reduction–oxidation) activity [
11]. Xu sequenced rice, a Gramineae model plant, during different periods of embryonic development (3–5, 7 and 14 dpa) using the Illumina/Solexa sequencing technology and found that a large number of genes related to regulation of metabolism were induced by DNA replication and processing [
12]. Examples include signal transduction genes mainly expressed during the early and middle embryonic development periods, protein accumulation-related genes mainly expressed during the middle embryonic development period, and starch/sucrose metabolism-related genes and protein modification-related genes expressed during the later period of embryonic development.
The Roche/454 GS FLX sequencer provides a useful tool for studying the transcriptome of wheat seed, which uses the pyrophosphate molecule released on nucleotide incorporation by DNA polymerase to fuel a downstream set of reactions that ultimately produces light from the cleavage of oxyluciferin by luciferase [
13]. Roche/454 GS FLX sequencer produces an average read length of 400 bp per sample (per bead), with a combined throughput of 100 Mb of sequence data per 7h run [
14,
15].
Here, the Roche/454 sequencing technology was employed to study the gene expression profiles of bread wheat grain and leaf-stem during grain filling. Both the grain library and leaf-stem library were mixed to assemble as the reference sequences. The leaf-stem served as a control to the grain to obtain a profile of differentially expressed genes in the grain, specifically for mining genes associated with wheat quality.
Materials and methods
Plant materials
The bread wheat cultivar, Nongda211, was grown under natural conditions in Beijing. Plants were tagged at anthesis and sampled at 5-day intervals from 5 to 25 dpa and 28 dpa. The grain, flag leaf and stem below the spike were cut off and quickly frozen with liquid nitrogen separately. Frozen tissues were stored at -80°C.
RNA extraction, cDNA synthesis and sequencing
For RNA extraction, equal amounts of flag leaf and stem below the spike tissues (leaf-stem library) from each stage were pooled. Equal amounts of developing grain (grain library) were also used. Frozen tissue was ground with a mortar and pestle and approximately 100 mg of powdered tissue was sampled. RNA was isolated using the RNA extraction kit (Beijing Autolab Biotechnology Co., Ltd) according to the manufacturer’s instructions. Total RNA was treated with DNase (TURBO DNase; Ambion, Austin, TX, USA) to remove DNA. RNA concentration was measured using a Nanodrop 2000 spectrophotometer (Thermo Scientific, Wilmington, USA). RNA integrity was assessed by 1% agarose gel electrophoresis.
High-quality cDNA was obtained using the SMART PCR cDNA synthesis protocol (Clontech) and checked on a 2% agarose gel to verify cDNA quality and fragment length. The cDNAs distributed between 500 and 5000 bp were collected and polyT was removed by restriction enzyme digestion (Beijing Autolab Biotechnology Co., Ltd). Approximately 3 μg of each cDNA sample was sheared into small fragments via nebulization and sequenced in a single 454 run using a GS-FLX sequencer (Beijing Autolab Biotechnology Co., Ltd).
De novo assembly
The raw reads were filtered to obtain high-quality clean reads prior to assembly, which was performed by removing the adaptor sequences, short sequences (less than 50 bp), contaminated sequences and low-quality sequences containing more than 20% nucleotides in a read with a Q-value ≤10. Clean reads were de novo assembled to generate non-redundant unigenes using the Newbler assembly programs with the default parameters. The non-redundant unigenes were used for further analysis in this study.
Functional annotation
Sequence-based and domain-based alignments were used to compare sequences. Sequence-based alignments were performed against three public databases [the NCBI non-redundant database (NR), the NCBI nucleotide sequences database (NT) and the Swiss-Prot protein database) with NCBI BLAST 2.2.28+ software (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.28/)], using an
E-value threshold of 10
-5 [
16]. Domain-based alignments were performed against the euKaryotic Orthologous Groups database (KOG; http://www.ncbi.nlm.nih.gov/COG/) with an
E-value threshold of 10
-3. Protein structure domains of all unigenes were predicted with the HMMER 3.0 package and hmmscan programs against the Pfam database (http://pfam.sanger.ac.uk/) [
17]. Functional categorization by GO terms was performed based on the best hits from the non-redundant and Pfam databases using BLAST2GO software [
18], with an
E-value threshold of 10
-5. To produce the pathway annotation and identify the BRITE functional hierarchies, sequences were submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) [
19]. KAAS annotated each submitted sequence with KEGG orthology (KO) identifiers, which represent an orthologous group of genes directly linked to an object in the KEGG pathways and BRITE functional hierarchy.
Analysis of differentially expression unigenes
To analyze unigene transcript levels, the uniquely mapped reads for a specific unigene were counted by mapping reads of each library to
de novo assembled non-redundant unigenes using Bowtie and RSEM software [
20]. And the reads per kb per million mapped sequence reads (RPKM) values were computed as recommended by Mortazavi [
21]. After computing the RPKM, the read count of each unigene was obtained. Differences in unigene transcript abundance between the grain and leaf-stem libraries were obtained from normalized read count values using TMM and DEGseq software [
22,
23]. Fold changes for each unigene between the grain and leaf-stem components were computed as the ratio of the read count values. If the read count value of the grain or the leaf-stem was 0, 0.001 was used instead of 0 to measure the fold change. The significance of differential transcript abundance was calculated using the
Q-value (Benjamini-Hochberg corrected
P-value) to justify the
P-value, and only unigenes with a |log
2 (FC)|≥1 and a
Q-value≤0.005 were used for subsequent analysis. The formula to determine the
P-value between the grain and the leaf-stem was:
where N1and N2represent the total number of clean reads mapped to all unigenes in each sample, and x and y represent the number of clean reads mapped to a common unigene in the leaf-stem and the grain libraries, respectively.
Enrichment analysis of DEGs
To assign putative biological functions and pathway involvement to the differentially expression unigenes, enrichment analysis was carried out. The GO enrichment was analyzed with the GOseq method and KEGG pathway enrichment was analyzed with KOBAS (2.0) [
24,
25], using Wallinius non-central hyper-geometric and hyper-geometric tests for statistical analysis, respectively. At first, all unigenes showing significant differences in transcript abundance (differentially expressed unigenes) between two libraries were mapped to the GO and KEGG pathway databases and then the number of unigenes for every GO term and KO term was calculated. The hyper-geometric test was applied to identify significantly enriched GO and KO terms from the set of DEGs. The formula for the gene enrichment test was
where N represents the total number of unigenes with GO and KEGG pathway annotation; n represents the number of DEGs in N; M represents the number of unigenes that were annotated to certain GO or KO terms; and m represents the number of DEGs in M. The initially obtained P-values were then adjusted using a Benjamini–Hochberg correction and a corrected P-value of 0.05 was adopted as a threshold.
Unigenes verified via qPCR
Grain of each stage was ground with a mortar and pestle and approximately 100 mg of powdered tissue was sampled. RNA was isolated using the RNAprep pure plant kit (TIANGEN Biotech, China), cDNA synthesized from total RNA using the PrimeScript RT reagent kit (TAKARA, Japan) and qPCR performed using the SYBR Premix Ex Taq (TAKARA, Japan) according to the manufacturer’s instructions. Primer sequences of unigenes are listed in the Appendix A (Table S1). Real-time PCR was conducted in a 7500 and 7500 Fast Sequence Detection System (Applied Biosystems). The pooled grain served as control. In each run, samples were assayed for both the target genes and ubi gene (endogenous control). Relative expression was calculated using 2
-∆∆CT method [
26].
Results
Transcriptome sequencing and de novo assembly
The bread wheat grain cDNA library and leaf-stem cDNA library were sequenced using a Roche/454 GS-FLX sequencer. Approximately 1.07 and 1.52 G clean reads were generated from the grain and the leaf-stem libraries, respectively. Then the clean reads of the two samples were mixed for assembly as reference sequences. Transcripts shorter than 100 bp were hard to annotate accurately, so only those greater than 100 bp were used in further analysis. A total of 61393 unigenes with an N50 (50% of the assembled bases were incorporated into contigs of 728 bp or longer) of 1814 and an N90 of 750 were obtained. The unigenes had an average length of 1456 bp, a minimum length of 102 bp and a maximum length of 12033 bp (Table 1).
Functional annotation of assembled unigenes
A total of 61393 distinct unigenes were acquired after assembly. Initially, all unigenes were blasted against the NCBI NR, the NCBI NT, the Swiss-Prot, the Pfam and the KOG databases. According to the hits retrieved from the Nr and Pfam databases, GO annotation was conducted. Finally, the KAAS was used to perform KEGG pathway annotation against the KEGG database. Among the 61393 unigenes, 58724 (95.7%) were returned as matches in at least one database and 7883 (12.8%) returned at least one hit in all databases (Table 2).
Global analysis of gene expression
As mentioned above, the clean reads of grain and leaf-stem samples were mixed and assembled as reference sequences. Then the two libraries were mapped individually using RSEM software. The total numbers of mapped reads were 722255 in the grain and 1091068 in the leaf-stem. To measure gene expression, the read count was converted into RPKM, a normalized measure of read density that allows transcript levels to be compared both within and between samples. There were 48869 annotated genes yielded after removing overlapping sequences, ranging from 100 to ≥2000 bp, which provided abundant data for the analysis of grain and leaf-stem. Expression of these genes is summarized in Fig. 1.
To obtain statistical confirmation of differential gene expression, the above 48869 unigenes were chosen for further analysis. The read count was used as an indicator to identify the DEGs for the samples with no biologic replications. Putative DEGs were identified using criteria mentioned in the methods. A total of 22045 DEGs were identified in both libraries, including 7355 upregulated and 14690 downregulated unigenes in the grain (Fig. 2 , Appendix B, Table S2).
Identification and annotation of potential DEGs
To better understand the significance of DEGs in wheat grain, 7355 upregulated genes in the grain were focused on (Appendix C, Table S3). Among the 7355 DEGs, 6899 were assigned to 3727 GO terms (Appendix D, Table S4), providing an overview of the ontology content. Among them, 60 GO terms, the top 20 of three main categories (Fig. 3), were significantly represented using a corrected P-value≤0.05 as the threshold. The terms dominant in the biological process category were carbohydrate metabolic process (GO: 0005975) and cellular component organization or biogenesis (GO: 0071840). Intracellular organelle (GO: 0043229) was dominant in the cellular component category. Alpha-amylase inhibitor activity (GO: 0015066) and hydrolase activity acting on glycosyl bonds (GO: 0016789) were dominant in the molecular function category.
KEGG pathway enrichment analysis was performed to better elucidate the functions of the upregulated DEGs. The KEGG enrichment results were displayed in a graphical form. The degree of KEGG pathway enrichment was represented by the enrichment factor, the Q-value and the number of unigenes enriched in a KEGG pathway. The enrichment factor indicates the ratio of DEGs enriched in this pathway to the total number of annotated unigenes in this pathway. DEGs showing upregulated expression were assigned to 212 metabolic pathways (Appendix E, Table S5) and the top 20 were chosen for further analysis. For upregulated DEGs, biosynthesis of secondary metabolites (500 DEGs), protein processing in endoplasmic reticulum (223 DEGs), ribosome (185 DEGs), and starch and sucrose metabolism (161 DEGs) were the dominant pathways (Fig. 4).
Unigenes related to protein process
In the GO significant enrichment analysis of the upregulated unigenes in grain, six GO terms which were dominant in the main molecular function category represented various enzymes associated with protein metabolism, including endopeptidase inhibitor activity, peptidase regulator activity and others. The GO term of intracellular non-membrane-bound organelle was also strongly represented. Ribosome is one of the non-membrane-bound organelles and the location of protein maturation and processing. In the KEEG pathway enrichment analysis, protein processing in the endoplasmic reticulum was strongly represented. All these showed that proteins were over-represented among the unigenes, which indicated that many proteins were synthesized and accumulated in grains.
Some unigenes were chose to verify via qPCR. Predicted high molecular weight gluten subunit (unigene: isotig52107) and gliadin (unigene: isotig08198) were found to be expressed significantly in the grain. Analysis of qPCR showed that these two kinds of protein were expressed strongly in grain at 15-20 dpa, but at a much lower level in the leaf-stem (Fig. 5a). The same result was observed for some transcription factors, SPA (unigene: isotig43997), wheat prolamin box binding factor (WPBF) (unigene: isotig37169), GAmyb (unigene: isotig21650) (Fig. 5b).
Unigenes associated carbohydrates
From the GO enrichment analysis of the upregulated unigenes in grain, the dominant terms in the main biological process category were almost metabolic processes associated with carbohydrates, including the sucrose metabolic process, glucan metabolic process, starch metabolic process and others. In the molecular function category, 10 of the 20 GO terms indicated significant enrichment of multiple enzymes associated with saccharometabolism, including glucosyltransferase, sucrose synthase and 1, 4-alpha-glucan branching enzyme. The KEGG pathway enrichment analysis of genes upregulated in grain also demonstrated that starch and sucrose metabolism were significantly enriched. As above, genes associated with carbohydrate metabolism were strongly expressed.
Discussion
The Roche/454 GS FLX sequencer has the great strength of producing an average read length of 400 bp per sample (per bead). Compared with Illumina and SOLiD platforms, this is more efficient for
de novo sequencing for plants without a reference genome. Although the physical map of
Triticum urartu and
Aegilops tauschii [
27,
28], the wheat D- and A-genome progenitor, has been accomplished, current knowledge of bread wheat genome sequence is insufficient for functional genomics analysis. Because of the lack of a reference sequence for wheat, the clean reads were
de novo assembled in the study.
The metabolic pathways of the starch, protein and lipid components are clear, but the mechanisms by which they are regulated remain poorly understood. In this study, a transcript profile was obtained and many key genes associated with grain qualities can be identified to study the genetic regulatory mechanisms governing metabolism of nutritionally important substance. Consistent with the function of the endosperm as a storage organ for protein and carbohydrate, these genes encoded mainly storage proteins, enzymes involved in starch metabolism and unknown proteins.
During the wheat grain filling period, an abundance of storage proteins is synthesized and accumulated in the proteinoplast, which illustrates that an efficient regulation network exists in the wheat endosperm. There are three main types of grain seed endosperm-specific promoter conservative cis-motifs: GCN4, PB box and 5′-AACA/TA-3′, which interact with transcription factors of the alkaline leucine zipper, single zinc finger protein and R2R3MYB binding domain types [
29].
Two transcription factors were selected for further study,
SPA and
WPBF.
SPA is a key regulator of transcription of wheat grain storage protein, gluten strength and grain hardness [
30].
SPA is located in the first homologous group of wheat and it belongs to the
Opaque2 transcription factor subfamily [
31-
33].
WPBF belongs to the DNA binding with one finger transcription factor subfamily. The gene encoding WPBF is located in the fifth homologous group of wheat and its product has a single zinc finger domain. WPBF regulates the expression of endosperm functional genes via binding to the prolamin box of endosperm storage protein [
34], which affects the synthesis and accumulation of wheat prolamin.
Genes associated with carbohydrate metabolism were strongly expressed, which was consistent with the fact that starch content in wheat grain accounts for about 53%-70% of the total mass. Intriguingly, in the GO enrichment analysis, one GO term annotated as alpha-amylase inhibitor activity in the molecular function category was the most strongly enriched, with >1000 RPKM. Alpha-amylase inhibitor is a glucoside hydrolase, which is rich in seeds of cereal crops and leguminous crops [
35]. In the endosperm these proteins might serve primarily as defense proteins to protect the seeds from predatory insects [
36,
37]. The genes encoding these proteins have been applied in genetic engineering to improve insect resistance [
38].
A total of 2669 unigenes had no significant matches to any known protein, which might be due in part to their novelty, high divergence or untranslated regions. Moreover, there were many putative proteins of unknown function among the annotated unigenes.
This study provides the first direct performance comparison between two major different organs in bread wheat. Flag leaf, stem below the spike and the developing grain constitute the relationship between source, sink and flow. Transcriptome profiling of these at the grain filling stage in wheat will be helpful in understanding the genes involved the resource flow from flag leaves to the grains.
Overall, transcriptome analysis of gene expression profiles for grain during grain filling in wheat will permit elucidation of the genetic regulation of this process. Therefore, further research to better understand the molecular mechanisms that regulate these metabolic pathways will be warranted.
Conclusions
In this study, global characterization of the transcriptome of grain and leaf-stem tissues in wheat was achieved using Roche/454 GS-FLX sequencing. There were about 1.06 and 1.52 G clean-sequencing reads from grain and leaf-stem generated respectively. The enrichment of GO terms and KEGG pathway analysis were used to describe the overall biological processes upregulated in the grain. The transcriptional patterns of genes involved in protein and carbohydrate metabolism were described as examples of DEG families. These data will facilitate gene discovery and functional genomics studies in wheat.
Higher Education Press and Springer-Verlag Berlin Heidelberg