Introduction
Age is an important developmental cue for trees, and great changes occur as they age. For example, young trees reactivate earlier than old trees with increasing temperature in the spring, thus having a longer period of growth, which indicates that meristem activity is affected by age [
1–
4]. Notably, cambium cell division in locally-heated portions of the
Cryptomeria japonica stem during the dormant period restarts earlier in younger than in older trees [
3], indicating that age affects the activity of the lateral meristem in response to temperature signals. The regular division of meristem cells in shoot and stem contributes to the continued growth in height and girth of a tree for many years, and this involves orderly cell-cycle progression and the expression of cell-cycle genes. The sensitivity of cambium to auxin, which is important in the control of cambium activity and wood formation [
5–
11], declines as trees age [
12–
14].
In addition to the differential responses to environmental and hormonal stimuli, the underlying mechanisms for transition from vegetative growth to flowering occurs as trees age are still poorly-defined. Given the age-control of meristem activity and identity, the expression of implicated genes appears to be regulated by age. However, limited information is available on the molecular basis of tree aging, especially the regulatory mechanisms of meristem activity.
Investigations of the molecular mechanisms of tree aging have identified many age-related genes [
15–
22]. Some of these genes participate in wood production [
21,
22] and the control of flowering time [
23–
26], while only a few have been associated with the regulation of meristem activity and identity. Due to the lack of genomic and transcriptome data, information about transcriptome reprogramming during tree aging is still limited. Thus the availability of transcriptome data for some trees, of which the genome has not been sequenced, would enable global analysis of transcriptome reprogramming during aging.
Larix kaempferi is a forest tree of important ecological and economic value widely-grown in the northern hemisphere, and its somatic embryogenesis has been used to study the regulatory mechanisms of plant development [
27–
35]. With the advantage of next-generation sequencing technologies, deep sequencing of the transcriptome has become rapid and economical, making it easier to develop molecular markers at the genome scale. Recently, transcriptome sequencing of
L. kaempferi embryogenic cell cultures and
L. gmelinii needles using the Roche 454 and Illumina Solexa sequencing platforms has been reported [
36,
37], greatly adding to availability of transcriptome information for
Larix. However, the
Larix transcriptome is still one of the least explored among conifers [
38] and remains insufficient for tree-aging studies. Here, we sequenced the transcriptome of the stems of 1-, 2-, 5-, 10-, 25- and 50-year-old
L. kaempferi, and assembled it
de novo with the transcriptome of the embryogenic cell cultures for global analysis of transcriptome resources and the development of genome-wide markers.
Materials and methods
Plant material
The materials were sampled in July 2011 at Dagujia seed orchard (42°22′ N, 124°51′ E), Liaoning Province, in North-east China. The uppermost main stems produced in the current year from three trees were harvested from 2-, 5-, 10-, 25- and 50-year-old L. kaempferi trees and ten stems from 1-year-old trees, which had been grown from seed. After removing the needles, the stems from trees of the same age were pooled, frozen in liquid nitrogen, and stored at -80°C for RNA extraction. We selected these ages because they constitute an entire rotation period from establishment to harvest, and include the vegetative and reproductive phases of L. kaempferi.
CDNA library preparation and transcriptome sequencing
CDNA (cDNA) library preparation and transcriptome sequencing were performed according to the manufacturer’s instructions (Illumina, San Diego, CA). Briefly, total RNA was extracted from each sample using plant RNA purification reagent according to the manufacturer’s instructions (Invitrogen, San Diego, CA). RNA was quantified on an ND-1000 Spectrophotometer (Thermo Fisher Scientific, Inc., Wilmington, DE). One microgram of total RNA from each sample was pooled and 5 μg of the mixed total RNA was used to generate an RNA library for Illumina paired-end sequencing. After isolation from 5 μg of the mixed total RNA, the mRNA was broken into short fragments. Using these short fragments as templates, a random hexamer primer was used to synthesize first-strand cDNA. Then second-strand cDNA was synthesized. The double-stranded cDNA was cleaned up and used for end repair and addition of poly A. Thereafter, these strands were connected to sequencing adapters and the desired cDNA was isolated through a 2% agarose gel. To enrich the desired cDNA, PCR amplification was performed. PCR products were isolated on 2% agarose gel and purified using a PCR column purification kit. Finally, the library was sequenced using the Illumina HiSeqTM 2000 platform.
Transcriptome assembly
The raw Illumina RNA sequencing reads were first preprocessed by discarding reads with adaptors, with unknown nucleotides >5%, of low-quality (quality score<20), or<20 bp. Then,
de novo assembly was performed using Trinity software [
39] with default parameters. The Roche 454 raw reads (accession number: SRX110249) [
36] were first preprocessed by trimming adaptors and Poly A, and then the preprocessed sequences were assembled using Newbler 2.6 (Roche; cDNA assembly mode) with default parameters. Another 3,874 expressed sequence tags (ESTs) [
40,
41] were used to increase the transcriptome coverage. Any vector contamination of the ESTs was removed using the seqclean program (http://compbio.dfci.harvard.edu/tgi/software/). The assembled Illumina transcripts, Roche 454 sequencing transcripts and ESTs were further assembled using CAP3 software [
42] with default parameters.
Transcriptome annotation
To identify the
L. kaempferi protein-coding genes, blastx (the basic local alignment search tool, search protein databases using a translated nucleotide query) [
43] was used to search our assembled sequences against four sets of protein sequences from
Arabidopsis thaliana (http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index),
Picea abies (http://congenie.org/),
Populus trichocarpa (http://www.phytozome.net/poplar.php) and
Vitis vinifera (http://plants.ensembl.org/Vitis_vinifera/Info/Index/) with an
e-value of 1e-5. The sequences with no blastx hits were then searched against the NR database (NCBI non-redundant protein database ‘nr’) with an e-value of 1e-5. To identify the biologic processes enriched in the
L. kaempferi transcriptome, we assigned the Gene Ontology (GO) terms associated with the top hits in the four protein databases to the annotated transcripts.
Evolutionary analysis
To do evolutionary analysis, we extracted likely coding regions from the
L. kaempferi annotated transcripts using TransDecoder, included with the Trinity software. First, protein sequences longer than 30 amino-acids from
A. thaliana,
L. kaempferi,
P. abies,
P. trichocarpa and V. vinifera were collected. Second, blastp (the basic local alignment search tool, simply compares a protein query to a protein database) [
43] was used against a database containing a protein data set of the five species with the
e-value of 1e-7. Then, the single-copy orthologous genes among the five species were used to construct a phylogenetic tree using PhyML [
44] with default parameters.
Molecular marker development
To aid Larix breeding and genetic diversity studies, we identified single nucleotide polymorphism (SNP), simple sequence repeat (SSR), and insertion and deletion (InDel) markers using the transcriptome data obtained in this work. To identify SNPs, the sequencing reads were aligned with the representative transcripts in the CLC genomic workbench (http://www.clcbio.com/products/clc-genomics-workbench/). The general alignment parameters were set to defaults except that non-specific matches were ignored to minimize read-alignment ambiguities. To capture reliable SNPs, we adjusted the minimum read coverage to 5.
SSRs were identified using Msatfinder (http://www.genomics.ceh.ac.uk/msatfinder/). The repeat thresholds for mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide motifs were set at 12, 8, 5, 5, 5, and 5, respectively. Only SSRs with flanking sequences longer than 50 bp on both sides were collected.
BWA and SAMtools software [
45,
46] were used to align reads to transcriptome reference and call InDels. The filtering threshold was set as follows: read depth no less than 10 and quality score no less than 20. The default parameter was used for quality control of flanking sequences in the SAMtools mpileup step.
Results
Transcriptome assembly
From the stem transcriptome sequencing of 1-, 2-, 5-, 10-, 25- and 50-year-old L. kaempferi, a total of 26074916 reads were generated and have been deposited in the NCBI SRA database (accession number: SRR1107838). Another 591759 Roche 454 sequencing reads and 3874 ESTs were also included in the L. kaempferi transcriptome sequence assembly. Assembly of 26670549 reads generated 146786 transcripts, ranging from 200 to 16701 bp, with a mean length of 849 bp and an N50 length of 1538 bp (Table 1).
Transcriptome annotation
Transcripts were first annotated by blastx of four protein databases from A. thaliana, P. abies, P. trichocarpa, and V. vinifera, and then the NR database (e-value<1e-5). Altogether, 79182 (53.9%) of the 146786 transcripts had significant matches, at least one hit in these databases. Among the 79182 annotated transcripts, 74744 (94.4%) were matched to P. abies, 56573 (71.4%) to A. thaliana, 55807 (70.5%) to V. vinifera, 54687 (69.1%) to P. trichocarpa, and 3022 (3.8%) to the NR database.
GO annotation was performed for the 79182 annotated transcripts in terms of ‘biologic process’, ‘molecular function’ and ‘cellular component’. In total, 55349 (69.9%) transcripts were assigned to a total of 2024491 GO terms: 53241 transcripts were assigned to the biologic process category, 53015 to molecular function, and 51800 to cellular component. In the biologic process category, ‘response to stress’ was the most abundant GO term (21280, 40.0%), followed by ‘biosynthetic process’ (20855, 39.2%), and ‘development of an anatomical structure’ (17558, 33.0%) (Fig. 1).
Evolutionary analysis
Seventy-two single-copy gene families across A. thaliana, L. kaempferi, P. abies, P. trichocarpa and V. vinifera were obtained. The gene families were concatenated to five super-peptides (100084 peptide sites) to construct a phylogenetic tree. The phylogenetic tree analysis showed that L. kaempferi was closely related to P. abies (Fig. 2), consistent with their taxonomic classification and evolutionary relationship.
Identification of gene-associated markers
In total, 463482 high-quality SNPs from 48578 transcripts were identified, including 272417 transitions and 191065 transversions (Table 2). The minor allele frequencies of SNPs were estimated from the sequence data (Fig. 3a). The distribution of SNPs per transcript was estimated (Fig. 3b) and the overall frequency of all types of SNPs in the transcriptome was one per 108 bp. Among these SNPs, 364227 (78.6%) were identified from transcripts with annotation information, and they were distributed in 32453 known genes (Table 2).
The 4756 SSRs detected in 4438 transcripts (Table 2) included 2165 (45.5%) mononucleotide, 569 (12.0%) dinucleotide, 1941 (40.8%) trinucleotide, 57 (1.2%) tetranucleotide, 17 (0.4%) pentanucleotide and 7 (0.1%) hexanucleotide motifs. Apart from the mononucleotide motifs, the most abundant was ACG\CGT (9.7%), followed by AT\AT (8.5%), and AAG\CTT (8.5%) (Fig. 3c). Suitable PCR primers were designed for 3595 SSRs using primer 3 [
47].
A total of 12434 InDels were identified in 10357 transcripts (Table 2), including 4080 deletions and 8354 insertions. Only 1640 (15.8%) of the sequences had more than one InDel. Among the 4080 deletions, >68% were mononucleotide deletions, followed by dinucleotide (16.7%), trinucleotide (11.0%), tetranucleotide (3.1%), and pentanucleotide (1.0%) deletions. Among the 8354 insertions, mononucleotide insertions were also prevalent (59.8%), while insertions of dinucleotide (19.7%), trinucleotide (13.6%), tetranucleotide (4.3%), and pentanucleotide (2.6%) accounted for small proportions.
Discussion
Environmental signals control tree growth and development. Notably, age affects their responses to these environmental signals [
3,
48]. Studying the underlying mechanisms is important not only for understanding the adaptation of trees in the context of global climate change, but also for forest breeding and management. For example, in one rotation period of
Larix, the rate of increase in height peaks at the age of about 15 [
49]; so determining the optimal time to harvest forest trees is important. Here, large-scale identification of the transcripts associated with ‘response to stress’ (21280), ‘response to salt stress’ (8016), ‘response to cadmium ion’ (6166), ‘response to cold’ (5313), ‘response to water deprivation’ (4999), ‘response to heat’ (2933), ‘response to oxidative stress’ (2566), and ‘response to osmotic stress’ (2544) will facilitate studies of the environmental regulation of tree growth and development and the mechanisms of age control of the responses to these environmental signals.
As a tree ages, floral meristems develop. Based on the functional annotation by GO, 4851 transcripts were assigned to the GO term ‘vegetative to reproductive phase transition of meristem’, suggesting that the transcriptome in meristem is reprogrammed during aging. The transcripts assigned to ‘regulation of transcription, DNA-dependent’ (6391), ‘positive regulation of transcription, DNA-dependent’ (4275), and ‘chromosome organization’ (3350) might participate in these transcriptome reprogramming processes. These results provide important information for studies of the phase-transition of meristem from vegetative to reproductive during tree aging.
When we collected the samples for sequencing, the
L. kaempferi trees were actively growing. Active cambium cell division and wood formation were reflected by the assignation of 8439 transcripts to ‘cell differentiation’, 6901 to ‘cell wall organization or biogenesis’, 4112 to ‘cell cycle’, 3390 to ‘cell division’, 3213 to ‘cytokinesis by cell plate formation’, 2997 to ‘cell wall organization’ and 2568 to ‘plant-type cell wall organization’. Hormones control cambium activity and wood formation [
50,
51], and this was supported by the assignation of 4541 transcripts to ‘jasmonic acid mediated signaling pathway’, 3855 to ‘response to auxin stimulus’, 3386 to ‘response to jasmonic acid stimulus’, and 3006 to ‘response to ethylene stimulus’. These results will help to understand the molecular basis of wood formation and the regulation of cambium activity by hormones.
Somatic embryogenesis is not only a valuable technique in clonal propagation, but also provides a useful experimental system to study the regulatory mechanisms of plant development [
52–
54]. Hormonal signals also play regulatory roles in somatic embryogenesis. For example, abscisic acid promotes the maturation of the somatic embryo, regulates the synchronization of its development, maintains its dormant state, and suppresses its germination [
29,
31,
35,
55,
56]. Based on the functional annotation by GO, 17558 transcripts were assigned to the GO term ‘anatomical structure development’, 6878 to ‘cell morphogenesis’, 6604 to ‘response to abscisic acid stimulus’, 6403 to ‘embryo development ending in seed dormancy’, 5668 to ‘embryo development’, and 3455 to ‘abscisic acid mediated signaling pathway’. Consistent with the reorganization of meristem during somatic embryogenesis, 4179 transcripts were assigned to ‘regulation of meristem growth’ and 2932 to ‘meristem initiation’. These data depict the features of somatic embryogenesis and give prominence to the regulation of somatic embryo development by abscisic acid.
Molecular markers are important for the assessment of genetic diversity [
57–
61], pedigree and mating system analyses [
62], hybrid identification [
63], the development of genetic maps, and marker-assisted breeding. In
Larix, EST-SSR markers have been developed [
64,
65] and two highly-informative SSRs have also been identified recently [
66], but other types of markers, such as SNPs and InDels, have not been well developed. Here, using
Larix transcriptome sequences, we identified 463482 SNPs, 4756 SSRs, and 12434 InDels, which will have considerable utility for the assessment of genetic diversity and marker-assisted breeding in
Larix.
Conclusions
Here, we present an analysis of Larix transcriptomes. All of these data enrich the transcriptome resources of Larix, are publicly available online, and will serve as useful tools for understanding the molecular basis of tree growth and development and for breeding and genetic diversity studies of Larix.
Higher Education Press and Springer-Verlag Berlin Heidelberg