Introduction
Repetitive elements are abundant in plant genomes. They can be categorized into three main types; tandem repeats (TRs), transposable elements (TEs) and high-copy number genes. TRs are often divided into three subclasses according to the size of the repetitive unit: satellites (>100 bp), minisatellites (7–100 bp) and microsatellites (1–6 bp)
[1,
2]. TRs frequently occur within or close to genes, i.e., in the untranslated regions (UTRs) up- and down-stream of open reading frames, within introns, or in coding regions (CDS)
[3]. TRs appear in high densities in the centromeric, telomeric, and subtelomeric regions of many eukaryotes, comprising hundreds or thousands of repeats
[4]. They are also found at interspersed positions and in low-recombining regions, such as sex or B chromosomes
[1,
5]. The reason why TRs are such ubiquitous elements in genomes is still not completely known. While originally classified as nonfunctional or junk DNA, more recent studies strongly hint at either a functional or evolutionary role
[3,
6–
12]. The high mutation rates of TRs lead to their prominent role and their importance in many fields of molecular evolution
[13,
14]. They are used as informative molecular markers in population genetics and molecular breeding, in plants as well as in animals
[15–
18]. Besides TRs, TEs are another class of important repetitive elements that are particularly abundant in plant genomes. They are important in genome and gene evolution
[19]. Their main characteristic is their ability to move or copy themselves within the genome
[20]. They are divided into two classes, RNA-mediated Class I retrotransposons and DNA-mediated Class II transposons. Both classes contain elements that encode functional products required for transposition (autonomous) and elements that only retain the
cis sequences necessary for recognition by the transposition machinery (non-autonomous). Class I elements can further be divided into several subclasses: SINEs, LINEs, long-terminal repeat (LTR) retrotransposons and terminal-repeat retrotransposons in miniature, which are LTR non-autonomous elements
[21]. Class II elements comprise autonomous and non-autonomous transposons, including MITEs (miniature inverted-repeat transposable elements). TEs can serve as a very rich source of identifiable polymorphisms. Some studies suggest that TEs might even be more useful as molecular markers (e.g., SSAP, IRAP or REMAP markers), in particular in plant breeding application, than other markers (e.g., SSR and AFLP)
[22,
23].
Citrus is one of the most popular fruit crops worldwide with great economic and health value. It grows throughout the tropical and subtropical regions of the world. The major citrus producing areas are in south and east Asia (led by China, India and Japan), Americas (led by Brazil, USA, Mexico and Argentina) and the Mediterranean basin (led by Spain, Italy, Egypt and Turkey)
[24]. Although citrus is one of the most important fruit crops, its genome has been much less explored than other plant species (e.g., rice, maize and soybean). The knowledge of repetitive sequence elements is essential for understanding the nature and consequences of genome size variation between different species, and the large-scale organization and evolution of plant genomes
[1]. Several methods have recently been developed for the analysis of repetitive sequence elements in genomes
[1,
2,
25–
30]. Expressed sequence tag (EST) databases are valuable resources for predictions regarding genome structure and genomic organization in the transcribed regions of genomes. The large number of publicly available citrus EST sequences offers the great possibility to study transcribed regions in citrus genomes. The analysis of repetitive elements in citrus ESTs will facilitate and provide valuable information when studying highly important questions concerned with a genetic improvement of citrus. They will also provide a valuable resource for the development of genetic tools such as molecular markers. Several studies have been conducted in the past to find repetitive elements in genomes of many plant species, including papaya
[1], maize
[31], soybean
[32]. Although several studies have already analyzed citrus ESTs and characterized microsatellite to develop SSR
[33,
34], most details concerning the repeat characteristics such as minisatellite, satellites and TEs found in the transcribed regions of citrus remain unexplored. Thus a detailed structural analysis of the transcribed regions of citrus genomes remains to be performed. In this study we screened clustered non-redundant EST data sets of 11
Citrus spp. for TRs and TEs with the aim to understand the genomic organization in the transcribed regions of citrus. For TRs we compared the densities and length characteristics of different repeat types and unit size ranges. TEs were classified and frequencies were computed. For selected TEs, we also estimated phylogenetic distances.
Materials and methods
Sequences retrieved and processing
EST sequences of the 11 Citrus spp. were retrieved from NCBI (http://www.ncbi.nlm.nih.gov) on November 14, 2015 (Table 1). A Perl script est_trimmer.pl (http://pgrc.ipk-gatersleben.de/misa/download/est_trimmer.pl) was used to remove unusual EST sequences, vector contamination, poly-A and poly-T bases from the EST sequences. After that, the CAP3 program (http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::cap3) was used to obtain non-redundant EST sequences. The 11 sets of non-redundant sequences were used for subsequent data mining.
Tandem repeat detection
TRs were detected in the citrus EST data sets by using the Tandem Repeats Finder (TRF) software
[25] and PHOBOS, version 3.2.6
[35]. Both programs have been used to search for imperfect TRs in a unit size range from 1 to 1000 bp without using a pre-specified motif library. TRF was used with default parameters. PHOBOS used the alignment scores 1, −5, −5, 0, for match, mismatch, gap and N positions. In every TR, the first repeat unit was not scored. Only a maximum number of four successive Ns were allowed. For a TR to be considered in the analysis it was required to have a minimum repeat alignment score of 12, if its unit size was less or equal to 12 bp, or a score of at least the unit size for unit sizes above 12 bp. As a consequence, mono-, di- and tri-nucleotide repeats were required to have a minimum length of 13, 14 and 15 bp, respectively, to achieve the minimum score. For repeat units above 12 bp, a perfect repeat had to be at least two units long and an imperfect repeat even longer, to obtain the minimum score.
All TRs with units that differ only by circular permutations and/or the reverse complement are associated to the same repeat type. As a result most tandem repeats and their complementary counterparts can be represented by several different basic unit patterns. Clearly, there are always several repeat units which belong to the same repeat type. For example, the pattern (GCC)
n, also represents (CCG)
n, (CGC)
n, (GGC)
n, (GCG)
n, and (CGG)
n. The convention allows counting and identifying repeat units without reference to the repeat unit phase or strand
[2]. In this study, we follow the convention to represent a repeat type by that unit which comes first in an alphabetical ordering of all units that are associated to it
[36]. For example, the repeat type represented by the unit AAG incorporates all TRs with units AAG, AGA, GAA, TTC, TCT and CTT. Furthermore, TR patterns are always listed under the smallest possible unit size. For example, patterns like (ACACAC)
n or (ACAC)
n were included into the category (AC)
n. As a result the total number of theoretically possible, non-overlapping patterns was reduced. Finally, the term repeat type is distinguished from the term repeat class which we use to denote the collection of all repeats with the same repeat unit size (e.g., mono-, di-, tri-nucleotide repeats). TR characteristics such as the density and mean length of repeat types were computed using the program Sat-Stat, version 1.3.1 (http://www.ruhr-uni-bochum.de/spezzoo/cm/). Different TR characteristics have been analyzed in this study. These are (1) the TR density measured in bp/Mbp, which gives the proportion of bp found in repeats with respect to all bp in the sequence, (2) the number of repeats found on average in a sequence of a certain length measure in TR/Mbp, and (3) mean length of repeats measured in bp.
Transposable element analysis
To find TE in the transcribed region of the citrus genome, we used a combination of homology-based and
de novo methods. Given that there are many known families of TEs in plants, homology-based methods should be highly effective in identifying and annotating them. We built a custom plant TE library in combination of plant repeats from Repbase
[30], plant repeat databases from TIGR (ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats) and GeneBank for our initial classification of TEs (Table S1). Repeat elements identified as rRNA sequences, centromere-related sequences, telomere-related and unclassified sequences in the TIGR databases were excluded from our repeat library, leaving a database of 6880 repeat sequences that were used to search the transcribed region of the citrus ESTs. Then customized plant TE databases were compared with the citrus EST data sets using BLASTN analysis. BLASTN analyses were performed using an expected threshold of 10, a word size of 11, a match/mismatch of two to three and gap cost existence of five and extension of two. We only considered search hits with an e-value<1×10
−5. A Perl script was developed for summarizing the results.
Phylogenetic analysis
Homolog TE sequences were retrieved from citrus EST data sets with the aid of BLASTN searches and using an in house developed Perl script.
Cpia- and
gypsy-like TE-EST sequences were pooled separately with randomly selected citrus
copia- and
gypsy-like genomic sequences. Sequences were aligned and trees were constructed with MEGA5
[37].
Results
In this study we analyzed 525510 clean ESTs from 11 Citrus spp. After a CAP3 assembly of each data set, the number of sequences reduced to 200968 unigenes. Therefore, about 61% of the citrus ESTs are redundant in the EST databases. Each unigene data set was used for further TR and TE analyses and the main results are summarized in Table 1. The percentage of TR containing ESTs was roughly identical among the studied species except for Citrus unshiu and C. paradisi. The highest number of TEs was recorded for Citrus sinensis, while the lowest was found in Citrus limittioides. Overall, less than 1.5% of the ESTs contain a known TE element.
Tandem repeat analysis
Characteristic of TR in all 11 Citrus genomes
Transcribed regions of the 11 Citrus spp. were searched for TRs. On average, 22% of the analyzed EST sequences contain one or more TR loci and for most species the fraction of TR containing repeats is relatively close to this value. Details are shown in Table 1. The highest number is found for Citrus limettioides (32%) and the lowest for C. paradisi (8%). We plotted TR densities against the size of the EST data set (which approximates the size of the transcribed region). The TR densities vary only slightly among the studied Citrus spp. No significant correlation was found between the size of the EST data set and the density of TRs (Fig. 1a, r = 0.05, P<0.1). A comparison of the mean lengths of TRs of all 11 genomes shows that TRs are shortest in C. paradisi (average length 9.22 bp) and longest in Citrus trifoliata (average length 99.74 bp). Again, no significant correlation between the size of the EST data set and the mean length of TRs was found (Fig. 1b, r = 0.319, P<0.1). A comparison of TR densities of the different repeat classes is given in Fig. 1c. The result shows that the relative densities of different repeat classes are considerably taxon-specific. For example, Citrus limonia has a high relative density of mononucleotide repeats, whereas dinucleotide repeats are rare. The proportion of di-, tetra-, penta-, hexa-nucleotides, 7–30 and 31–50 bp repeats are very similar in all the studied species except for C. limonia, and C. limettioides.
TRs were classified into three unit size ranges, namely microsatellites (1–6 bp), minisatellites (7–100 bp) and satellites (>100 bp). Results for the different unit size ranges are given in Table 2. As expected, micro- and minisatellites are more abundant than satellites in the transcribed regions of Citrus spp. The highest densities of TRs are recorded in C. sinensis, while lowest densities are found in C. paradisis. The density of micro- and minisatellites in the transcribed regions are taxon specific. A high abundance of microsatellites was found in the genomes of the Citrus spp., C. sinensis, C. trifoliata, C. reticulata, C. auruntium, C. latifolia, C. aurantifolia and C. limettioides, and a high abundance of minisatellites was found in the ESTs of the Citrus spp., C. clementina, C. unshiu, C. limonia and C. paradisi. In total, minisatellites contribute more to the TR coverage than microsatellites.
Genomic densities of mono- to tri-nucleotide repeat types
Repeat type usage of mono-, di-, and tri-nucleotide repeats in the 11 genomes are summarized in Table 3. It is shown that the repeat type usage in ESTs varies strongly between taxa. Even among more closely related Citrus spp., only few common features can be observed. For example, the density of ACT, ACG and CG repeats is consistently low in all species. The repeat types AG, AT, AAG and AAT have high densities in all species. The densities of poly-C repeats are generally high, except for C. unshiu and C. paradisi, where they are even lower than poly-A repeat densities. Poly-A repeats have the highest density in C. sinensis among the 11 species.
Characteristics of tandem repeats with unit sizes 1–50 bp in expressed sequence tags of all 11 Citrus spp.
Most previous studies only analyzed TR characteristics in the unit size range 1–6 bp. In this study we compared the TR characteristics in ESTs of 11 species in three unit size ranges, namely 1–6, 7–10, and 11–50 bp. Our results show that the density of TRs with a unit size in the range 7–50 bp contributes significantly to the total repeat density in the unit size range 1–50 bp (Fig. 2). The relative contribution ranges between 17.6% in C. limonia and 42.9% in C. paradisi with a mean value of 31.9%. Among the 11 EST data sets, strong differences are found also for individual repeat classes (Fig. 2; Fig. S1). TR densities in C. sinensis, C. clementina, C. trifoliata are slightly below average. Mono repeats represent the dominant repeat class followed by tri- and di- repeats in ESTs of all 11 Citrus spp. For the longer repeat units, there are usually only very few repeat types which contribute to the density of their repeat classes. A comparison of the longest repeat length and mean repeat length is presented in Fig. S2. This analysis reveals a strong difference between the mean length of TRs among different repeat classes and species. A maximum mean repeat length of 370 bp is found for the 48 bp repeat class in ESTs of C. aurantium, which consists of two repeats of length 117 bp and 623 bp. All mean repeat lengths are shorter than 200 bp in the unit size range 1–50 bp for all citrus ESTs except for C. clementina and C. aurantium.
Transposable elements
The availability of a large amount of EST sequences provides an opportunity to estimate the transcriptional activity of transposable elements. In this study we used custom TE data sets to query BLASTN against the citrus EST database. The results reveal that 1.53% of the total citrus ESTs (3083 sequences) showed significant sequence homology (e-value<1×10–5) with one of the TE families (Table 1). It has been found that Class I (RNA-mediated) elements are more abundant than Class II (DNA-mediated) elements (Fig. 3a) in all studied Citrus spp.
Among the different TE families, gypsy-like elements are most frequent in the citrus EST database (37% of the total TE-ESTs), followed by copia-like LTR retrotransposons, while SINE elements are least abundant (0.13%). Comparing TE copy numbers in ESTs with respect to different families, gypsy elements are almost four times as frequent as copia-like elements (Fig. 3c). There is no significant correlation found between the copy numbers of each TE family and the numbers of ESTs (Fig. 3d). In citrus we found TE families with a large number of family members which were found only in a few ESTs and TE families with a low number of members found in many ESTs. For example, the SINE elements have 664 family members in the database we searched, while we only identified a homology with three ESTs of citrus. In contrast, the 925 different Ty3-gypsy elements in the database could be found in 1133 ESTs.
Hundreds to thousands copies of TEs are found in the genome; but the question is how many of these are transcriptionally active. To find the answer to this question we constructed a phylogenetic tree based on EST sequences and randomly selected genomic sequences of the TE families of citrus (Fig. 4). The phylogenetic analysis indicated that transcriptionally active TEs are found of distinct clades, and very few are shared with genomic sequence based TEs. These findings indicated that few evolutionary branches of the TE family have retained transcriptional capability.
Discussion
Tandem repetitive elements
Tandem repeats are one of the most common elements in plant genomes and they are key for understanding genome organization and evolution. Available EST or GSS sequences provide an opportunity to study TR elements in transcribed regions of genomes. Although many studies have been conducted for TRs in plant genomes, few have studied these elements in citrus. Furthermore, most studies are restricted to TRs in the unit size of 1–6 bp. In particular very little is known about TR elements in transcribed regions. In this study, we analyzed and compared the TR content in the transcribed region of 11 Citrus spp. in three unit size ranges: 1–6bp (microsatellites), 7–100 bp (minisatellites) and>100bp (satellites). Our results reveal that on average 6.93% of each EST sequence is covered with TRs and that a significant proportion of this coverage is contributed by minisatellites, with their contribution being almost two-fold the contribution of microsatellites and 22-fold the coverage contribution of satellites (Table 2). This finding suggests that both microsatellites and minisatellites play a role in organization and function of the transcribed regions of citrus.
Several studies have shown that TRs are generally non-randomly distributed in genomes
[2,
3,
38]. Exceptions have been reported for example for the papaya genome, where TRs are more or less randomly distributed
[1]. We found that several TR characteristics are non-randomly distributed in the transcribed regions of citrus genomes. No significant correlation was found between the size of the EST data set and TRs densities or length characteristics in this study, which is consistent with results obtained for complete genomes, e.g.. Tautz et al.
[39], Tóth et al.
[40] and Victoria et al.
[41]. Except for the relatively low densities of ACT, ACG and CG repeats and high densities for AG, AT, AAG and AAT repeats, no TR characteristics were found to be common in all 11 citrus genomes. This result is in agreement with the comparative genomic analyses of a wide range of plant groups reported by Tóth et al.
[40]. The dominance of taxon rather than group specific characteristics has also been reported for
Arabidopsis, barley, rice and wheat
[42,
43] when comparing number counts of satellites or when considering densities
[2]. Evidence suggested that ACG repeats are underrepresented in most eukaryotic genomes
[40]. The only known counter example among green plants is the algae,
Ostreococcus lucimarinus, which has a particularly high density of ACG repeats
[2]. Usually, CG, ACG and CCG repeat densities are low in higher plants. This is generally attributed to the fact that methylated CpG dinucleotides are highly mutable, which disrupts CpG rich domains on short timescales
[2,
40]. However, other mechanisms have also been proposed; see Tóth et al.
[40]. Low densities of CCG repeats were also found in the genomes of
C. clementina,
Brassica and yeast. According to our result CG, ACG and CCG repeats have low abundances in the transcribed regions of all citrus genomes (Table 3).
Notably, the high absolute and relative di- and tri-nucleotide repeat densities found in
C. sinensis are almost exclusively based on the high densities of the AG, AAG and AGC repeat types that are also common in all other
Citrus spp. in this study (Table 3). Victoria et al.
[41] reported that AG and AAG repeats generally predominate among di- and tri-nucleotide repeats in higher plants. Several studies demonstrated that poly-A repeats are more frequent than poly-C repeats in almost all vascular plants, which was also found in the present study
[44]. As a general trend and except for the features just mentioned, we find that common TR characteristics are rare. We also observed that the length of TRs did not correlate with the repeat unit size.
Transposable elements
TEs are a major component and important for the physical structure of many plant genomes. Several studies show that TEs can account for as much as 80% to 90% of plant genomes
[45], and that some TEs are transcriptionally active. To understand the impact of retrotransposition on plant genome evolution, it is important to identify active members of TE families that are present with high copy numbers
[45]. Once TEs accumulate and degrade in a genome, they usually become functionally inactive. However, partial or rearranged TE copies may retain their ability to initiate transcription
[19]. Cells have active mechanisms to protect the integrity of their genomes against TE activity by transcriptional silencing
[46]. Under certain circumstances, some TEs can escape this cell control with the result that they are able to get transcribed and transposed
[47]. This phenomenon is frequently observed under biotic or abiotic stress or in cell cultures
[48–
51]. Consequently, TE transcripts were more abundant in cDNA libraries obtained from stress treated tissue. Thus, the presence of TE transcripts in cDNA libraries can be expected, and EST databases can be used to identify functionally active TEs in genomes. Here we searched citrus EST databases in order to identify transcriptionally active TE families in citrus genomes. We identified Ac-Ds, CACTA, SINE, MITE,
copia-like and
gypsy-like TE families as functionally active in citrus genomes. Previous work suggests that
copia- and
gypsy-like TE families are highly abundant in citrus genomes and that some of the members of these families are transcriptionally active
[52,
53]. Our work does not fully support these findings, since the number of ESTs that originated from TEs was low. Since TEs are not necessarily located in transcribed regions, we cannot conclude that the overall TE content in citrus genomes is low. Our analysis suggests that the ratio of
gypsy- to
copia-like elements in transcribed regions of the citrus genomes is closer to 3:1. This suggests that
gypsy TE families are either more frequent or transcriptionally more active than
copia families.
Gypsy elements were also found to dominate over other TE families in maize EST data sets
[19]. A primary analysis of the sweet orange genome reveals that 20% of the genome contains transposable elements and that
gypsy elements are predominant
[54]. Although we found a low level of transcriptional activity of TEs (about 1.5% of all ESTs) in citrus genomes this is compatible with a similar study in maize
[19].
There was no correlation found between the number of TEs of a given family in the database and the number of ESTs they were found in (
R2 = 0.326, Fig. 3d). Meyers et al.
[31] reported TE numbers were negatively correlated with the EST database size in Maize, while Vicient
[19] did not find any correlation in maize ESTs and TEs. Comparison of TE family members in the database with the number of occurrences of TEs in the EST database and phylogenetic analysis of TEs suggested that high-copy retroelements are transcriptionally less active than low-copy number retroelements in
Citrus spp. Similar findings were also reported for maize. Rabinowicz et al.
[55] found that high-copy retroelements were frequently located outside of hypomethylated regions of the genome, while low-copy were located inside the hypomethylated region of the genome. Consequently low-copy retroelements families may escape methylation and therefore they are transcriptionally active.
Conclusions
TRs are abundant element in transcribed regions of citrus genomes, with 22% of the citrus ESTs containing a TR and 7% of the EST sequences covered by TRs. Notably, TRs with a unit size longer than 6 bp contributed significantly to the TR content. TE abundance is rather low in ESTs of citrus; where on average 1.5% of the transcripts are derived from citrus transposable elements. TEs found in ESTs are assumed to be transcriptionally active members of TE families and it would be worthwhile to study their role in gene expression and citrus genome evolution. TR and TE abundance varies strongly from species to species, with very minor common features among species.
The Author(s) 2017. Published by Higher Education Press. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0)