1 INTRODUCTION
The expression of a gene as measured by the abundance of its mRNA transcripts has been taken as a basic unit for the quantitative study of the gene’s function and regulation. Gene transcription is regulated by transcription factors interacting with genomic elements in the promoter and enhancer regions. Genomic variations that affect those elements or transcription factors are an important cause of variations in gene expression. Expression quantitative trait locus (eQTL) analysis studies the association of genomic variations with gene expression variations, and is a key to decode the molecular network that regulates the expression of genes [
1–
6]. Genome-scale eQTL analyses have discovered many genomic variations that can affect gene expression, including
cis-eQTLs located inside or close to the gene which highlight genomic regions with regulatory elements, and
trans-eQTLs that are far away from the gene or on different chromosomes which highlights regions associated with transcription factors that regulate the gene or possible long-range DNA interactions, e.g., [
3,
7–
15].
Most human genes are composed of multiple exons and the procedure of transcription is accompanied by splicing. Alternative splicing is the mechanism that allows a gene to produce multiple isoforms of transcripts by using different combinations of exons in the splicing. This is an important mechanism of co- or post-transcriptional regulation in humans and other high organisms. Different isoforms and differences in their quantitative proportions can have important biologic consequences [
16–
19]. It has been widely observed that many alternative splicing events are associated with complex diseases such as cancers [
20–
22]. For example, the spleen tyrosine kinase gene (
SYK) has an effect on breast cancer when it undergoes aberrant alternative splicing [
11]. Later its alternative splicing is found as a regulator of mitosis and cell survival, and has effects on multiple types of tumors [
23–
27]. Alternative splicing was found as a rare event when it was discovered in 1970s. For a long period, it had been understood that alternative spliced genes may compose about 40-60% of human genes [
16]. Recent advancement in next-generation sequence (NGS) for RNAs (RNA-Seq) experiments has found that actually most human genes can have alternative splicing [
10,
28,
29]. However, the study of alternative splicing regulation in eQTL studies is still at the starting phase.
Alternative splicing adds another layer of regulation to the system of gene expression. With the ubiquitous existence of alternative isoforms, the concept of gene expression becomes less well defined. As different isoforms of the same gene can have differences in their functions, it is natural to use the expression of each isoform to replace the “gene expression” in earlier studies. Computational biologists have developed bioinformatics tools that can map RNA-Seq reads to isoforms instead of genes [
30–
35], infer new isoforms that have not been annotated in databases [
36–
38], and estimate the expression levels of isoforms [
9,
39–
42]. The tasks are challenging and cannot be exempt from errors as the current NGS technology only provides short sequencing reads, and cannot obtain reads of the full-length transcript directly. Sequenced short reads can be the mixture of multiple isoforms, besides being affected by noises and biases. Unique solutions are not mathematically guaranteed or accomplishable in many situations, and the inferences or estimations usually depend on assumptions that might not fit the biologic truth. Therefore, exon-centric methods have also been proposed to detect genes that show differential splicing between compared samples [
20,
43]. Such methods do not depend on annotations or assumptions about splicing isoforms, and leave the task of inferring and estimating isoform expression to the downstream analysis of only the fewer detected genes. Similar to this idea, junction reads have also been used to connect two spliced exons to define concepts such as exon-inclusion levels or splicing levels to quantify the relative expression of isoforms [
28,
44].
Microarrays had been the major technique to measure gene expression as well as genomic variations for the last decade, and were also the major technique for eQTL studies [
45–
51]. RNA-Seq revolutionized the technology and provides higher resolution and accuracy in both measuring genomic variations and RNA expressions. Several recent studies have used RNA-Seq data and sequencing-based single nucleotide polymorphisms (SNPs) data to perform eQTL studies [
3,
4,
52–
54]. They investigated associations of SNPs with gene expressions at the gene level, the splicing level or the exon level with RNA-Seq data, aiming to reveal more complex regulatory relations of genetic variations with gene expression. Hull et al. still using microarray data, reported that splicing patterns of exons depend on the single nucleotide polymorphisms distributed in flanking introns or exons [
45]. Pickrell et al. used RNA-Seq data to discover a large number of eQTLs and splicing QTLs, and concluded that an exon’s inclusion is affected by the variation within and near the consensus splice sites [
3]. Montgomery et al. conducted an eQTL study at gene, transcript and exon levels with RNA-Seq data of 60 European individuals and identified more eQTLs than with microarrays [
9], which showed the potential of using exons as the functional unit in eQTL study. Heinzen et al. carried out eQTL mapping and exon eQTL mapping in human primary cells with exon-level microarrays and suggested that “splicing effects may be of more phenotypic significance than overall gene expression changes” [
47]. Lalonde et al. identified many isoform eQTLs which are located near splice site and influence the splicing of cassette exons [
52]. Recently, Lappalainen et al. systematically mapped
cis-eQTLs for exon quantifications that can capture both gene expression and splicing variation, and reported a large number of
cis-regulatory eQTLs of various types, including gene eQTLs, exon eQTLs, transcript ratio QTLs, miRNA eQTLs and transcribed repeat eQTLs. They showed that the genetic loci affecting transcript structures are largely independent of gene eQTLs and they are both common in humans [
54]. These studies bring new insights on the complexity of transcription and splicing regulation, and also highlight the importance to study gene expression at the level of exons.
We had proposed to use exon expression as a basic unit in studying transcriptomes with RNA-Seq data, especially for studying differential splicing patterns [
55]. Although isoforms are the basic unit of function in the current understanding, all isoforms are composed of expressed exons, and all analyses on gene expression and isoform expression are based on short reads mapped to exons and their junctions. Due to the limited read length, noises and biases in sequencing data and the intrinsic complexity of possible isoform compositions, inferring isoforms and estimating their expressions can introduce extra inaccuracy or uncertainty. On the other hand, splicing can be viewed as the regulation of exon expression on top of the transcriptional regulation of the whole gene. This understanding has been confirmed from the observations in several recent studies, e.g., [
54]. In this work, we study the association of genomic variations with exon expression variations using RNA-Seq data and call it exon-expression QTL mapping or eeQTL mapping. The expression of an exon is regulated by both transcription and splicing. We conducted both gene-level eQTL and exon-level eeQTL mapping for both
cis- and
trans- loci, based on an RNA-Seq data set of the samples from the HapMap Project [
56]. Stringent computational protocols were adopted to ensure a low false positive rate in the discovery. We found that there are a noticeable amount of exons that have significant eeQTLs that are not eQTLs of their host gene, especially
trans-eeQTLs that are distant from the target gene or are on a different chromosome. Further study on such eeQTLs will bring new understanding to the regulation of exon expression that is independent with the transcriptional regulation.
2 DATA AND METHODS
The RNA-Seq data we used in this study were from the study by Pickrell et al. [
3]. They sequenced RNA from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals. The individuals are genotyped by the International HapMap Project [
56]. We used the release 27 of the HapMap genotypes obtained. We left out individuals with>50% missing genotype values and also SNPs with more than 3 missing values (5%) among the individuals. After the filtering, the data used in this study include 54 individuals and about 14 million SNPs. There were still missing values of around 1% in the remaining data after this filtering. We used the software package BIMBAM [
57] to do missing data imputation in the data. It is based on the fastPHASE model [
58], and we set the parameters as to run the EM algorithm 5 times with 20 steps per run. After the filtering and imputation, we got the genotype value of 0, 1 or 2 at each studied SNP site.
The RNA-Seq reads were mapped to the genome using gene models of the Ensembl database in the original work [
3]. Following the procedures used by Pickrell et al. [
3], we applied a series of pre-processing steps on RNA-Seq read counts at exons and genes to get the normalized gene and exon expression data. The steps include GC-content correction, correction for possible batch effects, principal component analysis to remove confounders, and two rounds of normalization. Sequence coverage can be influenced by the GC-content of the region which can cause biases in estimating gene and exon expression. We binned exons according to their GC contents and applied smoothing on them in each sequencing lane [
3,
59,
60]. As the data were sequenced at two centers, and some were at different concentrations, we also adopted the correction step as described in [
3] to compensate for differences between the two centers as well as different concentrations. Quantile normalization was applied on the expression data to make them follow a Gaussian distribution, which is a prerequisite of the ANOVA method used for detecting associations. Then principal component analysis was applied on the data to remove unmeasured confounders in the correlation between different individuals. We used the same setting of removing 16 principal components which has been reported to give the largest number of eQTLs in downstream analysis in [
3]. Another round of quantile normalization was applied on the residuals.
We are equally interested in eQTLs and eeQTLs in local regions of the gene and in the whole genome. Therefore, unlike the strategy used in [
3] and most other eQTL literature that detected local associations and genome-wide associations separately with different models and parameters, we used ANOVA to test for the associations of gene and exon expressions with genotypes of all the studied genome-wide SNPs. This will give associations with SNPs of different distances equal opportunity to be detected. We controlled the false discovery rate (FDR) to be less than 25%. The FDR was estimated via permutation and was controlled for possible associations on the whole genome. The permutation was done by randomly shuffling the expression of genes and exons, and we detected associations of the shuffled gene and exon expression with genotypes using the same method. Note that the whole-genome genes or exons for each individual were handled together during the shuffling. Therefore, the dependence among genes and the dependence among markers were kept during the permutation. Any detected association on the permuted data will be false discovery. As the expression of genes and exons has been normalized, we can treat all genes and exons using the same threshold of the p-value. At a given p-value level, we calculate the ratio of the number of discoveries obtained on the permuted data to the number of discoveries on the real data, and use the p-value at which this ratio becomes 25% as the threshold for controlling FDR≤25% on the real data.
In most existing work on eQTL, candidate SNPs were restricted to a nearby region of the gene for cis-eQTL study as the discovery power will be too low if all SNPs on the genome is considered. This made the standards for calling local associations and genome-wide associations very different, and is a major reason why many cis-eQTLs could be identified but trans-eQTLs were few. In our study, we took all SNPs on the genome equally for eQTL and eeQTL study with the same model and same threshold. This gives equal opportunity for discovering cis- and trans-loci. But since the number of candidate genome-wide SNPs are several magnitudes larger than those in the cis-region, it can be expected that the discovery power for cis-loci becomes lower than existing reports.
3 RESULTS
The key question we investigated was the possible differences between exon-expression QTLs (eeQTLs) and gene expression QTLs (eQTLs). Therefore, we focused on only the 929 genes that have been reported to have eQTLs on this data set in the study of Pickrell et al. [
3] to limit the computational workload of this study. In the original work, the authors focused on SNPs located within 200 kb of target genes and identified 929 genes that have local eQTLs at the FDR level of 10%. These genes contain a total of 9,552 exons. We searched both local and distant eQTLs and eeQTLs within these genes by a more stringent criterion, with SNPs on the whole genome as candidates. Expanding to whole-genome SNPs makes the power of detecting eQTLs and eeQTLs much lower, but it can give equal opportunities for eQTLs and eeQTLs at different distances from the target genes. For gene expression, we reported only 411 significant eQTL associations (FDR≤25%) that involve 77 genes and 411 SNPs. For exon expression, we found a total of 1,302 significant eeQTL associations, which belong to 138 genes and 408 SNPs.
Since we only called eQTLs or eeQTLs in a small fraction of the 929 genes that have been reported to have eQTLs, we double-checked the p-value of the gene in our study. We ranked the genes according to the minimal p-value of each gene with all SNPs. The largest minimal p-value among those genes, obtained on the gene ENSG00000209849, is 1.16e-05, which is at the same level with the p-value 1.96e-05 reported on this gene by Pickrell et al. [
3] at http://eqtl.uchicago.edu. But since we used genome-wide SNPs in controlling for the multiple testing, the threshold for eQTL p-value is 1.93e-07 and for eeQTL p-value is 5.2e-08 for FDR≤25%. This confirmed that the lower calling rate in our work is due to the more stringent criterion we used for calling significant associations.
Associations of mapping fall into different categories as
cis- and
trans- effects. In eQTL mapping literature, it is a common practice to define
cis- and
trans-eQTLs based on the genomic distance between target genes and SNPs, but there has been no precise definition of the threshold to discriminate distances of
cis- and
trans- loci. Some authors including Pickrell et al. used 200 kb as the boundary between
cis- and
trans-eQTL [
3], and some others used 100 kb [
1]. For eeQTLs, since they may be associated with splicing regulation, it is expected that
cis-elements for splicing be closer to the exon and within the gene region. But a solid definition is still missing. In our study, we categorized three types of eQTLs and eeQTLs by considering the distance between gene/exon and its associated SNPs:
local eQTLs and eeQTLs that are located close to the target gene on the same chromosome (distance≤100 kb),
distant eQTLs and eeQTLs that are located far away from the target gene (distance>100 kb) but are on the same chromosome, and
external eQTLs and eeQTLs that are located on different chromosomes from the target gene. The external loci and distant loci most likely correspond to SNPs associated with
trans-factors that regulate the expression of genes or exons by transcriptional or splicing regulation. The local loci are more likely to be associated with
cis-elements of genes or exons that receive the regulation signals. Different thresholds for the distinction between local and distance loci might lead to different observations on the relative number of local or distant associations. We had experimented with different thresholds from 10 kb to 200 kb, and the general observations are consistent although the specific numbers will change.
Table 1 summarizes the numbers of significant associations we identified and the numbers of genes and SNPs involved with the associations of the three categories. We can observe that there are more local eQTLs than distant and external eQTLs. This is expected as all the discoveries were done within genes that have been reported to have local eQTLs in the neighborhood of the genes in the previous study [
3]. We reported fewer eQTL/eeQTL genes because we adopted more stringent criteria in this work. However, when looking at the exon level, we observed that there are more genes detected with distant and external eeQTLs than genes with local eeQTLs (110 vs. 36), although the number of local associations is larger (245 vs. 1057). For gene level eQTL, we didn’t see such a trend, and there are more genes and associations found at local loci than distant and external loci. This is more obvious if we calculate the ratio of the number of eeQTLs to the number of eQTLs identified in the same category, as shown in Table 2. This is a strong indication that more trans-regulatory elements can be identified when we look at the expression at the exon level instead of the gene level.
From Tables 1 and 2, we can see that in general, more associations can be found at exon level than at gene level, and so is the number of genes involved in the associations. However, for local loci which are very possibly cis-elements of regulation, although 3-fold more associations can be found at the exon level, the number of SNPs involved and the number of genes identified at exon level is less than the numbers identified at the gene level. This observation brought some surprise, since if there is no alternative splicing events and if the read distribution on the exons of the same gene is uniform, genes that show significance eQTLs should also be detected at most of their exons. Even if there is alternative splicing event, the associations detected at gene level should still be reflected at some of the constitutive exons of the genes. The possible reason is that since most genes are composed of multiple exons, the signal of gene expression, which sums up the signals of all its exons, is stronger and more robust to noises. Once the signal of gene expression is spread among its multiple exons, especially for those genes with lower expression and therefore less read coverage at each exon, the signal becomes weaker and less powerful to detect significant associations. We have observed that associations with genes of lower expression levels are less likely to be detected at their exons. Besides, the number of candidate exons is about 10 times larger than the number of candidate genes. This makes the p-value threshold for calling an eeQTL more stringent than that for calling an eQTL at the same FDR level.
Table 3 gives another view of the overlap of genes discovered at the gene level and at the exon level. If a gene that has an eQTL is also detected to have eeQTL for at least one exon of that gene, we call it as a shared gene. If a gene with an eQTL is not detected to have any eeQTL for any of its exons, we call the association as a gene-only association. If a gene with no eQTL is detected to have an eeQTL for one of its exons, we call the association as an exon-only association. Table 3 summarized the number of genes of those situations in the local, distant and external categories. Similarly, the eQTLs and eeQTLs can also be categorized as shared, gene-only and exon-only loci. If an eQTL for a gene is detected as an eeQTL of at least one exon of the gene, or equally if an eeQTL of one exon is also detected as an eQTL for the gene hosting the exon, we call it as a shared locus. If an eQTL of a gene is not detected as a significant eeQTL for any of its exons, we call the locus as gene-only locus. And if an eeQTL is detected for an exon but is not detected as an eQTL for the gene hosting the exon, we call it as an exon-only locus. Table 4 summarized that number of loci of those situations in the three categories. Note that some of the shared genes might have different loci for the eQTL and eeQTL, and some of the gene-only and exon-only loci may be of the same gene. It can happen that a gene is detected with an eQTL and one of its exons is also detected with an eeQTL (so the gene is a shared gene), but the eQTL and eeQTL are not on the same loci (so the loci are one gene-only locus and one exon-only locus).
Tables 3 and 4 further strengthened the observation from Tables 1 and 2. For the majority of genes that have been detected to be associated with some SNPs in this data set, the associations were only detected at the exon expression level (106 among 183= 32+ 106+ 45). Also many detected SNPs only have associations with expression of exons but not of the whole genes (247 among 658). Especially, while most published results on eQTL and/or splicing eQTL had put more attention on cis-loci and reported few trans-loci, we observed that when we study the expression at the exon level and put equal attention on all genome-wide SNPs, more trans-eeQTLs (external and distant eeQTLs) can be found than cis-eeQTLs (local eeQTLs) in numbers. This phenomenon has not been observed at the gene level eQTL, indicating that they are hard to be detected at gene expression level. The distant and external eeQTL SNPs are very likely linked with trans-factors that regulate gene splicing.
The definition of local vs. distant loci on the same chromosome is based on the distance between the SNP of the gene with which eQTL or eeQTL is detected. The threshold is set as 100 kb in the above experiments. We did a series of experiments with different settings of the threshold, and found that for the choices from 10 kb to 200 kb, the specific numbers corresponding to those in Tables 1-4 have some changes but the overall observation on the trends and on the comparison of eQTL vs. eeQTL results does not change.
In Figs. 1 and 2, we picked up a few examples that have exon-only eeQTLs detected to illustrate the situation that exon-level associations cannot be detected at the gene level. Figure 1 is the example of the gene ENSG00000077984 (gene CST7) on chr20 with the SNP rs1036333 on chr2. A significant eeQTL was found in the 2nd exon of the gene with this SNP with p-value of 2.20e-08 but significant eQTL was not found at the gene level. The SNP rs1036333 was in the intron region of the gene PLCL1 on chr2. We can see significant differences in the expressions between the genotypes at the exon level but not the gene level. Figure 2 is the example of ENSG00000160392 (gene C19 or F47) on chr19 with the SNP rs11882778, about 1.1 Mb away from the gene on the same chromosome. Most of the samples are of the CT or CC genotypes. The two groups have very similar distribution in the gene expression but a significant difference can be seen in the exon expression. The genes with detected eeQTLs and eQTLs are listed in the Supplementary File 1 (for eeQTLs) and Supplementary File 2 (for eQTLs).
4 DISCUSSION
The ubiquitous existence of alternative splicing events in human genes has made people to pay more attention to the expression and regulation of alternative isoforms. Isoforms are composed of combinations of exons and the quantitative regulation of isoforms must be implemented by the regulation of exons in the splicing procedures. On the other hand, current sequencing technology can only measure short reads and cannot cover whole mRNAs. The quantitative estimation of isoform expression is based on the read counts on the exons and their junctions. Therefore, it is more natural to study the expression and regulation of alternative splicing isoforms at the exon level instead of at the gene or isoform level. In this work, we proposed to use exon expression levels to replace gene expression levels in eQTL study and called this strategy as eeQTL or exon expression QTL. We followed the existing methods for data preprocessing and QTL mapping, but corrected for multiple testing by controlling the false discovery rate for all genome-wide SNPs instead of only the candidate SNPs in a selected region. This introduces no pre-assumption on the location of the regulatory loci and gives equal opportunity for discovering cis- and trans- factors. The computational experiments show that for gene-level eQTL study, more cis-loci are detected than trans-loci, but the proportion of identified trans-loci is larger than existing studied which reported mostly cis-loci. On the other hand, for exon-level eeQTL study, there are significantly more trans-loci being found than cis-loci. Many of the exon-level eeQTLs do not show any significant associations with SNPs at the gene level. These observations suggest that regulation of exons by trans-factors adds another layer of regulation after transcriptional regulation, and exon expression QTL study can be a powerful approach for detecting such regulatory factors.
The presented work is still quite preliminary since it only analyzed the genes that have been previously reported to have eQTL in this data set, and also the criterion adopted in this work is very conservative. Also the possible functional association of the discovered eeQTL SNPs with alternative splicing has not been further explored. But the observations can be a proof of concept for a more systematic survey for this direction. As more RNA-Seq data in multiple tissues accompanied with genotype data are becoming available, it’s the time for a complete study of genetic variations that are associated with and possibly responsible for the regulation of alternative splicing in cooperation with transcriptional regulation by integrating eQTL studies at the gene level, the exon level and the splicing junction level.
From the methodological viewpoints, there are still much open questions for eeQTL studies. Variations in exon expression levels can be affected by changes in steady-state gene expression level, or changes in exon splicing activities, or both. As discussed above, the signals at exon levels are also weaker and more sensitive to sequencing noises and biases than at gene levels. Methods for deconvoluting signals of the transcription regulation and splicing regulations need to be studied. Using junction reads to calculate splicing QTLs or transcript QTLs is another strategy for mapping genomic variations associated with splicing regulation, e.g., [
54,
61,
62]. For genes and isoforms with reasonable coverage on junction reads, such methods can identify splicing variations more efficiently. It can be expected that by integrating observations from eQTL, eeQTL and splicing QTL studies, we will be able to gain a more systematic understanding on the nature of genomic regulations on gene expression and alternative splicing.
Higher Education Press and Springer-Verlag Berlin Heidelberg