1 INTRODUCTION
Genome-wide association studies (GWAS) have been very successful in identifying genetic variants associated with complex human traits. However, these studies exhibit limited statistical power due to the polygenic effect of genetic variants as well as small effect sizes. Moreover, the underlying molecular mechanisms between genetic variation and these traits are not well understood [
1,
2]. One way that genetic variants can affect traits is through modulating gene expression [
3]. The importance of gene expression regulation has been shown by expression quantitative trait loci (eQTL) studies. Such relationships between gene expression and the traits can be studied by measuring both genetic variation and transcriptome data in the same study subjects. Unfortunately, such designs might not be cost effective. For some traits, it is hard to retrieve tissue samples to measure transcriptome levels [
3–
5]. Thus, we have a limited number of datasets with matched GWAS and transcriptome data.
Transcriptome-wide association studies (TWAS) shed novel insights into complex disease mechanisms when transcriptome data are not available in GWAS samples. These studies integrate information from GWAS and eQTL studies to test for gene-trait associations and prioritize potential causal genes that mediate variants effects on the traits. TWAS can reduce the multiple testing burden (~number of genes) compared to testing traits with genetic variations (~number of SNPs), and is a powerful tool to identify susceptible risk genes in complex human traits [
1,
4].
Briefly, there are three steps in a TWAS framework (see Fig. 1). In the first step, an imputation model is trained in a reference panel, for which matched genotype and transcriptome data are available. Usually, the imputation model is a linear model that takes individual level genotype as input, and outputs the predicted gene expression level, which corresponds to the genetically regulated component of gene expression [
1]. The number of predictable genes is limited by the power of the reference panel. Publicly available resources of reference panels are summarized in Table 1. In particular, GTEx provides the most comprehensive database, with adequate power to predict gene expression in 54 tissue sites (The number is based on GTEx v8, the number of tissues used is method-dependent). The imputation models can be trained either by using single tissue or by leveraging information from multiple tissues.
In the second step the single-tissue/cross-tissue model is applied to a GWAS cohort of interest to predict (or impute) gene expression. In the third step a gene-trait association test is conducted. The association can be tested either in a relevant tissue or in a cross-tissue manner. The connections between different gene-based tests will be further illustrated in the discussion section. There are two types of cross-tissue tests. The first one constructs a test statistic that combines information from single-tissue tests [
12]. The other one directly tests trait of interest with predicted expression in multiple tissues [
13].
Alternatively, some approaches [
2,
4,
12–
16] combined the second step and the third step, which made use of summary statistics from large GWAS cohorts instead of individual-level data. Such approaches leveraged the publicly available GWAS summary statistics from large GWAS repositories such as the GWAS Catalog [
17]. However, these summary-based methods are lack of information on covariance structures of genetic variants (
i.e., linkage disequilibrium (LD)), thus we need to use the training set or a population reference set such as the 1000 Genomes Project [
18] to estimate this information in the study population. Furthermore, the GWAS summary statistics and the reference transcriptome data should come from independent cohorts. We underscore the importance of independence here as GWAS summary statistics oftentimes include multiple cohorts that may have overlaps with the reference panels.
In 2019, Wainberg
et al. [
19] gave a perspective review on TWAS. They highlighted two challenges in interpreting GWAS results, co-regulation and tissue bias, suggested best practices among current studies, and discussed future opportunities for TWAS. In contrast, our review concentrates in the methodological aspect of TWAS framework, and summarized and discussed the extensions of TWAS methods.
2 METHODS AND TOOLS
In this section, we review the methods and tools to conduct TWAS analysis (see Table 2). To avoid ambiguity, we refer to the studies under the three-step framework (see introduction) as TWAS approaches, and to distinguish the work of Gusev
et al. as FUSION [
4]. The TWAS framework originates from PrediXcan [
1] and FUSION [
4]. The major difference between the two methods is that PrediXcan uses individual-level data as input while FUSION can take both individual-level data and summary-level GWAS results.
Gamazon
et al. firstly introduced the concept of genetically regulated expression (GReX), which excluded trait-altered components as well as non-genetic components such as those affected by environmental factors [
1]. Based on this concept, the PrediXcan framework focuses on estimating GReX and correlating them with a phenotype of interest. Specifically, PrediXcan adopts a penalized regression model, Elastic-Net [
30], to train weights between observed gene expressions and cis-SNPs using a reference transcriptome dataset. Next, GReX is imputed with individual-level genotype data and the trained weights. Then, a gene-trait association test is performed with imputed GReX and a phenotype of interest.
FUSION adopts a Bayesian sparse linear mixed model (BSLMM) [
31] to train weights between observed gene expressions and cis-SNPs with a reference dataset and tests for the association between predicted gene expression and phenotypes of interest. BLSMM combines Bayesian variable selection (BVSR) [
32] and linear mixed model (LMM) [
33] with the assumption of a normal mixture prior [
16,
31]. The summary-based model is implemented in the publicly available software FUSION.
Following PrediXcan and FUSION, other methods extend TWAS by making modifications on each step of the TWAS framework. Here, we discuss these methods in details.
3 EXTENSIONS IN IMPUTATION MODELS
3.1 Cross-tissue model
Given the challenge of training a robust and accurate gene expression model in a single tissue, fQTL [
14] and UTMOST [
12] model gene expressions in multiple tissues jointly. fQTL adopts a cross-tissue Bayesian multivariate linear regression model that hierarchically decomposes the effect size of cis-SNPs into SNP-specific and tissue-specific effects. In addition, fQTL used the spike-and-slab prior for variable selection and solved the model using an optimization method with fast convergence called stochastic variation inference (SVI) [
34]. UTMOST builds a cross-tissue prediction model by using a multivariate regression that penalizes on both within-tissue effects and cross-tissue effects. Compared with the Elastic-Net model on single tissue, UTMOST achieved improved prediction accuracy.
3.2 Nonparametric model
To avoid limitations of parametric prediction models, Nagpal
et al. utilized a nonparametric Bayesian model, latent Dirichlet (DPR) process regression [
35] to model the distribution of cis-SNP effect sizes [
16]. Compared to traditional parametric methods, this model is data driven and models the complex genetic architecture of gene expression. Actually, Elastic-Net [
30] and BSLMM [
31] can be considered as special cases of DPR. Further, Nagpal
et al. employed a variational Bayesian algorithm, which approximates MCMC with less computational burden [
36], to obtain posterior estimates of the effect sizes. In real application of ROS/MAP data [
37–
39], Nagpal
et al. showed the DPR model can predict more genes than PrediXcan. Their method is implemented in a publicly available software tool called TIGAR.
4 EXTENSIONS IN GENE-TRAIT ASSOCIATION TEST
4.1 Cross-tissue test
In addition to training a cross-tissue imputation model, another idea to integrate information across different tissues is to modify the testing step in the TWAS framework. MultiXcan adopted multivariate regression to test the association between phenotypes of interest and predicted gene expression in multiple tissues [
13]. They used the first
k principal components of the predicted gene expressions in multiple tissues to address the multicollinearity problem. Applied to 222 traits studied in UK Biobank, MultiXcan detected more associations than the single-tissue based PrediXcan in 103 traits. In their simulation study, PrediXcan exhibited more benefits under a setting in which the single causal tissue is known and the predicted gene expression captures the regulatory mechanism well in that tissue.
Rather than performing the test jointly in MultiXcan, UTMOST combines the testing results in each tissue to cross-tissue z-scores by using a generalized Berk-Jones (GBJ) test [
40]. Compared with single-tissue methods including PrediXcan and FUSION, UTMOST showed a remarkable improvement in the accuracy of gene expression imputation and the number of identified significant genes in 50 GWAS of complex human traits. Li
et al. investigated the influence of tissue context on gene prioritization in TWAS approaches, using PrediXcan as a representative of single-tissue TWAS and UTMOST as a representative of cross-tissue TWAS [
41]. Excluding the overlaps of significant genes identified, they found PrediXcan typically identified genes that are more unique in a single tissue, while UTMOST tended to prioritize gene-trait association in multiple tissues as well as in disease-related tissues.
4.2 Probabilistic model
The uncertainties imposed in the imputation step might have an impact in the test of gene-trait association. Bhutani
et al. discussed possible sources of the uncertainties, including model misspecification and biased parameter estimation, and developed a two-stage Bayesian regression model, BAY-TS, to address the limitation [
42]. Essentially, the method considered the posterior distribution of imputed gene expression instead of single point estimate to incorporate the uncertainty. They compared their method with PrediXcan and demonstrated that BAY-TS was able to detect more genes in real data analysis.
Yang
et al. jointly modeled the imputation step and testing step by utilizing a mixed model that takes individual-level GWAS data and can be performed tissue-by-tissue [
20]. Further, they derived a computationally efficient and stable algorithm named PX-EM [
43]. In real data analyses of 25 traits in NFBC1966 (dbGAP at http://ncbi.nlm.nih.gov/gap (Study Accession: phs000276.v1.p1) and GERA studies (dbGAP at http://ncbi.nlm.nih.gov/gap (Study Accession: phs000674.v2.p2), they observed that CoMM identified more genes than PrediXcan, which suggests the necessity and benefit of modeling uncertainties.
4.3 Summary-based model
Many TWAS approaches can be adapted to take summary-level GWAS data, in favor of leveraging the massive sample sizes in meta-analysis and making their methods more applicable when individual-level data are not available. After firstly adopting summary-level GWAS data in FUSION, a number of summary-based methods were developed successively. S-PrediXcan [
2], S-MultiXcan [
13], CoMM-S
2 [
15] are the summary-based version of PrediXcan, S-MultiXcan and CoMM, respectively.
Barbeira
et al. derived the test statistics of PrediXcan with summary-level data, and further showed that the performance of their method, S-PrediXcan, is in high concordance with that of PrediXcan [
2]. Along with the work, a general tool named MetaXcan built on S-PrediXcan was proposed. They also showed that FUSION based on summary statistics is equivalent to S-PrediXcan in terms of the Z-score of gene-trait association tests up to a scaling factor, related to the proportion of variance explained by a SNP’s allelic dosage and the proportion of variation explained by gene expression. In practice, this factor is very close to 1, leading to similar results between the two methods.
Yang
et al. has upgraded CoMM to CoMM-S
2 [
15]. The basic idea behind CoMM-S
2 is similar to CoMM, but the upgraded framework accommodates individual-level eQTL data and summary-level GWAS data. An efficient algorithm based on variational Bayesian EM was proposed accordingly to fit the model. The performance of CoMM-S
2 is comparable to CoMM, and CoMM-S
2 performs better than CoMM and S-PrediXcan when cellular heritability is low.
4.4 SPU test
Xu
et al. generalized the gene-based test in the TWAS framework [
44]. They first showed the test of individual-level data in PrediXcan and FUSION are special cases of a general testing framework, weighted sum test of a generalized linear model [
44,
45]. Then extended the sum test approach to adaptive sum of powered score (SPU) test, which assigns different power (an integer ) to the score vector in a data adaptive manner. Since there is no uniformly powerful test under different situations, the adaptive SPU tests (aSPU) gained more power by calculating the minimal P-value across a set of SPU tests of different weighting schemes. Their method can also be applied to both individual level and summary-level GWAS data.
5 INTEGRATION OF ANNOTATION DATA
Efforts have been made to integrate genomic annotations to improve gene prediction accuracy in the traditional framework of TWAS [
21,
22,
46–
48].
EpiXcan [
22] extended TWAS by integrating epigenomic data to increase accuracy in the imputation step. It leverages epigenomic information (DNA methylation histone modification and chromatin accessibility) [
46] to learn the prior of SNPs to be regulatory in a Bayesian hierarchical model, and then used the priors to derive penalty factors for SNPs, which is subsequently employed in the weighted Elastic Net approach to predict gene expression. TF-TWAS concerned the effect of transcription-factor (TF) polymorphism on transcription, and integrated this information by including SNPs within the coding regions of TFs in the imputation model, in addition to the cis-SNPs used in PrediXcan. Three models of how the TF polymorphisms affect gene expression were explored. The method was tested in four tissues, and identified 48 genes with improved R
2 comparing to PrediXcan.
Exploration of three-dimensional structure of the human genome enables studies to integrate information such as chromatin interaction data into TWAS. Wu
et al. first identified the promoter regions and the main body of genes based on a chromatin state model, and then defined the enhancer region based on interactions with promoter regions [
47]. Analogous to modeling cis-SNPs in traditional TWAS [
1,
4], they restricted candidate SNPs in the imputation model to the enhancer-promoter region, which reduced the number of candidate SNPs and may improve prediction accuracy. They conducted gene-based tests and identified novels genes that provide complementary information to results from FUSION.
6 EXTENSION TO “X-WAS”
The TWAS framework can be naturally extended to “X-WAS”, which accommodates other intermediate phenotypes that can be imputed by GWAS data.
For example, the role of DNA methylation, an epigenetic mechanism that is essential for both genomic processes and normal development in human beings [
49,
50], has been studies in the MWAS framework in schizophrenia [
51], bipolar disorder [
51], Parkinson’s disease [
52], and Alzheimer’s disease [
53]. Methods of methylation imputation and methylation-trait association test are also analogical to TWAS methods. Similarly, Xu
et al. conducted IWAS, imaging-wide association study (IWAS), which used gray matter volumes of several brain regions of interest as imaging intermediate phenotypes [
54]. The proposed IWAS framework is applicable to both individual-level and summary-level GWAS data, and other imaging intermediate phenotypes.
7 OTHER RELATED METHODS
Essentially, TWAS approaches integrate GWAS data with eQTL studies to identify disease/trait associated genes. Besides TWAS, there are other methods to integrate information from both GWAS and eQTL studies. Mendelian randomization (MR) based methods and colocalization methods are along this line of work.
7.1 MR-based methods
MR-based analysis uses genetic variants as instrumental variables to test for the association between gene expression and complex traits. Oftentimes, MR-base methods have more assumptions than TWAS, which leverage cis-eQTLs as instrument variables to infer gene-trait associations. The method is based on observed gene expressions, which do not involve gene expression prediction like TWAS.
Zhu
et al. utilizes the most significant associated cis-eQTLs as the instrument variable to detect target genes associated with a complex trait in SMR [
55]. Porcu
et al. proposed a multivariable (multi-gene) multi-instrument MR approach named Transcriptome-Wide Mendelian Randomization (TWMR), which can reduce bias due to pleiotropic effects [
56]. Compared to TWAS, MR-based methods (
e.g., SMR and TWMR) are more sensitive to co-regulation due to violation of model assumptions. For SMR, genetic variants and the conditioned outcome is assumed to be independent, while InSIDE (Instrument Strength Independent of Direct Effect) was assumed in TWMR.
Barbeira
et al. systematically evaluated and compared SMR and S-PrediXcan. They showed that SMR P-values tend to be less significant compared with S-PrediXcan, and discussed possible explanations [
2]. First, only using a single SNP as the instrument in SMR will render the method less powerful. Second, it is inherent in the SMR framework that the significance is limited by the significance of the eQTL association. Noticeably, the derived statistic in SMR is not well calibrated because the chi-square approximation is only valid in two extreme cases: when the eQTL association is much more significant than the GWAS association or when the GWAS association is much more significant than the eQTL association [
2]. Similar limitations apply to TWMR. In addition, TWMR excludes pleiotropic SNPs to avoid estimation bias, the method could suffer from power loss when more SNPs are excluded owing to evidence for mild pleiotropy with increasing GWAS study size [
56].
7.1.1 Colocalization
Colocalization analysis identifies genetics variants that are significant in both GWAS and eQTL studies. Unlike TWAS, colocalization does not perform gene expression prediction and gene-trait association tests. Instead, colocalized SNPs are the unit of their concern. However, FUSION is conceptually similar to a test for colocalization of signal between expression and a complex trait [
4,
26,
57]. More importantly, these colocalization methods can serve as a post filtering step to mitigate issues with LD-contamination after TWAS-type gene-level associations (see Discussion).
Nica
et al. proposed an empirical method, Regulatory Trait Concordance (RTC), to reveal that significant SNPs in GWAS are colocalized with cis- or trans-eQTLs after accounting for LD structure [
58]. He
et al. proposed a Bayesian statistical method, Sherlock, to match the genetic signature of a specific gene with GWAS signals [
29]. Sherlock utilizes information in both cis- and trans-SNPs to constitute the genetic signature while accounting for the uncertainty of LD using a block-level Bayes factor.
Plagnol
et al. proposed a statistical procedure, QTLMatch, to detect whether the colocalization of GWAS and eQTL signals is coincidental by testing the null hypothesis of a sole, causal variant for the GWAS and the eQTL signals [
25]. COLOC, a Bayesian method expanded from QTLMatch, considers five scenarios for the colocalization of GWAS and eQTL signals, and subsequently provides the posterior probabilities for the five conditions. From simulation results [
4], FUSION and COLOC had similar power under the scenario with a single typed causal variant and COLOC has slightly lower power at small GWAS sizes. However, FUSION could have better performance when the causal variant was untyped or in the presence of allelic heterogeneity, which is likely due to the fact that FUSION explicitly models LD to better capture untyped variants. Multiple-trait-coloc (moloc), extends the COLOC framework to integrate GWAS summary statistics and multiple molecular QTL data to provide evidence how the variants are shared across multiple traits (
e.g., complex trait, gene expression and DNA methylation).
eCAVIAR (eQTL and GWAS Causal Variant Identification in Associated Regions) [
28] provides the colocalization posterior probability (CLPP) for each variant in a locus for a specific eGene (eGenes are genes that have at least one significant variant (P-value<1e–5 when corrected for multiple hypothesis) in a tissue to indicate that the variant is causal in both GWAS and eQTL studies. CLPP requires the marginal statistics from GWAS and eQTL studies while accounting for LD structure of genetic variants in a locus. eCAVIAR allows for multiple variants to be causal in a single locus, which is more accurate than COLOC and RTC in the presence of allelic heterogeneity.
Wen
et al. developed a unified inference framework, enloc, for enrichment analysis, fine-mapping and colocalized association testing [
27]. Their model first estimates the enrichment level of molecular QTLs in the GWAS signals. Then fine-mapping of GWAS signals is conducted to construct the SNP-level priors accounting for the uncertainty of the association status of molecular QTLs. The SNP-level colocalization probability (SCP) and regional colocalization probability (the sum of the SCPs of correlated SNPs within an LD block that harbors a GWAS signal) can be obtained from the previous steps. COLOC and eCAVIAR can be viewed as special cases of enloc, which eliminates the subjective prior specification.
8 DISCUSSION
In this review, we systematically summarize the development of TWAS, which integrates GWAS and eQTL data under the assumption that gene expression mediates the association between genetic variants and traits. Compared with GWAS, TWAS reduces the testing burden and provides straightforward biological interpretations.
A pipeline for a TWAS analysis should be carefully chosen based on the available data and the trait of interest. Systematic evaluation and comparison of TWAS approaches are needed to guide selection of tools. In particular, Wainberg
et al. [
19] suggested using fine mapping methods in interpreting TWAS results. Moreover, they raised concerns regarding to selection of single tissue and cross-tissue methods. In general, cross-tissue analysis may identify more potential causal genes by leveraging information across tissues, and single-tissue analysis may be more powerful when the mechanism behind the trait of interest and tissues are explicit. Comprehensive comparisons between PrediXcan and UTMOST can be found in [
41]. In addition, functional annotation data may also boost statistical power and provide complementary information to prioritize causal genes.
In recent years, TWAS have emerging applications on identifying novel candidate genes of complex human diseases such as cancers [
47,
59–
62], neuropsychiatric diseases [
41,
63],
etc. [
64–
66]. (see Table S1). In addition to the macro-phenotypes, micro-phenotypes such as chromatin status has also been considered. Gusev
et al. defined chromatin status with individual level epigenetics data, and identified genes associated with chromatin status with TWAS framework [
48]. After integrating the results from TWAS of schizophrenia and chromatin phenotypes, they identified overlapping genes that bring insights to the regulatory mechanism.
Despite the success of TWAS, there are also arising concerns in its further application. Low prediction accuracy of gene expression will lead to insufficient testing power, and may cause spurious conclusions. Nonetheless, a standard of informative prediction accuracy is not definitive [
59]. First, the choice of the reference dataset could be problematic, and the inherent limitations of prediction model arise from the reference transcriptome data. For single tissue TWAS approaches, there is not a principled approach to selecting the tissue to be used. Furthermore, TWAS may suffer low prediction accuracy when data are not available in the right tissue for the trait of interest [
5] or sample sizes of disease relevant tissues are relatively small [
60,
61]. Second, there is lack of assessment whether the prediction models learned in one population can be transferred to another population. Keys
et al. evaluated the accuracy of trans-ancestral prediction of gene expression [
67]. The prediction models were trained in European ancestral panels and applied to predict gene expression in African American individuals. As expected, the R
2 levels were higher in Europeans than that of African-Americans, underscoring the importance of ancestry specific reference eQTL panels. Third, most TWAS methods assume linear models for the prediction of GReX, however, prediction accuracy will be affected when the relationship between GReX and the SNPs is nonlinear, or when trans-regulation plays an important part in modulating gene expression [
61]. A straightforward solution is to incorporate trans-eQTLs in the prediction step [
68]. However, it is still challenging to decide how to appropriately model trans-eQTLs, as they are more difficult to detect because of a much heavier multiple testing burden and smaller effect sizes than cis-eQTLs [
69,
70]. Also, trans-eQTLs are more likely to be tissue-specific [
68,
71,
72].
In addition to poor prediction accuracy, interpretation of TWAS associations could also be complicated by confounding factors, especially when SNPs affect the phenotype of interest in non-regulatory mechanisms [
73]. Due to LD in the genome, co-regulation in TWAS may lead to prioritization of non-causal genes. Mancuso
et al. discussed four scenarios of co-regulation, and proposed a fine-mapping method, FOCUS, to address this problem [
19,
23].