1 INTRODUCTION
Genetic prediction of complex disease is a major goal in human genetics research. Accurate stratification of genetic risk requires a quantitative understanding of the genetic architecture underlying the trait of interest. The success of genome-wide association studies (GWAS) in the past decade has shed important light on the etiology of numerous complex diseases. Common single-nucleotide polymorphisms (SNPs) typically have weak to moderate effects on the phenotypic outcome individually. However, when effects of hundreds of thousands of SNPs are aggregated, they explain a substantial proportion of the phenotypic variability [
1–
3]. By aggregating the risk burden of numerous SNPs, including those that fail to surpass the genome-wide significance threshold, polygenic risk scores (PRSs) have demonstrated to yield greater predictive performance of various diseases and traits. Compared with genetic risk models built upon statistical learning methods that require individual-level genotype and phenotype data [
4–
6], PRS methods directly model publicly available GWAS summary statistics and enjoy superior computational efficiency and broader applications. Consequently, construction and evaluation of PRS have become a routine follow-up analysis for many recent GWASs [
7–
9].
Despite the success, PRS derived from summary level data still only has moderate predictive capability. In particular, simple PRSs based on independent genetic markers and marginal effect estimates may not be sufficient to fully appreciate the genetic architecture of complex traits and provide accurate prediction results [
10]. To improve the predictive power of PRSs derived from GWAS summary statistics, it is crucial to select the best set of genetic markers with accurate effect sizes estimation [
11]. Despite recent advances in biobankscale GWASs, it remains challenging to precisely identify causal variants and quantify variants’ true effects, in part due to the presence of linkage disequilibrium (LD).
In this review, we introduce recent advancements in PRS methodology. We summarize commonly used PRS approaches (Table 1), including simple yet popular methods and recent, more sophisticated models, with a focus on post-GWAS estimation of SNP weights and selection of optimal tuning parameter. We also discuss methods that achieve both objectives simultaneously. Specifically, we first focus on PRS frameworks that integrates GWAS data with LD, functional genome annotation, and pleiotropy information to estimate variant’s effect. Then, we introduce recent methods for PRS model fine-tuning. Finally, we discuss the limitation of PRS applications and layout some future directions for PRS research.
2 THE BASICS OF PRS MODELS
A PRS is a weighted sum of effect allele counts at a set of pre-selected genetic markers.
Here,
S denotes the PRS;
Xi indicates the
ith SNP in the dataset;
wi is the weight value assigned to the
ith SNP;
R is the set of SNPs included in the model. The standard PRS approach uses marginal association coefficients obtained from GWAS as weights and applies arbitrary thresholds on the association strength of genetic variants (
e.g.,
p<5e–8 or no threshold at all). It is not always ideal to include all SNPs in the prediction model, especially when GWAS is underpowered and coefficient estimates are noisy [
24]. PRSs with stringent
p-value cutoffs may outperform other models if few causal variants exist for the phenotype of interest or if the GWAS does not have sufficient statistical power. However, PRSs based on a genome-wide significance threshold have been demonstrated to underperform on polygenic traits [
1]. In general, SNPs that are strongly associated with the phenotype should be considered with priority but it remains an empirical challenge to properly select the thresholds in real data applications.
Due to pervasive LD in the genome, SNPs at the same genomic locus may be strongly correlated. Therefore, it is a common approach to prune SNPs in the data so that only independent predictors are included in the model. LD-pruning is an iterative algorithm that removes SNPs having strong correlations with an ‘index SNP’ in each designated LD block. In this case, the ‘index SNP’ is randomly chosen. Additionally, a threshold (
e.g., for LD strength
r2) needs to be selected to decide whether a pair of SNPs are in LD. A similar but more popular method is LD-clumping (also known as informed LD-pruning). This approach incorporates information from GWAS associations and always selects the most significant marker within an LD range to be the ‘index SNP’. Combined with
p-value thresholding, these approaches are referred to as “pruning+ thresholding” and “clumping+ thresholding” (P+T and C+T) and can be implemented using software tools such as PLINK [
28] and PRSice-2 [
12].
A common application of PRS is to predict individual trait values or disease outcomes on an external dataset independent from the training GWAS samples. It is a common practice to regress sample phenotypes on PRS values in the testing data and report R2 and area under the ROC curve (AUC) to quantify the predictive performance on continuous and dichotomized outcomes. Note that there does not exist a universal “best” subset of SNPs that always achieves the highest prediction accuracy. Depending on the true underlying genetic architecture of the traits being studied and the quality of GWAS data, the LD and p-value thresholds need be optimized by comparing PRSs’ predictive power on independent samples.
3 PENALIZED REGRESSION MODELS BASED ON GWAS SUMMARY STATISTICS
PRS models that clump SNPs and use marginal GWAS weights may not achieve optimal prediction accuracy when there are multiple causal SNPs at a single locus or when the causal SNPs are not included in the data. Several methods that incorporate LD into the PRS model have been shown to improve predictive performance.
LDpred is one of the first summary-statistics-based PRS frameworks that explicitly incorporates LD [
13]. Following recent advances in polygenic modeling and heritability estimation, LDpred was built upon a random effects model. It uses an empirical Bayesian approach to estimate SNP effects. More specifically, the SNP weights can be denoted as:
where
denotes the vector of SNP causal effects on the phenotype;
is the marginal effect estimates obtained from GWAS summary statistics;
V is the LD matrix which is typically estimated from an external reference panel in practice;
m is the number of SNPs in GWAS. Combining marginal GWAS coefficients and LD, LDpred re-estimates the effects of all SNPs using the posterior expectation without pruning the SNPs. Of note, this framework has been adopted by multiple PRS methods that were developed later [
14,
15]. The main difference of these approaches is the choice of prior for
.
LDpred models SNP effects with two types of priors: an infinitesimal prior that assumes equal contribution from all SNPs in the data, i.e.,
where
h2 denotes the heritability of the phenotype; or a non-infinitesimal (point-normal) prior that assumes non-zero effects for a proportion of causal SNPs,
i.e.,
where
is a point mass at 0 and
p is the proportion of causal SNPs. Here,
p is treated as a tuning parameter and needs to be selected using an independent validation dataset. LDpred enjoys the flexibility of being able to handle both sparse and polygenic genetic architecture. Under a non-infinitesimal model, LDpred uses a Gibbs sampler approach to estimate the posterior expectation
in each LD block. Under the infinitesimal model where all SNPs are considered as causal, the posterior expectation has a closed form solution.
Similar in terms of the choice of post-GWAS estimator for SNP effects, PRS-CS is a recently developed Bayesian high-dimensional PRS framework that employs a continuous shrinkage prior on
[
14]. Unlike the normal or point-normal priors implemented in LDpred, the prior in PRS-CS is a continuous distribution that is jointly determined by a global scaling parameter and a marker-specific local shrinkage parameter.
where
is the global shrinkage parameter;
is the SNP-specific local shrinkage factor;
is the residual variance;
N is the sample size. While the global scaling factor represents the uniform shrinkage applied to all markers, the local shrinkage parameter is unique for each variant and can be drawn from an appropriate absolute continuous distribution (
e.g., a gamma-gamma distribution):
where is the gamma distribution and is the unknown scale parameter of the distribution for the local shrinkage factor at each SNP. Such a prior design facilitates the algorithm’s capability of shrinking noisy estimates towards zero while maintaining the large effects of SNPs demonstrating stronger signals, therefore making PRS-CS adaptive for diverse genetic architecture. The optimal global shrinkage factor can be obtained either through external validation on an independent dataset, or estimated through a full Bayesian approach that assigns a half-Cauchy prior on the global scaling parameter.
We note that frequentist methods have also been developed to fit penalized regression models on GWAS summary statistics. These methods (
e.g., lassosum [
16] and PANPRS [
17]) provide a different perspective on re-estimating SNP effects using marginal association results as inputs. A traditional penalized regression model (
e.g., lasso) requires both the phenotype and genotype data measured on each individual to optimize the objective function and estimate regression coefficients. However, it can be shown that the penalized loss function can be iteratively optimized using the inner product of genotypes and the inner product of genotypes and phenotypes:
where
X and
y are the standardized genotype and phenotype and
is the tuning parameter. Notably, the inner product of genotypes can be obtained from the LD correlation matrix and the inner product of genotypes and phenotypes can be obtained from marginal association results. Regarding the selection of the tuning parameter
that controls the penalty strength, lassosum implemented a “pseudo-validation” approach that replaced validation genotypes and phenotypes in PRS-phenotype correlation with terms that can be obtained from GWAS summary statistics:
where
and
are the genotype and phenotype of the testing samples, and
is the mean-centering matrix. Built upon a similar idea, PANPRS is a new method that also trains penalized regression using GWAS summary statistics. Compared with lassosum, PANPRS can model not only quantitative but also binary traits and is capable of integrating information from external annotation information.
4 INCORPORATING FUNCTIONAL ANNOTATIONS IN PRS MODELS
Another major limitation of genetic prediction methods is the lack of biological interpretation. A plethora of transcriptomic and epigenomic annotation data have been generated and made available by large consortia including GTEx [
29], ENCODE [
30], and Roadmap Epigenomics Project [
31]. Integrating these data with GWAS has provided insights into functional variant fine-mapping, heritability enrichment, and risk gene prioritization [
32]. Methods have also been developed to incorporate functional annotation data in PRS models.
Most Bayesian shrinkage PRS methods use noninformative priors for SNP effect sizes. Leveraging advances in statistical methods to partition heritability by functional annotation [
33,
34], AnnoPred [
15] uses an empirically estimated informative prior to prioritize SNP predictors in PRS:
Here,
is the per-SNP heritability estimate in an annotation-stratified heritability model. That is,
where
denotes the
jth annotation category in the analysis and
is the corresponding regression coefficient from stratified LD score regression which quantifies per-SNP heritability for variants in the
jth annotation [
33]. Similar to LDpred, AnnoPred also uses the posterior expectation of SNP effects,
i.e.,
, as weights in PRS. This approach appreciates the trait-specific genetic architecture by empirically and adaptively prioritizing functional SNPs with greater impacts. We also note that functional annotations have been similarly incorporated into the penalty terms of frequentist approaches such as PANPRS [
17]. Using continuous traits as an example, when modeling functional annotation, PANPRS calculates the penalized regression coefficients by
where
is the baseline penalty,
is the penalty for the
sth annotation category and
is a binary variable that equals 1 when the
ith SNP is not annotated in the
sth annotation category and 0 otherwise. In this way, SNPs that are enriched in more categories of functional annotation receive less penalty and are therefore weighted more in the final PRS.
5 INTEGRATING MULTIPLE PHENOTYPES
Genetic correlation analysis and pleiotropic association mapping have revealed concordant genetic associations across many traits [
35–
37]. The same genetic variant may yield correlated effects to several diseases or traits. Aggregating association results from multiple genetically correlated GWASs may increase the effective sample size, improve SNP effects estimation, and enhance the prediction accuracy of PRS. We introduce several methods that jointly model summary statistics from multiple GWASs in genetic risk prediction.
PleioPred [
18] is a multi-trait extension of both LDpred [
13] and AnnoPred [
15]. PleioPred models each SNP’s effects on two different traits with a bivariate normal distribution,
i.e.,where
and
denote the
ith SNP’s effects on two traits;
and
are the per-SNP heritability for two traits;
is the genetic correlation. Conditioning on marginal association statistics and an external LD reference panel, the method provides an empirical Bayesian estimator of SNP effects for both traits.
The PRSs are constructed as
and
, respectively. Like Annopred, PleioPred can be generalized to non-infinitesimal models and it upweights the effects of SNPs in annotation regions with strong heritability enrichment. Under the infinitesimal assumption, genetic correlation can be estimated using existing methods [
36,
37] and SNP effect has a closed form solution. A Gibbs sampler approach is used to estimate effects in a non-infinitesimal model.
Since PleioPred, multiple methods based on similar ideas have been developed (
e.g., wMT-SBLUP [
19] and CTPR [
20]), showing various degrees of improvement. MTAG is another approach designed for multi-trait GWAS meta-analysis [
21]. For a given SNP, MTAG approaches its true effects on a number of traits by the random effects model:
where
are the random effects of
ith SNP on T traits. Then, MTAG can solve for the generalized method of moment estimator by:
where
and
are the
tth column and diagonal element of the covariance matrix of
. Although genetic prediction is not its primary purpose, effect sizes estimated by MTAG can be used to construct PRS. Due to improved precision in effect estimates, these PRSs generally outperform single-trait PRS approaches. Finally, a recent approach, GenomicSEM [
22], leverages the pair-wise genetic correlations to quantify the genetic basis of the underlying psychopathological factor shared by multiple psychiatric traits. Borrowing information from multiple GWAS, the PRS for the latent factor outperforms predictive models constructed using each GWAS.
6 MODEL TUNING
Most existing PRS methods include tuning parameters. Examples include the
p-value and LD thresholds in a C+T PRS, the proportion of causal variants in LDpred, and the penalty strength in penalized regressions. These parameters add flexibility to the models. When properly selected, they effectively improve the predictive performance of PRS. When individual-level data are available, it is straightforward to select the optimal tuning parameters through cross-validation [
38]. For a k-fold cross-validation, we equally partition the entire dataset into k folds. Each time, we fix the tuning parameter values and hold out one fold of data as the validation set to evaluate PRS performance. We use the remaining (k‒1) folds as the training set to conduct GWAS and estimate SNP weights. After repeating this procedure k times, we obtain the optimal PRS model by comparing the average predictive performance for different values of tuning parameters.
Despite its broad applications, cross-validation is almost never used to fine-tune PRS models in practice since individual-level data are rarely available for the full GWAS sample. Instead, a standard approach is to use summary statistics from a large GWAS to train PRS models based on different tuning parameter values, and use an external dataset independent from the GWAS to select tuning parameters that yield the best performance. However, in practice, most datasets with sufficient and easily accessible samples will most likely have been included in the large GWAS. Even if such a dataset is available, researchers may prefer to use it as the testing dataset to report the optimized PRS’s predictive performance, rather than holding it out as a validation dataset for model selection. To address this challenge, SummaryAUC is a new approach designed for assessing PRS’s prediction accuracy using summary-level data as the validation set [
23]. For a case-control GWAS, SummaryAUC estimates the area under the ROC curve (AUC) by calculating the probability that a randomly selected case has a higher PRS than a randomly selected control. When individual-level PRS can be calculated, this is simply a two-sample Z-test. When individual-level validation samples are not available, SummaryAUC can approximate the standardized Z score using summary statistics and minor allele frequencies from a GWAS. This method brings an advancement to the field such that we no longer require individual-level validation data to evaluate the performance of PRSs on binary traits.
However, SummaryAUC does not fully resolve challenges in model tuning. In many cases, even an independent summary statistics dataset may not be available. PUMAS is a new method that can fine-tune PRS models using only GWAS summary statistics [
24]. This method uses a resampling approach to create down- sampled training and validation GWAS summary statistics from the input GWAS, and then applies a procedure similar to cross-validation to fine-tune PRS models. PUMAS provides an alternative to approximate the predictive R2 of regressing phenotypes of the validation dataset on PRS without accessing individual-level genetic and phenotypic data. PUMAS can be applied to not only traditional PRSs where the tuning parameter is simply association p-value cutoffs, but also more sophisticated PRS frameworks including LDpred that uses GWAS summary statistics as input. With this approach, it has become possible to systematically benchmark and optimize PRS models for various traits using publicly available GWAS summary data.
7 PRS COMPUTATION IN PRACTICE
There are other summary-data-based PRS methods that we did not cover in this review (
e.g., SBayesR [
25], sBLUP [
26], and DBSLMM [
27]). These PRS frameworks generalize regression models that typically require access to individual-level genotype and phenotype data to be able to use GWAS summary statistics as inputs and construct PRS henceforth. The abundance of PRS methods requires researchers to not only optimize PRS within a single model framework but also benchmark the most predictive PRS model across different phenotypes. A recent study has conducted a systematic comparison between the most predictive PRSs of 15 PRS methods on 25 disease phenotypes from the UK Biobank cohort and found that complicated PRS methods do not necessarily outperform simple PRS models in terms of AUC [
39]. Besides prediction accuracy, computational complexity is another important aspect of PRS application in practice. Compared with more sophisticated methods such as LDpred or lassosum, C+T is the most scalable PRS method for biobank-scale datasets and requires less computational time and memory space [
27]. In practice, researchers may need to consider the tradeoff between predictive performance and computational burden when selecting PRS algorithms [
4].
8 DISCUSSION
PRS methods and applications have reached prosperity in recent years, largely owing to the fast-growing sample sizes in biobank cohorts and widely available GWAS summary statistics. These methods provide a practical alternative to build genetic prediction models when individual-level data cannot be directly accessed. In general, PRS methods based on GWAS summary statistics often have substantially lower computational burden compared with models that need to be trained on individual-level data; meanwhile, as the GWAS sample size continues to grow, PRS methods have also achieved comparable prediction accuracy for some traits. In this review, we have discussed recent advances in PRS approaches with a focus on methods that estimate SNP effects and optimize tuning parameters. These methods have laid the ground for developing accurate and robust genetic prediction models for a variety of diseases and traits and may have broad applications in disease diagnosis and precision medicine.
However, existing PRS methods still have limitations. First, the sensitivity and specificity of PRS still remain too low to become immediately useful in clinical intervention for most diseases. Despite the methodological advances, improvement in prediction accuracy is usually incremental for most PRS approaches compared with a simple PRS based on GWAS effects. However, a recent study has convincingly demonstrated that PRS can be used to identify individuals with substantially elevated risk for coronary artery diseases and a few other phenotypes despite the low AUC [
40], which hints at a need to develop better metrics to quantify the performance of PRS. Second, even if a PRS is predictive, its effect may be mediated through environment and needs to be interpreted with caution. Since a person’s genotypes are correlated with other family members’ genotypes (
e.g., parents), which are subsequently correlated with the family environment, effect estimates obtained from a GWAS are mixtures of both direct and indirect genetic effects [
41,
42]. A recent study has pointed out that the predictive performance of PRS substantially reduced for many traits when GWAS was conducted on sibling pairs which controls for the shared environment [
43]. Blindly trusting the PRS without understanding the underlying etiology may lead to bias and misinterpretation of results in PRS applications. Another major limitation of current PRS methods is the lack of portability. It has been extensively discussed in multiple studies that existing PRSs cannot accurately predict the disease risk of individuals from a population that is different from the GWAS cohorts, possibly due to differences in LD patterns, causal effect sizes, allele frequencies, and environmental mediators across populations [
44–
47]. This has become a major hurdle in PRS application especially because major GWAS cohorts lack diversity [
48] – over 70% of GWAS samples came from three countries: the United States, the United Kingdom, and Iceland. While the number of participants with European ancestry increases exponentially over the years, the proportion of other ethnicities in GWAS samples has declined since 2014 [
44]. Methods have been developed to incorporate GWAS data from multiple populations to improve transethnic portability of predictive performance [
49], but few approaches could achieve improvements using summary statistics alone [
50]. Overcoming the poor portability of PRS will greatly benefit risk prediction research. We anticipate emerging PRS methodologies in the near future that offer novel insights and solutions to these challenges.