1 Introduction
Pulmonary hypertension (PH) represents a progressive and life-threatening cardiopulmonary syndrome characterized by pulmonary vascular remodeling, elevated pulmonary arterial pressure, and increased vascular resistance. These pathological changes culminate in right ventricular dysfunction and ultimately lead to premature mortality [
1,
2]. The clinical challenge is compounded by the disease’s insidious onset and rapid progression, with approximately 50% of cases being diagnosed at advanced stages when therapeutic options are limited [
2,
3]. This diagnostic delay underscores the critical need for early risk stratification to enable timely clinical intervention and improve patient prognosis.
Current risk prediction models for incident PH predominantly target high-risk populations, such as patients with connective tissue diseases, including systemic sclerosis [
4] and systemic lupus erythematosus [
5]. Although useful for these specific groups, these models have key limitations: (1) limited generalizability to the general population, where early PH detection remains challenging; (2) suboptimal predictive accuracy due to dependence on routine clinical variables; and (3) a lack of mechanistic insights for therapeutic target discovery. To overcome these constraints, proteomic profiling offers a transformative approach. As functional mediators of biological processes, proteins encapsulate genetic, environmental, and pathological influences [
6–
8], enabling large-scale plasma proteomic analyses to both identify novel biomarkers for enhanced risk stratification and uncover molecular mechanisms driving PH pathogenesis.
A recent large-scale proteomic study [
9] analyzing plasma samples from 41 931 UK Biobank participants identified 20 potential protein biomarkers for PH prediction. While this represents an important advance in the field, several critical limitations constrain its clinical applicability and biological significance: first, the predictive performance of these proteomic markers was not systematically compared with established clinical risk factors; second, the findings lacked validation in an independent cohort; third, no pathway analyses were conducted to elucidate the biological mechanisms underlying the identified signatures; and fourth, causal inference approaches (e.g., Mendelian randomization (MR) analysis) were not employed to prioritize potential therapeutic targets. These methodological gaps highlight the need for a more comprehensive proteomic investigation that not only identifies predictive biomarkers but also advances our understanding of PH pathogenesis and facilitates target discovery.
The UK Biobank (UKB) Pharma Proteomics Project (UKB-PPP) [
10], a sub-study of the UK Biobank, provides an unprecedented opportunity to address these questions through large-scale proteomic analysis. This study sought to identify plasma protein biomarkers associated with PH risk, evaluate their causal relationships and therapeutic potential, and develop and validate a protein-based predictive model while comparing its performance with conventional clinical risk factors in the general population.
2 Materials and methods
2.1 Study design and participants
The UKB is a large, ongoing prospective cohort study initiated between 2006 and 2010 across 22 assessment centers in England, Wales, and Scotland. Approximately 500 000 participants were recruited, who completed touchscreen questionnaires, face-to-face nurse interviews, physical measurements, and provided biological samples for laboratory analyses at baseline [
11–
13]. The study was approved by the North West Multi-Center Research Ethics Committee (11/NW/0382), and all participants were informed at the start of the study and signed an informed consent.
The UKB-PPP is a substudy within the UKB that conducted plasma proteomic profiling in a subset of over 50 000 participants [
10]. From the initial 53 029 participants with complete proteomic data, we excluded 9370 individuals who were of non-White British ancestry, lacked genetic data, showed discrepancies between self-reported sex and X-chromosome heterozygosity, or exhibited excess relatedness. We further excluded 139 participants with prevalent PH at baseline, yielding a final analytical cohort of 43 520 eligible individuals. For analytical purposes, we assigned 38 499 English participants to the model development cohort and 5021 Scottish/Welsh participants to the external validation cohort. Within the development cohort, we performed stratified random sampling based on outcome events to create a training set (70%,
n = 26 951) and a testing set (30%,
n = 11 548) (Fig. S1).
2.2 Blood proteomics assessments
Plasma proteomic profiling was performed using the Olink Explore 3072 proximity extension assay (PEA), an antibody-based platform that quantified 2941 protein analytes, capturing 2923 unique proteins. Prior to analysis, the UKB laboratory team randomized and plated all samples to minimize batch effects. Protein measurements were conducted on three NovaSeq 6000 Sequencing Systems, followed by strict quality control and normalization at Olink’s processing facilities. The resulting data were transformed into inverse-rank normalized protein expression (NPX) values for each participant, reported in Olink’s proprietary log2-scale units for relative protein quantification.
2.3 Protein selection and risk score derivation
We initially excluded 12 proteins with > 20% missing values, retaining 2,911 proteins for PH risk score development. For these remaining proteins, we imputed the limited missing values using protein-specific mean values, a well-established approach in large-scale proteomic studies [
14–
16]. A Cox proportional hazards model for PH risk was utilized, incorporating protein levels along with age, and sex as covariates. Variable selection was performed using least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation to optimize penalization strength and determine the final set of predictors (Fig. S2, Table S1). The coefficients for the protein risk score were derived by applying the cross-validated LASSO penalty to the full training dataset within the Cox model’s objective function [17]. We then calculated the protein risk score for PH as the weighted sum of the proteins selected by the LASSO regression using the corresponding coefficients as weights.
In addition to the primary holdout method, we performed 5-fold cross-validation to validate the reproducibility of protein biomarker selection (Table S2).
2.4 Assessment of clinical risk factors for PH
Clinical risk factors for PH included established demographic, anthropometric, and clinical factors: age, sex, body mass index (BMI), smoking status, pulse rate, and the prevalence of hypertension, diabetes, cardiovascular disease (CVD), chronic kidney disease (CKD), and chronic respiratory disease.
Data on age, sex, height, weight, and smoking status were collected via standardized questionnaires, while pulse rate was measured during automated blood pressure assessments. BMI was calculated as weight (kg) divided by height squared (m
2). Hypertension was defined as systolic blood pressure (SBP) ≥ 140 mmHg, diastolic blood pressure (DBP) ≥ 90 mmHg, self-reported use of antihypertensive, a history of hypertension, or International Classification of Diseases (ICD)-9 (401) or ICD-10 (I10) codes. Diabetes was defined as prevalent diabetes [
18] or hemoglobin A1c (HbA1c) ≥ 6.5%, while CKD included self-reported diagnosis, ICD-10 (N18) codes, estimated glomerular filtration rate (eGFR) < 60 mL/min/1.73 m
2, or urine albumin-to-creatinine ratio (UACR) ≥ 30 mg/g. eGFR was calculated using the Chronic Kidney Disease Epidemiology Collaboration Equation [
19]. CVD comprised coronary heart disease, atrial fibrillation, heart failure, and stroke. Chronic respiratory diseases included chronic obstructive pulmonary disease (COPD), idiopathic pulmonary fibrosis (IPF), and asthma. These conditions were ascertained through self-reported history, hospital admission records, and death registry data.
2.5 Study outcome
The study outcome was incident PH, encompassing both primary and secondary PH. Cases were identified through hospital inpatient records and death registry data using ICD-10 codes I27.0 (primary PH) and I27.2 (secondary PH), along with ICD-9 code 4160 (primary PH).
2.6 Statistical analysis
The normality of all continuous variables was assessed using the Kolmogorov–Smirnov tests. Non-normally distributed variables (defined as P < 0.05; including age, BMI, and pulse rate) were expressed as median (interquartile range, IQR), while categorical variables were presented as proportions. Between-group comparisons (training set vs. testing set in the development cohort, and validation cohort vs. development cohort) were conducted using the Wilcoxon rank-sum tests for continuous variables and the chi-square tests for categorical variables, respectively.
Hazard ratios (HRs) with corresponding 95% confidence intervals (CIs) were calculated using Cox proportional hazards models to examine the association between the PH protein risk score and incident PH risk. Model performance was assessed through both discrimination and calibration measures. Discrimination was evaluated using Harrell’s C-index, while reclassification performance and improvement over the reference model were quantified using the continuous net reclassification index (NRI) and integrated discrimination improvement (IDI), computed with the R package survIDINRI. All 95% CIs were estimated using bootstrap methods.
Enrichment analysis was performed on LASSO-selected candidate proteins to investigate potential biological mechanisms. Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted using the Database for Annotation, Visualization and Integrated Discovery (DAVID) database (v2023q4), with statistical significance assessed via Fisher’s exact test. Additionally, protein-protein interaction (PPI) networks were constructed and analyzed using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database.
MR analyses were performed using the “TwoSampleMR” R package to investigate causal relationships between candidate proteins and PH. Protein quantitative trait loci (pQTLs) were identified from the UKB-PPP data [
10]. cis-pQTLs were defined as genome-wide significant (
P < 5 × 10
−8) and linkage disequilibrium (LD)-independent genetic variants located within 500 kb upstream or downstream of the transcription start site of the gene encoding the protein. Genetic information for PH (GWAS ID: finn-b-I9_HYPTENSPUL) was obtained from the IEU GWAS Database (Table S3). MR effects were estimated using the Wald ratio method for single pQTLs and the inverse-variance weighted (IVW) method for multiple pQTLs. The druggability potential of candidate PH-associated proteins was evaluated using the Therapeutic Target Database (TTD; accessed on 21 July 2024).
A two-tailed P-value < 0.05 was considered statistically significant for all analyses. All statistical analyses were conducted using R software (v4.1.1).
3 Results
3.1 Baseline characteristics of participants
Among the 38 499 participants in the development cohort, the mean age was 57.3 years (SD = 8.1), and 46.4% were male. Participants in the training and testing sets exhibited similar baseline characteristics. Compared with the development cohort, participants in the external validation cohort (N = 5021) were significantly younger (mean age: 58.0 years vs. 59.0 years, P < 0.001), had a higher BMI (27.0 kg/m2 vs. 26.8 kg/m2, P < 0.001) and a faster pulse rate (69.7 bpm vs. 69.0 bpm, P < 0.001). Additionally, the validation cohort exhibited a higher prevalence of hypertension (60.0% vs. 56.7%, P < 0.001) and diabetes (6.4% vs. 6.0%, P < 0.001), but a lower prevalence of CKD (7.8% vs. 9.2%, P < 0.001) (Table 1).
3.2 Association of PH protein risk score with incident PH risk
A PH protein risk score was derived from 30 selected proteins among 2911 candidates in the training set (Fig. S2, Table S1).
In the testing set, over a median follow-up of 13.1 years, 142 (1.2%) participants developed PH. As shown in Fig. 1A, the PH protein risk score was significantly positively associated with incident PH (per SD increment, adjusted HR 2.34, 95% CI 2.05–2.67). Similar results were observed in the external validation cohort (per SD increment, adjusted HR 2.38, 95% CI 1.94–2.92) (Fig. 1B).
3.3 Network analysis and biological pathways of PH-associated proteins
Enrichment analysis of the 30 proteins in the PH risk score revealed significant pathway associations across multiple biological domains: extracellular region/space (GO cellular component), cholesterol efflux/blood pressure regulation (GO biological process), hormone activity/calcium ion binding (GO molecular function), and the hypoxia-inducible factor 1 (HIF-1) signaling pathway and vascular smooth muscle contraction (KEGG pathways) (Fig. 2A–2D, Tables S4). PPI analysis identified endothelin-1 (EDN1) as a central network hub (Fig. 2E, Table S5).
3.4 MR analysis and druggability of PH-associated proteins
In the two-sample MR analysis, 217 single nucleotide polymorphisms (SNPs) for 28 candidate proteins served as instrumental variables. The results identified significant causal associations: repulsive guidance molecule A (RGMA) showed a positive association with PH (OR 2.73, 95% CI 1.15–6.48), whereas NPC intracellular cholesterol transporter 2 (NPC2) showed an inverse association with PH (OR 0.27, 95% CI 0.08–0.93) (Table S6).
Among the 30-protein risk score panel, nine proteins (116 drugs total) were identified as druggable targets, including: four clinically validated targets (angiopoietin-2 (ANGPT2), carbonic anhydrase 4 (CA4), carbonic anhydrase 14 (CA14), and epidermal growth factor receptor (EGFR)) used in macular degeneration and cancer therapies; three clinical trial targets (RGMA for multiple sclerosis, growth/differentiation factor 15 (GDF15) for heart failure, pro-adrenomedullin (ADM) for respiratory distress syndrome); and two literature-reported targets (NPPB, EDN1) (Table S7). Notably, one drug candidate targeting these proteins was discontinued during Phase 2 trials.
3.5 Predictive performance of individual proteins in the PH risk score
Among the 30 candidate proteins evaluated in the testing set, eight demonstrated strong predictive performance for PH risk (C-index ≥ 0.800), including ANGPT2, GDF15, N-terminal pro-B-type natriuretic peptide (NT-proBNP), WAP four-disulfide core domain protein 2 (WFDC2), EDN1, lamin-B2 (LMNB2), ADM, and RNA binding protein fox-1 homolog 3 (RBFOX3). These findings were consistently replicated in the external validation cohort (Fig. S3).
3.6 Discriminative performance of PH risk prediction models
In the testing set, the PH protein risk score model demonstrated superior discriminative ability for PH risk (C-index = 0.873, 95% CI 0.846–0.900) compared to both the basic demographic model (age and sex; C-index = 0.761, 95% CI 0.726–0.795) and the clinical risk factor model (C-index = 0.843, 95% CI 0.815–0.870) (Table 2). Notably, eight key proteins (ANGPT2, GDF15, NT-proBNP, WFDC2, EDN1, LMNB2, ADM, and RBFOX3) contributed most significantly to this predictive performance, achieving a combined C-index of 0.863 (95% CI 0.835–0.890) (Fig. 3).
Model enhancement analyses revealed that incorporating the protein risk score into the clinical model significantly improved discrimination (C-index increased from 0.843 to 0.881; C-index increase = 0.039, 95% CI 0.001–0.077), while adding clinical factors to the protein model provided minimal improvement (C-index increased from 0.873 to 0.881; C-index increase = 0.008, 95% CI −0.029–0.046) (Table 2). These patterns were consistently replicated in the external validation cohort (Table 2, Fig. 3).
3.7 Reclassification performance of PH risk prediction models
The addition of the PH protein risk score to the clinical risk factors model significantly enhanced risk reclassification in the testing set, as evidenced by both continuous NRI improvement (NRI 0.258, 95% CI 0.106–0.336) and IDI improvement (IDI 0.053, 95% CI 0.024–0.089) for 10-year PH risk prediction. These improvements in reclassification performance were consistently observed in the external validation cohort (Table 2).
3.8 Sensitivity analysis
Our sensitivity analyses confirmed the robustness of the PH protein risk score across multiple validation approaches. For primary PH prediction, the PH protein risk score demonstrated strong performance in both the testing set (C-index = 0.871, 95% CI 0.831–0.910) and external validation cohort (C-index = 0.875, 95% CI 0.817–0.933), with significant improvement when added to clinical risk factors (Table S8). The PH protein risk score maintained superior discriminative ability after excluding participants with missing protein data (testing set: C-index = 0.897, 95%CI 0.869–0.924; validation cohort: C-index = 0.871, 95% CI 0.815–0.927; Table S9). In addition, 5-fold cross-validation identified 33 proteins for risk score construction, which replicated 73% (22/30) of our primary proteins (Fig. S4). Notably, the 33-protein score showed comparable predictive performance to our primary 30-protein model (development cohort: 0.869 vs. 0.873; validation cohort: 0.881 vs. 0.878; Table S10), demonstrating both the stability of core predictive signatures and the reproducibility of our selection methodology.
4 Discussion
This large-scale proteomic study identified and validated a 30-protein risk score for incident PH in the general population, demonstrating superior predictive performance (C-index > 0.87) compared to conventional clinical risk factors. Through comprehensive analyses integrating machine learning, network biology, and MR, we not only developed a robust risk stratification tool but also uncovered novel biological pathways and potential therapeutic targets for PH. Our findings advance the field by addressing critical gaps in current PH prediction models and providing mechanistic insights into disease pathogenesis.
4.1 Proteomic profiling outperforms traditional risk assessment
Our study demonstrates that the PH protein risk score exhibits strong discriminative ability (C-index = 0.873) and significantly improves risk reclassification over clinical models (NRI = 0.258). This superior predictive performance (C-index > 0.87) reflects the proteome’s unique capacity to integrate inherited risk, environmental exposures, and active pathological processes driving PH development [
20,
21]. The marginal improvement from adding clinical factors to the PH protein risk score suggests proteomic profiling alone may provide a more efficient risk stratification tool than conventional multi-parameter approaches. Among the eight key predictive proteins, LMNB2 and RBFOX3 emerged as novel PH biomarkers. LMNB2, a nuclear lamina protein that regulates proliferation and DNA methylation, may promote PH through endothelial-to-mesenchymal transition and vascular remodeling [
22,
23]. RBFOX3, an RNA splicing regulator, could modulate PH progression via alternative splicing of vascular remodeling genes [
24,
25]. These findings expand the PH biomarker landscape beyond established candidates (e.g., GDF15 [
26], NT-proBNP [
27,
28], ADM [
29,
30]), underscoring proteomics’ potential for early disease detection.
4.2 Causal insights and therapeutic potential
Our MR analysis identified two proteins with causal associations to PH pathogenesis: RGMA (OR = 2.73) and NPC2 (OR = 0.27). RGMA, a vascular patterning molecule currently in clinical trials for multiple sclerosis, promotes vascular smooth muscle cell dedifferentiation and remodeling [
31], suggesting its therapeutic potential for PH. Conversely, NPC2’s inverse association reveals a previously unrecognized role of cholesterol metabolism in PH, potentially mediated through lipid-driven inflammatory or proliferative pathways [
32]. Notably, nine proteins in our risk score represent druggable targets, including both approved therapies (e.g., ANGPT2 inhibitors) and investigational agents, providing immediate opportunities for drug repurposing. While other predictive proteins lacked causal associations, their strong performance as biomarkers warrants further investigation into their roles as disease effectors.
4.3 Mechanistic insights and therapeutic implications
Pathway analyses confirmed and extended current understanding of PH pathophysiology, particularly through vascular smooth muscle contraction dysregulation, HIF-1 signaling activation, and calcium homeostasis alterations. The central network position of EDN1 serves dual validation-reinforcing endothelin pathway’s known role in vascular remodeling [
33] while confirming our proteomic approach’s biological relevance.
4.4 Advancements and translational implications
Our study significantly advances prior proteomic research in PH by addressing key methodological limitations through establishing generalizability to the general population, achieving superior predictive accuracy via novel protein biomarkers, and providing mechanistic insights for therapeutic target identification. The protein risk score’s superior performance over clinical factors highlights the critical need to move beyond conventional risk stratification tools that rely solely on routine clinical variables and lack pathophysiological specificity.
Clinically, this risk score enables transformative opportunities for early intervention by identifying high-risk individuals during the subclinical phase—a crucial advance given the diagnostic delays and poor outcomes characteristic of late-stage PH. Its implementation could guide targeted screening (e.g., echocardiography) and preventive strategies for at-risk subgroups. From a therapeutic perspective, our MR-prioritized targets (RGMA and NPC2) create new research avenues for mechanistic investigation and clinical trials. These collective insights bridge fundamental discovery with clinical application, paving the way for precision medicine approaches in PH management.
4.5 Study limitations
Several limitations should be considered when interpreting our findings. First, the exclusive inclusion of White European participants may limit generalizability to other ethnic populations. Second, while the Olink platform provides broad proteomic coverage, it does not encompass all potentially relevant proteins. Third, the statistical robustness of our MR analysis was constrained by the modest number of PH cases in the FinnGen GWAS, necessitating validation in larger datasets. Fourth, the UK Biobank’s lack of detailed PH subclassification was partially mitigated by our consistent findings in primary PH sensitivity analyses. Finally, while our integrated clinical and MR analyses identified robust proteomic signatures, experimental validation in preclinical models is needed to confirm their biological roles in PH pathogenesis. Nonetheless, these findings provide clinically meaningful insights into PH pathogenesis and establishe a foundation for future mechanistic and translational research.
In conclusion, this study establishes a protein-based risk score for incident PH that demonstrates superior performance compared to conventional clinical models while providing novel insights into disease mechanisms. By integrating predictive analytics with causal inference and druggability assessments, we present a potential framework for translating proteomic discoveries into clinically useful tools and therapeutic strategies. Future validation studies in prospective cohorts and exploration of early intervention applications will be important next steps. These findings suggest that proteomic approaches may offer promising avenues to improve risk stratification and target discovery in PH, potentially helping to address critical unmet needs in this challenging disease.
4.5.0.0.1 Data availability and compliance statement
The authors declare that the acquisition and subsequent use of all data presented in this manuscript fully comply with all relevant local, national, and international laws, regulations, ethical guidelines, and the terms of use associated with the original data sources.
The authors bear full legal responsibility for ensuring the legality of data acquisition and all subsequent uses.
The UK Biobank data are available on application to the UK Biobank, and the analytic methods that support the findings of this study will be available from the corresponding authors on request.