1. Introduction
According to the World Health Organization (WHO) in its report Obesity and Overweight, obesity is a chronic and multifaceted condition characterized by excessive fat accumulation that can negatively affect health. It increases the risk of developing conditions such as type 2 diabetes, heart disease and certain cancers. Obesity also impacts bone health, reproductive function, and can reduce quality of life by affecting sleep and physical activity levels [
1]. The link between Body Mass Index (BMI) and cancer risk is a critical concern in the field of health. BMI, a metric calculated from a person’s height and weight, is commonly used to assess whether an individual is underweight, overweight, or at a healthy weight. The formula for calculating BMI is: BMI = weight (kg) / height (m)
2. Both high and low BMI are associated with an increased risk of developing cancer.
A high BMI (overweight or obesity) increases the risk of several cancers including breast, colorectal, endometrial, esophageal and kidney cancers [
2,
3,
4]. Excess fat alters hormone levels, such as androgens, estrogens and progesterone that can promote cancer growth [
5]. It is also linked to insulin resistance and elevated insulin levels, further increasing cancer risk. Obesity triggers chronic inflammation and immune response changes that support cancer cell growth [
6]. Conversely, low BMI (underweight) can increase cancer risk due to malnutrition and weakened immunity, making the body more susceptible to cancer while also being associated with digestive system cancers such as oral and gastric cancers [
7]. Maintaining a healthy BMI (18.5–24.9) through a balanced diet and moderate exercise can therefore help reduce cancer risk.
BMI significantly affects the prognosis of cancer patients. Huang
et al. [
8] showed that a high BMI (obesity) is linked to poorer outcomes, with obese patients facing higher risks of complications, treatment failure and cancer recurrence, especially in breast and colorectal cancers. Obese patients also have higher mortality risks in cancers like uterine, colorectal, ovarian and liver cancers [
7]. In contrast, underweight patients may experience malnutrition, weakened immunity, poor treatment tolerance and longer recovery, leading to shorter survival [
9]. Maintaining a normal BMI (18.5–24.9) improves prognosis, with better treatment outcomes, lower recurrence and longer survival, as it supports immune function, reduces metabolic issues, and enhances treatment effectiveness.
Obese breast cancer patients, especially postmenopausal women, have a higher risk of recurrence after surgery [
10]. Overweight and obese prostate cancer patients are at greater risk of postoperative recurrence and have lower survival rates [
11] while further reducing treatment effectiveness and inducing more severe side effects [
12]. Significant weight changes, whether loss or gain, can affect prognosis, particularly during chemotherapy, and markedly, excessive weight loss (cancer cachexia) is linked to poorer survival outcomes [
13]. In conclusion, both high and low BMI are associated with worse cancer prognosis, while maintaining a normal BMI improves treatment outcomes and survival.
While obesity is traditionally seen as a carcinogen, it may have a protective effect in certain stages and types of cancer by enhancing antitumor immunotherapy. This challenges the view that obesity increases cancer mortality risk, known as the “Obesity Paradox”. The paradox suggests that overweight and Class 1 obese (BMI = 25–34.9) cancer patients may have a better prognosis than lean individuals, though this is not true for all patients or cancer types [
14]. Studies like Tu
et al. [
15] found that although overweight or obesity increases the risk of developing cancer, among patients already diagnosed with cancer, having a slightly higher body weight around the time of diagnosis is associated with lower risk of death and longer survival, and this finding applies to most cancer types. Petrelli
et al. [
16] found that obese patients with cancers like renal cell carcinoma, lung cancer and melanoma may better tolerate chemotherapy and experience lower mortality rates.
Alifano
et al. [
17] studied the impact of preoperative BMI on survival in non-small cell lung cancer (NSCLC) patients undergoing lung resection, and determined that underweight patients had lower survival rates, while overweight and obese patients had better outcomes. For obese patients, a higher BMI was associated with improved survival, even after adjusting for various factors. In breast cancer, Modi
et al. [
18] found that a higher BMI worsened survival in early breast cancer (EBC) but improved survival in advanced breast cancer (ABC). In colorectal cancer, some studies linked obesity to a higher risk of death, while others showed that obese patients had longer survival and better treatment tolerance [
19]. In summary, while obesity is generally associated with increased health risks, in certain cancer patients, a higher BMI may be linked to lower mortality, potentially due to factors like treatment tolerance, biological mechanisms, hormone levels and cancer subtypes.
This study used The Cancer Genome Atlas (TCGA) data to explore the relationship between BMI and gene expression. BMI, as an indicator of body fat, can influence various aspects of a cancer patient’s physiological condition, nutritional status and immune system. Abnormal BMI (either high or low) is often associated with biological changes that may affect gene expression, metabolic pathways, immune responses and more. By analyzing TCGA data, we can investigate the correlation between BMI and specific genes or gene clusters related to metabolism, inflammation, hormone regulation or cell proliferation, providing insights into how BMI impacts cancer development and prognosis. Additionally, genes interact to regulate processes like cell growth, death and migration, and studying such interactions between BMI-related genes can offer a more comprehensive understanding of how BMI influences cancer biology. For example, certain genes may collaborate in high BMI patients to drive cancer progression or influence treatment response. This research could uncover molecular mechanisms linking BMI to cancer and identify new targets or biomarkers to better understand how BMI affects cancer onset and prognosis.
The BMI Characteristics of TCGA Cancer Samples
We analyzed data from 1825 patients across seven TCGA cancer types to explore BMI-associated molecular characteristics with details shown in Table
1. Due to the small number of underweight cases (BMI
18.5), patients were grouped based on Hu
et al. [
20] into normal weight (BMI
25), overweight (25
BMI
30) and obese (BMI
30). The study focused on BMI characteristics analysis for bladder urothelial carcinoma (BLCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), colorectal adenocarcinoma (COADREAD), esophageal carcinoma (ESCA), kidney renal papillary cell carcinoma (KIRP) and liver hepatocellular carcinoma (LIHC).
As shown in Fig.
1, over 60% of patients in most cancer types had a BMI
25, except for ESCA and LIHC. The low obesity rate in ESCA could result from symptoms like difficulty or pain when swallowing, vomiting after meals, and weight loss; similarly, LIHC patients often suffer from appetite loss and rapid weight loss, leading to more normal-weight individuals and fewer with obesity.
Fig.
2 (top) shows the distribution of overweight and obese patients across cancer types, totaling 1093 individuals. BLCA had the highest number (209 patients, 19.1%), while ESCA had the lowest (78 patients, 7.1%), consistent with the previously noted lower BMI trend in ESCA. Patients were also grouped by sex: 750 females and 1075 males. As shown in Fig.
2 (middle), BLCA was the most common cancer type among overweight and obese males. Based on these findings, we further examined BMI-related gene expressions in BLCA and CESC. In Fig.
2 (bottom), CESC had the highest proportion among overweight and obese females. This supports findings by Clarke
et al. [
21], which suggest that overweight and obese women face a higher risk of CESC, potentially due to under-diagnosis of precancerous lesions.
Fig.
3 shows the BMI distribution for male and female patients across various cancer types. The boxplot highlights the median BMI, interquartile range (IQR) and outliers. Generally, the median BMI is similar for both genders, except in BLCA and KIRP, where males have slightly higher BMIs. In COAD, COADREAD and ESCA, females have a slightly higher median BMI. Additionally, CESC shows greater BMI variability, suggesting more significant differences among patients.
Fig.
4 displays the two-year and five-year survival rates for patients across different cancer types. Except for ESCA and LIHC, which have lower obesity rates, other cancer types with higher obesity proportions have a two-year survival rate above 35%, some exceeding 55% (e.g., KIRP), while five-year survival rate differences are less pronounced. Fig.
5 shows Kaplan-Meier survival curves for each BMI category (normal, overweight, obese), with a log-rank test
p-value of 0.001, indicating a significant survival difference between the groups where overweight and obese groups have higher survival rates than the normal group.
We conducted a Cox regression analysis using the three BMI categories (normal, overweight and obese) as a categorical variable, adjusting for age and gender. Table
2 shows the coefficient estimates,
p-values and hazard ratios for BMI (overweight) and BMI (obesity) across seven cancer types with results indicating that, except for BLCA and ESCA, the coefficient estimates for BMI (overweight) and BMI (obesity) were mostly negative, suggesting a lower risk of death with higher BMI. Specifically, negative and statistically significant coefficients for BMI (obesity) were seen in COAD, COADREAD and the combined data, while the coefficient for BMI (overweight) was also negative and significant for the combined data.
In summary, this study analyzed 1825 patients from seven TCGA cancer types to investigate the relationship between BMI, cancer characteristics and prognosis. Most patients were overweight or obese except for those with ESCA and LIHC, likely due to weight loss related to disease symptoms. Among high-BMI patients, bladder cancer (BLCA) was most common in males, and cervical cancer (CESC) in females. Higher BMI was generally associated with better two-year survival rates. Kaplan-Meier and Cox regression analyses indicated that, except for BLCA and ESCA, higher BMI was linked to a lower risk of death in most cancer types.
Although the Kaplan–Meier curves indicate that overweight and obese patients had significantly better survival rates (p = 0.001), the Cox regression results varied across cancer types. The non-significant findings in BLCA and ESCA might be due to limited sample sizes (particularly in ESCA), tumor heterogeneity, or the lack of clinical information such as treatment details and comorbidities in the TCGA dataset. It is also important to note that the Kaplan–Meier analysis reflects unadjusted survival differences, while the Cox model adjusts for variables such as age and gender, so such adjustment might dilute the effect of BMI, especially in cancer types where these covariates have a strong impact on prognosis. Caution should therefore be exercised when interpreting the relationship between BMI and survival.
2. Materials and Methods
2.1 Data Structure and the Multiple Pathways
We conducted a study with n subjects, each with genetic data represented by a vector of
p genes
, the interactions between genes are represented by
, based on a specific genotyping encoding. The number of genes might exceed the sample size, and high-dimensional statistics literature, such as Jacob
et al. [
22] provides theoretical guidance on the relationship between
p and
n. Genes are grouped into
G potentially overlapping pathways, where a gene could belong to multiple pathways, reflecting their hierarchical structure—common in gene expression data. The study aimed to identify genes and interactions associated with BMI phenotype using this natural structure. Pathway data are available from the human molecular signature database (MSigDB):
http://www.broadinstitute.org/gsea/msigdb.
TCGA transcriptomic data were obtained using the R package “UCSCXenaTools” [
23], while genomic data for TCGA BLCA, LIHC, ESCA and CESC used in this study were available from the TCGA Hub on the UCSC Xena platform (
https://tcga.xenahubs.net). For example, the TCGA BLCA dataset comprised 365 patients with recorded height and weight measurements, along with gene expression profiles covering 20,501 genes per individual. BMI was derived from the height and weight data.
2.2 Regression Model Performance Metrics
To compare the performance of regression models, the study used several performance metrics from the R package “Metrics” such as Min-Max Accuracy, mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE), root mean squared error (RMSE), mean absolute scaled error (MASE), mean absolute error (MAE) and median absolute error (MDAE). The definitions are as follows:
Among these metrics, only Min-Max Accuracy favors values close to 1, indicating better model performance; for all others, so smaller values imply fewer errors and better predictions. Each metric suits different contexts: RMSE emphasizes larger errors and is suitable when extreme errors matter, while MAE provides a general average error size. MDAE, based on the median, is more robust to outliers and useful when extreme values exist. MASE is common in time series analysis, allowing comparison across different scales. MAPE and SMAPE express errors as percentages; however, MAPE can become unstable when actual values near zero, while SMAPE mitigates this through symmetric treatment. Studies [
24] highlight MAPE’s bias toward low predictions, making it less reliable in some cases. SMAPE, introduced to address MAPE’s limitations, is gaining popularity due to its balanced error expression [
25,
26]. Thus, this study focuses on SMAPE results. In summary, selecting the right metric based on data traits is crucial for meaningful model evaluation.
2.3 The Overlapping Group Screening Approach for Quantitative Trait
This study applied overlapping group screening (OGS) to identify gene and interaction biomarkers related to quantitative phenotypes. OGS has been widely used in genomics, addressing outcomes such as censored survival time [
27], binary tissue classification [
28] and multinomial cancer subtype classification [
29]. The method uses a two-stage group screening to detect both main and interaction effects. Since gene pathways can overlap, the latent effect model by Jacob
et al. [
22] is adopted to handle shared genes across groups. All transcriptomic data must be standardized before applying OGS. The procedure of the OGS method for linear regression models
is as follows.
Step 1: To identify biologically relevant gene sets (i.e., pathways), we began by applying the overlapping group binary logistic regression framework, implemented via the R package “grpregOverlap” [
30]. This approach enables simultaneous feature selection while accommodating gene membership in multiple functional groups.
Step 2: Building on the strategy proposed by Wang and Chen [
27], we generated sets of gene-gene (G-G) interaction pairs, which fall into three categories: interactions occurring within a single candidate pathway, interactions spanning two distinct candidate pathways identified in Step 1, and interactions linking one selected pathway with a previously uncharacterized pathway. For each group of G-G interactions, we evaluated its association with the quantitative trait using the Sequence Kernel Association Test (SKAT) [
31]. This test yields a group-level
p-value by aggregating the effects of individual interactions through a weighted sum of chi-square distributions. The
p-values are derived using the Davies method [
32], implemented in the “CompQuadForm” R package [
33]. A lower
p-value signifies stronger evidence for association and thus higher relevance in downstream analysis. Interaction groups are retained for model development if their p-values fall below a predetermined threshold.
Step 3: Finally, incorporating the selected pathways and interaction groups, we constructed a predictive model for BMI based on microarray data. This is achieved through regularized linear regression methods, including ridge regression, lasso [
34] and adaptive lasso [
35], as implemented in the “glmnet” R package [
36]. We also provided a detailed description of the OGS method, including its mathematical formulation, underlying assumptions (such as group sparsity and gene overlap), and the associated optimization procedure, with full details presented in the
Supplementary Material.
2.4 Regularized Linear Regression
Regularized regression controls model complexity by adding a penalty term to the objective function, aiming to prevent overfitting, improve generalization, and identify key variables in high-dimensional data. It extends ordinary least squares (OLS) by penalizing both large and small regression coefficients to reduce overfitting and model complexity. Common methods include ridge (L2), lasso (L1) and adaptive lasso. Adaptive lasso builds on ridge estimates to reduce multi-collinearity, then uses their absolute values in a weighted lasso penalty, increasing the chance of shrinking unimportant coefficients to zero and selecting the most relevant features.
According to the study by Hoerl and Kennard [
37], the objective function of the ridge regularized linear regression model is as follows:
The vector represents the genes and gene interaction terms selected using the OGS method, while is the penalty term, which corresponds to the sum of the squared coefficients of the variables considered in the candidate model.
According to the study by Tibshirani [
34], the objective function of the linear regression model with lasso (least absolute shrinkage and selection operator) is as follows:
The vector represents the genes and gene interaction terms selected using the OGS method, while is the penalty term, which corresponds to the sum of the absolute values of the coefficients of the variables considered in the candidate model.
Adaptive lasso is an improved regularization method of lasso, primarily aimed at overcoming the issue of selection inconsistency in lasso and enhancing the accuracy of variable selection. The core idea is to introduce weights into the penalty term of lasso, adjusting the penalty based on the importance of the variables. According to Zou [
35], the objective function of adaptive lasso is as follows:
The vector represents the genes and gene interaction terms selected using the OGS method, and (where, in cases with more parameters than samples, it is recommended to use the OLS estimated coefficients; otherwise, the ridge method’s estimated coefficients are suggested as the initial estimates for the weights). The parameter controls the degree of weight decay, and is the regularization parameter. This design allows adaptive lasso to apply smaller penalties to variables with larger coefficients and larger penalties to those with smaller coefficients, thus enhancing the consistency of variable selection when identifying important variables. The advantages and disadvantages of the three regularization functions are summarized in Supplementary Table 1.
2.5 Generalized Ridge Regression
In high-dimensional data (where
p n), traditional linear regression is not suitable for processing such data, so ridge regression is typically used for these types of data. Traditional ridge regression uniformly shrinks all regression coefficients towards zero, which might not be the best strategy when dealing with high-dimensional and sparse data. However, generalized ridge regression is a regression method designed for high-dimensional data and sparse models [
38] where the goal is to extend uniform shrinkage in high-dimensional scenarios to non-uniform shrinkage, replacing the identity matrix
in ridge regression with a diagonal matrix
, and it proposes
where λ > 0 is the shrinkage parameter, and is the threshold parameter. The diagonal elements of the main diagonal matrix are suggested to take larger values for components that are close to zero, forming
where
The optimal is estimated through the modified generalized cross-validation function, which is defined as:
where . Then, are defined as . Given , is continuous with respect to , and any optimization method (such as the R optim function) can be used to minimize it in order to obtain . In sparse and high-dimensional models, the histogram of , can be approximated as a standard normal distribution, so a search range of is enough. Since is discontinuous with respect to , it is recommended to use grid search, defined as .
Finally, the generalized ridge regression estimator is defined as
In addition, , where
Generalized ridge regression can be implemented using the R package “g.ridge”. Emura
et al. [
39] showed through simulations and real data that it outperforms traditional ridge regression. They recommend standardizing predictors and excluding an intercept term; thus, the response variable Y should be centered by subtracting its mean. This study uses estimates from generalized ridge regression as initial weights for Adaptive Lasso, aiming to enhance prediction performance.
2.6 The Alternative Classification Methods
In our machine learning (ML) pipeline, we begin by utilizing the OGS method to pinpoint key gene biomarkers and interaction signals that will serve as features for downstream predictive modeling.
To construct regression-based prediction models, we explore several algorithms. For support vector regression (SVR), we adopt a radial basis function kernel. Hyperparameter tuning is performed for the “Cost” parameter across a wide logarithmic scale: 10
-2, 10
-1, 1, 10
1, …,10
5, including 0, and for the “Epsilon” parameter across values from 0.0 to 1.0 in 0.1 increments. This grid search and model evaluation are carried out using the tune function from the “e1071” R package, which performs cross-validation to identify the optimal SVR configuration. For the k-nearest neighbors (KNN) algorithm, we employ a rectangular kernel and use the kknn function from the “kknn” package. The number of neighbors is tuned within a range of 1 to 50 through cross-validation, seeking the value that yields the best predictive accuracy. Within the Random Forest (RF) modeling framework, two key hyperparameters are optimized: the number of trees, evaluated across values from 1 to 500, and the number of features considered at each split, explored over a range from 1 to d/3, where d denotes the number of features [
40]. This tuning process is implemented using the randomForest and tuneRF functions from the “randomForest” R package, again relying on cross-validation to determine the configuration that offers the best generalization performance.
3. Results
In the following simulations, we evaluate the predictive performance of the proposed OGS method combined with regularized regression or machine learning models, and compare it to Oracle, sure independence screening (SIS) Lasso and Ordinary Lasso methods. The Oracle method uses the true underlying model, which is known in simulations but not in real-world cases. SIS Lasso first ranks genes and gene-gene (G-G) interactions using univariate regression, selects the top n / (2 * log (n)) predictors, and applies lasso regression to build the final model. Ordinary Lasso directly applies lasso regression to all genes and G-G interactions without preselection.
3.1 Evaluation on Simulated Genomic Datasets With Overlapping Gene Structures
To demonstrate the effectiveness of the proposed OGS-based feature selection combined with regularized linear regression, we perform a comprehensive numerical study. This analysis not only highlights the behavior of our approach under controlled conditions but also benchmarks its predictive accuracy against a set of well-established machine learning algorithms. A synthetic dataset comprising 300 simulated observations is employed for training purposes. Each individual response is generated from an underlying linear regression framework, ensuring a structured ground truth for model comparison,
with the covariates are distributed uniformly between (–3,3) and denotes the two-way interaction covariates. To evaluate the predictive performance of the models, we generate an independent test dataset consisting of 100 samples. These samples are drawn from the same underlying distribution as the training data but are not used during model training, ensuring an unbiased assessment of generalization accuracy.
The simulation considers both gene group size and overlapping structure, as illustrated in Fig.
6. For example, groups 10 and 11 each contain 15 genes, sharing 5 genes and totaling 25 unique ones. In total, the study includes 462 genes and 594 group-specific gene effects. Groups 1, 7, 13 and 19 are set to have significant effects, with constant values of 2.25, 2.25, 1.5 and –1.5 respectively. In group 13, key gene-gene interactions (G78–G90, G80–G88, G82–G86) have coefficients of 4, 6 and 4. For groups 13 and 7, interactions (G23–G81, G24–G83, G25–G85) have coefficients 4, 6 and 4.8. The simulation includes 106,953 major gene and G-G interaction pairs.
We conducted the simulation 200 times to obtain numerical results. As shown in Table
3, the OGS method with ridge, lasso and adaptive lasso penalties consistently outperformed other methods, including standard ML techniques. Notably, OGS_G.ridge_ALasso achieved the best SMAPE performance, averaging 96.4460% with a standard deviation of 13.1026%, indicating both low prediction error and high stability. This highlights its strong advantage in high-dimensional prediction tasks.
Compared to other ML methods like OGS_RF, OGS_SVR and OGS_KNN, OGS_G.ridge_ALasso shows significantly higher accuracy. OGS_KNN (SMAPE: 159.9890%) and OGS_SVR (SMAPE: 139.3309%) have large fluctuations, indicating poor stability and adaptability. Even against the relatively strong OGS_Lasso, OGS_G.ridge_ALasso performs better across multiple metrics, including SMAPE, Min-Max Accuracy, MAE and MDAE, demonstrating more balanced predictive performance.
While both OGS_G.ridge and OGS_Ridge are based on ridge regression, OGS_G.ridge uses a generalized version that adjusts penalty strength based on variable characteristics, offering more flexibility. It achieves nearly 18% lower SMAPE than OGS_Ridge. Despite a slightly higher standard deviation, its improved accuracy suggests that generalized ridge better captures complex data relationships. Comparing OGS_ALasso with OGS_G.ridge_ALasso shows further improvement when generalized ridge is introduced. Both use adaptive lasso for variable selection, but OGS_G.ridge_ALasso applies variable-specific penalties, aiding in retaining key information. Though the SMAPE improvement is about 3.8%, it remains meaningful in complex high-dimensional data.
In conclusion, OGS_G.ridge_ALasso shows clear advantages over traditional ridge, adaptive lasso and other ML methods, offering low error and high stability. It is especially effective for high-dimensional structured problems like genetic data, highlighting both theoretical progress and practical potential.
We also examine an alternative gene network with 24 groups, each containing 23 genes, detailed in Fig.
7. This setup includes 504 genes and 552 group-specific gene effects. Groups 1, 7, 13 and 19 have significant effects, with underlying values of 2.25, 2.25, 1.5 and –1.5 respectively. In group 13, key gene-gene interactions (G253–G265, G255–G263, G257–G261) have coefficients 4, 6 and 4. For groups 13 and 7, interactions (G128–G266, G130–G268, G132–G270) have coefficients 4, 6 and 4.8. The simulation includes 127,260 major gene and interaction pairs.
According to Table
4, OGS_G.ridge_ALasso achieved the best predictive performance among all models, with an average SMAPE of 114.8411% and a standard deviation of 13.7029%, indicating both low error and high stability—making it well-suited for high-dimensional structured data. In contrast, most machine learning methods showed higher SMAPE values, likely due to overfitting or difficulty capturing the underlying data structure.
Compared to OGS_Lasso, OGS_G.ridge_ALasso improved SMAPE by about 5% and outperformed in other metrics, showing greater stability and predictive power. The addition of adaptive lasso to OGS_G.ridge further enhanced accuracy, significantly reducing SMAPE. This highlights the substantial performance gain achieved by combining the two techniques. OGS_Ridge, which lacks flexible variable selection, had relatively poorer performance. Similarly, OGS_G.ridge_ALasso outperformed OGS_ALasso (SMAPE = 118.0533%), highlighting the benefit of incorporating generalized ridge.
To further evaluate the robustness of the proposed method, we conducted an additional simulation study for sensitivity analysis. This analysis incorporated two different gene group structures (high-overlap and low-overlap scenarios) and simulated varying effect sizes by reducing the magnitude of the true biomarker coefficients (original coefficients divided by 2). Our results (Supplementary Tables 2,3) show that OGS_G.ridge_ALasso accurately identifies the relevant gene groups under these conditions, demonstrating strong robustness to varying signal strengths.
In summary, OGS_G.ridge_ALasso stands out in both accuracy and stability across metrics. It is particularly well-suited for high-dimensional and structurally complex prediction tasks, such as those involving genomic data, and holds significant practical value for real-world applications.
3.2 Real Data Application: TCGA CESC Data
Given the potential contamination and batch effects in TCGA transcriptomic data, we performed data normalization during the preprocessing stage to minimize the impact of technical variability. Furthermore, to account for the possible presence of outliers, we employed the non-parametric Kendall’s tau correlation coefficient to identify the top 1000 genes significantly associated with BMI for subsequent analyses as this method is less sensitive to outliers, thereby enhancing the overall robustness of the analysis.
Our TCGA CESC dataset includes 258 subjects: 100 with BMI 25, 72 with 25 BMI 30 and 86 with BMI 30. Given the likely limited pool of BMI-associated genes, we first streamline the gene set using Kendall’s tau correlation, selecting the top 1000 genes with the highest absolute correlations. Among these, 581 genes are mapped to 617 pathways based on the Gene Ontology Cellular Component (GO-CC) database. The remaining 419 unmapped genes are either excluded or grouped separately within the OGS framework. This results in 169,071 or 500,500 main gene and gene-gene interaction effects.
We used a validation set approach, randomly splitting the dataset into 80% training (206 samples) and 20% testing (52 samples), and repeated this process 30 times after excluding 419 unmapped genes. Table
5 summarizes the average prediction results for BMI. We also evaluated two additional GO pathway databases: Biological Process (GO-BP) and Molecular Function (GO-MF) with results in
Supplementary Tables 4,5. Across all databases, OGS_G.ridge_ALasso outperformed other methods, including standard machine learning models, particularly in terms of SMAPE, while also showing strong results in other metrics.
Next, based on the GO-CC pathway database, we apply our proposed method to the entire TCGA CESC dataset for model selection and parameter estimation. The method identifies 11 genes and 60 gene-gene interaction biomarkers, with corresponding network structure shown in Fig.
8 and and the top 10 highest and bottom 10 lowest biomarker coefficient values in shown in Table
6.
Supplementary Table 6 presents all the selected biomarkers and their corresponding coefficients. In summary, we have incorporated node sizes and edge weight elements into the network figures. Larger nodes represent genes with greater importance. Additionally, the color of each node indicates the direction of effect: red represents a positive effect, while yellow represents a negative effect.
Some of the selected biomarkers have been confirmed to have biological significance in existing literature. For example, Li
et al. [
41] indicated that
WNT1 is a target gene of miR-34a, and the decreased expression of miR-34a (suppressed by HPV E6/E7) leads to an increase in
WNT1 expression. In CESC,
WNT1 promotes cell proliferation, invasion, and the conversion of “E-cadherin to P-cadherin” by activating the WNT/
-catenin signaling pathway, further driving cancer progression. Therefore,
WNT1 plays a carcinogenic role in CESC. Additionally, Park [
42] highlighted the association between the
CRK gene and CESC. The
CRK gene belongs to the Crk family and encodes an adaptor protein with SH2 and SH3 domains, involved in various cellular processes, such as cell proliferation, migration and survival. Studies have shown that
CRK is upregulated in several human cancers, including cervical cancer.
3.3 Real Data Application: TCGA BLCA Data
Our TCGA BLCA dataset includes 356 subjects: 147 with BMI 25, 124 with 25 BMI 30 and 85 with BMI 30. Given the likely limited pool of BMI-associated genes, we first streamline the gene set using Kendall’s tau correlation, selecting the top 1000 genes with the highest absolute correlations. Among these, 620 genes are mapped to 586 pathways based on the GO-CC database. The remaining 380 unmapped genes are either excluded or grouped separately within the OGS framework. This results in 192,510 or 500,500 main gene and gene-gene interaction effects.
We used a validation set approach, randomly splitting the dataset into 80% training (284 samples) and 20% testing (72 samples), and repeated this process 30 times after excluding 380 unmapped genes. Table
7 summarizes the average prediction results for BMI. We also evaluated two additional GO pathway databases: GO-BP and GO-MF with results in
Supplementary Tables 7,8. Across all databases, OGS_G.ridge_ALasso outperformed other methods, including standard machine learning models, particularly in terms of SMAPE, while also showing strong results in other metrics.
Next, based on the GO-CC pathway database, we apply our proposed method to the entire TCGA BLCA dataset for model selection and parameter estimation. The method identifies 23 genes and 19 gene-gene interaction biomarkers, with corresponding network structure shown in Fig.
9 with the top 10 highest and bottom 10 lowest biomarker coefficient values shown in Table
6.
Supplementary Table 9 presents all the selected biomarkers and their corresponding coefficients.
The
HUS1 gene has been confirmed to have biological significance in the literature [
43].
HUS1 is a protein involved in DNA repair, and its expression is elevated in BLCA. Studies indicate that inhibiting
HUS1 enhances chemotherapy efficacy in cisplatin-sensitive cancer cells, but has no significant effect in resistant cells. Additionally, high expression of
HUS1 is associated with poor prognosis in patients, suggesting that
HUS1 might be a key factor influencing responses to platinum-based chemotherapy and could potentially serve as a therapeutic target.
3.4 Real Data Application: TCGA LIHC Data
Our TCGA LIHC dataset includes 334 subjects: 177 with BMI 25, 89 with 25 BMI 30 and 68 with BMI 30. Given the likely limited pool of BMI-associated genes, we first streamline the gene set using Kendall’s tau correlation, selecting the top 1000 genes with the highest absolute correlations. Among these, 634 genes are mapped to 557 pathways based on the GO-CC database. The remaining 366 unmapped genes are either excluded or grouped separately within the OGS framework. This results in 201,295 or 500,500 main gene and gene-gene interaction effects.
We used a validation set approach, randomly splitting the dataset into 80% training (267 samples) and 20% testing (67 samples), and repeated this process 30 times after excluding 366 unmapped genes. Table
8 summarizes the average prediction results for BMI. We also evaluated two additional GO pathway databases: GO-BP and GO-MF with results in
Supplementary Tables 10,11. Across all databases, OGS_G.ridge_ALasso outperformed other methods, including standard machine learning models, particularly in terms of SMAPE, while also showing strong results in other metrics.
Next, based on the GO-CC pathway database, we apply our proposed method to the entire TCGA LIHC dataset for model selection and parameter estimation. The method identifies 16 genes and 132 gene-gene interaction biomarkers, with corresponding network structure shown in Fig.
10 with the top 10 highest and bottom 10 lowest biomarker coefficient values shown in Table
6.
Supplementary Table 12 presents all the selected biomarkers and their corresponding coefficients.
Among the selected biomarkers, some have already been confirmed in existing literature to hold biological significance. A study by Chen
et al. [
44] found that the expression of
CAT was significantly downregulated in advanced LIHC tissues, and that high
CAT expression was associated with better survival outcomes. Furthermore, when
CAT expression is low, MET inhibitors such as SU11274 may serve as effective treatment options for LIHC with low
CAT expression. This suggests that
CAT might play an important role in LIHC and is closely related to tumor progression and prognosis. Wang
et al. [
45] reported that
ZFP36 is a gene associated with ferritinophagy and exhibits abnormal expression in immune cells in LIHC with their results indicating that
ZFP36 could be closely involved with the functions of monocytes and macrophages, and might participate in immune regulation and tumor progression.
ZFP36 shows potential as a target for liver cancer research or therapy. Zhao
et al. [
46] identified
B4GALT2 as a gene related to amino acid metabolism, which has been incorporated into a prognostic risk model for LIHC patients, and as its high expression is associated with poorer survival outcomes and could be involved in immune regulation and metabolic reprogramming in tumors, it has the potential to serve as a biomarker for liver cancer treatment and prognosis assessment.
3.5 Real Data Application: TCGA ESCA Data
Our TCGA ESCA dataset includes 175 subjects: 97 with BMI 25, 49 with 25 BMI 30 and 29 with BMI 30. Given the likely limited pool of BMI-associated genes, we first streamline the gene set using Kendall’s tau correlation, selecting the top 1000 genes with the highest absolute correlations. Among these, 642 genes are mapped to 532 pathways based on the GO-CC database. The remaining 358 unmapped genes are either excluded or grouped separately within the OGS framework. This results in 206,403 or 500,500 main gene and gene-gene interaction effects.
We used a validation set approach, randomly splitting the dataset into 80% training (140 samples) and 20% testing (35 samples), and repeated this process 30 times after excluding 358 unmapped genes. Table
9 summarizes the average prediction results for BMI. We also evaluated two additional GO pathway databases: GO-BP and GO-MF with results in
Supplementary Tables 13,14. Across all databases, Our proposed method (OGS_G.ridge_ALasso), while slightly outperformed by OGS_KNN in terms of the SMAPE (%) evaluation metric, still demonstrates better predictive performance compared to the other methods.
Next, based on the GO-CC pathway database, we apply our proposed method to the entire TCGA ESCA dataset for model selection and parameter estimation. The method identifies 8 genes and 63 gene-gene interaction biomarkers, with corresponding network structure shown in Fig.
11 with the top 10 highest and bottom 10 lowest biomarker coefficient values are shown in Table
6.
Supplementary Table 15 presents all the selected biomarkers and their corresponding coefficients.
The
ACE gene has been shown to have biological significance in the literature [
47]. The study demonstrated that, among patients with esophageal cancer, those with the
ACE gene D/D genotype were more likely to develop postoperative pulmonary complications, with a risk more than three times higher than that of patients with the I/I or I/D genotypes. In addition, serum
ACE levels were positively correlated with the presence of the
ACE D allele; the higher the
ACE level, the greater the risk of postoperative pulmonary complications. This suggests that the insertion/deletion polymorphism of the
ACE gene may could an important role in susceptibility to postoperative pulmonary injury in patients with esophageal cancer.
3.6 BMI-Stratified Survival Analysis Reveals Gene-Specific Prognostic Associations
In this study, patients with various cancer types are stratified into three groups based on their BMI: normal, overweight and obese. Within each BMI subgroup, a Cox proportional hazards model is applied to assess the association between the expression level of individual genes and overall survival with each model including only one gene as the primary explanatory variable, and age and gender being incorporated as covariates to adjust for potential confounding effects. A stratified analytical approach is employed to evaluate whether BMI classification modulates the prognostic impact of gene expression on survival outcomes and to identify gene signatures with significant prognostic relevance within specific BMI categories. Gene selection is based on both the analytical framework developed in this study and relevant targets reported in the literature. The results are summarized in Table
10.
The results show that most genes do not exhibit statistically significant associations with overall survival across different cancer types and BMI subgroups. However, in LIHC, the expression of the CAT gene in the overweight group is significantly negatively associated with overall survival (coefficient = –0.7447, p = 0.0005), indicating a potential protective prognostic role. In contrast, the B4GALT2 gene shows a significant positive association (coefficient = 0.4846, p = 0.0227), suggesting it might function as a risk factor. Additionally, in CESC, the expression of the CRK gene in the overweight group demonstrates a borderline significant positive association (p = 0.0611). Overall, the findings indicate that specific genes are significantly associated with survival only within certain BMI categories, suggesting that BMI could act as a modifier of gene-based prognostic effects and warrants further investigation into the underlying biological mechanisms.
We perform Kaplan-Meier survival analyses within each BMI subgroup, using the median gene expression as a cutoff. Several genes show significant or borderline associations with overall survival in specific BMI categories (Supplementary Figs. 1–7). WNT1 is significantly associated with survival in the normal BMI group (p = 0.0036), while CAT shows protective associations in both the normal (p = 0.0028) and overweight (p = 0.0087) groups. B4GALT2 is linked to worse survival in the normal group (p = 0.037), and CRK shows a borderline association in the overweight group (p = 0.059). HUS1 (p = 0.1) and ZFP36 (p = 0.1) also display borderline significance in the normal and obese groups. These results suggest that BMI could modify the prognostic impact of gene expression.
3.7 Gene Set Analysis Identifies BMI-Related Pathways in Various Cancer Types Using MGSA
To better interpret the biological significance of gene expression data, we analyzed not only individual genes but also gene sets with functional or regulatory relevance. Model-based Gene Set Analysis (MGSA) is a Bayesian approach that reveals expression patterns and functional relationships among gene sets within biological systems [
48]. MGSA integrates pathway information and statistical modeling to evaluate gene set changes under different conditions, helping to understand their impact on diseases or physiological states. This method is implemented in the R package “mgsa” and incorporates Gene Ontology data [
49]. In this study, we identified important genes using our proposed method, OGS_G.ridge_ALasso, and annotate them based on gene sets from the MSigDB database, including collections such as c2.cp.biocarta, c2.cp.kegg, c5.GOBP, c5.GOCC and c5.GOMF.
Using MGSA, we identified multiple biologically relevant pathways significantly associated with BMI across different cancers. In BLCA, GOBP_INTRACILIARY_TRANSPORT_INVOLVED_IN_CILIUM_ASSEMBLY, related to cilium assembly and intraciliary transport, links to BMI, highlighting processes crucial for cell polarity and signal transduction. In ESCA, KEGG_MEDICUS_REFERENCE_AVP_V2R_PKA_SIGNALING_PATHWAY, involving antidiuretic hormone signaling via the V2 receptor and PKA activation, associates with BMI, implicating metabolism and intracellular signaling regulation. In CESC, GOBP_REGULATION_OF_NEUROMUSCULAR_JUNCTION_DEVELOPMENT, which regulates neuromuscular junction formation and function, correlates with BMI. Finally, in LIHC, GOBP_NEGATIVE_REGULATION_OF_HYDROGEN_PEROXIDE_METABOLIC_PROCESS, involved in negative regulation of hydrogen peroxide metabolism and oxidative stress response, shows significant association with BMI. These findings support the biological relevance of the identified biomarkers and provide insight into the pathways through which BMI may influence cancer biology.
4. Discussion
4.1 Potential Improvements to the OGS Method
Identifying susceptibility genes and variants for complex diseases is challenging due to the often unknown underlying disease mechanisms. The OGS method, which incorporates the SKAT to screen for gene-gene interactions, relies on predefined pathways to extract gene network information although such reliance might lead to information loss while limiting the method’s scope; additionally, while current implementations typically consider only simple two-way or multiplicative interactions, future research should aim to develop statistical approaches capable of capturing higher-order and more complex interactions.
Multivariate analysis is another widely used approach, testing grouped variants defined by genes, pathways, or physical locations. Common statistical methods include burden tests, SKAT and the combined SKAT-O, each showing strengths under different biological conditions [
50] Data-adaptive methods, which adjust models based on the data structure, have attracted increasing attention. Notably, Ueki [
51] proposed a novel approach based on Yanai’s generalized coefficient of determination, which allows for the control of type I error without requiring test-specific null distributions. This method is computationally efficient and broadly applicable to models such as lasso, ridge and elastic net, enabling both variant selection and feature filtering. In summary, future efforts should focus on developing more flexible, accurate and interpretable methods to improve the detection of relevant variants in genome-wide studies.
4.2 Leveraging TCGA for Multi-Omics Insights into BMI and Cancer
TCGA provides a rich and comprehensive resource for exploring the relationship between BMI and cancer. With its extensive genomic and clinical datasets spanning numerous cancer types, TCGA enables multi-omics analyses—integrating gene expression, somatic mutations, DNA methylation and copy number variations. This multi-layered approach allows for a deeper understanding of how BMI could influence cancer biology at the molecular level [
52], while its pan-cancer structure facilitates cross-tumor comparisons, helping to uncover both shared and cancer-specific BMI-related molecular features.
We agree that validating BMI-associated gene signatures in independent cohorts is a crucial step to enhance the generalizability of our findings. We have reviewed multiple external datasets from sources such as Gene Expression Omnibus (GEO) and International Cancer Genome Consortium (ICGC) and confirmed that some do include BMI-related annotations [
53,
54]. Although external validation was not performed in the current study, we fully recognize its importance in establishing the broader applicability of our results. Future studies should consider validating the gene signatures identified here using independent datasets with BMI information, although owing to considerable heterogeneity in data generation platforms and clinical variable definitions, incorporating such datasets would require extensive data harmonization and standardization efforts, introducing complexity beyond the scope of this study. In light of these challenges, we focused on leveraging the pan-cancer framework of TCGA to assess the internal consistency and robustness of BMI-associated gene signatures across multiple cancer types. These results provide preliminary support for the validity of our findings.
4.3 Limitations of Western BMI Standards and the Need for Population-Specific Cutoffs
In our current simulated and real data analysis, BMI was treated as a continuous variable rather than categorized into discrete groups. This modeling approach allowed us to avoid arbitrary threshold effects and better capture dose-dependent associations between BMI and clinical or molecular outcomes. As such, the issue of cutoff definition (e.g., WHO vs. Asian-specific thresholds) did not directly affect our core statistical framework; nonetheless, we fully acknowledge the importance of population-specific BMI classification when translating findings into clinical practice.
The widely used BMI classification system—defining 18.5–24.9 as normal, 25+ as overweight and 30+ as obese—is based predominantly on data from Western populations, although research on individuals of Asian descent often exhibit higher body fat percentages at the same BMI levels along with increased susceptibility to chronic conditions such as cancer, type 2 diabetes, hypertension and cardiovascular disease. Consequently, applying Western BMI thresholds to Asian populations might lead to underestimation of health risks.
To address this disparity, region-specific resources such as the Taiwan Biobank and Taiwan’s National Health Insurance Research Database offer valuable Asian population-based insights. These have informed proposals to lower the overweight and obesity thresholds to 23 and 27 for Asian populations respectively, in the endeavor to enhance disease risk prediction and improve public health strategies. Looking ahead, the integration of genetic, dietary and environmental factors holds promise for advancing more personalized and accurate health assessment models.
4.4 Addressing Confounding and Limitations in the TCGA Dataset Related to the Obesity Paradox
Although this study utilizes TCGA data to investigate the relationship between obesity and cancer prognosis and observes the so-called “obesity paradox” in certain contexts, where patients with higher BMI exhibit better outcomes, these findings should be interpreted with caution. The TCGA dataset lacks many potential confounding variables such as treatment history, comorbidities (e.g., diabetes) and lifestyle factors (e.g., smoking and physical activity), which might lead to residual confounding and distort the true relationship between obesity and prognosis. Furthermore, BMI is measured at a single time point, limiting the ability to capture weight changes over time and to differentiate between fat and muscle mass. The retrospective nature of the study design further constrains causal inference; accordingly, future prospective cohort studies with more comprehensive clinical data are warranted to clarify the true causal relationship between obesity and cancer prognosis and validate the observed “obesity paradox” phenomenon reported in this study.
4.5 Need for Experimental Confirmation
Regarding experimental validation such as qPCR or Western blot, we agree that such assays could further strengthen the biological significance of our findings. However, these experiments fall beyond the current scope and resources of this study, which is primarily computational in nature. Although this study does not include laboratory-based experimental validation, we conducted pathway enrichment analysis using the MGSA method to explore the functional relevance of the top-ranked BMI-associated genes, with results indicating that these genes could be involved in metabolic and inflammatory pathways known to be associated with obesity and cancer progression.
4.6 Cancer-Specific vs. Pan-Cancer Perspectives
This study focuses on four cancer types: BLCA, CESC, LIHC and ESCA, selected based on their sample sizes, BMI distributions and biological characteristics. However, the current results are not yet placed within the broader TCGA pan-cancer framework, which might limit the generalizability of the findings. Notably, LIHC and ESCA show lower obesity rates, which likely relate to their unique disease etiologies and risk factors such as viral infections, alcohol consumption or chronic hepatitis thereby possibly weakening the association between BMI and cancer outcomes; additionally, the interactions between BMI and gene expression vary across cancer types, suggesting that the prognostic impact of BMI could be cancer-type specific. Future research should expand analysis to additional cancer types to more comprehensively assess the universality and biological relevance of BMI-associated gene signatures.
5. Conclusions
This study highlights the critical role of BMI in cancer development and prognosis, using TCGA data for in-depth analysis. Both high and low BMI levels were found to be linked to increased cancer risk, where obesity plays a role through hormonal imbalance, inflammation and immune dysfunction, while at the same time, underweight individuals might suffer from malnutrition and reduced treatment tolerance. BMI also affects treatment outcomes: obese patients often face higher complication risks, while underweight individuals might recover less effectively. Interestingly, the “obesity paradox” suggests that in some cancers, higher BMI can be linked to better survival, underscoring the complex, context-dependent nature of the BMI–cancer relationship.
Using the OGS method combined with regularized regression, particularly the OGS_G.ridge_ALasso model, we identified BMI-related genes and interactions from high-dimensional genetic data. This model performed best in accuracy and consistency, particularly in settings with overlapping genetic signals. By applying it to TCGA data, we uncovered meaningful molecular links between BMI and cancer, supporting the importance of maintaining a healthy BMI and offering a framework for exploring gene networks and biomarkers that can inform personalized cancer care.
National Science and Technology Council of Republic of China (Taiwan)(NSTC 112-2118-M-194-003-MY2)