Introduction
Hepatocellular carcinoma (HCC), one of the leading causes of cancer-related deaths, is rarely detected in its early stage and is usually fatal within months of diagnosis [
1]. The treatment of HCC is challenging because of diverse patient conditions, tumor characteristics, and liver function [
2]. In addition, HCC is highly heterogeneous based on the genome data [
3]. Thus, identifying high-risk patients is useful to guide the selection of optimal treatment.
Gene expression analyses have yielded insights into the molecular aspects of tumor progression [
4], differentiation [
5], and recurrence [
6,
7]. Although many cancer molecular biomarkers have been discovered, one single gene still cannot represent precisely the tumor characteristics because of tumor heterogeneity. Transcriptome-based prognostic signatures are becoming increasingly popular. Peng
et al. identified a core gene module in glioblastoma and reported that the module showed better prediction results than any single gene in the module [
8]. Thus, genome-wide expression analysis might help represent each tumor more precisely.
Liu
et al. summarized the main frameworks and methods currently available to infer the transcriptional regulatory networks from microarray gene expression data. They found that coexpression analysis is a powerful tool for exploring the intrinsic organization of a transcriptome [
9,
10]. Coexpression network analysis focuses on gene-to-gene relationships, contrary to differential analysis that focuses on individual gene expression. The former identifies intrinsic modules of coordinately expressed genes in an unsupervised manner [
11]. Coexpression analysis has been utilized in cancer research to illustrate signaling pathway [
12], chemotherapy resistance [
13], and tumor heterogeneity [
14]. Coexpression analysis is performed in liver cancer studies to reveal distinct patterns of liver cancer progression from chronic hepatitis B and hepatitis C [
15]. Meanwhile, coexpression analysis has been proved to help predict cancer survival in breast cancer and glioma [
16,
17].
In the present study, we performed weighted gene coexpression network analysis (WGCNA) to analyze a dataset comprising 488 samples from four publicly available data sets [
11]. We identified genome-wide modules of coexpressed genes with clear functional interpretations. According to the constructed modules, we determined the modules correlated with prognostic signature, which was represented by overall survival (OS) and tumor-free survival (TFS). The results provided potential gene biomarkers and signaling pathways for the prognosis of HCC.
Materials and methods
Publicly available microarray data sets
The raw microarray data from four HCC datasets (GSE62232, GSE14520, GSE15765, and GSE9843) were downloaded from the Gene Expression Omnibus (GEO) website (http://www.ncbi.nlm.nih.gov/geo/) (GSE15765 also contains some cholangiocarcinoma and mixed-type cancer samples. These samples were excluded before analysis). The combined datasets containing 488 samples were originally generated using Affymetrix arrays (U133 Plus 2.0 and U133A platforms), which had 22 277 probe sets in common. The raw datasets were normalized using RMA algorithm. We filtered the probe sets of individual study [signal>log2100 in at least 50% of samples and invariance (interquartile range, IQR)>1.0] prior to WGCNA. A total of 3706 probe sets passed the filtering criteria.
WGCNA
After ComBat adjustment for potential batch effects, the signed WGCNA coexpression measure was performed on the 3706 filtered probe sets using the R “WGCNA” package [
18,
19]. In a weighted gene coexpression network, the nodes represent genes, and the edges represent the connection strength (adjacency),
aij = |cor(
xi,
xj)|
β between the two gene expression profiles
xi and
xj [
15]. The coexpression modules were identified by raising the soft thresholding power to 6 to produce a weighted network, which was the lowest power because the scale-free topology fit index curve flattens out upon reaching a high value. We clustered the genes hierarchically based on their measured network distance known as topological overlap (TO), which assessed gene pairs in terms of their direct correlation and also the degree of agreement in terms of associations to the other genes in the dataset. Modules were identified using the Dynamic Tree Cut algorithm (deep split= 2, cut height= 0.4; default values were used for the other parameters) [
20]. We calculated the module eigengenes (ME) and clustered them according to their correlation to quantify the coexpression similarity of entire modules [
18]. The MEs were calculated by retaining the first principal component following principal component analysis of the log
2 (normalized expression data). Module membership is the degree to which the expression of a gene agrees with the ME expression. This measure of network centrality was determined by calculating the Pearson’s correlation coefficient (PCC) between each individual gene and ME.
Clinical and survival analysis
Clinical variables (gender, age, tumor size, tumor number, cirrhosis, HBV infection, AFP level, ALT level, TNM staging, BCLC staging, and CLIP staging) were compiled for each gene expression module. Pearson’s correlation was analyzed between the modules and clinical variables. Survival analysis was conducted on the dataset of GSE14520, considering that it was the only dataset with comprehensive clinical and survival information. Overall survival (OS) and tumor-free survival (TFS) were used as survival endpoints, which were determined via the “survival” R package. For the single module survival analysis, the data were dichotomized around the median expression of each gene module. Cox regression analysis was performed to evaluate the hazard ratio (HR). Survival curves were constructed by the Kaplan-Meier method and compared by the log-rank test.
Functional annotation
The modules were functionally annotated based on the analytical results of their gene compositions. Gene ontology (GO) biological process enrichment analysis was conducted on the coexpression modules using the R “topGO” package. Moreover, linkages of the co-expressed genes were highly related to their functions, and network-based ontology analysis (http://app.aporc.org/NOA/) was utilized to provide deeper insights [
21]. The hub genes were defined as genes that exhibit significant correlation with MEs and high within-module connectivity. We identified the hub gene by the highest module membership value (kME), which measures correlations between each gene and each ME. The resulting network of the WGCNA data was exported to network software VisANT to visualize the results [
22].
Results
Identification of proper samples and probe sets
We investigated the functional organization of the HCC transcriptome by analyzing the microarray public datasets (GSE62232, GSE14520, GSE15765, and GSE9843) generated by Affymetrix HG-U133A or HG-U133 Plus 2.0 microarray platforms, which had 22 277 probe sets in common. The datasets consisted of 488 HCC samples. We eliminated the probe sets with low expression or with low variability on a study-by-study basis. A total of 3706 probe sets passed the filtering criteria (Fig. 1).
Detection of gene coexpression modules using WGCNA
We applied the WGCNA algorithm to identify the gene coexpression modules with respect to the remaining microarray data (488 samples×3706 probe sets). WGCNA focuses on the TO rather than the correlation of two genes with each other, so it could illustrate the consistent patterns of gene coexpression in the transcriptome. The soft thresholding power was set to 6, and the minimal module size was set to 30. The WGCNA elucidated 10 coexpressed modules ranging in size from 31 to 1046 (assigning each module an arbitrary color for reference) and showed a single group of 718 non-coexpressed probe sets (designated as “gray”) (Fig. 2A). We merged the modules because their genes are highly coexpressed and they might be very similar to the module expression files identified by the Dynamic Tree Cut. We calculated their eigengenes and clustered them according to their correlation to quantify the coexpression similarity of entire modules. We chose the height cut of 0.4, resulting in eight merged coexpressed modules (Fig. 2B).
Module correlated with clinicopathological variables of HCC
We tried to explore whether any of the groups of genes from each of the identified modules were correlated with the clinical variables of HCC. We calculated the PCCs between the modules and gender, age, tumor size, tumor number, cirrhosis, HBV infection, AFP level, ALT level, TNM staging, BCLC staging, and CLIP staging (Fig. 3). The pink (34 probe sets) and red (117 probe sets) modules yielded significant PCCs with liver function (pink module: r = 0.20, P = 0.002; red module: r = 0.14, P = 0.03). By contrast, the turquoise (1046) and black (70) modules were found to be negatively correlated with tumor BCLC staging (turquoise module: r = -0.19, P = 0.002) and CLIP staging (turquoise module: r = -0.14, P = 0.03; black module: r = -0.14, P = 0.02). In addition, we observed some negative correlations between the black module and tumor number (r = -0.13, P = 0.04). These results suggested that the highly coexpressed genes in the same module have potential biological significance. Each module might represent some clinical features in HCC patients.
Module associated with significant associations to OS and TFS
We also explored the significant associations to OS and TFS of these modules given their biological significance. We applied the Cox regression model in 242 patients with complete survival data to calculate the HRs and corresponding P values for each dichotomized module (Table 1). The turquoise and red modules in the eight merged modules were shown to have significant associations with the prognosis of HCC patients. The downregulated expression of the genes within the turquoise module indicated poor outcomes of OS (HR= 0.487, P<0.001) and TFS (HR= 0.649, P = 0.013). By contrast, upregulated expression of genes within the red module indicated poor outcomes of TFS (HR= 1.461, P = 0.029). However, the poor outcome was not statistically significant (P = 0.076), although a higher risk of OS was observed in the red module (HR= 1.445).
The results of the survival analysis of these modules were consistent with their biological significance found by Pearson’s correlation analysis. The turquoise module was previously shown to be correlated with the tumor BCLC staging (PCC= -0.19, P = 0.002) and CLIP staging (PCC= -0.14, P = 0.03). Thus, the low expression of genes within the turquoise module might represent a higher tumor stage of HCC, which was consistent with the poor outcomes in the low expression group of the turquoise module. Similarly, the red module was correlated with the ALT level (PCC= 0.14, P = 0.03), which indicates that the high expression of genes within the red module might indicate poor liver function. Thus, the result was consistent with the poor survival of the high expression group of the red module.
Enrichment analysis of the genes within the turquoise module
We focused on the turquoise module and performed some enrichment analysis because this module (1046 probe sets) was correlated with tumor staging (BCLC staging and CLIP staging) and cancer prognosis (OS and TFS) of HCC patients. From the Kaplan-Meier curves, we found that upregulated expression of genes within the turquoise module indicated better outcomes of HCC patients, both in the OS and TFS (Fig. 4A and 4B). Therefore, some protective mechanisms may be hidden in the turquoise module. Categories associated with inflammation, immune, virus-related, and interferon-mediated pathways were enriched in the turquoise module after the GO enrichment analysis (Fig. 4C). Moreover, considering the highly related functions of the co-expressed genes, the network-based ontology analysis of the turquoise module showed that small molecule and cellular ketone metabolic processes were among the top biological processes; the microsome and cytoplasmic parts were among the top components; and oxidoreductase and catalytic activities were among the top molecular functions.
We exported the turquoise data to VisANT and visualized the top 30 hub genes in the module (Fig. 4D). HSD17B6 [hydroxysteroid (17-β) dehydrogenase 6], CES2 (carboxylesterase 2), DCXR (dicarbonyl/L-xylulose reductase), SEC14L2 (SEC14-like lipid binding 2), CYB5A (cytochrome B5 type A), EHHADH (enoyl-CoA, hydratase/3-hydroxyacyl CoA dehydrogenase), PCK2 (phosphoenolpyruvate carboxykinase 2), GLYAT (glycine-N-acyltransferase), HAO1 [hydroxyacid oxidase (glycolate oxidase) 1], ABAT (4-aminobutyrate aminotransferase), SLC10A1 [solute carrier family 10 (sodium/bile acid cotransporter), member 1], SLC27A5 [solute carrier family 27 (fatty acid transporter), member 5], ALDH6A1 (aldehyde dehydrogenase 6 family, member A1), SERPINC1 [serpin peptidase inhibitor, clade C (antithrombin), member 1], GYS2 (glycogen synthase 2), HAGH (hydroxyacylglutathione hydrolase), PRC1 (protein regulator of cytokinesis 1), TOP2A (topoisomerase 2 α), GINS1 (GINS complex subunit 1), BUB1B (BUB1 mitotic checkpoint serine/threonine kinase B), and CDK1 (cyclin-dependent kinase 1) were among the top 30 hub genes.
Discussion
Gene signatures identified from genome-based assays are expected to contribute to HCC stratification [
23]. Several studies have attempted to define the gene signatures predicting the survival or recurrence of HCC [
24,
25]. In this study, we utilized the WGCNA algorithm to analyze the mRNA datasets containing 488 HCC samples to identify the genes associated with clinical characteristics and cancer prognosis.
We identified eight distinct coexpression modules from 3706 probe sets which passed the pre-filtering criteria for WGCNA analysis. Among the identified modules, the pink and red modules were associated with liver function, whereas the turquoise and black modules were negatively correlated with tumor staging. In addition, some negative correlations were found between the black module and tumor number, suggesting that the highly coexpressed genes in the same module were of potential biological significance. Given that the biological significance of these modules might have potential clinical manifestations, we performed Cox regression analysis for each module to explore their survival significance. Poor outcomes were found in the low expression group of the turquoise module, which was consistent with the previous suggestion that low expression of the turquoise module might represent higher tumor stage of HCC. Similarly, Cox analysis indicated poor outcomes in the high expression group of the red module, which was consistent with poor liver function represented by these groups of genes.
The turquoise and red modules were inferred to be correlated with tumor staging and liver function, respectively. Thus, we chose the turquoise module for the subsequent analysis because this module might indicate tumor characteristics more accurately. Moreover, the turquoise module contained the largest groups of probe sets in the identified modules. GO enrichment analysis suggested that inflammation, immune, virus-related, and interferon-mediated pathways were over-represented within the turquoise module. Hepatitis infection and anti-viral treatment might play significant roles in the development and prognosis of HCC [
26]. To some extent, the result also partially explained the different mortalities among the HBV-positive, HCV-positive, and non-B non-C HCC patients [
27].
We visualized the top 30 hub genes in the turquoise module by VisANT to obtain an insight on the hidden mechanisms in the turquoise module [
22].
HSD17B6, CES2, DCXR, SEC14L2, CYB5A, EHHADH, PCK2, GLYAT, HAO1, ABAT, SLC10A1, SLC27A5, ALDH6A1, SERPINC1, GYS2, HAGH, PRC1, TOP2A, GINS1, BUB1B, and
CDK1 were among the top 30 hub genes. Most of these genes had been associated with cancer development or prognosis, and
CDK1 was the most studied gene in this group. Although
CDK1 has been extensively investigated in cancer therapy, the role of this gene in liver cancer has not been fully elucidated yet. For the first time, we suggest
CDK1 as a biomarker for HCC prognosis [
28]. Among the hub genes,
TOP2A is also a promising biomarker for cancer prognosis. The
TOP2A gene encodes a DNA topoisomerase that controls and alters the topologic states of DNA during transcription. Various reports have shown that TOP2A expression is a valuable prognostic marker for tumor progression, recurrences, and poor prognosis [
29]. Meanwhile, SERPINC1 inhibits thrombin and other activated serine proteases of the coagulation system, thereby preventing embolism and indicating better prognosis [
30]. In addition,
SERPINC1 was reported to efficiently inhibit tumor angiogenesis in a mouse model [
31].
In summary, we identified eight gene coexpression clusters from a large-scale HCC dataset using WGCNA. We associated a number of these network modules to HCC clinicopathological variables, as well as to OS and TFS, and uncovered the gene expression signature associated with HCC survival. We identified several significant pathways and potential hub genes correlated with the HCC prognosis using gene enrichment analysis. However, the results should be verified in future studies with larger sample size in the clinical assessment.
Higher Education Press and Springer-Verlag Berlin Heidelberg