1 INTRODUCTION
Functional characterization of the long noncoding RNA (lncRNA) has attracted considerable attentions [
1–
4] due to their pivotal role in suppressing DNA synthesis [
5], transcriptionally/post-transcriptionally regulating RNAs [
6], and modulating the process of protein translation [
7]. A typical protocol for inferring the lncRNA function in a system is the guilt-by-association based on differential expression analysis [
8]. The weighted correlation network analysis (WGCNA) based on the guilt-by-association principle is a well-known type of correlation network [
9,
10], and often used in combination with differential expression analysis (WGCNA-DEA) for lncRNA function prediction [
11–
14]. With the recent discovery of lncRNAs as the key regulator of disease pathogenesis [
15–
18] and drug resistance [
19,
20], the WGCNA-DEA is anticipated to reveal lncRNA-disease associations [
21] and accelerate the experimental discovery of lncRNA function for a studied disease [
22]. Till now, this method has been successfully applied to describe the disease phenotype via analyzing co-expression network [
13], and identify some lncRNAs associated with disease metastasis [
23].
However, only 6,000 (~7%) of over 90,000 lncRNAs in human genome have been characterized as “disease-associated” by experiments [
21,
24], and the major problems leading to this lack of experimentally verified associations include: (1) the significant number of false positive function assignments [
25] and (2) the extensive genetic heterogeneity of many diseases [
26,
27]. On the one hand, the disease-associated lncRNAs tend to have great inter-individual expression variabilities [
8,
28], which is largely ignored by differential expression testing [
29]. In other words, WGCNA-DEA method can only capture the differential expression of lncRNAs at the population level [
29], which brings about a large number of false positive predictions [
25]. On the other hand, the disease-associated single-nucleotide polymorphisms (SNPs) can not only alter lncRNAs’ expression level but also modify their secondary structure [
30,
31]. Since structural variations are not taken into consideration by differential expression analysis, the application of WGCNA-DEA alone may not be capable of fully describing the genetic heterogeneity of complex diseases [
26,
27,
32].
To cope with these major problems, the algorithm for detecting condition-specific expression is proposed to be in place of differential expression testing for describing inter-individual expression variability [
8]. Particularly, this variability is assessed using the standard measure ‘coefficient of variation (CV)’ [
33]. A low value of CV denotes a lncRNA in normal cell (health control), while a high value represents disease-related lncRNA [
8,
33]. Moreover, due to the superiority of SNPs in representing disease heterogeneity [
34,
35], disease-associated SNPs have been integrated with lncRNA expression level for predicting the function of lncRNAs in heterogeneous diseases such as malignancy [
36] and diabetes [
34]. Thus, it is essential and of great interest to have a tool that is capable of characterizing the lncRNA functions by not only controlling the false discovery rate but also tolerating the disease heterogeneity [
8,
37–
39].
More importantly, due to its much wider prevalence, longer illness-course, and poorer treatment-response than single disorder, the comorbidity (a simultaneous or sequential presence of multiple indications in one patient) has emerged to be one of the most vital tasks in clinical medicine [
40–
42], for instance, psychiatric comorbidity [
43,
44]. The regulation by lncRNAs is critical to the pathogenesis of various comorbidities [
45], and is expected to be used to improve the efficacy of current therapies [
46]. Compared with the single disease, comorbidity is even more heterogeneous, and shows greater inter-individual variabilities in lncRNA expression [
47,
48]. Thus, it is also necessary to make a tool effective in characterizing the function of lncRNAs in any comorbidity of interest [
45,
46].
A variety of powerful online tools have been designed to facilitate the functional characterization of lncRNA [
8]. Some of the tools are dedicated as databases that offer the experimentally verified and/or computationally predicted lncRNA data [
49–
63]. The others are popular web-servers including: AnnoLnc2 [
64], Co-LncRNA [
65], FARNA [
25], Lnc-GFP [
66], LncRNA2Function [
67], LncRNAs2Pathways [
68], ncFANs [
69], and so on. The majority of the web-servers are based on the construction of co-expression network and/or the differential expression analysis, but neither integrate disease-associated SNP data nor detect condition-specific expression [
65–
69]. AnnoLnc2 is unique in systematically annotating the newly identified human lncRNA and provides disease-associated SNPs for the annotated lncRNAs [
64], but it is specifically for the lncRNAs associated with 39 cancer types [
64]. Moreover, none of the available online tools can effectively characterize the function of lncRNAs in comorbidity. It is thus urgently needed to construct online tools that can characterize lncRNA functions (via simultaneously integrating disease-associated SNP data and detecting the condition-specific expression of lncRNAs) for a significantly increased number of diseases and comorbidities. However, no such tool is yet available.
In this study, a novel web-server FCCLnc was therefore constructed. As illustrated in Fig. 1, the FCCLnc is unique in (1) integrating diverse SNP data that are identified to be associated with 193 diseases (standardized by the latest version of WHO International Classification of Diseases, ICD-11 [
70–
72]), (2) enabling the functional characterization of lncRNAs in all 193 diseases and various comorbidities (the combinations among multiple diseases from those 193 diseases), (3) reducing the false characterization through detecting the condition-specific expression of lncRNA, and (4) offering interactive visualization of the lncRNA-centered co-expression networks. In summary, FCCLnc is distinguished for its capacity of characterizing lncRNA functions for a significantly increased number of diseases and comorbidities, and is therefore expected to emerge to be an indispensable complement to other available tools. FCCLnc can be freely accessible (without login requirement) at: https://idrblab.org/fcclnc/.
2 RESULTS AND DISCUSSION
2.1 Web service provided by and operating procedure adopted in FCCLnc
To make the usage of FCCLnc convenient, the operating procedure implemented in this tool was provide via four sequential steps (illustrated in Fig. 1). STEP (1): matrices upload based on raw RNA-seq data (upload the expression matrices of lncRNAs & mRNAs); STEP (2): identification of disease-associated lncRNAs (through integrating SNP-disease association data, and detecting condition-specific expression); STEP (3): constructing a co-expression network among lncRNAs and mRNAs (by the guilt-by-association based on the neighboring genes of the studied lncRNAs); STEP (4): functional characterization of lncRNAs based on the newly constructed co-expression network (GO & KEGG enrichments were conducted to facilitate the functional annotation). The general workflow of FCCLnc integrating all four steps was illustrated in Fig. 1. Detailed user manual and website demo were systematically provided in the ‘Manual’ panel of FCCLnc.
The systematic analysis on the functions provided by FCCLnc could reveal its uniqueness. First, a direct upload of lncRNA & mRNA expression matrices was allowed in STEP (1), which made it possible to analyze clinical/experimental sequencing data. Such function is very crucial for the researchers working in clinical or precision medicine [
40–
42]. Second, FCCLnc stood out among available tools by not only integrating SNP-disease association data but also detecting condition-specific expressions of lncRNAs in its STEP (2). These novel features would made it capable of characterizing lncRNA function by simultaneously controlling FDR and tolerating disease heterogeneity [
8,
37]. Third, the characterization of lncRNA function in comorbidity was realized, for the first time, in FCCLnc by overlapping disease-associated genes among comorbid diseases. FCCLnc enabled the functional characterizations for a very wide range of diseases (193 diseases in total standardized by ICD-11), which included 19 infectious diseases, 36 cancers, 12 immune system disorders, 10 metabolic diseases, 40 neurodevelopmental disorders, 17 digestive system diseases, 15 circulatory system disorders, 10 respiratory system diseases, 4 genitourinary system disorders, 10 musculoskeletal system diseases, 4 developmental anomalies, and 16 other disorders of skin, eyes, ear or mastoid process. The combinations among multiple diseases from those 193 would result in the widest coverage of comorbidities so far. Finally, lncRNA-centered co-expression network was generated by FCCLnc for the interactive visualization and download. User can navigate to any step of interest through the “BACK” or “NEXT” button of each step. The resulting network for disease/comorbidity could be downloaded in the format of ‘.html’ (for visualization and text mining) and ‘.cys’ (supporting further network analysis in Cytoscape [
73]). Experimentally verified lncRNAs were also collected and integrated into FCCLnc for enhancing the visualization of the co-expression network.
2.2 Characterizing the function of lncRNAs in a particular disease by FCCLnc
In order to evaluate the capacity of FCCLnc in characterizing the function of lncRNAs in disease, six datasets were collected and shown in Table 1, which included: GSE106388 [
74], GSE112523 [
75], GSE128682 [
76], GSE129398 [
77], GSE131526 [
78], and TCGA-BC [
79]. These datasets were transcriptomic data of diverse diseases (including asthma, schizophrenia, ulcerative colitis, obesity, type-I diabetes and breast cancer, respectively). As shown in Fig. 2, the performances of FCCLnc and one of the most frequently utilized methods (WGCNA-DEA) [
8] were compared based on two key indexes. First, the percentage of successful prediction (PSP) was utilized to measure method’s success rate (%) in characterizing experimentally verified lncRNAs [
21]. As shown in Fig. 2A, the PSPs of FCCLnc varied (from 9.8% for TCGA-BC to 100% for GSE106388) and the PSPs of WGCNA-DEA also differed greatly (from 0% for four datasets to 14.3% for GSE112523). It was clear to see that, for all datasets, the PSPs of FCCLnc could significantly surpass those of WGCNA-DEA, which showed the good performances of FCCLnc on characterizing the experimentally verified lncRNAs. Second, the enrichment factor (EF) was further used to assess method’s ability to control the false characterization. As shown in Fig. 2B, the EFs of FCCLnc differed greatly (from 5.3 for TCGA-BC to 180.5 for GSE106388) and the EFs of WGCNA-DEA also varied (from 0.0 for four datasets to 3.0 for GSE112523). It was also obvious that, for all datasets, the EFs of FCCLnc were consistently better than those of WGCNA-DEA, which indicated the superior capacity of FCCLnc in controlling the false characterization of lncRNA function.
Based on a breast cancer dataset TCGA-BC collected from The Cancer Genome Atlas [
79], the performances of FCCLnc and WGCNA-DEA were further compared based on the enrichment analysis of KEGG pathways and GO terms. As illustrated in Fig. 3A, besides the chord diagram on the left side, two additional pie charts were drawn to illustrate the percentages of pathways (enriched based on FCCLnc (upper) and WGCNA-DEA (lower)) that were further confirmed by literature search to be closely related to breast cancer. As shown, 95% of the pathways enriched (
p-value<0.05) based on FCCLnc were also confirmed by previously reported literatures (the detail descriptions were provided in Supplementary Table S1), and 60% of the enriched pathways (
p-value<0.05) were identified by FCCLnc only. When it comes to WGCNA-DEA, the percentage confirmed by previous reports decreased to 79%, and only 55% of the enriched pathways were solely identified by WGCNA-DEA. Like pathways, the results of GO terms enrichment were illustrated in Fig. 3B. 83% of the GO terms enriched (adjusted
p-value<0.05) based on FCCLnc were confirmed by previously reported literatures (the detail descriptions were provided in Supplementary Table S2) and 78% of those enriched GO terms (adjusted
p-value<0.05) were found by FCCLnc only. For WGCNA-DEA, its percentage confirmed by previous reports was only 66%. All in all, these results further validated the good performance of FCCLnc in lncRNA functional annotation.
2.3 Characterizing the function of lncRNAs in certain comorbidity by FCCLnc
LncRNA regulation is essential to the pathogenesis of various comorbidities [
45]. Therefore, the ability of FCCLnc to characterize the function of lncRNAs in certain comorbidity was evaluated. Particularly, two datasets were collected (as shown in Table 1), which included GSE133099 [
76] (containing patients of type-2 diabetes and obesity), and GSE78936 [
80] (containing patients of schizophrenia and bipolar disorder). Comorbidity-associated lncRNAs were therefore identified by finding the genetic factors (“common disease genes”) shared by the comorbid diseases [
81,
82]. Figure 4 showed the lncRNAs and mRNAs co-expression networks in obesity patients comorbid with type-2 diabetes (identified using GSE133099 [
76]). As shown, two lncRNAs (DLEU1 and CCNT2-AS1) and their eighteen co-expressed mRNAs were found to be common disease genes. Recent study has revealed the essential roles of DLEU1 in thyroid hormone synthesis [
83]. Since thyroid hormones are strongly associated with both obesity [
84–
86] and diabetes [
87,
88], DLEU1 may be considered to be a critical regulator of this comorbidity. Meanwhile, although there were few studies on CCNT2-AS1, its co-expressed mRNA CCNT2 (Fig. 4) had been reported to possess genetic variations that were significantly associated with obesity and diabetes [
89].
Similarly, Supplementary Fig. S1 showed the co-expression network between lncRNAs and mRNAs in the schizophrenia patients comorbid with bipolar disorder (constructed using GSE78936 [
80]). As shown, three lncRNAs (LINC00243, MIR3681HG and TSBP1-AS1) and seven co-expressed mRNAs were discovered to be common disease genes. Although there were few reports on those identified lncRNAs, some of their co-expressed mRNAs (Supplementary Fig. S1) have been reported to be strongly associated with both comorbid diseases. For example, the mRNA ATAT1, co-expressed with LINC00243, was found to be extensively associated with the pathogenesis of schizophrenia [
90] and bipolar disorder [
91]; the ROCK2, co-expressed with MIR3681HG, was reported to play an important role in both schizophrenia [
92] and bipolar disorder [
93]. All in all, it is feasible to use FCCLnc to characterize the function of lncRNAs in certain comorbidity.
3 CONCLUSIONS
FCCLnc is distinguished for its capacity to characterize the lncRNA functions for a significantly increased number of diseases and comorbidities, and it is expected to emerge as an indispensable complement to other available tools. With the rapid accumulation of next generation sequencing data, FCCLnc and other powerful tools could collectively promote various life science researches, including pathological study, precision medicine, drug/target discovery, disease association, and biomarker identification.
4 MATERIALS AND METHODS
4.1 Discovering the potentially disease-associated lncRNAs by SNP-disease associations
The majority of disease-associated SNPs located in lncRNAs modify the secondary structures or influence expression levels, thereby affecting their regulatory function, hence contributing to the development of disease [
30]. In this study, SNP-disease association data were therefore collected to facilitate the identification of potentially disease-associated lncRNAs. First, the SNP-disease associations were collected from three well-known sources: GWASdb [
94], NHGRI-EBI GWAS Catalog [
95] and GRASP2 [
96]. This led to 17,842 SNPs located in the lncRNAs associated with (
p-value<0.001) 156 diseases (standardized by the latest version of ‘International Classification of Diseases’ provided by the World Health Organization, ICD-11). Moreover, a literature review of over 350 recent publications yielded 4,616 additional SNPs located in the lncRNAs associated with 72 standardized diseases. All in all, 24,339 associations between 193 standardized diseases and 22,458 SNPs were collected for subsequent analysis. Second, the chromosome information of lncRNAs was downloaded from the NONCODEV5 (human reference genome hg38) [
53] to match the disease-associated SNPs to the lncRNA region. Finally, 10,936 lncRNAs (with at least one disease-associated SNP) were identified to be “potentially disease-associated”.
4.2 Detecting the inter-individual variability of lncRNA by condition-specific expression
The lncRNAs transcribed in a certain disease indication tend to show high expression variability, and the algorithm for detecting condition-specific expression is therefore proposed to be in place of differential expression testing for describing such inter-individual expression variability [
8]. The variability could be assessed using the standard measure ‘coefficient of variation (CV)’ [
33,
97,
98]. A low value of CV denotes a lncRNA in normal cell (healthy individual), while a high value represents the disease-related lncRNA [
8,
33]. Herein, the CV is first defined as the ratio between the standard deviation of the lncRNA expression levels measured across the patients and its mean [
33]. Using those “potentially disease-associated” lncRNAs identified in previous section, their CV values were then calculated and ranked. Finally, top-N ranked lncRNAs (N= 100, 200, 300, 400 or 500 depending on user’s preference) were selected and identified as “disease-associated” ones for the subsequent functional characterization.
4.3 Constructing the co-expression network based on lncRNAs’ neighboring genes
Co-expressed genes were known as more likely co-regulated and functionally associated, which made the identification of the co-expressed neighboring protein-coding gene helpful in assigning lncRNA function [
8,
69,
99,
100]. To construct the co-expression network between disease-associated lncRNAs and their neighboring genes, the WGCNA [
9] based on the guilt-by-association principle was used to compute co-expression network of lncRNAs with their neighboring genes. Herein, the comprehensive data of 96,308 lncRNAs and 19,975 protein coding genes were first collected from NONCODEV5 (human reference genome hg38) [
53] and GENCODEV31 (human reference genome hg38) [
51], respectively. Then, the neighboring genes within 5 kb [
101], 10 kb [
102], 20 kb [
103], 50 kb [
104], 70 kb [
105], 100 kb [
106], 200 kb [
107], 300 kb [
108], 400 kb [
109] and 500 kb [
110] up/downstream of the studied lncRNAs were calculated based on the collected location information, which resulted in a collection of neighboring genes of the studied disease-associated lncRNAs. WGCNA [
9] was used to compute a co-expression network based on the studied lncRNAs and their neighboring genes. The resulting co-expression network was illustrated in FCCLnc, and it could be downloaded in the formats of ‘.html’ (for visualization analysis) and ‘.cys’ (supporting the network analysis in Cytoscape [
73]). In the meantime, ~6,000 experimentally verified lncRNAs were collected from LncRNADisease [
21] and LncRNA2Target [
50]. Such data were integrated into FCCLnc for enhancing the visualization of the co-expression network by highlighting the lncRNAs with experimental verification information.
4.4 Annotating the lncRNA function based on gene ontology and KEGG pathway
The Gene Ontology (GO) [
111] and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway [
112,
113] were widely adopted to characterize the function of disease-associated lncRNAs. Herein, the GO annotations (containing biological processes, molecular functions, and cellular components) as well as KEGG pathways of proteins-coding gene were downloaded from the gene set enrichment analysis (GSEA) database [
114]. Then, GO and KEGG enrichment analysis were performed using the mRNAs identified by FCCLnc. In particular, the statistical significance of GO terms and KEGG pathways enrichments were evaluated by hypergeometric test with
p-value less than 0.05 [
115]. Finally, a chord diagram illustrating the enrichment results was displayed directly in FCCLnc online server.
4.5 Characterizing the lncRNA function in comorbidity using common disease genes
The mechanism of lncRNAs in the comorbid diseases were frequently explained by their shared genetic factors (namely “common disease genes”) [
81,
82]. In other words, direct overlap of the disease-associated genes among comorbid disease was found to be one of the critical factors for explaining the corresponding comorbidity [
81,
82]. Thus, the upload of RNA expression matrices that containing the data of multiple diseases was allowed in FCCLnc. First, the RNA expression data of each disease were analyzed using the sequential steps discussed in the first three sections of Materials and Methods, which resulted in multiple co-expression networks (each contained a great many of lncRNAs & mRNAs). Then, a direct overlap of disease-associated RNAs among multiple diseases was conducted, which identified a set of common disease genes (RNAs). Third, multiple co-expression networks were linked together based on this set of common disease genes, and the resulting network is the co-expression network of the corresponding comorbidity. Finally, the common disease lncRNAs were therefore characterized as comorbidity-associated, and their function was further annotated by their co-expressed mRNAs. The enrichment results of multiple co-expression networks could also be overlapped to facilitate the functional annotation.
4.6 Evaluating the ability of FCCLnc to characterize the function of lncRNA in disease
Two indexes were adopted here to evaluate the ability of FCCLnc to characterize the function of disease-associated lncRNAs, both of which were based on the experimentally validated disease-associated lncRNAs. These indexes included: (1) PSP, and (2) EF. Particularly, the experimentally verified disease-associated lncRNAs were collected directly from LncRNADisease [
21], which provided a number of “true” lncRNAs for each of those 193 diseases, and the PSPs (%) of FCCLnc in characterizing these true lncRNAs were used as the first index for evaluating the performances. Moreover, the EF was adopted here to denote the concentration of the experimentally verified lncRNAs among the prediction results of FCCLnc compared to their concentration throughout the entire lncRNAs in expression matrix, which was known to be effective in assessing false discovery by fully considering the real-world true lncRNAs [
115]. EF could be represented by:
where N
FCCLnc denoted the number of lncRNAs characterized by FCCLnc as ‘disease-associated’; N
trueFCCLnc represented the number of ‘true’ lncRNAs successfully characterized by the FCCLnc as ‘disease-associated’; N
all was the total number of lncRNAs in the expression matrix uploaded by users; and N
true indicated the number of ‘true’ lncRNAs in the uploaded expression matrix based on the LncRNADisease data [
21]. EF value is no less than zero. Only when an EF is larger than 1, there is an enrichment. The larger the EF, the lower the FDR.
The weighted correlation network analysis based on differential expression lncRNA (WGCNA-DEA) frequently utilized method for inferring the lncRNA functions [
12,
13] was assessed here to compare with FCCLnc. Based on RNA-seq data of control-case studies, WGCNA-DEA was implemented based on two procedures. First, differential expression analysis was conducted for finding the differential lncRNA by R package DESeq2 [
116]. Second, the weighted correlation network between differential lncRNA and protein-coding neighboring genes was built.
4.7 Required data formats of FCCLnc input files and server implementation details
There were two input files for FCCLnc analysis that provided the expression matrices of lncRNAs and mRNAs, respectively. The required formats of both input files were almost the same, which should provide two RNA-by-sample matrices in csv format. Particularly, the sample ID and class of samples are sequentially provided in the first two rows of input file. The sample ID is uniquely assigned according to users’ preferences. Title of the second row must be kept to “label” without change during the analysis, and class of samples indicates the sample groups. For characterizing the lncRNA function in single disease, there should be only two types of group names in each of those two input files (using the exact word ‘control’ to indicate healthy individuals, and denoting all disease samples by disease name. For example, users could use ‘diabetes’ or ‘type-II diabetes’ to label the samples of type-II diabetes patients). For characterization in comorbidity, all samples should be labeled by their disease name. For instance, when studying the anxiety comorbidity in schizophrenia, user could use ‘Anxiety’ and ‘SCZ’ to denote the class of samples in the uploaded expression matrices. The sample ID and class of samples must be identical in both matrices. The unique RNA IDs (NONCODE ID and Ensemble Gene ID for lncRNA and mRNA, respectively) should be provided. The expression matrix could be reads counts, transcripts per kilobase million (TPM), reads per kilobase million (RPKM) or fragments per kilobase million (FPKM). Example files can be directly downloaded from the “Analysis” panel of FCCLnc.
FCCLnc website is deployed on a server running Cent OS Linux v7.0 operating system, Apache Tomcat servlet container and Apache HTTP web server v2.2.15. Its interface was developed by R v3.6.2 and R package Shiny v0.13.1 running on Shiny-server v1.4.1.759 [
117,
118]. A variety of packages in R package were utilized in the background processes [
119], which included SpeCond, htmlTable, visNetwork, networkD3, htmltools, WGCNA, shinythemes, shiny, shinyjs, shinyBS, and shinydashboard. The popular browsers such as Google Chrome, Mozilla Firefox, Safari, and Internet Explorer (11 or later) were compatible with the official website of FCCLnc (without login requirement).
4.8 The sample datasets collected for validating the case studies in this work
To test the utility of FCCLnc, the expression matrices of both lncRNA and mRNA were collected from Gene Expression Omnibus (GEO) [
120] and The Cancer Genome Atlas (TCGA) [
79]. There were eight benchmark datasets in total collected for analyzing case study. As shown in Table 1, the first six datasets were used for the case-control studies [
121] for six single disease, and the last two datasets contained the patients of two different diseases (utilized for the case study of the corresponding comorbidity). Sample details in each benchmark dataset were shown in Table 1, which included the detail of expression unit and the number of lncRNAs and mRNAs.
The Author(s) 2021. Published by Higher Education Press