Introduction
One of the major goals in cancer research is the identification of altered genes that play a causal role in cancer development/progression. Many gene fusions have been recognized as important diagnostic and/or prognostic markers in human malignancies. The identification of specific gene fusions in common solid tumors, such as
TMPRSS2-ERG in prostate cancer and
ELM4-ALK in lung cancer, signaled a new view of solid tumor pathogenesis and further heightened the interest in this type of genetic alteration in cancer [
1,
2]. From the first success of Imatinib for treating BCR-ABL1-driven chronic myeloid leukemia to the most recent success of Crizotinib for treating ALK- rearranged non-small cell lung cancer (NSCLC), the identification of recurrent fusion genes in human cancer has impressive therapeutic significance [
3-
7].
Historically, the identification of fusion genes has been largely dependent on the detection of structural chromosomal abnormalities through conventional cytogenetic analysis. Compared to hematological disease, technical and analytical problems that limit karyotype analysis commonly arise in solid tumors. For instance, solid tumor culture often fails or, even if metaphases can be obtained from the culture, the chromosome morphology is often poor and the karyotype is usually too complex to be analyzed completely. Therefore, most gene fusions (75%) known so far have been identified in hematological diseases, whereas only 10% of known recurrent gene fusions have been identified in carcinomas [
8,
9].
Aside from the limitation of conventional cytogenetic analysis in solid tumors, traditional cytogenetic techniques used to identify genes involved in genetic translocations are labor-consuming. Traditional sequencing-based approaches, such as cloning of genomic DNA into bacterial artificial chromosomes or fosmids followed by end-sequencing or massively parallel cDNA sequencing, can complement karyotype analysis by providing a relatively high-throughput analysis across a given genome and can facilitate fusion gene identification. However, such approaches are not cost-effective in terms of a broad screening of cancer samples [
10-
12]. In recent years, novel gene fusions have been identified based on the detection of cryptic genomic aberrations using new genome-wide screening approaches accompanied by powerful computational data analysis methods. This review focuses on advanced genome-wide screening approaches in fusion gene identification, including microarray-based approaches, next-generation sequencing (NGS), and NanoString technology. The fundamental rationale and strategy for gene fusion identification using each biotech platform are also discussed. Regardless of the technical platform used to study a given tumor genome or transcriptome, an appropriate bioinformatic data analysis algorithm is essential to success.
Array-based technical platforms in the identification of gene fusions
Two key aspects of cancer gene fusions are relevant to the array-based strategies in this application. First, cancer gene fusions that encode a chimeric protein are often characterized by an intragenic discontinuity in the RNA expression level of the exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners due to differences in the level of activation of their respective promoters. This feature of cancer gene fusion is fundamental to the identification of gene fusions based on gene-expression profiling. Second, cancer gene fusions are commonly associated with intragenic changes in DNA copy number. Therefore, array-based DNA copy number analysis can reveal unbalanced gene fusions.
Gene-expression profiling in the identification of gene fusions
Tomlins
et al. [
1] developed cancer outlier profile analysis (COPA) algorithm. This bioinformatic tool seeks to accentuate and identify outlier profiles by applying a simple numerical transformation based on the median and median absolute deviation of a gene expression profile. Outlier genes are those with very high expressions in the microarray data. They focused their analysis on the outlier profiles of known causal cancer genes, as defined by the cancer gene census. In six independent prostate cancer profiling studies, COPA ranked
ERG and
ETV1, both of which encode ETS family transcription factors, within the top 10 outlier genes. To determine the mechanism responsible for
ERG and
ETV1 overexpression, they measured the expression of
ETV1 exons by exon-walking quantitative RT-PCR (Q-RT-PCR) in samples that displayed
ETV1 overexpression. Based on the intragenic discontinuity in the RNA expression level of exons revealed by exon-walking Q-RT-PCR, they performed 5′ RNA ligase-mediated rapid amplification of cDNA ends (RLM-RACE). Sequencing of the RLM-RACE products revealed the presence of
TMPRSS2-ETV1 fusions and the fusion structure was consistent with the exon-walking Q-RT-PCR results. 5′ RLM-RACE was also performed in samples with
ERG overexpression and identified a
TMPRSS2-ERG fusion transcript. In addition, fluorescence
in situ hybridization (FISH) analysis on an independent cohort of 29 cases demonstrated that 79% has
TMPRSS-ETV1 fusion or
ERG rearrangement. These results were further confirmed by several other groups [
13-
17]. Following the groundbreaking identification of
TMPRSS2-ETS fusion, a few other ETS family member fusions have been identified during the past few years [
18-
20]. Several independent studies reported that most cases of prostate cancer currently detectable by prostate-specific antigen screening harbor either the common
TMPRSS2-ERG fusion (~50%) or one of the less common fusion involving
TMPRSS2 or other 5′ partners (5% to 10%) [
21]. Considering the high incidence of prostate cancer, the
TMPRSS2 fusion with ETS family members is the first specific gene fusion identified in a common human carcinoma. The identification of this type of recurrent gene fusions in prostate cancer has important applications for understanding prostate carcinogenesis and developing a new targeted therapy. One of the important clinical implications is that the
TMPRSS2-ERG fusion transcript can be detected in urine and represents a highly specific prostate cancer biomarker [
22-
25].
Exon-array based screening for gene fusions
Compared to 3′ biased gene-expression array, the gene-expression exon array can reveal more directly a given intragenic discontinuity in the RNA expression level of exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners. The Affymetrix GeneChip® Human Exon 1.0 ST Array, which is the exon array utilized in this study, consists of 1.4 million probe sets targeting more than one million exon clusters across the entire genome. This array design enables two complementary levels of analysis: gene-level and exon-level expression analyses. The latter allows distinguishing between different isoforms of gene transcripts. The fundamental rationale for exon-level expression profiling in the detection of fusion genes is based on the observation that most gene fusions that lead to the formation of a chimeric fusion protein cause an intragenic discontinuity in the RNA expression level of the exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners. This is attributable to the differences in strength or activity of the promoters of the two translocation partner genes (Fig.β1). Furthermore, in some cases, the non-oncogenic, reciprocal fusion gene may be lost due to an unbalanced translocation event [
26].
A few studies have shown that Affymetrix exon arrays can be used to detect known gene fusions. Jhavar
et al. [
27] reported the detection of the
TMPRS22-ERG fusion in prostate cancer, and Lin
et al. [
28] demonstrated the detection of the
EML4-ALK fusion in breast, colorectal, and NSCLC. Moreover, we reported the successful utilization of exon array data in genome-wide screens for gene fusions without prior knowledge of the genetic background of a given case and described the identification of a novel, recurrent
HEY1-NCOA2 gene fusion in mesenchymal chondrosarcoma and another novel
NUP107-LGR5 fusion in a dedifferentiated liposarcoma case [
26]. In our study, we developed an unbiased data analysis pipeline called the “Fusion Score model” to score and rank genes for intragenic changes in expression. The transcription factor gene
NCOA2 was one of the candidates identified in a mesenchymal chondrosarcoma that showed a significant intragenic change in expression. A novel
HEY1-NCOA2 fusion was identified by 5′ RACE, which represented an in-frame fusion of
HEY1 exon 4 to
NCOA2 exon 13. RT-PCR and/or FISH evidence of this
HEY1-NCOA2 fusion was present in all additional mesenchymal chondrosarcomas tested with a definitive histologic diagnosis and adequate material for analysis but was absent in 15 samples of other subtypes of chondrosarcomas. The novel
HEY1-NCOA2 fusion appears to be the defining and diagnostic gene fusion in mesenchymal chondrosarcomas. Using the same approach, we also identified a
NUP107-LGR5 fusion in a dedifferentiated liposarcoma case.
Recently, Li
et al. [
29] have demonstrated the successful application of this approach in the detection of RET gene fusion in lung cancer. For all these successful applications of exon array in fusion gene identification, the same fundamental strategy was followed. The most challenging part of the workflow is filtering the candidate gene list that was generated by a computational data analysis algorithm and spotting the gene most likely to be involved in a functional fusion event. The candidate list generated by the computational data analysis algorithm usually consists of hundreds of genes, which are then narrowed down using different strategies to determine the most promising candidates. For instance, to identify the
HYE1-NCOA2 fusion, the initial list of candidate genes was cross-referenced to a list of genes previously known being involved in cancer gene fusions or belonging to the same gene families. Common genes in both lists were selected for further investigation which means reviewing all the alternative transcripts for each gene in the UCSC database (http://genome.ucsc.edu/). This review was performed to exclude candidates of which the discontinuity in exon-level expression was likely due to alternative splicing. Finally, additional steps were taken to further prioritize candidate genes based on biological plausibility, including mapping the predicted breakpoint of each gene to its protein domains because an important function domain should be preserved in a fusion protein. As a result, only a few genes were highly prioritized and selected for experimental validation. Different studies may focus on different genes of interest, but the principle for prioritizing candidate genes based on biological plausibility is generally accepted. The same principle is also applicable to the analysis of data generated by whole-transcriptome sequencing.
Intragenic DNA copy number change in the identification of gene fusions
Gene fusions occurring by intra-chromosome deletions can be directly detected by high-resolution array-based tests. In addition, intragenic changes in genomic copy number are frequently found in other fusions subtypes because of the acquisition of additional copies of the translocated chromosome or the loss of the reciprocal (non-functional) derivative chromosomes. Therefore, array-based DNA copy number analysis is also another method that can be used to screen for gene fusions associated with unbalanced genomic aberrations flanking the fusion points. This hypothesis was verified successfully by a previous study that identified
KDR-PDGFRA in a glioblastoma patient sample [
30]. In this study, a high-resolution array-based comparative genomic hybridization (aCGH) platform, designed to densely cover 89 tyrosine kinase (TK) genes, was used to screen for potential rearrangements involving TK genes in glioma samples. Cases showing intragenic copy number variation between the 5′ and 3′ ends of the TK genes were studied further using a second, custom-designed, high-density aCGH array to fine map the intragenic breakpoints. The aCGH screening identified a complex amplicon on chromosome 4q12. Intragenic copy number changes were found in both
PDGFRA and
KDR genes, which are localized on 4q12 and are transcribed in opposite directions. Interestingly, the DNA copy number profile suggested the amplification of DNA segments preserving the 5′ end of
KDR and the 3′ portion of
PDGFRA, including its kinase domain. Intragenic breakpoints of both genes were successfully narrowed down into very small regions on genomic DNA by the fine-mapping array. Based on the pattern of copy number changes across this region, we hypothesized that an intrachromosomal rearrangement may have resulted in a gene fusion between
KDR and
PDGFRA. All possible exon combinations between these two genes were reviewed, and those that would produce out-of-frame fusion transcripts were eliminated. Then, a panel of RT-PCR assays was designed to identify the potential fusion transcript in the RNA extracted from this tumor tissue, and confirmed the presence of an in-frame fusion between
KDR and
PDGFRA genes. Functional studies demonstrated that KDR-PDGFRA fusion protein had constitutively elevated TK activity and transforming potential. However, in practice, intragenic copy number changes might not always be detected in both fusion partners. In such cases, an appropriate RACE approach can be utilized accordingly to obtain the other fusion partner in the transcript level. The RACE products were further confirmed by RT-PCR, and only the in-frame fusion was considered for further functional investigation.
Nowadays, high-resolution array-based copy number analysis has become a well-developed test platform not only in research but also in clinical diagnosis. In addition, array vendors usually provide reliable and easy-to-use software for data analysis. In contrast, fusion gene screening by gene-expression/exon-expression arrays is based on information from RNA transcripts, which is much more complicated than the DNA copy number due to several reasons, such as the baseline expression of a given gene may vary from tissue to tissue or the expression variation of exons identified might be resulting from alternative splicing. Therefore, bioinformaticians usually have to work with geneticists very closely to conduct a proper data analysis.
Next-generation sequencing further advances the identification of gene fusions
Copy number arrays can reveal breakpoints of fusion genes associated with genomic imbalances; however, gene fusions that resulted from balanced rearrangements cannot be detected. Next-generation sequencing, a new generation of non-Sanger-based sequencing technologies, is a revolutionary cost-effective technique in genetics and genomics. This method enables a high-throughput and highly automated sequencing of large stretches of DNA base pairs. It has various applications in research and clinical diagnosis [
31].
With the same fundamental sequencing platform, applications of NGS are largely dependent on sample resources, the way sequencing libraries are prepared and the way NGS data are analyzed. A comparison between RNA-Seq and whole-genome sequencing in gene fusion screening is summarized in Table 1. As shown in the table, RNA-Seq is more suitable than whole-genome sequencing for fusion event screening, particularly for fusion genes coding chimeric proteins [
32].
With regard to the discovery of gene fusions, we formerly believe that the paired-end (PE) sequencing approach is superior to the single-read (SE) approach. Although the SE approach can detect gene fusions by reads spanning the fusion junction, the PE approach can detect a fusion chimera if a mate pair spans the fusion junction or if the mate pairs encompass the fusion junction [
33]. Both the genome-wide massively parallel PE sequencing and PE transcriptome sequencing were successfully employed in screening for fusion genes [
33-
36]. In addition to mate pairs which truly span the fusion junction or encompass the fusion junction, additional evidence for the presence of a fusion transcript may be obtained from RNA-Seq because of its quantitative nature. Based on the evaluation of the signal track of each transcript, the intragenic discontinuity in the RNA expression level of the exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners can also be observed in RNA-Seq in such a way that the number of reads that map to a given exon of a fusion partner is equal to the expression intensity demonstrated on an exon-expression array (Fig. 1). In fact, paired-end RNA-Seq has been extensively used for fusion gene screening. Easy-to-use protocols of paired-end transcriptome sequencing can be obtained from NGS vendors. However, recent advances in NGS platforms have resulted in dramatic increases in read length, indicating that gene fusion boundaries may be adequately represented by SE reads. With the continuous improvement of NGS techniques, fusion detection based on SE junction-spanning reads may be more efficient than the paired-end approach because of its cost- effectiveness.
Along with library preparation, data analysis is also an important factor to consider in ensuring successful sequencing applications. In comparison to automated, easy-to-use protocols in generating sequencing data, NGS data analysis is relatively time-consuming. Analysis software that can rapidly interrogate such large amounts of data is needed to reveal the patterns therein. Many algorithms exist to address the needs of each application. Some of these algorithms are commercially available from software vendors, whereas many are created by academic institutions and freely available from public literature. Several computational approaches that have been developed in recent years for the detection of gene fusion events are summarized in Table 2 [
37-
44]. Although these computational methods showed good sensitivity in discovering chimeric transcript events, further work is necessary to reduce the huge number of false-positive results that arise from this type of analysis. As NGS technologies become more affordable, the number of studies using this technology to discover fusion transcripts will increase.
NanoString nCounter gene expression system in the identification of gene fusions
Aside from the NGS-based whole genome/transcriptome screening for gene fusions, NanoString technology is also a new platform that can be used for gene fusion screening. Geiss
et al. (2008) described a non-amplification, hybridization and fluorescence detection-based platform, namely, NanoString nCounter gene expression system, to detect individual mRNA molecules [
45]. In brief, for each 70 bp to 100 bp region of a given gene transcript, a pair of sequence specific probes (each contained 35- to 50- base sequence) was designed. The first probe, a capture probe, consists of a sequence complementary to the target transcript and a short common sequence for signal capture. The second probe, a reporter probe, contains another target transcript specific sequence, which is immediately adjacent to that in the first probe, and a color (fluorescence)-code tag, which is a unique code created for the given gene transcript. Probes targeting numerous interesting regions in the transcriptome can be combined and hybridized to the total RNA of a test sample to study the expression levels of a set of genes and/or exons in the sample. When hybridized to their target correctly to form a complex, the pair of capture and reporter probes can be immobilized and visualized under a fluorescence microscope. The level of expression is measured by counting the number of code for each target. Only up to 800 regions of interest can be studied with one probe mix/one hybridization because of technical limitations.
The fundamental rationale for using the NanoString nCounter expression system is the same as that discussed in the exon array screening for gene fusions (Fig.β1). Given that most gene fusions leading to the formation of a chimeric fusion protein cause an intragenic discontinuity in the RNA expression level of exons that are 5′ or 3′ to the fusion point in one or both of the fusion partners, we developed an efficient NanoString-based strategy to focus on screening for fusions with TK genes involved [
46]. The NanoString nCounter expression assay design was based on the known genomic properties of existing TK fusions, i.e., these fusions invariably occur upstream of the exons encoding the kinase domain. Therefore, two sets of probe pairs were designed for each TK gene: one targeted a 100 bp region far 5′ to the kinase domain exons and the other was mapped to a 100 bp region within exons coding for kinase domain or 3′ to it. Tumor RNAs were hybridized to the NanoString probes and analyzed for outlier 3′ to 5′ expression ratios. Presumed novel fusion events were followed up by RACE and further verified by RT-PCR and FISH. With the application of this strategy, we identified
KIF5B-RET and
GOPC-ROS1 fusions in two lung adenocarcinoma patients, respectively, both of which may be immediately targetable for drugs. Although the relatively low-throughput of this platform as well as the general design strategy described would not be able to fine map the fusion point in a given gene, the NanoString nCounter expression system is a promising platform for fusion transcript screening in that it is less sensitive to the quality of RNA samples and requires a small amount of input RNA (100 ng to 200 ng) for a single test. Therefore, for formalin-fixed paraffin-embedded tissue analysis, the NanoString nCounter expression system is superior to exon-array and RNA-seq.
Conclusions
In this review, a few advanced genome-wide screening approaches in screening for gene fusions without prior knowledge of the genetic background of a given case have been discussed. In summary, with the rapid progression of NGS technologies, sequencing technologies will be continuously improved to get faster and cheaper. RNA-Seq is now the preferred approach of choice in screening for gene fusions. Data analysis remains to be the bottleneck in the application of whole-genome/transcriptome sequencing. Challenges include quickly and accurately aligning millions of sequence reads to a reference genome and data mining. In terms of data sorting and interpretation, what we have learned from the analysis of high-resolution array data is of great help. The NanoString nCounter gene expression system can only analyze a limited number of targets in each experiment; however, it is relatively cost-efficient, less sensitive to the quality of RNAs, and less challenging in data analysis. Such qualities make this system preferable in screening for fusion events of a small group of interesting genes. Table 3 summarizes the advantages and disadvantages of three major RNA-based platforms (i.e., array-based gene/exon-expression profiling, RNA-Seq, and NanoString nCounter gene expression system) in the discovery of gene fusions. In addition, workflows of the identification of recurrent fusion genes based on these high-throughput genome-wide screenings are illustrated in Fig. 2.
With the continuous innovation in genetics/genomics study, the identification of novel gene fusions in human cancers is no longer time- or labor-consuming. However, studying the increasing number of gene fusions and their functions in tumorigenesis will be a great challenge for researchers in the coming years.
Higher Education Press and Springer-Verlag Berlin Heidelberg