1 INTRODUCTION
Targeting tumor-specific mutations via customized chemical compounds can precisely eradicate the cancer cells without harming healthy tissues, which paves a way toward precision oncology. But this precision oncology strategy has not been successful in many refractory cancers such as glioblastoma (GBM). One of the main obstacles is the limited understanding of cancer evolution, in which cancer cells might acquire advantageous fitness to revive under treatment stress.
To study cancer evolution, researchers attempt to collect tumor samples from different locations (multi-regional) and/or at different time points (longitudinal) of the same patients. However, the collection of such data is extremely challenging, partly due to tumor resectability. To overcome this difficulty, one way is to integrate data from multiple sources, which is able to increase statistical power, potentially leading to new discoveries hidden in large-scale public datasets. Recently, Wang
et al. integrated longitudinal genomic data of GBM patients from six different sources [
1], and this integration has revealed the pattern of GBM evolution under therapy and discovered several somatic mutations exclusively in the tumors after treatment.
Here, we summarized the computational methods used in this paper [
1], developed an easy-to-use toolbox, namely CELLO (Cancer EvoLution for LOngitudinal data), and provided a step-by-step tutorial about how to analyze and visualize longitudinal next-generation sequencing data. Particularly, we analyzed several public longitudinal genomic sequencing datasets,
i.e., the whole-exome sequencing of the matched blood samples, the initial and the recurrent tumors from cancer patients, and demonstrated how to:
• preprocess the raw sequencing data into a tabular form of mutations;
• filter low-confidence somatic mutations;
• generate longitudinal mutational landscape;
• analyze mutational signature;
• cluster patients based on evolutionary patterns;
• identify clonal switching events; and
• infer temporal order of somatic mutations.
Notably, through a multi-platform data integration strategy, we extended the module of hypermutation signature analysis to be able to identify hypermutators from targeted DNA sequencing and RNA sequencing data, originally developed and used in our previous study on secondary glioblastoma [
2]. Besides brain tumor data, we tested CELLO using additional published datasets of longitudinal sequencing bladder and breast tumors [
3,
4], and illustrated that CELLO is applicable not only in brain tumor, but also in other cancer types. To benefit researchers who are interested in longitudinal cancer genomics study for analyzing their own data, both MATLAB and R versions of CELLO are developed. To ensure reproducibility and usability, we also present a docker version of CELLO based on the R implementation.
2 PREPARING AND PREPROCESSING OF INPUT DATA
2.1 Mapping of DNA and RNA sequencing data
DNA sequencing (DNA-seq) of tumor samples is able to profile somatic mutations and copy number variations in cancer cells, while RNA sequencing (RNA-seq) is commonly used to detect gene fusions and quantify expression changes in transcriptome. In this tutorial, we mainly focus on the analysis of whole-exome sequencing of the tumor tissue and normal control (blood or tumor adjacent normal tissue), as well as RNA-seq of tumor tissue, longitudinally collected from cancer patients.
The raw sequencing data (FASTQ files) should first go through quality control. The tool FastQC [
5] is recommended for this purpose. Low quality reads, such as those with average sequencing quality<20 or with more than three ambiguous bases (“N”s), should be discarded. The high quality reads can then be aligned to the reference genome (such as
hg19, which can be downloaded from the following site http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/). BWA-MEM [
6] is recommended for mapping DNA-seq data, and STAR [
7] for mapping RNA-seq data. For DNA sequencing data, duplicates from PCR (polymerase chain reaction) can be removed using the FastUniq toolkit [
8]. Gene expression profile can be extracted from the RNA-seq aligned files using FeatureCounts [
9], Cufflinks [
10] or RSEM [
11]. Analysis of the DNA-seq data is detailed as below.
2.2 Somatic mutation calling using SAVI2
Somatic mutation calling is to identify somatic mutations (single nucleic variants and small insertions/deletions) that are solely carried by tumor cells rather than normal cells. Statistical algorithm for variant frequency identification (SAVI) is an empirical Bayesian framework that models variant allelic frequency (VAF) [
12]. Based on the posterior distribution, SAVI can identify somatic mutations whose VAFs in a normal control are significantly lower than those in a tumor sample. SAVI2 pipeline further integrated multiple external databases and multi-step filters to rule out common single-nucleotide polymorphisms (SNPs), potential technical errors, strand-biased variants, and a number of other types of low-quality variants. Notably, unlike most mutation callers, SAVI2 is able to call mutations for sequencing data from multiple samples at one execution, which is an important advantage in the longitudinal sequencing data analysis.
2.3 Analysis of copy number alterations
Commonly used copy number alteration (CNA) detection tools, such as EXCAVATOR [
13], can be used to detect copy number alterations from the WES data. EXCAVATOR implemented a three-step normalization step to reduce the bias from GC contents, genomic mappability and exon size, all of which may significantly impact CNA calls. Other tools such as CNVkit [
14] can also be used, and it is important to calibrate the method using existing data and the reported CNA results to ensure the reliability of the tools. For samples with only RNA sequencing data, CNAPE will be used to infer the copy number change [
15].
2.4 Detection of gene fusions
Detection of gene fusions typically start from FASTQ files from RNA-seq. ChimeraScan [
16] is recommended to generate the raw gene fusion candidates because of its high sensitivity, but other tools such as STAR-fusion [
17] can also be used. The candidates can then be processed by Pegasus [
18] to annotate the gene fusions and prioritize the fusions based on their functional importance using a machine learning model. False positives and passenger gene fusions can then be filtered out based on the score and the number of supporting reads. It is recommended to validate the presence of the selected gene fusions using PCR and Sanger sequencing.
3 MATERIALS
3.1 Software
The whole CELLO pipeline requires the following software packages:
• SAVI2 and its dependent packages:
• Python (2.7 preferred);
• Scipy;
• Java;
• Samtools (v1.2);
• SnpEff (v4.1 C);
• tabix (v1.7);
• bgzip (v1.7);
• vcflib;
• Bedtools (v2.26.0); and
• MATLAB (R2016b) or R (version 3.6.1).
3.2 Example dataset
The required input file is a tabular data of mutations with additional functional annotations from SnpEff [
19], a dependent package called from SAVI2, and the two neighboring bases of the mutated position acquired by Bedtools [
20] getfasta command. In this tutorial, we provided an example data of mutations from 90 GBM patients before and after treatment previously published [
1]. Particularly, each patient in this dataset has three biological samples: blood, primary and recurrent tumors. The whole exome of those 270 samples were sequenced, and the output sequencing raw data were aligned to the reference genome (
hg19) using BWA-MEM
(Fig. 1A). Using SAVI2 and Bedtools, we derived a mutation table consisting of 56,242 rows/mutations and 27 columns/features (Fig.1B, raw data file: input.savi.txt in the GitHub repository). The detailed description of each mutation feature is listed in Table 1.
4 PROTOCOL
First of all, we provided a functional content of CELLO toolbox by listing the usage, input and output of each function in Table 2. Next, we explained in details how to use each function and what is the implication of the output result.
4.1 Clean data
First, we filtered out the mutations with allelic frequency less than 5% in both initial and recurrent tumors. We then defined somatic mutations in tumors by using the following criteria: (i) read depth of blood higher than 20 (default parameter); and (ii) number of altered reads in the control as 0 or 1. In CELLO, one can read in and clean the data using the following MATLAB or R codes.
MATLAB code:
saviTable= mutRead('input.savi.txt');
R code:
source('CELLO.R')
savi.table<- mutRead("input.savi.txt", 20, 1, 5)
4.1.1 Generate figures for longitudinal mutational landscape
A mutational landscape is to display a global picture of mutation occurrence in a tumor type using a grid heatmap, e.g., oncoprint. For longitudinal data analysis, one patient has two (or more) tumor samples, and hence a mutation may be (i) conserved in the both samples, (ii) present in the initial sample but disappear in the recurrent sample, and (iii) absent in the initial sample but newly emerge in the recurrent sample. Here we provide a customized mutational landscape to visualize the presence/absence of mutations during tumor progression using different colors. In practice, we first marked the key driver genes (knownDriverGene) in the data table (saviTable) and input them to the relevant function, mutLandscape(), which will automatically generate the longitudinal landscape (Fig. 1C). For example, using 12 patients in the example dataset, Fig. 1C displays for each patient in each column that (i) the number of somatic mutations in the stacked bar plot (upper panel) with shared mutations in yellow, primary-private mutations in red and recurrence-private mutations in black, and (ii) the presence of key functional drivers in the heatmap (lower panel) using the same color theme. Note that this layout design provides a general glance of mutational dynamics in cancer evolution, and works for patients with two tumor samples only.
To further display the co-occurrence and mutual exclusivity between different mutations, we performed Fisher’s exact test (FET) for each pair of key driver mutations/phenotypes and highlighted the significant pairs in a pyramid-shaped scatter plot (Fig. 1D). These comparisons were performed within primary and recurrent samples and shown on the left and right halves of the pyramid, respectively. Concretely, mutation in gene 1 co-occurs with mutation in gene 2 in 25 out of 100 recurrent samples, which leads to significant co-occurrence in the FET (odds ratio>1, P value<0.0001). In contrast, mutations in gene 2 and gene 3 exhibit a mutual exclusive pattern in primary samples (odds ratio<1, P value<0.0001). We screened each pair between mutations and phenotypes, and highlighted the significant co-occurrence and mutual exclusiveness using larger dots, compared to the smaller dots in grey denoting insignificant pairs. Significant co-occurrences in primary and recurrent samples are shown in red and black, respectively, whereas mutual exclusiveness are in green with red/black border denoting sample type. This figure can reveal statistically significant associations between mutations and phenotypes which are worth further investigation for potential causality.
Furthermore, we use a 3D scatter plot to display the proportion of each mutation present in each patient in three categories: commonly shared in both primary and recurrence, primary only and recurrence only (Fig. 1E). A mutation occurred more frequently in common will be located close to the upward axis in yellow, while the one present more in either primary or recurrence will be located close to leftward axis in red and rightward axis in black, respectively. For example, Fig. 1E shows mutations in gene 1 close to the upward axis in yellow, as majority of the mutations are observed before and after recurrence. Similarly, the mutations in gene 2 present in primary are not always inherited in the recurrent tumor, whereas the mutations in gene 3 are only observed in recurrence. This figure can highlight evolutionary conservation of driver mutations, and recurrence-enriched mutations for further investigation of potential recurrence-driving role. In practice, the longitudinal landscape, pairwise correlation, and primary-recurrence enrichment of mutation frequency can be computed and automatically visualized by CELLO as follow.
MATLAB code:
knownDriverGene={'TP53','ATRX','IDH1','EGFR','PTEN','PIK3CA','PIK3R1','PIK3CG','PDGFRA','RB1','NF1','PTPN11','LTBP4'};
[saviTable, mutGeneTable] = mutStats(knownDriverGene, saviTable);
hland=mutLandscape(saviTable, mutGeneTable);
hcom= mutCorrelation(mutGeneTable);
h3d= mutFrequency(mutGeneTable);
R code:
knownDriverGene<-c('LTBP4','PTPN11','NF1','RB1','PDGFRA','PIK3CG','PIK3R1','PIK3CA','PTEN','EGFR','IDH1','ATRX','TP53')
stats<- mutStats(savi.table, knownDriverGene, 5, remove_LOW= TRUE)
mutLandscape(stats$mutNum.table, stats$mutGenes.table)
mutCorrelation(stats$mutGenes.table)
freq.table<- mutFrequency(savi.table, knownDriverGene, stats$mutGenes.table, 5)
4.1.2 Analyze mutational signature and hypermutation
It is known that the treatment of alkylating agent,
i.e., temozolomide (TMZ) for GBM patients might induce somatic hypermutation in cancer cell. The TMZ-induced mutations can be characterized by a special type of mutational signature,
i.e., CC>TC [
21]. To rapidly calculate this TMZ-induced signature, we proposed a customized Hypermutation score (
HM score) to capture the main characteristics of this signature as previously described in our sGBM study [
2], and combined it with tumor mutation load to sort out the TMZ-induced hypermutated samples. Recall that the
HM score formula is mathematically described as the summation of three terms: (i) the fraction of C>T mutations, (ii) the fraction of the dominant mutation type CC>TC among all the C>T mutations, and (iii) the fraction of the secondary mutation type CT>TT among all the C>T mutations. Notably, this secondary type is not uniquely contributed by TMZ treatment [
21], and therefore its fraction should not contribute to the
HM score when it becomes the dominant type within all the C>T mutations. As previously described [
2], the
HM score is mathematically formulated as
where
ML indicates mutation load; and the function
fC→T(
x) is the number of nucleotides mutating from C into T where
x represents the flanking nucleotide of the C at the 3-prime direction in the reference genome. And the letter N denotes any bases A, T, G or C. The function sign(
y) = 1 if
y>= 0, and
-1 otherwise, which determines whether CT>TT positively contributes to the
HM score or not. To highlight the hypermutated samples from the cohort, we visualized the number of somatic mutations (a.k.a., mutation load) and the
HM score of each sample in a 2D scatter plot with initial samples in red squares and recurrence samples in black circles (Fig. 1F). In our case, a hypermutated sample is determined through a cutoff of 350 mutations or more in whole-exome scale, and the hypermutation is attributed to TMZ induction if the
HM score is higher than 1.3, as shown within the grey shaded area (Fig. 1F). This figure is able to visualize hypermutated tumor samples with particular characteristic, and we expect that it is also applicable in hypermutated lung cancer with smoking signature and hypermutated skin cancer with ultraviolet signature. Complementarily, we provided a stacked bar plot displaying the proportion of each nucleotide change type, 6 in total when considering base-paring rule (see the legend, Fig. 1G), for each sample in three categories: primary (red squares in Fig. 1F), non-hypermutated relapsed (black circles in lower left corner, Fig. 1F) and hypermutated relapsed samples (black circles in upper right corner, Fig. 1F). This figure will highlight the enriched type of nucleotide change in the hypermutated samples, compared to the non-hypermutated samples (C>T or G>A in our GBM case). In addition, we also displayed the distribution of silent/missense ratio of each sample among the primary, non-hypermutated recurrence and hypermutated recurrence categories (Fig. 1H). This figure is used to visualize whether or not hypermutation imposes selective pressure in tumor evolution. In practice, one can use the
mutSignature() function in CELLO to complete the above analysis and visualize the corresponding results as follow.
MATLAB code:
hsig= mutSignature(saviTable);
R code:
hm.table<- mutSignature(savi.table, 15, 350, 1.3)
4.1.3 Cluster patients based on evolutionary pattern
With different shapes of phylogenetic tree, cancer evolution follows different patterns: linear, branching, neutral or punctuation [
22]. Moduli space is a geometric space of phylogenetic trees (Fig. 1I). Clustering analysis on the Moduli space can categorize the embedded trees with different evolutionary patterns. Particularly, we first constructed a phylogenetic tree of the initial and recurrent tumors of each patient with three branches: the branch of common mutations in yellow, the branch of primary-private mutations in red and the branch of recurrent-private mutations in black, and then embedded the trees into a Moduli space (Fig.1I). In this space, each dot represents a phylogenetic tree of a patient, and the coordinates of this dot are calculated by the structure and the branch length of the phylogenetic tree. Clustering those trees/patients on the Moduli space using un-supervised clustering methods such as k-means algorithm can group patients with similar phylogenetic trees into the same categories. In particular, the patients with a long trunk and extremely short branches in the tumor phylogenetic tree are deemed to follow a linear growth pattern, as illustrated on the top of the Moduli space in Fig. 1I. In contrast, the patients whose recurrent tumors do not inherit majority of mutations in the initial tumors (bottom-left corner and the center) are deemed to follow a branching pattern as the clonal structure of initial and recurrent tumors are dramatically different. In addition, the patients located in the bottom right corner are those developing hypermutation after TMZ treatment, whose phylogenetic trees have extremely long branch in black representing the number of recurrence-private mutations. In practice, the moduli analysis can be easily accomplished in a few seconds by CELLO with function
mutTreeClustering() in MATLAB or R as follow.
MATLBA code:
hmod= mutTreeClustering(saviTable);
R code:
cluster.table<- mutTreeClustering(stats$mutNum.table)
4.1.4 Infer the order of somatic mutation using TEDG
Tumor evolutionary directed graph (TEDG [
23]) is a computational model to infer and integrate mutation orders of multiple cancer patients into a directed graph of highly recurrent trajectories. This model was successfully applied in the longitudinal genomic data analyses of chronic lymphocytic leukemia [
23] and GBM [
1], and is generally applicable in other cancer types with tumor genome sampling conducted at two time points or more. A TEDG consists of major functional driver variants as the nodes (Fig. 1J). In the TEDG, one directed edge from gene 1 to gene 2 represents the gene 1 occurs earlier than the gene 2 in tumor progression. The thickness of the edges is proportional to the occurrence of this order,
i.e., how many patients have this mutation order in their temporally sampled tumors. The color gradient of nodes stands for the evolutionary direction from early events (red) to late events (black), which is quantified by the ratio of indegree and outdegree of each node. And the size of the nodes is proportional to the occurrence of the mutations. In practice, the TEDG can be constructed by the function
mutDirectedGraph() with optional deconvolution using minimum spanning tree.
MATLAB code:
G= mutDirectedGraph(saviTable, true);
R code:
TEDG<- mutDirectedGraph(Stats$mutGenes.table)
4.1.5 Identify “clonal-switching” events
Clonal switching refers to the switch between differentially mutated versions of the same gene in the initial and recurrent tumor of the same patient. In Fig. 1K, we illustrate this event by showing two different mutated loci of the same gene before and after recurrence. Notably, the red mutant present in primary genome disappears in the recurrence genome, whereas the black mutant is newly emerging in the recurrence. This phenomenon has been reported in multiple cancer types such as renal cell carcinoma [
24,
25] and glioblastoma [
1]. Clonal switching events are often observed in key driver genes. Since the possibility of back-mutation is extremely low, the different versions of mutations are believed to be developed independently, indicating a branched evolution of the initial and recurrent tumors. In our case, one can use the
mutSwitch() function in CELLO to identify and visualize the clonal switching events as the following MATLAB or R code.
MATLAB code:
hsw= mutSwitch(saviTable, 'PDGFRA');
R code:
switch.table<- mutSwitch(savi.table, knownDriverGene, 5, 20)
5 EXTENSION OF HYPERMUTATION DETECTION TO TARGETED DNA AND WHOLE-TRANSCRIPTOME SEQUENCING DATA
Built upon the original hypermutation method as mentioned above, we extended our pipeline of the hypermutation detection to targeted DNA and whole-transcriptome sequencing data derived from our previous study on secondary glioblastoma (sGBM) [
2]. To achieve this extension, we added extra filters to rule out sequencing noise and alignment artifacts in the DNA-targeted and RNA sequencing data (Fig. 2A). In particular, to remove sequencing noise in targeted data, we increased the VAF cutoff to 7%. And to remove potential uncommon germline variants and the alignment artefacts in RNA data, we removed the RNA variants with VAF>40% and the variants close to splice regions. Using the 43 whole-exome, 63 targeted-DNA and 51 RNA sequencing data from the sGBM cohort, CELLO identified 4, 6 and 8 hypermutated samples with extensive mutation load and a score of TMZ-induced signature larger than 1.3 (Fig. 2B–D). Within the 18 hypermutated samples in total, one sample was sequenced by both targeted-DNA and RNA platforms, and our pipeline deemed this samples as hypermutated in the both platforms, which verified the consistency and robustness of our pipeline under different sequencing protocols. Remarkably, the hypermutated and non-hypermutated RNA-seq samples are inseparable solely using the mutation load (Fig. 2D), but the gap of HM score is sufficiently large for a high-confidence separation, demonstrating that our additional filters preserve the hypermutation signature.
6 EXTENSION OF ANALYTICAL PIPELINE TO ADDITIONAL CANCER TYPES
To demonstrate that CELLO is applicable in other public datasets of longitudinal tumor sequencing, we collected two additional datasets from bladder cancer [
3] and breast cancer [
4]. Based on the phylogenetic tree topology, CELLO divided the patients into three clusters on the Moduli space (Fig. 2E–G). Similar to glioma (Fig. 2E), CELLO identified six bladder (Fig. 2F) and three breast cancer patients (Fig. 2G) undergoing a branched evolutionary mode under therapy.
In addition to glioma, CELLO constructed the TEDGs of the bladder and breast cancer patients (Fig. 2H–J). Unlike TP53 mutation as an early event and PIK3CA mutation as a late event in glioma (Fig. 2H), we observe they are both late events in the bladder cancer cohort (Fig. 2I), and early events in breast cancer cohort (Fig. 2J). This observation implies that the same gene may play different roles in the tumor evolution depending on the tissue of origins.
7 DISCUSSION
We developed CELLO, a versatile toolbox for comprehensive analysis of tumor evolution given longitudinal sequencing samples, following the previous work published on brain cancer [
1]. In this protocol, we elaborated the technical details with interpretation on how to use CELLO to analyze longitudinal tumor sequencing data in brain, bladder, and breast cancers, and how to detect hypermutation signature in the brain tumor data from our previous study [
2] generated by different sequencing protocols and platforms. Having these detailed workflow, one can use it to uncover evolutionary behaviors of other cancer types with customized modifications. We anticipate that more researchers can use CELLO to reveal the global picture of pan-cancer evolution, which can provide a guidance on how to target cancer evolution so as to prevent from deadly relapse.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature