Introduction
Tens of thousands of cancer genomes have been sequenced, and numerous point mutations, insertions and deletions (indels), structural variations, copy number alterations, epigenetic changes, and microbial infections have been uncovered [
1] through tumor–normal pairs method. In this method, the cancer genome sequence is compared with a reference human genome and subtracted with those found in counterpart normal controls (CNCs, counterpart normal tissues or peripheral blood) and single-nucleotide polymorphisms (SNPs) [
2]. However, whether genomic alterations occur only in CNCs and not in tumor tissues remain unclear.
Approximately 90% of lung cancer deaths are caused by cigarette smoke, which contains more than 20 lung carcinogens including nicotine-derived nitrosaminoketone (NNK) and polycyclic aromatic hydrocarbons (PAHs) [
3]. Tobacco smoke induces cellular injury throughout the entire respiratory tract [
4] and causes all types of lung cancer but is most strongly linked with small-cell lung cancer and lung squamous-cell carcinoma (LUSC), which accounts for approximately 25%–30% of all lung cancer cases [
5,
6]. Characterizing the genomic landscapes of LUSCs through the tumor–normal pairs method has shown the presence of a large number of exonic mutations, genomic rearrangements, and segments of copy number alterations [
7,
8]. Long-term exposure to air pollution, another cause of lung cancer, also induces the development of numerous genomic mutations in patients [
9]. However, whether genomic alterations exist in the counterpart normal lung tissues that have been similarly exposed to tobacco smoke or polluted air as tumor tissues remains unclear.
In this study, we report previously unidentified CNC variations found in 481 patients with LUSC. We sequenced the whole genomes of three normal lung tissue samples and their paired adjacent squamous cell carcinomas to characterize CNC-specific genomic variations. We also analyzed the genome sequences of 478 CNCs and paired LUSCs of The Cancer Genome Atlas (TCGA) data sets. We compared the genomic sequence of CNCs with that of the reference human genome and filtered out alterations found in counterpart tumors or germline variants (normal–tumor pairs). Variations derived from sequencing artifacts were removed by VarScan fpfilter.
Materials and methods
Patients, genomic data, and analytical method
This study was approved by the research ethics committee of the Institute of Zoology, Chinese Academy of Sciences. The diagnosis of LUSC was confirmed by three pathologists. The tumor samples contained a tumor cellularity greater than 80%, and the paired normal lung tissues had no tumor content. Genomic DNA samples were isolated from normal lung tissues obtained 5 cm or more away from tumors. Sequencing libraries were constructed and sequenced with the Illumina Hiseq2000 platform [
9]. The raw sequencing data were processed with the FASTX-Toolkit to retrieve high-quality paired reads, which were then aligned to the reference human genome (hg19) [
11] by Burrows–Wheeler Alignment (BWA) with default parameters. After marking the duplicates with Picard, Binary Alignment (BAM) files were subjected to base recalibration and indel realignment with Genome Analysis Toolkit (GATK) [
10] (https://www.broadinstitute.org/gatk/). Variants in CNCs were called by the UnifiedGenotyper of GATK and filtered against dbSNP138 common germline variants (http://genome.ucsc.edu/) and those detected in the tumor samples. The false-positive filter incorporated in VarScan v2.3.9 was used to remove sequencing artifacts and filter false-positive variants. The readcount files of variants in CNC BAMs were constructed by bam-readcount and entered into the filter with default parameters other than the following options: –min-var-basequal: 20; –min-var-mapqual: 20; –min-var-freq: 0.2; –min-var-count: 2. Variants that passed the false-positive filter with the allele frequency≥0.20 in the normal counterpart and≤0.05 in tumor samples were reserved as variants in CNCs. The called variants were validated by polymerase chain reaction (PCR) and Sanger capillary sequencing using the primers listed in Table S1 and the genomic DNA samples of the patients. Mutations in the cancer genome were also analyzed through GATK with the above criteria.
The use of the TCGA genome data was approved by the National Institutes of Health of the United States of America with the approval number of #24437-4. The dbGaP accession number is phs000178.v9.p8. The TCGA genome data of 478 patients with LUSC (Table S2) were downloaded from the Cancer Genomics Hub (https://cghub.ucsc.edu/) and analyzed as described above.
Statistics
Differences between data groups were evaluated for significance using the software SPSS 17.0 for Windows (Chicago, IL, USA) and Student’s t-test. The survival curves were plotted in accordance with the Kaplan–Meier method and compared through the log-rank test. P values<0.05 were considered statistically significant.
Results
Identification and validation of CNC genomic variations in patients with LUSC
In the initial screening, the genomic DNA samples of the paired normal lung tissues and cancer tissues of three patients with LUSC were sequenced to an average of 44.15× (35.83×–50.68×) coverage and 64.63× (range, 62.03×–65.97×) coverage, respectively. Nucleotide substitutions and small indels were found in the three LUSC genomes, including 14 single nucleotide substitutions, one dinucleotide substitution (AG→CC in MADCAM1), and one indel (TCC deletion in NCL) (Table 1). PCR assays and subsequent sequencing were performed to verify the identified alterations in six genes in the normal lung tissues of the patients, and the results confirmed the existence of the alterations (Fig. 1). For example, the nucleotide at chr12:124810072 of NCOR2 of hg19 is C. However, two peaks (C and T) of equal peak height were seen in the sequence of normal lung tissues, whereas a high peak of C and a very low peak of T were detected in the tumor samples of a patient (Fig. 1A, left panel). This change might lead to A2474V substitution in the encoded protein. Sequencing results using another set of primers confirmed the existence of T in the normal lung rather than in the counterpart tumor sample of the patient (Fig. 1A, right panel). Nucleotide T at this position of NCOR2 was absent in dbSNP138 common germline variants. Similarly, the normal lung tissues had variations in GLB1L (Fig. 1B), MACF1 (Fig. 1C), C10orf95 (Fig. 1D), and DPPA4 (Fig. 1E) compared with those in the tumor samples, hg19, and dbSNP138. TCC deletion in NCL in normal lung tissues was also confirmed by Sanger capillary sequencing using genomic DNA and two sets of primers (Fig. 1F).
Analyses of TCGA datasets
We expanded the observations in TCGA datasets by analyzing the genome sequences of the normal–tumor pairs from 478 patients with LUSC. Of these patients (Table S2), 353 (73.85%) were males and 125 (26.15%) were females, and the median age was 68 years old (range, 39–90 years). The smoking histories of 468 patients were available. Among these patients, 450 (96.15%) were current smokers or reformed smokers (not smoking at the time of interview but had smoked at least 100 cigarettes in their life), and 18 (3.85%) were nonsmokers (not smoking at the time of the interview and had smoked less than 100 cigarettes in their life). Adjacent normal lung tissues and peripheral blood were used as normal controls for 224 (46.86%) and 254 (53.14%) of the 478 patients with LUSC, respectively.
Variations were found throughout the genomes of CNCs and tumor tissues. A mean of 0.566 exonic alterations per megabase (Mb) was recorded in the CNC samples. This value is considerably less than that in tumor tissues (7.067 mutations/Mb, P<0.0001; Table 2). A total of 0.588 exonic mutations/Mb was found in normal lung tissues. This value is approximately equal to that in peripheral blood (0.547 mutations/Mb; Table S3). CNCs from male patients had 0.598 exonic mutations/Mb, which is slightly more than that of CNCs from females (0.476 mutations/Mb; Table S4). Black people had more CNC mutations than white people (Table S5). Only 18 patients were nonsmokers in this cohort (Table S2). This finding might provide an explanation for the observation that mutations in CNCs, as well as tumors in smokers, were not significantly higher than that in nonsmokers, as reflected by mutations/Mb, mutated genes/sample, synonymous/nonsynonymous mutations, and indels/sample (Fig. S2A–S2H).
Nucleotide substitutions in TCGA datasets
The nucleotide variations of the CNC genomes were analyzed. The results showed that the C:G→T:A transitions were the most predominant nucleotide substitutions, followed by A:T→G:C transitions (Fig. 2A). C:G→A:T transversions were the most predominant nucleotide substitutions in the tumor samples of the patients, and C:G→T:A transitions were the second most prevalent nucleotide changes in the genomes of the patients(Fig. 2A). We further showed that the C:G→T:A transitions were the most prevalent nucleotide changes in CNCs of nonsmokers and smokers (Fig. 2B) and males and females (Fig. 2C).
Altered genes in CNCs of TCGA datasets
We found a mean of 7.7887 altered genes per CNC sample. This value is considerably less than that in tumor samples (164.8159 mutated genes/sample,
P<0.0001; Table 2). In the 478 CNC samples, 25 genes had a variation rate of more than 2% (Fig. 2D and Table S6).
ARSD [
12] represented the most frequently altered gene and was altered in 89/478 (18.62%) of the CNC samples (Fig. 2D). In the 89 CNCs, 192 variations were found in
ARSD, and 28 (14.583%), 28 (14.583%), and 27 (14.06%) of these alterations led to G175D, L166Q, and M176K amino acid substitutions (Fig. 3A), respectively.
MUC4,
RBMX,
MUC5B,
RP1L1, and
CDC27 were mutated in 42 (8.79%), 34 (7.11%), 18 (3.77%), 18 (3.77%), and 17 (3.56%) of the 478 CNC samples, respectively. Variations and small indels, which resulted in single amino acid substitutions or the truncation of the encoded proteins, were frequently seen in CNC variations. Some genes (e.g.,
ARSD) also had variation hotspots (Fig. 3).
TP53,
MLL2,
PIK3CA,
CDKN2A, and
NFE2L2 were frequently mutated in LUSC [
7]. However, no alteration in these genes was detected in these CNC samples (Table S6).
Altered signaling pathways
The affected signaling pathways were analyzed through Gene Ontology analysis [
13]. The results showed that genes involved in interferon-
g (IFN-
g)-mediated signaling pathway and O-glycan processing, antigen processing and presentation were altered in CNC samples (Fig. S3A). Assays using the Kyoto Encyclopedia of Genes and Genome database showed that allograft rejection, cell adhesion molecules, and asthma pathways were affected (Fig. S3B).
CNC variations associated with poor prognosis of the patients
We analyzed the potential association between variations in CNCs and the prognosis of the patients using Kaplan–Meier method. Variations in two genes were associated with poor clinical outcome (Fig. 4). Alternative splicing variations in
CTAGE5 (for CTAGE Family Member 5) [
14], c.1356+ 2_1356+ 3delTA, were observed in the CNCs of seven (1.46%) of the 478 patients (Fig. 4A, upper panel). In the 473 patients with available survival information, the overall survival of the seven patients with the splicing variant of
CTAGE5 in CNCs was considerably shorter than those with wild type
CTAGE5 (
P<0.0001; Fig. 4A). Nucleotide changes that result in R301L substitution in Ubiquitin Specific Peptidase 17-Like Family Member 7 (
USP17L7) [
15] gene were seen in six CNCs (Fig. 4B). Patients with these CNC variations had considerably shorter survival time than those with wild type
USP17L7 (Fig. 4B).
Discussion
Chronic exposure to tobacco smoke causes the development of LUSC in the central airway. The development of LUSC follows a stepwise progression, e.g., from hyperplasia, metaplasia, dysplasia, and to carcinoma
in situ [
16]. Molecular lesions (e.g., genetic mutations and somatic copy number variations) are present in premalignant patches [
16–
19], and somatic genomic mutations have been found in lung tumors [
7,
8]. In this study, we dissected the whole genome sequence of normal lung tissues from three patients with LUSC using the normal–tumor pairs method to characterize the genomic alterations present in normal lungs that have been exposed to tobacco smoke. We found that the normal lung tissues of three patients with LUSC harbored genomic variations that have not been observed in their counterpart tumor samples, hg19, and dbSNP138 (Table 1). The six identified genomic variations were obvious upon validation through Sanger capillary sequencing (Fig. 1, A through E). However, the tumor samples also exhibited very low peaks of respective nucleotides (Fig. 1), suggesting the presence of normal lung epithelial cells in tumor samples or allelic loss in the tumor cells. Genomic variations were also detected in CNCs of lung adenocarcinomas (LUADs) in our own genome sequencing data and TCGA dataset [
20]. In the TCGA datasets, the CNC genomic alterations displayed a frequency of up to 18.62%, and variations in two genes were associated with poor prognosis. Although we were unable to verify these variations because of the unavailability of the TCGA samples, our results provide new opportunities for the investigation of cigarette smoke-induced genomic mutations in normal lungs and the elusive lung carcinogenesis.
We found that the C:G→T:A transitions are the most prevalent nucleotide changes in CNCs and the second most prevalent substitutions in LUSCs. Meanwhile, C:G→A:T transversions are the predominant nucleotide substitutions in LUSCs (Fig. 2A). Previous studies have shown that C:G→T:A transitions are the genomic signature of the tobacco carcinogen N-methyl-N-nitro-N-nitrosoguanidine (MNNG) [
20] and the mutational fingerprints of aging [
21]. C:G→A:T transversions represent a genomic signature of PAHs, which are found in tobacco smoke and act as air pollutants [
9,
20,
22]. Given that SNPs, including those of elderly individuals, had been filtered in this study, C:G→T:A transitions in CNCs may reflect the exposure of the patients to environmental carcinogens such as MNNG (from tobacco smoke and second-hand smoke for nonsmokers). Our results further indicate that the genotoxicity of tobacco smoke triggers lung carcinogenesis.
Genomic variations are frequently seen in CNCs, suggesting that these alterations may perturb the biological function of relevant proteins and are involved in lung tumorigenesis. We hypothesized that some of these variants are pro-oncogenes (e.g.,
CDC27 [
23] and
MADCAM1 [
24]) or tumor suppressors (e.g.,
NCOR1 [
25]). Cells that harbor the gain-of-function or loss-of-function mutations of these genes are in a precancerous stage, and the accumulation of other mutations will result in the transformation and development of malignant neoplasms. Thus, LUSCs may have multiclonal origins with genetic variants. In addition, many of the CNC-altered genes are associated with immune response (Fig. S3), which may help avoid immune destruction and cancer initiation. CNC cells may interact with tumor cells to provide an environment that either fosters or constrains carcinogenesis. In addition, variations in the components of the DNA-damage response system, such as RBMX [
26], are also frequently seen in CNCs (Fig. 2D), suggesting their role in maintaining genome stability. Some CNC variations, i.e.,
CTAGE5 and
USP17L7, are associated with poor patient prognosis (Fig. 4). This association further suggests their significance in lung carcinogenesis. Notably, some CNC variations that are similar to passenger mutations in tumor samples may have a minimal role in lung tumorigenesis. Further works required to investigate the roles of CNC variations in the initiation and progression of LUSC.
Higher Education Press and Springer-Verlag GmbH Germany