INTRODUCTION
Many genetic diseases are not recognized as the result of dysfunction of a single gene [
1], but rather related to variations and mutations of multiple genes or their interplays, such as: (i) over- and under-expression of multiple genes [
2], (ii) duplicating/removing of copies of several genes and also (iii) hypo- and hyper- methylation of multiple genes. Thus, identification of mutated genes responsible for specific diseases still remains a challenging issue [
3]. Reliable detection of these genomic variations and mutated genes is fundamentally important for us to understand the mechanism of many disease or genetic disorder, such as cancers, diabetes, neuropsychiatric disorders [
4,
5], birth defects, autoimmune disorders, autism and even susceptibility to HIV.
Nowadays, with the development of high-throughput genomic technologies, it has become easy and cost-effective to comprehensively characterize various complex diseases by using a wide range of genomic datasets, epigenomic datasets, transcriptomic datasets, proteomic datasets, and metabolomic datasets [
6]. Specifically: (i) single-nucleotide polymorphism (SNP), copy number variation (CNV), loss of heterozygosity (LOH), genomic rearrangement are datasets at the genome level; (ii) DNA methylation, histone modification, chromatin accessibility, transcription factor (TF) binding and micro RNA (miRNA) are datasets at the epigenome level; (iii) gene expression and alter-native splicing are datasets at the transcriptome level; (iv) protein expression and post-translational modification are datasets at the proteome level; and (v) metabolite profiling is the dataset at the metabolome level.
Copy number variation (CNV) is one of the most important human genetic variations, which consists of not only sequence variants but also structural variants within populations. Although many genetic variants do not cause overt diseases, they influence disease susceptibility or drug response. Therefore, these CNVs have drawn attention of some scientists and had been recognized as novel genetic variations related to the genomic disease.
The aim of this review is to give insights into the important role of CNVs in the identification of disease related genes. The rest of the paper is organized as follows. In the section of Copy Number Variation Overview, we give a brief overview about the definition of CNVs and their relation with genomic diseases and detection methods. Then, we highlight the most significant disease genes determined using CNV methods. Finally we explain the successful use of this genetic variant in diseases, especially its integration with other genomics data, which will definitely be helpful for identifying new disease genes.
COPY NUMBER VARIATION OVERVIEW
Along with SNPs (which has been among the most abundant genetic variation in humans), CNVs have attracted many attentions, since they refer to a type of intermediate-scale structural variants (SVs) in the genome. A lot of definitions have been given by many researchers to CNV as follows:
• CNVs are DNA segments presenting at variable copy numbers and contribute to a substantial proportion of the variation in a genome owing to their large size [
7,
8].
• CNVs refer to large-scale (>1 kb) chromosomal copy number changes,
e.g., amplifications or deletions compared to a reference genome [
9].
• CNVs are deletions or duplications of size (>1 kb) genomic area [
10].
• CNVs are inherited or
de novo structural variations, including all kinds of genomic variations larger than 1 kb, such as insertions, deletions and duplications [
11,
12].
Briefly, CNVs are defined as either the gain (duplication) or loss (deletion) of a stretch of DNA as compared with a reference genome. They are characterized by the break point loci (starting and ending points), single copy length and number of copies, and they may range in size from a kilobase to several megabase or even an entire chromosome. As depicted in Figure 1.
Additionally, CNVs involve more genomic sequences than SNPs and have potentially greater effects, including alteration of gene dosage, disruption of genes or perturbation of their expression levels. Moreover, CNVs is shown to be enriched in genes also involved in immune responses, cell–cell signaling, and retrovirus- and transposition-related protein coding genes [
13]. Thus, based on the large portions of human CNVs that have been reported in the Database of Genomic Variants (DGV) [
14–
16], and their mild effect on multiple gene functions, many CNVs have been associated with disease susceptibility and severity, while the majority continues to be benign. For example, a duplication within the CCL3L1 (C–C motif chemokine ligand 3 like 1) gene is involved in HIV susceptibility and developing AIDS [
17], a deletion within the
IRGM (immunity related GTPase M) gene is linked with Crohn’s disease [
18], a CNV located within the
TSPAN8 (tetraspanin 8) gene is associated with type 2 diabetes [
19]. Similarly, a lower copy number of
FCGR3B predisposes to immunologically related glomerulonephritis in humans and rats [
20] and a higher
EGFR copy number is linked to non-small cell lung cancer [
21].
CNV is also associated with a range of neurodevelopmental disorders [
22], including autism [
23], schizophrenia [
24], and depression [
25]. Besides neuropsychiatric diseases, CNV have found to be linked with other disease types, including heart disease [
26], obesity [
27], cancer [
28] and it has also been implicated in altered lifespan [
29]. Furthermore, CNVs are also linked with extensive phenotypic traits in domestic animals including pigs [
30], sheep [
31], chicken [
32,
33], dogs [
34], and cattle [
35,
36] amongst others.
CNV DETECTION METHODS
Although, CNV studies have developed considerably over time, little is known about how CNVs influence the phenotype of many rare and common complex diseases. To investigate this issue, various CNV detection methods have been developed. These methods can be categorized into two groups [
37] (see Table 1 for details):
1. Genome-wide approaches, in which the entire genome is scanned for detecting CNVs.
(a) Microarray-based methods [
38] such as array comparative genomic hybridization (aCGH) (Figure 2) and single nucleotide polymorphism (SNP) arrays [
39].
(b) Karyotyping and fluorescence in situ hybridization (FISH) [
40].
(c) Synthetic high-density oligonucleotide arrays [
41].
(d) Deep sequencing platforms [
42].
(e) NanoString’s digital detection technology [
43].
(f) Next-generation sequencing (NGS) [
44] such as whole genome sequencing (WGS) and whole exome sequencing (WES).
However, genome-wide approach CNV analyses are not efficient for the validations of a small set of known CNVs. Targeted approaches are more efficient for that purpose.
2. Targeted approaches CNVs include:
(a) Quantitative polymerase chain reaction (qPCR) or southern hybridization for single target screening [
45] (Figure 2).
(b) Multiplex ligation-dependent probe amplification (MLPA) [
46].
(c) Multiplex amplifiable probe hybridization (MAPH) [
47].
(d) Multiplex amplicon quantification [
48].
In this regard, a more extensive commonly and high-throughput methods have been used, especially in the context of CNV related to human genomes. Starting from targeted CNV screening and validation, various methods have been applied including qPCR, paralogue ratio test (PRT), and molecular copy-number counting (MCC). qPCR compares the threshold cycles of a target versus reference sequence. PRT uses a single pair of primers to exploit sequence similarities between the elements of test and reference locus [
49]. While MCC uses PCR to count the number of molecules in DNA aliquots [
50]. Additionally, multiplex PCR-based approaches such as MAPH, MLPA, MAQ, quantitative multiplex PCR of short fluorescent fragments (QMPSF) and multiplex PCR-based real-time invader assay (mPCR-RETINA), have also been successfully used [
51].
On the other hand and from high-throughput perspective, many high-resolution array platforms have been used extensively for CNV detection, which range from cytogenetic technologies such as karyotyping and fluorescence
in situ hybridization (FISH) to more accurate arrays such as CGH and SNP arrays. CGH arrays considered to be a reliable method, since it can measure the fluorescence ratio along the length of each chromosome and identify novel regions of interest in the test sample. This method has the highest sensitivity and specificity [
52], but gives relatively low resolution in CNV detection. Similarly, SNP arrays are more commonly used for CNV analysis, since they provide high resolution of CNVs based on hybridization intensities from custom and non-custom probes and require less sample DNA than CGH [
53]. However, the main bias of SNP arrays on CNV detection is the low SNP coverage of the genomic regions.
In the same context of array analysis, a suite of algorithms has been used including but not limited to: CBS [
54], GLAD [
55], ITALICS [
56], CRLMM [
57], HMM, PennCNV [
58], ParseCNV [
59] and R.GADA [
60]. Each of these methods has distinctive features and the most of them incorporated log R ratio (LRR) and B-allele frequency (BAF) for reliable CNV identification.
To overcome the issues driven by array-based techniques, studies turn to adopt the new approach of NGS, which has rapidly emerged as a viable option to identify CNVs in human diseases. This approach confers a number of critical advantages including higher coverage and resolution, more precise detection of breakpoints, and higher capability to identify smaller CNVs [
61,
62].
In general, three main approaches have been used in NGS technologies: (i) read count, (ii) paired-end and (iii) assembly [
63] as shown in Figure 3 plus two additional strategies including split read (SR) and combinatorial of these four methods. In the read depth (RD) approach a sliding window is used to count the number of short reads, and then these read count values are used to identify CNV regions. RD-based methods can detect the exact number of copy numbers, large insertions, CNVs in complex genomic region classes and can be applied to both WGS and WES data. However, they cannot detect precise breakpoints, inversions and translocations events. Paired-end (PE) approach or paired-end mapping (PEM) identifies genomics aberration based on the distances between a pair of paired-end reads and not single-end reads. Also, PEM is able to identify efficiently inversions and translocations but unable to detect CNVs in low complexity regions. In the assembly (AS) approach overlapping short reads (contigs) are used to assemble the genomics regions, and CNV regions are detected by comparing these assembled contigs to the reference genome. On the other hand, SR methods rely on the only unique mapping information. Since they can only split the incompletely mapped reads of read pairs into multiple fragments, and then the start and end fragments of each split read will be aligned to the reference genome to assign insertion or deletion events. For every approach, a diverse set of popular methods and tools have been developed such as CNV-seq [
64], FREEC/Control-FREEC [
65], CNVnator [
66], SegSeq [
67], eventwise testing (EWT) [
68], BreakDancer [
69], ExomeCNV [
70], XHMM [
71], ExoCNVTest [
72], GPHMM [
73], CLImAT [
74], Cortex assembler [
75] and Magnolya [
76]. Although there has been great progress in each category, none of the methods and tools could comprehensively detect all types of CNVs. Thereafter, a combinatorial approach has been used attempting to increase the performance in detecting CNVs more reliably.
Furthermore, several well-established methods have also been used to find recurrent copy number aberration (RCNA) or somatic copy number alteration (SCNA) from a cohort of tumor patients. Among these methods GISTIC (genomic identification of significant targets in cancer) [
77], GISTIC 2.0 [
78], JIS-TIC [
79], NN-SSVD (non-negative sparse singular value decomposition) [
80], DiNAMIC (discovering copy number aberrations manifested in cancer) [
81] and PLA (piecewise-constant and low-rank approximation) [
82], have extensively applied. All of these methods were focused on the identification of driver aberrations that was proved to be crucial for many diseases progression, unlike passenger events that have no functional effect. GISTIC can identify significant driver SCNA by evaluating the frequency and amplitude of observed events based on G-score (the product of frequency and average amplitude) and a greedy peeling-off. However, GISTIC 2.0, a revised version of GISTIC discovers recurrent CNVs based on G-score (the negative logarithm of the likelihood of both frequency and amplitude of each aberration region) and an arbitrated peeling-off. Similarly, JISTIC which is an improved tool of GISTIC algorithm can detect more significant sub-regions within large aberrant regions. In addition, The RCNA regions of DiNAMIC are detected by using peeling method tailored to the inner cyclic shift procedure from various input-data types (continuous, continuous segmented or discrete segmented). While PLA detects RCNAs by the sample frequency of the low-rank component from multi-sample data, NN-SSVD have the ability to discover RCNAs in complex patterns based on low-rank approximation component of only one layer.
DISEASES RELATED COPY NUMBER VARIATION
In this review, we summarize a serious of CNV (common/rare) related diseases and their associated disease genes. The details of those information are illustrated in the below Tables 2‒9 and the following subsections.
Immune response and inflammation
Various studies have confirmed the significant impact of CNVs on immune response and inflammation. As initial estimates of the Online Mendelian Inheritance in Man (OMIM) and the Gene Ontology (GO) analysis, large portions of genes and exons as well were linked to CNV and code proteins involved in the immune response and inflammation. For example low copy numbers for the gene
CCL3L1 were associated with an accelerated rate of HIV progression and developing AIDS (Table 2) [
83].
Syndromes, schizophrenia, mental retardation and autism spectrum disorder
CNVs have been associated with diseases, through (i) dosage of a single gene [
84,
85], (ii) a contiguous set of genes (
e.g., Williams-Beuren syndrome [
86,
87], DiGeorge syndrome [
88], Smith-Magenis syndrome [
89], Potocki-Lupski syndrome [
90]) or (iii) allele combinations in the case of complex diseases.
Recent studies in (i) syndromes, (ii) schizophrenia [
91,
92], (iii) mental retardation [
93,
94] and (iv) autism spectrum disorder [
95] not only detected multiple disease related genes but also led to the description of variable phenotypes, novel microdeletion and microduplication syndromes [
96–
99].
Firstly, in the context of syndromes, various popular classic examples have been identified including (i) the 15q11-q13 deletion associated with Prader-Willi and Angelman syndromes [
100], (ii) the 17p11 deletion associated with Smith-Magenis syndrome [
101], (iii) the 7q11 deletion associated with Williams-Beuren syndrome [
102], and (ⅳ) the 22q11 deletions associated with velocardiofacial syndrome (Table 3)[
103].
Secondly, alterations in the following three regions were associated with both schizophrenia and mental retardation: (i) 1q21.1, (ii) 15q11.2 and (iii) 15q13.3. While deletions at all three loci were linked to schizophrenia, related psychoses [
94,
95], as well as 22q11.2 deletion syndrome (22qDS), which is identifiable genetic alteration that has been associated only with schizophrenia [
104].
Thirdly, duplication of the entire 15q11-q13.3 and both deletions and duplications of band 1q21.1 were associated with mental retardation.
Finally, duplication of the entire 15q11-q13.3 region was shown to cause autism spectrum disorder (ASD) [
95]. Similarly, duplication of band 1q21.1 was also identified in patients with ASD.
These examples suggest that a simple alteration at any given chromosome position in the human genomes can cause some disorders and phenotypes.
Cancers
It has been specifically reported that an accurate CNVs detection is an essential part of cancer genome analysis, which holds great promise to improve cancer prognosis and treatment decision. Therefore, significant effort has found associations between somatic CNVs and cancers, based on their oncogene activation and tumor suppressor gene inactivation caused by copy number amplification and heterozygous/homozygous deletion respectively.
Generally, there are three kinds of CNV variations: (i) Germline CNVs, (ii) somatic CNVs and (iii) inherited CNVs. Among these variations, somatic CNVs have been successfully associated with cancer. For example, Walters
et al. [
105] predicted an amplified copy number of CHD7-PVT1 likely to have a relative effect in tumor genesis of small cell lung cancer. Fanciulli
et al. have also found associations between somatic CNVs and another kinds of cancers such as prostate and colorectal cancers [
83].
In the context of prostate cancer, a landscape of CNVs with the clinical/pathological endpoints of metastasis were observed including that of Barbieri
et al. [
106] and the Cancer Genome Atlas (TCGA) prostate cancer cohort. The related CNVs include: (i) genomic deletion on chromosomes at 6p, 8p, 13q and 16p, (ii) genomic duplication at 7q and 8q, and (iii) focal alterations spanning PTEN, RB1, and tumor protein p53 (TP53) among others.
Additionally, in the same context of the cancer genome, another disease called oral cavity squamous cell carcinoma (OSCC), including (i) cigarette smoking, (ii) alcohol consumption, and (iii) betel quid chewing, have caused many genomic aberrations and widespread genomic instability, especially in eastern and west countries [
65,
107,
108]. Therefore, various studies on OSCC have detected multiple mutations related to genes like:
TP53,
NOTCH1,
CASP8,
FAT1,
CDKN2A,
HRAS,
USP9X [
109,
110] and multiple CNVs events (Table 4) [
109,
111,
112], such as deletions at 3p, 8p, 9p, 18q and duplication at 3q, 5p, 7p, 8q, 11q, and 20q.
Furthermore, various methods have also been applied to detect alterations in OSCC heterogeneous patient’s samples. Among these methods, an approach called Ultra-deep targeted sequencing (UDT-Seq) successfully identified new pathogenic CNVs, like: (i) PIK3CA duplica-tion [
113], (ii) FGFR1 duplication [
114], and (iii) deletions of
PTEN,
RB1,
SMAD4, and
TP53 [
114–
116].
A suite of studies related to CNVs and their roles in lung cancer have also been discussed. For example, alterations in chromosome regions at 3q26.2-q29, 3p26.3-p11.1, 17p13.3-p11.2 and 9p13.3-p13.2 have been deemed as the main predictors for lung cancer. Moreover, an integrative analysis of transcriptional profile and CNV of lung cancer have captured more significant CNV driven genes.
Cardiovascular disease
The association of CNVs with cardiovascular diseases has also been early demonstrated. In which, many aberrant CNVs loci and related disease genes have been discovered to strong genetic components often single-gene or “monogenic” disorder (Table 5) [
117] such as:
(i)Ventricular tachycardia disease: deletion and duplication of calsequestrin gene at 21p13.2-1p13.1.
(ii)Hypertrophic cardiomyopathy disease: deletion of myosin, light polypeptide 3, alkali at the band 3p21.31.
(iii)Immune disease and cardiomyopathy: deletion and duplication of major histo-compatibility complex, class II, DR1 in 6p21.32.
Neuropsychiatric disease
A great number of studies have also demonstrated the role of CNVs in the etiology of several neuropsychiatric disorders [
73,
117]. Argyrophilic grain disease (AGD) is an example of this genetic disorder. Thus, to identify CNVs related to AGD. They first used aCGH (180k platform) (Table 6) alone and then adopted the same 180K aCGH platform with an extra 400 independent samples and revealed no rare CNVs was significant. However, they highlighted a 40 kb microdeletion at 17p13.2 that includes the
CTNS gene which causes cystinosis disorder, and a 65 kb deletion that includes the
SHPK gene [
119].
Autoimmune disorders
Another genetic disorder called autoimmune disorders have been associated with multiple CNVs rather than single CNV, which is useful for understanding the pathogenesis and discovering new drug targets [
120–
123]. Several studies have reported this association by discovering several genes such as (Table 7):
(i) Systemic lupus erythematosus (SLE) [
122]: Fcy receptors located at 6p21, complement component 4 (C4) at 1q23 positions,
RABGAP1L, and deletion at 10q21.3 deletion.
(ii) Psoriasis and Crohn’s disease (CD): ITP and b-defensin genes.
(iii) Rheumatoid arthritis (RA): VPREB1 at 22q11 region.
(ⅳ) Ankylosing spondylitis (AS) [
76]: deletion of
HHAT (1q32.2),
HLA-DPB1 (6p21.3),
PRKRA (2q31.2),
EEF1DP3 (13q13.1) and 16p13.3.
Psoriasis
In contrast to most complex diseases, the role of common CNV in the pathogenesis of psoriasis has been well addressed across multiple studies [
123–
126]. Particularly, a promising association between psoriasis and a 32.2 kb deletion of
LCE3B and
LCE3C genes have been identified extensively in (i) in European populations [
123], and (ii) subsequently replicated in a Chinese cohort [
126], and then confirmed extremely by ExoCNV Test exome sequencing method (Table 8).
Huntington’s disease
While it is known that gene deletion and duplication can affect neurological disease, a new study was also able to investigate an association between CNV and variable adult age of onset (AAO) of Huntington’s disease (HD). As a result, CNV of
SLC2A3 has been finally observed between 1 copy (heterozygous deletion) and 3 copies (heterozygous duplication) in HD [
127], whereas many genes and loci were related extensively to neurodegenerative disorders such as the triplication of alpha-synuclein or large deletions of the parkin gene causing Parkinson’s disease [
128].
CONCLUSIONS
Nowadays, with advances in high-throughput genomic technologies, various genomic datasets have been significantly reported from various biological levels. In this review, we have focused on CNVs which belong to the genomic level. An overview of the definition of CNVs, CNVs related diseases/phenotypes and detections methods has been firstly introduced. Then, based on a number of complex diseases, the critical role of CNVs (whether rare or common CNVs) have been summarized for the identification of disease related genes.
To date, a large number of genetic diseases and phenotypes have been associated with CNVs. CNVs play important roles in these genetic diseases through: (i) identifying multiple genes whether existed genes or newly discovered ones, (ii) allowing discrimination of driver mutation for pathogenesis or diagnosis of complex diseases and (iii) helping to develop personalized medicines.
Various methods have been discovered to validate and detect the reliability and the accuracy of CNVs. These methods are: (i) cytogenetics and karyotyping methods, (ii) microarrays based methods, (iii) next generation sequencing methods and (ⅳ) third generation approaches as well. Each of these approaches has advantages and disadvantages, which are: coverage biases, batch effects, poor sensitivity and precision as well as higher effective resolution and less complex data analysis. However, accurate detection of small CNVs specifically and their precise boundaries from massively amount of data using these methods is still a challenge, which largely due to the complexities of tumor samples. Thus, the validity and reliability detection of CNVs will improve quickly as genotyping technologies advance, which will support the required replication.
In addition, the role of CNVs in genetic syndromes has long been recognized, with recurrent microdeletion/microduplications detected in syndromes, such as Prader‒Willi, Smith‒Magenis and Williams‒Beuren. However, with the increased clinical use of array-based CNV analysis, the list of CNVs associated with disease phenotypes has continued to grow. This has led to the discovery of many new microdeletion and microduplication syndromes. These novel syndromes and the ever-expanding of CNVs associated with disease phenotypes, highlight the significant involvement of CNVs in genetic diseases.
An important issue that has also to be reported in the context of CNVs was called missing heritability [
129]. This issue has been studied in order to estimate the heritability of common diseases. Missing heritability in genome wide association studies, which was identified as the failure to account for a considerable fraction of heritability by the variants detected is also still a challenging issue in human genetics. For solving this puzzle, a number of CNV based methods have been proposed. However, none of them have accurately accounted for missing heritability due to the conflict raised from rare and common genetic variants.
In conclusion, CNV alone will not meet a great advancement enough and will not be a worthy endeavor enough without its integration with other genomic datasets. Multiple data integrations have proved to be successful, which include those datasets such as gene expression, DNA methylation, protein-protein interaction (PPI), metabolism pathways and Gene Ontology. These integrations will help us understand disease susceptibility and pathogenesis from various perspectives.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature