1 INTRODUCTION
Sequencing technology has developed swiftly and thoroughly since location-specific primer extension DNA sequencing strategy was first introduced by Ray Wu and then largely improved by Frederick Sanger in the 1970s [
1]. Limited to the technology and methodology, sequencing was applied to small genomes at the first time, such as the genome of the bacteriophage and viruses [
2]. In the late 1980s, the automated DNA sequencing method, usually considered as the first generation sequencing, had been successfully applied for almost two decades and achieved a serious of essential accomplishments [
3,
4]. In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first paper which used the whole-genome shotgun sequencing to sequence the complete genome of a free-living organism, the bacterium
Haemophilus influenza . By the year of 2001, shotgun sequencing methods had been widely adopted to produce the draft sequence of the large genomes, especially the monumental world-wide achievement of the initial rough draft of human genome [
5,
6].
Despite of the steady improvement in the first generation sequencing, it remains some fatal problems, such as cost, speed, scalability and resolution [
7]. Fundamental improvement was made for genome sequencing since the next-generation sequencing (NGS) came out in the 2000s [
8,
9]. The newer technologies make use of the power of massively-parallel short-read DNA sequencing, genome alignment and assembly methods to digitally and rapidly search the genomes on a revolutionary scale, which enable large-scale whole genome sequencing accessible and practical for researchers [
10,
11]. Several NGS platforms for whole genome sequencing have emerged with high speed, comparable low cost and high coverage, which makes the whole genome sequencing a more and more popular way for research. Nowadays, more than 90% of the reported complete human genome sequences are produced by the platforms of two famous companies, Illumina and Complete Genomics (CG) [
12]. Large-scale comparative and evolutionary studies are then allowed by the sequencing of the whole genomes of many related organisms [
13,
14]. Whole genome sequencing also provides solutions to complex genomic and genetic research problems by offering the most comprehensive collection of rare variants and structural variations for sequenced individuals [
15]. Recently, whole genome sequencing has been successfully applied to reconstruction of human population history [
1], uncovering the roles of rare variants in common diseases [
16,
17], and provide clinical interpretation and implications [
18–
21]. Even more, it is reported that whole genome sequencing is more powerful than whole-exome sequencing for detecting exome variants [
22].
This review first gives a typical pipeline of whole genome sequencing, including the lab template preparation, sequencing, genome assembling and quality control, variants calling and annotations. Then we compare the difference between whole genome and whole exome sequencing. We explore a wide range of applications of whole genome sequencing for both mendelian diseases and complex diseases in medical genetics. At last, we highlight the impact of whole genome sequencing in cancer studies, regulatory variant analysis, predictive medicine and precision medicine.
2 A TYPICAL PIPELINE OF WHOLE GENOME SEQUENCING
We introduce a typical pipeline of whole genome sequencing (shown in Figure 1), which is largely built based on literature [
7,
23]. After lab preparation, a proper sequencing platform is chosen for sequencing the samples. The next steps are genome assembling and quality control, followed by variants calling and annotations. Detected variants can be further analyzed to infer the biological relevance, prioritized or filtered according to the causative relation to a concerned phenotype. Further verification tests can be applied according to results of analysis.
High quality NGS lab preparation is an essential procedure for accurate whole genome sequencing. As this step is often outsourced to sequencing companies, for more details about the lab preparation, please see the guidebook of Preparing Samples for Sequencing Genomic DNA [
24] from Illumina or instructions provided in the literature [
25].
2.1 Sequencing platforms
Nowadays, many sequencing companies provide the service of whole genome sequencing, so choosing an affordable and accurate sequencing platform is also an essential step to offer reliable and wholesome sequencing outputs for further biological or bioinformatics analysis [
12,
26]. Table 1 shows the properties of some current sequencing platforms, which are summarized by the AllSeq Knowledge Bank [
27]. Besides these platforms, Complete Genomics, the leader of whole human genome sequencing, provides high quality sequencing outputs (SNP calling rate>90% with a reference consensus accuracy of>99.999%).
2.2 Alignment or genome assembling
Different next generation sequencing platform generates massive short reads of different quantity and different read length for one genome. Due to the complexity of some genomes, the most comprehensive and accurate genome assemblers are based on the pair-end reads sequenced from both ends of the DNA fragment [
23]. Based on whether there exists a reference genome, two assembling approaches are dominated to integrate the short reads into longer continuous sequences after the quality assessment, and then build the draft genome [
23,
28]. The first idea is reference based assembly, which is to align the reads to a reference genome and produce a similar sequence with affordable difference. As this method cannot generate novel sequences, which are different or absent from the reference, thus, sometimes it is combined with other methods to improve the accuracy of the assembling [
29]. A more complex and popular approach is
de novo genome assembly [
30], which can discover new sequences or generate the draft genome whose related reference genome does not exist. The
de novo genome assembly should be treated with the sequencing errors, repeat structures, and the computational complexity and speed of processing large amount of data. It is more challenging for
de novo assembly to deal with shorter sequence reads [
31]. Some popular alignment and genome assembling tools for reference based assembly or
de novo assembly are listed in Table 2.
2.3 Quality assessment
Quality control is an essential step before and after reads alignment and genome assembling. Raw reads generated by the sequencing platforms may cause errors which are common and inevitable during sequencing, such as reads in bad qualities, base calling errors, small insertions or deletions [
7]. Thus, quality assessment should be introduced to measure the quality of raw reads and remove, trim or correct the poor reads in order to avoid receiving wrong assembled sequences for further biological analysis. The quality assessment before the reads alignment and genome assembling usually includes plotting the quality score trend provided by the sequencing platforms; checking the primer contaminations, N content per base and GC bias; as well as trimming and filtering reads. As shown in Table 3, many tools have been designed to solve the error problems caused by different sequencing platforms.
Due to the complex genomes with large repeats, reads error and PCR duplicate generated by sequencing platforms, no genome assemblers can perfectly reconstruct the sequenced genome. Quality assessments are also suggested to control the assembled draft genomes and correct the errors which may lead to mistaking biological interpretations [
28]. A variety of quality metrics are built to reflect different aspects of the assembling results, such as assembly size, contig numbers, N50 or N90 statistic (a statistic of a set of contigs or scaffold lengths), number of mismatches or mis-assemblies [
32]. Some popular tools for evaluating the assemblers are collected in Table 4.
2.4 Variants calling
One prominent application of whole genome sequencing is to identify variants from the sequenced genome for further studying the genetic associations with diseases, detecting mutations in cancer, or characterizing heterogeneous cell populations [
33]. The simple procedure includes at least two elements, an aligner and a variant caller. The aligner aligns the sequencing reads to a reference genome, and the variant caller assigns a genotype and identifies the positions of variants. According to the different types of variants, there are three types of variants calling tools, single nucleotide variation (SNV) calling tools (including the indels), copy number variation (CNV) calling tools, and structural variation (SV) calling tools. The detection of SNVs and indels is essential to discover the genetics of diseases and further help clinical diagnosis or treatments for patients [
34]. As an important and special form of structural variation, more and more evidences indicate that CNVs play an important role in human diversity and disease susceptibility, especially in complex diseases [
35]. Human genome has unexpectedly large amount of structural variations. Even if it is not clear the exact functions of most of the structural variations, they are not to be overlooked in study of human diseases and population genetics. Table 5 shows some popular variants calling tools.
2.5 Variant annotation
Variant annotation is a crucial procedure in the analysis of genome sequencing data, which provides functional information for DNA variants and give implications and evidence for biological analysis and disease studies [
36]. With the dramatic increase in variant amount and complexity given by the whole genome sequencing, predicting the functional impact of variants becomes a new challenge rather than the sequencing or variant calling. There are many types of annotations ranging from the context, conservation metrics, functional genomic properties, transcript information, to the protein structural and functional predictions. Most of the variant annotation tools are available for comprehensively analyzing, prioritizing or filtering SNVs or small indels from many aspects, such as CADD [
37], dbNSFP [
38], GATK [
39], GEMINI [
40], and SPRING [
41]. Although it is more complex for predicting the function of structural variants, recently some annotations tools are available to analysis structural variants, especially CNVs, including AnnTools [
42], ANNOVAR [
43], CNVannotator [
44] and VEP [
45]. For a comparable complete list of variant annotations tools and their usage hints, please see the literature [
7] for details.
3 COMPARISON BETWEEN WHOLE GENOME AND WHOLE EXOME SEQUENCING
With the rapid development of sequencing technology and lower cost of each run of sequencing, whole genome sequencing is more and more prevalent in the detecting genetics of diseases, studying causative relations with cancers, making genome-level comparative analysis, and giving clinical implications and instructions [
46,
47]. Apparently, whole genome sequencing is superior to whole exome sequencing if there is no limitations of resources and time. Compare to the whole exome sequencing, whole genome sequencing provides examinations of SNVs, indels, CNVs and SVs in both coding (~1% part of the genome) and non-coding regions of the genome. Whole genome sequencing has more reliable and unified sequence coverage, no limitations of sequencing read length, no requirement of PCR amplification in library preparation or reference genome for assembling [
46]. Whole genome sequencing possesses more advantages for sequencing a species other than human. Even more, it is reported that whole genome sequencing is more powerful than whole exome sequencing for detecting exome variants [
22]. However, whole genome sequencing do suffering some problems of cost and time-consuming (see Table 6 for exact numbers), and it is more difficult to accurately interpret a huge amount and variety of detected variants [
48].
4 WHOLE GENOME SEQUENCING FOR MENDELIAN DISEASES
Mendelian diseases refer to those disorders caused by single gene, and make up the largest proportion of human inherited diseases. According to OMIM database [
49], the largest collection of Mendelian diseases, about 7,000 different diseases are characterized, of which ~3,500 disorders own unknown genetic causes. Traditional approaches to pinpoint the causal genes for Mendelian diseases are mainly based on linkage analysis [
50], which measures the segregation degree between genomic regions and disease status. Those identified linked regions usually contain hundreds of candidate genes, and those candidate genes are further validated and investigated by Sanger sequencing. Despite of its successful cases for identifying causal genes for some diseases, several drawbacks of this strategy prevent it from being widely used now. For example, linkage analysis is only effective for those familial diseases with enough sample size [
51]. Whole exome sequencing has emerged as a powerful and popular approach to elucidate the genetic determinants of Mendelian diseases [
52]. With acceptable cost and easy interpretation, WES has identified causal genes for many Mendelian diseases [
51,
53–
55]. Recent evidences suggest the advantages of WGS over WES on detecting exonic variants [
22] from technical perspectives. For the same task of detecting variants, including SNVs and indels in coding regions, WGS can identify more variants that are missed by WES than variants that are only captured by WES but missed by WGS. This fact makes WGS a preferable alternative to WES without consideration of cost and time-consuming. Besides coding variants, WGS provides more insights into genomic structural variants and noncoding variants. Recently, WGS has also been successfully applied to identify causal mutations in rare Mendelian diseases [
56,
57].
The widely used workflow for identifying disease-causing variants from exome sequencing in Mendelian diseases involves combination of biological information about genes, predicting functional consequence of variants, variant frequency in well-known large databases (e.g., 1000 G, ESP ) and evolutionary conservation (e.g.,GERP [
58]). The rationale behind this workflow assumes that disease-causing variants for Mendelian diseases tend to be rare variants that alter protein functions on disease-related genes. Although successful applications of this strategy in some studies, it is suspected to be powerless when the available sample size is limited. Because normal individuals without phenotypes for studied diseases could also carry some such rare functional variants, thus, additional variants in other samples or statistical evidence are needed for establishing pathogenicity [
59]. This problem becomes even more difficult when whole genome sequencing is applied for studying Mendelian diseases. Due to the largely increased number of variants compared with WES, the list of candidate variants that need functional follow-up or manual investigation becomes more time-consuming even if some filters are applied. Additionally, it is harder to evaluate the functional consequence of noncoding variants than coding variants since coding regions are more well-studied than noncoding part. The large number of candidate variants and interpretative difficulty for noncoding variants pose great challenges for applying WGS in clinical testing and medical research. Although hindered by such difficulties, WGS is believed to play an important role in genetics with the development of sequencing technologies and increased understanding about human genome, especially noncoding regions. For example, with the increased number of sequenced genomes, such as 1000 G Project and others, the filters based on variant frequency will become more powerful with more complete catalogue of human genetic variants. With the efforts of large constoria, such as ENCODE and Roadmap, deeper understanding about noncoding genomes and regulatory elements will enable the development of computational methods to assess regulatory impact of noncoding variants more precisely.
Functional prediction of sequence variants provides a fast assessment of deleterious effect of variants, and is widely used in sequencing based studies as filters. Many tools have been developed for analyzing variants locating in protein-coding region, such as SIFT [
60], PolyPhen2 [
61], but available methods for predicting functional effects of noncoding sequence variants are relatively limited. Recently, several computational methods for whole genome variants have been developed, including CADD [
62], DANN [
63], FATHMM-MKL [
64], Funseq [
65,
66], SInBaD [
67], deltaSVM [
68], GWAVA [
69]. Most of those predictors utilize machine learning approaches to discriminate harmful variants from normal variants with various genomic annotations (e.g., ENCODE) as features. Of those predictors, deltaSVM is the only one to consider cell type specificity. Training a gkm-SVM, a method for modeling DNA sequences, on cell type specific regulatory sequences and discovering corresponding sequence vocabularies, deltaSVM has the ability to evaluate the regulatory effect of sequence variants under different cell lines. It is expected that the development of computational prediction for noncoding sequence variants will be an active area of research.
Besides SNVs, noncoding CNVs are believed to play important roles in Mendelian diseases and complex diseases [
15,
70–
73]. CNVs refer to large alterations happened in the genome, including deletions and duplications, and are believed to cause severe consequences since large proportions of genes or regulatory elements are affected. Researchers have developed many tools for detecting both of coding CNVs [
74,
75] and noncoding CNVs [
76]. However, how to elucidate the effect and predict the consequence of CNVs, especially noncoding CNVs, remains elusive.
5 WHOLE GENOME SEQUENCING FOR COMPLEX DISEASES
Common or complex diseases refer to those diseases affected by more than one gene or one variant, which makes it unsuitable for those methods used in Mendelian diseases to apply to complex diseases. Usually, the variants that contribute to disease susceptibility of complex diseases have modest effect size, thus genome-wide association studies (GWAS) are designed to study complex diseases [
77], in which large cohort is required to ensure the power. However, GWAS is suspected for its rationality due to two important issues without effective solution. The first is called “missing heritability” [
78], in which associated common variants only explain limited heritability, and rare variant is considered as sources for missing heritability [
79]. GWAS only genotypes common variants, but WGS overcomes this limitation by sequencing all variants, including common and rare variants. The second problem arises from the existence of linkage disequilibrium (LD, defined as the non-random association of alleles at different loci), and associated variants detected in GWAS are usually not the functional variants but just in LD with true functional variants, which lead to the prosperous development of fine mapping methods [
80]. General approaches for fine mapping include dense genotyping and imputation, while WGS guarantees that the real functional variants are sequenced. We demonstrate the usage of GWAS for overcoming the two important issues in detail as follows.
5.1 Association Mapping
Association mapping has been successfully applied to discover variants associated with diseases or traits of interest, and it will continue to be a powerful approach for studying complex diseases or traits in WGS setting. Recently, researcher utilized WGS to discover two loci associated with major depressive disorder [
81], providing evidence to support the effectiveness of low-coverage WGS. In this study, association signals of common variants (MAF>1%) were calculated with linear mixed model [
82,
83], which was proved to be an effective method for association mapping and controlling population structure. Although success of WGS for association mapping is observed, several issues should be considered and handled properly in the future. Due to consideration of cost, coverage of WGS for large-scale cohorts is low, which lead to potential quality problem in variant calling and imputation. Care must be taken to ensure the quality of variants called, and methodological development is needed to improve accuracy. In addition, WGS discovers many rare variants besides common variants, therefore, how to utilize those variants and related rare variant association method, like SKAT [
84], to find biologically meaningful associations pose a challenge.
5.2 Genetic architecture analysis
Understanding the genetic architecture (e.g., heritability) of complex diseases provides important insights about them. Traditional approach to study the genetic architecture of complex diseases is usually achieved with GWAS. Considering cost and efficiency, tagging SNPs in LD blocks are genotyped with genotyping platforms. Although thousands of significant loci have been discovered through GWAS, those associated SNPs only account for small proportion of variance of traits, which is also called “missing heritability”. Limited ability for interpreting heritability makes research communities suspect about GWAS, and figure out that the missing heritability has become an important problem [
78] in recent years. WGS has the ability to genotype each loci, thus hold the promise to figure out “missing heritability”. Recently, Taylor
et al. [
85] uses WGS to study thyroid function, and identifies more heritability than previous GWAS does. Alanna
et al. [
86] studied the genetic architecture of HDL-C, (shorts for high-density lipoprotein cholesterol) with whole genome sequence data of 962 individuals, and revealed that common variants accounted for more heritability than rare variants for this complex traits, providing some insights and evidence about the argument between common and rare variants [
87]. Those studies highlight the utility of association tests for rare variants [
84] and linear mixed model for estimating heritability of complex traits [
88]. Since high-depth whole genome sequencing is not feasible now, low-depth WGS represents the major candidate for large-scale analysis. Special care must be taken to deal with artifacts owing to low depth, and strict quality control is essential to guarantee eliminating false positives [
89].
5.3 Fine mapping
GWAS has identified thousands of disease- or trait- associated common variants, which provide insights about complex diseases or traits. Considering cost, GWAS usually only tag several SNPs within a haplotype block that could be up to several thousands of base pairs in distance. Thus, associated variants are usually in LD with real functional variants and fine-mapping is needed to uncover the real functional sites. All variants within the associated loci are required to be genotyped in typical fine-mapping studies, in which targeted sequencing or imputation based on large population data (e.g., 1000 G, HapMap) are needed [
90]. The accumulation of uncertainty across those steps could undermine the identification of causal variants underlying GWAS loci. However, WGS has the ability to sequence all variants along the whole genome, thus holds the promise to solve this problem and facilitates the progress towards discovering causal variants underlying associations. Although significant differences exist between GWAS and WGS, the methodological development for refining GWAS results can also be beneficial for refining WGS results [
80,
91,
92]. The increasing number of genotyped variants in WGS also poses a greater challenge for fine mapping than GWAS.
6 WHOLE GENOME SEQUENCING FOR CANCER
As an important type of complex disease, cancer is a genetic disease and accounts for many death worldwide each year. The popular platform now for analyzing cancer genome is whole exome sequencing for its acceptable cost and highly interpretation. Recently, several international groups, like TCGA and ICGC, have paid much attention to characterize multiple types of cancers by using WES, such as prostate cancer [
93] and gastric cancer [
94]. Somatic mutations detected from more than 1 million cancer samples are accumulated and deposited in COSMIC database, which collects the largest number of somatic mutations thus far. Although WES is the primary approach for cancer research, it focuses on protein-coding regions and ignore noncoding regions, which limit more deep understanding about cancer [
95]. Several studies reveal the pathogenic impact of noncoding mutations on cancer genome, especially promoter mutations in TERT gene, which is a catalytic subunit of the enzyme telomerase and comprises the most important unit of the telomerase complex [
96–
98].
With the increasing number of sequenced cancer genomes, systematic analysis of large-scale cancer whole genome sequences could identify noncoding regions of interest, which are frequently mutated across different cancer types. Integrating whole genome sequence data from multiple cancer samples with regulatory annotations or expression profiles emerges as an effective approach to study somatic mutations in noncoding regions [
99–
101]. Fredriksson [
100] proposed a method to identify the associations between regulatory regions containing somatic mutations and gene expression, and highlighted TERT promoter region with highest statistically significant association with TERT gene expression. Due to the low frequency of somatic mutations, regional association test is used, which borrows the methodology from association test for rare population variants [
84,
102,
103]. Weinhold [
99] performed three distinct analysis, including hotspot analysis, regional recurrence analysis and transcription factor analysis, to identify functionally important somatic mutations in enhancer, promoter, 5´UTR and 3´UTR. Melton [
101] integrated 436 whole genome sequencing data and regulatory annotations from ENCODE to identify significantly mutated regulatory regions. All the three studies [
99–
101] detect TERT promoter mutations as significant mutated region across multiple cancer types.
Similar to Mendelian diseases, computational methods for identifying deleterious variants are also important for analysis of cancer genomes. However, variants that disrupt protein-coding regions or regulatory elements are not necessarily driver mutations whose effects lead to tumor progression. Although several methods have been developed specifically for cancer mutations [
104], their performance is far from satisfactory, suggesting further improvement is needed.
7 WHOLE GENOME SEQUENCING FOR REGULATORY VARIANT ANALYSIS
Despite the variants that disrupt the 1% protein-coding regions tend to have large deleterious effect, variants in the remaining 99% noncoding regions are also believed to play important roles in human diseases. Several reviews [
105,
106] have discussed about regulatory variants. Mulin
et al. [
106] focused on general regulatory variants analysis, including genetic mapping, prediction, prioritization, and functional validation. Frank
et al. [
105] discussed the role that regulatory variants played in human complex traits and disease, especially the molecular nature of regulatory variants and their influence on transcriptome and proteome. Ward
et al. [
107] discussed the interpretation of noncoding variants discovered in GWAS, with focus on enrichment analysis of regulatory annotations among discovered loci. Here, we highlight the recent development, especially integrative analysis, for interpretation and systems-level analysis of regulatory variants discovered by WGS.
QTL refers to the regions of genome containing sequence variants that can affect molecular quantitative traits, such as gene expression (eQTL), chromatin accessibility (dsQTL), alterative splicing (sQTL) (see Table 7 for more details). Studies on QTL can provide insights about the molecular mechanisms by which causal variants exert their effect to affect disease status. The typical eQTL studies require two types of data to test associations between variants and gene expression. One is the genotype of a recruited individual, which is often obtained through genotyping array, and the other is gene expression, which is often measured by microarray and RNA sequencing. The genotyping array provides a cost-effective solution to obtain genotypes, while this method only assays the certain loci, which probably result in missing hits for those stronger associations between ungenotyped loci and gene expression. Whole genome sequencing overcomes this problem through discovering all sequence variants and allowing identification all possible genetic associations between sequence variants and gene expression. For example, a recent WGS based eQTL mapping [
114] found that indels (short insertions and deletions) may play a more important role in cis-eQTL than SNPs. This study fully sequenced 462 individuals and discovered all types of sequence variants, so it provided more insights than traditional eQTL studies.
Integration of variants discovered by WGS and functional annotations tend to be a promising approach for dissecting regulatory variants. Although most attention is paid on analyzing disease-associated loci discovered by GWAS with integration of regulatory annotations, the similar idea or methodology can be also applied in WGS settings. Since WGS has the ability to discover every variant along the genome, it is expected to find more associations than GWAS, which poses greater challenge for integrative analysis. Recent studies find that disease or trait associated variants are enriched in DHS regions [
115,
116], and these regions could be used for marking regulatory elements with functional potential. Such annotations will help to elucidate molecular mechanisms underlying disease etiology and refine mapping of associated variants. Several studies also reveal the importance of TF binding in etiology of disease and disease-associated variants may contribute to the pathogenesis through disrupting the TF binding, such as PolII and NFkB [
116–
118].
7.1 Possible computational issues
With the increasing number of sequenced genomes, several computational issues need to be considered in order to facilitate the application of WGS to studies of diseases, gene regulation, and genomics etc. The first issue is the speed of data processing of WGS data. It usually takes long time to perform read mapping and variant calling for WGS data, and this issue becomes more severe when the number of samples is large. Recently, an ultra-fast WGS pipline called SpeedSeq [
119] is developed, which greatly speed up the data-processing procedure. How to further speed up the process and guarantee the accuracy at the same time will be an important computational issue that needs to be solved. The second issue is how to quantify the impact of variants detected from WGS on disease or trait of interest. We have reviewed several methods for this task on different scenarios, like Mendelian diseases, complex diseases, cancers and regulatory variants. The common strategy underlying those methods is integration of WGS data with information obtained from other sequencing technology, like ChIP-seq [
120], DNase-seq [
121], RNA-seq [
117] and ATAC-seq [
122]. How to integrate those genomics data into WGS will be an important field of research.
8 WHOLE GENOME SEQUENCING FOR PREDICTIVE MEDICINE AND PRECISION MEDICINE
Benefit from the high-throughput sequencing technologies with high speed and low cost, personal whole genome sequencing or whole exome sequencing becomes more and more available for customers. The genotype of a person can be achieved from the sequencing data, and compared to known disease databases or related published literature to determine likelihood of trait expression and the risk of some diseases. Our research group developed a database of human whole-genome single nucleotide variants and their functional predictions, namely dbWGFP [
123]. This database contains functional predictions and annotations of nearly 8.58 billion possible human whole-genome single nucleotide variants, with each of them described by 48 functional predictions and 44 valuable annotations. Specifically, the 48 prediction scores include 32 functional predictions calculated by 13 popular computational methods, 15 conservation features derived from 4 conservation calculation approaches, and 1 sensitivity measurement. The 44 annotations are obtained from the ENCODE project. dbWGFP is helpful in the capture of causative variants from massive candidate variants derived from whole-genome or whole-exome sequencing data.
Predictive medicine is a field of medicine which may take advantage of genetic information generated from personal whole genome sequencing to predict the probability of disease and what medical treatments are appropriate for a particular individual [
124–
126]. Precision medicine is a medical model that formulates personalized healthcare, including disease prevention, medical decisions and therapies [
127,
128]. Example of application of predictive and precision medicine includes selecting appropriate drugs for a patient to maximize the effect of drugs and minimize the side effects, or giving a tailor therapy to a patient to accelerate the recovery [
129,
130].
9 CONCLUSION AND DISCUSSION
As cost of the whole genome sequencing decreases rapidly and approaches $1000, WGS are increasingly used for revealing the genetic basis of Mendelian or complex diseases, explicating novel disease biology, helping clinical diagnosis and treatment. Whole genome sequencing provides exceptional coverage of genomic regions, including exonic, intronic and other unexplored noncoding regions, and a large collection of rare variants and comprehensive structural variants. Associated with other type of data and annotations, WGS also successfully helps to interpret the genetics and biology underlying the cancer genome. In the future of predictive medicine and precision medicine, WGS will be an important tool to guide therapeutic prevention and treatment.
Although the introduction of WGS has successfully applied in many researches, there exist some problems to be solved in the future. Next-generation sequencing technologies can generate tremendous amounts of data, in the mean while they are suffering from the sequencing errors, such as bias of GC/AT rich genomes and context specific error. The amplification, which is a necessary step for some platforms, may also bring errors. In addition, most of the WGS studies could not provide sufficient coverage, which may lead to some mistakes by genome assembling and variant calling steps. Furthermore, different sequencing platform may provide different analysis results, especially for potential loss-of-function mutations, or rare variants which are likely to be pathogenic [
47]. Even if the cost of whole genome sequencing of a sample has dropped dramatically, the sequencing of a comparable large number of samples with high coverage is still unaffordable for most of researchers.
The main challenge in WGS studies is the processing and interpreting whole genome sequencing data. Even if introducing the step of quality control, there still exist errors in the process of genome assembling, such as insufficient read coverage or mis-assembly [
131]. Another more important step is to interpret the sequencing data, discover the relationship from genotype to phenotype, and link the analyzed data to clinically applications [
132]. The volume of information contained in a genome sequence is so vast that it is hard to wholesomely and accurately explain all the hidden knowledge. The role of most of variants, genes and non-coding factors in the human genomes is still unclear or incompletely known [
133,
134]. Although a lot of bioinformatics approaches have been developed to deal with the sequencing data for different applications, most of the predicted or examined results remain to be testified. The pathogenic mechanisms for some diseases, such as cancers, are so complex, that they require the analysis of much more WGS data in a larger sample set and combining with other data, such as multi-omics data, functional data and clinic-pathological data [
95].
Higher Education Press and Springer-Verlag Berlin Heidelberg